python - Web 使用 Python 抓取隐藏的表格
问题描述
我正在尝试从该网站https://www.ebi.ac.uk/gwas/genes/SAMD12抓取“特征”表(实际上,URL 可以根据我的需要更改,但结构将相同) .
问题是我在网络抓取方面的知识非常有限,而且我无法使用我在这里看到的基本 BeautifulSoup 工作流程来获得这张表。
这是我的代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)
我正在寻找“efotrait-table”:
efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())
<div class="row" id="efotrait-table-loading" style="margin-top:20px">
<div class="panel panel-default" id="efotrait_panel">
<div class="panel-heading background-color-primary-accent">
<h3 class="panel-title">
<span class="efotrait_label">
Traits
</span>
<span class="efotrait_count badge available-data-btn-badge">
</span>
</h3>
<span class="pull-right">
<span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
<span class="glyphicon glyphicon-chevron-up">
</span>
</span>
</span>
</div>
<div class="panel-body">
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
</div>
</div>
</div>
具体来说,这个:
soup.select('table#efotrait-table')[0]
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
如您所见,表格的内容没有显示出来。在网站上,有一个将表格保存为 csv 的选项。如果我能以某种方式获得这个可下载的链接,那就太棒了。但是当我单击链接以复制它时,我得到的是“javascript:void(0)”。我没学过javascript,应该吗?
该表是隐藏的,即使不是,我也需要以交互方式在每页选择更多行以获取整个表(并且 URL 不会更改,因此我也无法获取表)。
我想知道一种以编程方式访问该表的方法(非结构化信息),那么关于组织表的未成年人就可以了。任何有关如何做到这一点(或我应该研究什么)的线索将不胜感激。
提前致谢
解决方案
所需的数据在 API 调用中可用。
import requests
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data).json()
print(r)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
You can follow the r.keys()
and load your desired data by access the dict.
But here's a quick load (Lazy Code):
import requests
import re
import pandas as pd
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data)
match = {item.group(2, 1) for item in re.finditer(
r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
df = pd.DataFrame.from_dict(match)
print(df)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
Output:
0 heel bone mineral density Heel bone mineral density
1 interleukin-8 measurement Chronic obstructive pulmonary disease-related ...
2 self reported educational attainment Educational attainment (years of education)
3 waist-hip ratio Waist-hip ratio
4 eye morphology measurement Eye morphology
5 CC16 measurement Chronic obstructive pulmonary disease-related ...
6 age-related hearing impairment Age-related hearing impairment (SNP x SNP inte...
7 eosinophil percentage of leukocytes Eosinophil percentage of white cells
8 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
9 multiple sclerosis Multiple sclerosis
10 mathematical ability Highest math class taken (MTAG)
11 risk-taking behaviour General risk tolerance (MTAG)
12 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
13 self reported educational attainment Educational attainment (MTAG)
14 pancreatitis Pancreatitis
15 hair colour measurement Hair color
16 breast carcinoma Breast cancer specific mortality in breast cancer
17 eosinophil count Eosinophil counts
18 self rated health Self-rated health
19 bone density Bone mineral density
推荐阅读
- java - 尝试上传文档时出现Android SecurityException - JAVA
- python - 返回列表中字符串中的第二个字母并按它对单词进行排序。不断使索引范围超出界限。Python
- javascript - css衣衫褴褛的底部,是什么意思?
- amazon-web-services - Helm install stable/efs-provisioner - 它们是否需要安装在与将挂载卷的 pod/容器相同的命名空间中?
- google-apps-script - 脚本正在覆盖 googlesheet 单元格中的当前数据
- logstash - "LogStash::ConfigurationError", :message=>"预期为 [\\t\\r\\n]、\"#\"、\"input\"、\"filter\"、\"output\" 之一在第 1 行第 1 列(字节 1)"
- oracle - 在 Oracle 中向容器数据库中的 XStream 管理员用户授予管理员权限时出现“ORA-44001:无效架构”错误
- reactjs - 无法在 reactJS 中设置 setstate
- javascript - 在 ngOnInit 中设置输入值与在角度 .ts 文件中设置输入属性之间有任何性能或内存差异吗?
- c - ncurses blit / 缓冲区