首页 > 解决方案 > Web 使用 Python 抓取隐藏的表格

问题描述

我正在尝试从该网站https://www.ebi.ac.uk/gwas/genes/SAMD12抓取“特征”表(实际上,URL 可以根据我的需要更改,但结构将相同) .

问题是我在网络抓取方面的知识非常有限,而且我无法使用我在这里看到的基本 BeautifulSoup 工作流程来获得这张表。

这是我的代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)

我正在寻找“efotrait-table”:

efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())
<div class="row" id="efotrait-table-loading" style="margin-top:20px">
 <div class="panel panel-default" id="efotrait_panel">
  <div class="panel-heading background-color-primary-accent">
   <h3 class="panel-title">
    <span class="efotrait_label">
     Traits
    </span>
    <span class="efotrait_count badge available-data-btn-badge">
    </span>
   </h3>
   <span class="pull-right">
    <span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
     <span class="glyphicon glyphicon-chevron-up">
     </span>
    </span>
   </span>
  </div>
  <div class="panel-body">
   <table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
   </table>
  </div>
 </div>
</div>

具体来说,这个:

soup.select('table#efotrait-table')[0]
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>

如您所见,表格的内容没有显示出来。在网站上,有一个将表格保存为 csv 的选项。如果我能以某种方式获得这个可下载的链接,那就太棒了。但是当我单击链接以复制它时,我得到的是“javascript:void(0)”。我没学过javascript,应该吗?

该表是隐藏的,即使不是,我也需要以交互方式在每页选择更多行以获取整个表(并且 URL 不会更改,因此我也无法获取表)。

我想知道一种以编程方式访问该表的方法(非结构化信息),那么关于组织表的未成年人就可以了。任何有关如何做到这一点(或我应该研究什么)的线索将不胜感激。

提前致谢

标签: pythonweb-scrapingbeautifulsoup

解决方案


所需的数据在 API 调用中可用。

import requests

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data).json()
    print(r)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

You can follow the r.keys() and load your desired data by access the dict.

But here's a quick load (Lazy Code):

import requests
import re
import pandas as pd

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data)
    match = {item.group(2, 1) for item in re.finditer(
        r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
    df = pd.DataFrame.from_dict(match)
    print(df)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

Output:

0              heel bone mineral density                          Heel bone mineral density
1              interleukin-8 measurement  Chronic obstructive pulmonary disease-related ...
2   self reported educational attainment        Educational attainment (years of education)
3                        waist-hip ratio                                    Waist-hip ratio
4             eye morphology measurement                                     Eye morphology
5                       CC16 measurement  Chronic obstructive pulmonary disease-related ...
6         age-related hearing impairment  Age-related hearing impairment (SNP x SNP inte...
7    eosinophil percentage of leukocytes               Eosinophil percentage of white cells
8          coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
9                     multiple sclerosis                                 Multiple sclerosis
10                  mathematical ability                    Highest math class taken (MTAG)
11                 risk-taking behaviour                      General risk tolerance (MTAG)
12         coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
13  self reported educational attainment                      Educational attainment (MTAG)
14                          pancreatitis                                       Pancreatitis
15               hair colour measurement                                         Hair color
16                      breast carcinoma  Breast cancer specific mortality in breast cancer
17                      eosinophil count                                  Eosinophil counts
18                     self rated health                                  Self-rated health
19                          bone density                               Bone mineral density

推荐阅读