首页 > 解决方案 > 无论 CSS 类元素输入到 BeautifulSoup 的 find_all() 函数,我都会收到一个空列表作为输出

问题描述

import requests
from bs4 import BeautifulSoup

gene_list = {"Ccl2", "CXCR4"}

for seq in gene_list:
    text = requests.get("https://uswest.ensembl.org/Multi/Search/Results?q=" + seq + ";site=ensembl").text
    soup = BeautifulSoup(text, "lxml")
    soup.find_all("div", {"class": "table_result"})

我正在尝试在 Ensembl 网站 ( https://uswest.ensembl.org/index.html ) 中搜索我们在测序数据中发现的未注释基因,然后对它们进行一些后续搜索和数据处理。

我只是无法让我的刮板从我的find_all(). 我已经尝试了每个解析器(html.parser, lxml, html5lib),包括 class 的语法({'class': 'name_of_class'} and class_='name_of_class')和多个不同的 CSS class 元素。如果我只是定义一个 HTML5 元素,"div"那么它会返回预期的输出。我不知道为什么它不会返回上述指定的内容div/class

我尝试了许多不同的 CSS 类元素,上面代码中的一个是我尝试的最后一个示例。

最后是我的会话信息:

-----
bs4         4.9.1
requests    2.24.0
sinfo       0.3.1
-----
Python 3.7.3 (default, Apr 24 2020, 18:51:23) [Clang 11.0.3 (clang-1103.0.32.62)]
Darwin-19.5.0-x86_64-i386-64bit
8 logical CPU cores, i386
-----
Session information updated at 2020-07-13 16:18

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


最后在浏览了网站网络选项卡后,我发现了该网站的工作原理。基本上,它在幕后进行了四种不同的 API 调用。这些如下:

  1. https://asia.ensembl.org/Multi/Ajax/search?q=name%3A%22CXCR4%22&rows=200&fq=feature_type%3AGene+AND+database_type%3Acore&facet.field=species&facet.mincount=1&facet=true
  2. https://asia.ensembl.org/Multi/Ajax/search?q=(+NOT+species%3Axxx+)+AND+(+CXCR4+)+AND+(+NOT+species%3Ayyy+)&fq=&rows=1&facet.field=物种&facet.field=feature_type&facet.field=strain&facet.mincount=1&facet=true&facet.limit=-1
  3. https://asia.ensembl.org/Multi/Ajax/search?q=(+CXCR4%5E316+AND+species%3A%22CrossSpecies%22+)+OR+(+CXCR4%5E190+AND+species%3A%22Human %22+)+OR+(+CXCR4%5E80+AND+species%3A%22Mouse%22+)+OR+(+CXCR4+AND+species%3A%22Zebrafish%22+)&fq=(++(++species% 3A%22CrossSpecies%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Human%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A% 22Mouse%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Zebrafish%22+AND+(+reference_strain%3A1+)++)++)&hl=true&hl.fl=_hr&hl.fl =内容&hl.fl=描述&hl.fragsize=500&rows=10&start=0

在此处输入图像描述

因此,结果是上述 4 个 API 调用的组合结果,它们出现在网站的不同页面上。

import requests

res = requests.get("https://asia.ensembl.org/Multi/Ajax/search?q=(+CXCR4%5E316+AND+species%3A%22CrossSpecies%22+)+OR+(+CXCR4%5E190+AND+species%3A%22Human%22+)+OR+(+CXCR4%5E80+AND+species%3A%22Mouse%22+)+OR+(+CXCR4+AND+species%3A%22Zebrafish%22+)&fq=(++(++species%3A%22CrossSpecies%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Human%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Mouse%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Zebrafish%22+AND+(+reference_strain%3A1+)++)++)&hl=true&hl.fl=_hr&hl.fl=content&hl.fl=description&hl.fragsize=500&rows=10&start=0", verify=False)

result = res.json()
print(result)

注意*:不要忘记verify=False在您的请求调用中使用,否则它会抛出SSLException

输出:

{'error': '',
 'result': {'highlighting': {'1d3be01c-f969-40de-a1f8-bfd5bbf40fc1': {},
   '5b2accd3-cfef-4e2a-9d9c-2e70752e4a68': {'_hr': ['<strong><em>Cxcr4</em></strong>-001 (Vega transcript) is an external reference matched to Transcript ENSMUST00000052172']},
   'd2f9e02b-f3f3-4823-9e39-3f727a265acb': {'_hr': ['GO:0031723 (GO record; description: <strong><em>CXCR4</em></strong> chemokine receptor binding,) is an external reference matched to Transcript ENST00000291526']},
   'b66c389f-ade7-4bc6-bcd6-b7011e7bc10e': {'_hr': ['LRG_51t1 (LRG display in Ensembl transcript record; description: Locus Reference Genomic record for <strong><em>CXCR4</em></strong>) is an external reference matched to Transcript ENST00000409817']},
   'dc70ef4d-7627-49d3-bfe8-f7e0c5fde994': {'_hr': ['<strong><em>Cxcr4</em></strong>-002 (Vega transcript) is an external reference matched to Transcript ENSMUST00000142893']},
   'e7d394ec-fd37-4cc2-8a5c-81482299c695': {},
   '8a02b397-ad39-420e-a4ed-89b709d4a3f5': {},
   '2d5880cc-d9f6-4fec-a154-ce9b7ba3c590': {'_hr': ['LRG_51t1 (LRG display in Ensembl transcript record; description: Locus Reference Genomic record for <strong><em>CXCR4</em></strong>) is an external reference matched to Transcript ENST00000241393']},
   '7f406926-0470-4c70-b8c7-f2bd8228be08': {'_hr': ['<strong><em>Cxcr4</em></strong>-001 (Vega transcript) is an external reference matched to Transcript ENSMUST00000052172']},
   'cf47bc6b-6bd0-4690-a0ac-8feed5a5a112': {'_hr': ['LRG_51 (LRG display in Ensembl gene record; description: Locus Reference Genomic record for <strong><em>CXCR4</em></strong>,) is an external reference matched to Gene ENSG00000121966']}},
  'responseHeader': {'QTime': 37,
   'params': {'fq': '(  (  species:"CrossSpecies" AND ( reference_strain:1 )  )  OR  (  species:"Human" AND ( reference_strain:1 )  )  OR  (  species:"Mouse" AND ( reference_strain:1 )  )  OR  (  species:"Zebrafish" AND ( reference_strain:1 )  )  )',
    'hl.fragsize': '500',
    'hl.fl': ['_hr', 'content', 'description'],
    'q': '( CXCR4^316 AND species:"CrossSpecies" ) OR ( CXCR4^190 AND species:"Human" ) OR ( CXCR4^80 AND species:"Mouse" ) OR ( CXCR4 AND species:"Zebrafish" )',
    'hl': 'true',
    'wt': 'json',
    'start': ['0', '0'],
    'rows': '10'},
   'status': 0},
  'response': {'numFound': 24,
   'docs': [{'domain_url': 'homo_sapiens/Gene/Summary?g=ENSG00000121966&db=core',
     'name': 'CXCR4',
     'species': 'Human',
     'ref_boost': 10,
     'location': '2:136114349-136118149:-1',
     'quick_links': ['orthologues:1'],
     'db_boost': 40,
     'website': 'http://www.ensembl.org',
     'reference_strain': 1,
     'id': 'ENSG00000121966',
     'domain': 'http://www.ensembl.org',
     'uid': 'cf47bc6b-6bd0-4690-a0ac-8feed5a5a112',
     'feature_type': 'Gene',
     'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
     'score': 3.3581953,
     'database_type': 'core'},
    {'feature_type': 'Transcript',
     'score': 2.238805,
     'database_type': 'core',
     'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
     'reference_strain': 1,
     'website': 'http://www.ensembl.org',
     'db_boost': 40,
     'uid': '2d5880cc-d9f6-4fec-a154-ce9b7ba3c590',
     'domain': 'http://www.ensembl.org',
     'id': 'ENST00000241393',
     'name': 'CXCR4-201',
     'location': '2:136114349-136118149:-1',
     'quick_links': ['protein:1'],
     'ref_boost': 10,
     'species': 'Human',
     'domain_url': 'homo_sapiens/Transcript/Summary?t=ENST00000241393&db=core'},
    {'feature_type': 'Transcript',
     'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
     'database_type': 'core',
     'score': 2.238805,
     'website': 'http://www.ensembl.org',
     'db_boost': 40,
     'reference_strain': 1,
     'domain': 'http://www.ensembl.org',
     'id': 'ENST00000409817',
     'uid': 'b66c389f-ade7-4bc6-bcd6-b7011e7bc10e',
     'name': 'CXCR4-202',
     'location': '2:136114349-136116243:-1',
     'quick_links': ['protein:1'],
     'species': 'Human',
     'ref_boost': 10,
     'domain_url': 'homo_sapiens/Transcript/Summary?t=ENST00000409817&db=core'},
    {'name': 'CXCR4-203',
     'quick_links': ['protein:0'],
     'location': '2:136114637-136117737:-1',
     'species': 'Human',
     'ref_boost': 10,
     'domain_url': 'homo_sapiens/Transcript/Summary?t=ENST00000466288&db=core',
     'feature_type': 'Transcript',
     'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
     'database_type': 'core',
     'score': 2.238805,
     'website': 'http://www.ensembl.org',
     'db_boost': 40,
     'reference_strain': 1,
     'domain': 'http://www.ensembl.org',
     'id': 'ENST00000466288',
     'uid': '1d3be01c-f969-40de-a1f8-bfd5bbf40fc1'},
    {'domain_url': 'mus_musculus/Gene/Summary?g=ENSMUSG00000045382&db=core',
     'strain': 'Mouse reference (CL57BL6)',
     'name': 'Cxcr4',
     'ref_boost': 10,
     'species': 'Mouse',
     'quick_links': ['orthologues:1'],
     'location': '1:128588199-128592293:-1',
     'db_boost': 40,
     'website': 'http://www.ensembl.org',
     'reference_strain': 1,
     'id': 'ENSMUSG00000045382',
     'domain': 'http://www.ensembl.org',
     'uid': '5b2accd3-cfef-4e2a-9d9c-2e70752e4a68',
     'feature_type': 'Gene',
     'description': 'chemokine (C-X-C motif) receptor 4 [Source:MGI Symbol;Acc:MGI:109563]',
     'score': 1.4139885,
     'database_type': 'core'},
    {'location': '1:128588199-128592290:-1',
     'quick_links': ['protein:1'],
     'species': 'Mouse',
     'ref_boost': 10,
     'name': 'Cxcr4-201',
     'strain': 'Mouse reference (CL57BL6)',
     'domain_url': 'mus_musculus/Transcript/Summary?t=ENSMUST00000052172&db=core',
     'score': 0.9426663,
     'database_type': 'core',
     'description': 'chemokine (C-X-C motif) receptor 4 [Source:MGI Symbol;Acc:MGI:109563]',
     'feature_type': 'Transcript',
     'uid': '7f406926-0470-4c70-b8c7-f2bd8228be08',
     'domain': 'http://www.ensembl.org',
     'id': 'ENSMUST00000052172',
     'reference_strain': 1,
     'website': 'http://www.ensembl.org',
     'db_boost': 40},
    {'reference_strain': 1,
     'website': 'http://www.ensembl.org',
     'db_boost': 40,
     'uid': 'dc70ef4d-7627-49d3-bfe8-f7e0c5fde994',
     'domain': 'http://www.ensembl.org',
     'id': 'ENSMUST00000142893',
     'feature_type': 'Transcript',
     'score': 0.9426663,
     'database_type': 'core',
     'description': 'chemokine (C-X-C motif) receptor 4 [Source:MGI Symbol;Acc:MGI:109563]',
     'domain_url': 'mus_musculus/Transcript/Summary?t=ENSMUST00000142893&db=core',
     'strain': 'Mouse reference (CL57BL6)',
     'name': 'Cxcr4-202',
     'location': '1:128589099-128592293:-1',
     'quick_links': ['protein:1'],
     'species': 'Mouse',
     'ref_boost': 10},
    {'reference_strain': 1,
     'website': 'http://www.ensembl.org',
     'uid': 'e7d394ec-fd37-4cc2-8a5c-81482299c695',
     'id': 'Cxcr4',
     'domain': 'http://www.ensembl.org',
     'feature_type': 'Marker',
     'database_type': 'core',
     'score': 0.01179975,
     'domain_url': 'mus_musculus/Marker/Details?m=Cxcr4',
     'strain': 'Mouse reference (CL57BL6)',
     'species': 'Mouse'},
    {'domain_url': 'homo_sapiens/Gene/Summary?g=ENSG00000160181&db=core',
     'ref_boost': 10,
     'species': 'Human',
     'quick_links': ['orthologues:1'],
     'location': '21:42346357-42350997:-1',
     'name': 'TFF2',
     'id': 'ENSG00000160181',
     'domain': 'http://www.ensembl.org',
     'uid': 'd2f9e02b-f3f3-4823-9e39-3f727a265acb',
     'db_boost': 40,
     'website': 'http://www.ensembl.org',
     'reference_strain': 1,
     'description': 'trefoil factor 2 [Source:HGNC Symbol;Acc:HGNC:11756]',
     'database_type': 'core',
     'score': 0.0072125974,
     'feature_type': 'Gene'},
    {'feature_type': 'Protein Family',
     'description': 'Ensembl protein family PTHR24227 [C C CHEMOKINE RECEPTOR TYPE C C CKR CC CKR CCR ANTIGEN]: 27 genes / 77 proteins in homo sapiens',
     'score': 0.004816582,
     'database_type': 'core',
     'website': 'http://www.ensembl.org',
     'reference_strain': 1,
     'domain': 'http://www.ensembl.org',
     'id': 'PTHR24227',
     'uid': '8a02b397-ad39-420e-a4ed-89b709d4a3f5',
     'name': 'PTHR24227',
     'species': 'Human',
     'domain_url': 'homo_sapiens/Gene/Family?family=PTHR24227;g=ENSG00000163464'}],
   'start': 0,
   'maxScore': 3.3581953}}}

推荐阅读