python - 无论 CSS 类元素输入到 BeautifulSoup 的 find_all() 函数,我都会收到一个空列表作为输出
问题描述
import requests
from bs4 import BeautifulSoup
gene_list = {"Ccl2", "CXCR4"}
for seq in gene_list:
text = requests.get("https://uswest.ensembl.org/Multi/Search/Results?q=" + seq + ";site=ensembl").text
soup = BeautifulSoup(text, "lxml")
soup.find_all("div", {"class": "table_result"})
我正在尝试在 Ensembl 网站 ( https://uswest.ensembl.org/index.html ) 中搜索我们在测序数据中发现的未注释基因,然后对它们进行一些后续搜索和数据处理。
我只是无法让我的刮板从我的find_all()
. 我已经尝试了每个解析器(html.parser, lxml, html5lib)
,包括 class 的语法({'class': 'name_of_class'} and class_='name_of_class')
和多个不同的 CSS class 元素。如果我只是定义一个 HTML5 元素,"div"
那么它会返回预期的输出。我不知道为什么它不会返回上述指定的内容div/class
。
我尝试了许多不同的 CSS 类元素,上面代码中的一个是我尝试的最后一个示例。
最后是我的会话信息:
-----
bs4 4.9.1
requests 2.24.0
sinfo 0.3.1
-----
Python 3.7.3 (default, Apr 24 2020, 18:51:23) [Clang 11.0.3 (clang-1103.0.32.62)]
Darwin-19.5.0-x86_64-i386-64bit
8 logical CPU cores, i386
-----
Session information updated at 2020-07-13 16:18
解决方案
最后在浏览了网站网络选项卡后,我发现了该网站的工作原理。基本上,它在幕后进行了四种不同的 API 调用。这些如下:
- https://asia.ensembl.org/Multi/Ajax/search?q=name%3A%22CXCR4%22&rows=200&fq=feature_type%3AGene+AND+database_type%3Acore&facet.field=species&facet.mincount=1&facet=true
- https://asia.ensembl.org/Multi/Ajax/search?q=(+NOT+species%3Axxx+)+AND+(+CXCR4+)+AND+(+NOT+species%3Ayyy+)&fq=&rows=1&facet.field=物种&facet.field=feature_type&facet.field=strain&facet.mincount=1&facet=true&facet.limit=-1
- https://asia.ensembl.org/Multi/Ajax/search?q=(+CXCR4%5E316+AND+species%3A%22CrossSpecies%22+)+OR+(+CXCR4%5E190+AND+species%3A%22Human %22+)+OR+(+CXCR4%5E80+AND+species%3A%22Mouse%22+)+OR+(+CXCR4+AND+species%3A%22Zebrafish%22+)&fq=(++(++species% 3A%22CrossSpecies%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Human%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A% 22Mouse%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Zebrafish%22+AND+(+reference_strain%3A1+)++)++)&hl=true&hl.fl=_hr&hl.fl =内容&hl.fl=描述&hl.fragsize=500&rows=10&start=0
因此,结果是上述 4 个 API 调用的组合结果,它们出现在网站的不同页面上。
import requests
res = requests.get("https://asia.ensembl.org/Multi/Ajax/search?q=(+CXCR4%5E316+AND+species%3A%22CrossSpecies%22+)+OR+(+CXCR4%5E190+AND+species%3A%22Human%22+)+OR+(+CXCR4%5E80+AND+species%3A%22Mouse%22+)+OR+(+CXCR4+AND+species%3A%22Zebrafish%22+)&fq=(++(++species%3A%22CrossSpecies%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Human%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Mouse%22+AND+(+reference_strain%3A1+)++)++OR++(++species%3A%22Zebrafish%22+AND+(+reference_strain%3A1+)++)++)&hl=true&hl.fl=_hr&hl.fl=content&hl.fl=description&hl.fragsize=500&rows=10&start=0", verify=False)
result = res.json()
print(result)
注意*:不要忘记verify=False
在您的请求调用中使用,否则它会抛出SSLException
输出:
{'error': '',
'result': {'highlighting': {'1d3be01c-f969-40de-a1f8-bfd5bbf40fc1': {},
'5b2accd3-cfef-4e2a-9d9c-2e70752e4a68': {'_hr': ['<strong><em>Cxcr4</em></strong>-001 (Vega transcript) is an external reference matched to Transcript ENSMUST00000052172']},
'd2f9e02b-f3f3-4823-9e39-3f727a265acb': {'_hr': ['GO:0031723 (GO record; description: <strong><em>CXCR4</em></strong> chemokine receptor binding,) is an external reference matched to Transcript ENST00000291526']},
'b66c389f-ade7-4bc6-bcd6-b7011e7bc10e': {'_hr': ['LRG_51t1 (LRG display in Ensembl transcript record; description: Locus Reference Genomic record for <strong><em>CXCR4</em></strong>) is an external reference matched to Transcript ENST00000409817']},
'dc70ef4d-7627-49d3-bfe8-f7e0c5fde994': {'_hr': ['<strong><em>Cxcr4</em></strong>-002 (Vega transcript) is an external reference matched to Transcript ENSMUST00000142893']},
'e7d394ec-fd37-4cc2-8a5c-81482299c695': {},
'8a02b397-ad39-420e-a4ed-89b709d4a3f5': {},
'2d5880cc-d9f6-4fec-a154-ce9b7ba3c590': {'_hr': ['LRG_51t1 (LRG display in Ensembl transcript record; description: Locus Reference Genomic record for <strong><em>CXCR4</em></strong>) is an external reference matched to Transcript ENST00000241393']},
'7f406926-0470-4c70-b8c7-f2bd8228be08': {'_hr': ['<strong><em>Cxcr4</em></strong>-001 (Vega transcript) is an external reference matched to Transcript ENSMUST00000052172']},
'cf47bc6b-6bd0-4690-a0ac-8feed5a5a112': {'_hr': ['LRG_51 (LRG display in Ensembl gene record; description: Locus Reference Genomic record for <strong><em>CXCR4</em></strong>,) is an external reference matched to Gene ENSG00000121966']}},
'responseHeader': {'QTime': 37,
'params': {'fq': '( ( species:"CrossSpecies" AND ( reference_strain:1 ) ) OR ( species:"Human" AND ( reference_strain:1 ) ) OR ( species:"Mouse" AND ( reference_strain:1 ) ) OR ( species:"Zebrafish" AND ( reference_strain:1 ) ) )',
'hl.fragsize': '500',
'hl.fl': ['_hr', 'content', 'description'],
'q': '( CXCR4^316 AND species:"CrossSpecies" ) OR ( CXCR4^190 AND species:"Human" ) OR ( CXCR4^80 AND species:"Mouse" ) OR ( CXCR4 AND species:"Zebrafish" )',
'hl': 'true',
'wt': 'json',
'start': ['0', '0'],
'rows': '10'},
'status': 0},
'response': {'numFound': 24,
'docs': [{'domain_url': 'homo_sapiens/Gene/Summary?g=ENSG00000121966&db=core',
'name': 'CXCR4',
'species': 'Human',
'ref_boost': 10,
'location': '2:136114349-136118149:-1',
'quick_links': ['orthologues:1'],
'db_boost': 40,
'website': 'http://www.ensembl.org',
'reference_strain': 1,
'id': 'ENSG00000121966',
'domain': 'http://www.ensembl.org',
'uid': 'cf47bc6b-6bd0-4690-a0ac-8feed5a5a112',
'feature_type': 'Gene',
'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
'score': 3.3581953,
'database_type': 'core'},
{'feature_type': 'Transcript',
'score': 2.238805,
'database_type': 'core',
'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
'reference_strain': 1,
'website': 'http://www.ensembl.org',
'db_boost': 40,
'uid': '2d5880cc-d9f6-4fec-a154-ce9b7ba3c590',
'domain': 'http://www.ensembl.org',
'id': 'ENST00000241393',
'name': 'CXCR4-201',
'location': '2:136114349-136118149:-1',
'quick_links': ['protein:1'],
'ref_boost': 10,
'species': 'Human',
'domain_url': 'homo_sapiens/Transcript/Summary?t=ENST00000241393&db=core'},
{'feature_type': 'Transcript',
'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
'database_type': 'core',
'score': 2.238805,
'website': 'http://www.ensembl.org',
'db_boost': 40,
'reference_strain': 1,
'domain': 'http://www.ensembl.org',
'id': 'ENST00000409817',
'uid': 'b66c389f-ade7-4bc6-bcd6-b7011e7bc10e',
'name': 'CXCR4-202',
'location': '2:136114349-136116243:-1',
'quick_links': ['protein:1'],
'species': 'Human',
'ref_boost': 10,
'domain_url': 'homo_sapiens/Transcript/Summary?t=ENST00000409817&db=core'},
{'name': 'CXCR4-203',
'quick_links': ['protein:0'],
'location': '2:136114637-136117737:-1',
'species': 'Human',
'ref_boost': 10,
'domain_url': 'homo_sapiens/Transcript/Summary?t=ENST00000466288&db=core',
'feature_type': 'Transcript',
'description': 'C-X-C motif chemokine receptor 4 [Source:HGNC Symbol;Acc:HGNC:2561]',
'database_type': 'core',
'score': 2.238805,
'website': 'http://www.ensembl.org',
'db_boost': 40,
'reference_strain': 1,
'domain': 'http://www.ensembl.org',
'id': 'ENST00000466288',
'uid': '1d3be01c-f969-40de-a1f8-bfd5bbf40fc1'},
{'domain_url': 'mus_musculus/Gene/Summary?g=ENSMUSG00000045382&db=core',
'strain': 'Mouse reference (CL57BL6)',
'name': 'Cxcr4',
'ref_boost': 10,
'species': 'Mouse',
'quick_links': ['orthologues:1'],
'location': '1:128588199-128592293:-1',
'db_boost': 40,
'website': 'http://www.ensembl.org',
'reference_strain': 1,
'id': 'ENSMUSG00000045382',
'domain': 'http://www.ensembl.org',
'uid': '5b2accd3-cfef-4e2a-9d9c-2e70752e4a68',
'feature_type': 'Gene',
'description': 'chemokine (C-X-C motif) receptor 4 [Source:MGI Symbol;Acc:MGI:109563]',
'score': 1.4139885,
'database_type': 'core'},
{'location': '1:128588199-128592290:-1',
'quick_links': ['protein:1'],
'species': 'Mouse',
'ref_boost': 10,
'name': 'Cxcr4-201',
'strain': 'Mouse reference (CL57BL6)',
'domain_url': 'mus_musculus/Transcript/Summary?t=ENSMUST00000052172&db=core',
'score': 0.9426663,
'database_type': 'core',
'description': 'chemokine (C-X-C motif) receptor 4 [Source:MGI Symbol;Acc:MGI:109563]',
'feature_type': 'Transcript',
'uid': '7f406926-0470-4c70-b8c7-f2bd8228be08',
'domain': 'http://www.ensembl.org',
'id': 'ENSMUST00000052172',
'reference_strain': 1,
'website': 'http://www.ensembl.org',
'db_boost': 40},
{'reference_strain': 1,
'website': 'http://www.ensembl.org',
'db_boost': 40,
'uid': 'dc70ef4d-7627-49d3-bfe8-f7e0c5fde994',
'domain': 'http://www.ensembl.org',
'id': 'ENSMUST00000142893',
'feature_type': 'Transcript',
'score': 0.9426663,
'database_type': 'core',
'description': 'chemokine (C-X-C motif) receptor 4 [Source:MGI Symbol;Acc:MGI:109563]',
'domain_url': 'mus_musculus/Transcript/Summary?t=ENSMUST00000142893&db=core',
'strain': 'Mouse reference (CL57BL6)',
'name': 'Cxcr4-202',
'location': '1:128589099-128592293:-1',
'quick_links': ['protein:1'],
'species': 'Mouse',
'ref_boost': 10},
{'reference_strain': 1,
'website': 'http://www.ensembl.org',
'uid': 'e7d394ec-fd37-4cc2-8a5c-81482299c695',
'id': 'Cxcr4',
'domain': 'http://www.ensembl.org',
'feature_type': 'Marker',
'database_type': 'core',
'score': 0.01179975,
'domain_url': 'mus_musculus/Marker/Details?m=Cxcr4',
'strain': 'Mouse reference (CL57BL6)',
'species': 'Mouse'},
{'domain_url': 'homo_sapiens/Gene/Summary?g=ENSG00000160181&db=core',
'ref_boost': 10,
'species': 'Human',
'quick_links': ['orthologues:1'],
'location': '21:42346357-42350997:-1',
'name': 'TFF2',
'id': 'ENSG00000160181',
'domain': 'http://www.ensembl.org',
'uid': 'd2f9e02b-f3f3-4823-9e39-3f727a265acb',
'db_boost': 40,
'website': 'http://www.ensembl.org',
'reference_strain': 1,
'description': 'trefoil factor 2 [Source:HGNC Symbol;Acc:HGNC:11756]',
'database_type': 'core',
'score': 0.0072125974,
'feature_type': 'Gene'},
{'feature_type': 'Protein Family',
'description': 'Ensembl protein family PTHR24227 [C C CHEMOKINE RECEPTOR TYPE C C CKR CC CKR CCR ANTIGEN]: 27 genes / 77 proteins in homo sapiens',
'score': 0.004816582,
'database_type': 'core',
'website': 'http://www.ensembl.org',
'reference_strain': 1,
'domain': 'http://www.ensembl.org',
'id': 'PTHR24227',
'uid': '8a02b397-ad39-420e-a4ed-89b709d4a3f5',
'name': 'PTHR24227',
'species': 'Human',
'domain_url': 'homo_sapiens/Gene/Family?family=PTHR24227;g=ENSG00000163464'}],
'start': 0,
'maxScore': 3.3581953}}}
推荐阅读
- django - 来自数据库的 Django 模型选择,另一个模型
- jquery - Jquery:如何动态禁用/启用选项
- python-3.x - Pandas 根据起始字符创建列
- android - Android Studio 4.0 颜色预览
- html - 不使用百分比水平分布 3 列容器的内部 Div?- 柔性
- python - 图像序列未在 OpenCV 中维护
- gremlin - Gremlin:如何在单个查询中获取传出边及其目标顶点
- c - UDP 上的 wav 文件无法在带有 Alsa-lib 的 unix 中正确播放
- node.js - 尝试在 Mongodb 和 nodejs 中建立 $ 查找关系时出现错误
- go - 多数据库驱动场景:将接口作为类型传递