首页 > 解决方案 > BeautifulSoup:在特定跨度之后提取文本中的数据

问题描述

我知道那里有很多类似的问题,我只是想不出我的具体例子。

页面上,我想从以下行中提取数字 '121,320':'Mass (Da):121,320'

我可以从 BeautifulSoup 看到这是我想要的:

</div><a class="show-link" href="#" id="O00203-show-link" style="display:none">Show »</a></div><div class="sequence-isoform-rightcol"><div><span class="sequence-field-header tooltiped" title="Sequence length.">Length:</span><span>1,094</span></div><div><span class="sequence-field-header tooltiped" title="The mass of the unprocessed protein, in Daltons.">Mass (Da):</span><span>121,320</span>

我正在尝试这个:

import urllib
import requests
import sys
from bs4 import BeautifulSoup

uniprot_list = ['O00203']
for each_id in uniprot_list:
        data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
        soup = BeautifulSoup(data.content, 'html.parser')


        #prints all spans
        print(soup.find_all('span'))

        #prints empty list
        print(soup.find_all('span',title_='The mass of the unprocessed protein, in Daltons.'))

我得到的最接近的是尝试在 SO 上遵循这个答案:

    div1 = soup.find("div", { "class" : "sequence-isoform-rightcol" }).findAll('span', { "class" : "sequence-field-header tooltiped" })
    for x in div1:
            print(x.text)

问题是打印出来:

Length:
Mass (Da):

没有实际值。

如何从我拥有的每一页中提取质量?在这种情况下是 121,320?

标签: pythonbeautifulsoup

解决方案


您可以使用正则表达式re搜索文本,然后使用find_next('span')

import re
import urllib
import requests
import sys
from bs4 import BeautifulSoup

uniprot_list = ['O00203']
for each_id in uniprot_list:
        data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
        soup = BeautifulSoup(data.content, 'html.parser')
        print(soup.find('span',text=re.compile("Mass")).find_next('span').text)

输出

121,320

或者,如果您有 Bs4 4.7 及更高版本,则可以使用以下 css 选择器。

import urllib
import requests
import sys
from bs4 import BeautifulSoup

uniprot_list = ['O00203']
for each_id in uniprot_list:
        data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
        soup = BeautifulSoup(data.content, 'html.parser')
        print(soup.select_one('span:contains("Mass (Da)")').find_next('span').text)

输出

121,320

推荐阅读