首页 > 解决方案 > Web Scraping w/BeautifulSoup4 - 如何过滤包含特定字符串的标签?

问题描述

如何过滤以下 HTML 片段以将包含“Codigo”的跨度标记附加到列表 A;包含“Acao”的跨度标签到列表 B 等?

Expected output:

List A: ['ABEV3', 'AZUL4']
List B: ['AMBEV S/A', 'AZUL']
List C: ['ON', 'PN']
List D: [4355174839, 326903173]
List E: [2.948, 0.432]
[...]
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN      N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
[...]

标签: pythonbeautifulsouprequest

解决方案


要获取各种列表,您可以使用 CSS 选择器,它会查找以指定字符串结尾的[id$="..."]标签。id=例如:

from bs4 import BeautifulSoup


html_data = '''
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN      N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
'''

soup = BeautifulSoup(html_data, 'html.parser')

list_a = [t.text for t in soup.select('[id$="_lblCodigo"]')]
list_b = [t.text for t in soup.select('[id$="_lblAcao"]')]
list_c = [t.text for t in soup.select('[id$="_lblTipo"]')]
list_d = [int(t.text.replace('.', '')) for t in soup.select('[id$="_lblQtdeTeorica_Formatada"]')]
list_e = [float(t.text.replace(',', '.')) for t in soup.select('[id$="_lblPart_Formatada"]')]

print(list_a)
print(list_b)
print(list_c)
print(list_d)
print(list_e)

印刷:

['ABEV3', 'AZUL4']
['AMBEV S/A', 'AZUL']
['ON', 'PN      N2']
[4355174839, 326903173]
[2.948, 0.432]

推荐阅读