python - Web Scraping w/BeautifulSoup4 - 如何过滤包含特定字符串的标签?
问题描述
如何过滤以下 HTML 片段以将包含“Codigo”的跨度标记附加到列表 A;包含“Acao”的跨度标签到列表 B 等?
Expected output:
List A: ['ABEV3', 'AZUL4']
List B: ['AMBEV S/A', 'AZUL']
List C: ['ON', 'PN']
List D: [4355174839, 326903173]
List E: [2.948, 0.432]
[...]
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
[...]
解决方案
要获取各种列表,您可以使用 CSS 选择器,它会查找以指定字符串结尾的[id$="..."]
标签。id=
例如:
from bs4 import BeautifulSoup
html_data = '''
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
'''
soup = BeautifulSoup(html_data, 'html.parser')
list_a = [t.text for t in soup.select('[id$="_lblCodigo"]')]
list_b = [t.text for t in soup.select('[id$="_lblAcao"]')]
list_c = [t.text for t in soup.select('[id$="_lblTipo"]')]
list_d = [int(t.text.replace('.', '')) for t in soup.select('[id$="_lblQtdeTeorica_Formatada"]')]
list_e = [float(t.text.replace(',', '.')) for t in soup.select('[id$="_lblPart_Formatada"]')]
print(list_a)
print(list_b)
print(list_c)
print(list_d)
print(list_e)
印刷:
['ABEV3', 'AZUL4']
['AMBEV S/A', 'AZUL']
['ON', 'PN N2']
[4355174839, 326903173]
[2.948, 0.432]
推荐阅读
- android - DiffUtil 和 SortedList 哪个更高效?
- ajax - 使用 L.esri.DynamicMapLayer,是否可以绑定鼠标悬停事件而不是动态地图上的弹出窗口?
- r - 如何以自动方式将多个模型输出增量添加到 DF?
- ruby-on-rails - 如何使用 `spec/support/` 中的模块进行 Ruby on Rails RSpec 测试
- python-3.x - 不能将梯度组合用于多输出 keras 模型的多个损失函数
- c# - 测试方法未在 C# 中运行
- ffmpeg - 使用 ffmpeg 录制音频时发出噼啪声
- android - 如何在xml中进行参数要求,如android中的宽度或高度
- c# - 我可以使用 Fluent API 通过构造函数传递 HashSet 吗?
- mysql - 如何在 NodeJS 中格式化多个模块的导出