python - 抓取特定的子元素
问题描述
我想刮,这样我需要两个列表
ListA = ["Driver Convenience","Exterior Features"]
ListB = ["2 key fob;Collision mitigation braking system;","Body coloured plastic front bumper;Boulder grey exterior door handle;Boulder grey exterior door mirrorn;"]
ListA
将在标签内包含文本,并将在h4
标签ListB
内包含文本,li
直到h4
找到下一个标签。
这是一个示例HTML
代码:
<ul class="c-list-table">
<h4 class="c-list-table__section-heading">Driver Convenience</h4>
<li class="c-list-table__item" rel="2-key-fob"><span class="c-list-table__title"> 2 key fob </span</li>
<li class="c-list-table__item" rel="collision-mitigation-braking-system">Collision mitigation braking system</li>
<h4 class="c-list-table__section-heading">Exterior Features</h4>
<li class="c-list-table__item" rel="body-coloured-plastic-front-bumper">Body coloured plastic front bumper</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-handle">Boulder grey exterior door handle</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-mirror">Boulder grey exterior door mirrorn</li>
</ul>
HTML 与这个相同 :) 尝试了很多东西,但无法帮助自己
解决方案
用于find_next_siblings('li')
查找h4之后的li标签,然后验证与文本不匹配的文本,然后添加到列表中。previous_sibling('h4')
from bs4 import BeautifulSoup
data='''
<ul class="c-list-table">
<h4 class="c-list-table__section-heading">Driver Convenience</h4>
<li class="c-list-table__item" rel="2-key-fob"><span class="c-list-table__title"> 2 key fob </span</li>
<li class="c-list-table__item" rel="collision-mitigation-braking-system">Collision mitigation braking system</li>
<h4 class="c-list-table__section-heading">Exterior Features</h4>
<li class="c-list-table__item" rel="body-coloured-plastic-front-bumper">Body coloured plastic front bumper</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-handle">Boulder grey exterior door handle</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-mirror">Boulder grey exterior door mirrorn</li>
</ul>'''
ListA =[]
ListB =[]
soup=BeautifulSoup(data,'lxml')
for item in soup.find_all('h4'):
lifinal=""
ListA.append(item.text)
nextlis=item.find_next_siblings('li')
for li in nextlis:
if li.find_previous_sibling('h4').text in item.text:
lifinal=lifinal+li.text.strip()+";"
ListB.append(lifinal)
print(ListA)
print(ListB)
输出:
['Driver Convenience', 'Exterior Features']
['2 key fob;Collision mitigation braking system;', 'Body coloured plastic front bumper;Boulder grey exterior door handle;Boulder grey exterior door mirrorn;']
推荐阅读
- python - Django 将窗口函数与标准注释相结合
- azure - Azure terratest - 构建约束排除所有 Go 文件
- sql - 将 TO_NUMBER() 无效格式错误视为 NULL
- vue.js - Auth0 路由保护不适用于 Nuxt 中间件
- git - Gitlab - CI 分支与 HEAD 分支之间的 Git 差异
- file - 检测文件是否可以在 Scheme 中打开
- javascript - Material-UI:DataGrid 组件错误:无法读取未定义的属性“长度”
- c# - AutoFixture.AutoNSubstitute 不为接口中的属性自动生成数据
- javascript - 反应:useState 在点击时不保留更新的状态
- .net-core - .NET Core DI 子范围和 Db 上下文