html - 如何使用 BeautifulSoup 提取每个 df1 内容(优点、缺点、df_tit)?
问题描述
我有一个关于标签提取的问题。这是 HTML 的架构。
<div class="content_body_ty1">
<div class="us_label_wrap">..</div>
<h2 class="us_label">
<dl class="tc_list">
<dt class = "merit">merit</dt>
<dd class = "df1">
<span> blah~~~~~~~~blah~~~~</span>
</dd>
<dt class = "disadvantages">disadvantages</dt>
<dd class = "df1">
<span> blah~~~~~~~~blah~~~~</span>
</dd>
<dt class = "df_tit">wish</dt>
<dd class = "df1">
<span> blah~~~~~~~~blah~~~~</span>
</dd>
我想使用 for 循环提取标签内容。接下来,将一个元素放入列表中。1)“优点”,等等~~~~ 2)“缺点”,等等~~~~ 3)“df_tit”,等等~~~~
在这里,我的代码
maximum = 3
merit = []
disadv = []
tit = []
for page_number in range(1, maximum+1):
URL = 'https://www.example.co.kr/companies/reviews/page={}'.format(page_number)
response = client.get(URL)
print(page_number)
whole_source = response.content.decode('utf-8')
soup = BeautifulSoup(whole_source, 'html.parser')
for entry in soup.find_all('dl', class_ = 'tc_list'):
if entry.find('dt', class_ = 'merit'):
merit.append(entry.find('dd', class_ = 'df1'))
elif entry.find('dt', class_ = 'disadvantage'):
disadv.append(entry.find('dd', class_ = 'df1'))
elif entry.find('dt', class_ = 'df_tit'):
tit.append(entry.find('dd', class_ = 'df1'))
如何提取标签内容。请检查这个问题。谢谢!
解决方案
您可以找到dd
标签并使用previous_sibling
来检查您的元素是什么类别。
请看下面的代码:
import requests
from bs4 import BeautifulSoup
html = '<div class="content_body_ty1"><div class="us_label_wrap">..</div><h2 class="us_label"><dl class="tc_list"><dt class = "merit">merit</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd><dt cl ass = "disadvantages">disadvantages</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd><dt class = "df_tit">wish</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd></dl></div>'
maximum = 3
merit = []
disadv = []
tit = []
soup = BeautifulSoup(html, 'html.parser')
dl_list = soup.find('dl', class_ = 'tc_list')
for dd in dl_list.find_all('dd',{'class':'df1'}):
if dd.previous_sibling:
if 'merit' in dd.previous_sibling.get('class'):
merit.append(dd.text)
elif 'disadvantages' in dd.previous_sibling.get('class'):
disadv.append(dd.text)
elif 'df_tit' in dd.previous_sibling.get('class'):
tit.append(dd.text)
print(merit)
print(disadv)
print(tit)
结果:
[' blah~~~~~~~~blah~~~~']
[' blah~~~~~~~~blah~~~~']
[' blah~~~~~~~~blah~~~~']
推荐阅读
- java - 如何使用 Maven 安装 unirest?
- php - 如何将消息从 PHP 页面返回到 AJAX 调用
- java - 如何获取相同视图组件的列表
- python - 错误计算区域“IndexError:维度 1 的张量的索引过多”
- dockerfile - docker hyperledger indy 上的 node-gyp 重建错误
- containers - Podman oci .containerenv:不是目录
- python - 设置 DeepDiff 以很好地识别删除
- sql - 如何通过算法从帐户树中选择适当的帐户
- github - 仅在 GitHub Actions 中发布更改的项目
- r - 在 R 包中使用 %>%