首页 > 解决方案 > 如何使用 BeautifulSoup 提取每个 df1 内容(优点、缺点、df_tit)?

问题描述

我有一个关于标签提取的问题。这是 HTML 的架构。

<div class="content_body_ty1">
  <div class="us_label_wrap">..</div>
    <h2 class="us_label">
  <dl class="tc_list">
    <dt class = "merit">merit</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>
    <dt class = "disadvantages">disadvantages</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>
    <dt class = "df_tit">wish</dt>
    <dd class = "df1">
        <span> blah~~~~~~~~blah~~~~</span>
    </dd>

我想使用 for 循环提取标签内容。接下来,将一个元素放入列表中。1)“优点”,等等~~~~ 2)“缺点”,等等~~~~ 3)“df_tit”,等等~~~~

在这里,我的代码

maximum = 3
merit = [] 
disadv = []
tit = []
for page_number in range(1, maximum+1):
    URL = 'https://www.example.co.kr/companies/reviews/page={}'.format(page_number) 
    response = client.get(URL)
    print(page_number)
    whole_source = response.content.decode('utf-8')
    soup = BeautifulSoup(whole_source, 'html.parser')
    for entry in soup.find_all('dl', class_ = 'tc_list'): 
        if entry.find('dt', class_ = 'merit'):
            merit.append(entry.find('dd', class_ = 'df1')) 
        elif entry.find('dt', class_ = 'disadvantage'):
            disadv.append(entry.find('dd', class_ = 'df1'))
        elif entry.find('dt', class_ = 'df_tit'):
            tit.append(entry.find('dd', class_ = 'df1'))

如何提取标签内容。请检查这个问题。谢谢!

标签: htmlpython-3.xfor-loopbeautifulsoup

解决方案


您可以找到dd标签并使用previous_sibling来检查您的元素是什么类别。

请看下面的代码:

import requests
from bs4 import BeautifulSoup


html = '<div class="content_body_ty1"><div class="us_label_wrap">..</div><h2 class="us_label"><dl class="tc_list"><dt class = "merit">merit</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd><dt cl    ass = "disadvantages">disadvantages</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd><dt class = "df_tit">wish</dt><dd class = "df1"><span> blah~~~~~~~~blah~~~~</span></dd></dl></div>'


maximum = 3
merit = []
disadv = []
tit = []
soup = BeautifulSoup(html, 'html.parser')

dl_list = soup.find('dl', class_ = 'tc_list')
for dd in dl_list.find_all('dd',{'class':'df1'}):
    if dd.previous_sibling:
        if 'merit' in dd.previous_sibling.get('class'):
            merit.append(dd.text)
        elif 'disadvantages' in dd.previous_sibling.get('class'):
            disadv.append(dd.text)
        elif 'df_tit' in dd.previous_sibling.get('class'):
            tit.append(dd.text)

print(merit)
print(disadv)
print(tit)

结果:

[' blah~~~~~~~~blah~~~~']
[' blah~~~~~~~~blah~~~~']
[' blah~~~~~~~~blah~~~~']

推荐阅读