首页 > 解决方案 > 如何在Python3中获取给定html中所有元素的文本?

问题描述

如何从以下 html 中提取元素的所有文本:

from bs4 import BeautifulSoup


html3 = """
<div class="tab-cell l1">
    <span class="cyan-90">***</span>
    <h2 class="white-80">
        <a class="k-link" href="#" title="Jump">Jump</a>
    </h2>
    <h3 class="black-70">
        <span>Red</span>
        <span class="black-50">lock</span>
    </h3>
    <div class="l-block">
        <a class="lang-menu" href="#">A</a>
        <a class="lang-menu" href="#">B</a>
        <a class="lang-menu" href="#">C</a>
    </div>
    <div class="black-50">
        <div class="p-bold">Period</div>
        <div class="tab--cell">$</div><div class="white-90">Method</div>
        <div class="tab--cell">$</div><div class="tab--cell">Type</div>
    </div>
</div>
"""

soup = BeautifulSoup(html3, "lxml")
if soup.find('div', attrs={'class': 'tab-cell l1'}):
    div_descendants = soup.div.descendants
    for des in div_descendants:
       if des.name is not None:
           print(des.name)
           if des.find(class_='k-link'):
               print(des.a.string)
           if des.find(class_='black-70'):
               print('span')
               print(des.span.text)

我只收到第一个链接的文本,之后我什么也得不到。我想逐行爬行并得到我想要的任何东西,如果有人有任何想法,请告诉我。

标签: python-3.xweb-scrapingbeautifulsouppython-requests

解决方案


你自己的if条件阻碍你得到所有的东西。您仅根据class_=...条件在两种情况下打印 - 您不会在所有条件下打印:

# html3 = see above 

from bs4 import BeautifulSoup
import lxml 

soup = BeautifulSoup(html3, "lxml")
if soup.find('div', attrs={'class': 'tab-cell l1'}):
    div_descendants = soup.div.descendants
    for des in div_descendants:
        if des.name is not None:
            print(des.name)
            found = False
            if des.find(class_='k-link'):
                print(des.a.string)
                found = True
            if des.find(class_='black-70'):
                print('span')
                print(des.span.text)
                found = True
            # find all others that are not already reported:
            if not found:
                print(f"Other {des.name}: {des.string}")

输出:

span
Other span: ***
h2
Jump
a
Other a: Jump
h3
Other h3: None
span
Other span: Red
span
Other span: lock
div
Other div: None
a
Other a: A
a
Other a: B
a
Other a: C
div
Other div: None 
div
Other div: Period
div
Other div: $
div
Other div: Method
div
Other div: $
div
Other div: Type

推荐阅读