首页 > 解决方案 > 使用 BeautifulSoup 进行网页抓取,在 td 内的跨度内查找文本,忽略子跨度

问题描述

我正在尝试抓取网站以获取某些信息,但我遇到了困难。

一个示例 HTML 文件:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    <form>
        <table>
            <tbody>
                <tr id="dontMatter"></tr>
                <tr id="td_important_id_1">
                    <div class="dontCare"></div>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 1"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 2"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 3"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 4"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                </tr>
            </tbody>

        </table>
    </form>
</body>
</html>

本质上,我想要所有“我想要的文本#”,但没有跨度子项。

尝试通过具有 id: "td_important_id_1" 和具有类 "important_class_1" 的跨度子级进行过滤并获取该跨度内的文本,但没有一个子级跨度。

我现在拥有的是:

import requests
from bs4 import BeautifulSoup

from selenium  import webdriver

driver = webdriver.Chrome(executable_path='path to driver')
driver.get('website_link')
soup = BeautifulSoup(driver.page_source, features="html.parser")


for item in soup.find("td", {"id" : "td_important_id_1"}).find_all("span", {"class" : "important_class_1"}, recursive=False):
    print(item.text)


driver.quit()

但这有点给我垃圾。如果有人可以提供帮助,那就太好了。

标签: pythonhtmlpython-3.xweb-scrapingbeautifulsoup

解决方案


这是另一种解决方案,仅供参考。

from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<tr id="dontMatter"></tr>
<tr id="td_important_id_1">
    <div class="dontCare"></div>
    <span onClick="blah" class="important_class_1">
        ::before
        <input type="checkBox" name="">
        "Text That I want 1"
        <div class="label">
            <span class="garbagbe">Text that I dont want</span>
            <span class="garbagbe1">Text that I dont want</span>
            <span class="garbagbe2">Text that I dont want</span>
            <span class="garbagbe3">Text that I dont want</span>
        </div>
    </span>
    <span onClick="blah" class="important_class_1">
        ::before
        <input type="checkBox" name="">
        "Text That I want 2"
        <div class="label">
            <span class="garbagbe">Text that I dont want</span>
            <span class="garbagbe1">Text that I dont want</span>
            <span class="garbagbe2">Text that I dont want</span>
            <span class="garbagbe3">Text that I dont want</span>
        </div>
    </span>
</tr>
'''
doc = SimplifiedDoc(html)
items = doc.selects('tr#td_important_id_1>span.important_class_1')
for item in items:
  print (item.input.nextText())
  print ([s.text for s in item.selects('div.label>span')])

结果:

"Text That I want 1"
['Text that I dont want', 'Text that I dont want', 'Text that I dont want', 'Text that I dont want']
"Text That I want 2"
['Text that I dont want', 'Text that I dont want', 'Text that I dont want', 'Text that I dont want']

推荐阅读