python - 使用 BeautifulSoup 进行网页抓取,在 td 内的跨度内查找文本,忽略子跨度
问题描述
我正在尝试抓取网站以获取某些信息,但我遇到了困难。
一个示例 HTML 文件:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<form>
<table>
<tbody>
<tr id="dontMatter"></tr>
<tr id="td_important_id_1">
<div class="dontCare"></div>
<span onClick="blah" class="important_class_1">
::before
<input type="checkBox" name="">
"Text That I want 1"
<div class="label">
<span class="garbagbe">Text that I dont want</span>
<span class="garbagbe1">Text that I dont want</span>
<span class="garbagbe2">Text that I dont want</span>
<span class="garbagbe3">Text that I dont want</span>
</div>
</span>
<span onClick="blah" class="important_class_1">
::before
<input type="checkBox" name="">
"Text That I want 2"
<div class="label">
<span class="garbagbe">Text that I dont want</span>
<span class="garbagbe1">Text that I dont want</span>
<span class="garbagbe2">Text that I dont want</span>
<span class="garbagbe3">Text that I dont want</span>
</div>
</span>
<span onClick="blah" class="important_class_1">
::before
<input type="checkBox" name="">
"Text That I want 3"
<div class="label">
<span class="garbagbe">Text that I dont want</span>
<span class="garbagbe1">Text that I dont want</span>
<span class="garbagbe2">Text that I dont want</span>
<span class="garbagbe3">Text that I dont want</span>
</div>
</span>
<span onClick="blah" class="important_class_1">
::before
<input type="checkBox" name="">
"Text That I want 4"
<div class="label">
<span class="garbagbe">Text that I dont want</span>
<span class="garbagbe1">Text that I dont want</span>
<span class="garbagbe2">Text that I dont want</span>
<span class="garbagbe3">Text that I dont want</span>
</div>
</span>
</tr>
</tbody>
</table>
</form>
</body>
</html>
本质上,我想要所有“我想要的文本#”,但没有跨度子项。
尝试通过具有 id: "td_important_id_1" 和具有类 "important_class_1" 的跨度子级进行过滤并获取该跨度内的文本,但没有一个子级跨度。
我现在拥有的是:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='path to driver')
driver.get('website_link')
soup = BeautifulSoup(driver.page_source, features="html.parser")
for item in soup.find("td", {"id" : "td_important_id_1"}).find_all("span", {"class" : "important_class_1"}, recursive=False):
print(item.text)
driver.quit()
但这有点给我垃圾。如果有人可以提供帮助,那就太好了。
解决方案
这是另一种解决方案,仅供参考。
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<tr id="dontMatter"></tr>
<tr id="td_important_id_1">
<div class="dontCare"></div>
<span onClick="blah" class="important_class_1">
::before
<input type="checkBox" name="">
"Text That I want 1"
<div class="label">
<span class="garbagbe">Text that I dont want</span>
<span class="garbagbe1">Text that I dont want</span>
<span class="garbagbe2">Text that I dont want</span>
<span class="garbagbe3">Text that I dont want</span>
</div>
</span>
<span onClick="blah" class="important_class_1">
::before
<input type="checkBox" name="">
"Text That I want 2"
<div class="label">
<span class="garbagbe">Text that I dont want</span>
<span class="garbagbe1">Text that I dont want</span>
<span class="garbagbe2">Text that I dont want</span>
<span class="garbagbe3">Text that I dont want</span>
</div>
</span>
</tr>
'''
doc = SimplifiedDoc(html)
items = doc.selects('tr#td_important_id_1>span.important_class_1')
for item in items:
print (item.input.nextText())
print ([s.text for s in item.selects('div.label>span')])
结果:
"Text That I want 1"
['Text that I dont want', 'Text that I dont want', 'Text that I dont want', 'Text that I dont want']
"Text That I want 2"
['Text that I dont want', 'Text that I dont want', 'Text that I dont want', 'Text that I dont want']
推荐阅读
- python - 如何使用 np.isnan 从 Python 中的数组中删除带有 nan 的行?
- javascript - ReactJs 不更新状态
- powershell - gpg解密和邮件通知
- typescript - TypeScript 中的类型
- php - PHP 类对象错误
- haskell - 如何使用 Haskell 中的百里香库从 Int 值创建 UTCTime?
- php - php页面中基本上发生了什么
- python - Discord 机器人程序中的 NameError
- python - 使用 SymPy 从 Uniform(0, 1) 中找到对随机变量 X 的转换
- c# - 如何在 Unity 中使用 OnPointerEnter 区分两个画布文本元素