python - 如何从抓取中排除具有特定标签的标签作为孩子
问题描述
我正在尝试使用 BeautifulSoup 获取文章的所有段落,并排除段落标签,而不是段落有另一个标签,例如其中的标签,或者如果他们确实有一个标签作为孩子只获取段落的文本.
这是 HTML 的一部分
<div class="entry-content clearfix">
<div class="entry-thumbnail>
<p> In as name to here them deny wise this. As rapid woody my he me which. </p>
<p> <a href="https://blabla"/> </p>
<p> Performed suspicion in certainty so frankness by attention pretended.
Newspaper or in tolerably education enjoyment. </p>
<p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
suffering. House it seven in spoil tiled court. Sister others marked
fat missed did out use.</p>
</div>
这就是我到目前为止所做的
contents = []
content = soup.find('div', { "class": "entry-content clearfix"}).find_all("p")
for p in content:
if not (p.find(findChildren("a"))):
contents[p] = content
if (content):
dic['content'] = content
else:
print("ARTICLE:", i, "HAS NO content")
dic['body'] = "No content"
解决方案
使用函数 get_text()。它将从段落中提取文本。参考:https ://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python
from bs4 import BeautifulSoup
contents = """<div class="entry-content clearfix">
<div class="entry-thumbnail>
<p> In as name to here them deny wise this. As rapid woody my he me which. </p>
<p> <a href="https://blabla"/> </p>
<p> Performed suspicion in certainty so frankness by attention pretended.
Newspaper or in tolerably education enjoyment. </p>
<p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
suffering. House it seven in spoil tiled court. Sister others marked
fat missed did out use.</p>
</div>"""
soup = BeautifulSoup(contents, "lxml")
print(soup.get_text())
结果:
Performed suspicion in certainty so frankness by attention pretended.
Newspaper or in tolerably education enjoyment.
When be draw drew ye. Defective in do recommend
suffering. House it seven in spoil tiled court. Sister others marked
fat missed did out use.
推荐阅读
- ios - 在 UIScrollView 中显示多个 UIView 的小缩略图/镜像
- azure - 获取 API 范围策略 XML
- angularjs - 使用上下文访问嵌套 Angular foreach 循环中的变量
- vba - 键的集合返回值
- regex - 如何根据 Google 表格中的一列中的多个文本值查找并返回一个值
- python - 我的绑定不会执行我的功能
- alexa - 问 cli 生成语言模型很多语言,怎么只做en-US?
- javascript - Firebase 云功能在完成 foreach 之前完成
- sql - Netezza SQL 中的 ltrim(s,t)、rtrim(s,t) 如何转换为 Hive SQL?
- winapi - Rich Edit Control 在取消最小化后将整个应用程序涂成黑色