python - BeautifulSoup 从一个标签中获取文本,但忽略另一个标签中的文本
问题描述
我有一些看起来像这样的文本:
<item>
<title>What Music Do You Build Robots to?</title>
<dc:creator><![CDATA[@TaranMayer TaranMayer ]]></dc:creator>
<description><![CDATA[ <aside class="quote no-group" data-username="DanMantz" data-post="34" data-topic="84065" data-full="true">
<div class="title">
<div class="quote-controls"></div>
<img alt="" width="20" height="20" src="https://www.vexforum.com/user_avatar/www.vexforum.com/danmantz/40/2285_2.png" class="avatar"> DanMantz:</div>
<blockquote>
<p>Classic Rock and Motown. I didn’t even consider that there are other options… <img src="https://www.vexforum.com/images/emoji/apple/slight_smile.png?v=9" title=":slight_smile:" class="emoji" alt=":slight_smile:"></p>
</blockquote>
</aside>
<p>This implies that you do indeed build robots. May we see some of your creations?</p> ]]></description>
<link>https://www.vexforum.com/t/what-music-do-you-build-robots-to/84065/35</link>
<pubDate>Wed, 02 Sep 2020 17:24:19 +0000</pubDate>
<guid isPermaLink="false">www.vexforum.com-post-669073</guid>
</item>
使用 bs4,我想获取标签中所有内容的文本,但<description>
标签中的内容除外<blockquote>
。我想得到这个:
This implies that you do indeed build robots. May we see some of your creations?
我该怎么做?我试过寻求帮助,但找不到我需要的东西。
解决方案
要获得所需的文本,您可以使用.extract()
方法:
from bs4 import BeautifulSoup, CData
txt = """<item>
<title>What Music Do You Build Robots to?</title>
<dc:creator><![CDATA[@TaranMayer TaranMayer ]]></dc:creator>
<description><![CDATA[ <aside class="quote no-group" data-username="DanMantz" data-post="34" data-topic="84065" data-full="true">
<div class="title">
<div class="quote-controls"></div>
<img alt="" width="20" height="20" src="https://www.vexforum.com/user_avatar/www.vexforum.com/danmantz/40/2285_2.png" class="avatar"> DanMantz:</div>
<blockquote>
<p>Classic Rock and Motown. I didn’t even consider that there are other options… <img src="https://www.vexforum.com/images/emoji/apple/slight_smile.png?v=9" title=":slight_smile:" class="emoji" alt=":slight_smile:"></p>
</blockquote>
</aside>
<p>This implies that you do indeed build robots. May we see some of your creations?</p> ]]></description>
<link>https://www.vexforum.com/t/what-music-do-you-build-robots-to/84065/35</link>
<pubDate>Wed, 02 Sep 2020 17:24:19 +0000</pubDate>
<guid isPermaLink="false">www.vexforum.com-post-669073</guid>
</item>"""
# load main soup:
soup = BeautifulSoup(txt, "html.parser")
# find CData in description
desc = soup.find("description").find_next(text=lambda t: isinstance(t, CData))
# create new soup
desc = BeautifulSoup(desc, "html.parser")
# extract tags we don't want
for a in desc.select("aside"):
a.extract()
# print the text:
print(desc.text.strip())
印刷:
This implies that you do indeed build robots. May we see some of your creations?
推荐阅读
- javascript - 脚本执行结束时显示的 console.log 消息
- excel - 将单元格值(不是公式)从一张表复制并粘贴到另一张表
- java - Windows 10 的 Gate 4.0 安装问题
- php - Doctrine Query - 获取子对象
- powershell - 在PowerShell中创建具有多列的预定义哈希表
- ios - 如何在 plist 中为自定义 Xcode 模板添加 xcconfig 文件?
- c++ - 从 QTableView 中检索特定列的内容
- sql - 如何在行中选择特定单词
- python - 将操作元素明智地应用于嵌套列表中的所有列表
- azure - 自动安装后创建 SendGrid API 密钥