python - 如何使用 Beautifulsoup 基于嵌套标签对文本进行切片和重组?
问题描述
在下面的 html 中,我需要按顺序阅读所有文本,并为每个 span 类组合单独的句子。
<label for="01">"The traveler, with his powerful "
<span class ="Wizard">"Storm"</span>
<span class ="Warrior">"Whirlwind"</span>
<span class ="Monk">"Prayer"</span>", took down the dark forces of evil. The "
<span class ="Wizard">"wizard"</span>
<span class ="Warrior">"warrior"</span>
<span class ="Monk">"monk"</span>" was exhausted afterwards and needed to take a rest."
</label>
在这种情况下,列表列表中应该有 3 个单独的句子和相应的类 - 所以输出将如下所示:
[['Wizard', 'The traveler, with his powerful Storm, took down the dark forces of evil. The wizard was exhausted afterwards and needed to take a rest.']
['Warrior', 'The traveler, with his powerful Whirlwind, took down the dark forces of evil. The warrior was exhausted afterwards and needed to take a rest.']
['Monk', 'The traveler, with his powerful Prayer, took down the dark forces of evil. The monk was exhausted afterwards and needed to take a rest.']]
我不知道如何解决这个问题,也无法在网上找到任何东西 - 可能是因为我不确定如何提出我的问题(如果您有建议如何更好地提出我的问题,请发表评论,我将要)。
先感谢您!
编辑:我试过了find(text=True)
,find_all(text=True)
但我不知道该怎么做。
解决方案
您可以使用itertools.groupby
import bs4
from bs4 import BeautifulSoup as soup
from itertools import groupby
d = [(a, list(b)) for a, b in groupby(list(filter(lambda x:x != '\n', soup(content, 'html.parser').label.contents)), key=lambda x:isinstance(x, bs4.element.NavigableString))]
users, _text = list(zip(*[b for a, b in d if not a])), [b for a, b in d if a]
result = [[a[0]['class'][0], (lambda x:''.join(f'{j[1:-1]} {next(x).text[1:-1]}' if l < len(_text) - 1 else j[1:-2] for l, [j] in enumerate(_text)))(iter(a))] for a in users]
输出中有一些额外\n
的字符,您可以使用以下命令删除它们re
:
import re
final_result = [[a, re.sub('"\n\s+', '', b)] for a, b in result]
输出:
[['Wizard', 'The traveler, with his powerful Storm, took down the dark forces of evil. The wizard was exhausted afterwards and needed to take a rest.'],
['Warrior', 'The traveler, with his powerful Whirlwind, took down the dark forces of evil. The warrior was exhausted afterwards and needed to take a rest.'],
['Monk', 'The traveler, with his powerful Prayer, took down the dark forces of evil. The monk was exhausted afterwards and needed to take a rest.']]
推荐阅读
- python - 逐行打印熊猫列名和单元格值
- r - 按 id 折叠行,为 df 中的每个 id 提供 1 行
- java - 在这种情况下如何正确使用 Comparables?
- function - 如何将一行数据与谷歌表中另一个表的每一行数据进行比较?
- python - 在 Python 中,有没有办法恢复使用 del 删除的列表元素?
- c# - 从列表中的 IQueryable 中删除项目
- javascript - Redux,切换状态无法按预期工作
- c++ - 错误 C1083:如何修复损坏的 Visual Studio 编译器?
- python-3.x - 用于大规模凸优化的 Python 库
- django - 如何将influx db(免费版本)的表复制到postgres