python - 如何用python提取一些没有html标签的文本
问题描述
如何在没有 html 标签的情况下提取每个句子,然后将它们添加到列表中。
例如
without_bracket = ['Jomi Jomi, okuroro ni i soni da', 'Joosua, ajooko bi eni wogbe.' etc.]
with_bracket = ['Insisting that one's children act like one makes one a wicked person', 'Joshua, a name that sounds like an act of jumping into the bush']
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div>
解决方案
尝试一些类似这样的:
from bs4 import BeautifulSoup
import re
html = """
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div>
"""
soup = BeautifulSoup(html,'html.parser')
text=soup.find('div').text.rstrip()
with_bracket = re.findall('\(([^)]+)', text)
print(with_bracket)
without_bracket=str(re.sub('\([^)]*\)','',text))
without_bracket=without_bracket.split('-')
without_bracket = [s.rstrip() for s in without_bracket]
without_bracket.remove('')
print(without_bracket)
结果 :
["Insisting that one's children act like one makes one a wicked person", 'Joshua, a name that sounds like an act of jumping into the bush', ' The fall of a leopard does not mean he can be likened to a dog', 'Instead of things to get better for the trader, he is turning bald like a vulture', " what is a bald man doing in a barber's shop?", 'Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth', ' there is no stone like potash, it is matchless.', ' The aeroplane has no business with a bad road']
[' Jomi Jomi, okuroro ni i soni da.. .', ' Joosua, ajooko bi eni wogbe. .', ' Ka gbekun yile, kii se egbe aja laelae .', 'Kaka ko san fun alajapa, pipa lori igun n pa. .', ' Kini apari wa de iso onigbajamo.', "Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade..", 'Ko si iru kaun lawujo okuta.', 'Kosi ohun to kan baalu pelu pe ona moto ko dara..']
推荐阅读
- android - 用于嵌套导航的 Flutter WillPopScope
- ruby - 如何解决自定义 RubyGem 的依赖关系?
- macos - 尝试在 Mac OS Catalina 上为 OpenCV for Python 3.6.5 安装 virtualenv 时出错
- azure - 密码重置流程 Azure AD B2C
- reactjs - 如何使用 Promise 创建受保护的路由
- python - 内核不断死亡 Jupyter,Anaconda。尝试使用神经核函数实现共指解析
- c# - web api 项目的默认路由
- plotly-dash - 在破折号的下拉菜单中更改文本输入的字体颜色
- bluetooth - 蓝牙查找设备
- angular - 尝试在角度中使用 x-www-form-urlencoded 的 post 方法时收到错误消息 400(错误请求)