首页 > 解决方案 > 如何用python提取一些没有html标签的文本

问题描述

如何在没有 html 标签的情况下提取每个句子,然后将它们添加到列表中。

例如

without_bracket = ['Jomi Jomi, okuroro ni i soni da', 'Joosua, ajooko bi eni wogbe.' etc.]

with_bracket = ['Insisting that one's children act like one makes one a wicked person', 'Joshua, a name that sounds like an act of jumping into the bush']
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div>

标签: pythonweb-scrapingbeautifulsoup

解决方案


尝试一些类似这样的:

from bs4 import BeautifulSoup
import re

html = """
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div> 
       """
soup = BeautifulSoup(html,'html.parser')
text=soup.find('div').text.rstrip()

with_bracket = re.findall('\(([^)]+)', text)
print(with_bracket) 
without_bracket=str(re.sub('\([^)]*\)','',text))
without_bracket=without_bracket.split('-')
without_bracket = [s.rstrip() for s in without_bracket]
without_bracket.remove('')
print(without_bracket)

结果 :

["Insisting that one's children act like one makes one a wicked person", 'Joshua, a name that sounds like an act of jumping into the bush', ' The fall of a leopard does not mean he can be likened to a dog', 'Instead of things to get better for the trader, he is turning bald like a vulture', " what is a bald man doing in a barber's shop?", 'Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth', ' there is no stone like potash, it is matchless.', ' The aeroplane has no business with a bad road']
[' Jomi Jomi, okuroro ni i soni da.. .', ' Joosua, ajooko bi eni wogbe. .', ' Ka gbekun yile, kii se egbe aja laelae .', 'Kaka ko san fun alajapa, pipa lori igun n pa. .', ' Kini apari wa de iso onigbajamo.', "Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade..", 'Ko si iru kaun lawujo okuta.', 'Kosi ohun to kan baalu pelu pe ona moto ko dara..']

推荐阅读