python-3.x - 使用 Newspaper3k 时从 html 中删除嵌入的推文
问题描述
我正在使用Newspaper3k从在线新闻中提取文本。
from newspaper import Article
urlw = 'https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959'
article = Article(urlw)
article.download()
article.parse()
string1 = article.text
但是,我可以看到有多个我不需要进行分析的嵌入式推文。我试图将它们识别为以下内容。
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
但是,我想不出一种方法来删除它们string1
?
解决方案
使用漂亮的汤去除 html 标签;只需找到 html 标签并调用extract()
html 变量。之后,使用soup对象查找文章内容
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959')
r.raise_for_status() # check for 4xx + 5xx status code
soup = BeautifulSoup(r.text, "html.parser")
for tweet in soup.find_all('div', {'element-oembed'}):
tweet.extract() # remove div with class 'element-oembed'
articleTag = soup.find(id='article-content')
print(articleTag.text.strip())
输出:
'Traffic is backed up for about 9km after an incident near Spaghetti Junction. The incident happened about 12pm at the Southern Motorway link to the Northwestern Motorway, westbound. Drivers were asked to avoid the area and consider using an alternative route. The New Zealand Transport Agency said at 1.25pm the road had reopened but traffic remained heavy between Penrose and the State Highway 1 link - a journey of about 9km. Advertisement Advertise with NZME. "Consider delaying your journey if possible, or be prepared for delays." Police have cordoned off a section of footpath on Alex Evans Road, above the motorway, in relation to the incident.'
推荐阅读
- javascript - 如何调试 EJS
- json - 在材料表详细信息面板下显示子表
- java - 可迭代接口的目的
- python-3.x - 如何从 dnspython 的解析器中按名称删除搜索域?
- c# - 断开连接后如何重新连接到套接字c#
- bitbucket - 我可以从 bitbucket 管道中的其他管道执行管道吗?
- ruby-on-rails - Rails 5:谷歌标签管理器不会触发
- python - 登录后如何更改 Django Project 中的一些路由?
- execute-immediate - 使用 Execute Immediate 通过查询大小 > 4 MB 的 XML 插入数据
- reactjs - 使用酶,模拟 keydown 事件不适用于将事件动态添加到元素的元素