首页 > 解决方案 > 使用 Newspaper3k 时从 html 中删除嵌入的推文

问题描述

我正在使用Newspaper3k从在线新闻中提取文本。

from newspaper import Article

urlw = 'https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959'
article = Article(urlw)
article.download()
article.parse()
string1 = article.text

但是,我可以看到有多个我不需要进行分析的嵌入式推文。我试图将它们识别为以下内容。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]

但是,我想不出一种方法来删除它们string1

标签: python-3.xstringreplace

解决方案


使用漂亮的汤去除 html 标签;只需找到 html 标签并调用extract()html 变量。之后,使用soup对象查找文章内容

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959')
r.raise_for_status() # check for 4xx + 5xx status code
soup = BeautifulSoup(r.text, "html.parser")

for tweet in soup.find_all('div', {'element-oembed'}):
    tweet.extract() # remove div with class 'element-oembed'

articleTag = soup.find(id='article-content')
print(articleTag.text.strip()) 

输出:

'Traffic is backed up for about 9km after an incident near Spaghetti Junction.  The incident happened about 12pm at the Southern Motorway link to the Northwestern Motorway, westbound.  Drivers were asked to avoid the area and consider using an alternative route.     The New Zealand Transport Agency said at 1.25pm the road had reopened but traffic remained heavy between Penrose and the State Highway 1 link - a journey of about 9km.   Advertisement   Advertise with NZME.     "Consider delaying your journey if possible, or be prepared for delays."  Police have cordoned off a section of footpath on Alex Evans Road, above the motorway, in relation to the incident.'

推荐阅读