python - 从 json 文件中的链接列表中抓取新闻站点,仅返回第一页
问题描述
我正在使用我找到的教程 ( linke ) 使用 python 库报纸和 feedparser 来抓取新闻网站。
它从 json 文件中读取要处理的链接,然后从中获取文章。问题是它只能从第一页获取文章,不能迭代到第二、第三等等。因此,我编写了一个脚本来使用站点的前 50 页填充 json 文件,例如 www.site.com/page/x。
{
"site0" : { "link" : "https://sitedotcom/page/0/"},
"site1" : { "link" : "https://sitedotcom/page/1/"},
"site2" : { "link" : "https://sitedotcom/page/2/"}
etc
}
# Set the limit for number of articles to download
LIMIT = 1000000000
articles_array = []
data = {}
data['newspapers'] = {}
# Loads the JSON files with news sites
with open('thingie2.json') as data_file:
companies = json.load(data_file)
count = 1
# Iterate through each news company
for company, value in companies.items():
# If a RSS link is provided in the JSON file, this will be the first choice.
# Reason for this is that, RSS feeds often give more consistent and correct data. RSS (Rich Site Summary; originally RDF Site Summary; often called Really Simple Syndication) is a type of
# web feed which allows users to access updates to online content in a standardized, computer-readable format
# If you do not want to scrape from the RSS-feed, just leave the RSS attr empty in the JSON file.
if 'rss' in value:
d = fp.parse(value['rss'])
print("Downloading articles from ", company)
newsPaper = {
"rss": value['rss'],
"link": value['link'],
"articles": []
}
for entry in d.entries:
# Check if publish date is provided, if no the article is skipped.
# This is done to keep consistency in the data and to keep the script from crashing.
if hasattr(entry, 'published'):
if count > LIMIT:
break
article = {}
article['link'] = entry.link
date = entry.published_parsed
article['published'] = datetime.fromtimestamp(mktime(date)).isoformat()
try:
content = Article(entry.link)
content.download()
content.parse()
except Exception as e:
# If the download for some reason fails (ex. 404) the script will continue downloading
# the next article.
print(e)
print("continuing...")
continue
article['title'] = content.title
article['text'] = content.text
article['authors'] = content.authors
article['top_image'] = content.top_image
article['movies'] = content.movies
newsPaper['articles'].append(article)
articles_array.append(article)
print(count, "articles downloaded from", company, ", url: ", entry.link)
count = count + 1
else:
# This is the fallback method if a RSS-feed link is not provided.
# It uses the python newspaper library to extract articles
print("Building site for ", company)
paper = newspaper.build(value['link'], memoize_articles=False)
newsPaper = {
"link": value['link'],
"articles": []
}
noneTypeCount = 0
for content in paper.articles:
if count > LIMIT:
break
try:
content.download()
content.parse()
except Exception as e:
print(e)
print("continuing...")
continue
# Again, for consistency, if there is no found publish date the article will be skipped.
# After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.
article = {}
article['title'] = content.title
article['authors'] = content.authors
article['text'] = content.text
article['top_image'] = content.top_image
article['movies'] = content.movies
article['link'] = content.url
article['published'] = content.publish_date
newsPaper['articles'].append(article)
articles_array.append(article)
print(count, "articles downloaded from", company, " using newspaper, url: ", content.url)
count = count + 1
#noneTypeCount = 0
count = 1
data['newspapers'][company] = newsPaper
#Finally it saves the articles as a CSV-file.
try:
f = csv.writer(open('Scraped_data_news_output2.csv', 'w', encoding='utf-8'))
f.writerow(['Title', 'Authors','Text','Image','Videos','Link','Published_Date'])
#print(article)
for artist_name in articles_array:
title = artist_name['title']
authors=artist_name['authors']
text=artist_name['text']
image=artist_name['top_image']
video=artist_name['movies']
link=artist_name['link']
publish_date=artist_name['published']
# Add each artist’s name and associated link to a row
f.writerow([title, authors, text, image, video, link, publish_date])
except Exception as e: print(e)
在我的浏览器中导航到这些站点会按预期显示较旧且独特的文章。但是当我在它们上运行脚本时,无论页码如何,它都会返回相同的文章。有什么我做错或没有考虑过的吗?
解决方案
推荐阅读
- excel - 从两个excel中读取并创建差异文件
- node.js - 为什么带有灯塔的 chrome 审计中的节点这么慢?
- python - Python 文本覆盖在其他所有内容之上
- linux - 为什么编译linux内核时`make modules_install`没有在我的/lib/modules中安装一个正确命名的文件夹?
- python - 开始自动化 Web 活动
- scala - spark无法从s3读取
- node.js - npm install -g 区块链钱包服务
- google-maps-api-3 - 使用 Google Maps API 的地图滑动工具
- javascript - 使用 ajax xmlhttprequest 对象后,javascript 和 css 文件不起作用
- typo3 - TYPO3 中博客文章列表中显示的元素