python - 如何将新闻网页抓取提取到 csv 文件中以及如何附加新记录?
问题描述
python 新手,并构建了一个网络爬虫来从 cnn 头条新闻中提取新的新闻文章。试图获取当我 print() 看起来像逐行项目的输出。希望将结果提取到 csv 文件中,以便每个标题都是自己的行。然后还能够编写附加版本,因此每次我运行它时,它都会附加到文件而不是覆盖它。问题是如何让结果在 csv 文件中看起来像这样:
1)来自抓取数据的标题 1 2)来自抓取数据的标题 2 3)来自抓取数据的标题 3,依此类推。
我在下面粘贴了我的代码:
from bs4 import BeautifulSoup
import requests
import csv
#nterwebsite you wish to pull from that has news articles
res = requests.get('http://money.cnn.com/')
soup = BeautifulSoup(res.text, 'lxml')
#need to pul the ulcode from the website by right clicking and choosing inspecting element
news_box = soup.find('ul', {'class': '_6322dd28 ad271c3f'})
#drill down into the li's as they should always show a, which signals the header for the news article shown.
all_news = news_box.find_all('a')
for news in all_news:
test= (news.text)
print(test)
with open('index.csv', 'w') as fobj:
csvwriter = csv.writer(fobj, delimiter=',')
for row in test:
csvwriter.writerow(test)
解决方案
您可以re.compile
使用BeautifulSoup.find_all
:
from bs4 import BeautifulSoup as soup
import requests, re
import csv
d = soup(requests.get('http://money.cnn.com/').text, 'html.parser')
articles = list(filter(None, [i.text for i in d.find_all('span', {'class':re.compile('^\w+ _\w+|^\w+$')})]))[2:]
with open('articles.csv', 'a') as f:
write = csv.writer(f)
write.writerows([[i] for i in articles])
输出:
What higher wages means for Domino's and McDonald's
'Jurassic World' sequel has big opening day amid a surging box office
Crying migrant girl: What the iconic photo says about press access
Chanel reveals earnings for the first time in its 108-year history
Why GE may need to stop paying its 119-year old dividend
A top Netflix executive is out after using the N-word
ZTE pays $1 billion fine to US over sanctions violations
Tariffs on European cars would hurt US auto jobs
Etsy sellers confront unknowns after Supreme Court ruling
Chipotle hopes quesadillas and milkshakes bring customers back
This group is getting ahead in America
OPEC strikes deal to increase oil production
Wall Street banks are healthier than ever
Self-driving Uber driver may have been streaming 'The Voice'
GM's new Chevy Blazer will be built in Mexico
"GM is bringing back the Chevy Blazer, an SUV classic "
...
推荐阅读
- opencv - OpenCV - 为视频生成自适应背景
- c++ - 如何从插件中的单独 C++ 线程调用发射器回调?
- html - 如何在 HTML/CSS 中为页面上的所有内容设置边框/边距?
- sql-server - Sql Server:如何从存储过程中为非特权用户创建数据库快照?
- ansible - Ansible 2.7:如何列出未归档的文件
- javascript - Firestore - 查询,然后更新
- java - 如何洗牌存储在数组中的一副牌?
- rasa-core - 如何将实体识别与意图预测联系起来?
- php - POST 数据中的 CURLFile 对象使我的请求失败
- c++ - 使用行顺序在打印二维数组时显示垃圾值