scrapy - How do I grab the headline titles from the Google News webpage with Scrapy?
问题描述
I saved an offline file of https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen
Having trouble determining how to grab the titles of the listed articles.
import scrapy
class newsSpider(scrapy.Spider):
name = "news"
start_urls = ['file:///127.0.0.1/home/toni/Desktop/crawldeez/googlenewsoffline.html/'
]
def parse(self, response):
for xrnccd in response.css('a.MQsxIb.xTewfe.R7GTQ.keNKEd.j7vNaf.Cc0Z5d.EjqUne'):
yield {
'ipQwMb.ekueJc.RD0gLb': xrnccd.css('h3.ipQwMb.ekueJc.RD0gLb::ipQwMb.ekueJc.RD0gLb').get(),
}
解决方案
问题似乎在于页面内容是使用 JavaScript 动态呈现的,因此无法使用css
orxpath
方法从 HTML 中提取。但是,它存在于响应正文中,因此您可以使用正则表达式提取它。这是Scrapy shell会话,展示了如何:
$ scrapy shell "https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen"
...
>>> import re
>>> from pprint import pprint
>>>
>>> titles = re.findall(r'<h3 class="[^"]+?"><a[^>]+?>(.+?)</a>', response.text)
>>> pprint(titles)
['Amazon will no longer sell Chinese goods in China',
'YouTube is finally coming back to Amazon’s Fire TV devices',
'Amazon Plans to Use Digital Media to Expand Its Advertising Business',
'Amazon flooded with fake reviews; Learn how to spot them',
'How To Win in Today's Amazon World',
'Amazon Day: How to schedule Amazon deliveries',
'Bezos Disputes Amazon’s Market Power. But His Merchants Feel the Pinch',
'20 Best Action Movies to Stream on Amazon Prime',
...]
推荐阅读
- excel - 如何将每个条形图顶部的 % 更改为数字?
- flutter - 使用 Material () 或 Scaffold () 小部件时,使用不透明度设置颜色样式不起作用
- flutter - 在 Java 中加密和在 dart Flutter 中解密
- flutter - 如何在颤动的列表视图中对多个表单实现表单验证?
- python - 从几年的时间序列中删除一天中的某些小时v - python
- cordova - 科尔多瓦插件是否与电容器(离子)一起使用
- mysql - Windows 更新后 Xammp MySQL 无法启动
- google-bigquery - 根据另一个表中的开始日期和结束日期对值求和
- ios - 导航到主页后,swrevealcontroller 无法正常工作
- java - vertx.setPeriodic函数时间间隔减小的问题