xpath - 抓取 AWS 博客站点时,Scrapy 不返回任何内容
问题描述
这是我尝试在 AWS 博客网站的第一页中抓取 URL 列表。但它什么也没返回。我想我的 xpath 可能有问题,但不知道如何修复。
import scrapy
class AwsblogSpider(scrapy.Spider):
name = 'awsblog'
allowed_domains = ['aws.amazon.com/blogs']
start_urls = ['http://aws.amazon.com/blogs/']
def parse(self, response):
blogs = response.xpath('//li[@class="m-card"]')
for blog in blogs:
url = blog.xpath('.//div[@class="m-card-title"]/a/@href').extract()
print(url)
Attempt 2
import scrapy
class AwsblogSpider(scrapy.Spider):
name = 'awsblog'
allowed_domains = ['aws.amazon.com/blogs']
start_urls = ['http://aws.amazon.com/blogs/']
def parse(self, response):
blogs = response.xpath('//div[@class="aws-directories-container"]')
for blog in blogs:
url = blog.xpath('//li[@class="m-card"]/div[@class="m-card-title"]/a/@href').extract_first()
print(url)
日志输出:
2019-11-06 10:38:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-06 10:38:30 [scrapy.core.engine] INFO: Spider opened
2019-11-06 10:38:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-06 10:38:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-06 10:38:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://aws.amazon.com/robots.txt> from <GET http://aws.amazon.com/robots.txt>
2019-11-06 10:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aws.amazon.com/robots.txt> (referer: None)
2019-11-06 10:38:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://aws.amazon.com/blogs/> from <GET http://aws.amazon.com/blogs/>
2019-11-06 10:38:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aws.amazon.com/blogs/> (referer: None)
2019-11-06 10:38:32 [scrapy.core.engine] INFO: Closing spider (finished)
任何帮助将不胜感激!
解决方案
您使用了错误的解析器,该站点正在通过动态脚本功能加载博客详细信息。查看页面源以了解博客内容的可用性。
为了获取数据,您应该使用如下动态数据获取技术
1. Scrapy splash
2. Selenium
推荐阅读
- r - 如何使用存储在 R 中另一个数据框中的列顺序从数据框中进行选择?
- python - 如何在 windows server 2012 r2 中将 laravel 作业作为服务运行并公开访问项目?
- python - 如何解释用于计算到达第 n 个楼梯的方式的代码?
- c - 通过字符索引对文本中的单词进行索引。(C)
- firebase - firebase 库是否检查匹配的 appId?如果应用程序具有某些随机 firebase 项目的 firebase 配置文件?
- node.js - 达到配额时如何捕获未经授权的 Sendgrid 错误?
- excel - VBA Excel_Query Import Data Web_Issue,结果中包含单元格格式
- azure - Azure 托管代理管道扩展动态环境变量
- javascript - 根据条件结果应用属性
- html - Rails 5.2 部分列,背景图像在页面上呈现不同