python - Scrapy:如何使用抓取的项目作为动态 URL 的变量
问题描述
我想从最后一个分页数开始抓取。从最高页面到最低页面
page-2267 是动态的,所以我需要先抓取项目,然后再确定最后一页码,然后 url 分页应该像这样 page-2267 , page-2266 ...
这就是我所做的
class TeslamotorsclubSpider(scrapy.Spider):
name = 'teslamotorsclub'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/']
def parse(self, response):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').extract_first()
for item in response.css("[id^='fc-post-']"):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').extract_first()
datime = item.css("a.datePermalink span::attr(title)").get()
message = item.css('div.messageContent blockquote').extract()
datime = parser.parse(datime)
yield {"last_page":last_page,"message":message,"datatime":datime}
next_page = 'https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-' + str(TeslamotorsclubSpider.last_page)
print(next_page)
TeslamotorsclubSpider.last_page = int(TeslamotorsclubSpider.last_page)
TeslamotorsclubSpider.last_page -= 1
yield response.follow(next_page, callback=self.parse)
我需要将项目从最高页面刮到最低页面。请帮帮我谢谢
解决方案
你的页面上有很好的元素link[rel=next]
。所以你可以用这种方式重构你的代码:解析页面、调用下一个、解析页面、调用下一个等。
def parse(self, response):
for item in response.css("[id^='fc-post-']"):
datime = item.css("a.datePermalink span::attr(title)").get()
message = item.css('div.messageContent blockquote').extract()
datime = parser.parse(datime)
yield {"message":message,"datatime":datime}
next_page = response.css('link[rel=next]::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
UPD:这是将数据从最后一页刮到第一页的代码:
class TeslamotorsclubSpider(scrapy.Spider):
name = 'teslamotorsclub'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/']
next_page = 'https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-{}'
def parse(self, response):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').get()
if last_page and int(last_page):
# iterate from last page down to first
for i in range(int(last_page), 0, -1):
url = self.next_page.format(i)
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
# parse data on page
for item in response.css("[id^='fc-post-']"):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').get()
datime = item.css("a.datePermalink span::attr(title)").get()
message = item.css('div.messageContent blockquote').extract()
datime = parser.parse(datime)
yield {"last_page":last_page,"message":message,"datatime":datime}
推荐阅读
- android - Android:如何使缩放动画不改变按钮中的文本
- serilog - Serilog 可以解构传递给 BeginScope 的复杂对象吗?
- html - 当我添加填充时,div 离开容器
- python - 为什么我得到这个返回值?
- c# - 如何在 WCF Rest Service 中传递多个参数:C# 中的字符串和流
- c# - 比较 ASP.Net Core 中的两个模型以检测无循环的变化
- html - gojs示例“productionProcess”中的浅蓝框如何去掉?
- javascript - 动态启用/禁用树枝中的提交按钮
- javascript - Vanilla Javascript:无限图像选框
- selenium - 如何在硒的弹出窗口中添加文本