首页 > 解决方案 > Scrapy:如何使用抓取的项目作为动态 URL 的变量

问题描述

我想从最后一个分页数开始抓取。从最高页面到最低页面

https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-

page-2267 是动态的,所以我需要先抓取项目,然后再确定最后一页码,然后 url 分页应该像这样 page-2267 , page-2266 ...

这就是我所做的

class TeslamotorsclubSpider(scrapy.Spider):
    name = 'teslamotorsclub'
    allowed_domains = ['teslamotorsclub.com']
    start_urls = ['https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/']

    def parse(self, response):
        last_page = response.xpath('//div[@class = "PageNav"]/@data-last').extract_first()
        for item in response.css("[id^='fc-post-']"):
            last_page = response.xpath('//div[@class = "PageNav"]/@data-last').extract_first()
            datime = item.css("a.datePermalink span::attr(title)").get()
            message = item.css('div.messageContent blockquote').extract()
            datime = parser.parse(datime)
            yield {"last_page":last_page,"message":message,"datatime":datime}

        next_page = 'https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-' + str(TeslamotorsclubSpider.last_page)
        print(next_page)
        TeslamotorsclubSpider.last_page = int(TeslamotorsclubSpider.last_page)
        TeslamotorsclubSpider.last_page -= 1
        yield response.follow(next_page, callback=self.parse)   

我需要将项目从最高页面刮到最低页面。请帮帮我谢谢

标签: pythonweb-scrapingscrapy

解决方案


你的页面上有很好的元素link[rel=next]。所以你可以用这种方式重构你的代码:解析页面、调用下一个、解析页面、调用下一个等。

def parse(self, response):
    for item in response.css("[id^='fc-post-']"):
        datime = item.css("a.datePermalink span::attr(title)").get()
        message = item.css('div.messageContent blockquote').extract()
        datime = parser.parse(datime)
        yield {"message":message,"datatime":datime}

    next_page = response.css('link[rel=next]::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)   

UPD:这是将数据从最后一页刮到第一页的代码:

class TeslamotorsclubSpider(scrapy.Spider):
    name = 'teslamotorsclub'
    allowed_domains = ['teslamotorsclub.com']
    start_urls = ['https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/']
    next_page = 'https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-{}'

    def parse(self, response):
        last_page = response.xpath('//div[@class = "PageNav"]/@data-last').get()
        if last_page and int(last_page):
            # iterate from last page down to first
            for i in range(int(last_page), 0, -1):
                url = self.next_page.format(i)
                yield scrapy.Request(url, self.parse_page)

    def parse_page(self, response):
        # parse data on page
        for item in response.css("[id^='fc-post-']"):
            last_page = response.xpath('//div[@class = "PageNav"]/@data-last').get()
            datime = item.css("a.datePermalink span::attr(title)").get()
            message = item.css('div.messageContent blockquote').extract()
            datime = parser.parse(datime)
            yield {"last_page":last_page,"message":message,"datatime":datime}

推荐阅读