首页 > 解决方案 > Scrapy 跟随 vs follow_all

问题描述

从scrapy的教程中,有一个例子:QuotesSpider
关注链接时,

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

此代码将获取所有页面。
或者,

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

        yield from response.follow_all(css='li.next a', callback=self.parse)
        # or equivalently
        # urls = response.css("li.next a")
        # yield from response.follow_all(urls=urls, callback=self.parse)

但是当我用 替换yield from response.follow_all(css='li.next a', callback=self.parse)yield response.follow(css='li.next a', callback=self.parse),它只获取第 1 页。由于response.css("li.next a")最多返回一个选择器,我希望它也会在后一种情况下获取所有页面。为什么?

提前致谢!

标签: pythonscrapy

解决方案


推荐阅读