首页 > 解决方案 > 无法使用 Scrapy 转到下一页

问题描述

我试图告诉 Scrapy 移动到下一页并抓取内容,但它停在第一页。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class CasaSpider(CrawlSpider):
    name = 'house'
    start_urls = ['https://www.casa.it/affitto/residenziale/napoli/montecalvario-avvocata-san-giuseppe-porto-pendino-mercato?sortType=date_desc']
       

rules = [
    (Rule(LinkExtractor(allow=(r'/immobili/.*'), deny=(r'/immagine-.*')), 
        callback='parse', follow = False)),
]

def parse(self, response):
    yield {
        'title': response.xpath('//*[@id="__next"]/div[2]/div[2]/div[1]/div/h1/text()').get(),
        'price': response.xpath('//*[@id="__next"]/div[2]/div[2]/div[1]/div/ul/li[1]/text()').get()
    }

    next_page = response.css('a.paginator__page.tp-a--c.b-r--100.is-block.c-bg--w.tp-w--m.paginator__nav.next::attr(href)').get()
    next_page = response.urljoin(next_page)
    if next_page is not None:
        yield scrapy.Request(url=next_page, callback=self.parse, dont_filter=True)

你知道我可能做错了什么吗?next_page当我在 shell 中测试代码时,我得到了正确的结果。

谢谢大家的帮助

标签: pythonscrapy

解决方案


只需为页面创建另一个规则:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class CasaSpider(CrawlSpider):
    name = 'house'
    start_urls = ['https://www.casa.it/affitto/residenziale/napoli/montecalvario-avvocata-san-giuseppe-porto-pendino-mercato?sortType=date_desc']

    rules = (
        Rule(LinkExtractor(allow=(r'/affitto/residenziale/napoli/montecalvario-avvocata-san-giuseppe-porto-pendino-mercato/*')), follow=True),
        Rule(LinkExtractor(allow=(r'/immobili/.*'), deny=(r'/immagine-.*')), callback='parse', follow=False),
    )

    def parse(self, response):
        yield {
            'title': response.xpath('//*[@id="__next"]/div[2]/div[2]/div[1]/div/h1/text()').get(),
            'price': response.xpath('//*[@id="__next"]/div[2]/div[2]/div[1]/div/ul/li[1]/text()').get()
        }

推荐阅读