首页 > 解决方案 > 下一页的抓取循环

问题描述

您好,我正在尝试使用文字抓取器和爬虫,但是我不明白为什么我的代码不会进入下一页并循环。

import scrapy 
from scrapy import*

    import scrapy 
from scrapy import*

class SpiderSpider(scrapy.Spider):
    name = 'spider'
    start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
            
     
    def parse(self, response):

        allbuyers = response.xpath('//div[@class="company-details"]')

        for buyers in allbuyers:

            name = buyers.xpath('.//div/a/h2/text()').extract_first()
            email = buyers.xpath('.//p/a[contains(text(),"@")]/text()').extract_first()
            
            yield{

                'Name' : name,
                'Email' : email,

            }  
        
        next_url = response.css('#main > div > nav > a.next.page-numbers')

        if next_url:
            print("test")
            url = response.xpath("href").extract()
            yield scrapy.Request(url, self.parse)

标签: pythonweb-scrapingscrapyscrapy-shell

解决方案


你为获得下一页所做的事情并没有任何意义。具体来说,我的意思是这条线url = response.xpath("href").extract()

这是您的蜘蛛的修改版本:

class HouseDirectorySpider(scrapy.Spider):
    name = 'thehousedirectory'
    start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
            
    def parse(self, response):
        for buyers in response.xpath('//*[@class="company-details"]'):
            yield {
                'Name' : buyers.xpath('.//*[@class="heading"]/a/h2/text()').get(),
                'Email' : buyers.xpath('.//p/a[starts-with(@href,"mailto:")]/text()').get(),
            }  
        
        next_url = response.css('.custom-pagination > a.next:contains("Next Page")')
        if next_url:
            url = next_url.css("::attr(href)").get()
            yield scrapy.Request(url,callback=self.parse)

推荐阅读