首页 > 解决方案 > 从未调用 RULE 中的 CrawlSpider 回调函数

问题描述

我想使用http://quotes.toscrape.com/ scrapy 2.2上的所有标签,代码如下

我得到这样的输出:

“调试:已爬网 (200) <GET http://quotes.toscrape.com/tag/classic/page/1/>(参考:http: //quotes.toscrape.com)”。

所以 link_extractor 正在工作,但为什么callback从未执行?

class MySpider(CrawlSpider):
name = 'quotes'
#allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com']

rules = (
    #extract links to tag page
    Rule(LinkExtractor(restrict_xpaths=('//div[@class="tags"]')),callback='parse_tag'),
    #Rule(LinkExtractor(allow=(r'/author/', )), callback='parse_author')
)
def parse_tag(self,response):
    taginfo=ItemLoader(item=tagitem(),response=response)
    taginfo.add_xpath('tag','//h3/a/text()')
    taginfo.add_xpath('quote','//span[@class="text"]/text()')
    return taginfo.load_item()

标签: web-scrapingscrapy

解决方案


它为我工作。它可以帮助你。

from scrapy.loader import ItemLoader

from scrapy import item, Field

class LinkfinderItem(scrapy.Item):
    # define the fields for your item here like:
    url = Field()
    anchor_text = Field()

class MySpider(CrawlSpider):
    name = 'quotesx'
    #allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    rules = (
        #extract links to tag page
        Rule(LinkExtractor(restrict_xpaths=('//div[@class="tags"]')), callback='parse_tag'),
        #Rule(LinkExtractor(allow=(r'/author/', )), callback='parse_author')
    )
    def parse_tag(self,response):

        tags = response.css('div.tags a.tag')

        for tag in tags:
            l = ItemLoader(item=LinkfinderItem(),selector=tag)
            l.add_value('url', response.url)
            l.add_value('anchor_text', response.css('div.tags a.tag::text').extract())
            yield l.load_item()

推荐阅读