首页 > 解决方案 > 如何在scrapy Sitemap蜘蛛中覆盖sitemap_rules?

问题描述

我正在尝试使用该方法动态添加sitemap_rules蜘蛛。init我可以sitemap_urls使用相同的方法更改,但sitemap_rules不会被覆盖。谁能告诉我我做错了什么。这是我的代码:

# -*- coding: utf-8 -*-
from scrapy.spiders import SitemapSpider
from scrapy.selector import Selector
from myspider.items import MyItem
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(SitemapSpider):
    sitemap_urls = []
    sitemap_rules = []
    name = "testspider"

    def __init__(self, *a, **kw):
        super(MySpider, self).__init__(*a, **kw)
        self.sitemap_rules = [('*.Attraction_Review.*', 'parse_data'),]
        start_url = "http://tripadvisor-sitemaps.s3-website-us-east-1.amazonaws.com/att/en_IN/sitemap_en_IN_attraction_review_index.xml"
        self.sitemap_urls = [start_url]
        #dispatcher.connect(self.spider_closed, signals.spider_closed)

    def parse_data(self, response):
        ......
        yield item

在上面的代码parse_data中永远不会被调用。如果我sitemap_rules在开始时在变量中提到相同的规则,它工作正常。

标签: pythonscrapy

解决方案


我用这段代码解决了它。这是代码更改:

def __init__(self, *a, **kw):
    super(RecipeSpider, self).__init__(*a, **kw)
    rules = [('https://www.tripadvisor.in/Attraction_Review.*', 'parse_data'),]
    self._cbs = []
    for r, c in rules:
        if isinstance(c, six.string_types):
            c = getattr(self, c)
        self._cbs.append((regex(r), c))

推荐阅读