首页 > 解决方案 > 如果在 Scrapy 中遇到某些情况,如何摆脱爬网

问题描述

出于我正在开发的应用程序的目的,我需要 scrapy 来摆脱爬网并从特定的任意 URL 重新开始爬网。

如果满足特定条件,则预期的行为是让scrapy 回到可以在参数中提供的特定 URL。

我正在使用 CrawlSpider 但无法弄清楚如何实现这一点:

class MyCrawlSpider(CrawlSpider):
    name = 'mycrawlspider'
    initial_url = ""

    def __init__(self, initial_url, *args, **kwargs):
        self.initial_url = initial_url
        domain = "mydomain.com"
        self.start_urls = [initial_url]
        self.allowed_domains = [domain]
        self.rules = (
            Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
        )

        super(MyCrawlSpider, self)._compile_rules()


    def parse_item(self, response):
        if(some_condition is True):
            # force scrapy to go back to home page and recrawl
            print("Should break out")

        else:
           print("Just carry on")

我试图放置

return scrapy.Request(self.initial_url, callback=self.parse_item)

在分支中someCondition is True却没有成功。非常感谢一些帮助,一直在努力解决这个问题几个小时。

标签: pythonscrapyweb-crawler

解决方案


您可以进行适当处理的自定义异常,就像这样...

请随时使用 CrawlSpider 的适当语法进行编辑

class RestartException(Exception):
    pass

class MyCrawlSpider(CrawlSpider):
    name = 'mycrawlspider'
    initial_url = ""

    def __init__(self, initial_url, *args, **kwargs):
        self.initial_url = initial_url
        domain = "mydomain.com"
        self.start_urls = [initial_url]
        self.allowed_domains = [domain]
        self.rules = (
            Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
        )

        super(MyCrawlSpider, self)._compile_rules()


    def parse_item(self, response):
        if(some_condition is True):

            print("Should break out")
            raise RestartException("We're restarting now")

        else:
           print("Just carry on")

siteName = "http://whatever.com"
crawler = MyCrawlSpider(siteName)           
while True:
    try:
        #idk how you start this thing, but do that

        crawler.run()
        break
    except RestartException as err:
        print(err.args)
        crawler.something = err.args
        continue

print("I'm done!")

推荐阅读