首页 > 解决方案 > 为列表中的每个 url 重新启动 scrapy

问题描述

我正在尝试运行一个scrapy bot,它将为列表中给出的每个url重复运行蜘蛛。我写到现在的代码如下

def run_spider(url_list,allowed_list):
    runner = CrawlerRunner(get_project_settings())
    d = runner.crawl('scraper',start_urls=url_list, allowed_domains=allowed_list)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()



for start, allowed in zip(start_url,allowedUrl):
    url_list = []
    allowed_list = []
    url_list.append(start)
    allowed_list.append(allowed)
    print(type(url_list),type(allowed_list))
    run_spider(url_list,allowed_list) 

蜘蛛本身在第一个 url 上运行良好,但是一旦循环命中它就会给出错误twisted.internet.error.ReactorNotRestartable,完整的回溯就在这里:

Traceback (most recent call last):
  File "C:\brox\Crawler\main.py", line 34, in <module>
    run_spider(url_list,allowed_list)
  File "C:\brox\Crawler\main.py", line 24, in run_spider
    reactor.run()
  File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 1282, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
    ReactorBase.startRunning(self)
  File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我正在遵循文档中描述的方法,但是如何为循环中的每个项目重新启动蜘蛛。任何建议都会非常有帮助。

P:S:: 当简单地传递允许的域和启动 url 时,蜘蛛机器人本身就可以正常工作

标签: pythonscrapytwisted

解决方案


为了让你的代码正常工作,你必须重新安排你的reactor.run()reactor.stop()逻辑。以下是您可以解决问题的方法示例:

from scrapy.crawler import CrawlerRunner
from twisted.internet.defer import gatherResults
from twisted.internet import reactor

def run_spider(url_list, allowed_list):
    """
    :returns: Deferred
    """
    runner = CrawlerRunner(get_project_settings())
    return runner.crawl('scraper', start_urls=url_list, allowed_domains=allowed_list)

d_list = []
for start, allowed in zip(start_url, allowedUrl):
    # ... your logic ...
    # Append the deferred into a list.
    d_list.append(run_spider(url_list, allowed_list))

# "Join"
results = gatherResults(d_list)
# Stop the reactor after all the sites are scraped or a failure occurs
results.addBoth(lambda _: reactor.stop())

reactor.run()

run_spider()返回一个Deferred。在循环中,将 追加Deferred到列表中并“加入”所有列表或在发生故障时停止处理(读取gatherResults)。一旦站点全部被刮掉,反应堆就会停止。

在网上搜索,ReactorNotRestartable因为这已经解释了很多次了。


推荐阅读