首页 > 解决方案 > Scrapy - multiple spiders - processing data from one spider while other are still running

问题描述

I have a couple of spiders in my scrapy project. Each of them collects data from various websites and stores it in the database (separately). After each spider is finished I need to run code which is doing things to the data (let's call it data processing subroutine). This takes variable amount of time (up to an hour) depending on the spider/data.

My goal is to have a script which runs these spiders simultaneously and also allows to trigger the data processing subroutine for each spider once the crawling is finished, while not interfering with the other still running spiders and other finished spiders' data processing subroutines. In other words, I want to do it all in a shortest amount of time.

I know I can run spiders simultaneously this way:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
I also know/think I could use spider_closed signal inside each of the spiders to trigger the data processing subroutine.

My questions are:

  1. Will this work as I imagine? Won't the data processing subroutines compete for resources since they are all in same process?
  2. Is there a way to use actual multiprocessing and run each spider in a separate process? Or some other, better way to do this?

Thank you.

标签: pythonscrapymultiprocessing

解决方案


数据处理子程序不会因为都在同一个进程中而竞争资源吗?

他们会竞争,就像蜘蛛一样。如果这不行,您可能需要使用multiprocessing

有没有办法使用实际的多处理并在单独的进程中运行每个蜘蛛?

将 Python Twisted 与多处理混合使用?


推荐阅读