python - Scrapy - multiple spiders - processing data from one spider while other are still running
问题描述
I have a couple of spiders in my scrapy project. Each of them collects data from various websites and stores it in the database (separately). After each spider is finished I need to run code which is doing things to the data (let's call it data processing subroutine). This takes variable amount of time (up to an hour) depending on the spider/data.
My goal is to have a script which runs these spiders simultaneously and also allows to trigger the data processing subroutine for each spider once the crawling is finished, while not interfering with the other still running spiders and other finished spiders' data processing subroutines. In other words, I want to do it all in a shortest amount of time.
I know I can run spiders simultaneously this way:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
I also know/think I could use spider_closed signal inside each of the spiders to trigger the data processing subroutine.
My questions are:
- Will this work as I imagine? Won't the data processing subroutines compete for resources since they are all in same process?
- Is there a way to use actual multiprocessing and run each spider in a separate process? Or some other, better way to do this?
Thank you.
解决方案
推荐阅读
- php - 传递给 Illuminate\Validation\Factory::make() 的参数 1 必须是数组类型,对象给定
- c# - Unity 和 Firebase 中的 RawJsonValues
- android - 如何通过 gmail 发送数据但不能更改
- ios - UIControl 无法识别表格视图单元格内的点击,从 iOS 14 停止工作
- reactjs - 在本机反应中使用transform translateX从左到右和从右到左制作动画?
- javascript - 在不改变页面位置的情况下更改 div 的高度时始终下推内容
- r - case_when 条件 IN -R
- javascript - 为什么我的 TinyMCE 文本编辑器中插入的文本被编辑?
- elasticsearch - 存储桶脚本聚合 - 弹性搜索
- spring-boot - @在集成测试期间不设置数据之前