web-scraping - 带有 Scrapy 的 Celery 不解析 CSV 文件
问题描述
任务本身会立即启动,但它会尽快结束,而且我看不到任务的结果,它根本没有进入管道。当我编写代码并使用scrapy crawl <spider_name>
命令运行它时,一切正常。我在使用 Celery 时遇到了这个问题。
我的芹菜工人日志:
[2021-02-13 14:25:00,208: INFO/MainProcess] Received task: crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3]
[2021-02-13 16:25:00,867: INFO/ForkPoolWorker-1] Scrapy 2.4.0 started (bot: crawling)
[2021-02-13 16:25:00,869: INFO/ForkPoolWorker-1] Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.7 (default, Jan 12 2021, 17:06:28) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.2.1, Platform Linux-5.8.0-41-generic-x86_64-with-glibc2.2.5
[2021-02-13 16:25:00,869: DEBUG/ForkPoolWorker-1] Using reactor: twisted.internet.epollreactor.EPollReactor
[2021-02-13 16:25:00,879: INFO/ForkPoolWorker-1] Overridden settings:
{'BOT_NAME': 'crawling',
'DOWNLOAD_TIMEOUT': 600,
'DOWNLOAD_WARNSIZE': 267386880,
'NEWSPIDER_MODULE': 'crawling.crawling.spiders',
'SPIDER_MODULES': ['crawling.crawling.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64)'}
[2021-02-13 16:25:01,018: INFO/ForkPoolWorker-1] Telnet Password: d95c783294fc93df
[2021-02-13 16:25:01,064: INFO/ForkPoolWorker-1] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats'][2021-02-13 16:25:01,151: INFO/ForkPoolWorker-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2021-02-13 16:25:01,172: INFO/ForkPoolWorker-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2021-02-13 16:25:01,183: INFO/ForkPoolWorker-1] Task crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3] succeeded in 0.9719750949989248s: None
[2021-02-13 16:25:01,285: INFO/ForkPoolWorker-1] Received SIGTERM, shutting down gracefully. Send again to force
我有以下蜘蛛:
class CopartSpider(CSVFeedSpider):
name = '<spider_name>'
allowed_domains = ['<allowed_domain>']
start_urls = [
'file:///code/autotracker/crawling/data/salesdata.cgi'
]
我的 Scrapy 设置的一部分(没有其他与 Scrapy 直接相关的内容):
BOT_NAME = 'crawling'
SPIDER_MODULES = ['crawling.crawling.spiders']
NEWSPIDER_MODULE = 'crawling.crawling.spiders'
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64)'
ROBOTSTXT_OBEY = False
DOWNLOAD_TIMEOUT = 600 # 10 min
DOWNLOAD_WARNSIZE = 255 * 1024 * 1024 # 255 mb
DEFAULT_REQUEST_HEADERS = {
'Accept': '*/*',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'crawling.pipelines.AutoPipeline': 1,
}
我有两个用于 Celery 配置的文件:
芹菜.py
from celery import Celery
from celery.schedules import crontab
BROKER_URL = 'redis://redis:6379/0'
app = Celery('crawling', broker=BROKER_URL)
app.conf.beat_schedule = {
'scrape-every-20-minutes': {
'task': 'crawling.crawling.tasks.start_crawler_process',
'schedule': crontab(minute='*/5'),
}
}
任务.py
@app.task
def start_crawler_process():
process = CrawlerProcess(get_project_settings())
process.crawl('<spider_name>')
process.start()
解决方案
原因: Scrapy 不允许运行其他进程。
解决方案:我使用了自己的脚本 - https://github.com/dtalkachou/scrapy-crawler-script
推荐阅读
- c++ - Arduino Nano BLE 33 Sense 和 DS18B20 不工作
- reactjs - 在兄弟姐妹之间传递道具(React)
- oracle - 用案例更改会话语句
- python - “psycopg2.errors.UndefinedTable:关系“航班”不存在”
- azure-devops - 如何在没有令牌的情况下登录 Azure DevOps 工件源
- c# - 我正在创建 ac# 控制台应用程序。我想将下面的指数值转换为十进制
- mongodb - How to find a result and apply localization in MongoDB?
- javascript - 为什么找不到':server'和':client'?
- angular - 在 HttpInterceptor 上捕获取消/中止请求。角8
- sql - SQL 按日期获取最接近的值