首页 > 解决方案 > 带有 Scrapy 的 Celery 不解析 CSV 文件

问题描述

任务本身会立即启动,但它会尽快结束,而且我看不到任务的结果,它根本没有进入管道。当我编写代码并使用scrapy crawl <spider_name>命令运行它时,一切正常。我在使用 Celery 时遇到了这个问题。

我的芹菜工人日志:

[2021-02-13 14:25:00,208: INFO/MainProcess] Received task: crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3]  
[2021-02-13 16:25:00,867: INFO/ForkPoolWorker-1] Scrapy 2.4.0 started (bot: crawling)
[2021-02-13 16:25:00,869: INFO/ForkPoolWorker-1] Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.7 (default, Jan 12 2021, 17:06:28) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Linux-5.8.0-41-generic-x86_64-with-glibc2.2.5
[2021-02-13 16:25:00,869: DEBUG/ForkPoolWorker-1] Using reactor: twisted.internet.epollreactor.EPollReactor
[2021-02-13 16:25:00,879: INFO/ForkPoolWorker-1] Overridden settings:
{'BOT_NAME': 'crawling',
 'DOWNLOAD_TIMEOUT': 600,
 'DOWNLOAD_WARNSIZE': 267386880,
 'NEWSPIDER_MODULE': 'crawling.crawling.spiders',
 'SPIDER_MODULES': ['crawling.crawling.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64)'}
[2021-02-13 16:25:01,018: INFO/ForkPoolWorker-1] Telnet Password: d95c783294fc93df
[2021-02-13 16:25:01,064: INFO/ForkPoolWorker-1] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats'][2021-02-13 16:25:01,151: INFO/ForkPoolWorker-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2021-02-13 16:25:01,172: INFO/ForkPoolWorker-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2021-02-13 16:25:01,183: INFO/ForkPoolWorker-1] Task crawling.crawling.tasks.start_crawler_process[dece5127-bdfe-47d1-855e-ffc06d5481d3] succeeded in 0.9719750949989248s: None
[2021-02-13 16:25:01,285: INFO/ForkPoolWorker-1] Received SIGTERM, shutting down gracefully. Send again to force

我有以下蜘蛛:

class CopartSpider(CSVFeedSpider):
    name = '<spider_name>'
    allowed_domains = ['<allowed_domain>']
    start_urls = [
        'file:///code/autotracker/crawling/data/salesdata.cgi'
    ]

我的 Scrapy 设置的一部分(没有其他与 Scrapy 直接相关的内容):

BOT_NAME = 'crawling'

SPIDER_MODULES = ['crawling.crawling.spiders']
NEWSPIDER_MODULE = 'crawling.crawling.spiders'

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64)'

ROBOTSTXT_OBEY = False

DOWNLOAD_TIMEOUT = 600    # 10 min
DOWNLOAD_WARNSIZE = 255 * 1024 * 1024    # 255 mb

DEFAULT_REQUEST_HEADERS = {
  'Accept': '*/*',
  'Accept-Language': 'en',
}

ITEM_PIPELINES = {
   'crawling.pipelines.AutoPipeline': 1,
}

我有两个用于 Celery 配置的文件:

芹菜.py

from celery import Celery
from celery.schedules import crontab

BROKER_URL = 'redis://redis:6379/0'
app = Celery('crawling', broker=BROKER_URL)

app.conf.beat_schedule = {
    'scrape-every-20-minutes': {
        'task': 'crawling.crawling.tasks.start_crawler_process',
        'schedule': crontab(minute='*/5'),
    }
}

任务.py

@app.task
def start_crawler_process():
    process = CrawlerProcess(get_project_settings())
    process.crawl('<spider_name>')
    process.start()

标签: web-scrapingscrapyweb-crawlercelerycelery-task

解决方案


原因: Scrapy 不允许运行其他进程。

解决方案:我使用了自己的脚本 - https://github.com/dtalkachou/scrapy-crawler-script


推荐阅读