首页 > 解决方案 > 使用 Scrapy 和 Crochet 库记录到文件

问题描述

我正在从脚本运行Scrapy,使用钩针库来阻止代码。现在我正在尝试将日志转储到文件中,但STDOUT由于某种原因它开始将日志重定向到。我在脑海中怀疑Crochet图书馆,但到目前为止我没有任何线索。

  1. 如何调试此类问题?请与我分享您的调试技巧。
  2. 如何修复它以便将日志转储到文件中?
import logging

import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log

crochet.setup()

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for idx, title in enumerate(response.css('.post-header>h2')):
            if idx == 10:
                return
            logging.info({'title': title.css('a ::text').get()})

@crochet.wait_for(timeout=None)
def crawl():
    runner = crawler.CrawlerRunner()
    deferred = runner.crawl(BlogSpider)
    return deferred

log.configure_logging(settings={'LOG_FILE': 'my.log'})
logging.info("Starting...")
crawl()

标签: pythonloggingscrapytwisted

解决方案


唯一需要的是将log settings也传递给CrawlerRunner

import logging

import crochet
import scrapy
from scrapy import crawler
from scrapy.utils import log

crochet.setup()

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for idx, title in enumerate(response.css('.post-header>h2')):
            if idx == 10:
                return
            logging.info({'title': title.css('a ::text').get()})

@crochet.wait_for(timeout=None)
def crawl():
    runner = crawler.CrawlerRunner(settings=log_settings)
    deferred = runner.crawl(BlogSpider)
    return deferred

log_settings = {'LOG_FILE': 'my.log'}
log.configure_logging(settings=log_settings)
logging.info("Starting...")
crawl()

推荐阅读