首页 > 解决方案 > 如何在 python 的 Scrapy 中将参数传递给 process.crawl

问题描述

我正在尝试将 python 的 Scrapy 库与 IBM 云函数一起使用。我想用process.crawl. 我怎样才能做到这一点?

我的代码如下:

class MySpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com/"]

    def __init__(self, make=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        init_url = "http://quotes.toscrape.com/"
        self.start_urls = [init_url]

    def parse(self, response):
        title = response.css(".header-box > div a::text").extract_first()
        yield {"title": title}


process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})

process.crawl(MySpider) <-------- Explanation
process.start()

解释

我发现here可以这样做:

process.crawl(MySpider, make="Audi")

但是当我尝试这样做时,我的编辑器中出现错误:

expected type 'dict' got 'str' instead

我究竟做错了什么?

更新

我将scrapy spider用于IBM云功能,因此我的代码如下:

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com/"]

    def __init__(self, make=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        print("Make {}".format(make))

    def parse(self, response):
        title = response.css(".header-box > div a::text").extract_first()
        yield {"title": title}


def main(params):
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})

    process.crawl(MySpider, make="Audi") <------- in my editor I get here an warning expected type 'dict' got 'str' instead
    process.start()
    return {"joke": "Some shit joke"}

当我main({})从控制台运行时,我收到以下错误:

2018-06-22 08:42:45 [scrapy.extensions.telnet] 调试:Telnet 控制台监听 127.0.0.1:6024 Traceback(最近一次调用最后):文件“”,第 1 行,在文件“./ main.py”,第 30 行,在主文件“/Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/scrapy/crawler.py”,第 291 行,在 start reactor.run( installSignalHandlers=False) #blocking call File "/Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/twisted/internet/base.py", line 1260, in run self.startRunning( installSignalHandlers=installSignalHandlers) 文件“/Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/twisted/internet/base.py”,第 1240 行,在 startRunning ReactorBase.startRunning(self) 文件中“/Users/boris/Projects/IBM-cloud/virtualenv/lib/python3.6/site-packages/twisted/internet/base.py”,第 748 行,在 startRunning 中引发错误。ReactorNotRestartable() twisted.internet.error。反应堆不可重启

标签: pythonscrapy

解决方案


推荐阅读