首页 > 解决方案 > 为课堂外的scrapy设置开始网址

问题描述

我是新的 Scrapy,我如何start_urls从课堂外通过,我试图在课堂start_urls外制作,但它没有用。我想做的是创建一个文件名来自字典(search_dict)和值的文件它作为 Scrapy 的起始 url

search_dict={'hello world':'https://www.google.com/search?q=hello+world',
            'my code':'https://www.google.com/search?q=stackoverflow+questions',
            'test':'https://www.google.com/search?q="test"'}

class googlescraper(scrapy.Spider):
    name = "test"
    allowed_domains = ["google.com"]
    #start_urls= ??
    found_items = []
    def parse:
        item=dict()
        #code here
        self.found_items.append(item)

for k,v in search_dict.items():
    with open(k,'w') as csvfile:
        process = CrawlerProcess({
            'DOWNLOAD_DELAY': 0,
            'LOG_LEVEL': 'DEBUG',
            'DOWNLOAD_TIMEOUT':30,})
        process.crawl(googlescraper) #scrapy spider needs start url
        spider = next(iter(process.crawlers)).spider
        process.start()
        dict_writer = csv.DictWriter(csvfile, keys)
        dict_writer.writeheader()
        dict_writer.writerows(spider.found_items)

标签: pythonscrapy

解决方案


Scrapy 文档有一个使用参数实例化爬虫的示例:https ://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments

您可以通过以下方式传递您的网址:

# ...

class GoogleScraper(scrapy.Spider):
    # ...
    # Omit `start_urls` in the class definition
    # ...

process.crawl(GoogleScraper, start_urls=[
    # The URL you want to pass here
])

kwargs调用中的将process.crawl()传递给蜘蛛初始化程序。默认初始化程序将复制任何kwargs作为蜘蛛类的属性。所以这相当于start_urls类定义中的设置。

Scrapy 文档中的相关部分:https ://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess.crawl


推荐阅读