首页 > 解决方案 > Scrapy 使用 CrawlSpider 不起作用

问题描述

我正在使用 Scrapy,我尝试使用蜘蛛来抓取整个网站,但我的终端没有得到任何结果。

PS:我在脚本中从浏览器运行 Scrapy。

这是我的代码:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'website.com'
    allowed_domains = ['website.com']
    start_urls = ['http://www.website.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('/', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item

    )

    def parse_item(self, response):
        print(response.css('title').extract())




process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start()

标签: python

解决方案


你错过了回调参数。

简单地改变

Rule(LinkExtractor(allow=('/', ), deny=('subsection\.php', ))),

Rule(LinkExtractor(allow=('/', ), deny=('subsection\.php', )), callback='parse_item')

根据crawlspider文档,您忘记将回调参数传递给您的链接提取器。


推荐阅读