首页 > 解决方案 > 为什么我的 Scrapy 规则 (LinkExtractor) 不起作用?

问题描述

这是我在 Stack Overflow 上的第一个问题。我开始在工作中使用 Python 来抓取数据,并且一直在使用 Scrapy 来完成这些任务。我尝试为政府网站设置刮板,但没有输出。最初我在我的规则变量中设置了三个规则,但我的 json 文件会出现空。代码很好,但我不知道出了什么问题。感谢您能够分享的任何见解。祝你有美好的一天。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DirSpider(CrawlSpider):
    name = 'di7'
    allowed_domains = ['transparencia.gob.sv']
    start_urls = ['https://www.transparencia.gob.sv/categories/2']
    

rules = (
        Rule(LinkExtractor(restrict_css=".filtrable a"), callback='parse_item', follow=True),
        Rule(LinkExtractor(restrict_css="a:nth-of-type(19)"), callback='parse_item', follow=True),
    )

    def parse(self, response):
        
        items = {}
        
        css_selector = response.css(".spaced .align-justify")
        
        for bureaucrat in css_selector:
            name = bureaucrat.css(".medium-11 a::text").extract_first()
            charge = bureaucrat.css(".medium-12::text").extract_first()
            status = bureaucrat.css(".medium-11 .text-mutted::text").extract_first()
            institution = response.css("small::text").extract()
            
            items['name'] = name
            items['charge'] = charge
            items['status'] = status
            items['institution'] = institution
            
            yield(items)```

标签: pythonweb-scrapingscrapy

解决方案


尝试将您的parse函数重命名为parse_item

def parse_item(self, response):

    items = {}

    css_selector = response.css(".spaced .align-justify")

    for bureaucrat in css_selector:
        name = bureaucrat.css(".medium-11 a::text").extract_first()
        charge = bureaucrat.css(".medium-12::text").extract_first()
        status = bureaucrat.css(".medium-11 .text-mutted::text").extract_first()
        institution = response.css("small::text").extract()

        items['name'] = name
        items['charge'] = charge
        items['status'] = status
        items['institution'] = institution

        yield(items)

推荐阅读