首页 > 解决方案 > Scrapy 重复抓取的数据

问题描述

我是 python 新手,但由于工作相关的原因需要刮擦。在scrapy上花了一两个星期,终于满意了,只是下面的代码不是输出一行数据,而是重复了五次。这是一个示例(仅使用 1 个 url):

导入scrapy

class AdamSmithInstituteSpider(scrapy.Spider):
name = "adamsmithinstitute"
start_urls = [
"https://www.adamsmith.org/research?month=March-2018",

]


def parse(self, response):
    for quote in response.css('div.post'):
        yield {
            'author': response.css('post-author::text').extract(),
            'pdfs': response.selector.xpath('//div/div/div/div/div/div/div/p/a').extract(),
        }

    next_page = response.css("div.older a::attr(href)").extract_first()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

scrapy shell中的输出如下:

2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:12 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}
2018-07-10 11:53:13 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.adamsmith.org/research?month=March-2018>
{'author': [], 'pdfs': ['<a target="_blank" href="/s/Immigration1.pdf">Read 
the full paper</a>']}

我知道数据很混乱,因为我只想要 href 链接,但我很熟悉,可以自己弄清楚。我不能指望的是重复。

任何帮助将不胜感激。

标签: pythonweb-scrapingscrapy

解决方案


Scrapy 仅在重复某个项目时才处理 url 重复,然后开发人员必须删除重复项

scrapy 记录了重复过滤器管道, 点击这里阅读

在该示例中,他们将 id 显示为唯一的,在您的情况下它可能是不同的


推荐阅读