首页 > 解决方案 > Python 和 Scrapy 缺少一些链接

问题描述

嗨,伙计们,我是 Scrapy 的新手,对解析的工作方式有点困惑。在这里,我首先有 2 个代码和 1 个解析,我得到 20 个结果

def start_requests(self):
    url = 'https://news.detik.com/indeks/'
    date = '01/01/2020'

    assert type(url) is str 
    assert type(date) is str 

    max_page = 1
    
    for page in range(1, max_page + 1):
        complete_url = url + str(page) + '?date=' + date
        yield scrapy.Request(complete_url, self.parse)    

def parse(self, response):
    links = response.xpath('//*[@id="indeks-container"]/article//h3/a/@href').extract()
    
    for link in links:
        yield {'link' : link}

但是,如果我添加新的解析,结果会减少到 18

   def start_requests(self):
    url = 'https://news.detik.com/indeks/'
    date = '01/01/2020'

    assert type(url) is str 
    assert type(date) is str 

    max_page = 1
    
    for page in range(1, max_page + 1):
        complete_url = url + str(page) + '?date=' + date
        yield scrapy.Request(complete_url, self.parse)    

def parse(self, response):
    links = response.xpath('//*[@id="indeks-container"]/article//h3/a/@href').extract()
    
    for link in links:
        yield scrapy.Request(link, callback=self.parse_content)

def parse_content(self, response):

    yield {
        'title': response.css('.detail__title::text').get().strip()
    }

我的问题是发生了什么?

标签: pythonweb-scrapingscrapy

解决方案


第二种情况有两个例外,

以下文章的标题在 .detail_text 类内而不是 .detail_title

“Bandara Halim Pastikan Penumpang Dapat Kompensasi 100 Persen”和“Kunjungi Posko Banjir Kemang, Anies Pastikan Kebutuhan Warga Terpenuhi”


推荐阅读