首页 > 解决方案 > Scrapy 跟随链接并提取数据

问题描述

基本上,我想递归地进入每个链接并提取数据。我遇到的问题是“finalTag”是一个字符串列表,其中包含我想要进入的每个链接的 URL。但是,如果我将它与 scrapy 请求一起插入request = scrapy.Request(finalTag, callback=self.parse2),它会说它不是字符串。我尝试就地执行 str(finalTag) 但也没有工作。

所以这是我到目前为止的代码:

import scrapy

class RecursionSpider(scrapy.Spider):
    name = 'recursion'
    start_urls = ['https://www.jobbank.gc.ca/jobsearch/?fper=L&fper=P&fter=S&page=2&sort=M&fprov=ON#article-32316546']

    def parse(self, response):
        tag = response.xpath('//a/@href').extract()
        # Extracting all the href tags to the new links
        tag = [str for str in tag if '/jobsearch/jobposting' in str]
        finalTag = ['https://www.jobbank.gc.ca' + tag for tag in tag]
        request = scrapy.Request(finalTag, callback=self.parse2)
        yield request

    def parse2(self, response):
        # Extracting the content using css selectors
        vacancy = response.xpath('//span/text()').extract()
        status = response.css('span.attribute-value::text').extract()
        duration = response.css('span.attribute-value::text').extract()
        jobID = response.css('span::text').extract()

        vacancy = [str for str in vacancy if "Vacanc" in str]
        vacancy.remove('Vacancies')

        del status[1]

        del duration[0]
        duration = map(lambda s: s.strip(), duration)

        jobID = [str for str in jobID if "146" in str]

        for item in zip(vacancy, status, duration, jobID):
            # Create a dictionary to store the scraped info
            scraped_info = {
                'vacancy' : item[0],
                'status' : item[1],
                'duration' : item[2],
                'job id' : item[3]
            }

            # Yield or give the scraped info to scrapy
            yield scraped_info```

标签: pythonscrapy

解决方案


如果它是一个列表,请检查该列表的每个成员以及每个成员调用scrapy.Request()

for url in finalTag:
    request = scrapy.Request(url, callback=self.parse2)
    yield request

推荐阅读