首页 > 解决方案 > 如何在scrapy中仅生成一次从多个页面/多个解析器填充的嵌套项目

问题描述

在这里scrapy的新手,并试图弄清楚一旦项目完成填充后如何只产生一次。

试图抓取一个发布游泳者时间的网站,该网站的页面结构如下:

游泳者搜索页面 -> 包含游泳样式列表的游泳者页面 -> 包含该样式所有时间的样式页面

我正在使用一组嵌套的项目

游泳者 -> [样式] -> [次]

为每个 Swimmer 输出一个 json dict,包含他/她游泳的所有样式以及在每个样式中完成的所有时间。

我的问题是这段代码一遍又一遍地产生相同的项目,而不是一次(正如我想要和期望的那样),所以造成了很多浪费。

import scrapy
from tempusopen.settings import swimmers
from tempusopen.items import Swimmer, Time, Style
from scrapy.loader import ItemLoader


class BaseUrl(scrapy.Item):
    url = scrapy.Field()


class RecordsSpider(scrapy.Spider):
    name = 'records_spider'
    allowed_domains = ['www.tempusopen.fi']

    def start_requests(self):
        base_url = ('https://www.tempusopen.fi/index.php?r=swimmer/index&Swimmer[first_name]={firstname}&'
                    'Swimmer[last_name]={lastname}&Swimmer[searchChoice]=1&Swimmer[swimmer_club]={team}&'
                    'Swimmer[class]=1&Swimmer[is_active]=1')
        urls = [base_url.format_map(x) for x in swimmers]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        swimmer_url = response.xpath('//table//tbody/tr/td/a[@class="view"]/@href').get()
        swimmer = Swimmer()
        return response.follow(swimmer_url, callback=self.parse_records, meta={'swimmer': swimmer})

    def parse_records(self, response):
        distances = response.xpath('//table//tbody/tr/td/a[@class="view"]/@href').extract()
        swimmer_data = response.xpath("//div[@class='container main']//"
                                      "div[@class='sixteen columns']//text()").extract()
        swimmer = response.meta['swimmer']
        swimmer['id'] = response.url.split('=')[-1]
        swimmer['name'] = swimmer_data[1]
        swimmer['team'] = swimmer_data[5].strip('\n').split(',')[0].split(':')[1].strip()
        swimmer['status'] = swimmer_data[5].split(',')[1:]
        swimmer_data = response.xpath("//div[@class='container main']//"
                                      "div[@class='clearfix']//div[@class='six columns']"
                                      "//text()").extract()
        swimmer['born'] = swimmer_data[2].strip('\n')
        swimmer['license'] = swimmer_data[4].strip('\n')
        for url in distances:
            yield response.follow(url, callback=self.parse_distances, meta={'swimmer': swimmer})

    def parse_distances(self, response):
        swimmer = response.meta['swimmer']
        style = Style()
        try:
            swimmer['styles']
        except:
            swimmer['styles'] = []
        distance = response.xpath('//div[@class="container main"]//p/text()').extract_first()
        distance = distance.strip().split('\n')[1]
        style['name'] = distance
        try:
            style['times']
        except:
            style['times'] = []
        swimmer['styles'].append(style)
        table_rows = response.xpath("//table//tbody/tr")
        for tr in table_rows:
            t = Time()
            t['time'] = tr.xpath("./td[1]/text()").extract_first().strip("\n\xa0")
            t['date'] = tr.xpath("./td[4]/text()").extract_first()
            t['competition'] = tr.xpath("./td[5]/text()").extract_first()
            style['times'].append(t)
        return swimmer

我想问题在于以“正确”的方式使用yieldreturn,但我无法找出正确的解决方案。

我只尝试过yield,我可以看到每个游泳者的 json 字典慢慢填充。return swimmer我在最后和任何地方都尝试了最后一个,yield但这只是给了我每个游泳者相同的 json dict 无休止地重复......

想要的行为是代码将为我在列表中搜索的每个游泳者输出一个单个 json 字典start_urls(而不是我现在得到的大量)。

任何帮助表示赞赏,谢谢!

附言。你可以在这里拉代码

作为swimmersdict 的一个例子,你可以使用这个:

swimmers = [
# Add here the list of swimmers you are interested in scraping
{'firstname': 'Lenni', 'lastname': 'Parpola', 'team': ''},
{'firstname': 'Tommi', 'lastname': 'Kangas', 'team': ''},
]

标签: pythonscrapy

解决方案


您有两种选择:

  1. 使用Scrapy Inline Requests实现parse_distance
  2. 不要更改代码中的任何内容,而是创建一个自定义pipelines.py并与部分中的每个游泳者一起工作(添加新的字典详细信息)process_item。您将能够yield在蜘蛛结束时获得所有结果。

推荐阅读