首页 > 解决方案 > 每个 start_url 我们必须创建一个输出 CSV 文件

问题描述

我对spider-scrapy很陌生。我正在从 www.goodsearch.com 提取数据。

下面的代码工作正常,没有任何错误:

import scrapy
class GoodsearchSpider(scrapy.Spider):
    name = 'goodsearch'
    allowed_domains = ['www.goodsearch.com/coupons/macys']
    start_urls = ['http://www.goodsearch.com/coupons/macys/']
    #start_urls = ['https://www.goodsearch.com/coupons/shutterfly']


    def parse(self, response):
        listings = response.xpath('//*[@id="main"]/div[1]/ul/li')
        for listing in listings:
            coupon_description = listing.xpath('.//span[@class="title"]/text()').extract_first()
            coupon_discount1 = listing.xpath('.//div[@class="top"]/text()').extract_first()
            coupon_discount2 = listing.xpath('.//div[@class="bottom"]/text()').extract_first()
            coupon_type = listing.xpath('.//div[@class="title"]/text()').extract_first()
            coupon_expire_data = listing.xpath('.//p/text()').extract_first()
            coupon_code = listing.xpath('.//div[1]/div[4]/span[1]/text()').extract_first()
            coupon_used_times = listing.xpath('.//span[@class="click-count"]/text()').extract_first()

            if coupon_discount1 is not None and coupon_discount2 is not None:
                print("")
            else:
                coupon_discount1 = ""
                coupon_discount2 = ""
                print(coupon_discount1)
            coupon_discount = coupon_discount1 + coupon_discount2


            yield {'Coupon Description': coupon_description,
                   'Coupon Discount': coupon_discount,
                   'Coupon Type': coupon_type,
                   'Coupon Expire Data': coupon_expire_data,
                   'Coupon Code': coupon_code,
                   'Coupon Used Times': coupon_used_times,
                   }

如果我通过单个 start_url 它就像上面的代码一样工作正常。我想从具有输入文件的 csv 文件中获取链接。

输入 CSV 文件 (goodsearch_inputfile.csv)

link,store_name    
https://www.goodsearch.com/coupons/amazon,Amazon
https://www.goodsearch.com/coupons/target,Target
https://www.goodsearch.com/coupons/bestbuy,BestBuy

我们必须生成一个 csv 输出文件的每个链接,这意味着我们必须生成三个输出文件。你能帮我解决这个问题吗?

我添加了下面的代码但没有用

'''    
    with open("goodsearch/input_file/goodsearch_inputfile.csv", "r") as links:
        for link in links:
            url, name = link.strip().split('|')
            start_urls = [url.strip()]
            fname = name
            print '----------------------------------'
            print 'name: {}, start urls: {}'.format(fname, start_urls) '''

标签: pythonweb-scrapingweb-crawlerscrapy-spider

解决方案


为什么不将 csv 文件加载到 numpy narray 而不是使用拆分文件和常规文件。您应该利用 csv 文件的组织结构。


推荐阅读