首页 > 解决方案 > 使用 scrapy / sitemaps 抓取具有不同项目的产品

问题描述

问候社区成员,

我正在 Jupyter Notebook 上使用 python 3 开发一个项目,我想使用站点地图抓取产品,到目前为止我所做的是从名为 df 的数据框内的站点地图获取 URL,而不是我想使用 Xpath 抓取每个 URL,这是我的代码的结构


from scrapy.spiders import SitemapSpider

class ProductSpider(SitemapSpider):

    name = 'ProductSpider'

    sitemap_urls = ['the sitemap']
    sitemap_rules = [('products', 'parse_product')]

    def parse_product(self, response):
        print('parse_product url:', response.url)

        yield {'url': response.url}



from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file as CSV
    'FEED_FORMAT': 'csv',     # csv
    'FEED_URI': 'urls.csv', #
})
c.crawl(ProductSpider)
c.start()


import pandas as pd

df=pd.read_csv('urls.csv')

到目前为止一切都很好我有我的名为 df 的数据框现在我想抓取数据框中的每个 url 来抓取产品

import scrapy



class MySpider(scrapy.Spider):
    custom_settings = {'FEED_URI' : 'products.csv'}
    name = 'MySpider'
    allowed_domains='website'
    first_page=[df.url[1]]
    all_others=[df.url[i] for i in range(2,400)]
   

    start_urls = first_page+all_others

    def parse(self, response):
        for product in response.selector.xpath("//div[@class='container']"):
            yield{
                'title' = product.xpath("//div[class='title clearfix']/h1/text()").extract()
                'img' = product.xpath("//div/a/img[@class='image-slide-link']").extract()
                'description'=product.xpath("//div/p/ul/text()").extract()
                'composition'=product.xpath("//div[@class='c-product__content']/text()").extract()
                'Id'=product.xpath("//div[@class='product-json']/@Id/text()").extract()
                 }
                for item in zip(title,image,description,composition):
            scraped_info = {
                    'title' : item[0],
                    'image' : item[1],
                    'description': item[2],
                    'composition' : [item[3]]}
            yield scraped_info


d = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',    
    'FEED_URI': 'products.csv', 
    })


d.crawl(Myspider)
d.start()

但最后我得到一个空文件 products.cv,谁能帮我找到解决方案?我已经考虑这个问题3周了!

提前致谢 !

标签: pythonweb-scrapingscrapy

解决方案


Problem solved: I made a stupid mistake, I shouldn't have added the items as all my information were stored in the zip and not in the file csv, so I just had to delete some lines, more precisely: the item lines, and retrieve information directly.


推荐阅读