首页 > 解决方案 > 试图用scrapy抓取一个网站 - 没有收到任何数据

问题描述

对于作业,我必须从 Kaercher 网上商店获取数据。我需要获取的数据是产品标题、描述和价格。

此外,我需要能够使用相同的脚本获取多种产品(高压清洁器、真空吸尘器等)。所以我可能需要制作一个 .csv 关键字文件或其他东西来相应地调整 URL。

但是,我似乎无法使用当前脚本获取数据..

信息:我将添加我的整个文件结构和当前代码。我只调整了实际的蜘蛛文件(karcher_crawler.py),其他文件大多是默认的。

我的文件夹结构:

scrapy_karcher/ # Project root directory
    scrapy.cfg  # Contains the configuration information to deploy the spider
    scrapy_karcher/ # Project's python module
        __init__.py
        items.py      # Describes the definition of each item that we’re scraping
        middlewares.py  # Project middlewares
        pipelines.py     # Project pipelines file
        settings.py      # Project settings file
        spiders/         # All the spider code goes into this directory
            __init__.py
            karcher_crawler.py # The spider

我的“karcher_crawler.py”代码

import scrapy

class KarcherCrawlerSpider(scrapy.Spider):
    name = 'karcher_crawler'
    start_urls = [
        'https://www.kaercher.com/nl/webshop/hogedrukreinigers-resultaten.html'
    ]

    def parse(self, response):
        products=response.xpath("//div[@class='col-sm-3 col-xs-6 fg-products-item']")
        # iterating over search results
        for product in products:
            # Defining the XPaths
            XPATH_PRODUCT_NAME=".//div[@class='product-info']//h6[contains(@class,'product-label')]//a/text()"
            XPATH_PRODUCT_PRICE=".//div[@class='product-info']//div[@class='product-price']//span/text()"
            XPATH_PRODUCT_DESCRIPTION=".//div[@class='product-info']//div[@class='product-description']//a/text()"

            raw_product_name=product.xpath(XPATH_PRODUCT_NAME).extract()
            raw_product_price=product.xpath(XPATH_PRODUCT_PRICE).extract()
            raw_product_description=product.xpath(XPATH_PRODUCT_DESCRIPTION).extract()

            # cleaning the data
            product_name=''.join(raw_product_name).strip(
            ) if raw_product_name else None
            product_price=''.join(raw_product_price).strip(
            ) if raw_product_price else None
            product_description=''.join(raw_product_description).strip(
            ) if raw_product_description else None

            yield {
                'product_name': product_name,
                'product_price': product_price,
                'product_description': product_description,
            }

我的“items.py”代码:

import scrapy


class ScrapyKarcherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

我的“pipelines.py”代码:

class ScrapyKarcherPipeline(object):
    def process_item(self, item, spider):
        return item

我的“scrapy.cfg”代码:

[settings]
default = scrapy_karcher.settings

[deploy]
#url = http://localhost:6800/
project = scrapy_karcher

标签: scrapy

解决方案


我设法使用以下代码请求所需的数据:

蜘蛛文件 (.py)

import scrapy
from krc.items import KrcItem
import json

class KRCSpider(scrapy.Spider):
    name = "krc_spider"
    allowed_domains = ["kaercher.com"]
    start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']

    def parse(self, response):
        item = KrcItem()
        data = json.loads(response.text)
        for company in data.get('products', []):
            item["productid"] = company["id"]
            item["name"] = company["name"]
            item["description"] = company["description"]
            item["price"] = company["priceFormatted"]
            yield item

项目文件(.py.

import scrapy


class KrcItem(scrapy.Item):
    productid=scrapy.Field()
    name=scrapy.Field()
    description=scrapy.Field()
    price=scrapy.Field()
    pass

感谢@gangabass,我设法找到了包含我需要提取的数据的 URL。(当您检查网页时,您可以在“网络”选项卡中找到它们(按 F12 或右键单击任意位置进行检查)。


推荐阅读