首页 > 解决方案 > Scrapy script doesn't get all the products on an ecommerce site page

问题描述

I am still new to scrapy, and I am trying to scrape a product list page (from: nordstromrack.com). I used almost the same script on other sites without issues, but on this site, it seems like it only gets me the first 6 items of the page that I want to scrape. I used different pages on the same site with the same results (Ex: https://www.nordstromrack.com/shop/Women/Clothing/Activewear). I used scrapy shell to see if I get different results but I only get the first 6 links. The page source only shows 6 links as well. So I am a little confused on what the problem is. I researched everywhere, and apparently it could be a problem with the site using a script to load 6 products at a time. However, most of the answers I found says to look for the next page and scrape the next page (But that's only for pages with infinite scrolling). Other solutions mention to use Selenium but I guess it will have the same issue because the links that we want to follow are not on the page source. Does anyone know how to solve this problem. Greatly appreciated.

Here is my script for this page: https://www.nordstromrack.com/clearance/Men/Accessories?priceRanges%5B%5D=100-200&sort=most_popular

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy import Spider
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from scrapy.spiders import Spider
from esourcing.items import EsourcingItem
from scrapy.http import Request
import re


class NrtestSpider(CrawlSpider):
    name = 'nrtest'
    allowed_domains = ['nordstromrack.com']
    start_urls = ('https://www.nordstromrack.com/clearance/Men/Accessories? 
   priceRanges%5B%5D=100-200&sort=most_popular',)

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//*[@class="product-grid"]'), 
callback='parse_item'),
    )


    def parse_item(self, response):

        yield {
            'reference': response.css('.product-details__style-number::text') 
[0].extract(),
            'title': response.css('.product-details__title-name::text') 
[0].extract(),
            'brand': response.css('.product-details__title').xpath('.//text()') 
[0].extract(),
            'description': response.css('.product-details-section__definition- 
list').xpath('.//text()').extract(),
            'retail': response.css('.product-details__retail- 
price').xpath('.//text()')[0].extract(),
            'purchase': response.css('.product- 
details__sale').xpath('.//text()')[0].extract(),
            'image_urls': response.css('.image-zoom').xpath('.//img/@src') 
[0].extract(),
            'image_urls_extra': response.css('.product- 
thumbnail').xpath('.//img/@src').extract(),
            'size': response.css('.sku-option__items').xpath('.//*[@class="sku- 
item sku-item--available sku-item--text"]//text()').extract()
        }

标签: python-3.xscrapy

解决方案


It's mostly because the data you are looking for is rendered with javascript and AJAX requests.

If you open up web inspector when clicking on 2nd page on your url you can see an XHR request is being made to get all of the product data in json (later javascript unpacks it to what you see on the web).

https://www.nordstromrack.com/clearance/Men/Accessories?page=2&sort=most_popular

Gives this AJAX: https://www.nordstromrack.com/api/search2/catalog/search?context=clearance&department=Accessories&division=Men&includeFlash=false&includePersistent=true&limit=99&page=2&sort=most_popular&experiment=control

Which has all of the clothes data in json format: enter image description here

All you need to do in scrapy is to scrape the AJAX url from above instead of url you are scraping initially and then just load it with json module and parse it as a normal dictionary:

$ scrapy shell "https://www.nordstromrack.com/api/search2/catalog/search?context=clearance&department=Accessories&division=Men&includeFlash=false&includePersistent=true&limit=99&page=2&sort=most_popular&experiment=control"
> import json
> data = json.loads(response.body_as_unicode())
> data['_embedded']  # your products

推荐阅读