python-3.x - Scrapy script doesn't get all the products on an ecommerce site page
问题描述
I am still new to scrapy, and I am trying to scrape a product list page (from: nordstromrack.com). I used almost the same script on other sites without issues, but on this site, it seems like it only gets me the first 6 items of the page that I want to scrape. I used different pages on the same site with the same results (Ex: https://www.nordstromrack.com/shop/Women/Clothing/Activewear). I used scrapy shell to see if I get different results but I only get the first 6 links. The page source only shows 6 links as well. So I am a little confused on what the problem is. I researched everywhere, and apparently it could be a problem with the site using a script to load 6 products at a time. However, most of the answers I found says to look for the next page and scrape the next page (But that's only for pages with infinite scrolling). Other solutions mention to use Selenium but I guess it will have the same issue because the links that we want to follow are not on the page source. Does anyone know how to solve this problem. Greatly appreciated.
Here is my script for this page: https://www.nordstromrack.com/clearance/Men/Accessories?priceRanges%5B%5D=100-200&sort=most_popular
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy import Spider
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from scrapy.spiders import Spider
from esourcing.items import EsourcingItem
from scrapy.http import Request
import re
class NrtestSpider(CrawlSpider):
name = 'nrtest'
allowed_domains = ['nordstromrack.com']
start_urls = ('https://www.nordstromrack.com/clearance/Men/Accessories?
priceRanges%5B%5D=100-200&sort=most_popular',)
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@class="product-grid"]'),
callback='parse_item'),
)
def parse_item(self, response):
yield {
'reference': response.css('.product-details__style-number::text')
[0].extract(),
'title': response.css('.product-details__title-name::text')
[0].extract(),
'brand': response.css('.product-details__title').xpath('.//text()')
[0].extract(),
'description': response.css('.product-details-section__definition-
list').xpath('.//text()').extract(),
'retail': response.css('.product-details__retail-
price').xpath('.//text()')[0].extract(),
'purchase': response.css('.product-
details__sale').xpath('.//text()')[0].extract(),
'image_urls': response.css('.image-zoom').xpath('.//img/@src')
[0].extract(),
'image_urls_extra': response.css('.product-
thumbnail').xpath('.//img/@src').extract(),
'size': response.css('.sku-option__items').xpath('.//*[@class="sku-
item sku-item--available sku-item--text"]//text()').extract()
}
解决方案
It's mostly because the data you are looking for is rendered with javascript and AJAX requests.
If you open up web inspector when clicking on 2nd page on your url you can see an XHR request is being made to get all of the product data in json (later javascript unpacks it to what you see on the web).
https://www.nordstromrack.com/clearance/Men/Accessories?page=2&sort=most_popular
Which has all of the clothes data in json format:
All you need to do in scrapy is to scrape the AJAX url from above instead of url you are scraping initially and then just load it with json
module and parse it as a normal dictionary:
$ scrapy shell "https://www.nordstromrack.com/api/search2/catalog/search?context=clearance&department=Accessories&division=Men&includeFlash=false&includePersistent=true&limit=99&page=2&sort=most_popular&experiment=control"
> import json
> data = json.loads(response.body_as_unicode())
> data['_embedded'] # your products
推荐阅读
- javascript - 带有正则表达式匹配的多个条件 if 语句
- python - 无法在 Flask 中设置 config.py
- python - 具有索引恢复的 QListView
- keras - 使用完全卷积网络的直接热图回归
- azure - Terraform 能否解决 Azure 相互依赖问题,例如重命名资源?
- javascript - 如何创建一个可以通过按钮添加和删除输入的列表?
- foreach - SwiftUI ForEach 没有动画延迟
- javascript - Karma 测试:在公司代理后面加载 JS 文件时出错
- typescript - 在本机反应中扩展通用道具
- python - 在读取字符串的每一行后尝试打印列表元素