首页 > 解决方案 > Scrapy Json 返回相同的内容

问题描述

我开发了这个scrapy爬虫,有一个循环从一个站点抓取10个页面循环运行良好,日志显示正确的url列表

2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=8&ajax=true>
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=9&ajax=true>

但结果总是一样的,并返回我在 shell 中测试的 page1 的内容,它也可以从浏览器正常工作。只有使用scrapy crawler才会出现问题我尝试使用start_urls,url方法,总是同样的问题

任何想法 ?

import scrapy
import json
import urllib
import time
import datetime
import re
from re import sub
from decimal import Decimal
#from prod.items import ProdItem
from staging.items import StagingItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d')

class QuotesSpider(scrapy.Spider):
    name = "lazada2"
    def start_requests(self):
        for i in range(1, 10):
            urls = 'https://www.lazada.vn/trang-diem/?page=%s&ajax=true' % i
            yield scrapy.Request(url=urls, callback=self.parse)

    def parse(self,response):
        data = json.loads(response.body)
        next_page = data['mainInfo']['page']
        for product in data['mods']['listItems']:
            item = StagingItem()
            item['collector_sku'] = product['name']
            if 'originalPrice' in product:
                item['collector_price_promo'] = product['originalPrice'],
            else:
                item['collector_price_promo'] = '',
            item['collector_retailer'] = 'Lazada'
            item['collector_url'] = product['productUrl'],
            item['collector_photo_url'] = product['image']
            item['collector_brand'] = product['brandName']
            item['collector_quantity'] = 'NA'
            item['collector_category'] = 'Makeup',
            item['collector_price'] = product['price']
            item['collector_timestamp'] = timestamp
            item['collector_local_id'] = ''
            item['collector_location_id'] = ''
            item['collector_location_name'] = ''
            item['collector_vendor_id'] = ''
            item['collector_vendor_name'] = ''
            yield item

标签: jsonscrapy

解决方案


使用 cookie 和标题

:
            headers = {
                "content-type": "application/json",
                "authority": "www.lazada.vn",
                "scheme": "https",
                "Accept-Language": "en-SG,en;q=0.9,en-US;q=0.8,zh-CN;q=0.7,zh;q=0.6,vi;q=0.5,fr;q=0.4",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
                "Accept": "*/*",
                "Path": "/trang-diem/?page=%s" % i,
                "Referer": "https://www.lazada.vn/trang-diem/?page=%s&ajax=true" % i,
                "accept-encoding": "gzip, deflate, br"
            }
            cookies = {
                "cookie": "_uab_collina=153864259681792402093714; _bl_uid=qpj7jm4CuXhcUk26er9n7hnhyRqd; t_fv=1538642596635; t_uid=mbei2vPUviVx0oPB6KjX1uVgASJvw7dA; lzd_cid=07e3d81c-bb96-4608-be5d-542d35d39dff; lzd_sid=1d8bf18519bb7fd8fb661ac558726c4d; _tb_token_=58e7f715a30eb; cna=O5A8FGGivzcCAXNPwzeoH+5y; hng=VN|vi|VND|704; userLanguageML=vi; cto_lwid=c9ad6486-acac-465f-ab05-6e0b3744d1dc; _ga=GA1.2.1435138343.1538642600; _gid=GA1.2.19901051.1538642600; cto_axid=zGni0uxNaRyv441RxQNq7EZ_LS8xiGmL; JSESSIONID=85306FF3F7612F91677FC6ED978B42E1; isg=BJ6eL8eUSXz4CZ0YqjCefDlu7zTqVCYsGgm5Z0gmm-DyaztFsOyk6OZNZi9CoFrx"
            }
            body ="?ajax=true&page=%s" % i
            urls = "https://www.lazada.vn/trang-diem/?ajax=true&page=%s" % i
            yield scrapy.Request(url=urls, body=body, cookies=cookies, headers=headers, callback=self.parse)

推荐阅读