json - Scrapy Json 返回相同的内容
问题描述
我开发了这个scrapy爬虫,有一个循环从一个站点抓取10个页面循环运行良好,日志显示正确的url列表
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=8&ajax=true>
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=9&ajax=true>
但结果总是一样的,并返回我在 shell 中测试的 page1 的内容,它也可以从浏览器正常工作。只有使用scrapy crawler才会出现问题我尝试使用start_urls,url方法,总是同样的问题
任何想法 ?
import scrapy
import json
import urllib
import time
import datetime
import re
from re import sub
from decimal import Decimal
#from prod.items import ProdItem
from staging.items import StagingItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d')
class QuotesSpider(scrapy.Spider):
name = "lazada2"
def start_requests(self):
for i in range(1, 10):
urls = 'https://www.lazada.vn/trang-diem/?page=%s&ajax=true' % i
yield scrapy.Request(url=urls, callback=self.parse)
def parse(self,response):
data = json.loads(response.body)
next_page = data['mainInfo']['page']
for product in data['mods']['listItems']:
item = StagingItem()
item['collector_sku'] = product['name']
if 'originalPrice' in product:
item['collector_price_promo'] = product['originalPrice'],
else:
item['collector_price_promo'] = '',
item['collector_retailer'] = 'Lazada'
item['collector_url'] = product['productUrl'],
item['collector_photo_url'] = product['image']
item['collector_brand'] = product['brandName']
item['collector_quantity'] = 'NA'
item['collector_category'] = 'Makeup',
item['collector_price'] = product['price']
item['collector_timestamp'] = timestamp
item['collector_local_id'] = ''
item['collector_location_id'] = ''
item['collector_location_name'] = ''
item['collector_vendor_id'] = ''
item['collector_vendor_name'] = ''
yield item
解决方案
使用 cookie 和标题
:
headers = {
"content-type": "application/json",
"authority": "www.lazada.vn",
"scheme": "https",
"Accept-Language": "en-SG,en;q=0.9,en-US;q=0.8,zh-CN;q=0.7,zh;q=0.6,vi;q=0.5,fr;q=0.4",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Accept": "*/*",
"Path": "/trang-diem/?page=%s" % i,
"Referer": "https://www.lazada.vn/trang-diem/?page=%s&ajax=true" % i,
"accept-encoding": "gzip, deflate, br"
}
cookies = {
"cookie": "_uab_collina=153864259681792402093714; _bl_uid=qpj7jm4CuXhcUk26er9n7hnhyRqd; t_fv=1538642596635; t_uid=mbei2vPUviVx0oPB6KjX1uVgASJvw7dA; lzd_cid=07e3d81c-bb96-4608-be5d-542d35d39dff; lzd_sid=1d8bf18519bb7fd8fb661ac558726c4d; _tb_token_=58e7f715a30eb; cna=O5A8FGGivzcCAXNPwzeoH+5y; hng=VN|vi|VND|704; userLanguageML=vi; cto_lwid=c9ad6486-acac-465f-ab05-6e0b3744d1dc; _ga=GA1.2.1435138343.1538642600; _gid=GA1.2.19901051.1538642600; cto_axid=zGni0uxNaRyv441RxQNq7EZ_LS8xiGmL; JSESSIONID=85306FF3F7612F91677FC6ED978B42E1; isg=BJ6eL8eUSXz4CZ0YqjCefDlu7zTqVCYsGgm5Z0gmm-DyaztFsOyk6OZNZi9CoFrx"
}
body ="?ajax=true&page=%s" % i
urls = "https://www.lazada.vn/trang-diem/?ajax=true&page=%s" % i
yield scrapy.Request(url=urls, body=body, cookies=cookies, headers=headers, callback=self.parse)
推荐阅读
- python - 如何在 django 模型中添加基于数组的文件
- javascript - 如何在javascript中检查字符串值
- python - 多次初始化时,Python C API 在“import numpy”上崩溃
- javascript - ks-modal-gallery 的 Angular 8 索引问题
- html - chmod 导入图像的文件夹
- r - 使用 R 计算名称中连字符的数量
- css - 如何将graphQl查询中的变量转换为内联样式或样式组件中的伪元素
- c# - 将模板分配给动态生成的 WPF 控件
- pandas - 读取 CSV 会创建太多行/列
- reactjs - React Hooks:在设置子状态之前设置父状态