首页 > 解决方案 > 使用python抓取ajax页面,但在请求几次后它返回假值

问题描述

我正在尝试使用 python 抓取 shopee 项目信息。

https://shopee.com.my/All%20in%20one%20pc%20Intel%20core%20I3/I5/I7%20Dual-core%208G%20RAM%20128%20gb%20SSD%20With%20optical%20drive%20CD以%2023.8%20Inch%20computer%20Office%20Desktop%20All-in-one%20desktop%20Support%20WiFi-i.206039726.5859069631为例。

由于它使用 ajax,我正在尝试从以下位置提取商品信息:https ://shopee.com.my/api/v2/item/get?itemid=5859069631&shopid=206039726

经过几次请求,我发现它响应的 json 原来是假值(例如它的实际评分是 4.78 但它返回的是 0.24)。

我试图通过更改标头和 ip/proxy 来解决这个问题,但仍然无法正常工作。

有没有其他方法可以解决这个问题?


def get_info(url,itemurl):
    
    
    requests.adapters.DEFAULT_RETRIES = 5 
    s = requests.session()
    s.keep_alive = False

    try:
        
        fake_ua=UserAgent()
        headers = {'User-Agent':fake_ua.random, 

                   'Accept': '*/*',
                    'Accept-Language': 'en-US,en;q=0.5',
                    'X-Shopee-Language': 'en',
                    'X-Requested-With': 'XMLHttpRequest',
                    'X-API-SOURCE': 'pc',
                    'If-None-Match-': '55b03-2ff39563c299cbdc937f8ab86ef322ab',
                    'DNT': '1',
                    'Referer': referer,
                    'TE': 'Trailers'}
        ip = get_daili()
        proxies = {"proxies":{"https":ip}}

        
        response = requests.get(url, headers = headers, proxies = proxies, verify=False)
        #response = requests.request("GET", url, headers=headers, data=payload)
        if response.status_code == 200:
            shop_info = response.json()
    except requests.ConnectionError as e:
        print(f' {url} error', e.args)    

    shop_name = shop_info['data']['name']    
    followers = shop_info['data']['follower_count']   
    ratinggood = shop_info['data']['rating_good'] 
    ratingbad = shop_info['data']['rating_bad'] 
    ratingnormal = shop_info['data']['rating_normal'] 



    try:
        fake_ua=UserAgent() 
        headers = {'User-Agent':fake_ua.random, 

                   'Accept': '*/*',
                    'Accept-Language': 'en-US,en;q=0.5',
                    'X-Shopee-Language': 'en',
                    'X-Requested-With': 'XMLHttpRequest',
                    'X-API-SOURCE': 'pc',
                    'If-None-Match-': '55b03-2ff39563c299cbdc937f8ab86ef322ab',
                    'DNT': '1',
                    'Referer': referer,
                    'TE': 'Trailers'} 
        ip = get_daili()
        proxies = {"proxies":{"https":ip}}
        response = requests.get(itemurl, headers = headers, proxies = proxies, verify=False)
        #response = requests.request("GET", itemurl, headers=headers, data=payload)
        if response.status_code == 200:
            item_info = response.json()
    except requests.ConnectionError as e:
        print(f' {url} error', e.args)    

    #print(json.dumps(item_info, indent=4))


    print(itemurl)

标签: htmlajaxweb-scrapingpython-requests

解决方案


我认为这是保护他们的 API 服务的算法,所以人们不能滥用他们的服务器。

也许您可以尝试使用 python selenium 和 Selenium Wire 来捕获数据。


推荐阅读