首页 > 解决方案 > 为网络抓取会话设置商店位置

问题描述

我正在尝试使用 Python 请求为网络抓取会话设置商店位置(=Albany,STORE_ID=65defcf2-bc15-490e-a84f-1f13b769cd22)。商店之间的产品价格各不相同,所以我想在会话开始时设置商店,并且只从该商店刮取。

我当前的代码尝试将 Store ID cookie 插入到请求的标头中。代码成功提取价格,但价格来自随机商店(例如 Whangarei、Kaitaia)。

有人可以指出我的代码的问题吗?

Cookie 信息是从 chrome 获取的。见下面的 Cookie 截图。

我的代码在下面,也可以在这个Google Colab Notebook中找到

import requests
from bs4 import BeautifulSoup as bs
import re

#Regex to parse a) grocery item prices and b) the store the prices come from
dollars_pattern = '>([0-9][0-9]?)'
cents_pattern = '>([0-9][0-9])'
Shopname_pattern = "(PAK'nSAVE)\s[A-Z][a-z]*"

#Urls I want to scrape prices from
baseurl=['https://www.paknsaveonline.co.nz/product/5039956_ea_000pns?name=broccoli',
'https://www.paknsaveonline.co.nz/product/5025171_ea_000pns?name=indian-style-tomatoes']

# My attempt to set a cookie in the heeader that will specify the store to Albany" (store= ID = =65defcf2-bc15-490e-a84f-1f13b769cd22)
header = {
#Attempt to set store cookie to albany
    "Cookie": "STORE_ID_V2=65defcf2-bc15-490e-a84f-1f13b769cd22|False; expires=Sun, 14-Aug-2022 00:41:46 GMT; path=/; secure; HttpOnly; SameSite=None",
#set the refer to a stupid name to make it easier to spot in the return header
    "Refer":"https//MadeupURL.com",
#Set user agent so it doesn't appear as requests
,  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest",
 }

#establish a requests session object  to maintain a single connection
for i in baseurl:
  with requests.session() as s:
#here 
    r = s.get(i, headers=header)
    soup = bs(r.content,'html.parser')
    cents =  str(soup.find_all('span', {'class': "fs-price-lockup__cents"}))
    dollars =  str(soup.find_all('span', {'class': "fs-price-lockup__dollars"}))
    centsprice =re.findall(cents_pattern, cents)
    dollarsprice = re.findall(dollars_pattern, dollars)
    storetext =  soup.find_all("script",attrs={"data-cfasync":"false"})[3].contents[0]
    storename = re.search(Shopname_pattern,storetext)

    prod_url =[]
    prod_url += [i]
    print("Store returned from scraping site:")
    print(storename.group())
    print(dollarsprice, centsprice, prod_url)
    #check the headers that are returned for the store cookie
    print(r.request.headers)
    print(" ")

在此处输入图像描述

以前关于这个主题的帖子,我认为有效 - 不幸的是没有

标签: pythonweb-scrapingpython-requests

解决方案


推荐阅读