python - 为网络抓取会话设置商店位置
问题描述
我正在尝试使用 Python 请求为网络抓取会话设置商店位置(=Albany,STORE_ID=65defcf2-bc15-490e-a84f-1f13b769cd22)。商店之间的产品价格各不相同,所以我想在会话开始时设置商店,并且只从该商店刮取。
我当前的代码尝试将 Store ID cookie 插入到请求的标头中。代码成功提取价格,但价格来自随机商店(例如 Whangarei、Kaitaia)。
有人可以指出我的代码的问题吗?
Cookie 信息是从 chrome 获取的。见下面的 Cookie 截图。
我的代码在下面,也可以在这个Google Colab Notebook中找到
import requests
from bs4 import BeautifulSoup as bs
import re
#Regex to parse a) grocery item prices and b) the store the prices come from
dollars_pattern = '>([0-9][0-9]?)'
cents_pattern = '>([0-9][0-9])'
Shopname_pattern = "(PAK'nSAVE)\s[A-Z][a-z]*"
#Urls I want to scrape prices from
baseurl=['https://www.paknsaveonline.co.nz/product/5039956_ea_000pns?name=broccoli',
'https://www.paknsaveonline.co.nz/product/5025171_ea_000pns?name=indian-style-tomatoes']
# My attempt to set a cookie in the heeader that will specify the store to Albany" (store= ID = =65defcf2-bc15-490e-a84f-1f13b769cd22)
header = {
#Attempt to set store cookie to albany
"Cookie": "STORE_ID_V2=65defcf2-bc15-490e-a84f-1f13b769cd22|False; expires=Sun, 14-Aug-2022 00:41:46 GMT; path=/; secure; HttpOnly; SameSite=None",
#set the refer to a stupid name to make it easier to spot in the return header
"Refer":"https//MadeupURL.com",
#Set user agent so it doesn't appear as requests
, "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
#establish a requests session object to maintain a single connection
for i in baseurl:
with requests.session() as s:
#here
r = s.get(i, headers=header)
soup = bs(r.content,'html.parser')
cents = str(soup.find_all('span', {'class': "fs-price-lockup__cents"}))
dollars = str(soup.find_all('span', {'class': "fs-price-lockup__dollars"}))
centsprice =re.findall(cents_pattern, cents)
dollarsprice = re.findall(dollars_pattern, dollars)
storetext = soup.find_all("script",attrs={"data-cfasync":"false"})[3].contents[0]
storename = re.search(Shopname_pattern,storetext)
prod_url =[]
prod_url += [i]
print("Store returned from scraping site:")
print(storename.group())
print(dollarsprice, centsprice, prod_url)
#check the headers that are returned for the store cookie
print(r.request.headers)
print(" ")
以前关于这个主题的帖子,我认为有效 - 不幸的是没有
解决方案
推荐阅读
- machine-learning - Inception V2 在 Inception V1 上没有改进
- gcc - 如何打印介子执行的 GCC 命令?
- python - 使用热图居中表格
- c++ - 简单倒计时码不能倒计时
- javascript - 超出图表中的最大文本大小 - mermaid.js
- php - Nginx 站点显示空白
- authentication - 为包添加 PackageReference 失败,出现 401(未经授权)
- elasticsearch - PySpark 3.1.1 的 Elasticsearch 插件
- python - Python scipy.optimize.fsolve 找到第一个交点
- node.js - 在 Node 上获取带有标题的 SVG 文本的宽度