web-scraping - web scraping gearbest with python
问题描述
EDITIED: i've been trying to pull some data from Gearbest.com about several products and I have some real trouble with pulling the shipping price. i'm working with requests and beautifulsoup and so far i managed to get the name + link + price. how can I get it's shipping price?
the urls are: https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363 https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i've tried:
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong)
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong).text
shipping = soup.find("strong", class_="goodsIntro_shippingCost")
shipping = soup.find("strong", class_="goodsIntro_shippingCost").text
soup is the return value from here(the url is each product link):
def get_page(url):
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"})
try:
client.raise_for_status()
except requests.exceptions.HTTPError as e:
print("Error in gearbest with the url:", url)
exit(0)
soup = BeautifulSoup(client.content, 'lxml')
return soup
any ideas what can I do?
解决方案
你想用汤而不是souo。此外,从请求返回的内容与我在页面上的内容似乎有所不同。
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363','https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
print(soup.select_one('.goodsIntro_price').text)
print(soup.select_one('.goodsIntro_shippingCost').text) # soup.find("strong", class_="goodsIntro_shippingCost").text
对于实际价格,网络选项卡中似乎有价格的动态提要,尽管它存储在actual fee
. 因此,也许有基于动态位置的运费更新。
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/goods/goods-shipping?goodSn=455718101&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=540DCE6E4F455639641E0BB2B6356F15&goodPrice=1729.99&num=1&categoryId=13300&saleSizeLong=50&saleSizeWide=40&saleSizeHigh=10&saleWeight=4.5&volumeWeight=4.5&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=2&backRuleId=',
'https://www.gearbest.com/goods/goods-shipping?goodSn=459768501&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=91D909FDFFE8F8F1F9D1EC1D5D1B7C2C&goodPrice=159.99&num=1&categoryId=12004&saleSizeLong=12&saleSizeWide=10.5&saleSizeHigh=6.5&saleWeight=0.266&volumeWeight=0.266&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=1&backRuleId=']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'}).json()
print(r['data']['shippingMethodList'][0]['actualFee'])
推荐阅读
- elastic-stack - Elastalert :如果在特定路径中找到匹配项,则发出警报
- r - R 版本 3.5.2 或最新 R 版本的 rtexttools 包替代方案
- reactjs - 如何将 react-hook-form 组合成组件
- python - 使用另一列动态填充
- c - C字符串解析和转换
- c++ - 将类型映射到整数值后,如何在给定整数值的情况下取回类型?
- c# - XXXX 类型中不存在类型名称 XXXX
- azure - 在通过 Azure 逻辑应用中的创建 blob 操作完成整个写入操作之前,Azure blob 容器中的 0 kb 文件
- reporting-services - SSRS 根据行值隐藏/显示图像
- java - org.hibernate.Session 的保存方法未将数据保存在数据库中