首页 > 解决方案 > web scraping gearbest with python

问题描述

EDITIED: i've been trying to pull some data from Gearbest.com about several products and I have some real trouble with pulling the shipping price. i'm working with requests and beautifulsoup and so far i managed to get the name + link + price. how can I get it's shipping price?

the urls are: https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363 https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i've tried:

shipping = soup.find("span", class_="goodsIntro_attrText").get("strong)
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong).text
shipping = soup.find("strong", class_="goodsIntro_shippingCost")
shipping = soup.find("strong", class_="goodsIntro_shippingCost").text

soup is the return value from here(the url is each product link):

def get_page(url):
    client = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"})
    try:
        client.raise_for_status()
    except requests.exceptions.HTTPError as e:
        print("Error in gearbest with the url:", url)
        exit(0)
    soup = BeautifulSoup(client.content, 'lxml')
    return soup

any ideas what can I do?

标签: web-scrapingbeautifulsouppython-requests

解决方案


你想用汤而不是souo。此外,从请求返回的内容与我在页面上的内容似乎有所不同。

from bs4 import BeautifulSoup as bs
import requests

urls = ['https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363','https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i']

with requests.Session() as s:
    for url in urls:
        r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'})
        soup = bs(r.content, 'lxml')
        print(soup.select_one('.goodsIntro_price').text)
        print(soup.select_one('.goodsIntro_shippingCost').text)  # soup.find("strong", class_="goodsIntro_shippingCost").text

对于实际价格,网络选项卡中似乎有价格的动态提要,尽管它存储在actual fee. 因此,也许有基于动态位置的运费更新。

from bs4 import BeautifulSoup as bs
import requests

urls = ['https://www.gearbest.com/goods/goods-shipping?goodSn=455718101&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=540DCE6E4F455639641E0BB2B6356F15&goodPrice=1729.99&num=1&categoryId=13300&saleSizeLong=50&saleSizeWide=40&saleSizeHigh=10&saleWeight=4.5&volumeWeight=4.5&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=2&backRuleId=',
        'https://www.gearbest.com/goods/goods-shipping?goodSn=459768501&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=91D909FDFFE8F8F1F9D1EC1D5D1B7C2C&goodPrice=159.99&num=1&categoryId=12004&saleSizeLong=12&saleSizeWide=10.5&saleSizeHigh=6.5&saleWeight=0.266&volumeWeight=0.266&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=1&backRuleId=']

with requests.Session() as s:
    for url in urls:
        r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'}).json()
        print(r['data']['shippingMethodList'][0]['actualFee'])

推荐阅读