首页 > 解决方案 > 使用 Beautifulsoup 抓取 Web 数据 - 在获取我需要的内容时遇到问题

问题描述

我想在以下网页上提取产品的 ASIN 编号。我可以提取一些我需要的其他元素,但我无法提取 ASIN 编号。ASIN 编号遵循亚马逊上 HTML 的“data-asin”元素。然后我想以与其他元素相同的方式打印输出。提前感谢您的帮助

import csv
from bs4 import BeautifulSoup
from selenium import webdriver

path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)

def get_url(search_term):
    """Generate a url from search term"""
    template = 'https://www.amazon.co.uk/s?k={}&ref=nb_sb_noss_2'
    search_term = search_term.replace(' ','+') 
    return template.format(search_term)

url = get_url('Ultrawide monitor')
print(url)

driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('div',{'data-component-type': 's-search-result'})

item = results [0]

atag = item.h2.a
atag.text
description = atag.text.strip()
url = 'https//www.amazon.com'+atag.get('href')
price_parent = item.find('span', 'a-price')
price = price_parent.find('span', 'a-offscreen').text
rating = item.i.text
review_count = item.find('span', {'class': 'a-size-base', 'dir':     
'auto'}).text

print(description)
print(price)
print(rating)
print(review_count)

标签: pythonwebweb-scrapingbeautifulsoup

解决方案


Andrej 的解决方案是正确的,但您可以稍作改动以一次性获取完整数据,然后使用 json_normalize。只是另一种方式来做到这一点,所以你可以看到。

import requests
import math
from pandas.io.json import json_normalize

url = 'https://api-prod.footballindex.co.uk/football.allTradable24hrchanges'
per_page = 5000
page = 1
payload = {
'page':'%s' %page,
'per_page':'%s' %per_page,
'sort':'asc'}

print ('Gathering page: %s' %page)
jsonData = requests.get(url, params=payload).json()
total_pages = math.ceil(jsonData['total'] / per_page)

df = json_normalize(jsonData['items'])
cols = ['id', 'country', 'nationalTeam','nationality','team', 'price', 'scoreSell', 'penceChange']
df = df[cols]

if total_pages > 1:
    for page in range(2,total_pages+1):
        print ('Gathering page: %s' %page)
        payload = {
                'page':'%s' %page,
                'per_page':'%s' %per_page,
                'sort':'asc'}

        jsonData = requests.get(url, params=payload).json()
        temp_df = json_normalize(jsonData['items'])
        df = df.append(temp_df[cols], sort=False).reset_index(drop=True)

推荐阅读