首页 > 解决方案 > 如何从页面中抓取产品信息

问题描述

我正在尝试technical detail table 从产品信息中抓取,但他们会为我提供空列表,我尝试抓取表格的页面链接是https://www.amazon.com/Hammermill-Letter-Bright-Sheets-113640C /dp/B072FVQNWM/ref=sr_1_6?dchild=1&qid=1633771276&s=office-products&sr=1-6

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
base_url='https://www.amazon.com'
productlinks=[]
results = [] 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
cookies= {'session': '17ab96bd8ffbe8ca58a78657a918558'}
cookies=cookies
r = requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar', headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('a',class_="a-link-normal s-underline-text s-underline-link-text a-text-normal",href=True):
    p=link['href']
    l=urljoin(base_url,p)
    productlinks.append(l)
    
results = []    
for link in productlinks:
        r =requests.get(link,headers=headers)
        soup=BeautifulSoup(r.content, 'html.parser')
        try:
            for tr in soup.find('table', id='productDetails_techSpec_section_1').find_all('tr') :
                print(tr.text.strip())
                results.append(tr.text.strip())
        except:
            continue
print(results)

标签: pythonweb-scrapingbeautifulsoup

解决方案


这是我得到的输出:

['ManufacturerAmazon Basics', 'BrandAmazon Basics', 'Item Weight41.6 pounds', 'Product Dimensions18 x 11.8 x 9 inches', 'Item model numberAMZN8RM', 'ColorWhite', 
'Material TypePaper', 'Number of Items8', 'Size8 Reams | 4000 Sheets', 'Sheet Size8.5-x-11-inch', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part NumberAMZN8RM', 'ManufacturerZebra Pen Corporation', 'BrandZebra Pen', 'Item Weight0.336 ounces', 'Product Dimensions1.1 x 6.5 x 7.5 inches', 'Item model number22218', 'Is Discontinued By ManufacturerNo', 'ColorBlack', 'ClosureRetractable', 'Grip TypeRubber', 'Material TypePlastic, Metal, Rubber', 'Number of Items18', 'Size18-Pack', 'Point TypeMedium', 'Line Size1.00 Pen', 'Ink ColorBlack', 'Manufacturer Part Number22218', 'Manufacturer3M Office Products', 'BrandScotch', 'Item Weight3.68 ounces', 'Product Dimensions7.8 x 7.1 x 3 inches', 'Item model number142-6', 'Is Discontinued By ManufacturerNo', 'ColorClear', 'Material TypeSynthetic Rubber Resin', 'Number of Items1', 'Size6 Count', 'Manufacturer Part Number142-6', 'National Stock Number6520-01-356-3964, 5970-01-137-7860, 7530-00-598-7711', 'ManufacturerInternational Paper (Office)', 'BrandHammermill', 'Item Weight40 pounds', 'Product Dimensions17.25 x 11.75 x 8.25 inches', 'Item model number113640C', 'Is Discontinued By ManufacturerNo', 'Color8 Ream | 4000 Sheets', 'Cover MaterialPaper', 'Material TypePaper', 'Number of Items8', 'Size8 Ream | 4000 Sheets', 'Sheet Size8.5 x 11', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number113640C', 'ManufacturerNewell Rubbermaid Office', 'BrandEXPO', 'Item Weight2.4 ounces', 'Product Dimensions5.5 x 6.25 x 4.02 inches', 'Item model number1884309', 'Is Discontinued By ManufacturerNo', 'ColorAssorted', 'Grip TypeThumb', 'Material TypePlastic', 'Number of Items1', 'Size8-Count', 'Point TypeUltra Fine', 'Line Size0.5mm millimeters', 'Ink ColorMulticolor', 'Tip TypeFine point', 'Manufacturer Part Number1884309', 'Manufacturer3M Office Products', 'BrandScotch', 'Item Weight3.06 pounds', 'Product Dimensions0.75 
x 8.9 x 11.4 inches', 'Item model numberTP3854-100', 'Is Discontinued By ManufacturerNo', 'ColorClear', 'Material TypeLaminate', 'Number of Items1', 'PackagingRetail', 'Size100-Pack', 'Paper FinishGlossy', 'Manufacturer Part NumberTP3854-100', 'ManufacturerScotch', 'BrandScotch', 'Item Weight10.6 ounces', 'Product Dimensions4.2 x 6.4 x 3.05 inches', 'Item model number6122', 'Is Discontinued By ManufacturerNo', 'ColorTransparent', 'Material TypePlastic', 'Number of Items1', 'Size6 Rolls', 'Manufacturer Part Number6122', 'Manufacturer\tGorilla Glue', 'Part Number\t7700104', 'Item Weight1.5 ounces', 'Product Dimensions1.25 x 3.38 x 6.63 inches', 'Item model number7700104', 'Is Discontinued By ManufacturerNo', 'Size1 Pack', 'ColorClear', 'Style1 - Pack', 'PatternSuper Glue', 'Item Package Quantity1', 'Included Components1 bottle glue', 'Batteries Included?No', 'Batteries Required?No', 'Warranty DescriptionNo', 'Manufacturer0', 'BrandSHARPIE', 'Item Weight3.2 ounces', 'Product Dimensions1 x 1 x 1 inches', 'Item model number30001', 'Is Discontinued By ManufacturerNo', 'ColorBlack (Box)', 'Material TypeAluminum', 'Number of Items1', 'Size12-Count', 'Point TypeFine', 'Line Size0.3mm', 'Ink ColorBlack', 'Tip TypeFine', 'Manufacturer Part NumberSAN30001', 'National Stock Number7520-00-904-1265', 'Manufacturer0', 'BrandSHARPIE', 'Item Weight3.2 ounces', 'Product Dimensions1 x 1 x 1 inches', 'Item model number30001', 'Is Discontinued By ManufacturerNo', 'ColorBlack (Box)', 'Material TypeAluminum', 'Number of Items1', 'Size12-Count', 'Point TypeFine', 'Line Size0.3mm', 'Ink ColorBlack', 'Tip TypeFine', 'Manufacturer Part NumberSAN30001', 'National Stock Number7520-00-904-1265', 'ManufacturerAimoh', 'BrandAimoh', 'Item Weight1.4 pounds', 'Product Dimensions9.7 x 4.3 x 2.2 inches', 'Item model number34100', 'Is Discontinued By ManufacturerNo', 'ColorWhite', 'ClosureSelf-Seal', 'Material TypePaper', 'Size100 Ct.', 'Sheet Size4.125-x-9.5-inch', 'Paper Weight24', 'Paper FinishWove', 'Manufacturer Part Number34100', 'ManufacturerHP Papers', 'BrandHP Papers', 'Item Weight15 pounds', 'Product Dimensions11 x 8.5 x 6.25 inches', 'Item model number112090', 'Is Discontinued By ManufacturerNo', 'Material TypePaper', 'Number of Items1', 'Size3 Ream | 1500 Sheets', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number112090', 'ManufactureriBayam', 'BrandIBayam', 'Item Weight3.84 ounces', 'Product Dimensions6.6 x 6.2 x 0.6 inches', 'Item model number18 Pack', 'Is Discontinued By ManufacturerNo', 'ColorBlack, Grey, Red, Blue, Magenta, Pink, 
Purple, Violet, Pale Yellow, Yellow, Orange, Raw Sienna, Sap Green, C Green, O Green, Lake Blue, Burnt Sienna, Crimson', 'ClosurePush Button', 'Grip TypeContoured', 'Material TypePlastic', 'Number of Items18', 'Size18 Unique Colors', 'Point TypeFine', 'Manufacturer Part Number61', 'ManufacturerAmazon Basics', 'BrandAmazon 
Basics', 'Item Weight6.7 ounces', 'Product Dimensions7.4 x 0.3 x 0.3 inches', 'Item model numberPHB-30', 'ColorYellow', 'Pencil Lead Degree (Hardness)HB', 'Material TypeWood', 'Number of Items30', 'Size30 Count (Pack of 1)', 'Point TypeMedium', 'Manufacturer Part NumberPHB-30', 'ManufacturerInternational Paper (Office)', 'BrandHammermill', 'Item Weight15 pounds', 'Product Dimensions11.25 x 8.75 x 6.25 inches', 'Item model number113620', 'Is Discontinued By ManufacturerNo', 'Material TypePaper', 'Number of Items3', 'Size3 Ream | 1500 Sheets', 'Sheet Size8.5 x 11', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number113620', 'Manufacturer\tiBayam', 'Part Number\t5234', 'Item Weight1.44 ounces', 'Product Dimensions4 x 3 x 0.6 inches', 'Item model number2 Pack', 'ColorPink & Black', 'MaterialFiberglass', 'Item Package Quantity1', 'Plug ProfileSewing', 'Batteries Included?No', 'Batteries Required?No', 'ManufacturerHewlett Packard SOHO Consumables', 'BrandHP Papers', 'Item Weight6 pounds', 'Product Dimensions11 x 8.5 x 12 inches', 'Item model number203000', 'Is Discontinued By ManufacturerNo', 'ColorWhite', 'Number of Items1', 'Size1 Ream | 500 Sheets', 'Sheet Size8.5 x 11 inch', 'Brightness Rating97', 'Paper Weight24', 'Paper FinishMatte', 'Manufacturer Part Number203000']

我只是append将所有数据编入result列表和print它,并将for loop读取所有trs 的 a 放入 a 中try & except,因为在某些linksin 中productlinks,没有tr

[...]
results = []    
for link in productlinks:
        r =requests.get(link,headers=headers)
        soup=BeautifulSoup(r.content, 'html.parser')
        try:
            for tr in soup.find('table', id='productDetails_techSpec_section_1').find_all('tr') :
                res = "".join(tr.text.strip().split("\n\n\n\u200e"))
                print(res)
                results.append(res)
        except:
            continue
        
print(results)

推荐阅读