首页 > 解决方案 > Python web scraper 每页返回多个列表,而不是遍历搜索结果页码?

问题描述

我在下面创建了一个网络抓取机制,但是当运行它时,它会在搜索结果页面上重复列表 - 我也无法弄清楚如何在没有从第一个 SRP 获得完全相同的结果的情况下遍历每个搜索结果页面. 关于这里有什么问题的任何想法?

url = '''https://www.cargurus.com/Cars/inventorylisting/viewDetailsFilterViewInventoryListing.action?zip=32805&inventorySearchWidgetType=PRICE&maxPrice=42500&maxMileage=50000&showNegotiable=false&sortDir=DESC&sourceContext=carGurusHomePageModel&distance=100&minPrice=0&sortType=PRICE&minMileage=0&sellerTypes=PRIVATE'''
listing_detail_url = 'https://www.cargurus.com/Cars/detailListingJson.action?inventoryListing={}&searchZip=&searchDistance=500&inclusionType=DEFAULT'

import json
import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

data = []
for a in soup.select('a[href^="#listing"]'):  # get all listings on the page
    listing_id = a['href'].split('=')[-1]
    json_data = requests.get(listing_detail_url.format(listing_id)).json()   
    listing_title = json_data['listing']['listingTitle']
    vehicle_id = json_data['listing']['id']
    price = json_data['listing']['price']
    make_name = json_data['listing']['makeName']
    model_name = json_data['listing']['modelName']
    mileage = json_data['listing']['mileage']
    #vin_id = json_data['listing']['vin']
    # ... other data

    data.append( (listing_title, vehicle_id, price, make_name, model_name, mileage, vin_id) )

标签: pythonjsonweb-scrapingpaginationduplicates

解决方案


您也可以尝试使用此脚本从其他页面获取有关汽车的信息:

import requests


page_url = 'https://www.cargurus.com/Cars/searchResults.action?zip=32805&offset={}&maxResults=15&distance=500'

data = []
offset = 0
while True:
    print('Offset {}...'.format(offset))
    json_data = requests.get(page_url.format(offset)).json()

    for listing in json_data:
        listing_title = listing['listingTitle']
        vehicle_id = listing['id']
        price = listing['price']
        make_name = listing['makeName']
        model_name = listing['modelName']
        mileage = listing['mileage']
        # ... other data

        print((listing_title, vehicle_id, price, make_name, model_name, mileage))
        data.append( (listing_title, vehicle_id, price, make_name, model_name, mileage) )

    if len(json_data) != 15:
        break

    offset += 15

印刷:

...

('2018 Honda CR-V EX AWD', 273663888, 20875.0, 'Honda', 'CR-V', 36870)
('2019 Ford Ranger Lariat SuperCrew RWD', 277554768, 29995.0, 'Ford', 'Ranger', 4546)
('2015 Ford Edge SEL', 273107810, 9999.0, 'Ford', 'Edge', 99336)
('2020 RAM 1500 Limited Crew Cab 4WD', 279568758, 54895.0, 'RAM', '1500', 1903)
('2014 Volkswagen Passat TDI SE', 268214566, 9498.0, 'Volkswagen', 'Passat', 45235)
Offset 105...
('2017 Chevrolet Silverado 1500 High Country Crew Cab RWD', 273586618, 36500.0, 'Chevrolet', 'Silverado 1500', 27936)
('2017 Volkswagen Tiguan S', 273485901, 12495.0, 'Volkswagen', 'Tiguan', 24824)
('2019 Ford Explorer Limited', 277039894, 30400.0, 'Ford', 'Explorer', 26328)
('2014 Dodge Challenger SXT RWD', 274612168, 10750.0, 'Dodge', 'Challenger', 105362)
('2012 Volkswagen GTI 2.0T 4-Door FWD with Sunroof and Navigation', 277629553, 7500.0, 'Volkswagen', 'GTI', 106911)
('2013 Buick LaCrosse Premium II FWD', 279206632, 4991.0, 'Buick', 'LaCrosse', 169886)
('2017 Toyota RAV4 XLE', 273207166, 17500.0, 'Toyota', 'RAV4', 27197)
('2017 Ford Explorer XLT', 273452570, 21899.0, 'Ford', 'Explorer', 26523)

...

推荐阅读