python - 详细页面的网页设计
问题描述
我目前正忙于使用此网站对汽车数据集进行网络抓取 - https://www.marktplaats.nl/l/auto-s/p/1/#f:10882
我的问题是我的分析中有趣的部分 - 变速箱,发动机类型,价格等 - 位于更详细的页面 - https://www.marktplaats.nl/a/auto-s/volkswagen/m1547281937-volkswagen- polo-1-0-tsi-highline-beats-edition-navi-xenon.html?c=df2f21f683612b45d62c413c0ca719df&previousPage=lr
我已经成功地从一般分页中抓取信息,但是不知道如何在详细页面上为我迭代和抓取必要的字段。
解决方案
您必须浏览第一个网页才能找到每辆车的所有网址。然后下载汽车详细信息并一一解析。我用过bs4
包(beautifulsoup)。下面的代码需要适应您的需求,但想法在这里:
import requests
import bs4
url = 'https://www.marktplaats.nl/l/auto-s/p/1/#f:10882'
def downloading_and_parsing_url(url):
# Downloading the webpage as text
txt = requests.get(url)
# Parsing the webpage
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
return soup
soup = downloading_and_parsing_url(url)
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
# Finding the url for each 'car'
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
# Downloading each url
sub_soup = downloading_and_parsing_url(sub_url)
# Finding the 'div' with id 'car-attributes'
sub_soup = sub_soup.find('div', {'id': 'car-attributes'})
for car_item in sub_soup.findAll('div', {'class': 'spec-table-item'}):
key = car_item.find('span', {'class': 'key'})
value = car_item.find('span', {'class': 'value'})
print(key.text, value.text)
print('\n')
和输出
Merk & Model: Lako
Bouwjaar: 1996
Uitvoering: 233 C
Carrosserie: Open wagen
Kenteken: OD-31-VD
APK tot: 29 juni 2020
Prijs: € 7.500,00
Merk & Model: RAM
Bouwjaar: 2020
Carrosserie: SUV of Terreinwagen
Brandstof: LPG
Kilometerstand: 70 km
Transmissie: Automaat
Prijs: Zie omschrijving
Motorinhoud: 5.700 cc
Opties:
Parkeersensor
Dodehoekdetectie
Elektrische achterklep
Metallic lak
Panoramadak
Radio
Mistlampen
Adaptive Cruise Control
Keyless entry
Airconditioning
Boordcomputer
Bekleding leder
Stoelverwarming
Trekhaak
Elektrische ramen
Climate control
Emergency brake assist
Isofix
Alarm
Spraakbediening
Navigatiesysteem
Elektrische buitenspiegels
Traction-control
...
推荐阅读
- opencv - 如何将多个项目构建为多个 Webassembly wasm 文件并共享它们?
- .net - Entity Framework Core 在过滤时不包括相关实体
- python - Fold/reduce boolean values in masked loss function Keras
- networkx - Extract constrained polygon using OSMnx
- java - 通过将 InMemory 更改为 Spring Data JPA / Mysql 来分离实体
- wordpress - 如何在启用 Wordpresss 多站点的情况下将 www 重定向到 htaccess 中的非 www
- python - 在 Plotly 图中添加多个过滤器维度
- python - Django频道,运行服务器不工作
- javascript - 使用 chart.js jquery 在 chartdata.labels 上使用 html 标签
- c# - Azure 函数无法绑定 Ilogger