首页 > 解决方案 > Python BeautifulSoup汽车排名抓取

问题描述

我需要从许多网站上抓取汽车排名。

例如:

https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/

  1. 2011丰田凯美瑞
  2. 2013款本田思域...

https://www.autoguide.com/auto-news/2019/10/top-10-best-cars-for-snow.html

道奇 Charger AWD 斯巴鲁傲虎 Nissan Altima AWD ...

我无法检测网站上的排名,因为它们都有点不同。我的目标基本上是有一个脚本,可以自动检测排名并在任何给定的汽车网站上以相当高的准确度检索我需要的数据(排名中的品牌 + 车型)。

我想收集的数据(排名中的品牌+车型)有时在H2、H3或H4,有时在链接中……有时写成“1. Brand1 Model1, 2. Brand2 Model2……”有时“ Brand1 Model1,Brand2 Model2……”这取决于……

我在 Python 中使用 BeautifulSoup 执行此操作。

什么是好方法?

编辑:

需要明确的是,我正在努力分析数据,而不是抓取它(请参阅下面的评论)。但为了清楚起见,这是我处理上面第一个示例的方式:

for url in urls:
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
        

    for sub_heading in soup.find_all('h2'): 
        if  str(1) + ". " in sub_heading.text and "11." not in sub_heading.text: #filter applied to keep only strings starting with "1. "
             list_url.append(url)
             print(list_sub_heading)

结果:['1。2011丰田凯美瑞']

标签: pythonweb-scrapingbeautifulsoup

解决方案


import requests
from bs4 import BeautifulSoup


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    goal = [item.find_previous("h3").text for item in soup.findAll(
        "img", class_="alignnone")]
    mylist = list(dict.fromkeys(goal))
    print(mylist)


main("https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/")

输出:

['1. 2011 Toyota Camry', '2. 2013 Honda Civic', '3. 2009 Toyota Avalon', '4. 2011 Honda Accord', '5. 2010 Toyota Prius', '6. 2012 Mazda Mazda3', '7. 2011 Toyota Corolla', '8. 2010 Subaru Outback', '9. 2013 Kia Soul', '10. 2012 Subaru Legacy']

re版本:

import requests
import re


def main(url):
    r = requests.get(url)
    match = [f'{item.group(1)} {item.group(2)}'
             for item in re.finditer(r'>(\d+\.).+?>(.+?)<', r.text)]
    print(match)


main("https://www.kbb.com/articles/best-cars/10-best-used-cars-under-10000/")

输出:

['1. 2011 Toyota Camry', '2. 2013 Honda Civic', '3. 2009 Toyota Avalon', '4. 2011 Honda Accord', '5. 2010 Toyota Prius', '6. 2012 Mazda Mazda3', '7. 2011 Toyota Corolla', '8. 2010 Subaru Outback', '9. 2013 Kia Soul', '10. 2012 Subaru Legacy']

推荐阅读