首页 > 解决方案 > 如何汇总列表中的所有外观并打印列表中的最大值

问题描述

您好,我对网络抓取有疑问。如何从抓取的数据中打印最大值、最小值和平均值?另外我不知道如何将它与标题的外观联系起来。最终打印结果如下所示:

BMW - number of offerts: ..., max price:..., min price: ..., average price: ...

我用这些数据创建了一个列表,但我不知道如何总结标题的出现,并从中计算最大值等值。这是我的代码:


    for car in carList:
      

        title = car.find('a', class_='offer-title__link').text.strip()

        price = car.find('span', class_='offer-price__number').text.strip()


        lista = [title, price,]


        carFile.write(title + ',')
        carFile.write(price + ',')

        carFile.write('\n')

        print( lista)
        print(lista.count(title))

carFile.close()

现在我只数一点。

标签: pythonlistprinting

解决方案


如果您想分析数据,那么最好全部投入pandas.DataFrame

首先添加[title, int(price)]到外部列表data

data = []

for page in range(1, last_page+1):

    # ... code ...

    for car in car_list:

         # ... code ...

         data.append( [title, int(price)] )

然后转换为DataFrame

df = pd.DataFrame(data, columns=['title', 'price'])

然后你可以分析它

    cars = df[ df['title'].str.contains("BMW") ]

    print('count:', len(cars))
    print('price min    :', cars['price'].min())
    print('price average:', cars['price'].mean())
    print('price max    :', cars['price'].max())   

您甚至可以在for循环中运行更多汽车

for name in ['BMW', 'Audi', 'Opel', 'Mercedes']:

    print('---', name, '---')

    cars = df[ df['title'].str.contains(name) ]

    print('count:', len(cars))
    print('price min    :', cars['price'].min())
    print('price average:', cars['price'].mean())
    print('price max    :', cars['price'].max()) 

您甚至可以绘制价格直方图,以查看哪些价格更受欢迎

在此处输入图像描述

在此处输入图像描述

您甚至可以简单地将数据保存为csvexcel

df.to_csv('carData.csv')
df.to_excel('carData.xlsx')

基于您以前的代码的最少工作代码

它显示您必须关闭才能查看下一个数据的直方图。

import requests
import bs4
import pandas as pd
import matplotlib.pyplot as plt

url = 'https://www.otomoto.pl/osobowe/seg-sedan/?search%5Bfilter_float_price%3Afrom%5D=3000&search%5Bfilter_float_price%3Ato%5D=5000&search%5Bfilter_float_engine_capacity%3Afrom%5D=2000&search%5Border%5D=created_at%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D='

response = requests.get(url)
response.raise_for_status()

# check how many pages are there
soup = bs4.BeautifulSoup(response.text, "lxml")
last_page = int(soup.select('.page')[-1].text)

print('last_page:', last_page)

data = []

for page in range(1, last_page+1):

    print('--- page:', page, '---')

    response = requests.get(url + '&page=' + str(page))
    response.raise_for_status()
    
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    all_offers = soup.select('article.offer-item')

    for offer in all_offers:
        # get the interesting data and write to file

        title = offer.find('a', class_='offer-title__link').text.strip()
        price = offer.find('span', class_='offer-price__number').text.strip().replace(' ', '').replace('\nPLN', '')

        item = [title, int(price)]
        data.append(item)
        print(item)

# --- work with data ---

df = pd.DataFrame(data, columns=['title', 'price'])
df.to_csv('carData.csv')
#df.to_excel('carData.xlsx')

for name in ['BMW', 'Audi', 'Opel', 'Mercedes']:
    print('---', name, '---')
    cars = df[ df['title'].str.contains(name) ]
    print('count:', len(cars))
    print('price min    :', cars['price'].min())
    print('price average:', cars['price'].mean())
    print('price max    :', cars['price'].max())        
    
    cars.plot.hist(title=name)
    plt.show()

结果:

--- BMW ---
count: 3
price min    : 4500
price average: 4500.0
price max    : 4500
--- Audi ---
count: 12
price min    : 3900
price average: 4500.0
price max    : 4900
--- Opel ---
count: 12
price min    : 3300
price average: 4049.5
price max    : 4999
--- Mercedes ---
count: 27
price min    : 3000
price average: 4366.555555555556
price max    : 5000

推荐阅读