python - 如何汇总列表中的所有外观并打印列表中的最大值
问题描述
您好,我对网络抓取有疑问。如何从抓取的数据中打印最大值、最小值和平均值?另外我不知道如何将它与标题的外观联系起来。最终打印结果如下所示:
BMW - number of offerts: ..., max price:..., min price: ..., average price: ...
我用这些数据创建了一个列表,但我不知道如何总结标题的出现,并从中计算最大值等值。这是我的代码:
for car in carList:
title = car.find('a', class_='offer-title__link').text.strip()
price = car.find('span', class_='offer-price__number').text.strip()
lista = [title, price,]
carFile.write(title + ',')
carFile.write(price + ',')
carFile.write('\n')
print( lista)
print(lista.count(title))
carFile.close()
现在我只数一点。
解决方案
如果您想分析数据,那么最好全部投入pandas.DataFrame
首先添加[title, int(price)]
到外部列表data
data = []
for page in range(1, last_page+1):
# ... code ...
for car in car_list:
# ... code ...
data.append( [title, int(price)] )
然后转换为DataFrame
df = pd.DataFrame(data, columns=['title', 'price'])
然后你可以分析它
cars = df[ df['title'].str.contains("BMW") ]
print('count:', len(cars))
print('price min :', cars['price'].min())
print('price average:', cars['price'].mean())
print('price max :', cars['price'].max())
您甚至可以在for
循环中运行更多汽车
for name in ['BMW', 'Audi', 'Opel', 'Mercedes']:
print('---', name, '---')
cars = df[ df['title'].str.contains(name) ]
print('count:', len(cars))
print('price min :', cars['price'].min())
print('price average:', cars['price'].mean())
print('price max :', cars['price'].max())
您甚至可以绘制价格直方图,以查看哪些价格更受欢迎
您甚至可以简单地将数据保存为csv
或excel
df.to_csv('carData.csv')
df.to_excel('carData.xlsx')
基于您以前的代码的最少工作代码
它显示您必须关闭才能查看下一个数据的直方图。
import requests
import bs4
import pandas as pd
import matplotlib.pyplot as plt
url = 'https://www.otomoto.pl/osobowe/seg-sedan/?search%5Bfilter_float_price%3Afrom%5D=3000&search%5Bfilter_float_price%3Ato%5D=5000&search%5Bfilter_float_engine_capacity%3Afrom%5D=2000&search%5Border%5D=created_at%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D='
response = requests.get(url)
response.raise_for_status()
# check how many pages are there
soup = bs4.BeautifulSoup(response.text, "lxml")
last_page = int(soup.select('.page')[-1].text)
print('last_page:', last_page)
data = []
for page in range(1, last_page+1):
print('--- page:', page, '---')
response = requests.get(url + '&page=' + str(page))
response.raise_for_status()
soup = bs4.BeautifulSoup(response.text, 'lxml')
all_offers = soup.select('article.offer-item')
for offer in all_offers:
# get the interesting data and write to file
title = offer.find('a', class_='offer-title__link').text.strip()
price = offer.find('span', class_='offer-price__number').text.strip().replace(' ', '').replace('\nPLN', '')
item = [title, int(price)]
data.append(item)
print(item)
# --- work with data ---
df = pd.DataFrame(data, columns=['title', 'price'])
df.to_csv('carData.csv')
#df.to_excel('carData.xlsx')
for name in ['BMW', 'Audi', 'Opel', 'Mercedes']:
print('---', name, '---')
cars = df[ df['title'].str.contains(name) ]
print('count:', len(cars))
print('price min :', cars['price'].min())
print('price average:', cars['price'].mean())
print('price max :', cars['price'].max())
cars.plot.hist(title=name)
plt.show()
结果:
--- BMW ---
count: 3
price min : 4500
price average: 4500.0
price max : 4500
--- Audi ---
count: 12
price min : 3900
price average: 4500.0
price max : 4900
--- Opel ---
count: 12
price min : 3300
price average: 4049.5
price max : 4999
--- Mercedes ---
count: 27
price min : 3000
price average: 4366.555555555556
price max : 5000
推荐阅读
- postgresql - 在 BigQuery(或 Postgres)中定义一个常量表
- java - 使用 InputStreamReader 从 lz4 文件流式传输内容 - Stream Corrupted - Java
- html - 根据按钮位置动态标注位置
- swift - 如何在 Swift 中使用 NWConnectionGroup 广播 IP 255.255.255.255?
- javascript - ActiveX 旧代码不显示 Excel 应用程序窗口
- reactjs - 何时使用箭头功能
- javascript - 使用 Web 组件仅与 CDN 发生反应
- python - 从dict中的元组获取值
- python - ProcessPoolExecutor:TypeError:无法腌制“PyCapsule”对象
- python - Celery Periodic 任务没有结果和元数据