python - Python中唯一值的聚合
问题描述
我有一个 csv 文件。我想对此进行聚合操作。我的目标是根据日期按降序打印每日唯一测量的数量。
.csv 文件如下所示:
['endDate,weight\n',
' 2020-06-12 00:00:00+03:00 , 91.5,91.9,91.9,91.9,92.55,92.55,92.55,92.55,92.1,92.1,93.3,93.3 \n',
' 2020-06-13 00:00:00+03:00 , 91.6,91.6,92.85,92.85,92.85,92.85,92.3,92.3,92.1,92.1,94.1,94.1 \n',
' 2020-06-14 00:00:00+03:00 , 91.5,91.5,91.65,91.65,91.5,91.5,92.9,92.9 \n',
' 2020-06-15 00:00:00+03:00 , 91.85,91.85,91.6,91.6,91.85,91.85,92.55,92.55,92.4,92.4,93.7,93.7,93.35,93.35 \n',
' 2020-06-16 00:00:00+03:00 , 91.6,91.6,91.3,91.3,92.75,92.75,92.15,92.15,93.15,93.15,92.9,92.9 \n',
' 2020-06-17 00:00:00+03:00 , 91.05,91.05,91.85,91.85,92.4,92.4,92.4,92.4,94.0,94.0,93.7,93.7,93.05,93.05,93.05,93.05 \n',
' 2020-06-18 00:00:00+03:00 , 91.55,91.55,91.45,91.45,91.25,91.25,91.65,92.2,91.95 \n',
' 2020-06-19 00:00:00+03:00 , 91.3,91.6,92.45,92.05,91.8,93.1,92.7,93.5,93.15 \n',
' 2020-06-20 00:00:00+03:00 , 90.8,90.8,90.6,90.6,90.6,90.6,92.15,92.15,92.05,92.05,91.4,91.4 \n',
' 2020-06-21 00:00:00+03:00 ,\n']
预期结果是:
import re
import collections
with open("weights.csv") as myFile:
formattedData = dict()
for line in myFile:
try:
date , numbers = line.split(' , ')
numbers = numbers.replace("\n","")
numbers = numbers.split(',')
formattedData[date] = len(list(set(numbers)))
except:
date = line
formattedData[date]=0
formattedData
拆分数据后,我的数据如下所示:
{'endDate,weight\n': 0,
' 2020-06-12 00:00:00+03:00 , 91.5,91.9,91.9,91.9,92.55,92.55,92.55,92.55,92.1,92.1,93.3,93.3 \n': 0,
' 2020-06-13 00:00:00+03:00 , 91.6,91.6,92.85,92.85,92.85,92.85,92.3,92.3,92.1,92.1,94.1,94.1 \n': 0,
' 2020-06-14 00:00:00+03:00 , 91.5,91.5,91.65,91.65,91.5,91.5,92.9,92.9 \n': 0,
' 2020-06-15 00:00:00+03:00 , 91.85,91.85,91.6,91.6,91.85,91.85,92.55,92.55,92.4,92.4,93.7,93.7,93.35,93.35 \n': 0,
' 2020-06-16 00:00:00+03:00 , 91.6,91.6,91.3,91.3,92.75,92.75,92.15,92.15,93.15,93.15,92.9,92.9 \n': 0,
' 2020-06-17 00:00:00+03:00 , 91.05,91.05,91.85,91.85,92.4,92.4,92.4,92.4,94.0,94.0,93.7,93.7,93.05,93.05,93.05,93.05 \n': 0,
' 2020-06-18 00:00:00+03:00 , 91.55,91.55,91.45,91.45,91.25,91.25,91.65,92.2,91.95 \n': 0,
' 2020-06-19 00:00:00+03:00 , 91.3,91.6,92.45,92.05,91.8,93.1,92.7,93.5,93.15 \n': 0,
' 2020-06-20 00:00:00+03:00 , 90.8,90.8,90.6,90.6,90.6,90.6,92.15,92.15,92.05,92.05,91.4,91.4 \n': 0,
' 2020-06-21 00:00:00+03:00 ,\n': 0}
c=计数器(格式化数据)
解决方案
您可以使用defaultdict。
替换formattedData = dict()
为
formattedData = defaultdict(int)
替换formattedData[date] = len(list(set(numbers)))
为
formattedData[date] += len(set(numbers))
最后,创建一个新字典,其中的键按计数降序排序:
descending = {dt: count for dt, count in sorted(formattedData.items(), key=lambda item: item[1], reverse=True)}
print(descending)
推荐阅读
- html - 如何使用网格正确堆叠弹性项目(弹性:9999 hack)
- c# - Couchbase Lite 2.0.0 C# 不区分大小写的查询
- python - 如何使用 pandas 检查每月刻度数据 csv 中的每日刻度数据?
- java - apache spark2.3.0 使用 master 作为纱线启动时,失败并出现错误找不到或加载主类 org.apache.spark.deploy.yarn.ApplicationMaster
- javascript - 在 Google Maps Directions API 提供的现有 Lat/Lng 之间生成新的 Lat/Lng
- javascript - Discord.js API 不可见标识符
- react-native - react-native 嵌套 StackNavigator 传递参数
- pandas - 重新索引以在多索引数据框中插入缺失的行
- api - Rest APi 用户注册
- android - ConnectionParameter 需要哪些参数?