python - Python:Groupby 多个 pandas col - 执行操作并输出新的摘要 df
问题描述
我有一个数据框:
file_name iteration x y z
0 File_1 A 0.000000 0.561140 1
1 File_1 A 0.285714 0.412811 3
2 File_1 A 0.571429 0.957206 7
3 File_1 A 0.857143 0.667926 3
4 File_1 B 1.142857 0.998238 5
5 File_1 B 1.428571 0.975049 3
6 File_1 B 1.714286 0.690555 4
7 File_1 B 2.000000 0.896192 9
8 File_2 C 0.000000 0.241320 0
9 File_2 C 0.285714 0.476498 0
10 File_2 C 0.571429 0.295382 9
11 File_2 C 0.857143 0.809189 6
12 File_2 D 1.142857 0.601656 6
13 File_2 D 1.428571 0.447752 0
14 File_2 D 1.714286 0.210794 3
15 File_2 D 2.000000 0.788367 6
我想要:
- Groupby(文件名和迭代)
- 过滤数据
- 在生成的 grp2 数据帧上运行一些函数
- 返回带有结果摘要的最终数据框。
下面是一个工作示例代码。但是,一旦您添加更多 groupby 操作(以及我存储值的列表),它就很难阅读并且变得非常快。我的真实代码使用 3 grouby,并且这些函数都采用列的组合并输出单个值。
import numpy as np
import pandas as pd
#Example functions i want to apply to my grouped df (take as input some columns and output a number)
def fun_a (x, y):
#function that returns 1 value for each grp, example1:
return (sum(x)+ sum(y)//3*max(x))
def fun_b (x, z):
return (np.mean(x) -np.median(z))
#Example dataframe (Real Dataframe has 3 columns to grouby on)
data= {
"file_name" : ["File_1"]*8 + ["File_2"]*8,
"iteration" : ["A"]*4 + ["B"]*4 + ["C"]*4 + ["D"]*4,
"x" : list(np.linspace(0,2,8))*2,
"y" : list(np.random.random(16)), # 4 times x
"z" : list(np.random.randint(0,10,16))
}
df = pd.DataFrame(data)
#--------------------------#
#How I am currently solving it: nested groupby:
#storing all values of final_df in lists
file_name_lst = []
iterations_name_lst = []
fun_a_lst = []
fun_b_lst = []
#grouping by filename--> iteration --> applyiing
for key1, grp1 in df.groupby(["file_name"]):
for key2, grp2 in grp1.groupby(['iteration']):
#filter out of grp2 based on column values
# grp2 = grp2[grp2['x'] != "SOME_VALUE_I_AM_FILTERING_OUT"]
#In some cases filter out second half of grp2 (measurment repeated)
# grp2 = grp2[:len(grp2)//2]
#sometimes filter out all everyting smaller than value.
# grp2 = grp2 [grp2"x" > some value]
#keep only every 5th value (measurments have been repeated 5 times)
# grp2 = grp2 [grp2[::5]]
#append the correspodin key values to lists
file_name_lst.append(key1)
iterations_name_lst.append(key2)
#apend the result of applying the functions to lists
fun_a_lst.append(fun_a(x=grp2['x'], y=grp2['y']))
fun_b_lst.append(fun_b(x=grp2['x'], z=grp2['z']))
df_final = pd.DataFrame ({
"file_name": file_name_lst,
"iteration": iterations_name_lst,
"fun_a_res": fun_a_lst,
"fun_b_res": fun_b_lst
})
print (df)
print ("\n summary dataframe: \n")
print (df_final)
出去:
file_name iteration x y z
0 File_1 A 0.000000 0.998349 1
1 File_1 A 0.285714 0.073390 5
2 File_1 A 0.571429 0.188568 9
3 File_1 A 0.857143 0.236174 3
4 File_1 B 1.142857 0.934034 4
5 File_1 B 1.428571 0.906336 6
6 File_1 B 1.714286 0.258148 4
7 File_1 B 2.000000 0.717714 9
8 File_2 C 0.000000 0.137624 1
9 File_2 C 0.285714 0.773759 6
10 File_2 C 0.571429 0.902748 9
11 File_2 C 0.857143 0.186516 8
12 File_2 D 1.142857 0.496885 6
13 File_2 D 1.428571 0.074975 1
14 File_2 D 1.714286 0.132376 3
15 File_2 D 2.000000 0.421217 2
summary dataframe:
file_name iteration fun_a_res fun_b_res
0 File_1 A 1.714286 -3.571429
1 File_1 B 6.285714 -3.428571
2 File_2 C 1.714286 -6.571429
3 File_2 D 6.285714 -0.928571
有没有更简洁的方法来创建一个数据帧,该数据帧由应用到前一个数据帧的 groupby 的函数的结果组成?
解决方案
在主聚合之前过滤掉不需要的行仍然会更容易。过滤和聚合步骤的分离也使代码更易于维护。
代码
1.过滤
每次运行从 (a)、(b) 或 (c) 中选择一个。
# get running number and group size
dfgp = df.groupby(["file_name", "iteration"])
df["cumcount"] = dfgp.cumcount()
df["size"] = dfgp["cumcount"].transform("size")
# (a) filter out the last 50% per group
df = df[df["cumcount"] < df["size"] // 2]
# (b) filter out everyting smaller than 0.5
df = df[df["x"] >= 0.5]
# (c) keep only every 2 valuees
df = df[df["cumcount"] % 2 == 0]
2.聚合
groupby-apply
构造用于计算涉及多列的聚合。
dfgp = df.groupby(["file_name", "iteration"]) # regroup on the new df
df_ans = pd.concat([
dfgp[["x", "y"]].apply(lambda df_: fun_a(df_["x"], df_["y"])).rename("fun_a_res"),
dfgp[["x", "z"]].apply(lambda df_: fun_b(df_["x"], df_["z"])).rename("fun_b_res")
], axis=1).reset_index()
结果
(一种)
print(df_ans)
file_name iteration fun_a_res fun_b_res
0 File_1 A 0.285714 -1.857143
1 File_1 B 2.571428 -2.714286
2 File_2 C 0.285714 0.142857
3 File_2 D 2.571428 -1.714286
(二)
file_name iteration fun_a_res fun_b_res
0 File_1 A 1.428572 -4.285714
1 File_1 B 8.285714 -2.928571
2 File_2 C 1.428572 -6.785714
3 File_2 D 6.285714 -2.928571
(C)
file_name iteration fun_a_res fun_b_res
0 File_1 A 0.571429 -3.714285
1 File_1 B 2.857143 -3.071429
2 File_2 C 0.571429 -4.214285
3 File_2 D 2.857143 -3.071429
推荐阅读
- delphi - 生成具有特定格式 ex 'PRT-00000' 的数字增加
- mysql - seqlPro 导出和导入错误:无法从具有 CHARACTER SET 'binary' 的字符串创建 JSON 值
- asp.net-core - 证书验证失败:验证证书时客户端证书验证失败
- java - 如何在 Java 中拆分从 Excel 导入的数据表?
- c - 在二进制补码中表示 x 所需的最小位数
- linux - Snmp_exporter 生成器问题
- java - TLS配置后发布请求错误
- python - 不带附件的电子邮件发送 - Python 3.8
- c++ - 如何找到哪个类对象位于数组的特定索引上(多态)
- php - 如何通过网站的所有链接保持“推荐人”属性?