首页 > 解决方案 > Python:Groupby 多个 pandas col - 执行操作并输出新的摘要 df

问题描述

我有一个数据框:

     file_name iteration         x         y  z
    0     File_1         A  0.000000  0.561140  1
    1     File_1         A  0.285714  0.412811  3
    2     File_1         A  0.571429  0.957206  7
    3     File_1         A  0.857143  0.667926  3
    4     File_1         B  1.142857  0.998238  5
    5     File_1         B  1.428571  0.975049  3
    6     File_1         B  1.714286  0.690555  4
    7     File_1         B  2.000000  0.896192  9
    8     File_2         C  0.000000  0.241320  0
    9     File_2         C  0.285714  0.476498  0
    10    File_2         C  0.571429  0.295382  9
    11    File_2         C  0.857143  0.809189  6
    12    File_2         D  1.142857  0.601656  6
    13    File_2         D  1.428571  0.447752  0
    14    File_2         D  1.714286  0.210794  3
    15    File_2         D  2.000000  0.788367  6

我想要:

下面是一个工作示例代码。但是,一旦您添加更多 groupby 操作(以及我存储值的列表),它就很难阅读并且变得非常快。我的真实代码使用 3 grouby,并且这些函数都采用列的组合并输出单个值。

import numpy as np
import pandas as pd

 #Example functions i want to apply to my grouped df (take as input some columns and output a number)
def fun_a (x, y):
    #function that returns 1 value for each grp, example1: 
    return (sum(x)+ sum(y)//3*max(x))

def fun_b (x, z):
    return (np.mean(x) -np.median(z))

 #Example dataframe (Real Dataframe has 3 columns to grouby on)
data= {
    "file_name" : ["File_1"]*8 + ["File_2"]*8,
    "iteration" : ["A"]*4 + ["B"]*4 + ["C"]*4 + ["D"]*4,
    "x"         : list(np.linspace(0,2,8))*2,
    "y"         : list(np.random.random(16)),  # 4 times x 
    "z"         : list(np.random.randint(0,10,16)) 
    
}

df = pd.DataFrame(data)

#--------------------------#

 #How I am currently solving it: nested groupby:

 #storing all values of final_df in lists
file_name_lst = []
iterations_name_lst = []
fun_a_lst = []
fun_b_lst = []
 #grouping by filename--> iteration --> applyiing 
for key1, grp1 in df.groupby(["file_name"]):
    for key2, grp2 in grp1.groupby(['iteration']):
         #filter out of grp2 based on column values
      #  grp2 = grp2[grp2['x'] != "SOME_VALUE_I_AM_FILTERING_OUT"]     
         #In some cases filter out second half of grp2 (measurment repeated)
      #  grp2 = grp2[:len(grp2)//2]
         #sometimes filter out all everyting smaller than value. 
      #  grp2 = grp2 [grp2"x" > some value]
         #keep only every 5th value (measurments have been repeated 5 times) 
       # grp2 = grp2 [grp2[::5]]
         #append the correspodin key values to lists
        file_name_lst.append(key1)
        iterations_name_lst.append(key2)
         #apend the result of applying the functions to lists 
        fun_a_lst.append(fun_a(x=grp2['x'], y=grp2['y']))
        fun_b_lst.append(fun_b(x=grp2['x'], z=grp2['z']))
 
df_final = pd.DataFrame ({
    "file_name": file_name_lst,
    "iteration": iterations_name_lst,
    "fun_a_res": fun_a_lst,
    "fun_b_res": fun_b_lst
    
                         })    
print (df)
print ("\n summary dataframe: \n")
print (df_final)


出去:

   file_name iteration         x         y  z
0     File_1         A  0.000000  0.998349  1
1     File_1         A  0.285714  0.073390  5
2     File_1         A  0.571429  0.188568  9
3     File_1         A  0.857143  0.236174  3
4     File_1         B  1.142857  0.934034  4
5     File_1         B  1.428571  0.906336  6
6     File_1         B  1.714286  0.258148  4
7     File_1         B  2.000000  0.717714  9
8     File_2         C  0.000000  0.137624  1
9     File_2         C  0.285714  0.773759  6
10    File_2         C  0.571429  0.902748  9
11    File_2         C  0.857143  0.186516  8
12    File_2         D  1.142857  0.496885  6
13    File_2         D  1.428571  0.074975  1
14    File_2         D  1.714286  0.132376  3
15    File_2         D  2.000000  0.421217  2

 summary dataframe: 

  file_name iteration  fun_a_res  fun_b_res
0    File_1         A   1.714286  -3.571429
1    File_1         B   6.285714  -3.428571
2    File_2         C   1.714286  -6.571429
3    File_2         D   6.285714  -0.928571

有没有更简洁的方法来创建一个数据帧,该数据帧由应用到前一个数据帧的 groupby 的函数的结果组成?

标签: pythonpandasdataframepandas-groupby

解决方案


在主聚合之前过滤掉不需要的行仍然会更容易。过滤和聚合步骤的分离也使代码更易于维护。

代码

1.过滤

每次运行从 (a)、(b) 或 (c) 中选择一个。

# get running number and group size
dfgp = df.groupby(["file_name", "iteration"])
df["cumcount"] = dfgp.cumcount()
df["size"] = dfgp["cumcount"].transform("size")

# (a) filter out the last 50% per group
df = df[df["cumcount"] < df["size"] // 2]

# (b) filter out everyting smaller than 0.5
df = df[df["x"] >= 0.5]

# (c) keep only every 2 valuees
df = df[df["cumcount"] % 2 == 0]

2.聚合

groupby-apply构造用于计算涉及多列的聚合。

dfgp = df.groupby(["file_name", "iteration"])  # regroup on the new df
df_ans = pd.concat([
    dfgp[["x", "y"]].apply(lambda df_: fun_a(df_["x"], df_["y"])).rename("fun_a_res"),
    dfgp[["x", "z"]].apply(lambda df_: fun_b(df_["x"], df_["z"])).rename("fun_b_res")
], axis=1).reset_index()

结果

(一种)

print(df_ans)
  file_name iteration  fun_a_res  fun_b_res
0    File_1         A   0.285714  -1.857143
1    File_1         B   2.571428  -2.714286
2    File_2         C   0.285714   0.142857
3    File_2         D   2.571428  -1.714286

(二)

  file_name iteration  fun_a_res  fun_b_res
0    File_1         A   1.428572  -4.285714
1    File_1         B   8.285714  -2.928571
2    File_2         C   1.428572  -6.785714
3    File_2         D   6.285714  -2.928571

(C)

  file_name iteration  fun_a_res  fun_b_res
0    File_1         A   0.571429  -3.714285
1    File_1         B   2.857143  -3.071429
2    File_2         C   0.571429  -4.214285
3    File_2         D   2.857143  -3.071429

推荐阅读