首页 > 解决方案 > Pandas 按条件分组和变换并应用于整列

问题描述

我有以下数据框:

import pandas as pd

df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'John', 'John','John', 'John','John','John','John','John','John']
                  ,'City': ['A', 'B', 'A','B', 'A','B','B','A','B','A'],'City2': ['C', 'D', 'C','D', 'C','D','D','C','D','C']})
df



     Value  Name  City  City2
    0   0   John    A   C
    1   1   John    B   D
    2   2   John    A   C
    3   3   John    B   D
    4   4   John    A   C
    5   5   John    B   D
    6   6   John    B   D
    7   7   John    A   C
    8   8   John    B   D
    9   9   John    A   C

我试图在City2等于“C”时取平均值,但将其应用于整个新列:

我试过了:

df['C_Average'] = df[df['City2'] == 'C'].groupby(['Name','City'])['Value'].transform(lambda v: v.nsmallest(11).mean())
df
     Value  Name  City City2 C_Average
    0   0   John    A   C   4.4
    1   1   John    B   D   NaN
    2   2   John    A   C   4.4
    3   3   John    B   D   NaN
    4   4   John    A   C   4.4
    5   5   John    B   D   NaN
    6   6   John    B   D   NaN
    7   7   John    A   C   4.4
    8   8   John    B   D   NaN
    9   9   John    A   C   4.4

如您所见,添加了新列,但我想将其应用于整个列,而不仅仅是City2等于 C 的行。即整个列显示 4.4。有任何想法吗?

谢谢!

标签: pythonpandas

解决方案


一种技巧是将不匹配的值替换为缺失值,而不是过滤:

print (df.assign(Value = df['Value'].where(df['City2']== 'C')))
   Value  Name City City2
0    0.0  John    A     C
1    NaN  John    B     D
2    2.0  John    A     C
3    NaN  John    B     D
4    4.0  John    A     C
5    NaN  John    B     D
6    NaN  John    B     D
7    7.0  John    A     C
8    NaN  John    B     D
9    9.0  John    A     C

但是样本数据中的问题是没有组Cin groups John, B,所以得到相同的输出:

df['C_Average'] = (df.assign(Value = df['Value'].where(df['City2']== 'C'))
                     .groupby(['Name','City'])['Value']
                     .transform(lambda v: v.nsmallest(11).mean()))

print (df)
   Value  Name City City2  C_Average
0      0  John    A     C        4.4
1      1  John    B     D        NaN
2      2  John    A     C        4.4
3      3  John    B     D        NaN
4      4  John    A     C        4.4
5      5  John    B     D        NaN
6      6  John    B     D        NaN
7      7  John    A     C        4.4
8      8  John    B     D        NaN
9      9  John    A     C        4.4

如果更改数据运行良好:

df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'John', 'John','John', 'John','John','John','John','John','John']
                  ,'City': ['A', 'B', 'A','B', 'A','B','B','A','B','A'],'City2': ['C', 'C', 'C','D', 'C','D','D','C','D','C']})

print (df)
   Value  Name City City2
0      0  John    A     C
1      1  John    B     C <- one row for C for group John, B
2      2  John    A     C
3      3  John    B     D
4      4  John    A     C
5      5  John    B     D
6      6  John    B     D
7      7  John    A     C
8      8  John    B     D
9      9  John    A     C

df['C_Average'] = (df.assign(Value = df['Value'].where(df['City2']== 'C'))
                     .groupby(['Name','City'])['Value']
                     .transform(lambda v: v.nsmallest(11).mean()))

print (df)
   Value  Name City City2  C_Average
0      0  John    A     C        4.4
1      1  John    B     C        1.0
2      2  John    A     C        4.4
3      3  John    B     D        1.0
4      4  John    A     C        4.4
5      5  John    B     D        1.0
6      6  John    B     D        1.0
7      7  John    A     C        4.4
8      8  John    B     D        1.0
9      9  John    A     C        4.4

您的解决方案有所不同:

df['C_Average'] = df[df['City2'] == 'C'].groupby(['Name','City'])['Value'].transform(lambda v: v.nsmallest(11).mean())

print (df)
   Value  Name City City2  C_Average
0      0  John    A     C        4.4
1      1  John    B     C        1.0
2      2  John    A     C        4.4
3      3  John    B     D        NaN
4      4  John    A     C        4.4
5      5  John    B     D        NaN
6      6  John    B     D        NaN
7      7  John    A     C        4.4
8      8  John    B     D        NaN
9      9  John    A     C        4.4

推荐阅读