首页 > 解决方案 > 根据条件在 Pandas 中分组

问题描述

我有一个数据框

|phone_number|call_date|answered| attempt|
|123        | 13thJune| 1 | 1 |
|234        | 15thJune| 0 | 1 |
|234        | 15thJune| 0 | 2 |

我想执行 groupby 并取出回答的最大日期。即如果呼叫未接听,即 0 ,则接听的最大日期应为空白。

df.groupby(['phone_number'])['Call_Date'].max().reset_index() 

只有当answered is > 0这个 groupby 应该给我一个blank

我如何实现这一目标?

预期 df

phone_number | max_call_date 
123 | 13th June
234 | Nan 

标签: pythonpandas

解决方案


第一个想法是过滤掉 not 0in 的行answered,聚合max并添加过滤后的行,phone_numberNaNs by Series.reindex

df1 = (df[df['answered'].ne(0)]
          .groupby(['phone_number'])['call_date']
          .max()
          .reindex(df['phone_number'].unique())
          .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123      13thJune
1           234           NaN

或者如果然后聚合替换call_date为缺失值:answered=0max

df1 = (df.assign(call_date = df['call_date'].mask(df['answered'].eq(0)))
         .groupby(['phone_number'])['call_date'].max()
         .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123      13thJune
1           234           NaN

NaN如果列的至少一个值answered=0和最小值是,则最后一个想法是否需要设置answered=0

df1 = df.groupby('phone_number', as_index=False).agg({'call_date':'max', 'answered':'min'})

df1['max_call_date'] = df1.pop('call_date').mask(df1.pop('answered').eq(0))
print (df1)
   phone_number max_call_date
0           123      13thJune
1           234           NaN

编辑:为了从字符串中获得正确的最大日期时间,必须将列转换为日期时间:

df['call_date'] = pd.to_datetime(df['call_date'].str.replace('st|nd|rd|th',' ',regex=True), 
                                 format='%d %B')


df1 = (df[df['answered'].ne(0)]
          .groupby(['phone_number'])['call_date']
          .max()
          .reindex(df['phone_number'].unique())
          .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123    1900-06-13
1           234           NaT

df1 = (df.assign(call_date = df['call_date'].mask(df['answered'].eq(0)))
         .groupby(['phone_number'])['call_date'].max()
         .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123    1900-06-13
1           234           NaT

df1 = df.groupby('phone_number', as_index=False).agg({'call_date':'max', 'answered':'min'})

df1['max_call_date'] = df1.pop('call_date').mask(df1.pop('answered').eq(0))
print (df1)
   phone_number max_call_date
0           123    1900-06-13
1           234           NaT

推荐阅读