首页 > 解决方案 > Python pandas:如何根据其他列的最大值查找时差?

问题描述

我一直在试图找出这个数据集中每个人参与最多的活动所花费的时间:

              name  activity           timestamp  money_spent
0    Chandler Bing     party 2017-08-04 08:00:00           51
1    Chandler Bing     party 2017-08-04 13:00:00           60
2    Chandler Bing     party 2017-08-04 15:00:00           59
5       Harry Kane     party 2017-08-04 07:00:00           68
4       Harry Kane     party 2017-08-04 11:00:00           90
3       Harry Kane  football 2017-08-04 13:00:00           80
11  Joey Tribbiani  football 2017-08-04 08:00:00           84
9   Joey Tribbiani     party 2017-08-04 09:00:00           54
10  Joey Tribbiani     party 2017-08-04 10:00:00           67
6         John Doe     beach 2017-08-04 07:00:00           63
7         John Doe     beach 2017-08-04 12:00:00           61
8         John Doe     beach 2017-08-04 14:00:00           65
12   Monica Geller    travel 2017-08-04 07:00:00           90
13   Monica Geller    travel 2017-08-04 08:00:00           96
14   Monica Geller    travel 2017-08-04 09:00:00           74
15   Phoebe Buffey    travel 2017-08-04 10:00:00           52
16   Phoebe Buffey    travel 2017-08-04 12:00:00           84
17   Phoebe Buffey  football 2017-08-04 15:00:00           58
18     Ross Geller     party 2017-08-04 09:00:00           96
19     Ross Geller     party 2017-08-04 11:00:00           81
20     Ross Geller    travel 2017-08-04 14:00:00           60

df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')

df # party day 2017-08-04 for some guys.
# find most involved activity and time spent on that activity per person.

所需输出:

                activity_num activity time_diff
name                                           
Chandler Bing            1.0    party  07:00:00
Harry Kane               2.0    party  04:00:00
Joey Tribbiani           2.0    party  02:00:00
John Doe                 1.0    beach  07:00:00
Monica Geller            1.0   travel  02:00:00
Phoebe Buffey            2.0   travel  03:00:00
Ross Geller              2.0   travel  03:00:00

注意:Harry Kane 从早上 7 点到 11 点参加派对,所以他的回答是 4 小时。

df.head()
              name  activity           timestamp  money_spent
0    Chandler Bing     party 2017-08-04 08:00:00           51
1    Chandler Bing     party 2017-08-04 13:00:00           60
2    Chandler Bing     party 2017-08-04 15:00:00           59
3       Harry Kane  football 2017-08-04 13:00:00           80
4       Harry Kane     party 2017-08-04 11:00:00           90
5       Harry Kane     party 2017-08-04 07:00:00           68

我的尝试:

df.groupby(['name','activity'])['timestamp'].max() # no idea

标签: pythonpandas

解决方案


尝试这个:

gb = df.groupby(['name', 'activity'])['timestamp']

print((gb.max() - gb.min()).sort_values(ascending=False).reset_index().drop_duplicates(subset='name'))

输出:

             name activity timestamp
0        John Doe    beach  07:00:00
1   Chandler Bing    party  07:00:00
2      Harry Kane    party  04:00:00
3     Ross Geller    party  02:00:00
4   Phoebe Buffey   travel  02:00:00
5   Monica Geller   travel  02:00:00
6  Joey Tribbiani    party  01:00:00

推荐阅读