首页 > 解决方案 > Pandas Dataframe:每个唯一ID的所有日期之间的差异

问题描述

[In 621]: df = pd.DataFrame({'id':[44,44,44,88,88,90,95],
                   'Status': ['Reject','Submit','Draft','Accept','Submit',
                   'Submit','Draft'],
                              'Datetime': ['2018-11-24 08:56:02',
                              '2018-10-24 18:12:02','2018-10-24 08:12:02', 
                              '2018-10-29 13:17:02','2018-10-24 10:12:02',
                              '2018-12-30 08:43:12', '2019-01-24 06:12:02']
                              }, columns = ['id','Status', 'Datetime'])
df['Datetime'] = pd.to_datetime(df['Datetime'])                              
df

Out[621]: 
   id  Status            Datetime
0  44  Reject 2018-11-24 08:56:02
1  44  Submit 2018-10-24 18:12:02
2  44   Draft 2018-10-24 08:12:02
3  88  Accept 2018-10-29 13:17:02
4  88  Submit 2018-10-24 10:12:02
5  90  Submit 2018-12-30 08:43:12
6  95   Draft 2019-01-24 06:12:02

我想要得到的是另一列,例如df['Time in Status'],这是id在该状态下花费的时间。

我已经看过df.groupby()但只找到了两个日期(例如第一个和最后一个)之间的答案(例如这个),无论它们之间有多少个日期。

df['Datetime'] = pd.to_datetime(df['Datetime'])                              
g = df.groupby('id')['Datetime']
print(df.groupby('id')['Datetime'].apply(lambda g: g.iloc[-1] - g.iloc[0])) 

id
44   -32 days +23:16:00
88    -6 days +20:55:00
90      0 days 00:00:00
95      0 days 00:00:00
Name: Datetime, dtype: timedelta64[ns]

我得到的最接近结果的是DataFrameGroupBy.diff

df['Time in Status'] = df.groupby('id')['Datetime'].diff()
df
   id  Status            Datetime          Time in Status
0  44  Reject 2018-11-24 08:56:02                NaT
1  44  Submit 2018-10-24 18:12:02 -31 days +09:16:00
2  44   Draft 2018-10-24 08:12:02  -1 days +14:00:00
3  88  Accept 2018-10-29 13:17:02                NaT
4  88  Submit 2018-10-24 10:12:02  -6 days +20:55:00
5  90  Submit 2018-12-30 08:43:12                NaT
6  95   Draft 2019-01-24 06:12:02                NaT

然而,这有两个问题。首先,我怎样才能从最早的日期开始计算直到结束?例如这样在行中2,而不是-1 days +14:00:00它会是0 Days 10:00:00?或者通过事先重新排列数据的顺序更容易解决这个问题?

另一个问题是 NaT。如果没有可比较的日期,则将使用当前日期(即 datetime.now)。之后我可以很容易地应用它,但我想知道是否有更好的解决方案来查找和替换所有 NaT 值。

标签: pythonpandasdatetimedataframegroup-by

解决方案


没错,首先需要对DataFrame.sort_values两列进行排序:

df = df.sort_values(['id', 'Datetime'])
df['Time in Status'] = df.groupby('id')['Datetime'].diff()
print (df)
   id  Status            Datetime   Time in Status
2  44   Draft 2018-10-24 08:12:02              NaT
1  44  Submit 2018-10-24 18:12:02  0 days 10:00:00
0  44  Reject 2018-11-24 08:56:02 30 days 14:44:00
4  88  Submit 2018-10-24 10:12:02              NaT
3  88  Accept 2018-10-29 13:17:02  5 days 03:05:00
5  90  Submit 2018-12-30 08:43:12              NaT
6  95   Draft 2019-01-24 06:12:02              NaT

推荐阅读