python - 数据框中某列中基于特定条件的行的时间差
问题描述
在给定固定 User_ID 的意义上,以下数据框中的“年龄”功能已损坏,所有“日期”的年龄都相同。我想从原始年龄中减去日期和最后一次发生的日期之间的年差。
import pandas as pd
df = pd.DataFrame({
"User_ID": [ "N1", "N2", "N3", "N1", "N1", "N2", "N3", "N2" , "N1", "N1", "N1", "N2"],
"Date": [ "31/10/2021", "31/10/2020" , "31/10/2019", "24/10/2019", "22/10/2018", "15/10/2017", "14/10/2017", "13/10/2016", "12/10/2016", "11/10/2015", "2/10/2015", "1/10/2015" ],
"Age": [6,5,8,6,6,5,8,5,6,6,6,5]
})
因此对于数据框
ID Date Age
0 N1 2021-10-31 6
1 N2 2020-10-31 5
2 N3 2019-10-31 8
3 N1 2019-10-24 6
4 N1 2018-10-28 6
5 N2 2017-10-15 5
6 N3 2017-10-14 8
7 N2 2016-10-13 5
8 N1 2016-10-12 6
9 N1 2015-10-11 6
10 N1 2015-10-2 6
11 N2 2015-10-1 5
结果应该看起来像
ID Date Age
0 N1 2021-10-31 6
1 N2 2020-10-31 5
2 N3 2019-10-31 8
3 N1 2019-10-24 4
4 N1 2018-10-28 3
5 N2 2017-10-15 2
6 N3 2017-10-14 6
7 N2 2016-10-13 1
8 N1 2016-10-12 1
9 N1 2015-10-11 0
10 N1 2015-10-2 0
11 N2 2015-10-1 0
有什么快速的方法吗?
解决方案
您可以Series
按年份创建,通过 first year
in GroupBy.first
with GroupBy.transform
original获取差异,y
并用于按列减去Age
:
y = df['Date'].dt.year
df['Age'] = df['Age'].sub(y.groupby(df['User_ID']).transform('first').sub(y))
print (df)
User_ID Date Age
0 N1 2021-10-31 6
1 N2 2020-10-31 5
2 N3 2019-10-31 8
3 N1 2019-10-24 4
4 N1 2018-10-22 3
5 N2 2017-10-15 2
6 N3 2017-10-14 6
7 N2 2016-10-13 1
8 N1 2016-12-10 1
9 N1 2015-11-10 0
10 N1 2015-02-10 0
11 N2 2015-01-10 0