python - 基于 Pandas 中唯一编号的两个 csv 文件之间的日期差异
问题描述
我有 2 个 csv 文件,我希望根据 Mobile_Number 和类别将 csv_1 中的 App_Date 减去 csv_2 中最接近的 Last_Visit_Date。
注意:我不想考虑 App_Date == Last_Visit_Date、Last_Visit_Date < App_Date 的日期
csv_1
Category Mobile_Number App_Date
A 503477334 2018-10-18
B 503477334 2018-10-16
C 501022162 2018-10-16
A 503487338 2018-10-13
C 506012887 2018-10-21
E 503427339 2018-10-17
csv_2
Category Mobile_Number Last_Visit_Date
A 503477334 2018-10-08
B 503477334 2018-10-07
B 503477334 2018-10-09
C 501022162 2018-10-11
F 501428449 2018-10-18
C 506012887 2018-10-14
输出应如下所示
输出
Category Mobile_Number App_Date Last_Visit_Date Difference
A 503477334 2018-10-18 2018-10-08 10
B 503477334 2018-10-16 2018-10-09 7
C 501022162 2018-10-16 2018-10-11 5
A 503487338 2018-10-13 2018-10-13 NaN
C 506012887 2018-10-21 2018-10-14 7
E 503427339 2018-10-17 2018-10-17 NaN
编辑
找出差异,通过排除 App_Date == Last_Visit_Day 并减去上一次出现
csv_1
Category Mobile_Number App_Date
A 503477334 2018-10-18
B 503477334 2018-10-16
C 501022162 2018-10-16
A 503487338 2018-10-13
C 506012887 2018-10-21
E 503427339 2018-10-17
csv_2
Category Mobile_Number Last_Visit_Date
A 503477334 2018-10-18
A 503477334 2018-10-08
A 503477334 2018-10-06
B 503477334 2018-10-07
B 503477334 2018-10-09
C 501022162 2018-10-14
A 503487338 2018-10-13
A 503487338 2018-10-11
C 506012887 2018-10-15
E 503427339 2018-10-17
输出应如下所示
输出
Category Mobile_Number App_Date Last_Visit_Date Difference
A 503477334 2018-10-18 2018-10-08 10
B 503477334 2018-10-16 2018-10-09 7
C 501022162 2018-10-16 2018-10-14 2
A 503487338 2018-10-13 2018-10-11 2
C 506012887 2018-10-21 2018-10-15 6
E 503427339 2018-10-17 2018-10-17 NaN
解决方案
仅使用merge_asof
排序列 by datetimes
,然后获取差异 bySeries.sub
并在必要时最后替换Last_Visit_Date
by中的缺失值Series.fillna
:
csv_1['App_Date'] = pd.to_datetime(csv_1['App_Date'])
csv_2['Last_Visit_Date'] = pd.to_datetime(csv_2['Last_Visit_Date'])
csv_1 = csv_1.sort_values('App_Date')
csv_2 = csv_2.sort_values('Last_Visit_Date')
df = pd.merge_asof(csv_1,
csv_2,
left_on='App_Date',
right_on='Last_Visit_Date',
by=['Category','Mobile_Number'])
df['Difference'] = df['App_Date'].sub(df['Last_Visit_Date']).dt.days
df['Last_Visit_Date'] = df['Last_Visit_Date'].fillna(df['App_Date'])
print (df)
Category Mobile_Number App_Date Last_Visit_Date Difference
0 A 503487338 2018-10-13 2018-10-13 NaN
1 B 503477334 2018-10-16 2018-10-09 7.0
2 C 501022162 2018-10-16 2018-10-11 5.0
3 E 503427339 2018-10-17 2018-10-17 NaN
4 A 503477334 2018-10-18 2018-10-08 10.0
5 C 506012887 2018-10-21 2018-10-14 7.0
编辑:您需要将参数allow_exact_matches
设置False
为merge_asof
:
allow_exact_matches bool,默认 True
如果为 True,则允许匹配相同的 'on' 值(即小于或等于 / 大于或等于)
如果为 False,则不匹配相同的 'on ' 值(即严格小于/严格大于)。
csv_1['App_Date'] = pd.to_datetime(csv_1['App_Date'])
csv_2['Last_Visit_Date'] = pd.to_datetime(csv_2['Last_Visit_Date'])
csv_1 = csv_1.sort_values('App_Date')
csv_2 = csv_2.sort_values('Last_Visit_Date')
df = pd.merge_asof(csv_1,
csv_2,
left_on='App_Date',
right_on='Last_Visit_Date',
by=['Category','Mobile_Number'],
allow_exact_matches=False)
print (df)
Category Mobile_Number App_Date Last_Visit_Date
0 A 503487338 2018-10-13 2018-10-11
1 B 503477334 2018-10-16 2018-10-09
2 C 501022162 2018-10-16 2018-10-14
3 E 503427339 2018-10-17 NaT
4 A 503477334 2018-10-18 2018-10-08
5 C 506012887 2018-10-21 2018-10-15
df['Difference'] = df['App_Date'].sub(df['Last_Visit_Date']).dt.days
df['Last_Visit_Date'] = df['Last_Visit_Date'].fillna(df['App_Date'])
print (df)
Category Mobile_Number App_Date Last_Visit_Date Difference
0 A 503487338 2018-10-13 2018-10-11 2.0
1 B 503477334 2018-10-16 2018-10-09 7.0
2 C 501022162 2018-10-16 2018-10-14 2.0
3 E 503427339 2018-10-17 2018-10-17 NaN
4 A 503477334 2018-10-18 2018-10-08 10.0
5 C 506012887 2018-10-21 2018-10-15 6.0