首页 > 解决方案 > 基于 Pandas 中唯一编号的两个 csv 文件之间的日期差异

问题描述

我有 2 个 csv 文件,我希望根据 Mobile_Number 和类别将 csv_1 中的 App_Date 减去 csv_2 中最接近的 Last_Visit_Date。

注意:我不想考虑 App_Date == Last_Visit_Date、Last_Visit_Date < App_Date 的日期

csv_1

Category     Mobile_Number      App_Date

A              503477334       2018-10-18
B              503477334       2018-10-16
C              501022162       2018-10-16
A              503487338       2018-10-13
C              506012887       2018-10-21
E              503427339       2018-10-17

csv_2

Category     Mobile_Number    Last_Visit_Date    

A              503477334        2018-10-08
B              503477334        2018-10-07
B              503477334        2018-10-09
C              501022162        2018-10-11
F              501428449        2018-10-18
C              506012887        2018-10-14

输出应如下所示

输出

Category     Mobile_Number    App_Date       Last_Visit_Date   Difference

A              503477334     2018-10-18       2018-10-08       10
B              503477334     2018-10-16       2018-10-09       7
C              501022162     2018-10-16       2018-10-11       5
A              503487338     2018-10-13       2018-10-13       NaN
C              506012887     2018-10-21       2018-10-14       7 
E              503427339     2018-10-17       2018-10-17       NaN

编辑

找出差异,通过排除 App_Date == Last_Visit_Day 并减去上一次出现

csv_1

Category     Mobile_Number      App_Date

A              503477334       2018-10-18
B              503477334       2018-10-16
C              501022162       2018-10-16
A              503487338       2018-10-13
C              506012887       2018-10-21
E              503427339       2018-10-17

csv_2

Category     Mobile_Number    Last_Visit_Date    

A              503477334        2018-10-18
A              503477334        2018-10-08
A              503477334        2018-10-06
B              503477334        2018-10-07
B              503477334        2018-10-09
C              501022162        2018-10-14
A              503487338        2018-10-13
A              503487338        2018-10-11
C              506012887        2018-10-15
E              503427339        2018-10-17

输出应如下所示

输出

Category     Mobile_Number    App_Date       Last_Visit_Date   Difference

A              503477334     2018-10-18       2018-10-08       10
B              503477334     2018-10-16       2018-10-09       7
C              501022162     2018-10-16       2018-10-14       2
A              503487338     2018-10-13       2018-10-11       2
C              506012887     2018-10-21       2018-10-15       6 
E              503427339     2018-10-17       2018-10-17       NaN

标签: pythonpandas

解决方案


仅使用merge_asof排序列 by datetimes,然后获取差异 bySeries.sub并在必要时最后替换Last_Visit_Dateby中的缺失值Series.fillna

csv_1['App_Date'] = pd.to_datetime(csv_1['App_Date'])
csv_2['Last_Visit_Date'] = pd.to_datetime(csv_2['Last_Visit_Date'])
csv_1 = csv_1.sort_values('App_Date')
csv_2 = csv_2.sort_values('Last_Visit_Date')

df = pd.merge_asof(csv_1, 
                   csv_2, 
                   left_on='App_Date', 
                   right_on='Last_Visit_Date', 
                   by=['Category','Mobile_Number'])
df['Difference'] = df['App_Date'].sub(df['Last_Visit_Date']).dt.days
df['Last_Visit_Date'] = df['Last_Visit_Date'].fillna(df['App_Date'])
print (df)
  Category  Mobile_Number   App_Date Last_Visit_Date  Difference
0        A      503487338 2018-10-13      2018-10-13         NaN
1        B      503477334 2018-10-16      2018-10-09         7.0
2        C      501022162 2018-10-16      2018-10-11         5.0
3        E      503427339 2018-10-17      2018-10-17         NaN
4        A      503477334 2018-10-18      2018-10-08        10.0
5        C      506012887 2018-10-21      2018-10-14         7.0

编辑:您需要将参数allow_exact_matches设置Falsemerge_asof

allow_exact_matches bool,默认 True
如果为 True,则允许匹配相同的 'on' 值(即小于或等于 / 大于或等于)
如果为 False,则不匹配相同的 'on ' 值(即严格小于/严格大于)。

csv_1['App_Date'] = pd.to_datetime(csv_1['App_Date'])
csv_2['Last_Visit_Date'] = pd.to_datetime(csv_2['Last_Visit_Date'])
csv_1 = csv_1.sort_values('App_Date')
csv_2 = csv_2.sort_values('Last_Visit_Date')

df = pd.merge_asof(csv_1, 
                   csv_2, 
                   left_on='App_Date', 
                   right_on='Last_Visit_Date', 
                   by=['Category','Mobile_Number'],
                   allow_exact_matches=False)
print (df)
  Category  Mobile_Number   App_Date Last_Visit_Date
0        A      503487338 2018-10-13      2018-10-11
1        B      503477334 2018-10-16      2018-10-09
2        C      501022162 2018-10-16      2018-10-14
3        E      503427339 2018-10-17             NaT
4        A      503477334 2018-10-18      2018-10-08
5        C      506012887 2018-10-21      2018-10-15

df['Difference'] = df['App_Date'].sub(df['Last_Visit_Date']).dt.days
df['Last_Visit_Date'] = df['Last_Visit_Date'].fillna(df['App_Date'])
print (df)
  Category  Mobile_Number   App_Date Last_Visit_Date  Difference
0        A      503487338 2018-10-13      2018-10-11         2.0
1        B      503477334 2018-10-16      2018-10-09         7.0
2        C      501022162 2018-10-16      2018-10-14         2.0
3        E      503427339 2018-10-17      2018-10-17         NaN
4        A      503477334 2018-10-18      2018-10-08        10.0
5        C      506012887 2018-10-21      2018-10-15         6.0

推荐阅读