首页 > 解决方案 > 如何对列值在一定范围内的两个数据框进行外部合并?

问题描述

这是这个的后续问题

我有两个dataframes

print df_1

  timestamp      A          B
0 2016-05-15     0.020228   0.026572
1 2016-05-15     0.057780   0.175499
2 2016-05-15     0.098808   0.620986
3 2016-05-17     0.158789   1.014819
4 2016-05-17     0.038129   2.384590
5 2018-05-17     0.011111   9.999999


print df_2

  start                end  event    
0 2016-05-14   2016-05-16   E1
1 2016-05-14   2016-05-16   E2
2 2016-05-17   2016-05-18   E3

如果介于和之间,我想合并df_1df_2进入event columndf_1timestampstartend

问题以及与问题的差异是

1) 那eventsE1E2具有相同的startend

2) 同样在df_16行不属于任何区间。

最后,我希望这两个事件和没有任何事件的行都有NA

所以我希望我的结果dataframe是这样的

  timestamp      A          B         event
0 2016-05-15     0.020228   0.026572  E1
1 2016-05-15     0.057780   0.175499  E1
2 2016-05-15     0.098808   0.620986  E1
3 2016-05-15     0.020228   0.026572  E2 
4 2016-05-15     0.057780   0.175499  E2
5 2016-05-15     0.098808   0.620986  E2
6 2016-05-17     0.158789   1.014819  E3
7 2016-05-17     0.038129   2.384590  E3
8 2018-05-17     0.011111   9.999999  NA

标签: pythonpython-3.xpandas

解决方案


import pandas as pd

df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                     'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                     'end':['2016-05-16','2016-05-16','2016-05-18'],
                     'event':['E1','E2','E3']})
df_1.timestamp = pd.to_datetime(df_1.timestamp, format='%Y-%m-%d')
df_2.start = pd.to_datetime(df_2.start, format='%Y-%m-%d')
df_2.end = pd.to_datetime(df_2.end, format='%Y-%m-%d')

# convert game_ref_dt to long format with all the dates in between, and do a left merge on date
df_2_2 = pd.melt(df_2, id_vars='event', value_name='timestamp')
df_2_2.timestamp = pd.to_datetime(df_2_2.timestamp)
df_2_2.set_index('timestamp', inplace=True)
df_2_2.drop('variable', axis=1, inplace=True)

df_2_3 = df_2_2.groupby('event').resample('D').ffill().reset_index(level=0, drop=True).reset_index()

df_2 = pd.merge(df_2, df_2_3)
df_2 = df_2.drop(columns=['start', 'end'])

df_1 = df_1.merge(df_2,on='timestamp',  how='left')

print(df_1)
   timestamp  A event
0 2016-05-15  1    E1
1 2016-05-15  1    E2
2 2016-05-15  1    E1
3 2016-05-15  1    E2
4 2016-05-15  1    E1
5 2016-05-15  1    E2
6 2016-05-17  1    E3
7 2016-05-17  1    E3
8 2018-05-17  1   NaN

归功于

也是这个解决方案,但没有NA在最后一行给出

  import pandas as pd

df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                     'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                     'end':['2016-05-16','2016-05-16','2016-05-18'],
                     'event':['E1','E2','E3']})   

df_try2 = pd.merge(df_1.assign(key=1), df_2.assign(key=1), on='key').query('timestamp >= start and timestamp <= end')    

print(df_try2)

   timestamp  A  key      start        end event
0  2016-05-15  1    1 2016-05-14 2016-05-16    E1
1  2016-05-15  1    1 2016-05-14 2016-05-16    E2
3  2016-05-15  1    1 2016-05-14 2016-05-16    E1
4  2016-05-15  1    1 2016-05-14 2016-05-16    E2
6  2016-05-15  1    1 2016-05-14 2016-05-16    E1
7  2016-05-15  1    1 2016-05-14 2016-05-16    E2
11 2016-05-17  1    1 2016-05-17 2016-05-18    E3
14 2016-05-17  1    1 2016-05-17 2016-05-18    E3

推荐阅读