首页 > 解决方案 > 合并两个具有条件日期时间的数据集并显示不匹配的值

问题描述

我有两个数据集:

df1 和 df2,如果日期时间在 df2 的 20 秒内,我希望列从 df1 合并在一起

df1


Connect                 Ended

4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM
3/31/2020 11:00:08 AM   3/31/2020 11:00:10 AM
4/1/2020 10:00:05 PM    4/1/2020 12:00:05 PM

df2

Start                   End

4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM
3/31/2020 11:00:10 AM   3/31/2020 11:00:14 AM

期望的输出:

df3

Match_Start1             Match_End1                     Match_Start2              Match_End2

4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM          4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM
3/31/2020 11:00:08 AM   3/31/2020 11:00:10 AM        3/31/2020 11:00:10 AM   3/31/2020 11:00:14 AM    

df4(不匹配)

Unmatched_Start         Unmatched_end  

4/1/2020 10:00:05 PM    4/1/2020 12:00:05 PM

输入:

df1

structure(list(Connect = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "3/31/2020 11:00:08 AM", 
"4/1/2020 10:00:05 PM", "4/6/2020 1:15:21 PM"), class = "factor"), 
Ended = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("", "3/31/2020 11:00:10 AM", "4/1/2020 12:00:05 PM", 
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA, 
 -13L))



df2

structure(list(Start = structure(2:1, .Label = c("3/31/2020 11:00:10 AM", 
"4/6/2020 1:15:21 PM"), class = "factor"), End = structure(2:1, .Label = c("3/31/2020 11:00:14 AM", 
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

我试过的:

pd.merge_asof(df1, df2, on="Connect", "Ended", by='ticker',       tolerance=pd.Timedelta('20 s'), direction='backward')

但是,如何合并 20 秒的条件,以及显示不匹配的数据集?

任何建议表示赞赏

标签: rmergedplyr

解决方案


我们可以使用crossing创建所有的组合df1并且df2只保留那些在 20 秒间隔内的行。

library(tidyr)
library(dplyr)
library(lubridate)

df3 <- crossing(df1, df2) %>%
          mutate_all(mdy_hms) %>%
          filter(abs(difftime(Connect, Start, units = "secs")) <= 20 &
                 abs(difftime(Ended, End, units = "secs")) <= 20)

df3
# A tibble: 2 x 4
#  Connect             Ended               Start               End                
#  <dttm>              <dttm>              <dttm>              <dttm>             
#1 2020-03-31 11:00:08 2020-03-31 11:00:10 2020-03-31 11:00:10 2020-03-31 11:00:14
#2 2020-04-06 13:15:21 2020-04-06 14:05:18 2020-04-06 13:15:21 2020-04-06 14:05:18

要获得df4,我们可以这样做:

df4 <- df1 %>% mutate_all(mdy_hms) %>% anti_join(df3, by = c('Connect', 'Ended'))

推荐阅读