首页 > 解决方案 > 为什么 data.table 连接不适用于日期?

问题描述

data.tablejoin 不是选择最大日期,而是最大值。请参见以下示例:

table1 <- fread(
  "individual_id | date       
       1             |  2018-01-06
       2             |  2018-01-06", 
  sep ="|"
)
table1$date = as.IDate(table1$date)
table2 <- fread(
  "individual_id | date_second       | company_id | value
       1             |  2018-01-02 |     62       |  1     
       1             |  2018-01-04 |     62       |  1.5 
       1             |  2018-01-05 |     63       |  1   
       2             |  2018-01-01 |     71       |  2     
       2             |  2018-01-02 |     74       |  1   
       2             |  2018-01-05 |     74       |  4",
  sep = "|"
)
table2$date_second = as.IDate(table2$date_second)

下面的连接应该通过公司 id 选择最大值,然后选择返回每个人返回的所有值的最大值。选择最大值的连接:

table2[table1, on=.(individual_id, date_second<=date), 
       #for each row of table1,
       by=.EACHI,
       # get the maximum value by company_id and the max of all of these
       max(.SD[,max(value), by=.(company_id)]$V1)]

输出:

   individual_id date_second  V1
1:             1  2018-01-06 1.5
2:             2  2018-01-06 4.0

相同的加入,选择最大日期:

table2[table1, on=.(individual_id, date_second<=date), 
       #for each row of table1,
       by=.EACHI,
       # get the maximum date by company_id and the max of all of these
       max(.SD[,max(date_second), by=.(company_id)]$V1)]

输出:

   individual_id date_second         V1
1:             1  2018-01-06 2018-01-02
2:             2  2018-01-06 2018-01-01

为什么它不像最大值那样返回最大日期?

标签: rdata.table

解决方案


我猜您正在寻找更新加入:

table1[table2
       , on = .(individual_id, date >= date_second)
       , by = .EACHI
       , second_date := max(i.date_second)][]

这使:

> table1
   individual_id       date second_date
1:             1 2018-01-06  2018-01-05
2:             2 2018-01-06  2018-01-05

推荐阅读