首页 > 解决方案 > data.table 合并 R 中不同列的部分匹配

问题描述

之前可能已经问过这个问题,但如果可能的话,我正在寻找一个 data.table 解决方案,而不使用其他包。我有一个 data.table DT1 作为参考:

> require(data.table)
> DT1 <- data.table(col1 = c("AA", "BA", "ABC", "ABC BC", "AB")
                  , col2 = c(1,4,5,3,2))
> DT1
     col1 col2
1:     AA    1
2:     BA    4
3:    ABC    5
4: ABC BC    3
5:     AB    2

我想根据 DT1 中的 col1 和 DT2 中的 col2 的部分匹配将第二个 data.table DT2 与 DT1 合并,在 DT2 中创建一个 col3。

> DT2 <- data.table(col1 = c(0,5,2,7,1,0)
                  , col2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))
> DT2
   col1 col2
1:    0   BA
2:    5  ABC
3:    2   DC
4:    7   AA
5:    1   AB
6:    0  R AB

期望的输出

 > desired_output <- data.table(col1 = c(0,5,5,2,7,1,1,1,0)
                                 , col2 = c("BA", "ABC", "ABC", "DC", "AA",  "AB", "AB", "AB", "R AB")
                                 , col3 = c(4,5,3,NA,1,5,3,2,2))
> desired_output
   col1 col2 col3
1:    0   BA    4
2:    5  ABC    5
3:    5  ABC    3
4:    2   DC   NA
5:    7   AA    1
6:    1   AB    5
7:    1   AB    3
8:    1   AB    2
9:    0  R AB   2

有没有使用 data.table 操作的优雅方式来做到这一点?如果不是那么乐意考虑其他解决方案。这将在一个非常大的数据集上运行。


编辑:要指定部分匹配的条件,如果 DT1 中的 col1 字符串是 DT2 中 col2 字符串的子集,则匹配,反之亦然(DT2 中 col2 的字符串是 DT1 中 col1 字符串的子集)。两种方式的grepl?

col1/DT1    col2/DT2
  "AB"       "There is ABhere"    # it's a match
  "ABC"      "someABC"            # it's a match
  "ABC BC"   "ABC"                # it's a reverse match
  "DR"       "ADD"                # no match
  "BA"       "HABAHA"             # two matches

标签: rmergedata.tablematchingpartial

解决方案


Given the dimension of the problem (DT1 [(1:50,000), (1:25)] - DT2[(1:50,000,000), (1:55)]), it is probably infeasible to do a CJ of the IDs before doing a two-way grepl.

Breaking down the different kind of matches/approx. matches, we can 1) first look for exact matches, 2) then approx. matches where substring in DT1 can be found in DT2 and then, 3) vice versa.

Finally, we row bind all the results and do a left join between original DT2 and the row-binded results to get desired output.

exactMatches <- DT1[DT2, on=c("ID1"="ID2"), nomatch=0L][,
    ID2 := ID1]

substr1in2 <- DT2[, c(.SD, DT1[grepl(ID2, ID1) & ID1 != ID2]), 
    by=1:DT2[,.N]][!is.na(VAL1), -1L]

substr2in1 <- DT1[, c(.SD, DT2[grepl(ID1, ID2) & ID2 != ID1]), 
    by=1:DT1[,.N]][!is.na(VAL2), -1L]

binded <- rbindlist(list(exactMatches, substr1in2, substr2in1), 
    use.names=TRUE, fill=TRUE)

binded[DT2, on=.(ID2, VAL2)]

output:

       ID1 VAL1 VAL2  ID2
 1:     BA    4    0   BA
 2:    ABC    5    5  ABC
 3: ABC BC    3    5  ABC
 4:     AB    2    5  ABC
 5:   <NA>   NA    2   DC
 6:     AA    1    7   AA
 7:     AB    2    1   AB
 8:    ABC    5    1   AB
 9: ABC BC    3    1   AB
10:     AB    2    0 R AB

I changed some of the column names to make the code more readable. Data:

DT1 <- data.table(ID1 = c("AA", "BA", "ABC", "ABC BC", "AB"), 
    VAL1 = c(1,4,5,3,2))

DT2 <- data.table(VAL2 = c(0,5,2,7,1,0),
    ID2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))

推荐阅读