r - data.table 合并 R 中不同列的部分匹配
问题描述
之前可能已经问过这个问题,但如果可能的话,我正在寻找一个 data.table 解决方案,而不使用其他包。我有一个 data.table DT1 作为参考:
> require(data.table)
> DT1 <- data.table(col1 = c("AA", "BA", "ABC", "ABC BC", "AB")
, col2 = c(1,4,5,3,2))
> DT1
col1 col2
1: AA 1
2: BA 4
3: ABC 5
4: ABC BC 3
5: AB 2
我想根据 DT1 中的 col1 和 DT2 中的 col2 的部分匹配将第二个 data.table DT2 与 DT1 合并,在 DT2 中创建一个 col3。
> DT2 <- data.table(col1 = c(0,5,2,7,1,0)
, col2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))
> DT2
col1 col2
1: 0 BA
2: 5 ABC
3: 2 DC
4: 7 AA
5: 1 AB
6: 0 R AB
期望的输出
> desired_output <- data.table(col1 = c(0,5,5,2,7,1,1,1,0)
, col2 = c("BA", "ABC", "ABC", "DC", "AA", "AB", "AB", "AB", "R AB")
, col3 = c(4,5,3,NA,1,5,3,2,2))
> desired_output
col1 col2 col3
1: 0 BA 4
2: 5 ABC 5
3: 5 ABC 3
4: 2 DC NA
5: 7 AA 1
6: 1 AB 5
7: 1 AB 3
8: 1 AB 2
9: 0 R AB 2
有没有使用 data.table 操作的优雅方式来做到这一点?如果不是那么乐意考虑其他解决方案。这将在一个非常大的数据集上运行。
编辑:要指定部分匹配的条件,如果 DT1 中的 col1 字符串是 DT2 中 col2 字符串的子集,则匹配,反之亦然(DT2 中 col2 的字符串是 DT1 中 col1 字符串的子集)。两种方式的grepl?
col1/DT1 col2/DT2
"AB" "There is ABhere" # it's a match
"ABC" "someABC" # it's a match
"ABC BC" "ABC" # it's a reverse match
"DR" "ADD" # no match
"BA" "HABAHA" # two matches
解决方案
Given the dimension of the problem (DT1 [(1:50,000), (1:25)] - DT2[(1:50,000,000), (1:55)]), it is probably infeasible to do a CJ of the IDs before doing a two-way grepl
.
Breaking down the different kind of matches/approx. matches, we can 1) first look for exact matches, 2) then approx. matches where substring in DT1 can be found in DT2 and then, 3) vice versa.
Finally, we row bind all the results and do a left join between original DT2 and the row-binded results to get desired output.
exactMatches <- DT1[DT2, on=c("ID1"="ID2"), nomatch=0L][,
ID2 := ID1]
substr1in2 <- DT2[, c(.SD, DT1[grepl(ID2, ID1) & ID1 != ID2]),
by=1:DT2[,.N]][!is.na(VAL1), -1L]
substr2in1 <- DT1[, c(.SD, DT2[grepl(ID1, ID2) & ID2 != ID1]),
by=1:DT1[,.N]][!is.na(VAL2), -1L]
binded <- rbindlist(list(exactMatches, substr1in2, substr2in1),
use.names=TRUE, fill=TRUE)
binded[DT2, on=.(ID2, VAL2)]
output:
ID1 VAL1 VAL2 ID2
1: BA 4 0 BA
2: ABC 5 5 ABC
3: ABC BC 3 5 ABC
4: AB 2 5 ABC
5: <NA> NA 2 DC
6: AA 1 7 AA
7: AB 2 1 AB
8: ABC 5 1 AB
9: ABC BC 3 1 AB
10: AB 2 0 R AB
I changed some of the column names to make the code more readable. Data:
DT1 <- data.table(ID1 = c("AA", "BA", "ABC", "ABC BC", "AB"),
VAL1 = c(1,4,5,3,2))
DT2 <- data.table(VAL2 = c(0,5,2,7,1,0),
ID2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))
推荐阅读
- c++ - 调用函数后,我的 char 数组的大小自动缩小到 1
- javascript - service-worker 是否使 web-worker 冗余或者它是否比 web-worker 更好(HTML5 功能)
- node.js - 使用 Node.js 客户端库将查询参数添加到 mailchimp 请求
- arrays - Slice Method regarding the return of a certain amount of letters in an array
- c# - How to configure registration/logIn into a existing project?
- audiocontext - have to remove audioContext or MediaElementAudioSourceNode
- c# - 是否可以使用对它一无所知的类的属性(甚至是类型和名称)?
- git - 列出拉取请求上的提交
- javascript - 如何使用 Laravel + React 检查上传文件的类型?
- javascript - 如何使用 Jest 测试 Node.js 脚本的“if (require.main === module)”部分中的代码?