首页 > 解决方案 > 比较 2 个数据表对并根据第二个数据表更正第一个数据表中的值

问题描述

让我们有以下2个数据表。第一个是收集的有错误的数据,第二个包含 x1 和 x2 的正确可能对:

第一个:

+----------+-----+------+
|    x1    |  x2 |  x3  |
+----------+-----+------+
| march    |  3  |  198 |
| april    |  4  | 4984 |
| february |  2  |  498 |
| march    |  35 |  984 |
| aripl    |  4  |  498 |
+----------+-----+------+

第二个:

+----------+----+
|    x1    | x2 |
+----------+----+
| january  |  1 |
| february |  2 |
| march    |  3 |
| april    |  4 |
| may      |  5 |
+----------+----+

我想在第一个表中找到不正确的行并根据第二个表更正它们。所以输出应该是这样的:

+----------+----+------+
|    x1    | x2 |  x3  |
+----------+----+------+
| march    |  3 |  198 |
| april    |  4 | 4984 |
| february |  2 |  498 |
| march    |  3 |  984 |
| april    |  4 |  498 |
+----------+----+------+

它应该检查 x1 是否是正确的名称,然后添加正确的数字,或者如果 x2 是正确的数字,则添加正确的名称。

对于第一部分,我想答案就在这里。虽然我需要以某种方式适应我的情况(所以任何帮助也将不胜感激)。对于第二部分,我所知道的只是使用“for”和“if”,这对于速度问题是不可接受的(甚至是不可能的)。

标签: rdataframedata.tablecompare

解决方案


您可以按名称合并找到正确的编号,然后您可以按编号合并找到正确的名称。

示例数据

library(data.table)

dt1 <- data.table(
    x1 = c("march", "april", "february", "march", "aripl"),
    x2 = c(3,4,2,35,4),
    x3 = c(198,4984,498,984,498)
)

dt2 <- data.table(
    x1 = c("january", "february", "march", "april", "may"),
    x2 = 1:5
)

解决方案:

# fix number by merging via name
result <- merge(dt1, dt2, by="x1", all.x=T)
result[ , corr_num := ifelse(is.na(x2.y), x2.x, x2.y)]
result[ , c("x2.x", "x2.y") := NULL]

# fix name by merging via number
result <- merge(result, dt2, by.x="corr_num", by.y="x2", all.x=T)
result[ , corr_name := x1.y]
result[ , c("x1.x", "x1.y") := NULL]

等效解决方案:

result <- merge(merge(dt1, dt2, by="x1", all.x=T), dt2, by.x="x2.x", by.y="x2", all.x=T)
result[ , corr_num  := ifelse(is.na(x2.y), x2.x, x2.y)]
result[ , corr_name := ifelse(is.na(x1.y), x1.x, x1.y)]
result[ , grep("\\.", names(result)) := NULL]

结果

> result[ , .(corr_name, corr_num, x3)]
   corr_name corr_num   x3
1:  february        2  498
2:     march        3  198
3:     april        4 4984
4:     april        4  498
5:     march        3  984

推荐阅读