首页 > 解决方案 > 比较分组数据框中的行时返回列差异

问题描述

我想对每组进行成对比较,并返回不匹配的行以及哪些列不同。下面是一个示例数据集,用于解释我的实际数据将有更多行和列的问题。

data=structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), Common_1 = c("A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B"), Common_2 = c("C", "C", "C", "C", "C", "D", 
"D", "D", "D", "D", "C", "C", "C", "C", "C", "D", "D", "D", "D", 
"D"), Common_3 = c("X", "X", "X", "X", "X", "X", "X", "X", "X", 
"X", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"), G = c(0, 
1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0), var_1 = c(1, 
3, 3, 3, 3, 1, 3, 2, 4, 3, 5, 5, 3, 4, 5, 1, 3, 5, 1, 4), var_2 = c("lev1", 
"lev1", "lev2", "lev2", "lev1", "lev2", "lev2", "lev1", "lev1", 
"lev2", "lev2", "lev2", "lev2", "lev1", "lev1", "lev1", "lev1", 
"lev1", "lev2", "lev2"), var_3 = c("on", "on", "on", "off", "off", 
"on", "on", "on", "off", "off", "on", "on", "on", "off", "off", 
"on", "on", "on", "off", "off"), var_4 = c("up", "up", "down", 
"down", "up", "down", "up", "down", "up", "up", "up", "up", "down", 
"down", "up", "up", "up", "up", "down", "down")), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

ID是唯一标识符,Common_1, Common_2Common_3是分组变量,G是我要进行比较的组,最后其余列var_1:var_4是确定差异的列。该过程将比较每一行,G=0如果G=1任何var列存在差异,则返回ID不匹配的组合以及哪些列不同。

这是,的所需结果 Common_1=A,它具有行、所有分组变量、不匹配的 和显示哪些列不同的指示变量。Common_2=CCommon_3=XIDG=0IDG=1

results=structure(list(ID = c(1, 1, 3, 3, 4, 4), Common_1 = c("A", "A", 
"A", "A", "A", "A"), Common_2 = c("C", "C", "C", "C", "C", "C"
), Common_3 = c("X", "X", "X", "X", "X", "X"), G = c(0, 0, 0, 
0, 0, 0), var_1 = c(1, 1, 0, 0, 0, 0), var_2 = c(0, 0, 1, 1, 
1, 1), var_3 = c(0, 1, 0, 1, 1, 0), var_4 = c(0, 0, 1, 1, 1, 
1), ID_diff = c(2, 5, 2, 5, 2, 5)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

更新:添加了结果说明

我正在对G=0to进行成对比较G=1。前两行结果的派生如下: 相同的整体组Common_1=A, Common_2=C,Common_3=X

现在ID=1比较ID=2

var_1 不同,因此 1 放置在 var_1 列中,其余为 0。 ID_diff=2因为那是不同于ID=1

比较ID=1_ID=5

var_1 并且var_3是不同的,因此每列中放置一个 1,其余为 0。ID_diff=5因为那是不同于ID=1

我尝试编写一个函数来遍历每个案例并与每个案例G=0进行比较,G=1但在提取不匹配信息时遇到困难,感谢您的帮助。

Ronak Shah 解决方案的结果有效,但我无法正确显示结果。

> var_col <- grep('^var', names(data))
> 
> apply_fun <- function(tmp) {
+     df1 <- subset(tmp, G == 0)
+     df2 <- subset(tmp, G == 1)
+     lapply(seq(nrow(df1)), function(x) {
+         df3 <- df1[rep(x, nrow(df2)), ]
+         df3$ID_diff <- df2$ID
+         df3[var_col] <- +(df1[rep(x, nrow(df2)), var_col] != df2[var_col])
+         df3
+     })
+ }
> 
> 
> library(dplyr)
> data %>%
+     group_by(across(starts_with('Common'))) %>%
+     summarise(data = apply_fun(cur_data_all())) %>%
+     ungroup %>%
+     select(data) %>%
+     tidyr::unnest(data)
`summarise()` regrouping output by 'Common_1', 'Common_2', 'Common_3' (override with `.groups` argument)
# A tibble: 22 x 10
      ID Common_1 Common_2 Common_3     G var_1[,1]  [,2]  [,3]  [,4] var_2[,1]  [,2]  [,3]  [,4] var_3[,1]  [,2]  [,3]  [,4] var_4[,1]  [,2]
   <dbl> <chr>    <chr>    <chr>    <dbl>     <int> <int> <int> <int>     <int> <int> <int> <int>     <int> <int> <int> <int>     <int> <int>
 1     1 A        C        X            0         1     0     0     0         1     0     0     0         1     0     0     0         1     0
 2     1 A        C        X            0         1     0     1     0         1     0     1     0         1     0     1     0         1     0
 3     3 A        C        X            0         0     1     0     1         0     1     0     1         0     1     0     1         0     1
 4     3 A        C        X            0         0     1     1     1         0     1     1     1         0     1     1     1         0     1
 5     4 A        C        X            0         0     1     1     1         0     1     1     1         0     1     1     1         0     1
 6     4 A        C        X            0         0     1     0     1         0     1     0     1         0     1     0     1         0     1
 7     7 A        D        X            0         1     0     0     1         1     0     0     1         1     0     0     1         1     0
 8     8 A        D        X            0         1     1     0     0         1     1     0     0         1     1     0     0         1     1
 9     9 A        D        X            0         1     1     1     1         1     1     1     1         1     1     1     1         1     1
10    10 A        D        X            0         1     0     1     1         1     0     1     1         1     0     1     1         1     0
# ... with 12 more rows, and 3 more variables: [,3] <int>, [,4] <int>, ID_diff <dbl>

标签: rdataframe

解决方案



推荐阅读