r - 比较分组数据框中的行时返回列差异
问题描述
我想对每组进行成对比较,并返回不匹配的行以及哪些列不同。下面是一个示例数据集,用于解释我的实际数据将有更多行和列的问题。
data=structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), Common_1 = c("A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), Common_2 = c("C", "C", "C", "C", "C", "D",
"D", "D", "D", "D", "C", "C", "C", "C", "C", "D", "D", "D", "D",
"D"), Common_3 = c("X", "X", "X", "X", "X", "X", "X", "X", "X",
"X", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"), G = c(0,
1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0), var_1 = c(1,
3, 3, 3, 3, 1, 3, 2, 4, 3, 5, 5, 3, 4, 5, 1, 3, 5, 1, 4), var_2 = c("lev1",
"lev1", "lev2", "lev2", "lev1", "lev2", "lev2", "lev1", "lev1",
"lev2", "lev2", "lev2", "lev2", "lev1", "lev1", "lev1", "lev1",
"lev1", "lev2", "lev2"), var_3 = c("on", "on", "on", "off", "off",
"on", "on", "on", "off", "off", "on", "on", "on", "off", "off",
"on", "on", "on", "off", "off"), var_4 = c("up", "up", "down",
"down", "up", "down", "up", "down", "up", "up", "up", "up", "down",
"down", "up", "up", "up", "up", "down", "down")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
ID
是唯一标识符,Common_1
, Common_2
,Common_3
是分组变量,G
是我要进行比较的组,最后其余列var_1:var_4
是确定差异的列。该过程将比较每一行,G=0
如果G=1
任何var
列存在差异,则返回ID
不匹配的组合以及哪些列不同。
这是,的所需结果 Common_1=A
,它具有行、所有分组变量、不匹配的 和显示哪些列不同的指示变量。Common_2=C
Common_3=X
ID
G=0
ID
G=1
results=structure(list(ID = c(1, 1, 3, 3, 4, 4), Common_1 = c("A", "A",
"A", "A", "A", "A"), Common_2 = c("C", "C", "C", "C", "C", "C"
), Common_3 = c("X", "X", "X", "X", "X", "X"), G = c(0, 0, 0,
0, 0, 0), var_1 = c(1, 1, 0, 0, 0, 0), var_2 = c(0, 0, 1, 1,
1, 1), var_3 = c(0, 1, 0, 1, 1, 0), var_4 = c(0, 0, 1, 1, 1,
1), ID_diff = c(2, 5, 2, 5, 2, 5)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
更新:添加了结果说明
我正在对G=0
to进行成对比较G=1
。前两行结果的派生如下: 相同的整体组Common_1=A
, Common_2=C
,Common_3=X
现在ID=1
比较ID=2
var_1
不同,因此 1 放置在 var_1 列中,其余为 0。 ID_diff=2
因为那是不同于ID=1
比较ID=1
_ID=5
var_1
并且var_3
是不同的,因此每列中放置一个 1,其余为 0。ID_diff=5
因为那是不同于ID=1
我尝试编写一个函数来遍历每个案例并与每个案例G=0
进行比较,G=1
但在提取不匹配信息时遇到困难,感谢您的帮助。
Ronak Shah 解决方案的结果有效,但我无法正确显示结果。
> var_col <- grep('^var', names(data))
>
> apply_fun <- function(tmp) {
+ df1 <- subset(tmp, G == 0)
+ df2 <- subset(tmp, G == 1)
+ lapply(seq(nrow(df1)), function(x) {
+ df3 <- df1[rep(x, nrow(df2)), ]
+ df3$ID_diff <- df2$ID
+ df3[var_col] <- +(df1[rep(x, nrow(df2)), var_col] != df2[var_col])
+ df3
+ })
+ }
>
>
> library(dplyr)
> data %>%
+ group_by(across(starts_with('Common'))) %>%
+ summarise(data = apply_fun(cur_data_all())) %>%
+ ungroup %>%
+ select(data) %>%
+ tidyr::unnest(data)
`summarise()` regrouping output by 'Common_1', 'Common_2', 'Common_3' (override with `.groups` argument)
# A tibble: 22 x 10
ID Common_1 Common_2 Common_3 G var_1[,1] [,2] [,3] [,4] var_2[,1] [,2] [,3] [,4] var_3[,1] [,2] [,3] [,4] var_4[,1] [,2]
<dbl> <chr> <chr> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 A C X 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
2 1 A C X 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
3 3 A C X 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1
4 3 A C X 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1
5 4 A C X 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1
6 4 A C X 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1
7 7 A D X 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0
8 8 A D X 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
9 9 A D X 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10 10 A D X 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0
# ... with 12 more rows, and 3 more variables: [,3] <int>, [,4] <int>, ID_diff <dbl>
解决方案
推荐阅读
- ssas - SSAS:角色驱动的安全性
- mysql - 无法连接到 MySQL 服务器:无法连接到 localhost
- reactjs - 如何使用来自上下文的数据控制 React Re-Renders
- html - 我想在单击链接时关闭 div
- python - 尝试替换熊猫中的异常值时出现问题
- mysql - 不同组合的唯一索引
- java - Spring Kafka:将 KafkaListenerErrorHandler 应用于所有 KafkaListener
- angular - 组件之间的角度动画不适用于正在播放的视频元素
- canvas - 如何将 HTML5 画布上下文重置为默认值?
- javascript - 如何确保在此类项的呈现时将正确的对象引用传递给 UI 项的单击处理程序?