r - 如何识别相同但顺序不同的列或另一列中的列?
问题描述
它可以使用代码构建:
df<-structure(list(cxr.CSV = c("project", "Subject", "Site", "InstanceName",
"RecordPosition", "CXRDT", "CXRFIND", "CXRFNDSP", "CXRYN", NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), cy1.CSV = c("project",
"Subject", "Site", "InstanceName", "RecordPosition", "CYSHPYN",
"CYSHPDT", "CY1TMPT", "CYND", "CYNDSP", "CYDT", "CYTM", NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA), cy2.CSV = c("project", "Subject",
"Site", "InstanceName", "RecordPosition", "CYSHPYN", "CYSHPDT",
"CY2TMPT", "CYND", "CYNDSP", "CYDT", "CYTM", NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), cy24.CSV = c("project", "Subject", "Site",
"InstanceName", "RecordPosition", "CYSHPYN", "CYSHPDT", "CY1TMPT",
"CYND", "CYNDSP", "CYDT", "CYTM", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), cy3.CSV = c("project", "Subject", "Site", "InstanceName",
"RecordPosition", "CYSHPYN", "CYSHPDT", "CY3TMPT", "CYND", "CYNDSP",
"CYDT", "CYTM", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), cy6.CSV = c("project",
"Subject", "Site", "InstanceName", "RecordPosition", "CYSHPYN",
"CYSHPDT", "CY1TMPT", "CYND", "CYNDSP", "CYDT", "CYTM", NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA), dlt.CSV = c("project", "Subject",
"Site", "InstanceName", "RecordPosition", "DLTYN", "DLTAE", "DLTSP",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), dm.CSV = c("project",
"Subject", "Site", "InstanceName", "RecordPosition", "BRTHYR",
"DMAGE", "SEX", "SEXSP", "FEMCBP", "FEMCBPSP", "RACE", "RACESP",
"ETHNIC", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), dov.CSV = c("project",
"Subject", "Site", "InstanceName", "RecordPosition", "DOVDT",
"DOVAE", "DOVCM", "DOVCP", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), dov_1.CSV = c("project", "Subject", "Site", "InstanceName",
"RecordPosition", "DOVDT", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), ds.CSV = c("project", "Subject", "Site",
"InstanceName", "RecordPosition", "DSDT", "DSREAS", "DSORTH",
"DSWCSP", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
ds_1.CSV = c("project", "Subject", "Site", "InstanceName",
"RecordPosition", "DSDT", "DSREAS", "DSWCSP", "DSORTH", NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), dth.CSV = c("project",
"Subject", "Site", "InstanceName", "RecordPosition", "DTHFCDT",
"DTHDT", "DTHDUR", "DTHREAS", "DTHROTH", "DTHCOMM", NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), dv.CSV = c("project",
"Subject", "Site", "InstanceName", "RecordPosition", "DVYN",
"DVVIS", "DVIDDAT", "DVSTDAT", "DVENDAT", "DVCAT", "DVCATSP",
"DVCATCD", "DVTERM", "REWFLAG", "REWCOMP", "DVACN", "DVMETRPT",
"DVCLSDAT", "DVCLS", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), tegu.CSV = c("project",
"Subject", "Site", "InstanceName", "RecordPosition", "EGYN",
"EGDT", "EGNOU", "EGTM", "EGORRES", "EGHR", "EGPR", "EGQRS",
"EGQTINT", "ECGRR", "EGQTCFC", "EGQTCBC", "EGQTCNS", "EGQTCO",
"EGQTCOSP", "EGRSAB01", "EGRSAB02", "EGRSAB03", "EGRSAB04",
"EGRSAB05", "EGRSAB06", "EGRSAB07", "EGRSAB08", "EGRSAB09",
"EGRSAB10", "EGRSAB11", "EGRSAB12", "EGRSAB13", "EGABNCOM",
"EGABNCS", "EGTMPT", "EGND"), tegu_1.CSV = c("project", "Subject",
"Site", "InstanceName", "RecordPosition", "EGYN", "EGNOU",
"EGND", "EGTMPT", "EGDT", "EGTM", "EGORRES", "EGHR", "EGPR",
"EGQRS", "EGQTINT", "ECGRR", "EGQTCFC", "EGQTCBC", "EGQTCNS",
"EGQTCO", "EGQTCOSP", "EGRSAB01", "EGRSAB02", "EGRSAB03",
"EGRSAB04", "EGRSAB05", "EGRSAB06", "EGRSAB07", "EGRSAB08",
"EGRSAB09", "EGRSAB10", "EGRSAB11", "EGRSAB12", "EGRSAB13",
"EGABNCOM", "EGABNCS")), row.names = c(NA, -37L), class = c("tbl_df",
"tbl", "data.frame"))
我想比较每一列。如果两个数据集的变量相同,或者一个已完成包含在另一个中。然后用相同的数字标记它们。最后,我想得到一个如下所示的汇总表:
只要它捕获信息,就不需要完全相同。棘手的部分是:tegu.CSV 和 tegu_1.CSV,ds.CSV 和 ds_1.CSV 具有不同顺序的相同变量列表,dov.CSV 具有 dov_1.CSV 具有的所有变量等等。他们需要在同一个组中。
我怎样才能实现这个目标?
附加步骤:如果我只希望数据集在组中具有相同的变量怎么办?在那种情况下, dov 和 dov1 将在不同的组中?
解决方案
这是一种解决方案,虽然不是很好,但它可能会对您有所帮助:
library(purrr)
my_data <- df %>%
map(~.x[!is.na(.x)])
mySetDiff <- function(a, b) map2(a, b, setdiff)
my_data <- my_data %>%
outer(., ., mySetDiff) %>%
apply(1, function(x) colnames(df)[which(map_dbl(x, length) == 0)]) %>%
.[order(map_dbl(., length), decreasing = TRUE)]
i <- 1
my_list <- list()
repeat{
if(length(my_data) == 0) break
my_list[[i]] <- my_data[my_data[[1]]] %>%
unlist() %>%
unique()
my_data <- my_data[-which(names(my_data) %in% my_data[[1]])]
i <- i + 1
}
my_list %>%
imap(~tibble(Data = .x, Group = .y)) %>%
bind_rows()
请注意cy2.csv
并且cy3.csv
有CY2TMPT
/CY3TMPT
所以他们不应该在同一个组中cy1.csv, cy6.csv, cy24.csv
推荐阅读
- kettle - 在转换中解析 CSV
- docker - Docker 容器通信 - “无法将主机名“mydbalias”转换为地址:名称解析暂时失败”
- java - 无法创建 Map.Entry 类型的通用类
- python-3.x - 获得嵌套循环的第 n 次迭代?
- dart - 在列或行内时,容器装饰不可见
- python - Data storage for standalone python application
- r - 为什么 dplyr 这么慢?
- java - 无法获取设备位置
- javascript - Javascript/React onClick 向下/向上滚动设置量
- dialogflow-es - 应用程序超时并退出对话需要多长时间