首页 > 解决方案 > 如何识别相同但顺序不同的列或另一列中的列?

问题描述

我有一个捕获数据变量列表的数据集。它看起来像这样: 在此处输入图像描述

它可以使用代码构建:

df<-structure(list(cxr.CSV = c("project", "Subject", "Site", "InstanceName", 
"RecordPosition", "CXRDT", "CXRFIND", "CXRFNDSP", "CXRYN", NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), cy1.CSV = c("project", 
"Subject", "Site", "InstanceName", "RecordPosition", "CYSHPYN", 
"CYSHPDT", "CY1TMPT", "CYND", "CYNDSP", "CYDT", "CYTM", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA), cy2.CSV = c("project", "Subject", 
"Site", "InstanceName", "RecordPosition", "CYSHPYN", "CYSHPDT", 
"CY2TMPT", "CYND", "CYNDSP", "CYDT", "CYTM", NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), cy24.CSV = c("project", "Subject", "Site", 
"InstanceName", "RecordPosition", "CYSHPYN", "CYSHPDT", "CY1TMPT", 
"CYND", "CYNDSP", "CYDT", "CYTM", NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA), cy3.CSV = c("project", "Subject", "Site", "InstanceName", 
"RecordPosition", "CYSHPYN", "CYSHPDT", "CY3TMPT", "CYND", "CYNDSP", 
"CYDT", "CYTM", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), cy6.CSV = c("project", 
"Subject", "Site", "InstanceName", "RecordPosition", "CYSHPYN", 
"CYSHPDT", "CY1TMPT", "CYND", "CYNDSP", "CYDT", "CYTM", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA), dlt.CSV = c("project", "Subject", 
"Site", "InstanceName", "RecordPosition", "DLTYN", "DLTAE", "DLTSP", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), dm.CSV = c("project", 
"Subject", "Site", "InstanceName", "RecordPosition", "BRTHYR", 
"DMAGE", "SEX", "SEXSP", "FEMCBP", "FEMCBPSP", "RACE", "RACESP", 
"ETHNIC", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), dov.CSV = c("project", 
"Subject", "Site", "InstanceName", "RecordPosition", "DOVDT", 
"DOVAE", "DOVCM", "DOVCP", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA), dov_1.CSV = c("project", "Subject", "Site", "InstanceName", 
"RecordPosition", "DOVDT", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), ds.CSV = c("project", "Subject", "Site", 
"InstanceName", "RecordPosition", "DSDT", "DSREAS", "DSORTH", 
"DSWCSP", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
    ds_1.CSV = c("project", "Subject", "Site", "InstanceName", 
    "RecordPosition", "DSDT", "DSREAS", "DSWCSP", "DSORTH", NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), dth.CSV = c("project", 
    "Subject", "Site", "InstanceName", "RecordPosition", "DTHFCDT", 
    "DTHDT", "DTHDUR", "DTHREAS", "DTHROTH", "DTHCOMM", NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA), dv.CSV = c("project", 
    "Subject", "Site", "InstanceName", "RecordPosition", "DVYN", 
    "DVVIS", "DVIDDAT", "DVSTDAT", "DVENDAT", "DVCAT", "DVCATSP", 
    "DVCATCD", "DVTERM", "REWFLAG", "REWCOMP", "DVACN", "DVMETRPT", 
    "DVCLSDAT", "DVCLS", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), tegu.CSV = c("project", 
    "Subject", "Site", "InstanceName", "RecordPosition", "EGYN", 
    "EGDT", "EGNOU", "EGTM", "EGORRES", "EGHR", "EGPR", "EGQRS", 
    "EGQTINT", "ECGRR", "EGQTCFC", "EGQTCBC", "EGQTCNS", "EGQTCO", 
    "EGQTCOSP", "EGRSAB01", "EGRSAB02", "EGRSAB03", "EGRSAB04", 
    "EGRSAB05", "EGRSAB06", "EGRSAB07", "EGRSAB08", "EGRSAB09", 
    "EGRSAB10", "EGRSAB11", "EGRSAB12", "EGRSAB13", "EGABNCOM", 
    "EGABNCS", "EGTMPT", "EGND"), tegu_1.CSV = c("project", "Subject", 
    "Site", "InstanceName", "RecordPosition", "EGYN", "EGNOU", 
    "EGND", "EGTMPT", "EGDT", "EGTM", "EGORRES", "EGHR", "EGPR", 
    "EGQRS", "EGQTINT", "ECGRR", "EGQTCFC", "EGQTCBC", "EGQTCNS", 
    "EGQTCO", "EGQTCOSP", "EGRSAB01", "EGRSAB02", "EGRSAB03", 
    "EGRSAB04", "EGRSAB05", "EGRSAB06", "EGRSAB07", "EGRSAB08", 
    "EGRSAB09", "EGRSAB10", "EGRSAB11", "EGRSAB12", "EGRSAB13", 
    "EGABNCOM", "EGABNCS")), row.names = c(NA, -37L), class = c("tbl_df", 
"tbl", "data.frame"))

我想比较每一列。如果两个数据集的变量相同,或者一个已完成包含在另一个中。然后用相同的数字标记它们。最后,我想得到一个如下所示的汇总表:

在此处输入图像描述

只要它捕获信息,就不需要完全相同。棘手的部分是:tegu.CSV 和 tegu_1.CSV,ds.CSV 和 ds_1.CSV 具有不同顺序的相同变量列表,dov.CSV 具有 dov_1.CSV 具有的所有变量等等。他们需要在同一个组中。

我怎样才能实现这个目标?

附加步骤:如果我只希望数据集在组中具有相同的变量怎么办?在那种情况下, dov 和 dov1 将在不同的组中?

标签: r

解决方案


这是一种解决方案,虽然不是很好,但它可能会对您有所帮助:

library(purrr)

my_data <- df %>%
  map(~.x[!is.na(.x)])
mySetDiff <- function(a, b) map2(a, b, setdiff)
my_data <- my_data %>% 
  outer(., ., mySetDiff) %>%
  apply(1, function(x) colnames(df)[which(map_dbl(x, length) == 0)]) %>%
  .[order(map_dbl(., length), decreasing = TRUE)]
  

i <- 1
my_list <- list()
repeat{
  
  if(length(my_data) == 0) break
  
  my_list[[i]] <- my_data[my_data[[1]]] %>% 
    unlist() %>%
    unique()
  
  my_data <- my_data[-which(names(my_data) %in% my_data[[1]])]
  
  i <- i + 1
}

my_list %>%
  imap(~tibble(Data = .x, Group = .y)) %>%
  bind_rows()

请注意cy2.csv并且cy3.csvCY2TMPT/CY3TMPT所以他们不应该在同一个组中cy1.csv, cy6.csv, cy24.csv


推荐阅读