首页 > 解决方案 > 比较两个相似元数据的名称和选项

问题描述

我有两个类似的元数据大约。1000 条记录和 500 多列。我想检查两个数据框之间的一致性。现在我想创建一个新的数据框,它将在第一行中显示 df1 的所有列名,并与第二行中的 df2 列名相同,并且分别在第 3 列和第 4 列中显示它们的选项。然后改变新列以显示TRUEFALSE列名及其选项是否匹配。

基本上我必须检查 df1 中的列名是否与 df 2 完全一致,以及 df1 的所有列中的选项是否与 df2 完全匹配

df1 <- data.frame(ID =c("DEV2962","KTN2252","ANA2719","ITI2624","DEV2698","HRT2921",NA,"KTN2624","ANA2548","ITI2535","DEV2732","HRT2837","ERV2951","KTN2542","ANA2813","ITI2210"),
                  city=c("del","mum","nav","pun","bang","chen","triv","vish","del","mum","bang","vish","bhop","kol","noi","gurg"),
                  Name= c("dev,akash","singh,rahul","abbas,salman","lal,ram","singh,nkunj","garg,prabal","ali,sanu","singh,kunal","tomar,lakhan","thakur,praveen","ali,sarman","khan,zuber","singh,giriraj","sharma,lokesh","pawar,pooja","sharma,nikita"),
                  gender =c("m","f","m","f","m","m","m","m","m","m","m","f","m","f","m","m"))

df2 <- data.frame(ID =c("DEV2962","KTN2152","ANA2719","ITs2624","DEV2698","HRT2921",NA,"KTN2624","ANA2548","ITI2535","DEV2732","HRT2837","ERV2951","KTN2542","ANA2813","ITI2210"),
                  city=c("del","MUm"," nav","pun","bang","chen"," ddgy ","vish","del","mum","bang","vish","bhol","nhus","huay","gurg"),
                  Name= c("dev","singh,rahul","abbas,salman","lal,ram","singh,nkunj","huna,ghalak","khan,fhalt","singh,kunal","tomar,lakhan","thakur,praveen","ali,sarman","khan,zuber","singh,giriraj","sharma,lokesh","pawar,pooja","sharma,nikita"),
                  gender =c("m","f","m","f","m","m","m","male","m","male","m","f","m","f","m","m"))


df <- select(df1,matches("^[A-Z]"))
comp <- do.call(rbind, lapply(df[, 4:ncol(df)], function(option) as.data.frame(table(option)))) 
comp$variable <- gsub("[.](.*)","", rownames(comp)) 
rownames(comp) <- NULL
comp <- comp[, c(3,1,2)] 
comp <- comp[order(-comp$Freq), ]

我无法附加该文件,但我已在所需输出下方显示了图片。

输出应该是这样的。 在此处输入图像描述

在此处输入图像描述

标签: rdataframecompare

解决方案


一个类似于 Tarjae 的 tidyverse 选项。

library(dplyr)
library(tidyr)

one <-  pivot_longer(df1, everything()) %>%
  rename(names_df1 = name, options_df1 = value)

two <-  pivot_longer(df2, everything()) %>%
  rename(names_df2 = name, options_df2 = value)

one %>%
  bind_cols(two) %>%
  mutate(names_matching = names_df1 == names_df2,
         options_matching = options_df1 == options_df2) %>%
  arrange(names_df1, names_df2)

# # A tibble: 64 x 6
# names_df1 options_df1 names_df2 options_df2 names_matching options_matching
#   <chr>     <chr>       <chr>      <chr>      <lgl>          <lgl>           
# 1 city      del         city      "del"       TRUE           TRUE            
# 2 city      mum         city      "MUm"       TRUE           FALSE           
# 3 city      nav         city      " nav"      TRUE           FALSE           
# 4 city      pun         city      "pun"       TRUE           TRUE            
# 5 city      bang        city      "bang"      TRUE           TRUE            
# 6 city      chen        city      "chen"      TRUE           TRUE            
# 7 city      triv        city      " ddgy "    TRUE           FALSE           
# 8 city      vish        city      "vish"      TRUE           TRUE            
# 9 city      del         city      "del"       TRUE           TRUE

推荐阅读