首页 > 解决方案 > 如何选择具有相同前缀的多个变量中的 1 个?

问题描述

继续我之前的问题如何在不考虑 Na 值的情况下返回多个列并按 R 中的其他列名称分组?

Mexico_01 <- c(1,2,5,1,NA,1)
Mexico_02 <- c(3,NA,2,0,4,1)
Argentina_01 <- c(2,1,5,2,NA,2)
Argentina_02 <- c(2,3,NA,2,2,2)
Italy<- c(NA,10,10,10,NA,10)
Spain_01 <- c(2,NA,4,6,8,11)
Spain_02 <- c(3,4,NA,11,11,11)
England <- c(5,NA,10,NA,NA,12)
Germany <- c(1,NA,NA,NA,NA,10)
Data_Risk = data.frame( Mexico_01, Mexico_02, Argentina_01, Argentina_02, 
Italy, Spain_01, Spain_02, England, Germany)

Data_Risk <- as.data.table(Data_Risk)
library(data.table)
library(magrittr)
all_variable <- as.data.table(which(!is.na(Data_Risk), arr.ind = T))
all_variable [, .(colnm = names(Data_Risk)[col], col = paste0('var', 

order(col))) , by = row] %>%  dcast(row ~ col, value.var = 'colnm')

row      var1         var2         var3         var4     var5     var6     
var7
1:   1 Mexico_01    Mexico_02 Argentina_01 Argentina_02 Spain_01 Spain_02  
England

2:   2 Mexico_01 Argentina_01 Argentina_02        Italy Spain_02     <NA>     
<NA>

3:   3 Mexico_01    Mexico_02 Argentina_01        Italy Spain_01  England     
<NA>

4:   4 Mexico_01    Mexico_02 Argentina_01 Argentina_02    Italy Spain_01 

Spain_02

5:   5 Mexico_02 Argentina_02     Spain_01     Spain_02     <NA>     <NA>     
 <NA>

6:   6 Mexico_01    Mexico_02 Argentina_01 Argentina_02    Italy Spain_01 
 Spain_02

 var8          var9
 1: Germany    <NA>
 2:    <NA>    <NA>
 3:    <NA>    <NA>
 4:    <NA>    <NA>
 5:    <NA>    <NA>
 6: England Germany

对于这种情况,我只需要考虑具有相同前缀的所有变量中的一个变量,例如:而不是 mexico_01 或 mexico_02 只选择墨西哥。

所以决赛桌必须是这样的:

var1           var2          var3       var4     var5    var6
mexico    argentina       england    germany     null    null
mexico    argentina         italy       null     null    null 
mexico    argentina         italy      spain  england    null
mexico    argentina         italy      spain     null    null
spain      null             null       null      null    null
mexico    argentina         italy      spain england  germany

标签: rjoin

解决方案


我们可以用 拆分列,根据'row','V1'列tstrsplit获取ID,将'V1'中的那些元素分配给然后执行duplicatedNAdcast

out[, c("V1", "V2") := tstrsplit(colnm, "_")]
i1 <- out[, .I[duplicated(.SD)], .SDcols = c('row',  'V1')]
out[i1, V1 := NA_character_]
out[, V1 := V1[order(is.na(V1))], row]
dcast(out, row ~ col, value.var = "V1")[, row := NULL][]

数据

out <-  all_variable [, .(colnm = names(Data_Risk)[col], 
         col = paste0('var',  order(col))) , by = row]

推荐阅读