首页 > 解决方案 > 删除特定行的列之间的重复观察

问题描述

这是我要清理的数据框的一个简短示例:

L3 <- LETTERS[1:5]    
fac<-c("fish", "meat", "chicken", "veg", "shrimp")

set.seed(1)
(d <- data.frame(code = sample(c(11:15)), 
      upc = sample(c(1:5)), desc = sample(fac), 
      desc1 = fac, desc2 = sample(fac), 
      desc3 = fac, desc4 = sample(fac) ))


  code upc    desc   desc1   desc2   desc3   desc4
1   12   5    meat    fish chicken    fish  shrimp
2   15   4    fish    meat  shrimp    meat    fish
3   14   2 chicken chicken     veg chicken    meat
4   13   3     veg     veg    fish     veg     veg
5   11   1  shrimp  shrimp    meat  shrimp chicken

我正在尝试编写一个通用函数(使用for loopand unique()),它为每一行独立验证从第 3 列到第 7 列的条目,并保持在其他列中不重复的唯一值(即:如果一行包含全部鱼desc columns 新行应该只在一列中包含鱼)。更具体地说,期望的结果是:

  code upc    desc desc1   desc2 desc3   desc4
1   12   5    meat  fish chicken        shrimp
2   15   4    fish  meat  shrimp              
3   14   2 chicken           veg          meat
4   13   3     veg          fish              
5   11   1  shrimp          meat       chicken

标签: rdataframefor-loopunique

解决方案


我们可以使用duplicated将每行中重复的元素分配给""“desc”列的空白

nm1 <- grep('desc', names(d))
d[nm1] <- t(apply(d[nm1], 1, function(x) {replace(x, duplicated(x), "")}))
d
#  code upc    desc desc1   desc2 desc3   desc4
#1   12   5    meat  fish chicken        shrimp
#2   15   4    fish  meat  shrimp              
#3   14   2 chicken           veg          meat
#4   13   3     veg          fish              
#5   11   1  shrimp          meat       chicken

或使用for循环(假设列是character类或在进行分配之前将空白作为级别之一)

for(i in seq_len(nrow(d))) d[i, nm1] <- replace(d[i, nm1], 
                                     duplicated(unlist(d[i, nm1])), '')

推荐阅读