首页 > 解决方案 > 如果有空白空间,则处理连续出现

问题描述

这个问题与我之前 关于在每个 id 的数据帧中识别值的出现的问题有关。

这次我试图识别每个 id 长度为 3 或更多的非连续测量。这些非 w 测量发生在 w 的连续出现之后(连续出现的长度大小至少为 3)。我不知道如何处理空格。即使我替换为na's仍然无法正常工作。

      id t1 t2 t3 t4 t5 t6 t7 t8 t9
      1           w  w  w  r  s  r # empty space t1:t3; 3 consecutive occ. of w and 3 non-consec. occ. after the last w at t6
      2        w  w  e  w  w  w  w # empty space t1:t2; 4 consec. occ. of w start at t6 but no non-w occ. after the last w 
      3  w  w  w  w  w  w  s  s  s # no empty space; 6 consec. w occ.; 3 non-w occ. start at t7
      4     w  w  w  w  w  w  w  w # t1 empty space; 8 consec. w occ. but no non-w occ. after the last w
      5  w  w  w  w  w  w  r  s  w # no empty space; consec w occ. till t6; 2 non-w occ. but not after the last occ. of w and not 3 times
      6     s  w  r  w  r  w  w  s # no empty space; 2 consec. occ. of w and 1 non-w occ. after the last w.

前任。

w下面是长度为的连续出现的示例3。从t1:t3那里有空的空间;从连续出现 w 和从t4:t6有3 个非 w 出现(无论它们是否连续)。3t7:t9

  id t1 t2 t3 t4 t5 t6 t7 t8 t9
   1           w  w  w  r  s  r 

我想将非 w 事件保存df为:

 id  t6  t7 t8 t9 
  1   w  r  s  r 

我不知道的是:

前任。我怎样才能知道是否在最后一个w位置 - 那是t6

   id t1 t2 t3 t4 t5 t6 t7 t8 t9
   1           w  w  w  r  s  r 

前任。如何确定在最后一个w位置之后 - 即t6t7:t9 是否有非 w 出现。

   id t1 t2 t3 t4 t5 t6 t7 t8 t9
   1           w  w  w  r  s  r 

样本数据:

df<-structure(list(id=c(1,2,3,4,5,6), t1=c("","","w","","w","", "w"), t2=c("","","w","w","w","s", "w"),t3 = c("","w","w","w","w","w", "w"),
                    t4 = c("w","w","w","w","w","r", "w"), t5 = c("w","e","w","w","w","w", "r"), t6 = c("w","w","w","w","w","r", "s"),
                    t7 = c("r","w","s","w","r","w", "t"), t8 = c("r","w","s","w","s","w", "v"), t9=c("r","w","s","w","w","s"), "z"), row.names = c(NA, 6L), class = "data.frame")

df

输出df

 id  t6 t7 t8 t9
  1  w  r  s  r 
  3  w  s  s  s

还有一种特殊情况,当 t 不是同时开始时,例如从下面最后df一次id 7出现 w 时开始,t4而不是t6在其他情况下。

  id t1 t2 t3 t4 t5 t6 t7 t8 t9
  1           w  w  w  r  r  r
  2        w  w  e  w  w  w  w
  3  w  w  w  w  w  w  s  s  s
  4     w  w  w  w  w  w  w  w
  5  w  w  w  w  w  w  r  s  w
  6     s  w  r  w  r  w  w  z
  7  w  w  w  w  r  s  t  v  s

这个输出会更复杂。如果 occ.lenght 至少为 3,删除 w 的 if(consec.occ.lenght 至少 3) 并保留序列的第二部分会不会更容易?

 id   t4 t5 t6 t7 t8 t9
   1         w  r  s  r 
   3         w  s  s  s
   7  w   r  s  t  v  s

标签: rdataframe

解决方案


使用apply逐行:

mat <- apply(df[-1], 1, function(x) {
  #rle to find consecutive occurrence of w
  a1 <- rle(x == 'w')
  #Find the position of last 'w' in rle output
  a2 <- max(which(a1$values))
  #Find the position of last 'w' in x
  a3 <- sum(a1$lengths[1:a2])
  #If the consecutive occurrence of last w is greater than equal to 3 and 
  #If there are more than 3 values after the last w
  if(a1$length[a2] >= 3 & length(x) >=  a3 + 3)
    #Keep only the values after the last w
    x[a3:length(x)]
})
#Get length of elements in each list
n <- lengths(mat)
#Get max n meaning number of columns in final dataframe
m <- max(n)
#Append NA's to shorter elements to make the length equal
new_mat <- t(sapply(mat[n > 0], function(x) c(rep(NA, m - length(x)), x)))
#Create a new dataframe
data.frame(id = df$id[n > 0], new_mat)

数据

df <- structure(list(id = 1:7, t1 = c("", "", "w", "", "w", "", "w"
), t2 = c("", "", "w", "w", "w", "s", "w"), t3 = c("", "w", "w", 
"w", "w", "w", "w"), t4 = c("w", "w", "w", "w", "w", "r", "w"
), t5 = c("w", "e", "w", "w", "w", "w", "r"), t6 = c("w", "w", 
"w", "w", "w", "r", "s"), t7 = c("r", "w", "s", "w", "r", "w", 
"t"), t8 = c("r", "w", "s", "w", "s", "w", "v"), t9 = c("r", 
"w", "s", "w", "w", "z", "s")), class = "data.frame", row.names = c(NA,-7L))

推荐阅读