首页 > 解决方案 > 过滤每个主题的多个事件之间的行

问题描述

我有一个大型数据集,我正在尝试过滤每个主题的特定事件之后的天数。这个问题是,感兴趣的“事件”可能对某些受试者发生多次,而对于少数受试者,该事件根本不会发生(在这种情况下,它们可以从汇总数据中删除)。

这是数据示例以及我尝试过的示例:

library(tidyverse)

set.seed(355)
subject <- c(rep(LETTERS[1:4], each = 40), rep("E", times = 40))
event <- c(sample(0:1, size = length(subject)-40, replace = T, prob = c(0.95, 0.05)), rep(0, times = 40))
df <- data.frame(subject, event)


df %>%
    filter(event == 1) %>%
    count(subject, event, sort = T)

# A tibble: 4 x 3
  subject event     n
  <fct>   <dbl> <int>
1 D           1     3
2 A           1     2
3 B           1     2
4 C           1     2

所以我们看到受试者 D 发生了 3 次事件,而受试者 A、B 和 C 发生了 2 次事件。对象E根本没有发生过这件事。

我的下一步是创建一个“事件”标签,用于标识每个事件发生的位置,然后为所有行生成一个 NA。我还创建了一个事件序列,它在事件之间进行排序,因为我认为它可能有用,但我最终没有尝试使用它。

df_cleaned <- df %>%
    group_by(subject, event) %>%
    mutate(event_seq = seq_along(event == 1),
        event_detail = ifelse(event == 1, "event", NA)) %>%
    as.data.frame() 

filter()我使用and尝试了两种不同的方法between()来获取每个事件以及每个事件之后的 2 行。由于主题内的多个事件,这两种方法都会产生错误。我想不出一个好的解决方法。

方法一:

df_cleaned %>%
    group_by(subject) %>%
    filter(., between(row_number(), 
        left = which(!is.na(event_detail)),
        right = which(!is.na(event_detail)) + 1))

方法二:

df_cleaned %>%
    group_by(subject) %>%
    mutate(event_group = cumsum(!is.na(event_detail))) %>%
    filter(., between(row_number(), left = which(event_detail == "event"), right = which(event_detail == "event") + 2))

标签: rfiltertidyverse

解决方案


If you want to get rows with 1 in event and the following two rows, you can do the following. Thanks to Ananda Mahto who is the author of splitstackshape package, we can handle this type of operation with getMyRows(), which returns a list. You can specify a range of rows in the function. Here I said 0:2. So I am asking R to take each row with 1 in event and the following two rows. I used bind_rows() to return a data frame. But if you need to work with a list, you do not have to do that.

install_github("mrdwab/SOfun")
library(SOfun)
library(dplyr)

ind <- which(x = df$event == 1)
bind_rows(getMyRows(data = df, pattern = ind, range = 0:2))

   subject event
1        A     1
2        A     0
3        A     0
4        A     1
5        A     0
6        A     0
7        B     1
8        B     0
9        B     0
10       B     1
11       B     0
12       B     0
13       C     1
14       C     0
15       C     0
16       C     1
17       C     0
18       C     0
19       D     1
20       D     0
21       D     0
22       D     1
23       D     0
24       D     0
25       D     1
26       D     0
27       D     0

推荐阅读