首页 > 解决方案 > 在R中提取模式周围的行

问题描述

我有一个 data.frame test,我想在其中确定每个bar-foo模式之前和之后的内容id。该模式必须是连续的timestamp

例如,在以下示例中,出现了三种bar-模式foo

> test
             timestamp id message   result
1  2019-01-01 00:00:21  1     bar negative
2  2019-01-01 00:00:58  1     bar positive
3  2019-01-01 00:01:35  1     foo positive
4  2019-01-01 00:03:02  1     bar negative
5  2019-01-01 00:06:42  1     baz positive
6  2019-01-01 00:07:16  1     baz positive
7  2019-01-01 00:07:39  1     bar positive
8  2019-01-01 00:09:14  2     bar negative
9  2019-01-01 00:09:56  2     foo negative
10 2019-01-01 00:10:56  2     foo positive
11 2019-01-01 00:11:13  2     foo negative
12 2019-01-01 00:11:32  2     foo positive
13 2019-01-01 00:11:49  2     bar negative
14 2019-01-01 00:12:18  2     foo positive
15 2019-01-01 00:15:28  2     bar positive

因此,理想的输出将如下所示:

> output
    before    after id
1 negative negative  1
2     <NA> positive  2
3 positive positive  2

我在下面应用的代码有效,但看起来很复杂且效率低下

test %>%
            group_by(id) %>%
            mutate(next.message = lead(message, order_by=timestamp),
                   previous.result = lag(result, order_by=timestamp),
                   next.result = lead(result, n = 2, order_by=timestamp)) %>%
            filter(message == 'bar', next.message == 'foo')  %>%
            filter_all(any_vars(!is.na(.))) %>% 
            select (-c(timestamp, message, result, next.message)) %>%
            rename(before = previous.result , after = next.result) 

dplyr使用ordata.table函数来解决这个问题的更好方法是什么?

样本数据:

dput(test)
structure(list(timestamp = structure(c(1546318821, 1546318858, 
1546318895, 1546318982, 1546319202, 1546319236, 1546319259, 1546319354, 
1546319396, 1546319456, 1546319473, 1546319492, 1546319509, 1546319538, 
1546319728), class = c("POSIXct", "POSIXt")), id = c(1, 1, 1, 
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), message = c("bar", "bar", 
"foo", "bar", "baz", "baz", "bar", "bar", "foo", "foo", "foo", 
"foo", "bar", "foo", "bar"), result = c("negative", "positive", 
"positive", "negative", "positive", "positive", "positive", "negative", 
"negative", "positive", "negative", "positive", "negative", "positive", 
"positive")), row.names = c(NA, -15L), class = "data.frame")

标签: rdplyrdata.tablepattern-matching

解决方案


也许是这样的data.table

library(data.table)
setDT(test)
test[, 
    {
        #find the rows where message is bar and next message is foo
        v <- .I[message=="bar" & shift(message, -1L, fill="")=="foo"]

          #extract the previous result and use NA if its beyond the starting row index of current id
        .(before=test[replace(v - 1L, v - 1L < min(.I), NA_integer_), result],

            #extract the next result and use NA if its beyond the ending row index of current id
            after=test[replace(v + 2L, v + 2L > max(.I), NA_integer_), result])
    },
    by=.(id)]

输出:

   id   before    after
1:  1 negative negative
2:  2     <NA> positive
3:  2 positive positive

推荐阅读