r - 根据起点和终点识别模式
问题描述
我想确定从 t1 开始到 t7 结束的活动的持续时间。起点是 t1,它记录了在 t1_1、t1_2、t1_3 等处发生的活动。例如,在 t1_2 到 t3_1 发生 id 12 活动的情况下(我想保存所有事件)。我想确定所有活动的开始和结束,因此活动发生超过 4 次(例如 4 次发生第 1 次)、持续时间和最频繁的活动。零定义了序列的边界(例如,序列以 1 结束和开始,以 0 开头)
输入:
id t1_1 t1_2 t1_3 t2_1 t2_2 t2_3 t3_1 t3_2 t3_3 t4_1 t4_2 t4_3 t5_1 t5_2 t5_3 t6_1 t6_2 t6_3 t7_1 t7_2 t7_3
12 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 0 1
123 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
id 12 的输出
Id Start/End Duration Frequency
12 t1_2, t1_3, t2_1, t2_2, t2_3, t3_1 6 1
12 t6_1, t6_2, t6_3, t7_1 4 1
样本数据
df1 <- structure(list(serial = c(12L, 123L, 10L), t1_1 = c(0L, 0L, 1L),
t1_2 = c(1L, 0L, 1L), t1_3 = c(1L, 0L, 1L), t2_1 = c(0L,
1L, 1L), t2_2 = c(1L, 1L, 1L), t2_3 = c(0L, 1L, 1L), t3_1 = c(1L,
0L, 1L), t3_2 = c(0L, 0L, 1L), t3_3 = c(1L, 0L, 1L), t4_1 = c(0L,
1L, 1L), t4_2 = c(1L, 1L, 1L), t4_3 = c(0L, 1L, 1L), t5_1 = c(0L,
1L, 1L), t5_2 = c(1L, 1L, 1L), t5_3 = c(0L, 1L, 1L), t6_1 = c(1L,
0L, 1L), t6_2 = c(1L, 0L, 1L), t6_3 = c(1L, 0L, 1L), t7_1 = c(0L,
1L, 1L), t7_2 = c(0L, 1L, 1L), t7_3 = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
到目前为止的代码
df1 <- melt(setDT(df1), id.var = 'serial')
df1[, c('time', 'subtime') := tstrsplit(as.character(variable), "_", fixed = TRUE)]
df2 <- df1[, rle(value), by = .(serial, time)][lengths > 1 & values == 1, ]
df3 <- df1[df2, on = c('serial', 'time')]
df3 <- df3[, .(`Start/End` = paste0(time, '_', c(min(subtime), max(subtime)), collapse = " - "),
Duration = unique(lengths)),
by = .(serial, time)]
df3[, Frequency := .N, by = .(serial, `Start/End`)]
df3[, time := NULL]
df3[order(serial), ]
解决方案
我建议使用tidyverse
函数的下一种方法。您想识别序列,以便下一个代码有用。主要思想是重新格式化数据并拆分时间变量 ( t
),以便您为序列创建 ID,然后聚合:
library(tidyverse)
df1 %>% arrange(serial) %>% pivot_longer(cols = -serial) %>%
#Duplicate the variable with time
mutate(name2=name) %>%
#Split time so that you have categories by t1, t2,...
separate(name2,into = c('var1','var2'),sep = '_') %>%
#Group by main id, the categories and value
group_by(serial,var1,value) %>%
#Create an unique id for sequences
mutate(id=cur_group_id()) %>%
#Omit values in zero which are not patterns
ungroup() %>% filter(value!=0) %>%
#Aggregate with the new id
group_by(serial,id) %>%
#Compute outputs
summarise(chain=paste0(name,collapse = ','),Duration=n()) %>%
select(-id) -> dfprime
输出(我只包括serial
12 个):
# A tibble: 7 x 3
# Groups: serial [1]
serial chain Duration
<int> <chr> <int>
1 12 t1_2,t1_3 2
2 12 t2_2 1
3 12 t3_1,t3_3 2
4 12 t4_2 1
5 12 t5_2 1
6 12 t6_1,t6_2,t6_3 3
7 12 t7_3 1
如果您想进行其他聚合,您可以处理最终数据帧。
推荐阅读
- delphi - Delphi TIdTcpServer 获取浏览器发送的 POST 参数
- matlab - 由保留叶齿形状的分割产生的二值叶图像中的填充间隙
- python - 我是否必须尊重 pyspark sql 中的命令顺序?
- split - 将 .tfrecords 文件拆分为多个 .tfrecords 文件问题
- r - 重新排列R中相同列中相同类型的数据值
- javascript - 如何将多个对象分配给一个数组?
- sql-server - 不使用实体框架的 CRUD 主细节
- amazon-web-services - 带有 lambda 函数来源的 aws cloudfront 抛出“服务器以非 JavaScript MIME 类型的“text/html”响应”
- javascript - 从 DOM 元素中获取 javascript 对象
- python-3.x - 我的熊猫数据框不能按列条件过滤