首页 > 解决方案 > 根据条件对数据进行子集和重命名,并在 Dplyr 中查找持续时间

问题描述

我有一个数据集,df:

最终,我希望能够将数据分组为“块”,其中文件夹列包含字符串“Out”,确保考虑与之关联的日期。有没有办法为每个“Out”发生的实例创建一个块,同时计算它的持续时间。

  Folder               DATE
  Out                  9/9/2019 5:46:00                
  Out                  9/9/2019 5:46:01
  Out                  9/9/2019 5:46:02
  In                   9/9/2019 5:46:03
  In                   9/9/2019 5:46:04
  Out                  9/10/2019 6:00:01
  Out                  9/10/2019 6:00:02
  In                   9/11/2019 7:50:00
  In                   9/11/2019 7:50:01

我想要这个输出:

 New Variable        Duration
 Out1                 2 sec
 Out2                 1 sec

我已经包括了输入:

structure(list(Folder = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("In", "Outdata"), class = "factor"), Date =  structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L), .Label = c("9/10/2019 6:00", 
"9/11/2019 7:50", "9/9/2019 5:46"), class = "factor")), class = "data.frame", row.names = c(NA,      
-9L))

到目前为止我已经尝试过:

 #Loading appropriate libraries

 library(dplyr)
 library(lubridate)

创建将首先按字符串“Out”对文件夹进行分组的新变量

(但是,这是我不确定该怎么做的地方,因为我希望为每个“Out”组及其持续时间创建一个新变量,以最终将其绘制在直方图上。)

 newdf<-df %>%
 group_by(df$Folder) %>%            
 summarise(mutate(Duration = difftime(as.POSIXct(ss_EndTime, format = 
 "%m/%d/%Y %I:%M:%S %p"),as.POSIXct(ss_StartTime, 
 format = "%m/%d/%Y %I:%M:%S %p" ), units = "secs")))

我将继续研究,感谢所有建议。

标签: rdataframedatetimedplyrstringr

解决方案


我们可以将DATE列转换为POSIXct,根据值的连续出现创建一个分组列Folderfilter"Out"值并获取每组中的时间first差。last

library(dplyr)

df %>%
  mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M:%S"), 
         gr = cumsum(Folder != lag(Folder, default = TRUE))) %>%
   filter(Folder == "Out") %>%
   arrange(gr, DATE) %>%
   group_by(gr) %>%
   summarise(Duration = difftime(last(DATE), first(DATE))) %>%
   mutate(gr = paste0('Out', row_number()))

# A tibble: 2 x 2
#  gr    Duration
#  <chr> <drtn>  
#1 Out1  2 secs  
#2 Out2  1 secs  

数据

df <- structure(list(Folder = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("In", "Out"), class = "factor"), DATE = structure(c(5L, 
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L), .Label = c("9/10/2019 6:00:01", 
"9/10/2019 6:00:02", "9/11/2019 7:50:00", "9/11/2019 7:50:01", 
"9/9/2019 5:46:00", "9/9/2019 5:46:01", "9/9/2019 5:46:02", "9/9/2019 5:46:03", 
"9/9/2019 5:46:04"), class = "factor")), class = "data.frame", 
row.names = c(NA, -9L))

推荐阅读