首页 > 解决方案 > 在 R 或 Python 中分组、应用多个条件并计算持续时间

问题描述

客观的

我有一个数据集 df,我想根据某些条件计算持续时间并显示、Recipientstarttimeendtime的输出。DurationLength

问题

如果以下条件适用,我需要首先将消息组合在一起:如果Folder == 'out'OR draftsMessage == ""Edit == "T"以及 Recipient 和 Length 列中的内容连续相同。

理想情况下,这会给我 A 组及其持续时间。例如,第一个“块”数据将标记为“A 组”,开始时间为1/2/2020 1:00:01 AM,结束时间为1/2/2020 1:00:30 AM

此外,如果 Subject、Re 和 Length 列的最后一行与第一行上的其他组 Subject、Re 和 Length 匹配,我想将 A 组与另一个“块”数据“匹配”。所以第二组 A 的开始时间为1/2/2020 1:02:00 AM,结束时间为1/2/2020 1:02:05 AM

Subject Re                    Length         Folder      Message   Date                   Edit     
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 
some
some
some
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        
no
no
no
no
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T                        

期望的输出

 Start                  End                        Duration          Group  Subject  Length
 1/2/2020 1:00:01 AM    1/2/2020 1:00:30 AM        29                A      hey       80
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A      hey       80
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A      hey       80

输入:

 structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L, 
 4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", 
 "hey", "no", "some"), class = "factor"), Recipient = structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 
2L), .Label = c("", " ", "a@mail.com,b@mail.com"), class = "factor"), 
Length = c(80L, 80L, 80L, 80L, 80L, NA, NA, NA, 80L, 80L, 
NA, NA, NA, NA, 80L, 80L, NA, NA), Folder = structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("", "draft", "out"), class = "factor"), 
Message = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 1L, 1L), .Label = c("", 
"1/2/2020 1:00", "1/2/2020 1:02", "1/2/2020 1:03"), class = "factor"), 
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA, NA, TRUE, 
TRUE, NA, NA, NA, NA, TRUE, TRUE, NA, NA)), class = "data.frame", row.names =   c(NA, 
  -18L))

我正在使用它,但我希望保留主题为空白的行,我不希望将其过滤掉。从这个例子的前几行可以看出,虽然主题字段是空白的,但它仍应包含在第一个“块”中。当我删除这部分时:

   filter(Subject != '') %>%, I get some errors, should I remove another      part in the code too? (Keep in mind, I still want to display the Subject output).  Any advice is appreciated.





 df1<-df %>% 

 mutate_if(is.factor, as.character) %>% 

 mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "",      stringr::str_trim(.))) %>%
 filter(Subject != '') %>%
 mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
 mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>% 
 mutate(segment = cumsum(!cond)) %>%
 filter(cond) %>%  


 group_by(Subject, Recipient, Length, segment) %>%
 summarize(Start = min(Date),
        End = max(Date),
        Duration = End - Start) %>%


  mutate(new_group = (Subject   != lag(Subject, 1, "")) *
       (Recipient != lag(Recipient, 1, "")) *
       (Length    != lag(Length, 1, ""))) %>%
  ungroup() %>%
  mutate(group = LETTERS[cumsum(new_group)])

标签: pythonrpandasloopsdplyr

解决方案


推荐阅读