首页 > 解决方案 > 如何根据特定条件将行分组在一起?(R 或 Python)

问题描述

客观的:

如果 Subject、Re 和 Length 列具有相同的连续值,并且如果 Folder 为 == "out" | 则将数据部分组合在一起 "drafts", Message == "", Edit is == "T" 并占用他们的持续时间。

Subject Re                    Length         Folder      Message   Date                   Edit     
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 



hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        




hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T                        

期望的输出

 Start                  End                        Duration          Group
 1/2/2020 1:00:10 AM    1/2/2020 1:00:30 AM        20                A
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A

输入:

 structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 
 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "hey"), class = "factor"), 
 Recipient = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
 1L, 1L, 2L, 2L), .Label = c("", "sarah@mail.com,gee@mail.com"
 ), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 
 NA, NA, 80L, 80L, NA, NA, 100L, 100L), Folder = structure(c(3L, 
 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", 
 "draft", "out"), class = "factor"), Message = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 
 3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L), .Label = c("", 
 "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:10 AM", 
 "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", "1/2/2020 1:02:00 AM", 
 "1/2/2020 1:02:05 AM", "1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"
 ), class = "factor"), Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, 
 NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)), class = "data.frame", row.names = c(NA, 
-13L))

我知道我可以像这样过滤:

   df1<-df2 %>%
   mutate(Date = lubridate::mdy_hms(Date), 
    cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" &     Message == "" & Subject     ==  ?   & Re ==   ?     & Length == ?   

但不确定如何合并“如果有连续值”。我将继续研究,非常感谢任何帮助或建议。

标签: pythonrpandasdplyr

解决方案


您的structure外观与您发布的数据框略有不同:

> df
   Subject                   Recipient Length Folder Message                Date Edit
1                                          80    out      NA 1/2/2020 1:00:01 AM TRUE
2                                          80    out      NA 1/2/2020 1:00:05 AM TRUE
3      hey sarah@mail.com,gee@mail.com     80    out      NA 1/2/2020 1:00:10 AM TRUE
4      hey sarah@mail.com,gee@mail.com     80    out      NA 1/2/2020 1:00:15 AM TRUE
5      hey sarah@mail.com,gee@mail.com     80    out      NA 1/2/2020 1:00:30 AM TRUE
6                                          NA             NA                       NA
7                                          NA             NA                       NA
8      hey sarah@mail.com,gee@mail.com     80  draft      NA 1/2/2020 1:02:00 AM TRUE
9      hey sarah@mail.com,gee@mail.com     80  draft      NA 1/2/2020 1:02:05 AM TRUE
10                                         NA             NA                       NA
11                                         NA             NA                       NA
12     hey sarah@mail.com,gee@mail.com    100  draft      NA 1/2/2020 1:03:00 AM TRUE
13     hey sarah@mail.com,gee@mail.com    100  draft      NA 1/2/2020 1:03:20 AM TRUE

此外,您所需的输出表明您希望Folder按以及按其他类别拆分组,但这不是您的描述所说的,所以我没有分组Folder。不过,如果你愿意,这很容易改变。

您可以通过使用运行长度编码来消除排序数据中相同连续值组的歧义,但rle在 R 中将其转换为数据框列有点棘手。我用这个答案来做到这一点

library(lubridate)
library(dplyr)

df %>%
  mutate(Date = mdy_hms(Date),
         Key = paste(Subject, Recipient, Length, sep = "_")) %>%
  arrange(Date) %>%
  filter(Folder == "out" | Folder == "draft" & Edit == TRUE) %>%
  mutate(RLE = {RLE = rle(Key) ; rep(seq_along(RLE$lengths), RLE$lengths)}) %>%
  group_by(RLE) %>%
  summarize(Start = first(Date),
            End = last(Date),
            Duration = as.numeric(End) - as.numeric(Start))

这会从 1:2、3:5+8:9 和 12:13 行创建组。这些组给出了这些持续时间:

# A tibble: 3 x 4
    RLE Start               End                 Duration
  <int> <dttm>              <dttm>                 <dbl>
1     1 2020-01-02 01:00:01 2020-01-02 01:00:05        4
2     2 2020-01-02 01:00:10 2020-01-02 01:02:05      115
3     3 2020-01-02 01:03:00 2020-01-02 01:03:20       20

如果您想Folder包含在分组中,请将其添加到创建中包含的内容中Key。这使得分组为 1:2、3:5、8:9 和 12:13。这样做会产生以下结果:

# A tibble: 4 x 4
    RLE Start               End                 Duration
  <int> <dttm>              <dttm>                 <dbl>
1     1 2020-01-02 01:00:01 2020-01-02 01:00:05        4
2     2 2020-01-02 01:00:10 2020-01-02 01:00:30       20
3     3 2020-01-02 01:02:00 2020-01-02 01:02:05        5
4     4 2020-01-02 01:03:00 2020-01-02 01:03:20       20

推荐阅读