python - 如何根据特定条件将行分组在一起?(R 或 Python)
问题描述
客观的:
如果 Subject、Re 和 Length 列具有相同的连续值,并且如果 Folder 为 == "out" | 则将数据部分组合在一起 "drafts", Message == "", Edit is == "T" 并占用他们的持续时间。
Subject Re Length Folder Message Date Edit
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:01 AM T
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:05 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:15 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:30 AM T
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:00 AM T
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:05 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:20 AM T
期望的输出
Start End Duration Group
1/2/2020 1:00:10 AM 1/2/2020 1:00:30 AM 20 A
1/2/2020 1:02:00 AM 1/2/2020 1:02:05 AM 5 A
1/2/2020 1:03:10 AM 1/2/2020 1:03:20 AM 10 A
输入:
structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "hey"), class = "factor"),
Recipient = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 2L), .Label = c("", "sarah@mail.com,gee@mail.com"
), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L,
NA, NA, 80L, 80L, NA, NA, 100L, 100L), Folder = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("",
"draft", "out"), class = "factor"), Message = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date = structure(c(2L,
3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L), .Label = c("",
"1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:10 AM",
"1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", "1/2/2020 1:02:00 AM",
"1/2/2020 1:02:05 AM", "1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"
), class = "factor"), Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE,
NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)), class = "data.frame", row.names = c(NA,
-13L))
我知道我可以像这样过滤:
df1<-df2 %>%
mutate(Date = lubridate::mdy_hms(Date),
cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" & Message == "" & Subject == ? & Re == ? & Length == ?
但不确定如何合并“如果有连续值”。我将继续研究,非常感谢任何帮助或建议。
解决方案
您的structure
外观与您发布的数据框略有不同:
> df
Subject Recipient Length Folder Message Date Edit
1 80 out NA 1/2/2020 1:00:01 AM TRUE
2 80 out NA 1/2/2020 1:00:05 AM TRUE
3 hey sarah@mail.com,gee@mail.com 80 out NA 1/2/2020 1:00:10 AM TRUE
4 hey sarah@mail.com,gee@mail.com 80 out NA 1/2/2020 1:00:15 AM TRUE
5 hey sarah@mail.com,gee@mail.com 80 out NA 1/2/2020 1:00:30 AM TRUE
6 NA NA NA
7 NA NA NA
8 hey sarah@mail.com,gee@mail.com 80 draft NA 1/2/2020 1:02:00 AM TRUE
9 hey sarah@mail.com,gee@mail.com 80 draft NA 1/2/2020 1:02:05 AM TRUE
10 NA NA NA
11 NA NA NA
12 hey sarah@mail.com,gee@mail.com 100 draft NA 1/2/2020 1:03:00 AM TRUE
13 hey sarah@mail.com,gee@mail.com 100 draft NA 1/2/2020 1:03:20 AM TRUE
此外,您所需的输出表明您希望Folder
按以及按其他类别拆分组,但这不是您的描述所说的,所以我没有分组Folder
。不过,如果你愿意,这很容易改变。
您可以通过使用运行长度编码来消除排序数据中相同连续值组的歧义,但rle
在 R 中将其转换为数据框列有点棘手。我用这个答案来做到这一点。
library(lubridate)
library(dplyr)
df %>%
mutate(Date = mdy_hms(Date),
Key = paste(Subject, Recipient, Length, sep = "_")) %>%
arrange(Date) %>%
filter(Folder == "out" | Folder == "draft" & Edit == TRUE) %>%
mutate(RLE = {RLE = rle(Key) ; rep(seq_along(RLE$lengths), RLE$lengths)}) %>%
group_by(RLE) %>%
summarize(Start = first(Date),
End = last(Date),
Duration = as.numeric(End) - as.numeric(Start))
这会从 1:2、3:5+8:9 和 12:13 行创建组。这些组给出了这些持续时间:
# A tibble: 3 x 4
RLE Start End Duration
<int> <dttm> <dttm> <dbl>
1 1 2020-01-02 01:00:01 2020-01-02 01:00:05 4
2 2 2020-01-02 01:00:10 2020-01-02 01:02:05 115
3 3 2020-01-02 01:03:00 2020-01-02 01:03:20 20
如果您想Folder
包含在分组中,请将其添加到创建中包含的内容中Key
。这使得分组为 1:2、3:5、8:9 和 12:13。这样做会产生以下结果:
# A tibble: 4 x 4
RLE Start End Duration
<int> <dttm> <dttm> <dbl>
1 1 2020-01-02 01:00:01 2020-01-02 01:00:05 4
2 2 2020-01-02 01:00:10 2020-01-02 01:00:30 20
3 3 2020-01-02 01:02:00 2020-01-02 01:02:05 5
4 4 2020-01-02 01:03:00 2020-01-02 01:03:20 20
推荐阅读
- python - Python:仅当特定列不为空时才将列表写入 CSV
- deep-learning - 在多个图像上运行保存的训练模型
- excel - 当有多个工作簿打开时运行 auto_close
- c# - RemoteAttribute 对于 null/空数据的行为不符合预期
- javascript - 尽管我在对象方法中创建了一个新属性(键),但 console.log 返回“未定义”
- python - 请添加一个完整的日志,log_level = 2 任何一个答案租约?
- c++ - 如何让一个类访问另一个类的私有方法或属性?
- javascript - Vue.js 3 初始化我的数字变量的正确方法是什么?
- reactjs - 将状态传递给兄弟姐妹 onChange
- c# - 如何使用 C# 获取父类别?(包括点网小提琴)