python - 在 R 或 Python 中分组、应用多个条件并计算持续时间
问题描述
客观的
我有一个数据集 df,我想根据某些条件计算持续时间并显示、Recipient
、starttime
和endtime
的输出。Duration
Length
问题
如果以下条件适用,我需要首先将消息组合在一起:如果Folder == 'out'
OR drafts
、Message == ""
、Edit == "T"
以及 Recipient 和 Length 列中的内容连续相同。
理想情况下,这会给我 A 组及其持续时间。例如,第一个“块”数据将标记为“A 组”,开始时间为1/2/2020 1:00:01 AM
,结束时间为1/2/2020 1:00:30 AM
。
此外,如果 Subject、Re 和 Length 列的最后一行与第一行上的其他组 Subject、Re 和 Length 匹配,我想将 A 组与另一个“块”数据“匹配”。所以第二组 A 的开始时间为1/2/2020 1:02:00 AM
,结束时间为1/2/2020 1:02:05 AM
。
Subject Re Length Folder Message Date Edit
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:01 AM T
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:05 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:15 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:30 AM T
some
some
some
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:00 AM T
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:05 AM T
no
no
no
no
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:20 AM T
期望的输出
Start End Duration Group Subject Length
1/2/2020 1:00:01 AM 1/2/2020 1:00:30 AM 29 A hey 80
1/2/2020 1:02:00 AM 1/2/2020 1:02:05 AM 5 A hey 80
1/2/2020 1:03:10 AM 1/2/2020 1:03:20 AM 10 A hey 80
输入:
structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("",
"hey", "no", "some"), class = "factor"), Recipient = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L,
2L), .Label = c("", " ", "a@mail.com,b@mail.com"), class = "factor"),
Length = c(80L, 80L, 80L, 80L, 80L, NA, NA, NA, 80L, 80L,
NA, NA, NA, NA, 80L, 80L, NA, NA), Folder = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 1L), .Label = c("", "draft", "out"), class = "factor"),
Message = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 1L, 1L), .Label = c("",
"1/2/2020 1:00", "1/2/2020 1:02", "1/2/2020 1:03"), class = "factor"),
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA, NA, TRUE,
TRUE, NA, NA, NA, NA, TRUE, TRUE, NA, NA)), class = "data.frame", row.names = c(NA,
-18L))
我正在使用它,但我希望保留主题为空白的行,我不希望将其过滤掉。从这个例子的前几行可以看出,虽然主题字段是空白的,但它仍应包含在第一个“块”中。当我删除这部分时:
filter(Subject != '') %>%, I get some errors, should I remove another part in the code too? (Keep in mind, I still want to display the Subject output). Any advice is appreciated.
df1<-df %>%
mutate_if(is.factor, as.character) %>%
mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "", stringr::str_trim(.))) %>%
filter(Subject != '') %>%
mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>%
mutate(segment = cumsum(!cond)) %>%
filter(cond) %>%
group_by(Subject, Recipient, Length, segment) %>%
summarize(Start = min(Date),
End = max(Date),
Duration = End - Start) %>%
mutate(new_group = (Subject != lag(Subject, 1, "")) *
(Recipient != lag(Recipient, 1, "")) *
(Length != lag(Length, 1, ""))) %>%
ungroup() %>%
mutate(group = LETTERS[cumsum(new_group)])
解决方案
推荐阅读
- python-3.x - LinAlgError:scipy 中的奇异矩阵
- algorithm - 复杂度为 O(n) 的 for 循环如何嵌套?
- oracle - NLog:找不到 LayoutRenderer:'aspnet-user-identity
- html - 子图像没有在父母中成长
- eclipse-rcptt - 启动 AUT 进行录制时出错 - 超时错误:AUT 没有响应
- django - 如何在 django 中修改继承类中的登录视图?
- python - Pandas:如何按日期时间列分组,仅使用时间并丢弃日期
- html - CSS 位置 Sticky、Z-Index 和绝对定位不起作用...如何解决这个问题?
- angular - 在融合图表Angular 8中添加数据加载消息
- telegram-bot - 电报机器人:继续私下对话