r - 根据多个过滤条件(R、dplyr)创建时间戳列
问题描述
我有一个数据集,df
Read Box ID Time Subject
T out 10/1/2019 9:00:01 AM
T out 10/1/2019 9:00:02 AM Re:
T out 10/1/2019 9:00:03 AM Re:
T out 10/1/2019 9:02:59 AM Re:
T out 10/1/2019 9:03:00 AM
F 10/1/2019 9:05:00 AM
T out 10/1/2019 9:06:00 AM Fwd:
T out 10/1/2019 9:06:02 AM Fwd:
T in 10/1/2019 9:07:00 AM
T in 10/1/2019 9:07:02 AM
T out 10/1/2019 9:07:04 AM
T out 10/1/2019 9:07:05 AM Fw:
T out 10/1/2019 9:07:06 AM Fw:
hello 10/1/2019 9:07:08 AM
基于此数据集中的某些条件,我想创建一个 starttime 列和一个 endtime 列。
当发生以下情况时,我想创建一个“开始时间”:如果主题列的第一个单词以 RE:、re、FWD 或 FW 开头(以连续方式),Read == "T", Box == "out " 和 ID == ""
当这种情况的第一个实例发生时,将生成一个开始时间。例如,对于这个数据集,开始时间将是 10/1/2019 9:00:02 AM,因为这是我们看到所需条件首先出现的地方(主题是 FW:、RE: 或 FWD,Read = T,Box = out and ID = "" ) 但是,当这些条件中的任何一个不成立时,将创建一个结束时间。因此,第一个结束时间将发生在第 4 行之前,时间是 2019 年 10 月 1 日上午 9:02:59。我的最终目标是为此创建一个持续时间列。
这是包含 RE、Fwd 和 Fw 时我想要的输出
starttime endtime duration
10/1/2019 9:00:02 AM 10/1/2019 9:02:59 AM 177 secs
10/1/2019 9:06:00 AM 10/1/2019 9:06:02 AM 2 secs
10/1/2019 9:07:05 AM 10/1/2019 9:07:06 AM 1 secs
此外,我将如何在单独的代码中指定为这些条件创建开始和结束时间:Read = T、Box = out、ID = "" 并且主题列中的第一个单词不包含 Re、Fwd 或吗?
Read Box ID Time Subject
T out 10/1/2019 9:00:01 AM
T out 10/1/2019 9:00:02 AM Re:
T out 10/1/2019 9:00:03 AM Re:
T out 10/1/2019 9:02:59 AM Re:
T out 10/1/2019 9:03:00 AM
F 10/1/2019 9:05:00 AM
T out 10/1/2019 9:06:00 AM Fwd:
T out 10/1/2019 9:06:02 AM Fwd:
T in 10/1/2019 9:07:00 AM
T in 10/1/2019 9:07:02 AM
T out 10/1/2019 9:07:04 AM
T out 10/1/2019 9:07:05 AM Fw:
T out 10/1/2019 9:07:06 AM Fw:
hello 10/1/2019 9:07:08 AM
这是排除 RE、Fwd 和 Fw 时我想要的输出
starttime endtime duration
10/1/2019 9:00:01 AM 10/1/2019 9:00:01 AM 0 secs
10/1/2019 9:03:00 AM 10/1/2019 9:03:00 AM 0 secs
10/1/2019 9:07:04 AM 10/1/2019 9:07:04 AM 0 secs
输入:
structure(list(Read = structure(c(3L, 3L, 3L, 3L, 3L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 3L, 1L), .Label = c("", "F", "T", "T "), class = "factor"),
Box = structure(c(3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 2L,
3L, 3L, 3L, 1L), .Label = c("", "in", "out"), class = "factor"),
ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L), .Label = c("", "hello"), class = "factor"),
Time = structure(1:14, .Label = c("10/1/2019 9:00:01 AM",
"10/1/2019 9:00:02 AM", "10/1/2019 9:00:03 AM", "10/1/2019 9:02:59 AM",
"10/1/2019 9:03:00 AM", "10/1/2019 9:05:00 AM", "10/1/2019 9:06:00 AM",
"10/1/2019 9:06:02 AM", "10/1/2019 9:07:00 AM", "10/1/2019 9:07:02 AM",
"10/1/2019 9:07:04 AM", "10/1/2019 9:07:05 AM", "10/1/2019 9:07:06 AM",
"10/1/2019 9:07:08 AM"), class = "factor"), Subject = structure(c(1L,
4L, 4L, 4L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("",
"Fw:", "Fwd:", "Re:"), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
建议的代码有效,我还想合并主题列条件:
Where Subject == FW, FWD, RE (忽略大写/小写) 和 Where Subject is not equal to FW, FWD, Re (忽略大写/小写)
library(dplyr)
df %>%
mutate(Time = lubridate::mdy_hms(Time),
cond = Read == "T" & Box == "out" & ID == "" & Subject == "FW" & Subject == "FWD" & Subject == "RE" (ignore.case = TRUE)
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
summarise(starttime = first(Time),
endtime = last(Time),
duration = difftime(endtime, starttime, units = "secs")) %>%
select(-grp)
图书馆(dplyr)
df %>%
mutate(Time = lubridate::mdy_hms(Time),
cond = Read == "T" & Box == "out" & ID == "" & Subject! == "FW" & Subject! == "FWD" & Subject! == "RE" (ignore.case = TRUE)
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
summarise(starttime = first(Time),
endtime = last(Time),
duration = difftime(endtime, starttime, units = "secs")) %>%
select(-grp)
解决方案
您的问题的一部分已经在您的其他问题中得到解答(基于 R (dplyr, lubridate) 中的多个条件创建开始和结束时间列)。我知道这可能很困难,但下次请尝试通过专注于您不知道的内容来将您的问题减少到较小的问题。
如果要检测子字符串,最好的方法是str_detect
从stringr
包中使用(的一部分tidyverse
):
library(tidyverse)
library(lubridate)
df %>%
mutate(Time = mdy_hms(Time),
# cond = Read == "T" & Box == "out" & ID == "", #from the answer https://stackoverflow.com/a/60068929/3888000
cond = Read == "T" & Box == "out" & ID == "" & str_detect(Subject, regex('FW|FWD|RE', ignore_case=TRUE)), #including those subjects
# cond = Read == "T" & Box == "out" & ID == "" & !str_detect(Subject, regex('FW|FWD|RE', ignore_case=TRUE)), #excluding those subjects
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
summarise(starttime = first(Time),
endtime = last(Time),
duration = difftime(endtime, starttime, units = "secs")) %>%
select(-grp)
这使用了正则表达式 ( regex
),这是一个非常值得学习的东西。这个很容易阅读,因为它只有 OR ( |
) 运算符,但可能性是无限的。
推荐阅读
- ios - PDFKit 和 PDFDocument 注释在 Adobe Acrobat 中不可见
- laravel - 如何使用 vagrant box 修复流明迁移错误
- android - 应用程序在打开 QR Scanner Flutter 时崩溃
- alfresco - 拥有对文档的访客访问权限 - 这是否应该允许用户搜索此文档?
- angular - 我想在 React 表单输入中显示 A$,但只返回数字;
- bash - 如何在 Expect 脚本中循环 Bash 数组
- python - 在 python 中处理大型 JSON 数据
- lisp - lisp Ceramic:如何关闭 Ceramic 服务器?
- css - 具有 2 行、4 列、7 项的 CSS 网格。如何使第二行居中?
- javascript - 奇怪的javascript toString() 行为