r - 过滤多个时间范围的多个特定列
问题描述
我是 QCing 数据,对于几个 tank/data_types,存在需要删除的错误数据,跨越多个时间范围。包含错误数据的 data_types、tank 和 time 范围已在单独的数据框中报告,其中的一个片段包含在此 QC_table 中:
structure(list(trial = c(1L, 1L, 1L, 1L, 2L, 2L), data_type = c("Temp",
"pH", "pH", "pH", "Temp", "Temp"), tank = c("29", "40", "40",
"40", "13", "29"), date_time_start = c("2021-03-31 8:30", "2021-03-31 7:50",
"2021-03-31 10:25", "2021-03-31 17:05", "2021-04-07 10:25", "2021-04-08 10:30"
), date_time_end = c("2021-03-31 18:00", "2021-03-31 8:15", "2021-03-31 10:40",
"2021-03-31 17:30", "2021-04-07 17:20", "2021-04-10 18:25"),
to.be.removed = c("yes ", "yes ", "yes ", "yes ", "yes ",
"yes "), reason = c("calibration error", "faulty probe",
"faulty probe", "faulty probe", "calibration error", "faulty probe"
), data_type_tank = c("WalchemTempTank29", "pH_Tank40", "pH_Tank40",
"pH_Tank40", "WalchemTempTank13", "WalchemTempTank29")), row.names = c(NA,
-6L), class = "data.frame")
对此还有额外的试验和坦克。我的方法是创建一个包含所有需要删除的数据的新数据框(基于 QC 表中的 data_type_tank 和 date_time_start/end 列),然后从原始数据框中删除该数据框。我不知道这是否最合乎逻辑,但我不知道如何从原始数据框中删除数据。
我使用以下方法构建一个新的数据框:
new_dataframe <- dataframe %>%
select(c(Measurement.time, Trial, contains(urchin_temp_pH_QC$data_type_tank))) %>% head(10)
structure(list(Measurement.time = c("2021-03-30 11:00", "2021-03-30 11:05",
"2021-03-30 11:10", "2021-03-30 11:15", "2021-03-30 11:20", "2021-03-30 11:25",
"2021-03-30 11:30", "2021-03-30 11:35", "2021-03-30 11:40", "2021-03-30 11:45"
), Trial = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), WalchemTempTank29_Avg = c("18.8",
"18.67", "18.58", "18.48", "18.38", "18.29", "18.2", "18.12",
"18.03", "18"), WalchemTempTank29_Std = c("0.037", "0.025", "0.032",
"0.029", "0.017", "0.018", "0.026", "0.025", "0.024", "0.023"
), pH_Tank40_Avg = c("7.859", "7.863", "7.868", "7.866", "7.863",
"7.864", "7.865", "7.867", "7.869", "7.87"), pH_Tank40_Std = c("0.007",
"0.006", "0.002", "0.001", "0.002", "0.001", "0.002", "0.001",
"0.002", "0.004"), WalchemTempTank13_Avg = c("10.26", "10.22",
"10.21", "10.24", "10.27", "10.3", "10.32", "10.34", "10.37",
"10.4"), WalchemTempTank13_Std = c("0.01", "0.013", "0.005",
"0.01", "0.007", "0.006", "0.006", "0.008", "0.008", "0.005")), row.names = 4:13, class = "data.frame")
但是现在,基于 QC 表,我需要删除一些行(日期/时间)或为其设置子集,但仅限于特定列(即那些包含 data_type_tank 的列)。我想我可以手动执行此操作,使用下面的代码,然后在需要的地方绑定或连接列/行,但这似乎是一个艰巨的过程。
subset_row_1_QC_table <- dataframe %>% select(Measurement.time, contains("WalchemTempTank29")) %>%
subset(Measurement.time >= as.POSIXct("2021-03-31 08:30") & Measurement.time <= as.POSIXct("2021-03-31 18:00"))
有没有办法根据来自不同数据框的列来自动化这个过程,删除或子集列特定的行?我认为理想情况下我的数据框看起来像一个扩展版本,例如:
测量时间 | 审判 | WaterTempTank29_Avg | WaterTempTank29_Std | pH_Tank40_Avg | pH_Tank40_Std |
---|---|---|---|---|---|
2021-03-31 08:30 | 1 | 18.8 | 0.037 | 不适用 | 不适用 |
[所有 5 分钟间隔] | 不适用 | 不适用 | |||
2021-03-31 18:00 | 1 | 18.36 | 0.023 | 不适用 | 不适用 |
2021-03-31 07:50 | 1 | 不适用 | 不适用 | 7.854 | 0.001 |
[所有 5 分钟间隔] | 1 | 不适用 | 不适用 | 7.88 | 0.001 |
2021-03-31 08:15 | 1 | 不适用 | 不适用 | 7.84 | 0.001 |
2021-03-31 10:25 | 1 | 不适用 | 不适用 | 7.881 | 0.001 |
[所有 5 分钟间隔] | 1 | 不适用 | 不适用 | 7.804 | 0.001 |
2021-03-31 10:40 | 1 | 不适用 | 不适用 | 7.881 | 0 |
任何帮助将不胜感激!我希望我能够正确解释我的问题,StackOverflow 的第一次用户。
干杯,
编辑:感谢 r2evans 和 GuedesBF——希望现在更好/修复。
解决方案
我会这样做。
- 准备变量数据
CF
和df
CF = structure(list(trial = c(1L, 1L, 1L), data_type = c("Temp", "pH",
"pH"), tank = c("29", "40", "40"), date_time_start = structure(c(1617204600,
1617202200, 1617211500), tzone = "", class = c("POSIXct", "POSIXt"
)), date_time_end = structure(c(1617238800, 1617203700, 1617212400
), tzone = "", class = c("POSIXct", "POSIXt")), to.be.removed = c("yes ",
"yes ", "yes "), reason = c("calibration error", "faulty probe",
"faulty probe"), data_type_tank = c("WalchemTempTank29", "pH_Tank40",
"pH_Tank40")), row.names = c(NA, -3L), groups = structure(list(
data_type = c("pH", "Temp"), tank = c("40", "29"), .rows = structure(list(
2:3, 1L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
df = structure(list(Measurement.time = c("2021-03-30 11:00", "2021-03-30 11:05",
"2021-03-30 11:10", "2021-03-30 11:15", "2021-03-31 18:30", "2021-03-30 11:25",
"2021-03-30 11:30", "2021-03-30 11:35", "2021-03-31 17:00", "2021-03-31 19:28"
), Trial = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), WalchemTempTank29_Avg = c("18.8",
"18.67", "18.58", "18.48", "18.38", "18.29", "18.2", "18.12",
"18.03", "18"), WalchemTempTank29_Std = c("0.037", "0.025", "0.032",
"0.029", "0.017", "0.018", "0.026", "0.025", "0.024", "0.023"
), pH_Tank40_Avg = c("7.859", "7.863", "7.868", "7.866", "7.863",
"7.864", "7.865", "7.867", "7.869", "7.87"), pH_Tank40_Std = c("0.007",
"0.006", "0.002", "0.001", "0.002", "0.001", "0.002", "0.001",
"0.002", "0.004"), WalchemTempTank13_Avg = c("10.26", "10.22",
"10.21", "10.24", "10.27", "10.3", "10.32", "10.34", "10.37",
"10.4"), WalchemTempTank13_Std = c("0.01", "0.013", "0.005",
"0.01", "0.007", "0.006", "0.006", "0.008", "0.008", "0.005")), row.names = 4:13, class = "data.frame")
请注意,我更改了变量中的一些数据以Measurement.time
将事件包含在CF
表中。
- 准备单独的表格
CFTemp
和CFpH
CFTemp = CF %>% ungroup() %>%
filter(data_type == "Temp") %>%
mutate(Temp = "Temp",
Temp_start = date_time_start,
Temp_end = date_time_end) %>%
select(Temp, Temp_start, Temp_end)
CFpH = CF %>% ungroup() %>%
filter(data_type == "pH") %>%
mutate(pH = "pH",
pH_start = date_time_start,
pH_end = date_time_end) %>%
select(pH, pH_start, pH_end)
- 准备两个函数,返回具有二进制值的向量,并为其删除索引数据。
fTemp = function(df) CFTemp %>% left_join(df, by="Temp") %>%
mutate(TempRm = Measurement.time>=Temp_start & Measurement.time<=Temp_end) %>%
group_by(ID) %>%
summarise(TempRm = any(TempRm)) %>%
pull(TempRm)
fpH = function(df) CFpH %>% left_join(df, by="pH") %>%
mutate(pHRm = Measurement.time>=pH_start & Measurement.time<=pH_end) %>%
group_by(ID) %>%
summarise(pHRm = any(pHRm)) %>%
pull(pHRm)
- 转换数据框
df1 = df %>% as_tibble() %>%
mutate(Measurement.time = as.POSIXct(Measurement.time),
ID = 1:nrow(.),
Temp = "Temp",
pH = "pH") %>%
mutate(
TmpRm = fTemp(.),
pHRm = fpH(.)
) %>%
mutate(
WalchemTempTank29_Avg = ifelse(TmpRm, NA, WalchemTempTank29_Avg),
WalchemTempTank29_Std = ifelse(TmpRm, NA, WalchemTempTank29_Std),
WalchemTempTank13_Avg = ifelse(TmpRm, NA, WalchemTempTank13_Avg),
WalchemTempTank13_Std = ifelse(TmpRm, NA, WalchemTempTank13_Std),
pH_Tank40_Avg = ifelse(pHRm, NA, pH_Tank40_Avg),
pH_Tank40_Std = ifelse(pHRm, NA, pH_Tank40_Std),
) %>%
select(Measurement.time:WalchemTempTank13_Std)
df1
输出
# A tibble: 10 x 8
Measurement.time Trial WalchemTempTank29_Avg WalchemTempTank29_Std pH_Tank40_Avg pH_Tank40_Std WalchemTempTank13_Avg WalchemTempTank13_Std
<dttm> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 2021-03-30 11:00:00 1 18.8 0.037 7.859 0.007 10.26 0.01
2 2021-03-30 11:05:00 1 18.67 0.025 7.863 0.006 10.22 0.013
3 2021-03-30 11:10:00 1 18.58 0.032 7.868 0.002 10.21 0.005
4 2021-03-30 11:15:00 1 18.48 0.029 7.866 0.001 10.24 0.01
5 2021-03-31 18:30:00 1 NA NA 7.863 0.002 NA NA
6 2021-03-30 11:25:00 1 18.29 0.018 7.864 0.001 10.3 0.006
7 2021-03-30 11:30:00 1 18.2 0.026 7.865 0.002 10.32 0.006
8 2021-03-30 11:35:00 1 18.12 0.025 7.867 0.001 10.34 0.008
9 2021-03-31 17:00:00 1 18.03 0.024 NA NA 10.37 0.008
10 2021-03-31 19:28:00 1 NA NA NA NA NA NA
就这样。
更新 1
library(tidyverse)
CFTemp = CF %>% ungroup() %>%
filter(data_type == "Temp") %>%
mutate(Temp = "Temp",
Temp_start = date_time_start,
Temp_end = date_time_end) %>%
select(Temp, tank, Temp_start, Temp_end)
CFpH = CF %>% ungroup() %>%
filter(data_type == "pH") %>%
mutate(pH = "pH",
pH_start = date_time_start,
pH_end = date_time_end) %>%
select(pH, pH_start, pH_end)
fTemp = function(df, Tank){
out = CFTemp %>% filter(tank==Tank) %>%
left_join(df, by="Temp") %>%
mutate(TempRm = Measurement.time>=Temp_start & Measurement.time<=Temp_end) %>%
group_by(ID) %>%
summarise(TempRm = any(TempRm)) %>%
pull(TempRm)
if(length(out)==0) FALSE else out
}
fpH = function(df) CFpH %>% left_join(df, by="pH") %>%
mutate(pHRm = Measurement.time>=pH_start & Measurement.time<=pH_end) %>%
group_by(ID) %>%
summarise(pHRm = any(pHRm)) %>%
pull(pHRm)
df1 = df %>% as_tibble() %>% #Step 1
mutate(Measurement.time = as.POSIXct(Measurement.time),
ID = 1:nrow(.),
Temp = "Temp",
pH = "pH") %>%
mutate( #Step 2
TmpRm29 = fTemp(., 29),
TmpRm13 = fTemp(., 13),
pHRm = fpH(.)
) %>%
mutate( #Step 3
WalchemTempTank29_Avg = ifelse(TmpRm29, NA, WalchemTempTank29_Avg),
WalchemTempTank29_Std = ifelse(TmpRm29, NA, WalchemTempTank29_Std),
WalchemTempTank13_Avg = ifelse(TmpRm13, NA, WalchemTempTank13_Avg),
WalchemTempTank13_Std = ifelse(TmpRm13, NA, WalchemTempTank13_Std),
pH_Tank40_Avg = ifelse(pHRm, NA, pH_Tank40_Avg),
pH_Tank40_Std = ifelse(pHRm, NA, pH_Tank40_Std),
) %>%
select(Measurement.time:WalchemTempTank13_Std)
df1
输出
# A tibble: 10 x 8
Measurement.time Trial WalchemTempTank29_Avg WalchemTempTank29_Std pH_Tank40_Avg pH_Tank40_Std WalchemTempTank13_Avg WalchemTempTank13_Std
<dttm> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 2021-03-30 11:00:00 1 18.8 0.037 7.859 0.007 10.26 0.01
2 2021-03-30 11:05:00 1 18.67 0.025 7.863 0.006 10.22 0.013
3 2021-03-30 11:10:00 1 18.58 0.032 7.868 0.002 10.21 0.005
4 2021-03-30 11:15:00 1 18.48 0.029 7.866 0.001 10.24 0.01
5 2021-03-31 18:30:00 1 NA NA 7.863 0.002 10.27 0.007
6 2021-03-30 11:25:00 1 18.29 0.018 7.864 0.001 10.3 0.006
7 2021-03-30 11:30:00 1 18.2 0.026 7.865 0.002 10.32 0.006
8 2021-03-30 11:35:00 1 18.12 0.025 7.867 0.001 10.34 0.008
9 2021-03-31 17:00:00 1 18.03 0.024 NA NA 10.37 0.008
10 2021-03-31 19:28:00 1 NA NA NA NA 10.4 0.005
推荐阅读
- wordpress - 如何在 Bitnami 中恢复 Apache 配置文件
- node.js - 无法在 heroku 上进行 GET / MERN 堆栈部署
- angular - 生产模式下的 Ngx Quill 错误:未捕获(承诺中):TypeError:ne 不是函数
- c++ - 如何通过 FetchContent 将 Eigen 库添加到 cmake c++ 项目
- java - Vaadin 14 - 文本字段未正确显示枚举
- css - Rails 6 未在开发中获取样式表更改
- c# - CLR 如何从程序集中加载文件?
- reactjs - ReactDataGrid 未正确显示行
- microsoft-teams - 如何使用 c# 代码删除 MSteams 中发布的自适应消息
- python - Discordpy 定时消息或禁令或角色,机器人重启 | 不和谐