首页 > 解决方案 > 计算所有科目不同的最后两列的平均值

问题描述

我是 R 初学者,这是我在这里的第一篇文章。我正在努力解决一个问题,并希望得到你的建议。基本上,我有一个包含 3 组列的数据集,我需要完全操纵这些列以获得所需的结果,这是最近两次观察的平均值(并且这些观察必须在截止日期之后发生,例如 3 /15/2018) 是高质量的,但复杂的是,在所有情况下,进入平均值的相关列都不同。

为了计算两个具有良好数据质量的最新观测值的平均值:

  1. 我必须首先确定哪些观察质量好,

  2. 然后,计算 3/15 之后发生的平均值(必须是 2 个观测值的平均值),它们必须是最近的两个观测值。

下面是我的示例数据集。我尝试在 Excel 中手动执行此操作,这真的很辛苦。我希望在 R 中做到这一点,非常感谢您的反馈。谢谢!

Here is my sample dataset: 
> dput(head(df,5))

structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7, 
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA, 
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE, 
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =    
c(NA, 
NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577, 
17578), class = "Date"), obs_2_date = structure(c(17606, 17606, 
17608, 17598), class = "Date"), obs_3_date = structure(c(NA, 
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA, 
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5, 
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3", 
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date", 
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names   
= c(NA, 
4L), class = "data.frame")

标签: rconditionalaverage

解决方案


看看这是否适合你。代码注释简单。

df=structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7, 
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA, 
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE, 
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =    
c(NA, NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577, 
17578), class = "Date"), obs_2_date = structure(c(17606, 17606, 
17608, 17598), class = "Date"), obs_3_date = structure(c(NA, 
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA, 
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5, 
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3", 
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date", 
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names   
= c(NA, 4L), class = "data.frame")

# separate each section
obs=df[,2:5]
dq=df[, 6:9]
dt=sapply(df[, 10:13], as.numeric) # for easier calculations
# remove bad quality
obs[dq==F]=NA
# remove dates before 2018-3-15
obs[dt - as.numeric(as.Date("2018-03-15")) <= 0] = NA
# only leave two most recent dates
dt[is.na(obs)]=NA
dt=t(apply(dt,1,function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))
obs[is.na(dt)]=NA
# average
df$avg=apply(obs,1,function(x)ifelse(sum(!is.na(x))>=2, mean(x,na.rm=T), NA))
df

编辑: 解释

dt=t(apply(dt,1, function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))

我认为这对于x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA. 意思是na.rm=T删除NA值。max(x[x!=max(x)])意思是第二大的数。因此x[x < 2nd_largest_num]=NA,只需删除除最大和第二大之外的任何数字。然后将此函数逐行应用于数据框。最终结果是dt每行仅包含两个最大的数字(数字格式的最近日期)。所有“丢弃”的值(dt 中的 NA)将从obs下一行中删除obs[is.na(dt)]=NA。毕竟这些,obs每行只包含两个最近的值。


推荐阅读