r - 使用聚合并保留 NA 行
问题描述
多年来,我没有在这样的一项任务上花费这么多时间。
例如,这里有多个关于 SO 的提示:here或here所以有人很想说这是重复的(我什至会这么说)。但是通过示例和多次试验,我无法完成所需的工作。
这是完整的示例:
x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))
x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))
lpp <- lapply(spl,
function(x) { r <- with(x,
data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
val_g_lab=cut(val, seq(0,1,0.1)))); r })
rd <- do.call(rbind, lpp); ord <- rd[order(rd$idx, decreasing = FALSE), ]; ord
aggregate(val ~ group + val_g_lab, ord,
FUN=function(x) c(mean(x, na.rm = FALSE),
sum(!is.na(x))), na.action=na.pass)
所需的输出:我希望在aggregate()
. 目前aggregate()
删除 NA 的行。
idx group val val_g val_g_lab
a.1 1 a 0.53789249 6 (0.5,0.6]
b.2 2 b 0.01729695 1 (0,0.1]
c.3 3 c 0.62295270 7 (0.6,0.7]
d.4 4 d 0.60291892 7 (0.6,0.7]
e.5 5 e 0.76422909 8 (0.7,0.8]
f.6 6 f 0.87433547 9 (0.8,0.9]
g.7 7 g NA NA <NA>
h.8 8 h 0.50590159 6 (0.5,0.6]
i.9 9 i 0.89084068 9 (0.8,0.9]
...... continue (full data set as @ord object.
解决方案
一种解决方法就是不NA
用于值组。首先,如上所述初始化您的数据:
x <- data.frame(idx=1:30, group=rep(letters[1:10],3), val=runif(30))
x$val[sample.int(nrow(x), 5)] <- NA; x
spl <- with(x, split(x, group))
lpp <- lapply(spl,
function(x) { r <- with(x,
data.frame(x, val_g=cut(val, seq(0,1,0.1), labels = FALSE),
val_g_lab=cut(val, seq(0,1,0.1)))); r })
rd <- do.call(rbind, lpp);
ord <- rd[order(rd$idx, decreasing = FALSE), ];
只需转换为字符并将 NA 转换为任意字符串文字:
# Convert to character
ord$val_g_lab <- as.character(ord$val_g_lab)
# Convert NAs
ord$val_g_lab[is.na(ord$val_g_lab)] <- "Unknown"
aggregate(val ~ group + val_g_lab, ord,
FUN=function(x) c(mean(x, na.rm = FALSE), sum(!is.na(x))),
na.action=na.pass)
# group val_g_lab val.1 val.2
#1 e (0,0.1] 0.02292533 1.00000000
#2 g (0.1,0.2] 0.16078353 1.00000000
#3 g (0.2,0.3] 0.20550228 1.00000000
#4 i (0.2,0.3] 0.26986665 1.00000000
#5 j (0.2,0.3] 0.23176149 1.00000000
#6 d (0.3,0.4] 0.39196441 1.00000000
#7 e (0.3,0.4] 0.39303518 1.00000000
#8 g (0.3,0.4] 0.35646994 1.00000000
#9 i (0.3,0.4] 0.35724889 1.00000000
#10 a (0.4,0.5] 0.48809261 1.00000000
#11 b (0.4,0.5] 0.40993166 1.00000000
#12 d (0.4,0.5] 0.42394859 1.00000000
# ...
#20 b (0.9,1] 0.99562918 1.00000000
#21 c (0.9,1] 0.92018049 1.00000000
#22 f (0.9,1] 0.91379088 1.00000000
#23 h (0.9,1] 0.93445802 1.00000000
#24 j (0.9,1] 0.93325098 1.00000000
#25 b Unknown NA 0.00000000
#26 c Unknown NA 0.00000000
#27 d Unknown NA 0.00000000
#28 i Unknown NA 0.00000000
#29 j Unknown NA 0.00000000
这是做你想做的吗?
编辑:
在评论中回答您的问题。注意NaN
和NA
不完全相同(见这里)。"NaN"
另请注意,这两个与and非常不同"NA"
,后者是字符串文字(即只是文本)。但无论如何,NA
s 是特殊的“原子”元素,几乎总是由函数异常处理。因此,您必须查看特定函数如何处理NA
s 的文档。在这种情况下,该na.action
参数适用于您聚合的值,而不是公式中的“类”。drop=FALSE
也可以使用该参数,但是您会得到(在这种情况下)两个分类的所有组合。将 重新定义NA
为字符串文字是可行的,因为新名称被视为与任何其他类一样。