r - 在引导期间在 boot::boot() 函数中使用 dplyr::group_by() 时出错
问题描述
我目前正在尝试对一些数据进行引导分析,最终结果是获得围绕计数数据比例的引导置信区间。
例如,我尝试引导的当前数据将采用以下形式(字符):
> foo
notes
1 a
2 b
3 c
4 c
5 b
6 c
7 b
8 c
9 a
10 a
11 c
12 b
13 d
14 e
15 f
16 f
17 g
18 a
19 b
20 c
21 c
你可以带着它到这里dput()
structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c",
"a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
在尝试设置一个函数,该函数将输出一个与引导包正常运行所需的命名向量类似的函数(请参见此处的示例),我编写了以下使用dplyr
代码的函数:
library(dplyr)
notes_bootstrap <- function(d, i){
# get global set
global_set <- d %>% distinct()
# take random rows
sampler <- d#[i,]
proportion_table <- sampler %>%
count(.data$notes) %>%
mutate(proportion = n/sum(n)) %>%
ungroup()
# combine with full set to turn NAs to 0s
combined_table <- proportion_table %>% full_join(global_set)
final_table <- combined_table %>%
select(-n) %>%
mutate(proportion = if_else(is.na(proportion),0,proportion))
output <- setNames(final_table$proportion, final_table$notes)
return(output)
}
并且当这个版本的函数与 一起运行时boot()
,它运行得很好,关键问题是它只是对整个数据集进行采样(由于代码的注释部分而没有执行引导程序)。如果你运行这个,你会看到每个估计都是一样的。
bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)
bootstrap_analysis$t
如果我确实使用为引导分析随机子集变量子集的部分运行该函数,如下面的代码所示(与上面相同,但删除了注释):
notes_bootstrap <- function(d, i){
# get global set
global_set <- d %>% distinct()
# take random rows
sampler <- d[i,]
proportion_table <- sampler %>%
count(.data$notes) %>%
mutate(proportion = n/sum(n)) %>%
ungroup()
# combine with full set to turn NAs to 0s
combined_table <- proportion_table %>% full_join(global_set)
final_table <- combined_table %>%
select(-n) %>%
mutate(proportion = if_else(is.na(proportion),0,proportion))
output <- setNames(final_table$proportion, final_table$notes)
return(output)
}
然后我收到以下错误:
> bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "character"
该问题的解决方案是运行此代码,以便引导分析按书面方式工作(可能是一个整洁的评估问题?),或者让某人建议一种更有效的方法来进行这种引导分析。
解决方案
有趣的问题!我试图在不使用引导包的情况下解决这个问题,而是使用基本功能(主要是出于透明目的)。
我可以弄清楚这一点:
#Assigning the provided structure to an object called "df"
df <- structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c",
"a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b",
"c", "c")),
class = "data.frame", row.names = c(NA, -21L))
#Specifying the bootstrap replications (as far as I know, it's, however,
#rather recommended to use 10K replications and more)
B <- 100
#Number of observations (i.e., 21 in this case)
N <- nrow(df)
#setting the seed to ensure pseudo-randomness for the samples
#we just want to generate and, of course, to ensure general reproducibility
set.seed(42, sample.kind = "Rounding")
#bootstrapping the proportion (aka mean) of the note "a"
boot_note_a <- replicate(B, {
#taking random samples of the same sample size of those notes
#and putting back the taken sample in the urn—for each iteration
notes_star <- sample(df$notes, N, replace = T)
#getting the proportion of the note "a" within each bootstrapped sample.
#Hence, we'll get B (100 in this case) times a proportion of the note "a"
#based on the respective bootstrapped sample.
mean(notes_star == "a")
})
#getting the confidence interval (at a confidence level of 95%) of the
#bootstrapped proportion of the note "a" in the bootstrapped sample
quantile(boot_note_a, prob = c(0.025, 0.975))
最后,我们可以像这样快速绘制(双关语)这个结果:
#calculating the binwidth according to Freedman & Diaconis (1981); see also
#Hyndman (1995)
binw <- 2 * IQR(boot_note_a) / length(boot_note_a)^(1/3)
#plotting
p1 <- qplot(boot_note_a, binwidth = binw, color = I("red") )
p2 <- qplot(sample = scale(boot_note_a), xlab = "theoretical", ylab = "sample")+
geom_abline()
gridExtra::grid.arrange(p1, p2, ncol = 2)
归根结底,我认为你会得到想要的结果——至少对于音符“a”(诚然,我们必须对剩下的六个音符重复这个策略)。因此,这种解决方案可能不是最有效的方式,但希望是透明的。如果这个求解策略运行良好,我们可以调整它并通过使用 apply-family 左右来提高它的效率。
干杯,邱!
推荐阅读
- c++ - 将递归模拟为堆栈返回错误
- reactjs - 如何在本机反应中垂直定位图像?
- postgresql - PostgreSQL 11 的 order by 子句的性能问题
- laravel - 使用 webpack 和 FontAwesome 5 Pro,生成一个奇怪的前缀
- apache-spark - 处理小消息文件的技术栈
- html - 如何将不同的图片添加到共享同一类的一行 div 中?
- android - Dagger/MissingBinding androidx.lifecycle.ViewModelProvider.Factory 不能在没有@Provides-annotated 方法的情况下提供
- javascript - 在没有 I/O 的情况下,javascript(在浏览器中)中的异步/承诺是否有益?
- python - 使用 PyInstaller 编译 python,用 Python
- ionic3 - Ionic4 与 Ionic3 中的离子段