首页 > 解决方案 > 一种运行许多方差分析和提取某些列的快速方法

问题描述

我的数据由响应变量 ( y) 和两个因素 (sextime) 组成,用于几个groups:

set.seed(1)
df <- data.frame(y = rnorm(26*18),
                 group = sort(rep(LETTERS,18)),
                 sex = rep(c(rep("F",9),rep("M",9)),26),
                 time = rep(rep(sort(rep(1:3,3)),2),26))
df$sex <- factor(df$sex, levels = c("M","F"))

我想使用R's anova, for each在模型之间进行测试group,并将它们全部组合在一起,对于我正在拟合的模型中的每个因素data.frame都有一列anova p-value,其中每一行都是groups I'米运行anova

这是我目前正在做的事情:

anova.df <- do.call(rbind,lapply(unique(df$group),function(i){
  an.df <- anova(lm(y ~ sex*time,data=df %>% dplyr::filter(group == i)))
  an.df <- data.frame(factor.name=rownames(an.df)[1:(nrow(an.df)-1)],p.value=an.df[1:(nrow(an.df)-1),which(colnames(an.df) == "Pr(>F)")]) %>%
    tidyr::spread(factor.name,p.value) %>%
    dplyr::mutate(group=i)
  return(an.df)
}))

但实际上我有大约 15,000group秒,所以我想知道有没有更快的方法来做到这一点。

标签: rdataframedplyranova

解决方案


我想purrr可以帮助你。
也许这不是最好的决定,但尝试这样的事情:

 df%>%
   group_by(group)%>%
   nest()%>%
   mutate(fit = map(data, ~ anova(lm(y ~ sex*time, data = .x))))%>%
   select(group,data,fit)%>%
   unnest(fit)%>%
   select(group,`Pr(>F)`)%>%
   na.omit()%>%
   mutate(var=rep(c("sex","time","sex:time"),times=nrow(.)/3))%>%
   spread(var,`Pr(>F)`)
# A tibble: 26 x 4
   group   sex `sex:time`  time
   <fct> <dbl>      <dbl> <dbl>
 1 A     0.840    0.284   0.498
 2 B     0.414    0.627   0.500
 3 C     0.642    0.469   0.430
 4 D     0.423    0.569   0.567
 5 E     0.169    0.904   0.625
 6 F     0.845    0.00390 0.869
 7 G     0.937    0.318   0.473
 8 H     0.329    0.663   0.609
 9 I     0.977    0.144   0.158
10 J     0.823    0.448   0.193
# ... with 16 more rows

microbenchmark::microbenchmark(x= df%>%
                                  group_by(group)%>%
                                  nest()%>%
                                  mutate(fit = map(data, ~ anova(lm(y ~ sex*time, data = .x))))%>%
                                  select(group,data,fit)%>%
                                  unnest(fit)%>%
                                  select(group,`Pr(>F)`)%>%
                                  na.omit()%>%
                                  mutate(var=rep(c("sex","time","sex:time"),times=nrow(.)/3))%>%
                                  spread(var,`Pr(>F)`),
                                y=anova.df <- do.call(rbind,lapply(unique(df$group),function(i){
                                  an.df <- anova(lm(y ~ sex*time,data=df %>% dplyr::filter(group == i)))
                                  an.df <- data.frame(factor.name=rownames(an.df)[1:(nrow(an.df)-1)],p.value=an.df[1:(nrow(an.df)-1),which(colnames(an.df) == "Pr(>F)")]) %>%
                                    tidyr::spread(factor.name,p.value) %>%
                                    dplyr::mutate(group=i)
                                  return(an.df)
                                })),times=50)
Unit: milliseconds
 expr       min        lq     mean    median        uq      max neval cld
    x  69.98061  71.02417  74.0585  72.45625  74.08786  89.4715    50  a 
    y 166.63844 168.22296 181.6709 171.08077 184.14635 434.8872    50   b

推荐阅读