首页 > 解决方案 > 多个T检验的功能,以找到主效应

问题描述

df <- data.frame (rating1  = c(1,5,2,4,5),
                  rating2  = c(2,1,2,4,2),
                  rating3  = c(0,2,1,2,0),
              race = c("black", "asian", "white","black","white"),
              gender = c("male","female","female","male","female")
              ) 

我想对组均值(例如 rating1 中亚洲人的均值)和每个评级的总体均值(例如 rating1)进行 t 检验。以下是我在 rating1 中的亚洲人代码。

asian_df <- df %>% 
   filter(race == "asian")
t.test(asian_df$rating1, mean(df$rating1)) 

然后对于评分为 2 的黑人,我会跑

   black_df <- df %>% 
       filter(race == "black")
    t.test(black_df$rating2, mean(df$rating2))

如何编写一个函数来自动执行每个组的 t 检验?到目前为止,我必须手动更改变量名称,以便基本上为每个种族、每个性别和每个等级(等级 1 到等级 3)运行。谢谢!

标签: rstatistics

解决方案


执行多个 t 检验会增加 I 类错误的风险,并且您需要针对多重比较进行调整,以使您的结果有效/有意义。您可以通过循环变量来运行 t 检验,例如

library(tidyverse)
df <- data.frame (rating1  = c(5,8,7,8,9,6,9,7,8,5,8,5),
                  rating2  = c(2,7,8,4,9,3,6,1,7,3,9,1),
                  rating3  = c(0,6,1,2,7,2,9,1,6,2,3,1),
                  race = c("asian", "asian", "asian","black","asian","black","white","black","white","black","white","black"),
                  gender = c("male","female","female","male","female","male","female","male","female","male","female","male")
)

for (rac in unique(df$race)){
tmp_df <- df %>% 
    filter(race == rac)
print(rac)
print(t.test(tmp_df$rating1,
         rep(mean(df$rating1),
             length(tmp_df$rating1))))
}
[1] "asian"

    Welch Two Sample t-test

data:  tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 0.19518, df = 3, p-value = 0.8577
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.550864  2.884198
sample estimates:
mean of x mean of y 
 7.250000  7.083333 

[1] "black"

    Welch Two Sample t-test

data:  tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = -1.5149, df = 4, p-value = 0.2044
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.5022651  0.7355985
sample estimates:
mean of x mean of y 
 6.200000  7.083333 

[1] "white"

    Welch Two Sample t-test

data:  tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 3.75, df = 2, p-value = 0.06433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1842176  2.6842176
sample estimates:
mean of x mean of y 
 8.333333  7.083333 


for (gend in unique(df$gender)){
  tmp_df <- df %>% 
    filter(gender == gend)
  print(gend)
  print(t.test(tmp_df$rating1,
               rep(mean(df$rating1),
                   length(tmp_df$rating1))))
}
[1] "male"

    Welch Two Sample t-test

data:  tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = -2.0979, df = 5, p-value = 0.09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.4107761  0.2441094
sample estimates:
mean of x mean of y 
 6.000000  7.083333 

[1] "female"

    Welch Two Sample t-test

data:  tmp_df$rating1 and rep(mean(df$rating1), length(tmp_df$rating1))
t = 3.5251, df = 5, p-value = 0.01683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2933469 1.8733198
sample estimates:
mean of x mean of y 
 8.166667  7.083333 

由于多次测试(在本例中为 5 次 t 检验),您误报的机会1 - (1 - 0.05)^5 = 22.62%<- 非常高。为了解决这个问题,您可以应用Bonferroni 校正,它基本上采用您所需的 p 值(在本例中,p < 0.05)并将其除以测试次数(即拒绝空值所需的新 p 值是p < 0.01)。当您应用此校正时,即使是“最佳”t 检验结果(性别;p 值 = 0.01683)也没有统计学意义。

另一种方法是使用 ANOVA 比较所有条件下的均值,然后使用 Tukey 的 HSD 来确定哪些组不同。Tukey 的 HSD 是单一的事后测试,因此您无需考虑多次测试,并且您的结果是有效的。使这种方法适应您的问题可能是一种更好的方法,例如

anova_one_way <- aov(rating1 + rating2 + rating3 ~ race + gender, data = df)

summary(anova_one_way)

            Df Sum Sq Mean Sq F value  Pr(>F)   
race         2 266.70  133.35   14.01 0.00243 **
gender       1 140.08  140.08   14.72 0.00497 **
Residuals    8  76.13    9.52           
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


TukeyHSD(anova_one_way)

Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = rating1 + rating2 + rating3 ~ race + gender, data = df)

$race
                 diff        lwr       upr     p adj
black-asian -7.050000 -12.963253 -1.136747 0.0224905
white-asian  4.416667  -2.315868 11.149201 0.2076254
white-black 11.466667   5.029132 17.904201 0.0023910

$gender
                 diff       lwr       upr     p adj
male-female -3.416667 -7.523829 0.6904958 0.0913521

推荐阅读