首页 > 解决方案 > 计算大型数据帧的加权标准差

问题描述

如何计算每轮游戏的加权标准差?数据框实际上很宽(有很多玩家:从 r001 到 r100),而且很长(很多游戏回合)。每轮比赛的权重不同。

df <- data.frame(gameround= c("1_1", "1_2", "1_3"),
  r001 = c(3,5,4), r002 = c(2,3,5), r003 = c(1,2,2), weight001=c(0.7,0.8,0.7), 
                 weight002 = c(0.6,0.1,0.6), weight003=c(0.2,0.7,0.2) ,weightedsd = NA)

 #gameround r001 r002 r003 weight001 weight002 weight003 weightedsd
 # 1_1       3    2    1       0.7       0.6       0.2         NA
 # 1_2       5    3    2       0.8       0.1       0.7         NA
 # 1_3       4    5    2       0.7       0.6       0.2         NA

标签: rdplyrtidyversestandard-deviationweighted

解决方案


这里需要两个支点:

  1. 首先,将数据转换为关于轮数和权重的长格式
  2. 接下来,将轮次和权重分成单独的列。
  3. 这将创建列表列;作为最后的预处理步骤,我们将那些
long = df %>%
    pivot_longer(matches('\\d{3}'), names_pattern = '(r|weight)\\d+') %>%
    pivot_wider(values_fn = list) %>%
    unnest(c(r, weight))

现在使用它的定义计算加权标准偏差很简单:

weighted_var = function (x, w) sum(w * (x - weighted.mean(x, w)) ^ 2) / sum(w)
weighted_sd = function (x, w) sqrt(weighted_var(x, w))

long %>% group_by(gameround) %>% summarize(sd = weighted_sd(r, weight))
# A tibble: 3 x 2
  gameround    sd
  <chr>     <dbl>
1 1_1       0.699
2 1_2       1.46
3 1_3       0.957

推荐阅读