首页 > 解决方案 > 计算子组加权比例的有效方法?

问题描述

目标:以更有效的方式(例如函数)计算几个子组加权比例。需要对所有组合中的两个变量(var1,var2)进行子集化,并计算结果的加权比例(var3)。在 R 中工作(但 python 解决方案也很受欢迎)。

代表:

# Reprex
library(dplyr)
library(weights)

df <- data.frame(
  var1 = c(1, 1, 1, 2, 1, 2, 1, 2, 2, 1),
  var2 = c(1, 2, 2, 3, 3, 3, 2, 1, 2, 2),
  var3 = c("A", "B", "A", "A", "A", "B", "A", "B", "A", "A"),
  weight = rnorm(10)
)

# sub1 
sub <- filter(df, var1 == 1 & var2 == 3)
round(weights::wpct(sub$var3, weight = sub$weight), digits = 2)

# sub2
sub <- filter(df, var1 == 2)
round(weights::wpct(sub$var3, weight = sub$weight), digits = 2)

# sub3
sub <- filter(df, var2 == 2)
round(weights::wpct(sub$var3, weight = sub$weight), digits = 2)

# Looking for more efficient way to continue subgroups (with more vars and combinations)

标签: pythonrdplyr

解决方案


使用data.tablescube函数非常简单。此函数可用于计算多个变量内的所有分组以及整体分组的函数。然而,data.table我们有一个小问题,因为它只期望一个值作为输出,并且该函数为(在我们的例子中) 中wpct的每个组提供一个值。幸运的是,函数将其命名为输出,因此将结果封装为允许我们将其转换为可读格式。xvar3result = list(weights::wpct(var3, weights))

set.seed(1)
library(data.table)
library(weights)

df <- data.frame(
  var1 = c(1, 1, 1, 2, 1, 2, 1, 2, 2, 1),
  var2 = c(1, 2, 2, 3, 3, 3, 2, 1, 2, 2),
  var3 = c("A", "B", "A", "A", "A", "B", "A", "B", "A", "A"),
  weight = rnorm(10)
)
setDT(df)
# Note that I use list(weights::wpct(var3, weight)), 
#  because I want to keep the result in *one* column.
res <- cube(df, 
            j = c(list(result = list(weights::wpct(var3, weight)))), 
            by = c('var1', 'var2'))
res
## Output
    var1 var2                  result
 1:    1    1                       1
 2:    1    2    1.3907765,-0.3907765
 3:    2    3      2.058925,-1.058925
 4:    1    3                       1
 5:    2    1                       1
 6:    2    2                       1
 7:    1   NA    1.2394648,-0.2394648
 8:    2   NA  1.03932354,-0.03932354
 9:   NA    1     -5.599793, 6.599793
10:   NA    2   -0.7351568, 1.7351568
11:   NA    3    1.7429624,-0.7429624
12:   NA   NA   0.92322427,0.07677573

分组显示在“var1”和“var2”中,而这也将计算所有整体组(例如var1 = 1var2 = *any*var1, var2 = *any*)。然而,正如我上面提到的,这个结果几乎不可读。然而,我们可以通过使用unnest_widerfromtidyrresult列分解为更好的格式来解决这个问题

library(dplyr)
library(tidyr)
res %>% unnest_wider(result)
# A tibble: 12 x 4
    var1  var2      A       B
   <dbl> <dbl>  <dbl>   <dbl>
 1     1     1  1     NA     
 2     1     2  1.39  -0.391 
 3     2     3  2.06  -1.06  
 4     1     3  1     NA     
 5     2     1 NA      1     
 6     2     2  1     NA     
 7     1    NA  1.24  -0.239 
 8     2    NA  1.04  -0.0393
 9    NA     1 -5.60   6.60  
10    NA     2 -0.735  1.74  
11    NA     3  1.74  -0.743 
12    NA    NA  0.923  0.0768

现在我们有了一个可读的格式,前两列表示分组,其余列表示变量每个值的结果var3。请注意,如果在+的特定分组中NA没有值,则返回。var3var1var2


推荐阅读