首页 > 解决方案 > 如何在 R 中循环汇总统计信息

问题描述

我有一个包含大约 60 个变量(A、B、C、D、...)的数据集,每个变量都有 3 个对应的信息列(A、Group_A 和 WOE_A),如下表所示:

ID  A   Group_A WOE_A   B   Group_B WOE_B   C   Group_C WOE_C   D   Group_D WOE_D   Status
213 0   1   0.87    0   1   0.65    0   1   0.80    915.7   4   -0.30   1
321 12  5   0.08    4   4   -0.43   6   5   -0.20   85.3    2   0.26    0
32  0   1   0.87    0   1   0.65    0   1   0.80    28.6    2   0.26    1
13  7   4   -0.69   2   3   -0.82   4   4   -0.80   31.8    2   0.26    0
43  1   2   -0.04   1   2   -0.49   1   2   -0.22   51.7    2   0.26    0
656 2   3   -0.28   2   3   -0.82   2   3   -0.65   8.5 1   1.14    0
435 2   3   -0.28   0   1   0.65    0   1   0.80    39.8    2   0.26    0
65  8   4   -0.69   3   4   -0.43   5   4   -0.80   243.0   3   0.00    0
565 0   1   0.87    0   1   0.65    0   1   0.80    4.0 1   1.14    0
432 0   1   0.87    0   1   0.65    0   1   0.80    81.6    2   0.26    0

Min(A), Max(A), WOE_A, Count(Group_A), Count(Group_A, where Status=1), Count(Group_A, where Status=0)我想在 R 中打印一个带有一些统计信息的表(我尝试了“dplyr”包,但我不知道如何引用与变量 (A) 相关的所有三列(A、Group_A 和 WOE_A),以及如何汇总所有所需统计信息的信息。

我开始的代码是:

df <- data
List <- list(df)
for (colname in colnames(df)) {
  List[[colname]]<- df %>%
    group_by(df[,colname]) %>%
    count()
}
List

这就是我想要打印结果的方式:

**Var A                       
Group   Min(A)  Max(A)  WOE_A   Count(Group_A)  Count_1(Group_A, where Status=1)  Count_0(Group_A, where Status=0)**
1                       
2                       
3                       
4                       
5   

非常感谢!

劳拉

标签: rloopsdplyrstatisticsgrouping

解决方案


Laura,正如其他人所提到的,使用“长”数据帧比使用宽数据帧更好。

您最初的想法使用dplyrgroup_by()让您几乎实现了目标。注意:这也是一种分解数据然后将其与通用列组合的方法,如果宽-长正在突破极限。

让我们从这个开始:

library(dplyr)

#---------- extract all "A" measurements
df %>% 
   select(A, Group_A, WOE_A, Status) %>% 
#---------- grouped summary of multiple stats
   group_by(A) %>% 
   summarise(
       Min = min(A)
    ,  Max = max(A)
    ,  WOE_A = unique(WOE_A) 
    ,   Count = n()    # n() is a helper function of dplyr
    ,  CountStatus1 = sum(Status == 1)  # use sum() to count logical conditions
    ,  CountStatus0 = sum(Status == 0)
)

这产生:

# A tibble: 6 x 7
      A   Min   Max WOE_A Count CountStatus1 CountStatus0
  <dbl> <dbl> <dbl> <dbl> <int>        <int>        <int>
1     0     0     0  0.87     4            2            2
2     1     1     1 -0.04     1            0            1
3     2     2     2 -0.28     2            0            2
4     7     7     7 -0.69     1            0            1
5     8     8     8 -0.69     1            0            1
6    12    12    12  0.08     1            0            1

好的。在嵌套测量和变量名称时,将宽数据框变成长数据框并非易事。在顶部,ID并且Status是每行的 ids/key 变量。

将宽转换为长的标准工具是tidyr's pivot_longer()。阅读此内容。在您的特定情况下,我们希望将多个列推送到多个目标中。为此,您需要了解.value哨兵。pivot_longer()帮助页面可能有助于研究此案例。

为了减轻构建复杂的正则表达式来解码变量名的痛苦,我将您的group-id-label,例如 A,B,重命名为X_A,X_B . This ensures that all column-names are built in the form of what_letter`!

library(tidyr)

    df %>% 
    # ----------- prepare variable names to be well-formed, you may do this upstream
      rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
     
    # ----------- call pivot longer with .value sentinel and names_pattern
    # ----------- that is an advanced use of the capabilities
      pivot_longer(
          cols = -c("ID","Status")         # apply to all cols besides ID and Status
       , names_to = c(".value", "label")   # target column names are based on origin names
                                           # and an individual label (think id, name as u like)
       , names_pattern = "(.*)(.*_[A-D]{1})$")  # regex for the origin column patterns
                                                # pattern is built of 2 parts (...)(...)
                                                # (.*) no or any symbol possibly multiple times
                                                # (.*_[A-D]{1}) as above, but ending with underscore and 1 letter 

这给你

# A tibble: 40 x 6
      ID Status label     X Group   WOE
   <dbl>  <dbl> <chr> <dbl> <dbl> <dbl>
 1   213      1 _A      0       1  0.87
 2   213      1 _B      0       1  0.65
 3   213      1 _C      0       1  0.8 
 4   213      1 _D    916.      4 -0.3 
 5   321      0 _A     12       5  0.08
 6   321      0 _B      4       4 -0.43
 7   321      0 _C      6       5 -0.2 
 8   321      0 _D     85.3     2  0.26
 9    32      1 _A      0       1  0.87
10    32      1 _B      0       1  0.65

把所有的放在一起

df %>% 
# ------------ prepare and make long
   rename(X_A = A, X_B = B, X_C = C, X_D = D) %>% 
   pivot_longer(cols = -c("ID","Status")
               , names_to = c(".value", "label")
               , names_pattern = "(.*)(.*_[A-D]{1})$") %>% 

# ------------- calculate stats on groups
  group_by(label, X) %>% 
  summarise(Min = min(X),  Max = max(X),  WOE = unique(WOE)
           ,Count = n(),  CountStatus1 = sum(Status == 1)
           , CountStatus0 = sum(Status == 0)
)

瞧:

# A tibble: 27 x 8
# Groups:   label [4]
   label     X   Min   Max   WOE Count CountStatus1 CountStatus0
   <chr> <dbl> <dbl> <dbl> <dbl> <int>        <int>        <int>
 1 _A        0     0     0  0.87     4            2            2
 2 _A        1     1     1 -0.04     1            0            1
 3 _A        2     2     2 -0.28     2            0            2
 4 _A        7     7     7 -0.69     1            0            1
 5 _A        8     8     8 -0.69     1            0            1
 6 _A       12    12    12  0.08     1            0            1
 7 _B        0     0     0  0.65     5            2            3
 8 _B        1     1     1 -0.49     1            0            1
 9 _B        2     2     2 -0.82     2            0            2
10 _B        3     3     3 -0.43     1            0            1
# ... with 17 more rows

推荐阅读