首页 > 解决方案 > 如何使用 dplyr 对汇总频率表进行分箱

问题描述

我有以下数据框:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- nycflights13::flights %>% 
  select(distance) %>% 
  group_by(distance) %>% 
  summarise(n = n()) %>% 
  arrange(distance) %>% ungroup() 

df
#> # A tibble: 214 x 2
#>    distance     n
#>       <dbl> <int>
#>  1       17     1
#>  2       80    49
#>  3       94   976
#>  4       96   607
#>  5      116   443
#>  6      143   439
#>  7      160   376
#>  8      169   545
#>  9      173   221
#> 10      184  5504
#> # … with 204 more rows

我想要做的是distance按大小为 100 的 bin 对列进行分类,并相应地对n列进行求和。怎么能这样做?

所以你会得到类似的东西:

bin_distance sum_n
1-100       1633  #(1 + 49 + 976 + 607)
101-200     21344 # (443 + ... + 5327)
#etc

标签: rdplyrtidyverse

解决方案


最简单的方法是为每 100 个值和每个组的值cut创建groupsusing 。seqsum

library(dplyr)

df %>%
  group_by(group = cut(distance, breaks = seq(0, max(distance), 100))) %>%
  summarise(n = sum(n))


#   group         n
#   <fct>       <int>
# 1 (0,100]      1633
# 2 (100,200]   21344
# 3 (200,300]   28310
# 4 (300,400]    7748
# 5 (400,500]   21292
# 6 (500,600]   26815
# 7 (600,700]    7846
# 8 (700,800]   48904
# 9 (800,900]    7574
#10 (900,1e+03] 18205
# ... with 17 more rows

可以使用aggregatelike将其转换为基础 R

aggregate(n ~ distance, 
 transform(df, distance = cut(distance, breaks = seq(0, max(distance), 100))), sum)

推荐阅读