首页 > 解决方案 > 使用 group_by 时添加总体平均值

问题描述

我正在使用 dplyr 包来生成一些表,并且正在使用该adorn_totals("row")函数。

当我想对组内的值求和时,这很好用,但是在某些情况下,我想要一个整体平均值而不是总和。有一个 adorn_means 函数吗?

示例代码:

Regions2 <- Data %>%
  filter(!is.na(REGION))%>%
  group_by(REGION) %>%
  summarise(Numberofpeople=length(Names))%>%
  adorn_totals("row")

在这里,我的“总”行只是区域内所有人的总和。这给了我

REGION          NumberofPeople
East Midlands       578,943
East of England     682,917
London            1,247,540
North East          245,830
North West          742,886
South East          963,040
South West          623,684
West Midlands       653,335
Yorkshire           553,853
TOTAL             6,292,028

我的下一段代码生成每个地区的平均工资,但我想为总数添加一个总体平均值

Regions3 <- Data %>%
  filter(!is.na(REGION))%>%
  filter(!is.na(AVGSalary))%>%
  group_by(REGION) %>%
  summarise(AverageSalary=mean(AVGSalary))

如果我adnorn_totals("row")像以前一样使用,我只是得到平均值的总和,而不是数据集的整体平均值。

我如何获得总体平均值?

用一些点头数据更新:

数据

people  region      salary
person1 London      1000
person2 South West  1050
person3 South East  900
person4 London      800
person5 Scotland    1020
person6 South West  750
person7 East        600
person8 London      1200
person9 South West  1150

因此,组平均值为:

London      1000
South West  983.33
South East  900
Scotland    1020
East        600

我想将总和添加到底部

Total    941.11

标签: rdplyr

解决方案


1)因为整体平均值是平均值的加权平均值(不是平均值的普通平均值),即它是 941 而不是 901,所以我们维护一个n列,以便最终我们可以正确计算整体平均值。尽管显示的数据没有我们使用drop_na的任何 NA,以便也将其与此类数据一起使用。这将删除任何包含 NA 的行。

library(dplyr)
library(tidyr)

Region %>%
  drop_na %>%
  group_by(region) %>%
  summarize(avg = mean(salary), n = n()) %>%
  ungroup %>%
  bind_rows(summarize(., region = "Overall Avg", 
                         avg = sum(avg * n) / sum(n), 
                         n = sum(n))) %>%
  select(-n)

给予:

# A tibble: 6 x 2
  region        avg
  <chr>       <dbl>
1 East         600 
2 London      1000 
3 Scotland    1020 
4 South East   900 
5 South West   983.
6 Overall Avg  941.

2)另一种方法是通过返回原始数据来构建总体平均线:

Region %>%
  drop_na %>%
  group_by(region) %>%
  summarize(avg = mean(salary)) %>%
  ungroup %>%
  bind_rows(summarize(Region %>% drop_na, region = "Overall Avg", avg = mean(salary)))

给予:

# A tibble: 6 x 2
  region        avg
  <chr>       <dbl>
1 East         600 
2 London      1000 
3 Scotland    1020 
4 South East   900 
5 South West   983.
6 Overall Avg  941.

2a)如果你反对引用Region两次,那么试试这个。

Region_ <- Region %>% 
  drop_na

Region_ %>%
  group_by(region) %>%
  summarize(avg = mean(salary)) %>%
  ungroup %>%
  bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))

2b)或作为单个管道,现在Region_是管道本地的,并且在管道完成后将自动删除:

Region %>%
  drop_na %>%
  { Region_ <- .
    Region_ %>%
      group_by(region) %>%
      summarize(avg = mean(salary)) %>%
      ungroup %>%
      bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
  }

笔记

我们使用它作为输入:

Lines <- "people  region      salary
person1 London      1000
person2 South West  1050
person3 South East  900
person4 London      800
person5 Scotland    1020
person6 South West  750
person7 East        600
person8 London      1200
person9 South West  1150"

library(gsubfn)
Region <- read.pattern(text = Lines, pattern = "^(\\S+) +(.*) (\\d+)$", 
  as.is = TRUE, skip = 1, strip.white = TRUE,
  col.names = read.table(text = Lines, nrow = 1, as.is = TRUE))

推荐阅读