首页 > 解决方案 > 为什么我的 dplyr 语句会创建额外的行?

问题描述

我希望“temp”输出 40 行,其中包含 1-20 岁的男性和 1-20 岁的女性。相反,它创建了 40 行,然后复制它们并附加它们,导致 'temp' 为 80 行。

为什么要这样做,我该如何阻止它?我知道我可以自己删除第 41-80 行,但是在处理大数据集时这很痛苦。

library(dplyr)
library(tidyr)

gender <- sample(c("male","female"), 100, replace = T)
age <- sample(1:20, , replace = T)

df <- data.frame(gender, age)

temp <- df %>% group_by(gender, age) %>%
  summarise(count = n()) %>%
  complete(gender = c("male", "female"), age = 1:20, fill = list(count = 0))

标签: rdplyrtidyr

解决方案


来自 dplyr 的小插图(强调添加):

当您按多个变量分组时,每个摘要都会剥离一个分组级别

以下是您的代码通过管道传输到的数据框complete

> df %>% group_by(gender, age) %>% summarise(count = n()) 
# A tibble: 24 x 3
# Groups:   gender [?]
   gender   age count
   <fct>  <int> <int>
 1 female     2     4
 2 female     3     2
 3 female     7     6
 4 female     9     5
 5 female    10     4
 6 female    11     2
 7 female    12     3
 8 female    13     4
 9 female    15     1
10 female    18     1
# ... with 14 more rows

我们可以看到,经过一轮之后summarise,数据框不再被分组age,但仍然被分组gender。这意味着在下一步中,它将尝试完成每个组的所有性别 (M/F) 和年龄 (1-20)组合,从而为每个性别产生 40 行组合。对于 2 种性别,我们总共得到 40 x 2 = 80 行。

以下方法在给出预期结果方面是等效的:

# explicitly remove all grouping
t1 <- df %>% 
  group_by(gender, age) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  complete(gender = c("male", "female"), 
           age = 1:20, 
           fill = list(count = 0))

# retain gender grouping, & only complete for different ages within each gender group
t2 <- df %>% 
  group_by(gender, age) %>%
  summarise(count = n()) %>%
  complete(age = 1:20, 
           fill = list(count = 0))

# use count, which is a wrapper for group_by(), summarise(n = n()), & ungroup() in one line
# note: the output variable name from this approach is hard-coded to n, & there is currently
# no way to change it in this step
t3 <- df %>%
  count(gender, age) %>%
  rename(count = n) %>%
  complete(gender = c("male", "female"), 
           age = 1:20, 
           fill = list(count = 0))

> all.equal(t1, t2)
[1] TRUE
> all.equal(t1, t3)
[1] TRUE

推荐阅读