首页 > 解决方案 > 如何使用 dplyr group_by 并获取每个不同分组的索引?

问题描述

在返回带有分组索引的原始 data.frame 之前,我们如何使用dplyr group_by然后为每个唯一分组分配一个索引?

例子

df <- data.frame(
  user=c("Peter", "Peter", "Peter", "Paul", "Paul", "Mary", "Mary", "Mary"),
  purchase=c("Snickers", "Snickers", "Coke", "Pepsi", "Pepsi", "Snickers", "Pepsi", "Coke"),
  stringsAsFactors = FALSE
)

这有效,但只是因为我手动硬编码了答案,即c(1,2,1,1,2,3)

library(dplyr)
df %>% 
  group_by(user, purchase) %>% 
  distinct() %>% 
  cbind(., c(1,2,1,1,2,3)) %>% 
  left_join(df, ., by=(c("user", "purchase")))

   user purchase ...3
1 Peter Snickers    1
2 Peter Snickers    1
3 Peter     Coke    2
4  Paul    Pepsi    1
5  Paul    Pepsi    1
6  Mary Snickers    1
7  Mary    Pepsi    2
8  Mary     Coke    3

在取消分组之前,我们如何group_by为每个不同的组分配一个索引,以便索引作为原始 data.frame 的附加列返回?

标签: rdplyr

解决方案


你可以做:

df %>%
 group_by(user) %>%
 mutate(indices = cumsum(!duplicated(purchase)))

  user  purchase indices
  <chr> <chr>      <int>
1 Peter Snickers       1
2 Peter Snickers       1
3 Peter Coke           2
4 Paul  Pepsi          1
5 Paul  Pepsi          1
6 Mary  Snickers       1
7 Mary  Pepsi          2
8 Mary  Coke           3

推荐阅读