首页 > 解决方案 > Select a random row in case of ties in a grouped df

问题描述

I have a data frame like below

df <- data.frame(group_var = c("a", "a", "b", "b"),
           summ_var = c("x", "y", "z", "w"),
           val = c(100, 100, 150, 200))

df
  group_var summ_var val
1         a        x 100
2         a        y 100
3         b        z 150
4         b        w 200

For each group_var, I want to select exactly one summ_var with minimum val. I have tried the following code:

df %>% 
    group_by(group_var) %>% 
    filter(val == min(val)) %>% 
    ungroup()

  group_var summ_var   val
  <fct>     <fct>    <dbl>
1 a         x          100
2 a         y          100
3 b         z          150

which gives me multiple summ_var for group_var = a, since val == min(val) is TRUE for multiple values of summ_var. How do I randomly select one of the multiple values of summ_var for group_var = a?

My desired output looks like below in which a random value of summ_var is picked in each group in case of conflict.

  group_var summ_var   val
  <fct>     <fct>    <dbl>
1 a         x          100
2 b         z          150

This is just a reproducible example, in reality I may have more than 2 conflicting values. Therefore, looking for a generalised approach. Any help is appreciated.

标签: rdataframetidyversedata-manipulation

解决方案


With dplyr, you can do:

df %>%
 group_by(group_var) %>%
 slice(which.min(rank(val, ties.method = "random")))

  group_var summ_var   val
  <fct>     <fct>    <dbl>
1 a         x          100
2 b         z          150

Or:

df %>%
 group_by(group_var) %>%
 filter(val == min(val)) %>%
 sample_frac(1) %>%
 slice(1)

推荐阅读