首页 > 解决方案 > mutate:在分组的df(dplyr)中选择除当前行之外的所有行

问题描述

我想标记以下重叠:在同一支球队中有重叠的恢复时间间隔的球员。

这是我的代码:

library(tidyverse)
library(lubridate)


# data
df<- data.frame(times=c("01/01/2020","01/02/2020","01/07/2020","01/10/2020","01/01/2020","01/03/2020","01/05/2020","01/07/2020"))
starttime= mdy(df$times)


df <- tibble(team=c("A","A","A","A","B","B","B","B"),
             player=c(1,2,3,4,1,2,3,4),
             starttime= starttime,
             stoptime= starttime+days(1))


# find overlaps
df %>% 
  mutate(interval=lubridate::interval(starttime,stoptime)) %>% #calculate interval
  group_by(team) %>% 
  mutate(overlap_flag= case_when(
    sum(starttime %within% as.list(interval)) == 0 ~ 0, # I want to chose as.list(interval[except actual row])
    sum(starttime %within% as.list(interval)) > 0 ~ 1, # I want to chose as.list(interval[except actual row])
    TRUE ~ NA_real_))

给我:

# A tibble: 8 x 6
# Groups:   team [2]
  team  player starttime  stoptime   interval                       overlap_flag
  <chr>  <dbl> <date>     <date>     <Interval>                            <dbl>
1 A          1 2020-01-01 2020-01-02 2020-01-01 UTC--2020-01-02 UTC            1
2 A          2 2020-01-02 2020-01-03 2020-01-02 UTC--2020-01-03 UTC            1
3 A          3 2020-01-07 2020-01-08 2020-01-07 UTC--2020-01-08 UTC            1
4 A          4 2020-01-10 2020-01-11 2020-01-10 UTC--2020-01-11 UTC            1
5 B          1 2020-01-01 2020-01-02 2020-01-01 UTC--2020-01-02 UTC            1
6 B          2 2020-01-03 2020-01-04 2020-01-03 UTC--2020-01-04 UTC            1
7 B          3 2020-01-05 2020-01-06 2020-01-05 UTC--2020-01-06 UTC            1
8 B          4 2020-01-07 2020-01-08 2020-01-07 UTC--2020-01-08 UTC            1

我想要的是:

# A tibble: 8 x 6
# Groups:   team [2]
  team  player starttime  stoptime   interval                       overlap_flag
  <chr>  <dbl> <date>     <date>     <Interval>                            <dbl>
1 A          1 2020-01-01 2020-01-02 2020-01-01 UTC--2020-01-02 UTC            1
2 A          2 2020-01-02 2020-01-03 2020-01-02 UTC--2020-01-03 UTC            1
3 A          3 2020-01-07 2020-01-08 2020-01-07 UTC--2020-01-08 UTC            0
4 A          4 2020-01-10 2020-01-11 2020-01-10 UTC--2020-01-11 UTC            0
5 B          1 2020-01-01 2020-01-02 2020-01-01 UTC--2020-01-02 UTC            0
6 B          2 2020-01-03 2020-01-04 2020-01-03 UTC--2020-01-04 UTC            0
7 B          3 2020-01-05 2020-01-06 2020-01-05 UTC--2020-01-06 UTC            0
8 B          4 2020-01-07 2020-01-08 2020-01-07 UTC--2020-01-08 UTC            0

我知道可能有一个data.table解决方案......但是,我想知道这是否可以轻松完成dplyr

标签: rdplyr

解决方案


我们可以使用row_number()循环遍历行,然后将其用作删除“starttime”值的索引

library(dplyr)
library(lubridate)
library(purrr)
df %>% 
  mutate(interval = as.list(interval(starttime, stoptime))) %>% 
  group_by(team) %>%
  mutate(overlap_flag = +(map2_lgl(row_number(), 
        interval, ~ sum(starttime[-.x] %within% .y) > 0)))

推荐阅读