首页 > 解决方案 > 将重叠的日期范围与 R 中的层次结构合并

问题描述

我希望根据优先级层次重叠日期范围。我在下面举一个例子。下面的代码

df = data.frame(Priority = c("Priority_2","Priority_1", "Priority_2"),
                Start = as.Date(c("2018-01-01", "2018-01-03", "2018-01-08")),
                End = as.Date(c("2018-01-04","2018-01-05","2018-01-09")))

将为您提供此表:

Priority      Start        End
Priority_2    2018-01-01   2018-01-04
Priority_1    2018-01-03   2018-01-05
Priority_2    2018-01-08   2018-01-09

我希望能够重叠日期范围,但要根据某些优先级。输出表应如下所示:

Priority      Start        End
Priority_2    2018-01-01   2018-01-02
Priority_1    2018-01-03   2018-01-05
NA            2018-01-06   2018-01-07
Priority_2    2018-01-08   2018-01-09

因此,例如,如果 Priority_1 与任何其他优先级重叠,则将日期范围指定给 Priority_1。否则,如果 Priority_2 与任何其他优先级重叠,则将日期范围指定给 Priority_2。如果日期范围没有说明,我们得到 NA。如果输出看起来像这样,我也可以:

Priority      Start        End
Priority_2    2018-01-01   2018-01-03
Priority_1    2018-01-03   2018-01-05
NA            2018-01-05   2018-01-08
Priority_2    2018-01-08   2018-01-09

有谁知道如何在 R 中实现这一点?

标签: r

解决方案


完成这个任务非常有趣(如果你喜欢谜题)。我有兴趣看看其他人的想法。

让我们只用 tidyverse 来做这件事。具体来说dplyrtidyrpurrr

library(dplyr) # For dataframe functions
library(tidyr) # For nesting and fill functions
library(purrr) # For map functions

我将在 tidy 框架中做的第一件事是将其转换为更机器可读的格式。特别是一种长数据格式,其中每个日期都由它自己的行表示。没有更多的开始和结束。

要获取 Start 和 End 之间的所有日期,让我们使用 map2 将序列嵌套在新列中。我正在嵌套这些数据,因为有时只有 1 个日期,有时 3 个,有时 ???。通过嵌套,我可以将所有日期包含在数据框的单个变量中。

# Identify the all dates in the range and nest in a new column
df2 <- df %>%
  arrange(Start) %>%
  mutate(date = map2(Start,End,seq,by = 'day')) %>%
  select(-Start,-End)
    Priority                       date
1 Priority_2 17532, 17533, 17534, 17535
2 Priority_1        17534, 17535, 17536
3 Priority_2               17539, 17540

接下来,我将取消嵌套日期行,以便每个日期都是自己的行。

# Unnest the dates column so each item is a row
df2 <- df2 %>%
  unnest(date)
    Priority       date
1 Priority_2 2018-01-01
2 Priority_2 2018-01-02
3 Priority_2 2018-01-03
4 Priority_2 2018-01-04
5 Priority_1 2018-01-03
6 Priority_1 2018-01-04
7 Priority_1 2018-01-05
8 Priority_2 2018-01-08
9 Priority_2 2018-01-09

接下来,让我们按日期分组并总结以选择最高优先级

# Now we can group_by each date! This means we can summarize to only
# select the highest priority
df2 <- df2 %>%
  group_by(date) %>%
  # Min finds the lowest string. Priority_1 is lower than Priority_2 
  summarise(Priority = min(Priority)) 
# A tibble: 7 x 2
  date       Priority  
  <date>     <chr>     
1 2018-01-01 Priority_2
2 2018-01-02 Priority_2
3 2018-01-03 Priority_1
4 2018-01-04 Priority_1
5 2018-01-05 Priority_1
6 2018-01-08 Priority_2
7 2018-01-09 Priority_2

射击!缺失值。对日期范围内所有可能的日期进行简单的 full_join 即可解决此问题。

# Now for each date in the dataset we have only the highest priority
# but what about the missing values?
df2 <- df2 %>%
  # Join in a list of all days in the date range!
  full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
  arrange(date)
# A tibble: 9 x 2
  date       Priority  
  <date>     <chr>     
1 2018-01-01 Priority_2
2 2018-01-02 Priority_2
3 2018-01-03 Priority_1
4 2018-01-04 Priority_1
5 2018-01-05 Priority_1
6 2018-01-06 NA        
7 2018-01-07 NA        
8 2018-01-08 Priority_2
9 2018-01-09 Priority_2

现在我们需要弄清楚如何将同一优先级连续多次发生的位置重新组合在一起。如果我们将NA值转换为"NA",我们可以使用 lag 函数来查看一个值是否与最后出现的值相同。这总是为第一个值返回 NA(你看不到在第一个值之前发生了什么),所以我们需要解决一个小问题。然后我们可以使用fill来填充所有的空白。

# This is the data desired, but now it needs to be put back in the 
# human readable format it started in. Use lag to identify when Priority
# changes, then use fill to establish a group of the same Priority in a row.
df2 <- df2 %>%
  # Remove NA from Priority because it causes problems with !=
  replace_na(list(Priority = "NA")) %>%
  mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
         # The first column will always be NA...so fix it.
         Group = ifelse(is.na(lag(Priority)),1,Group))%>%
  # Now that the breaks are identified, fill in the rest of the group 
  # with the most recent value
  fill(Group)
# A tibble: 9 x 3
  date       Priority   Group
  <date>     <chr>      <dbl>
1 2018-01-01 Priority_2  1.00
2 2018-01-02 Priority_2  1.00
3 2018-01-03 Priority_1  3.00
4 2018-01-04 Priority_1  3.00
5 2018-01-05 Priority_1  3.00
6 2018-01-06 NA          6.00
7 2018-01-07 NA          6.00
8 2018-01-08 Priority_2  8.00
9 2018-01-09 Priority_2  8.00

从这里可以很快地按PriorityGroup列分组,汇总以获取StartEnd日期,并清理变量。

# Return the data to human readable form using group_by and summarize
df2 <- df2 %>%
  group_by(Priority, Group) %>%
  summarise(Start = min(date),
            End = max(date)) %>%
  ungroup() %>%
  # Return "NA" values to NA
  mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
  arrange(Start) %>%
  select(Priority,Start,End)
# A tibble: 4 x 3
  Priority   Start      End       
  <chr>      <date>     <date>    
1 Priority_2 2018-01-01 2018-01-02
2 Priority_1 2018-01-03 2018-01-05
3 NA         2018-01-06 2018-01-07
4 Priority_2 2018-01-08 2018-01-09

或者......一个梦想中的整个代码

# Performing the whole thing in one go...
df3 <- df %>%
  arrange(Start) %>%
  mutate(date = map2(Start,End,seq,by = 'day')) %>%
  select(-Start,-End) %>%
  unnest(date) %>%
  group_by(date) %>%
  summarise(Priority = min(Priority))  %>%
  full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
  arrange(date) %>%
  replace_na(list(Priority = "NA")) %>%
  mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
         # The first column will always be NA...so fix it.
         Group = ifelse(is.na(lag(Priority)),1,Group)) %>%
  fill(Group) %>%
  group_by(Priority, Group) %>%
  summarise(Start = min(date),
            End = max(date)) %>%
  ungroup() %>%
  mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
  arrange(Start) %>%
  select(Priority,Start,End)

推荐阅读