r - 将重叠的日期范围与 R 中的层次结构合并
问题描述
我希望根据优先级层次重叠日期范围。我在下面举一个例子。下面的代码
df = data.frame(Priority = c("Priority_2","Priority_1", "Priority_2"),
Start = as.Date(c("2018-01-01", "2018-01-03", "2018-01-08")),
End = as.Date(c("2018-01-04","2018-01-05","2018-01-09")))
将为您提供此表:
Priority Start End
Priority_2 2018-01-01 2018-01-04
Priority_1 2018-01-03 2018-01-05
Priority_2 2018-01-08 2018-01-09
我希望能够重叠日期范围,但要根据某些优先级。输出表应如下所示:
Priority Start End
Priority_2 2018-01-01 2018-01-02
Priority_1 2018-01-03 2018-01-05
NA 2018-01-06 2018-01-07
Priority_2 2018-01-08 2018-01-09
因此,例如,如果 Priority_1 与任何其他优先级重叠,则将日期范围指定给 Priority_1。否则,如果 Priority_2 与任何其他优先级重叠,则将日期范围指定给 Priority_2。如果日期范围没有说明,我们得到 NA。如果输出看起来像这样,我也可以:
Priority Start End
Priority_2 2018-01-01 2018-01-03
Priority_1 2018-01-03 2018-01-05
NA 2018-01-05 2018-01-08
Priority_2 2018-01-08 2018-01-09
有谁知道如何在 R 中实现这一点?
解决方案
完成这个任务非常有趣(如果你喜欢谜题)。我有兴趣看看其他人的想法。
让我们只用 tidyverse 来做这件事。具体来说dplyr
,tidyr
和purrr
。
library(dplyr) # For dataframe functions
library(tidyr) # For nesting and fill functions
library(purrr) # For map functions
我将在 tidy 框架中做的第一件事是将其转换为更机器可读的格式。特别是一种长数据格式,其中每个日期都由它自己的行表示。没有更多的开始和结束。
要获取 Start 和 End 之间的所有日期,让我们使用 map2 将序列嵌套在新列中。我正在嵌套这些数据,因为有时只有 1 个日期,有时 3 个,有时 ???。通过嵌套,我可以将所有日期包含在数据框的单个变量中。
# Identify the all dates in the range and nest in a new column
df2 <- df %>%
arrange(Start) %>%
mutate(date = map2(Start,End,seq,by = 'day')) %>%
select(-Start,-End)
Priority date 1 Priority_2 17532, 17533, 17534, 17535 2 Priority_1 17534, 17535, 17536 3 Priority_2 17539, 17540
接下来,我将取消嵌套日期行,以便每个日期都是自己的行。
# Unnest the dates column so each item is a row
df2 <- df2 %>%
unnest(date)
Priority date 1 Priority_2 2018-01-01 2 Priority_2 2018-01-02 3 Priority_2 2018-01-03 4 Priority_2 2018-01-04 5 Priority_1 2018-01-03 6 Priority_1 2018-01-04 7 Priority_1 2018-01-05 8 Priority_2 2018-01-08 9 Priority_2 2018-01-09
接下来,让我们按日期分组并总结以选择最高优先级
# Now we can group_by each date! This means we can summarize to only
# select the highest priority
df2 <- df2 %>%
group_by(date) %>%
# Min finds the lowest string. Priority_1 is lower than Priority_2
summarise(Priority = min(Priority))
# A tibble: 7 x 2 date Priority <date> <chr> 1 2018-01-01 Priority_2 2 2018-01-02 Priority_2 3 2018-01-03 Priority_1 4 2018-01-04 Priority_1 5 2018-01-05 Priority_1 6 2018-01-08 Priority_2 7 2018-01-09 Priority_2
射击!缺失值。对日期范围内所有可能的日期进行简单的 full_join 即可解决此问题。
# Now for each date in the dataset we have only the highest priority
# but what about the missing values?
df2 <- df2 %>%
# Join in a list of all days in the date range!
full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
arrange(date)
# A tibble: 9 x 2 date Priority <date> <chr> 1 2018-01-01 Priority_2 2 2018-01-02 Priority_2 3 2018-01-03 Priority_1 4 2018-01-04 Priority_1 5 2018-01-05 Priority_1 6 2018-01-06 NA 7 2018-01-07 NA 8 2018-01-08 Priority_2 9 2018-01-09 Priority_2
现在我们需要弄清楚如何将同一优先级连续多次发生的位置重新组合在一起。如果我们将NA
值转换为"NA"
,我们可以使用 lag 函数来查看一个值是否与最后出现的值相同。这总是为第一个值返回 NA(你看不到在第一个值之前发生了什么),所以我们需要解决一个小问题。然后我们可以使用fill来填充所有的空白。
# This is the data desired, but now it needs to be put back in the
# human readable format it started in. Use lag to identify when Priority
# changes, then use fill to establish a group of the same Priority in a row.
df2 <- df2 %>%
# Remove NA from Priority because it causes problems with !=
replace_na(list(Priority = "NA")) %>%
mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
# The first column will always be NA...so fix it.
Group = ifelse(is.na(lag(Priority)),1,Group))%>%
# Now that the breaks are identified, fill in the rest of the group
# with the most recent value
fill(Group)
# A tibble: 9 x 3 date Priority Group <date> <chr> <dbl> 1 2018-01-01 Priority_2 1.00 2 2018-01-02 Priority_2 1.00 3 2018-01-03 Priority_1 3.00 4 2018-01-04 Priority_1 3.00 5 2018-01-05 Priority_1 3.00 6 2018-01-06 NA 6.00 7 2018-01-07 NA 6.00 8 2018-01-08 Priority_2 8.00 9 2018-01-09 Priority_2 8.00
从这里可以很快地按Priority
和Group
列分组,汇总以获取Start
和End
日期,并清理变量。
# Return the data to human readable form using group_by and summarize
df2 <- df2 %>%
group_by(Priority, Group) %>%
summarise(Start = min(date),
End = max(date)) %>%
ungroup() %>%
# Return "NA" values to NA
mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
arrange(Start) %>%
select(Priority,Start,End)
# A tibble: 4 x 3 Priority Start End <chr> <date> <date> 1 Priority_2 2018-01-01 2018-01-02 2 Priority_1 2018-01-03 2018-01-05 3 NA 2018-01-06 2018-01-07 4 Priority_2 2018-01-08 2018-01-09
或者......一个梦想中的整个代码
# Performing the whole thing in one go...
df3 <- df %>%
arrange(Start) %>%
mutate(date = map2(Start,End,seq,by = 'day')) %>%
select(-Start,-End) %>%
unnest(date) %>%
group_by(date) %>%
summarise(Priority = min(Priority)) %>%
full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
arrange(date) %>%
replace_na(list(Priority = "NA")) %>%
mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
# The first column will always be NA...so fix it.
Group = ifelse(is.na(lag(Priority)),1,Group)) %>%
fill(Group) %>%
group_by(Priority, Group) %>%
summarise(Start = min(date),
End = max(date)) %>%
ungroup() %>%
mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
arrange(Start) %>%
select(Priority,Start,End)
推荐阅读
- javascript - 如何在测验中显示单答案问题的先前答案
- aws-lambda - 使用无服务器框架,是否可以让函数在本地相互调用?类似于 SAM 的 sam local start-lambda?
- excel - 无法完全保护 Excel 工作表
- php - 如何对来自对象响应的结果进行 json 编码?
- vue.js - 如何在多用户 Vue 应用程序中处理刷新
- c# - 给定数字的总和
- c# - 有条件地将一个列表添加到另一个相同类型的列表中
- javascript - JavaScript 中的对象到数组(Lodash、Azure 函数、Azure 逻辑应用程序)
- unity3d - 统一自由绘画
- scala - 如何在不创建新数据框的情况下向我的 Scala 数据框添加列