首页 > 解决方案 > 创建一个变量,包括日期时间变量的滞后和超前

问题描述

我有一个与以下示例类似的数据集。我正在尝试创建一个与“预期”变量相同的变量。也就是说,如果 date_time 是“2020-01-02 12:00”,则创建一个包含 date_time 的滞后、领先和水平值的变量。

先感谢您。

    df <- tibble(
             date_time = c("2020-01-01 10:00","2020-01-01 11:00","2020-01-01 12:00","2020-01-01 13:00","2020-01-01 14:00","2020-01-02 10:00","2020-01-02 11:00","2020-01-02 12:00","2020-01-03 10:00","2020-01-03 11:00"), 
             
             date = c("2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-01-02","2020-01-02","2020-01-02","2020-01-03","2020-01-03"),
               x= seq.default(1,5.5,.5),
date_time_NA = c(NA,"2020-01-01 11:00",NA,NA,NA,NA,NA,"2020-01-02 12:00",NA,NA),
            expected = c("2020-01-01 10:00","2020-01-01 11:00","2020-01-01 12:00",NA,NA,NA,"2020-01-02 11:00","2020-01-02 12:00","2020-01-03 10:00",NA) 
             )
df

新示例

我改变了上面的例子。所以我试图告诉 R 找到 date_time_NA 没有丢失的位置,然后创建一个变量,包括 date_time 的滞后、领先和水平值(类似于预期)。

更新:另一种解决方案

我使用@akrun 分享的代码提出了一个解决方案。这可能不是一个非常聪明的解决方案;但是,我发现它很直观。主要思想是使用 ifelse 查找 date_time_NA 没有丢失的位置。然后我们可以再次使用 row_number() 并添加 1,这样它就可以找到领先并减去 1,这样它就可以找到滞后位置并将它们替换为 date_time 的值。

df %>%
 mutate(na_row = ifelse(!is.na(date_time_NA),row_number(),NA),
       
         row_level_lag = ifelse(row_number() %in% c(na_row), date_time,NA), 
       
         row_level_now = ifelse(row_number() %in% c(na_row + 1), date_time,NA),
       
         row_level_lead = ifelse(row_number() %in% c(na_row - 1), date_time,NA),
       
         date_time_expected = glue("{row_level_lag} {row_level_now} {row_level_lead}"), 
       
         date_time_expected = stringr::str_replace_all(string = date_time_expected,pattern = "NA",replacement = "")) 

标签: rdatetime

解决方案


我们可以使用match获取相关 'date_time' 值的索引,然后通过从 中添加一个序列来获取上一个和下一个值的索引-1:1。使用它来创建一个具有“date_time”子集值的“新”列

i1 <- match("2020-01-02 12:00", df$date_time)
i2 <- i1 + (-1:1)
df$new <- NA_character_
df$new[i2] <- df$date_time[i2]

-输出

df
# A tibble: 10 x 5
#   date_time        date           x expected         new             
#   <chr>            <chr>      <dbl> <chr>            <chr>           
# 1 2020-01-01 10:00 2020-01-01   1   <NA>             <NA>            
# 2 2020-01-01 11:00 2020-01-01   1.5 <NA>             <NA>            
# 3 2020-01-01 12:00 2020-01-01   2   <NA>             <NA>            
# 4 2020-01-01 13:00 2020-01-01   2.5 <NA>             <NA>            
# 5 2020-01-01 14:00 2020-01-01   3   <NA>             <NA>            
# 6 2020-01-02 10:00 2020-01-02   3.5 <NA>             <NA>            
# 7 2020-01-02 11:00 2020-01-02   4   2020-01-02 11:00 2020-01-02 11:00
# 8 2020-01-02 12:00 2020-01-02   4.5 2020-01-02 12:00 2020-01-02 12:00
# 9 2020-01-03 10:00 2020-01-03   5   2020-01-03 10:00 2020-01-03 10:00
#10 2020-01-03 11:00 2020-01-03   5.5 <NA>             <NA>            

使用dplyr,我们可以使用replace将 'date_time' 中不是 'date_time' 的matched 索引的值以及上一个和下一个索引索引替换为NA

library(dplyr)
df %>% 
   mutate(new = replace(date_time,
   !row_number()  %in% (match("2020-01-02 12:00", 
           df$date_time) + (-1:1)), NA_character_ ))

-输出

# A tibble: 10 x 5
#   date_time        date           x expected         new             
#   <chr>            <chr>      <dbl> <chr>            <chr>           
# 1 2020-01-01 10:00 2020-01-01   1   <NA>             <NA>            
# 2 2020-01-01 11:00 2020-01-01   1.5 <NA>             <NA>            
# 3 2020-01-01 12:00 2020-01-01   2   <NA>             <NA>            
# 4 2020-01-01 13:00 2020-01-01   2.5 <NA>             <NA>            
# 5 2020-01-01 14:00 2020-01-01   3   <NA>             <NA>            
# 6 2020-01-02 10:00 2020-01-02   3.5 <NA>             <NA>            
# 7 2020-01-02 11:00 2020-01-02   4   2020-01-02 11:00 2020-01-02 11:00
# 8 2020-01-02 12:00 2020-01-02   4.5 2020-01-02 12:00 2020-01-02 12:00
# 9 2020-01-03 10:00 2020-01-03   5   2020-01-03 10:00 2020-01-03 10:00
#10 2020-01-03 11:00 2020-01-03   5.5 <NA>             <NA>            

更新

如果有多个元素match,那么我们创建一个分组列并应用相同的代码

library(tidyr)
df %>%
    mutate(grp = lead(cumsum(!is.na(date_time_NA)))) %>%
    fill(grp) %>%
    group_by(grp) %>% 
    mutate(new = replace(date_time,
      !row_number()  %in% (match(date_time_NA[complete.cases(date_time_NA)][1],
            date_time) + (-1:1)), NA_character_ )) %>%
     ungroup %>%
    select(-grp)
# A tibble: 10 x 6
#   date_time        date           x date_time_NA     expected         new             
#   <chr>            <chr>      <dbl> <chr>            <chr>            <chr>           
# 1 2020-01-01 10:00 2020-01-01   1   <NA>             2020-01-01 10:00 2020-01-01 10:00
# 2 2020-01-01 11:00 2020-01-01   1.5 2020-01-01 11:00 2020-01-01 11:00 2020-01-01 11:00
# 3 2020-01-01 12:00 2020-01-01   2   <NA>             2020-01-01 12:00 2020-01-01 12:00
# 4 2020-01-01 13:00 2020-01-01   2.5 <NA>             <NA>             <NA>            
# 5 2020-01-01 14:00 2020-01-01   3   <NA>             <NA>             <NA>            
# 6 2020-01-02 10:00 2020-01-02   3.5 <NA>             <NA>             <NA>            
# 7 2020-01-02 11:00 2020-01-02   4   <NA>             2020-01-02 11:00 2020-01-02 11:00
# 8 2020-01-02 12:00 2020-01-02   4.5 2020-01-02 12:00 2020-01-02 12:00 2020-01-02 12:00
# 9 2020-01-03 10:00 2020-01-03   5   <NA>             2020-01-03 10:00 2020-01-03 10:00
#10 2020-01-03 11:00 2020-01-03   5.5 <NA>             <NA>             <NA>            

推荐阅读