首页 > 解决方案 > 完整的数据框,缺少多个参数的日期范围

问题描述

我有以下数据框:

Date_from <- c("2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13")
Date_to <- c("2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18")
y <- data.frame(Date_from,Date_to)
y$concentration <- c("1.5","2.5","1.5","3.5","1.5","2.5","1.5","3.5","3")
y$Parameter<-c("A","A","A","B","B","B","C","C","C")
y$Date_from <- as.Date(y$Date_from)
y$Date_to <- as.Date(y$Date_to)
y$concentration <- as.numeric(y$concentration)

如果每个参数的日期范围从一年的第一天 (2013-01-01) 开始并在一年的最后一天 (2013-12-31) 结束,我将需要检查数据框。如果不是,我需要在每个参数的开头和结尾添加一个额外的行,以完成每个参数的完整日期范围。结果应如下所示:

Date_from    Date_to concentration Parameter
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         A
2013-05-10 2013-08-12           2.5         A
2013-08-13 2013-11-18           1.5         A
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           3.5         B
2013-05-10 2013-08-12           1.5         B
2013-08-13 2013-11-18           2.5         B
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         C
2013-05-10 2013-08-12           3.5         C
2013-08-13 2013-11-18           3.0         C
2013-11-19 2013-12-31            NA        NA

请注意:为简化起见,此示例中的日期范围仅相同。

更新:这是我的原始数据片段和代码:

sm<-read.csv("https://www.dropbox.com/s/tft6inwcrjqujgt/Test_data.csv?dl=1",sep=";",header=TRUE)
cleaned_sm<-sm[,c(4,5,11,14)] ##Delete obsolete columns
colnames(cleaned_sm)<-c("Parameter","Concentration","Date_from","Date_to")
cleaned_sm$Date_from<-as.Date(cleaned_sm$Date_from, format ="%d.%m.%Y")     
cleaned_sm$Date_to<-as.Date(cleaned_sm$Date_to, format ="%d.%m.%Y") 
#detect comma decimal separator and replace with dot decimal separater as comma is not recognised as a number
cleaned_sm=lapply(cleaned_sm, function(x) gsub(",", ".", x))
cleaned_sm<-data.frame(cleaned_sm)
cleaned_sm$Concentration <- as.numeric(cleaned_sm$Concentration)
cleaned_sm$Date_from <- as.Date(cleaned_sm$Date_from)
cleaned_sm$Date_to <- as.Date(cleaned_sm$Date_to)

添加了基于@jasbner 的代码:

cleaned_sm %>%
   group_by(Parameter) %>%
   do(add_row(.,
                 Date_from = ymd(max(Date_to))+1 ,
                 Date_to = ymd(paste(year(max(Date_to)),"1231")),
                 Parameter = .$Parameter[1])) %>%
   do(add_row(.,
                 Date_to = ymd(min(Date_from))-1, 
                 Date_from = ymd(paste(year(min(Date_from)),"0101")) ,
                 Parameter = .$Parameter[1],
                 .before = 0)) %>% 
   filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))

标签: r

解决方案


dplyr我对and的尝试lubridate。一起破解,但我认为它应该工作。请注意,这不会查找日期范围中间的任何间隙。基本上,对于每个组,您在该特定组之前和之后添加一行。然后,如果存在日期范围从年初开始或在年底结束的任何情况,则过滤掉添加的行。

library(dplyr)
library(lubridate)
cleaned_sm %>%
  group_by(Parameter) %>%
  do(add_row(.,
             Date_from = ymd(max(.$Date_to))+1 ,
             Date_to = ymd(paste(year(max(.$Date_to)),"1231")),
             Parameter = .$Parameter[1])) %>%
  do(add_row(.,
             Date_to = ymd(min(.$Date_from))-1, 
             Date_from = ymd(paste(year(min(.$Date_from)),"0101")) ,
             Parameter = .$Parameter[1],
             .before = 0)) %>% 
  filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))  

# A tibble: 15 x 4
# Groups: Parameter [3]
#    Date_from  Date_to    concentration Parameter
#    <date>     <date>             <dbl> <chr>    
#  1 2013-01-01 2013-01-31         NA    A        
#  2 2013-02-01 2013-05-07          1.50 A        
#  3 2013-05-10 2013-08-12          2.50 A        
#  4 2013-08-13 2013-11-18          1.50 A        
#  5 2013-11-19 2013-12-31         NA    A        
#  6 2013-01-01 2013-01-31         NA    B        
#  7 2013-02-01 2013-05-07          3.50 B        
#  8 2013-05-10 2013-08-12          1.50 B        
#  9 2013-08-13 2013-11-18          2.50 B        
# 10 2013-11-19 2013-12-31         NA    B        
# 11 2013-01-01 2013-01-31         NA    C        
# 12 2013-02-01 2013-05-07          1.50 C        
# 13 2013-05-10 2013-08-12          3.50 C        
# 14 2013-08-13 2013-11-18          3.00 C        
# 15 2013-11-19 2013-12-31         NA    C 

推荐阅读