首页 > 解决方案 > How to assign day of year values starting from an arbitary date and take care of missing values?

问题描述

I have an R dataframe df_demand with a date column (depdate) and a dependent variable column bookings. The duration is 365 days starting from 2017-11-02 and ending at 2018-11-01, sorted in ascending order.

We have booking data for only 279 days in the year.

dplyr::arrange(df_demand, depdate)

           depdate bookings
    1   2017-11-02       43
    2   2017-11-03       27
    3   2017-11-05       27
    4   2017-11-06       22
    5   2017-11-07       39
    6   2017-11-08       48
    .
    .

   279  2018-11-01       60

I want to introduce another column day_of_year in the following way:

    depdate       day_of_year     bookings
1    2017-11-02        1              43
2    2017-11-03        2              27
3    2017-11-04        3              NA
4    2017-11-05        4              27
    .
    .
    .
365  2018-11-01      365              60

I am trying to find the best possible way to do this.

In Python, I could use something like :

df_demand['day_of_year'] = df_demand['depdate'].sub(df_demand['depdate'].iat[0]).dt.days + 1

I wanted to know about an R equivalent of the same.

When I run

typeof(df_demand_2$depdate)

the output is

"double"

Am I missing something?

enter image description here

标签: rdatedataframe

解决方案


complete您可以使用包中的函数为每个日期创建一行tidyr

首先,我正在创建一个包含一些示例数据的数据框:

df <- data.frame(
  depdate = as.Date(c('2017-11-02', '2017-11-03', '2017-11-05')),
  bookings = c(43, 27, 27)
)

接下来,我正在执行两个操作。首先,使用tidyr::complete,我在我的分析中指定我想要的所有日期。我可以使用 来做到这一点seq.Date,创建从第一天到最后一天的序列。

一旦完成,day_of_year列就等于行号。

df_complete <- tidyr::complete(df,
  depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)
)

df_complete$day_of_year <- 1:nrow(df_complete)

> df_complete
#> # A tibble: 4 x 3
#>   depdate    bookings day_of_year
#>   <date>        <dbl>       <int>
#> 1 2017-11-02       43           1
#> 2 2017-11-03       27           2
#> 3 2017-11-04       NA           3
#> 4 2017-11-05       27           4

管道运算符的等效解决方案来自dplyr

df %>%
  complete(depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)) %>%
  mutate(days_of_year = row_number())

推荐阅读