首页 > 解决方案 > 在面板数据中填写 Miss Years

问题描述

我有一个我认为是一个非常简单的问题,但我一直无法在互联网上找到任何解决方案。我正在尝试清理面板数据,并且我有一个数据集,其中以不规则的时间步长记录了观察结果。实体是州,时间单位是年。我想为每个州填写缺失的年份,用 . 填写新行NA

这是数据框的一部分:

dput(df)
structure(list(State = c("Alabama", "Alabama", "Alabama", "Alaska", 
"Alaska", "Alaska", "Arizona", "Arizona", "Arizona", "Arkansas", 
"Arkansas", "California", "California", "California", "California", 
"California", "California", "Colorado", "Colorado", "Colorado", 
"Colorado", "Colorado", "Connecticut", "Connecticut", "Connecticut", 
"Connecticut", "Connecticut", "Connecticut", "Connecticut", "Delaware", 
"Delaware", "District of Columbia", "District of Columbia", "District of Columbia", 
"Florida", "Florida", "Florida", "Georgia", "Georgia", "Georgia", 
"Hawaii", "Hawaii", "Hawaii", "Idaho", "Idaho", "Idaho", "Idaho", 
"Illinois", "Illinois", "Illinois", "Illinois", "Illinois", "Indiana", 
"Indiana", "Indiana", "Iowa", "Iowa", "Kansas", "Kansas", "Kentucky", 
"Kentucky", "Kentucky", "Louisiana", "Louisiana", "Louisiana", 
"Maine", "Maine", "Maine", "Maine", "Maine", "Maryland", "Maryland", 
"Maryland", "Maryland", "Maryland", "Massachusetts", "Massachusetts", 
"Massachusetts", "Massachusetts", "Michigan", "Michigan", "Michigan", 
"Minnesota", "Minnesota", "Minnesota", "Mississippi", "Mississippi", 
"Mississippi", "Missouri", "Missouri", "Montana", "Montana", 
"Nebraska", "Nebraska", "Nevada", "Nevada", "Nevada", "New Hampshire", 
"New Hampshire", "New Jersey", "New Jersey", "New Jersey", "New Jersey", 
"New Mexico", "New Mexico", "New Mexico", "New Mexico", "New Mexico", 
"New Mexico", "New York", "New York", "New York", "New York", 
"New York", "New York", "North Carolina", "North Carolina", "North Carolina", 
"North Carolina", "North Carolina", "North Dakota", "North Dakota", 
"North Dakota", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Oklahoma", 
"Oklahoma", "Oklahoma", "Oregon", "Oregon", "Oregon", "Oregon", 
"Pennsylvania", "Pennsylvania", "Rhode Island", "Rhode Island", 
"Rhode Island", "Rhode Island", "Rhode Island", "South Carolina", 
"South Carolina", "South Carolina", "South Dakota", "South Dakota", 
"Tennessee", "Tennessee", "Tennessee", "Tennessee", "Tennessee", 
"Texas", "Texas", "Texas", "Utah", "Utah", "Utah", "Utah", "Vermont", 
"Vermont", "Vermont", "Vermont", "Virginia", "Virginia", "Virginia", 
"Virginia", "Virginia", "Virginia", "Washington", "Washington", 
"Washington", "Washington", "West Virginia", "West Virginia", 
"West Virginia", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin", 
"Wisconsin", "Wyoming", "Wyoming"), Year = c(2001, 2015, 2016, 
2001, 2016, 2017, 2001, 2016, 2017, 2001, 2015, 2001, 2008, 2011, 
2014, 2015, 2016, 2001, 2013, 2015, 2016, 2017, 2001, 2003, 2012, 
2014, 2015, 2016, 2017, 2001, 2014, 2001, 2013, 2017, 2001, 2015, 
2016, 2001, 2014, 2017, 2001, 2016, 2017, 2001, 2015, 2016, 2017, 
2001, 2010, 2015, 2016, 2017, 2001, 2015, 2016, 2001, 2016, 2001, 
2017, 2001, 2013, 2015, 2001, 2015, 2016, 2001, 2014, 2015, 2016, 
2017, 2001, 2013, 2015, 2016, 2017, 2001, 2012, 2014, 2016, 2001, 
2014, 2017, 2001, 2014, 2016, 2001, 2015, 2017, 2001, 2016, 2001, 
2017, 2001, 2015, 2001, 2015, 2017, 2001, 2015, 2001, 2013, 2015, 
2017, 2001, 2009, 2014, 2015, 2016, 2017, 2001, 2006, 2007, 2014, 
2015, 2016, 2001, 2013, 2015, 2016, 2017, 2001, 2015, 2016, 2001, 
2014, 2015, 2016, 2017, 2001, 2013, 2014, 2001, 2013, 2014, 2016, 
2001, 2014, 2001, 2012, 2014, 2015, 2016, 2001, 2015, 2016, 2001, 
2016, 2001, 2014, 2015, 2016, 2017, 2001, 2015, 2016, 2001, 2014, 
2016, 2017, 2001, 2013, 2014, 2015, 2001, 2013, 2014, 2015, 2016, 
2017, 2001, 2010, 2012, 2015, 2001, 2015, 2016, 2001, 2014, 2015, 
2016, 2017, 2001, 2017), naaddressoaayn = c(0, 1, 1, 0, 1, 1, 
0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 
1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 
0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 
1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 
1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 
1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 
1, 1, 0, 1, 1, 1, 1, 0, 1)), row.names = c(NA, -183L), class = c("tbl_df", 
"tbl", "data.frame"))

我试过下面的代码,但是,它返回一个错误,说我不能将我的年份变量声明为日期。

    library(tidyverse)
    library(lubridate)
    library(tidyr)

df2 <- df %>%
          mutate(Year = as.Date(Year)) %>%
  complete(Year = seq.Date(min(Year), max(Year), by = "year"))

但是它返回以下错误:

Error in as.Date.numeric(Year) : 'origin' must be supplied

如果我删除了 mutate 命令,如下所示,那么它会返回以下错误:

df2 <- df %>%
  complete(Year = seq.Date(min(Year), max(Year), by = "year"))

Error in seq.Date(min(Year), max(Year), by = "year") : 
  'from' must be a "Date" object

我也尝试过使用 tsible 包,但 fill_gaps 函数不起作用,因为观察结果没有以常规时间步长记录。

library(tsibble)

df_tsbl <-df %>%
  as_tsibble(key = State, index = Year, regular = FALSE) 

df2 <- df_tsbl %>%
  fill_gaps()

Error: Can't handle tsibble of irregular interval.
Call `rlang::last_error()` to see a backtrace

最后,我也尝试过使用 padr 包。我尝试了以下代码:

library(padr)

df2 <- df_tsbl %>%
  pad(Year)

Error in pad(., Year) : object 'Year' not found

df2 <- df %>% 
   pad(interval = year, start_val = 2001, end_val = 2017)

Error: start_val should be of class Date, POSIXlt, or POSIXct

df2 <- pad(df)

Error: x contains multiple variables of class Date, POSIXct, or POSIXlt.
        Please specify which variable to use in the by argument.

我相信这是一个非常简单的问题,但是,我似乎无法找到解决方案。请让我知道你的想法,我真的很感激任何建议。

标签: rtime-seriesmissing-datapanel-data

解决方案


我们可以为每个创建一个从min到的序列max YearState

library(dplyr)

df %>%
 group_by(State) %>%
 tidyr::complete(Year = seq(min(Year), max(Year)))

#   State    Year naaddressoaayn
#   <chr>   <dbl>          <dbl>
# 1 Alabama  2001              0
# 2 Alabama  2002             NA
# 3 Alabama  2003             NA
# 4 Alabama  2004             NA
# 5 Alabama  2005             NA
# 6 Alabama  2006             NA
# 7 Alabama  2007             NA
# 8 Alabama  2008             NA
# 9 Alabama  2009             NA
#10 Alabama  2010             NA
# … with 819 more rows

如果我们希望所有州都应该有相同的开始年份和结束年份,我们可以使用

tidyr::complete(df, State, Year = seq(min(Year), max(Year)))

推荐阅读