r - 在面板数据中填写 Miss Years
问题描述
我有一个我认为是一个非常简单的问题,但我一直无法在互联网上找到任何解决方案。我正在尝试清理面板数据,并且我有一个数据集,其中以不规则的时间步长记录了观察结果。实体是州,时间单位是年。我想为每个州填写缺失的年份,用 . 填写新行NA
。
这是数据框的一部分:
dput(df)
structure(list(State = c("Alabama", "Alabama", "Alabama", "Alaska",
"Alaska", "Alaska", "Arizona", "Arizona", "Arizona", "Arkansas",
"Arkansas", "California", "California", "California", "California",
"California", "California", "Colorado", "Colorado", "Colorado",
"Colorado", "Colorado", "Connecticut", "Connecticut", "Connecticut",
"Connecticut", "Connecticut", "Connecticut", "Connecticut", "Delaware",
"Delaware", "District of Columbia", "District of Columbia", "District of Columbia",
"Florida", "Florida", "Florida", "Georgia", "Georgia", "Georgia",
"Hawaii", "Hawaii", "Hawaii", "Idaho", "Idaho", "Idaho", "Idaho",
"Illinois", "Illinois", "Illinois", "Illinois", "Illinois", "Indiana",
"Indiana", "Indiana", "Iowa", "Iowa", "Kansas", "Kansas", "Kentucky",
"Kentucky", "Kentucky", "Louisiana", "Louisiana", "Louisiana",
"Maine", "Maine", "Maine", "Maine", "Maine", "Maryland", "Maryland",
"Maryland", "Maryland", "Maryland", "Massachusetts", "Massachusetts",
"Massachusetts", "Massachusetts", "Michigan", "Michigan", "Michigan",
"Minnesota", "Minnesota", "Minnesota", "Mississippi", "Mississippi",
"Mississippi", "Missouri", "Missouri", "Montana", "Montana",
"Nebraska", "Nebraska", "Nevada", "Nevada", "Nevada", "New Hampshire",
"New Hampshire", "New Jersey", "New Jersey", "New Jersey", "New Jersey",
"New Mexico", "New Mexico", "New Mexico", "New Mexico", "New Mexico",
"New Mexico", "New York", "New York", "New York", "New York",
"New York", "New York", "North Carolina", "North Carolina", "North Carolina",
"North Carolina", "North Carolina", "North Dakota", "North Dakota",
"North Dakota", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Oklahoma",
"Oklahoma", "Oklahoma", "Oregon", "Oregon", "Oregon", "Oregon",
"Pennsylvania", "Pennsylvania", "Rhode Island", "Rhode Island",
"Rhode Island", "Rhode Island", "Rhode Island", "South Carolina",
"South Carolina", "South Carolina", "South Dakota", "South Dakota",
"Tennessee", "Tennessee", "Tennessee", "Tennessee", "Tennessee",
"Texas", "Texas", "Texas", "Utah", "Utah", "Utah", "Utah", "Vermont",
"Vermont", "Vermont", "Vermont", "Virginia", "Virginia", "Virginia",
"Virginia", "Virginia", "Virginia", "Washington", "Washington",
"Washington", "Washington", "West Virginia", "West Virginia",
"West Virginia", "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin",
"Wisconsin", "Wyoming", "Wyoming"), Year = c(2001, 2015, 2016,
2001, 2016, 2017, 2001, 2016, 2017, 2001, 2015, 2001, 2008, 2011,
2014, 2015, 2016, 2001, 2013, 2015, 2016, 2017, 2001, 2003, 2012,
2014, 2015, 2016, 2017, 2001, 2014, 2001, 2013, 2017, 2001, 2015,
2016, 2001, 2014, 2017, 2001, 2016, 2017, 2001, 2015, 2016, 2017,
2001, 2010, 2015, 2016, 2017, 2001, 2015, 2016, 2001, 2016, 2001,
2017, 2001, 2013, 2015, 2001, 2015, 2016, 2001, 2014, 2015, 2016,
2017, 2001, 2013, 2015, 2016, 2017, 2001, 2012, 2014, 2016, 2001,
2014, 2017, 2001, 2014, 2016, 2001, 2015, 2017, 2001, 2016, 2001,
2017, 2001, 2015, 2001, 2015, 2017, 2001, 2015, 2001, 2013, 2015,
2017, 2001, 2009, 2014, 2015, 2016, 2017, 2001, 2006, 2007, 2014,
2015, 2016, 2001, 2013, 2015, 2016, 2017, 2001, 2015, 2016, 2001,
2014, 2015, 2016, 2017, 2001, 2013, 2014, 2001, 2013, 2014, 2016,
2001, 2014, 2001, 2012, 2014, 2015, 2016, 2001, 2015, 2016, 2001,
2016, 2001, 2014, 2015, 2016, 2017, 2001, 2015, 2016, 2001, 2014,
2016, 2017, 2001, 2013, 2014, 2015, 2001, 2013, 2014, 2015, 2016,
2017, 2001, 2010, 2012, 2015, 2001, 2015, 2016, 2001, 2014, 2015,
2016, 2017, 2001, 2017), naaddressoaayn = c(0, 1, 1, 0, 1, 1,
0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0,
1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 0, 1, 1, 1, 1, 0, 1)), row.names = c(NA, -183L), class = c("tbl_df",
"tbl", "data.frame"))
我试过下面的代码,但是,它返回一个错误,说我不能将我的年份变量声明为日期。
library(tidyverse)
library(lubridate)
library(tidyr)
df2 <- df %>%
mutate(Year = as.Date(Year)) %>%
complete(Year = seq.Date(min(Year), max(Year), by = "year"))
但是它返回以下错误:
Error in as.Date.numeric(Year) : 'origin' must be supplied
如果我删除了 mutate 命令,如下所示,那么它会返回以下错误:
df2 <- df %>%
complete(Year = seq.Date(min(Year), max(Year), by = "year"))
Error in seq.Date(min(Year), max(Year), by = "year") :
'from' must be a "Date" object
我也尝试过使用 tsible 包,但 fill_gaps 函数不起作用,因为观察结果没有以常规时间步长记录。
library(tsibble)
df_tsbl <-df %>%
as_tsibble(key = State, index = Year, regular = FALSE)
df2 <- df_tsbl %>%
fill_gaps()
Error: Can't handle tsibble of irregular interval.
Call `rlang::last_error()` to see a backtrace
最后,我也尝试过使用 padr 包。我尝试了以下代码:
library(padr)
df2 <- df_tsbl %>%
pad(Year)
Error in pad(., Year) : object 'Year' not found
df2 <- df %>%
pad(interval = year, start_val = 2001, end_val = 2017)
Error: start_val should be of class Date, POSIXlt, or POSIXct
df2 <- pad(df)
Error: x contains multiple variables of class Date, POSIXct, or POSIXlt.
Please specify which variable to use in the by argument.
我相信这是一个非常简单的问题,但是,我似乎无法找到解决方案。请让我知道你的想法,我真的很感激任何建议。
解决方案
我们可以为每个创建一个从min
到的序列max
Year
State
library(dplyr)
df %>%
group_by(State) %>%
tidyr::complete(Year = seq(min(Year), max(Year)))
# State Year naaddressoaayn
# <chr> <dbl> <dbl>
# 1 Alabama 2001 0
# 2 Alabama 2002 NA
# 3 Alabama 2003 NA
# 4 Alabama 2004 NA
# 5 Alabama 2005 NA
# 6 Alabama 2006 NA
# 7 Alabama 2007 NA
# 8 Alabama 2008 NA
# 9 Alabama 2009 NA
#10 Alabama 2010 NA
# … with 819 more rows
如果我们希望所有州都应该有相同的开始年份和结束年份,我们可以使用
tidyr::complete(df, State, Year = seq(min(Year), max(Year)))
推荐阅读
- c# - 关于在 C# 项目中使用 Face API
- python - 如何使用 .loc 语法创建新列?
- javascript - Javascript Clock - 尽管程序完整,但时钟为空
- cython - 如何使用相同 cython memoryview 的多个视图腌制对象
- android - Android graphics.Camera - 如何设置垂直于表面的位置?
- rabbitmq - 在 Ubuntu 16.04 上安装 RabbitMQ 时出错
- python - GIMP Python插件加载2个图像作为图层
- wpf - WPF如何设置进度条在桌面顶部的位置?
- kotlin - Kotlin - 协程未按预期运行
- javascript - 删除触摸屏上的自定义光标