imputation - na.approx 和 na.locf 行为不正常
问题描述
我正在尝试计算不同国家/地区时间序列的估算值。这段代码以前运行良好,但现在估算的值都是错误的……我无法弄清楚问题所在,我已经尝试了我能想到的一切。
我们的规则是:
- 在时间序列末尾缺少的值将被赋予该序列中最后一个已知值。
- 在时间序列开始时缺失的值被赋予该序列中的第一个已知值。
- 如果时间序列中间缺少值,则使用线性外推。
# load library for imputation
library(zoo)
# expand table to show NAs
output_table_imp = expand(output_table, transport_mode, year, country_code)
output_table_imp = full_join(output_table_imp, output_table)
# add imputated values
output_table_imp <- output_table_imp %>%
group_by(transport_mode, country_code) %>%
mutate(fatalities_imp= na.approx(fatalities,na.rm=FALSE)) %>% # linear interpolation
mutate(fatalities_imp= na.locf.default(fatalities_imp,na.rm=FALSE)) %>% # missing values at the end of a time series (copy last non-NA value)
mutate(fatalities_imp= na.locf(fatalities_imp,fromLast=TRUE, na.rm=FALSE)) %>% # missing values at the start of a time series (copy first non-NA value)
我的数据框由几列组成:transport_mode、country_code、year、fatities。我不确定如何在这里分享我的数据?这是一张有 3600 个观测值的大表……
解决方案
Your code looks somehow overly complicated. Don't know about the zoo details - but pretty sure you could get it also to work.
With the imputeTS package you could just take your whole data. frame
(it assumes each column is a separate time series) and the package performs imputation for each of this series.
(unfortunately your code has no data, but I guess this would be your output_table_imp data.frame
after expansion)
Just like this:
library("imputeTS")
na_interpolation(output_table_imp, option = "linear")
We also don't have to change something for NA treatment at the beginning and at the end, since your requirements are the default in the na_interpolation function.
These were your requirements:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
Here a toy example:
# Test time series with NAs at start, middle, end
test <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
# Perform linear interpolation
na_interpolation(test, option = "linear")
#Results
> 1 1 1 2 3 4 5 6 7 8 8 8
So see, this works perfectly fine.
Works also perfectly with a data.frame (as a said, a column is interpreted as a time series):
# Create three time series and combine them into 1 data.frame
ts1 <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
ts2 <- c(NA,1,1,2,3,NA,3,6,7,8,NA,NA)
ts3 <- c(NA,3,1,2,3,NA,3,6,7,8,NA,NA)
df <- data.frame(ts1,ts2,ts3)
na_interpolation(df, option = "linear")
推荐阅读
- docker - 在 windows 容器中,如何优雅地关闭 asp.net core 应用程序?
- python - 为什么lambdify永远不会停止?
- jquery - 将选择更改为他的下一个值
- git - 让 git 在不覆盖现有代码的情况下进行更改
- c# - Open XML 删除 MS Word 表行损坏图像
- elasticsearch - 有没有办法使用弹性搜索数据源的当前日期计算时差?
- xml - 如何在 xsl 中实现附加赋值运算符,即“+=”
- python - Python face_recognition 为什么不能识别卡通图像?
- node.js - 如何使用 Node JS 获取 PostgreSQL 转储数据库文件
- c# - Sublime text 3 - 为特定字符着色 C#