python - Pandas:在一系列可用值之前/之后估算给定数量的缺失值
问题描述
假设我有一个时间序列,其中我通常有某个连续年份的可用数据,但在该跨度之前和之后缺少值,如下所示:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 NaN
7 2007 NaN
有没有办法根据可用值中看到的趋势来填充(给定数量的)缺失值?
假设我想在每个方向上最多填充 2 个值,结果必须如下所示:
year cakes eaten
0 2000 NaN
1 2001 1.0
2 2002 2.0
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 6.0
7 2007 7.0
另外:有没有办法确保仅在有足够的可用值时才执行此插补,例如,如果至少有 3 个可用值(或更一般地说,我只想在每个方向上填充最多 2 个值)条款,仅当 n + m 可用时才填写 n) ?
解决方案
感谢@olv1do 向我展示了interpolate()可以满足我的需求。
使用 interpolate 和.first_valid_index
and.last_valid_index
允许实现所需的行为:
#impute n values in both directions if at least m values are available
def interpolate(data, n, m):
first_valid = data['cakes eaten'].first_valid_index()
last_valid = data['cakes eaten'].last_valid_index()
if(abs(first_valid - last_valid) + 1 >= m):
data['imputed'] = data['cakes eaten'].interpolate(method='spline',order = 1, limit_direction='both', limit = n)
return data
对于问题中的示例:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.0
2 2002 NaN 2.0
3 2003 3.0 3.0
4 2004 4.0 4.0
5 2005 5.0 5.0
6 2006 NaN 6.0
7 2007 NaN 7.0
如果可用的值少于 m,则不执行任何操作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, np.nan, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 NaN
6 2006 NaN
7 2007 NaN
此外,spline
如果值不像我的示例中那样完全线性,该方法也可以很好地工作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, 1, 4, 2, 3, np.nan, np.nan]})
interpolate(df, 1,4)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.381040
2 2002 1.0 1.000000
3 2003 4.0 4.000000
4 2004 2.0 2.000000
5 2005 3.0 3.000000
6 2006 NaN 3.433167
7 2007 NaN NaN
推荐阅读
- python - How to train only RPN for torch vision Faster RCNN with pretrained backbone
- python - 为什么循环在第一次迭代后没有中断。预期的答案是 [1,3]
- assembly - "asm volatile ("" ::: "内存");" 在 C++ 中不起作用
- python - 将图像和数值数据都传递给 Tensorflow
- python - 如何获取不同时间间隔的币安历史数据?
- sql - 在 MS Access 中查找提供所有部件的供应商的名称
- c++ - 从根本上说,恶意 UEFI NTFS 驱动程序删除文件的过程中存在哪些信息可以被另一个驱动程序阻止?
- php - Woocommerce REST API 扩展订单“line_items”响应
- javascript - Android 的深色模式破坏了我网站的 CSS
- git - 如何为本地存储库而不是全局方式生成 ssh 密钥