首页 > 解决方案 > 如何通过在左侧和右侧的平均值之间插值来填充缺失值?

问题描述

我有一个数据表,其中每个组中都可能出现缺失值(单个和连续)。我想将它们填写如下:计算序列中第一个 NaN 左侧的 3 个值的平均值,然后计算序列中最后一个 NaN 右侧的 3 个值的平均值,然后进行插值这些平均值之间的 NaN。

+-------+-------+
| group | value |
+-------+-------+
| 1     | 1     |
+-------+-------+
| 1     | 1     |
+-------+-------+
| 1     | 2     |
+-------+-------+
| 1     | 3     |
+-------+-------+
| 1     | 4     |
+-------+-------+
| 1     | NaN   |
+-------+-------+
| 1     | NaN   |
+-------+-------+
| 1     | 3     |
+-------+-------+
| 1     | 6     |
+-------+-------+
| 1     | 4     |
+-------+-------+
| 1     | 3     |
+-------+-------+
| 1     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | 1     |
+-------+-------+
| 2     | 2     |
+-------+-------+
| 2     | 3     |
+-------+-------+
| 2     | 4     |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | NaN   |
+-------+-------+
| 2     | 6     |
+-------+-------+
| 2     | 8     |
+-------+-------+
| 2     | 9     |
+-------+-------+

重现上述数据帧的代码

nan = np.nan
d = {'group': {0: 1,
  1: 1,
  2: 1,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 1,
  8: 1,
  9: 1,
  10: 1,
  11: 1,
  12: 2,
  13: 2,
  14: 2,
  15: 2,
  16: 2,
  17: 2,
  18: 2,
  19: 2,
  20: 2,
  21: 2,
  22: 2,
  23: 2},
 'value': {0: 1.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: nan,
  6: nan,
  7: 3.0,
  8: 6.0,
  9: 4.0,
  10: 3.0,
  11: nan,
  12: nan,
  13: nan,
  14: 1.0,
  15: 2.0,
  16: 3.0,
  17: 4.0,
  18: nan,
  19: nan,
  20: nan,
  21: 6.0,
  22: 8.0,
  23: 9.0}}

df = pd.DataFrame(d)

预期输出:

d = {'group': {0: 1,
  1: 1,
  2: 1,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 1,
  8: 1,
  9: 1,
  10: 1,
  11: 1,
  12: 2,
  13: 2,
  14: 2,
  15: 2,
  16: 2,
  17: 2,
  18: 2,
  19: 2,
  20: 2,
  21: 2,
  22: 2,
  23: 2},
 'value': {0: 1.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: 3.44444444,
  6: 3.88888889,
  7: 3.0,
  8: 6.0,
  9: 4.0,
  10: 3.0,
  11: 4.333333,
  12: 2.0,
  13: 2.0,
  14: 1.0,
  15: 2.0,
  16: 3.0,
  17: 4.0,
  18: 4.166667,
  19: 5.333333,
  20: 6.500000,
  21: 6.0,
  22: 8.0,
  23: 9.0}}

是否可以在熊猫中做到这一点,而不使用循环?

标签: pythonpandas

解决方案


IIUC,这是一种方法:

df['updated_values'] = (
    df.groupby('group')
    .apply(
        lambda x: x['value'].fillna(
            x['value']
            .rolling(3)
            .mean()
            .bfill()
            .where(~x['value'].isna())
            .interpolate()
            .bfill()
            .ffill()
            )
    ).values
)

输出:

    group  value  updated_values
0       1    1.0        1.000000
1       1    1.0        1.000000
2       1    2.0        2.000000
3       1    3.0        3.000000
4       1    4.0        4.000000
5       1    NaN        3.444444
6       1    NaN        3.888889
7       1    3.0        3.000000
8       1    6.0        6.000000
9       1    4.0        4.000000
10      1    3.0        3.000000
11      1    NaN        4.333333
12      2    NaN        2.000000
13      2    NaN        2.000000
14      2    1.0        1.000000
15      2    2.0        2.000000
16      2    3.0        3.000000
17      2    4.0        4.000000
18      2    NaN        4.166667
19      2    NaN        5.333333
20      2    NaN        6.500000
21      2    6.0        6.000000
22      2    8.0        8.000000
23      2    9.0        9.000000

推荐阅读