首页 > 解决方案 > 熊猫插值条件

问题描述

我有一个这样的数据框

import pandas as pd
import numpy as np

df = pd.DataFrame({'date':pd.date_range(start='13/11/2021', periods=11),
                    'a': [1, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan, np.nan, np.nan, 3],
                   'b': [4, np.nan, np.nan, np.nan, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 7],
                   }).set_index('date')

              a    b
date                
2021-11-13  1.0  4.0
2021-11-14  NaN  NaN
2021-11-15  NaN  NaN
2021-11-16  NaN  NaN
2021-11-17  NaN  NaN
2021-11-18  2.0  6.0
2021-11-19  NaN  NaN
2021-11-20  NaN  NaN
2021-11-21  NaN  NaN
2021-11-22  NaN  NaN
2021-11-23  3.0  7.0

如何将其线性插值到两个非 nan 值之间的 n% 间隔,然后用上限填充其余部分。
两个非 nan 值之间的间隔将在整个数据帧中保持不变。
现在例如,n = 0.5

                   a         b
date                          
2021-11-13  1.000000  4.000000  << #  original value --------------
2021-11-14  1.333333  4.666667                                     |
2021-11-15  1.666667  5.333333                                     |  50% linearly
2021-11-16  2.000000  6.000000  <- linear interpolation upto here   |  interpolated rest are
2021-11-17  2.000000  6.000000                                     |  filled.
2021-11-18  2.000000  6.000000  << # original value (upper bound)---
2021-11-19  2.333333  6.333333
2021-11-20  2.666667  6.666667
2021-11-21  3.000000  7.000000  <- linear interpolation upto here
2021-11-22  3.000000  7.000000
2021-11-23  3.000000  7.000000  << # original value (upper bound)

标签: pythonpandasdataframe

解决方案


我认为没有为此目的的熊猫功能。你必须创建你的:

def interpol_segment(df, r):
    nan_idx = df[pd.isna(df.a) | pd.isna(df.b)].index   # nan rows
    consecutive_nan = []        # a list of range (list)
    date_inter = [nan_idx[0]]   # range (list) of consecutive dates
    for i, date in enumerate(nan_idx[1:], start=1):
        if date - nan_idx[i-1] == pd.Timedelta(1, unit="day"):
            date_inter.append(date)
        else:
            consecutive_nan.append(date_inter)
            date_inter = [date]

    consecutive_nan.append(date_inter) # appending last interval
    # dates to be filled with upper bound:
    capped_date = [x[i] for x in consecutive_nan for i,_ in enumerate(x) if i >= int(r*len(x))]
    df_capped = df.loc[capped_date]
    # interpoling without the capped rows:
    df_inter = df.drop(capped_date).interpolate()
    # reassembling and filling with bfill:
    return pd.concat([df_inter,df_capped]).sort_index().bfill()


print(interpol_segment(df, 0.5))

输出:

                   a         b
date                          
2021-11-13  1.000000  4.000000
2021-11-14  1.333333  4.666667
2021-11-15  1.666667  5.333333
2021-11-16  2.000000  6.000000
2021-11-17  2.000000  6.000000
2021-11-18  2.000000  6.000000
2021-11-19  2.333333  6.333333
2021-11-20  2.666667  6.666667
2021-11-21  3.000000  7.000000
2021-11-22  3.000000  7.000000
2021-11-23  3.000000  7.000000

推荐阅读