首页 > 解决方案 > 如何在 python 中使用年增长率来估算缺失值?

问题描述

我有以下格式的数据集:

            Country Code    Year    Value
        0   ABC     32      2000    NaN
        1   ABC     32      2001    NaN
        2   ABC     32      2002    NaN
        3   ABC     32      2003    NaN
        4   ABC     32      2004    1000000.0
        5   ABC     32      2005    NaN
        6   ABC     32      2006    NaN
        7   ABC     32      2007    NaN
        8   ABC     32      2008    NaN
        9   ABC     32      2009    NaN

我正在尝试替换 NaN 值,以使它们显示非 NaN 值周围 r% 的年增长率;换句话说,对于示例数据,Value[i] 应该等于 1000000 * (1+r)^x,其中 x 是非 NaN 值的索引与 i 的索引之间的差。

对于这个小集合,以下代码可以完成这项工作:

df['imputed'] = ''
gr = 0.05 # growth rate

for i in range(len(df)):
    nx = df.Value.first_valid_index() # index of first non-NaN value
    nv = df.Value[df.Value.first_valid_index()] # first non-NaN value
    df['imputed'][i] = nv * (1+gr) ** (i - nx)
df


    Country   Code      Year    Value       imputed
0   ABC       32        2000    NaN         822702
1   ABC       32        2001    NaN         863838
2   ABC       32        2002    NaN         907029
3   ABC       32        2003    NaN         952381
4   ABC       32        2004    1000000.0   1e+06
5   ABC       32        2005    NaN         1.05e+06
6   ABC       32        2006    NaN         1.1025e+06
7   ABC       32        2007    NaN         1.15763e+06
8   ABC       32        2008    NaN         1.21551e+06
9   ABC       32        2009    NaN         1.27628e+06

然而,真实的数据集有多个“国家”和“代码”的组合,需要类似的计算(注意:这些组合中的每一个只有一个非 NaN 值,就像上面一样)。

如果我使用所有必需的国家代码组合创建一个新的 df (df2),我如何将上述计算应用于主 df 中的每个匹配组合?请注意,还有许多组合不需要此类计算。

df2
    Country Code
0   ABC     32
1   DEF     27
2   GHI     19

标签: pythonpandasimputation

解决方案


您可以只处理与国家或其他任何内容有关的整个数据中的过滤数据框,然后您可以将所有数据附加或合并在一起。我这里只介绍方法。随意使用下面的代码,并对其进行定制以获得更优化的解决方案。

代码:

df2 = pd.DataFrame(columns = cols)
df2['Country'] = np.array([(c*10).split() for c in ['ABC ', 'DEF ', 'GHI ']]).ravel()
df2['Code'] = np.array([(c*10).split() for c in ['32 ' , '27 ', '19 ']]).ravel()
df2['Year'] = np.arange(2000,2010).tolist() * 3
df2['Value'] = np.nan
df2.loc[[4,14,24],'Value'] = [1000000.0, 2000000.0, 3000000.0]

# print(df2)
df2.drop('id', axis=1, inplace=True)
df.Value = df.Value.apply(lambda x: np.nan if x == 'NaN' else float(x))

df2['imputed'] = 0
def process(df):
    for i in range(len(df)):
        nx = df.Value.first_valid_index() # index of first non-NaN value
        nv = df.Value.loc[nx] # first non-NaN value
        # print(nv,gr,i,nx)
        df.loc[i,'imputed'] = nv * ((1+gr) ** (i - nx))
    return df


new_df = pd.DataFrame()
for c in df2.Country.unique():
    cond = (df2.Country == c)
    p_df = df2[cond].copy()
    p_df.reset_index(drop=True,inplace=True)
    df_ = process(p_df)
    new_df = new_df.append(df_, ignore_index=True)

print(new_df)

输出:

   Country Code  Year      Value       imputed
0      ABC   32  2000        NaN  8.227025e+05
1      ABC   32  2001        NaN  8.638376e+05
2      ABC   32  2002        NaN  9.070295e+05
3      ABC   32  2003        NaN  9.523810e+05
4      ABC   32  2004  1000000.0  1.000000e+06
5      ABC   32  2005        NaN  1.050000e+06
6      ABC   32  2006        NaN  1.102500e+06
7      ABC   32  2007        NaN  1.157625e+06
8      ABC   32  2008        NaN  1.215506e+06
9      ABC   32  2009        NaN  1.276282e+06
10     DEF   27  2000        NaN  1.645405e+06
11     DEF   27  2001        NaN  1.727675e+06
12     DEF   27  2002        NaN  1.814059e+06
13     DEF   27  2003        NaN  1.904762e+06
14     DEF   27  2004  2000000.0  2.000000e+06
15     DEF   27  2005        NaN  2.100000e+06
16     DEF   27  2006        NaN  2.205000e+06
17     DEF   27  2007        NaN  2.315250e+06
18     DEF   27  2008        NaN  2.431013e+06
19     DEF   27  2009        NaN  2.552563e+06
20     GHI   19  2000        NaN  2.468107e+06
21     GHI   19  2001        NaN  2.591513e+06
22     GHI   19  2002        NaN  2.721088e+06
23     GHI   19  2003        NaN  2.857143e+06
24     GHI   19  2004  3000000.0  3.000000e+06
25     GHI   19  2005        NaN  3.150000e+06
26     GHI   19  2006        NaN  3.307500e+06
27     GHI   19  2007        NaN  3.472875e+06
28     GHI   19  2008        NaN  3.646519e+06
29     GHI   19  2009        NaN  3.828845e+06

推荐阅读