首页 > 解决方案 > 基于另一个属性的概率填充缺失值

问题描述

我想根据来自另一个属性的条件的已知实例的概率分布来填充缺失值。具体来说:

Weather_Conditions         | Road_Surface | Date_Month
----------
Fine without high winds    | NaN          | 9
Fine without high winds    | NaN          | 1
Raining without high winds | Wet/Damp     | 6
Fine without high winds    | Wet/Damp     | 1
Fine without high winds    | NaN          | 2
Fine without high winds    | NaN          | 1
Raining without high winds | Wet/Damp     | 7
Raining without high winds | Wet/Damp     | 1

如果月份是一月,则所有缺失的 Road_Surface 值都应以 1:3 Frost:Wet 的比例填充。

到目前为止,我设法创建了要填充的值数组

road_values_jan = np.random.choice(["Frost/Ice", "Wet/Damp"], random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])].isnull().sum(), p=[0.25, 0.75])

# which outputs:
array(['Wet/Damp', 'Frost/Ice'], dtype='<U9')

当我希望它将它绑定到原始数​​据框时,问题就来了。我试过了

null_road = random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])].isnull()

random_data.loc['null_road'] = np.random.choice(road_values_jan, road_values_jan.size)

来自这个线程,但它说:ValueError: cannot set a row with mismatched columns

我也玩过

random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])] = random_data["Road_Surface_Conditions"][random_data['Date_Month'].isin(["01"])].fillna(pandas.Series(road_values_jan, index=random_data.index))

但是这个给了我ValueError:传递值的长度是2,索引意味着8

如何在 Month 条件下将此二值数组附加到 NaN 值?

请在下面找到 .csv 样式的数据:

Weather_Conditions,Road_Surface_Conditions,Date_Month
Fine without high winds,NaN,9
Fine without high winds,NaN,1
Raining without high winds,Wet/Damp,6
Fine without high winds,Wet/Damp,1
Fine without high winds,NaN,2
Fine without high winds,NaN,1
Raining without high winds,Wet/Damp,7
Raining without high winds,Wet/Damp,1

标签: pythonpandas

解决方案


如果我理解正确,您可以首先创建一个分布为 25:75 且值大小相同的数组,然后选择NaN列中的那些行并用创建的数组填充它们:NaNRoad_Surface_Conditions

m = (df['Road_Surface_Conditions'].isnull() & df['Date_Month'].eq(1)).sum()

s = np.random.choice(['Frost/Ice', 'Wet/Damp'],
                     p=[0.25, 0.75], 
                     size = m)
print(s)
['Wet/Damp' 'Frost/Ice']

df.loc[df['Road_Surface_Conditions'].isnull() & df['Date_Month'].eq(1), 
       'Road_Surface_Conditions'] = s

print(df)
           Weather_Conditions Road_Surface_Conditions  Date_Month
0     Fine without high winds                     NaN           9
1     Fine without high winds                Wet/Damp           1
2  Raining without high winds                Wet/Damp           6
3     Fine without high winds                Wet/Damp           1
4     Fine without high winds                     NaN           2
5     Fine without high winds               Frost/Ice           1
6  Raining without high winds                Wet/Damp           7
7  Raining without high winds                Wet/Damp           1

注意我的数据框被调用df而不是random_data


推荐阅读