pandas - pandas 中组的模式插补(处理 NaN 的组模式)
问题描述
我有一个包含 NaN 的分类列“WALLSMATERIAL_MODE”,我想使用以下组 ['NAME_EDUCATION_TYPE'、'AGE_GROUP'] 的模式来估算:
NAME_EDUCATION_TYPE AGE_GROUP WALLSMATERIAL_MODE
20 Secondary / secondary special 45-60 Stone, brick
21 Secondary / secondary special 21-45 NaN
22 Secondary / secondary special 21-45 Panel
23 Secondary / secondary special 60-70 Mixed
24 Secondary / secondary special 21-45 Panel
25 Secondary / secondary special 45-60 Stone, brick
26 Secondary / secondary special 45-60 Wooden
27 Secondary / secondary special 21-45 NaN
28 Higher education 21-45 NaN
29 Higher education 21-45 Panel
再现性代码
df = pd.DataFrame({'NAME_EDUCATION_TYPE': {20: 'Secondary / secondary special',
21: 'Secondary / secondary special',
22: 'Secondary / secondary special',
23: 'Secondary / secondary special',
24: 'Secondary / secondary special',
25: 'Secondary / secondary special',
26: 'Secondary / secondary special',
27: 'Secondary / secondary special',
28: 'Higher education',
29: 'Higher education'},
'AGE_GROUP': {20: '45-60',
21: '21-45',
22: '21-45',
23: '60-70',
24: '21-45',
25: '45-60',
26: '45-60',
27: '21-45',
28: '21-45',
29: '21-45'},
'WALLSMATERIAL_MODE': {20: 'Stone, brick',
21: np.nan,
22: 'Panel',
23: 'Mixed',
24: 'Panel',
25: 'Stone, brick',
26: 'Wooden',
27: np.nan,
28: np.nan,
29: 'Panel'}})
我尝试从这篇文章中调整以下函数,该函数适用于中位数插补并处理 NaN 组中位数
在:
def mode(s):
if pd.isnull(s.mode()):
return df['WALLSMATERIAL_MODE'].mode()
return s.mode()
df['WALLSMATERIAL_MODE'] = df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']], dropna=False).apply(lambda x: x.fillna(mode(x)))
OUT:调用 pd.isnull 时引发以下错误
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
我不明白,我尝试在所有组模式上应用 pd.isnull ,它不会引发此错误。请参阅下面的组模式
在:
df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']]).agg(pd.Series.mode).to_dict()
出去:
{('Higher education', '60-70'): nan,
('Higher education', '45-60'): nan,
('Higher education', '21-45'): 'Panel',
('Higher education', '0-21'): nan,
('Secondary / secondary special', '60-70'): 'Mixed',
('Secondary / secondary special', '45-60'): 'Stone, brick',
('Secondary / secondary special', '21-45'): 'Panel',
('Secondary / secondary special', '0-21'): nan}
如果有人能说出错误在哪里,或者是否有有效的方法来按组估算此列,我将不胜感激!
解决方案
下面的代码似乎使用 try except 来解决问题。我宁愿避免使用 try except 但我想不出更清洁的方法。
def mode_cats(s):
try:
if pd.isnull(s.mode().any()): # check if the mode of the subgroup is NaN or contains NaN
# (mode() may indeed return a list of several modes)
m = app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0] # returns the mode of the column
else:
m = s.mode().iloc[0] # returns the mode of the subgroup
return m
except IndexError: # mode returns an empty series if the subgroup consists of a single NaN value
# this causes s.mode().iloc[0] to raise an index error
return app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0]
正如@Ben.T 指出的那样,我不得不使用.iloc[0]
with.mode()
但是IndexError: single positional indexer is out-of-bounds
当.mode().iloc[0]
有一个空数组作为输入时我得到了。错误的回溯:
- mode() 在 value = NaN 的一行的子组上调用。.mode() 为单个 NaN 的此子组返回一个空数组
- 在传递的空数组上调用 pd.isnull 并返回一个空数组
- 在空数组上调用 .iloc[0] 会引发索引错误
推荐阅读
- php - 删除 Magento2 上某些类别中的“添加到购物车”按钮
- android - 如何在应用程序强制关闭android上添加代码
- python - 使用 Python 从进程的内存中读取数据
- angular - 如何将数据 + 多个文件从 Angular 上传到 .net core Web Api
- css - 关键帧的混合
- ruby - Ruby On Rails - 新的迁移和模型更改导致迁移失败
- javascript - 等待由不同组件触发的 2 个 API 完成,然后再触发另一个 API
- r - 列在 num 时错误地标记为 int
- c# - CountdownEvent 与 Barrier 的多线程用法?
- c++11 - 英特尔 C++ 编译器无法选择模板函数重载