首页 > 解决方案 > pandas 中组的模式插补(处理 NaN 的组模式)

问题描述

我有一个包含 NaN 的分类列“WALLSMATERIAL_MODE”,我想使用以下组 ['NAME_EDUCATION_TYPE'、'AGE_GROUP'] 的模式来估算:

    NAME_EDUCATION_TYPE             AGE_GROUP   WALLSMATERIAL_MODE
20  Secondary / secondary special   45-60       Stone, brick
21  Secondary / secondary special   21-45       NaN
22  Secondary / secondary special   21-45       Panel
23  Secondary / secondary special   60-70       Mixed
24  Secondary / secondary special   21-45       Panel
25  Secondary / secondary special   45-60       Stone, brick
26  Secondary / secondary special   45-60       Wooden
27  Secondary / secondary special   21-45       NaN
28  Higher education                21-45       NaN
29  Higher education                21-45       Panel

再现性代码

df = pd.DataFrame({'NAME_EDUCATION_TYPE': {20: 'Secondary / secondary special',
  21: 'Secondary / secondary special',
  22: 'Secondary / secondary special',
  23: 'Secondary / secondary special',
  24: 'Secondary / secondary special',
  25: 'Secondary / secondary special',
  26: 'Secondary / secondary special',
  27: 'Secondary / secondary special',
  28: 'Higher education',
  29: 'Higher education'},
 'AGE_GROUP': {20: '45-60',
  21: '21-45',
  22: '21-45',
  23: '60-70',
  24: '21-45',
  25: '45-60',
  26: '45-60',
  27: '21-45',
  28: '21-45',
  29: '21-45'},
 'WALLSMATERIAL_MODE': {20: 'Stone, brick',
  21: np.nan,
  22: 'Panel',
  23: 'Mixed',
  24: 'Panel',
  25: 'Stone, brick',
  26: 'Wooden',
  27: np.nan,
  28: np.nan,
  29: 'Panel'}})

我尝试从这篇文章中调整以下函数,该函数适用于中位数插补并处理 NaN 组中位数

在:

def mode(s):
    if pd.isnull(s.mode()):
        return df['WALLSMATERIAL_MODE'].mode()
    return s.mode()
        
df['WALLSMATERIAL_MODE'] = df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']], dropna=False).apply(lambda x: x.fillna(mode(x)))

OUT:调用 pd.isnull 时引发以下错误

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

我不明白,我尝试在所有组模式上应用 pd.isnull ,它不会引发此错误。请参阅下面的组模式

在:

df['WALLSMATERIAL_MODE'].groupby([df['NAME_EDUCATION_TYPE'], df['AGE_GROUP']]).agg(pd.Series.mode).to_dict()

出去:

{('Higher education', '60-70'): nan,
 ('Higher education', '45-60'): nan,
 ('Higher education', '21-45'): 'Panel',
 ('Higher education', '0-21'): nan,
 ('Secondary / secondary special', '60-70'): 'Mixed',
 ('Secondary / secondary special', '45-60'): 'Stone, brick',
 ('Secondary / secondary special', '21-45'): 'Panel',
 ('Secondary / secondary special', '0-21'): nan}

如果有人能说出错误在哪里,或者是否有有效的方法来按组估算此列,我将不胜感激!

标签: pandaspandas-groupbynancategorical-dataimputation

解决方案


下面的代码似乎使用 try except 来解决问题。我宁愿避免使用 try except 但我想不出更清洁的方法。

def mode_cats(s):
        try:
            if pd.isnull(s.mode().any()): # check if the mode of the subgroup is NaN or contains NaN 
                                          # (mode() may indeed return a list of several modes)
                m = app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0] # returns the mode of the column
            else:
                m = s.mode().iloc[0]  # returns the mode of the subgroup
            return m
        except IndexError: # mode returns an empty series if the subgroup consists of a single NaN value
                           # this causes s.mode().iloc[0] to raise an index error
            return app_train_dash['WALLSMATERIAL_MODE'].mode().iloc[0]

正如@Ben.T 指出的那样,我不得不使用.iloc[0]with.mode() 但是IndexError: single positional indexer is out-of-bounds.mode().iloc[0]有一个空数组作为输入时我得到了。错误的回溯:

  1. mode() 在 value = NaN 的一行的子组上调用。.mode() 为单个 NaN 的此子组返回一个空数组
  2. 在传递的空数组上调用 pd.isnull 并返回一个空数组
  3. 在空数组上调用 .iloc[0] 会引发索引错误

推荐阅读