首页 > 解决方案 > 如何使用 pd.cut 对 nan 值进行分级

问题描述

我正在尝试编写一个代码,该代码从包含空白值的数据框(account_raw)创建箱。我的问题是 python 使用我的第一个 bin 标签对空白值进行分类:0 - 25k。我想要做的是为空白值创建一个单独的 bin。任何想法如何解决这个问题?谢谢

Bucket = [0, 25000, 50000, 100000, 200000, 300000, 999999999999]
Label = ['0k to 25k', '25k - 50k', '50k - 100k', '100k - 200k', '200k - 300k', 'More than 300k']
   account_raw['LoanGBVBuckets'] = pd.cut(account_raw['IfrsBalanceEUR'],bins=ls_LoanGBVBucket,                                        labels=ls_LoanGBVBucketLabel, include_lowest= True).astype(str)

标签: pythonpandasnan

解决方案


我认为最简单的是处理后的值并按列为pd.cut缺失值设置自定义类别IfrsBalanceEUR

account_raw['LoanGBVBuckets'] = pd.cut(account_raw['IfrsBalanceEUR'],
                                      bins=ls_LoanGBVBucket, 
                                      labels=ls_LoanGBVBucketLabel, 
                                      include_lowest= True).astype(str)

account_raw.loc[account_raw['IfrsBalanceEUR'].isna(), 'LoanGBVBuckets'] = 'missing values'

编辑:

在 pandas 0.25.0 中测试,对于缺失值NaN,在输出中获取 s,为了替换它们,首先需要一些类别cat.add_categories,然后fillna

account_raw = pd.DataFrame({'IfrsBalanceEUR':[np.nan, 100, 100000]})

Bucket = [0, 25000, 50000, 100000, 200000, 300000, 999999999999]
Label = ['0k to 25k', '25k - 50k', '50k - 100k', 
         '100k - 200k', '200k - 300k', 'More than 300k']

account_raw['LoanGBVBuckets'] = pd.cut(account_raw['IfrsBalanceEUR'],
                                      bins=Bucket, 
                                      labels=Label, 
                                      include_lowest= True)
print (account_raw)
   IfrsBalanceEUR LoanGBVBuckets
0             NaN            NaN
1           100.0      0k to 25k
2        100000.0     50k - 100k

account_raw['LoanGBVBuckets']=(account_raw['LoanGBVBuckets'].cat
                                                            .add_categories('missing values')
                                                            .fillna('missing values'))
print (account_raw)
   IfrsBalanceEUR  LoanGBVBuckets
0             NaN  missing values
1           100.0       0k to 25k
2        100000.0      50k - 100k

推荐阅读