首页 > 解决方案 > 为什么在 Pandas 数据框中使用 Z 分数进行归一化会生成 NaN 列?

问题描述

我使用 Z-score fromscipy将我的数据集标准化如下:

import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore

df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers

print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())

但是,我将某些列更改为,NaN特别是如下所示,但在应用 Z 分数标准化之前它们是数字的。这是我的代码中的错误还是 Z 分数可以生成?mta_taxtrip_typeNaN

标准化之前:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean   1.794324e+00  1.000000e+00  1.106734e+02  1.285285e+02   
std    4.041947e-01  4.353414e-04  7.541486e+01  7.729142e+01   
min    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
25%    2.000000e+00  1.000000e+00  4.900000e+01  6.100000e+01   
50%    2.000000e+00  1.000000e+00  8.200000e+01  1.290000e+02   
75%    2.000000e+00  1.000000e+00  1.660000e+02  1.930000e+02   
max    2.000000e+00  2.000000e+00  2.650000e+02  2.650000e+02   

       passenger_count  trip_distance   fare_amount         extra     mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07  10552857.0   
mean      1.140647e+00   2.399851e+00  1.082419e+01  3.607218e-01         0.5   
std       4.436568e-01   2.014673e+00  6.464638e+00  3.797668e-01         0.0   
min       0.000000e+00   0.000000e+00  0.000000e+00 -6.700000e-01         0.5   
25%       1.000000e+00   1.000000e+00  6.000000e+00  0.000000e+00         0.5   
50%       1.000000e+00   1.700000e+00  9.000000e+00  5.000000e-01         0.5   
75%       1.000000e+00   3.120000e+00  1.350000e+01  5.000000e-01         0.5   
max       4.000000e+00   1.117000e+01  4.100000e+01  1.000000e+00         0.5   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07           1.055286e+07  1.055286e+07   
mean   1.028691e+00  5.512108e-02           3.000000e-01  1.312541e+01   
std    1.510206e+00  5.524008e-01           4.110357e-11  7.370554e+00   
min    0.000000e+00  0.000000e+00           3.000000e-01  0.000000e+00   
25%    0.000000e+00  0.000000e+00           3.000000e-01  7.800000e+00   
50%    0.000000e+00  0.000000e+00           3.000000e-01  1.080000e+01   
75%    1.860000e+00  0.000000e+00           3.000000e-01  1.630000e+01   
max    7.660000e+00  8.000000e+00           3.000000e-01  4.877000e+01   

       payment_type   trip_type  
count  1.055286e+07  10552857.0  
mean   1.501672e+00         1.0  
std    5.061254e-01         0.0  
min    1.000000e+00         1.0  
25%    1.000000e+00         1.0  
50%    1.000000e+00         1.0  
75%    2.000000e+00         1.0  
max    3.000000e+00         1.0 

标准化后:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean  -1.235870e-12  1.006184e-13 -3.819625e-14 -1.004818e-14   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00   
25%    5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01   
50%    5.088537e-01 -4.353414e-04 -3.802090e-01  6.100220e-03   
75%    5.088537e-01 -4.353414e-04  7.336298e-01  8.341353e-01   
max    5.088537e-01  2.297048e+03  2.046369e+00  1.765675e+00   

       passenger_count  trip_distance   fare_amount         extra  mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07      0.0   
mean     -3.942620e-14  -3.206434e-14 -4.744100e-13 -1.042732e-12      NaN   
std       1.000000e+00   1.000000e+00  1.000000e+00  1.000000e+00      NaN   
min      -2.571013e+00  -1.191187e+00 -1.674370e+00 -2.714092e+00      NaN   
25%      -3.170185e-01  -6.948283e-01 -7.462435e-01 -9.498508e-01      NaN   
50%      -3.170185e-01  -3.473773e-01 -2.821804e-01  3.667467e-01      NaN   
75%      -3.170185e-01   3.574519e-01  4.139144e-01  3.667467e-01      NaN   
max       6.444966e+00   4.353138e+00  4.667827e+00  1.683344e+00      NaN   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07             10552857.0  1.055286e+07   
mean   3.152945e-13 -2.877092e-14                   -1.0  2.081611e-14   
std    1.000000e+00  1.000000e+00                    0.0  1.000000e+00   
min   -6.811593e-01 -9.978459e-02                   -1.0 -1.780791e+00   
25%   -6.811593e-01 -9.978459e-02                   -1.0 -7.225258e-01   
50%   -6.811593e-01 -9.978459e-02                   -1.0 -3.155007e-01   
75%    5.504607e-01 -9.978459e-02                   -1.0  4.307119e-01   
max    4.390996e+00  1.438246e+01                   -1.0  4.836080e+00   

       payment_type  trip_type  
count  1.055286e+07        0.0  
mean   1.387184e-12        NaN  
std    1.000000e+00        NaN  
min   -9.912012e-01        NaN  
25%   -9.912012e-01        NaN  
50%   -9.912012e-01        NaN  
75%    9.845937e-01        NaN  
max    2.960389e+00        NaN 

谢谢

标签: pythonpython-3.xpandasstatistics

解决方案


要跟进评论,请查看以下内容:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })

请注意“b”和“d”是如何保持不变的,这意味着标准偏差为 0。应用zscore- 函数意味着减去平均值并除以标准误差。如果将数字除以 0,则结​​果未定义,Pandas 将显示 NaN。

df.apply(zscore)
Out[8]: 
          a   b         c   d
0 -1.224745 NaN -1.224745 NaN
1  0.000000 NaN  0.000000 NaN
2  1.224745 NaN  1.224745 NaN

您要么需要将zscore函数应用于选定的列,要么更改函数以使其省略常量列。要将函数仅应用于选定的列,您可以执行以下操作:

df[['a','c']] = df[['a','c']].apply(zscore)

df
Out[9]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8

为了让函数检查列是否为常数,让我们使用一个lambda函数,如果列的标准差为 0,则返回未更改的列,否则返回标准化列。

df.apply(lambda x: x if np.std(x) == 0 else zscore(x))
Out[15]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8

推荐阅读