python - 为什么在 Pandas 数据框中使用 Z 分数进行归一化会生成 NaN 列?
问题描述
我使用 Z-score fromscipy
将我的数据集标准化如下:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore
df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers
print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())
但是,我将某些列更改为,NaN
特别是如下所示,但在应用 Z 分数标准化之前它们是数字的。这是我的代码中的错误还是 Z 分数可以生成?mta_tax
trip_type
NaN
标准化之前:
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.794324e+00 1.000000e+00 1.106734e+02 1.285285e+02
std 4.041947e-01 4.353414e-04 7.541486e+01 7.729142e+01
min 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
25% 2.000000e+00 1.000000e+00 4.900000e+01 6.100000e+01
50% 2.000000e+00 1.000000e+00 8.200000e+01 1.290000e+02
75% 2.000000e+00 1.000000e+00 1.660000e+02 1.930000e+02
max 2.000000e+00 2.000000e+00 2.650000e+02 2.650000e+02
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 10552857.0
mean 1.140647e+00 2.399851e+00 1.082419e+01 3.607218e-01 0.5
std 4.436568e-01 2.014673e+00 6.464638e+00 3.797668e-01 0.0
min 0.000000e+00 0.000000e+00 0.000000e+00 -6.700000e-01 0.5
25% 1.000000e+00 1.000000e+00 6.000000e+00 0.000000e+00 0.5
50% 1.000000e+00 1.700000e+00 9.000000e+00 5.000000e-01 0.5
75% 1.000000e+00 3.120000e+00 1.350000e+01 5.000000e-01 0.5
max 4.000000e+00 1.117000e+01 4.100000e+01 1.000000e+00 0.5
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.028691e+00 5.512108e-02 3.000000e-01 1.312541e+01
std 1.510206e+00 5.524008e-01 4.110357e-11 7.370554e+00
min 0.000000e+00 0.000000e+00 3.000000e-01 0.000000e+00
25% 0.000000e+00 0.000000e+00 3.000000e-01 7.800000e+00
50% 0.000000e+00 0.000000e+00 3.000000e-01 1.080000e+01
75% 1.860000e+00 0.000000e+00 3.000000e-01 1.630000e+01
max 7.660000e+00 8.000000e+00 3.000000e-01 4.877000e+01
payment_type trip_type
count 1.055286e+07 10552857.0
mean 1.501672e+00 1.0
std 5.061254e-01 0.0
min 1.000000e+00 1.0
25% 1.000000e+00 1.0
50% 1.000000e+00 1.0
75% 2.000000e+00 1.0
max 3.000000e+00 1.0
标准化后:
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean -1.235870e-12 1.006184e-13 -3.819625e-14 -1.004818e-14
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00
25% 5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01
50% 5.088537e-01 -4.353414e-04 -3.802090e-01 6.100220e-03
75% 5.088537e-01 -4.353414e-04 7.336298e-01 8.341353e-01
max 5.088537e-01 2.297048e+03 2.046369e+00 1.765675e+00
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 0.0
mean -3.942620e-14 -3.206434e-14 -4.744100e-13 -1.042732e-12 NaN
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 NaN
min -2.571013e+00 -1.191187e+00 -1.674370e+00 -2.714092e+00 NaN
25% -3.170185e-01 -6.948283e-01 -7.462435e-01 -9.498508e-01 NaN
50% -3.170185e-01 -3.473773e-01 -2.821804e-01 3.667467e-01 NaN
75% -3.170185e-01 3.574519e-01 4.139144e-01 3.667467e-01 NaN
max 6.444966e+00 4.353138e+00 4.667827e+00 1.683344e+00 NaN
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 10552857.0 1.055286e+07
mean 3.152945e-13 -2.877092e-14 -1.0 2.081611e-14
std 1.000000e+00 1.000000e+00 0.0 1.000000e+00
min -6.811593e-01 -9.978459e-02 -1.0 -1.780791e+00
25% -6.811593e-01 -9.978459e-02 -1.0 -7.225258e-01
50% -6.811593e-01 -9.978459e-02 -1.0 -3.155007e-01
75% 5.504607e-01 -9.978459e-02 -1.0 4.307119e-01
max 4.390996e+00 1.438246e+01 -1.0 4.836080e+00
payment_type trip_type
count 1.055286e+07 0.0
mean 1.387184e-12 NaN
std 1.000000e+00 NaN
min -9.912012e-01 NaN
25% -9.912012e-01 NaN
50% -9.912012e-01 NaN
75% 9.845937e-01 NaN
max 2.960389e+00 NaN
谢谢
解决方案
要跟进评论,请查看以下内容:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })
请注意“b”和“d”是如何保持不变的,这意味着标准偏差为 0。应用zscore
- 函数意味着减去平均值并除以标准误差。如果将数字除以 0,则结果未定义,Pandas 将显示 NaN。
df.apply(zscore)
Out[8]:
a b c d
0 -1.224745 NaN -1.224745 NaN
1 0.000000 NaN 0.000000 NaN
2 1.224745 NaN 1.224745 NaN
您要么需要将zscore
函数应用于选定的列,要么更改函数以使其省略常量列。要将函数仅应用于选定的列,您可以执行以下操作:
df[['a','c']] = df[['a','c']].apply(zscore)
df
Out[9]:
a b c d
0 -1.224745 2 -1.224745 8
1 0.000000 2 0.000000 8
2 1.224745 2 1.224745 8
为了让函数检查列是否为常数,让我们使用一个lambda
函数,如果列的标准差为 0,则返回未更改的列,否则返回标准化列。
df.apply(lambda x: x if np.std(x) == 0 else zscore(x))
Out[15]:
a b c d
0 -1.224745 2 -1.224745 8
1 0.000000 2 0.000000 8
2 1.224745 2 1.224745 8
推荐阅读
- azure-sql-database - 谁创建了这个扩展活动会话?
- css - 带有计数器标识符“list-item”的奇怪浏览器行为
- android - 使用 URI 的 Android Studio 深度链接
- snakemake - Snakemake 规则中的递归输入调用
- arrays - 如何使用 np.where 查找数组中某个元素的索引值?
- python - 我收到 NameError: name 'data' is not defined from selenium webscrape,但我已经在代码中定义了它,任何人都可以看到有什么问题吗?
- database - 使用 VBA 从前端访问表单重新链接后端 mdb
- php - laravel 检查用户是否创建了个人资料
- dart - 在 Flutter 中插入生成的列表
- amazon-web-services - 超过 AWS CloudFormation 速率