python - Pandas df.get_dummies() 返回“ValueError:无法将字符串转换为浮点数”
问题描述
我正在尝试使用 Pandas 的 df.get_dummies() 对几个分类列进行一次热编码,它返回一个我不理解的错误。错误说ValueError: could not convert string to float: 'Warm Cool'
。什么可能导致这个问题,我怎样才能成功地对所有列进行一次热编码dtype == object
?
我的数据集来自此处找到的 DC_Properties.CSV 文件。
我的代码和错误消息:
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Import packages section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Read data section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df = pd.read_csv('DC_Properties.csv', index_col=0)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Preprocess data section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# remove rows without sales prices
df = df[df.PRICE.notnull()]
# create month sold column
df['MONTHSOLD'] = [i[:i.find('/')] if type(i) == str else i for i in df.SALEDATE]
# create year sold column
df['YEARSOLD'] = [i[-4:] if type(i) == str else i for i in df.SALEDATE]
# join GBA and Living GBA
df['GBA'] = df['GBA'].fillna(df['LIVING_GBA'])
# remove unused columns
unused_cols = ['SALEDATE',
'GIS_LAST_MOD_DTTM',
'CMPLX_NUM',
'LIVING_GBA',
'FULLADDRESS',
'CITY',
'STATE',
'NATIONALGRID',
'ASSESSMENT_SUBNBHD',
'CENSUS_TRACT',
'CENSUS_BLOCK',
'X',
'Y']
df = df.drop(unused_cols, axis=1)
# one-hot encode categorical variables
pd.get_dummies(df, dummy_na=True)
# standardize the data
scaler = StandardScaler()
dataset = scaler.fit_transform(df)
# specify x and y variables
x = dataset[:,-y_idx]
y = dataset[:,'PRICE']
# split data into a train and test set
np.random.seed(123)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-81-62c3931b3dfa> in <module>
33 # standardize the data
34 scaler = StandardScaler()
---> 35 dataset = scaler.fit_transform(df)
36
37 # specify x and y variables
~\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
551 if y is None:
552 # fit method of arity 1 (unsupervised transformation)
--> 553 return self.fit(X, **fit_params).transform(X)
554 else:
555 # fit method of arity 2 (supervised transformation)
~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit(self, X, y)
637 # Reset internal state before fitting
638 self._reset()
--> 639 return self.partial_fit(X, y)
640
641 def partial_fit(self, X, y=None):
~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in partial_fit(self, X, y)
661 X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
662 estimator=self, dtype=FLOAT_DTYPES,
--> 663 force_all_finite='allow-nan')
664
665 # Even in the case of `with_mean=False`, we update the mean anyway
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
494 try:
495 warnings.simplefilter('error', ComplexWarning)
--> 496 array = np.asarray(array, dtype=dtype, order=order)
497 except ComplexWarning:
498 raise ValueError("Complex data not supported\n"
~\Anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'Warm Cool'
解决方案
实际上是 StandardScaler 因为遇到字符串而引发错误。
原因是您使用的是 pd.dummies,但您从未分配返回的数据帧。
# one-hot encode categorical variables
pd.get_dummies(df, dummy_na=True) # <------ is lost
要修复它,请将其更改为:
# one-hot encode categorical variables
df = pd.get_dummies(df, dummy_na=True)
推荐阅读
- memory - 虚拟内存到物理翻译?
- python - 如何解决 IntegrityError
- reactjs - 尝试使用 useSelector 设置 useState 的默认值,但它不起作用
- python - Docker 上的 Jupyter 笔记本 - 权限被拒绝
- excel - Excel:使用 VBA 连接和更新不同工作簿中的两个数据透视表
- python - 烧瓶请求返回空多部分/表单数据
- rdf - 根据三重存储中的数据创建 OWL 文件
- pandas - 循环遍历 pandas 数据框以创建单独的 Folium 地图
- python - Django如何编写更改用户密码的函数
- python - 删除链表中的最后一个节点