python - Scikit-Learn - one-hot 编码熊猫数据帧的某些列
问题描述
我有一个X
包含整数、浮点数和字符串列的数据框。我想对“对象”类型的每一列进行一次热编码,所以我正在尝试这样做:
encoding_needed = X.select_dtypes(include='object').columns
ohe = preprocessing.OneHotEncoder()
X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str)) #need astype bc I imputed with 0, so some rows have a mix of zeroes and strings.
但是,我最终得到IndexError: tuple index out of range
. 根据编码器期望的文档,我不太了解这一点X: array-like, shape [n_samples, n_features]
,所以我应该可以传递数据帧。如何对特别标记的列列表进行一次性编码encoding_needed
?
编辑:
数据是机密的,所以我不能分享它,也不能创建一个虚拟数据,因为它有 123 列。
我可以提供以下内容:
X.shape: (40755, 123)
encoding_needed.shape: (81,) and is a subset of columns.
全栈:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-90-6b3e9fdb6f91> in <module>()
1 encoding_needed = X.select_dtypes(include='object').columns
2 ohe = preprocessing.OneHotEncoder()
----> 3 X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str))
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3365 self._setitem_frame(key, value)
3366 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3367 self._setitem_array(key, value)
3368 else:
3369 # set column
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in _setitem_array(self, key, value)
3393 indexer = self.loc._convert_to_indexer(key, axis=1)
3394 self._check_setitem_copy()
-> 3395 self.loc._setitem_with_indexer((slice(None), indexer), value)
3396
3397 def _setitem_frame(self, key, value):
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
592 # GH 7551
593 value = np.array(value, dtype=object)
--> 594 if len(labels) != value.shape[1]:
595 raise ValueError('Must have equal len keys and value '
596 'when setting with an ndarray')
IndexError: tuple index out of range
解决方案
# example data
X = pd.DataFrame({'int':[0,1,2,3],
'float':[4.0, 5.0, 6.0, 7.0],
'string1':list('abcd'),
'string2':list('efgh')})
int float string1 string2
0 0 4.0 a e
1 1 5.0 b f
2 2 6.0 c g
3 3 7.0 d h
使用pandas
使用pandas.get_dummies
,它将自动选择您的object
列并删除这些列,同时附加单热编码列:
pd.get_dummies(X)
int float string1_a string1_b string1_c string1_d string2_e \
0 0 4.0 1 0 0 0 1
1 1 5.0 0 1 0 0 0
2 2 6.0 0 0 1 0 0
3 3 7.0 0 0 0 1 0
string2_f string2_g string2_h
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 1
使用sklearn
在这里,我们必须指定我们只需要object
列:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['string1', 'string2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
int float string1_a string1_b string1_c string1_d string2_e \
0 0 4.0 1 0 0 0 1
1 1 5.0 0 1 0 0 0
2 2 6.0 0 0 1 0 0
3 3 7.0 0 0 0 1 0
string2_f string2_g string2_h
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 1
推荐阅读
- mysql - MYSQL - MATCH AGAINST 得分值 = 0 但在 WHERE 子句中 > 0
- javascript - jQuery计数器并添加一个类但不是前两个?
- android - 无法解析符号 ViewModelProviders
- local-storage - 页面刷新时的角度如何使 loaclstorage 值为空
- javascript - 调用 javascript 函数时内部会发生什么?
- react-native - onHostDestroy() 未按记录调用
- java - 在 Spring 中添加请求范围会导致 java.lang.IllegalStateException: No Scope registered for scope name 'request'
- eclipse - 带有Tomcat7的Eclipse,我的应用程序从哪里运行?
- r - 为什么一段时间后我无法加载 R 包错误?
- visual-studio - 错误 System.InvalidCastException Forms(droid 构建然后崩溃,ios 很好)