首页 > 解决方案 > 如何在 One Hot Encoding 中修复此 ValueError?

问题描述

尝试在 Jupyter-Notebook 中运行以下代码时,会导致以下错误:

dataset_train.drop_duplicates(inplace=True)
dataset_test.drop_duplicates(inplace=True)

#One-Hot-Encoding¶
enc = OneHotEncoder()
dataset_train_categorical_values_encenc = enc.fit_transform(dataset_train_categorical_values_enc)
dataset_train_cat_data = pd.DataFrame(dataset_train_categorical_values_encenc.toarray(),columns=dumcols)
# test set
dataset_test_categorical_values_encenc = enc.fit_transform(dataset_test_categorical_values_enc)
dataset_test_cat_data = pd.DataFrame(dataset_test_categorical_values_encenc.toarray(),columns=testdumcols)

错误:ValueError:传递值的形状为 (82332, 151),索引暗示 (82332, 155)

到目前为止,这是上面放置的工作表之前的整个代码:

#Label Encoder

ategorical_columns=['proto', 'service', 'state']
# insert code to get a list of categorical columns into a variable, categorical_columns
categorical_columns=['proto', 'service', 'state'] 
 # Get the categorical values into a 2D numpy array
dataset_train_categorical_values = dataset_train[categorical_columns]
dataset_test_categorical_values = dataset_test[categorical_columns]

    
    #Transform categorical features into numbers using LabelEncoder()
dataset_train = pd.read_csv('BMW_Theftprotection_trainer.csv')
dataset_test = pd.read_csv('BMW_Theftprotection_tester.csv') 

dataset_train_categorical_values_enc=dataset_train_categorical_values.apply(LabelEncoder().fit_transform) print(dataset_train_categorical_values_enc.head()) # 测试集 dataset_test_categorical_values_enc=dataset_test_categorical_values.apply(LabelEncoder().fit_transform)

#Dummy Columns


# protocol type
unique_protocol=sorted(dataset_train.proto.unique())
string1 = 'proto_'
unique_protocol2=[string1 + x for x in unique_protocol]
# service
unique_service=sorted(dataset_train.service.unique())
string2 = 'service_'
unique_service2=[string2 + x for x in unique_service]
# flag
unique_flag=sorted(dataset_train.state.unique())
string3 = 'state_'
unique_flag2=[string3 + x for x in unique_flag]
# put together
dumcols=unique_protocol2 + unique_service2 + unique_flag2
print(dumcols)

#do same for test set
unique_service_test=sorted(dataset_test.service.unique())
unique_service2_test=[string2 + x for x in unique_service_test]
testdumcols=unique_protocol2 + unique_service2_test + unique_flag2

有谁知道如何修理它?

标签: pythondata-sciencedata-mining

解决方案


这可能是因为您在一个数据框中的数据值没有出现在另一个数据框中,这会在一个热编码时改变您的尺寸。

在一个热编码之前合并它们,然后将它们拆分回来。这将为您提供相等的列尺寸。

import pandas as pd

dataframe_train = pd.DataFrame(
    {"one": ["a", "e", "i", "a", "a", "b"], "two": ["x", "x", "y", "x", "y", "y"] }, 
)
dataframe_test = pd.DataFrame(
    {"one": ["a", "e", "r"], "two": ["x", "x", "y"], },
)

train_test_df = pd.concat(
    [dataframe_test, dataframe_train],
    keys=['test','train']
).droplevel(level=1, axis=0)

ohe = pd.get_dummies(train_test_df )
test_ohe = ohe.loc['test',:].values
train_ohe = ohe.loc['train',:].values

我在这里使用 pandas 进行了一次热编码,因为它使之后的拆分变得更加容易。


推荐阅读