首页 > 解决方案 > 错误:“发现样本数量不一致的输入变量:[5114, 3409]”

问题描述

我希望遵循以下步骤:

  1. 加载数据
  2. 分为标签和特征集
  3. 规范化数据
  4. 划分为测试集和训练集
  5. 实施过采样(smote)

这是正确的步骤顺序还是我做错了什么?我不断收到一条错误消息,提示“发现样本数量不一致的输入变量:[5114, 3409]”。

在线出现此错误:X_train,Y_train = smote.fit_sample(X_train,Y_train)

#data loading
dataset = pd.read_csv('data.csv')

#view data and check for null values
print(dataset.isnull().values.any())
print(dataset.shape)


# Dividing dataset into label and feature sets
X = dataset.drop('Bankrupt?', axis = 1) # Features
Y = dataset['Bankrupt?'] # Labels
print(type(X))
print(type(Y))
print(X.shape)
print(Y.shape)

# Normalizing numerical features so that each feature has mean 0 and variance 1
feature_scaler = StandardScaler()
X_scaled = feature_scaler.fit_transform(X)

# Dividing dataset into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split( X_scaled, Y, test_size = 0.5, random_state = 100)

print(X_train.shape)
print(X_test.shape)
    
X = dataset.iloc[:,1:].values
y = dataset.iloc[:,0].values.reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Implementing Oversampling to balance the dataset; 
print("Number of observations in each class before oversampling (training data): \n", pd.Series(Y_train).value_counts())

smote = SMOTE(random_state = 101)
X_train,Y_train = smote.fit_sample(X_train,Y_train)

print("Number of observations in each class after oversampling (training data): \n", pd.Series(Y_train).value_counts())

标签: pythondata-analysis

解决方案


推荐阅读