首页 > 解决方案 > 使用 MLPCLassifier,多次使用 partial_fit 会产生比使用 ​​fit() 最差的准确度,尽管有混洗数据

问题描述

我正在使用 sklearn 的 MLPClassifier 在 Python 中为分类任务构建神经网络。我想绘制一条准确度与时期数的曲线,看看我需要多少个时期才能达到某种程度的准确度。我能够做到这一点的唯一方法是partial_fit()在循环中使用。这是执行此操作的代码:

from sklearn.preprocessing   import StandardScaler
from sklearn.decomposition   import PCA
from sklearn.neural_network  import MLPClassifier
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt

scaler = StandardScaler()
scaler.fit(df_train_sample)
X_train = scaler.transform(df_train_sample)
scaler.fit(df_val)
X_val = scaler.transform(df_val)

pca = PCA(pca_frac)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_val = pca.transform(X_val)

n_classes = np.unique(labels_train_sample)
n_train_sample = len(df_train_sample)

scores_train = []
scores_val = []

epoch = 0
while epoch < max_iter:
   
    random_perm = np.random.permutation(n_train_sample)
    mini_batch_index = 0

    while True:
        indices = random_perm[mini_batch_index:mini_batch_index + batch_size]
        mlpc.partial_fit(X_train[indices], labels_train_sample[indices], classes=n_classes)
        mini_batch_index += batch_size

        if mini_batch_index >= n_train_sample:
            break
    
    scores_train.append(mlpc.score(X_train, labels_train_sample))
    scores_val.append(mlpc.score(X_val, labels_val))

    epoch += 1

fig, ax = plt.subplots()

ax.plot(np.arange(1, max_iter + 1), scores_train, label = "Train")
ax.plot(np.arange(1, max_iter + 1), scores_val, label = "Validation")

这里,max_iter是 epoch 的数量,mlpc是分类器,定义如下:

seed          = 123
hidden_layers = [30, 15]
activation    = "relu"
learning_rate = 5e-4
beta_1        = 0.99
epsilon       = 1e-4

batch_size    = 200 
max_iter      = 200 
tol           = 1e-4

warm_start    = True
shuffle       = True

mlpc = MLPClassifier(
    hidden_layer_sizes = hidden_layers,
    activation         = activation,
    batch_size         = batch_size,
    learning_rate_init = learning_rate,
    beta_1             = beta_1,
    epsilon            = epsilon,
    warm_start         = warm_start,
    shuffle            = shuffle,
    max_iter           = max_iter,
    tol                = tol,
    random_state       = seed
)

可以肯定的是,以下是从原始数据帧构造的方式df_train_samplelabels_train_sample构造:

df_train_sample = df_train.sample(N, replace = False).reset_index(drop = True)
labels_train_sample = labels_train[df_train_sample.index].reset_index(drop = True)

其中N是要采样的行数。df_vallabels_val是验证数据,直接从.csv文件中读取,无需修改。请注意,标签是布尔值。

问题是,如果用 调用该算法,mlpc.fit()在采样数据集上产生的准确率约为 82%,而我发布的代码的准确率是 65%。这是情节: 准确性与时代

在线搜索我发现打乱数据会有所帮助,但正如您所见,数据已经在每个时期都被打乱了。为什么会这样?是否有另一种方式以另一种更直接的方式构建所述情节?

标签: pythonmachine-learningscikit-learnneural-network

解决方案


我发现了问题所在。问题不在于partial_fit(),而在于我构建示例数据框的方式:

df_train_sample = df_train.sample(N, replace = False).reset_index(drop = True)
labels_train_sample = labels_train[df_train_sample.index].reset_index(drop = True)

在这部分中df_train_sample,我在构建它时重置了 的索引,但随后我使用它的索引从labels_train. 如果我不重置索引(这是我在以前的版本中所做的),这将起作用。

解决方案只是在重置索引之前存储索引,就像这样

df_train_sample = df_train.sample(N, replace = False)
train_index = df_train_sample.index
df_train_sample = df_train_sample.reset_index(drop = True)
labels_train_sample = labels_train[train_index].reset_index(drop = True)

推荐阅读