首页 > 解决方案 > 使用 keras/python 和 CSV 文件创建顺序模型,但准确性较差

问题描述

我想建立一个分类器,但我很难找到可以清楚地解释 keras 功能以及如何去做我想做的事情的来源。我想使用以下数据:

         0    1    2        3          4       5    6     7
0     Name  TRY  LOC   OUTPUT     TYPE_A   SIGNAL  A-B  SPOT
1    inc 1    2   20   TYPE-1    TORPEDO   ULTRA    A   -21
2    inc 2    3   16   TYPE-2    TORPEDO     ILH    B   -14
3    inc 3    2   20  BLACK47    TORPEDO    LION    A    49
4    inc 4    3   12   TYPE-2  CENTRALPA    LION    A    25
5    inc 5    3   10   TYPE-2      THREE    LION    A   -21
6    inc 6    2   20   TYPE-2        ATF    LION    A   -48
7    inc 7    4    2  NIVEA-1        ATF    LION    B   -23
8    inc 8    3   16  NIVEA-1        ATF    LION    B    18
9    inc 9    3   18  BLENDER  CENTRALPA    LION    B    48
10   inc 10   4   20    DELCO        ATF    LION    B   -26
11   inc 11   3   20    VE248        ATF    LION    B    44
12   inc 12   1   20   SILVER  CENTRALPA    LION    B   -35
13   inc 13   2   20  CALVIN3     SEVENX    LION    B   -20
14   inc 14   3   14  DECK-BT  CENTRALPA    LION    B   -38
15   inc 15   4    4  10-LEVI    BERWYEN     OWL    B   -29
16   inc 16   4   14   TYPE-2        ATF     NOV    B   -31
17   inc 17   4   10     NYNY    TORPEDO     NOV    B    21
18   inc 18   2   20  NIVEA-1  CENTRALPA     NOV    B    45
19   inc 19   3   27   FMRA97    TORPEDO     NOV    B   -26
20   inc 20   4   18   SILVER        ATF     NOV    B   -46

我想使用第 1、2、4、5、6、7 列作为输入,输出为 3(输出)。

我目前拥有的代码是:

import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import one_hot

df = pd.read_csv("file.csv")

df.drop('Name', axis=1, inplace=True)

obj_df = df.select_dtypes(include=['object']).copy()
# print(obj_df.head())
obj_df["OUTPUT"] = obj_df["OUTPUT"].astype('category')
obj_df["TYPE_A"] = obj_df["TYPE_A"].astype('category')
obj_df["SIGNAL"] = obj_df["SIGNAL"].astype('category')
obj_df["A-B"] = obj_df["A-B"].astype('category')
# obj_df.dtypes
obj_df["OUTPUT_cat"] = obj_df["OUTPUT"].cat.codes
obj_df["TYPE_A_cat"] = obj_df["TYPE_A"].cat.codes
obj_df["SIGNAL_cat"] = obj_df["SIGNAL"].cat.codes
obj_df["A-B_cat"] = obj_df["A-B"].cat.codes
# print(obj_df.head())
df2 = df[['TRY', 'LOC', 'SPOT']]
df3 = obj_df[['OUTPUT_cat', 'TYPE_A_cat', 'SIGNAL_cat', 'A-B_cat']]
df4 = pd.concat([df2, df3], axis=1, sort=False)

target_column = ['OUTPUT_cat']
predictors = list(set(list(df4.columns))-set(target_column))
df4[predictors] = df4[predictors]/df4[predictors].max()
print(df4.describe())

X = df4[predictors].values
y = df4[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)

model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=6))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# build the model
model.fit(X_train, y_train, epochs=20, batch_size=150)

我不知道为什么这是我得到的结果:

Epoch 20/20
56/56 [==============================] - 4s 77ms/step - loss: 0.0000e+00 - accuracy: 1.8165e-04

我似乎也找不到与此问题相关的任何答案。我是否错误地使用了 keras 功能?这是我将对象类型转换为整数的方式吗?假设有 1250 个输出,我将如何修复这些层?任何提示或帮助将不胜感激。谢谢你。

标签: pythonpandastensorflowkeraskeras-layer

解决方案


正如我在评论中所说,这似乎是模型欠拟合的明显案例——对于模型本身的大小,您的数据太少。与其玩弄层的大小,不如先尝试 SVM 或 RandomForest 分类器,看看是否有可能对您的数据进行任何合理的分类。同样,对于如此大量的数据,神经网络也几乎不是一个好的选择。

所以改为这样做:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import pandas as pd

df = blablabla # This is your data
X = df.iloc[:, [i for i in range(8) if i != 3]]
y = df.iloc[:, 3]

X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=5, n_jobs=-1)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

如果这可行并且可以做出一些预测,那么您可以继续尝试调整您的顺序模型。

编辑:只需阅读您的评论,您总共有 1250 个类别标签和 5000 个样本。这可能不适用于大多数分类器。太多的类和太少的样本支持。


推荐阅读