python - 与用于逻辑回归的 Scikit-Learn 相比,Tensorflow 的性能要差得多
问题描述
我正在尝试在数值数据集上实现逻辑回归分类器。我在 Tensorflow 中构建的模型无法获得良好的准确性和损失,因此为了检查它是否在数据中,我尝试使用 scikit-learn 自己的 LogisticRegression,并获得了更好的结果。差异如此之大,以至于我怀疑我在 .tf 方面做了一些非常基本的错误......
数据预处理:
dt = pd.read_csv('data.csv', header=0)
npArray = np.array(dt)
xvals = npArray[:,1:].astype(float)
yvals = npArray[:,0]
x_proc = preprocessing.scale(xvals)
XTrain, XTest, yTrain, yTest = train_test_split(x_proc, yvals, random_state=1)
如果我现在用 sklearn 进行逻辑回归:
log_reg = LogisticRegression(class_weight='balanced')
log_reg.fit(XTrain, yTrain)
yPred = log_reg.predict(XTest)
print (metrics.classification_report(yTest, yPred))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, yPred),2))
...我得到以下混淆矩阵:
precision recall f1-score support
1 1.00 0.98 0.99 52
2 0.96 1.00 0.98 52
3 0.98 0.96 0.97 51
4 0.98 0.97 0.97 58
5 1.00 0.95 0.97 37
6 0.93 1.00 0.96 65
7 1.00 0.95 0.97 41
8 0.94 0.98 0.96 50
9 1.00 0.98 0.99 45
10 1.00 0.98 0.99 49
avg/total 0.98 0.98 0.98 500
Overall Accuracy: 0.98
很棒的东西,对吧?这是拆分后同一点的 TensorFlow 代码:
yTrain.resize(len(yTrain),10) #the labels are scores between 1 and 10
yTest.resize(len(yTest),10)
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [None, 8], name="input")
Y = tf.placeholder(tf.float32, [None, 10])
W = tf.Variable(tf.zeros([8, 10]))
b = tf.Variable(tf.zeros([10]))
out = (tf.matmul(X, W) + b)
pred = tf.nn.softmax(out, name="output")
learning_rate = 0.001
training_epochs = 100
batch_size = 200
display_step = 1
L2_LOSS = 0.01
l2 = L2_LOSS * \
sum(tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables())
# Minimize error using cross entropy
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits = out, labels = Y)) + l2
# Gradient Descent
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
train_count = len(XTrain)
#defining optimizer and accuracy
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
#----Training the model------------------------------------------
saver = tf.train.Saver()
history = dict(train_loss=[],
train_acc=[],
test_loss=[],
test_acc=[])
sess=tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for i in range(1, training_epochs + 1):
for start, end in zip(range(0, train_count, batch_size),
range(batch_size, train_count + 1,batch_size)):
sess.run(optimizer, feed_dict={X: XTrain[start:end],
Y: yTrain[start:end]})
_, acc_train, loss_train = sess.run([pred, accuracy, cost], feed_dict={
X: XTrain, Y: yTrain})
_, acc_test, loss_test = sess.run([pred, accuracy, cost], feed_dict={
X: XTest, Y: yTest})
history['train_loss'].append(loss_train)
history['train_acc'].append(acc_train)
history['test_loss'].append(loss_test)
history['test_acc'].append(acc_test)
if i != 1 and i % 10 != 0:
continue
print(f'epoch: {i} test accuracy: {acc_test} loss: {loss_test}')
predictions, acc_final, loss_final = sess.run([pred, accuracy, cost], feed_dict={X: XTest, Y: yTest})
print()
print(f'final results: accuracy: {acc_final} loss: {loss_final}')
现在我得到以下信息:
epoch: 1 test accuracy: 0.41200000047683716 loss: 0.6921926140785217
epoch: 10 test accuracy: 0.5 loss: 0.6909801363945007
epoch: 20 test accuracy: 0.5180000066757202 loss: 0.6918861269950867
epoch: 30 test accuracy: 0.515999972820282 loss: 0.6927152872085571
epoch: 40 test accuracy: 0.5099999904632568 loss: 0.6933282613754272
epoch: 50 test accuracy: 0.5040000081062317 loss: 0.6937957406044006
epoch: 60 test accuracy: 0.5019999742507935 loss: 0.6941683292388916
epoch: 70 test accuracy: 0.5019999742507935 loss: 0.6944747567176819
epoch: 80 test accuracy: 0.4959999918937683 loss: 0.6947320103645325
epoch: 90 test accuracy: 0.46799999475479126 loss: 0.6949512958526611
epoch: 100 test accuracy: 0.4560000002384186 loss: 0.6951409578323364
final results: accuracy: 0.4560000002384186 loss: 0.6951409578323364
想法?我已经尝试过初始化权重(这里有第二个答案:如何在 TensorFlow 上进行 Xavier 初始化),改变学习率、时期、批量大小、L2 损失等,但都没有实际效果。任何帮助将非常感激...
解决方案
我想我找到了问题的根源 - yTrain.resize 和 yTest.resize 在逻辑和数学方面都是愚蠢的,一旦我用单热编码数组替换它们(在将索引数组转换为 1-hot 编码的帮助下) numpy array)这一切都开始变得更好了。最终获得了与 sk-learn 相同的准确性(我认为)!
推荐阅读
- python - 嵌套列表中的唯一性不超过一对重叠坐标
- javascript - 异步等待 mysql2/promise 和结束连接
- sql - 对于字段末尾有空格的 LIKE 搜索,是否有解决方案?
- json - 容器内的 JAX-RS 客户端 ISO8601 日期序列化
- android - 添加 Xamarin.GooglePlayServices.Maps 后,“XamarinBuildAndroidAarProguardConfigs”任务意外失败
- r - 使用 purrr 通过 mapply 捕获错误
- python - 从谷歌驱动器下载后文件不显示
- r - R - 基于共享相似列但长度不同的另一个 data.frame 将列添加到第一个 data.frame
- c# - 使用 Entity Framework Core 2.0 更改或重命名列名而不丢失数据
- java - 如何避免swagger codegen接口中的默认方法实现?