首页 > 解决方案 > 重新训练加载的模型似乎没有正确利用 GPU(训练速度极慢)

问题描述

我在 GPU 上训练模型没有任何问题,但是当从.h5文件加载模型并为其拟合更多数据时,训练过程变得异常缓慢。慢到每个 epoch 28 秒到 201 秒

除了创建架构、加载和保存模型之外,我的代码与下面的示例相同

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization

EPOCHS = 10
BATCH_SIZE = 32

model = Sequential()
model.add(LSTM(128, input_shape=(X_train_lstm.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(BatchNormalization())

model.add(Dense(1))

opt = tf.keras.optimizers.Adam(lr=0.0005, decay=1e-6)

model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mean_absolute_error'])


history = model.fit(X_train_lstm, y_train_lstm, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(X_test_lstm, y_test_lstm))


model.save("model.h5")


最初训练我的模型似乎工作得很好,充分利用了我的 gpu,如下所示:

2020-04-23 00:29:27.838398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.710605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-23 00:29:29.741211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:29:29.747629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.760441: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:29.769311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:29:29.774739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:29:29.785242: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:29:29.792404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:29:29.815372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:29:29.819504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:29:29.822104: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-23 00:29:29.827926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:29:29.835707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.839583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:29.843859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:29:29.847116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:29:29.851727: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:29:29.855977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:29:29.859217: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:29:29.863804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:29:30.482013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-23 00:29:30.486022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-04-23 00:29:30.488195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-04-23 00:29:30.491115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4702 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:09:00.0, compute capability: 6.1)
Train on 34849 samples, validate on 7421 samples
Epoch 1/10
2020-04-23 00:29:35.276148: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:35.530107: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll

34849/34849 [==============================] - 28s 808us/sample - loss: 0.3903 - mean_absolute_error: 0.4317 - val_loss: 0.0015 - val_mean_absolute_error: 0.0341

但是,当我不是从头开始训练模型时,而是加载一个模型-它非常慢(训练)并且dynamic library cudnn64_7.dll与上面相比似乎没有打开(接近底部,就在即将开始训练时):

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.models import load_model

EPOCHS = 10
BATCH_SIZE = 32

model = load_model("model.h5")

history = model.fit(X_train_lstm, y_train_lstm, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(X_test_lstm, y_test_lstm))
2020-04-23 00:37:30.650618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.426823: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-23 00:37:32.459919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:37:32.465038: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.475213: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:37:32.481768: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:37:32.485735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:37:32.493894: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:37:32.499155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:37:32.520215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:37:32.523377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:37:32.525383: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-23 00:37:32.530055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:37:32.538667: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.542212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:37:32.545335: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:37:32.549384: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:37:32.552674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:37:32.556134: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:37:32.559841: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:37:32.563358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:37:33.191288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-23 00:37:33.194467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-04-23 00:37:33.196314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-04-23 00:37:33.198647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4702 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:09:00.0, compute capability: 6.1)
Train on 34849 samples, validate on 7421 samples
Epoch 1/10
2020-04-23 00:37:37.688378: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll

34849/34849 [==============================] - 201s 6ms/sample - loss: 0.0074 - mean_absolute_error: 0.0604 - val_loss: 2.7296e-04 - val_mean_absolute_error: 0.0151

我该如何处理这个问题?

标签: pythontensorflowgpu

解决方案


推荐阅读