python - 重新训练加载的模型似乎没有正确利用 GPU(训练速度极慢)
问题描述
我在 GPU 上训练模型没有任何问题,但是当从.h5
文件加载模型并为其拟合更多数据时,训练过程变得异常缓慢。慢到每个 epoch 28 秒到 201 秒
除了创建架构、加载和保存模型之外,我的代码与下面的示例相同
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
EPOCHS = 10
BATCH_SIZE = 32
model = Sequential()
model.add(LSTM(128, input_shape=(X_train_lstm.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(lr=0.0005, decay=1e-6)
model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mean_absolute_error'])
history = model.fit(X_train_lstm, y_train_lstm, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(X_test_lstm, y_test_lstm))
model.save("model.h5")
最初训练我的模型似乎工作得很好,充分利用了我的 gpu,如下所示:
2020-04-23 00:29:27.838398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.710605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-23 00:29:29.741211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:29:29.747629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.760441: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:29.769311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:29:29.774739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:29:29.785242: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:29:29.792404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:29:29.815372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:29:29.819504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:29:29.822104: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-23 00:29:29.827926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:29:29.835707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:29:29.839583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:29.843859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:29:29.847116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:29:29.851727: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:29:29.855977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:29:29.859217: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:29:29.863804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:29:30.482013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-23 00:29:30.486022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-23 00:29:30.488195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-23 00:29:30.491115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4702 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:09:00.0, compute capability: 6.1)
Train on 34849 samples, validate on 7421 samples
Epoch 1/10
2020-04-23 00:29:35.276148: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:29:35.530107: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
34849/34849 [==============================] - 28s 808us/sample - loss: 0.3903 - mean_absolute_error: 0.4317 - val_loss: 0.0015 - val_mean_absolute_error: 0.0341
但是,当我不是从头开始训练模型时,而是加载一个模型-它非常慢(训练)并且dynamic library cudnn64_7.dll
与上面相比似乎没有打开(接近底部,就在即将开始训练时):
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.models import load_model
EPOCHS = 10
BATCH_SIZE = 32
model = load_model("model.h5")
history = model.fit(X_train_lstm, y_train_lstm, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(X_test_lstm, y_test_lstm))
2020-04-23 00:37:30.650618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.426823: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-23 00:37:32.459919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:37:32.465038: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.475213: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:37:32.481768: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:37:32.485735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:37:32.493894: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:37:32.499155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:37:32.520215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:37:32.523377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:37:32.525383: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-23 00:37:32.530055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:09:00.0 name: GeForce GTX 1060 6GB computeCapability: 6.1
coreClock: 1.835GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s
2020-04-23 00:37:32.538667: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-23 00:37:32.542212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-23 00:37:32.545335: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-23 00:37:32.549384: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-23 00:37:32.552674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-23 00:37:32.556134: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-23 00:37:32.559841: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-23 00:37:32.563358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-23 00:37:33.191288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-23 00:37:33.194467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-04-23 00:37:33.196314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-04-23 00:37:33.198647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4702 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:09:00.0, compute capability: 6.1)
Train on 34849 samples, validate on 7421 samples
Epoch 1/10
2020-04-23 00:37:37.688378: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
34849/34849 [==============================] - 201s 6ms/sample - loss: 0.0074 - mean_absolute_error: 0.0604 - val_loss: 2.7296e-04 - val_mean_absolute_error: 0.0151
我该如何处理这个问题?
解决方案
推荐阅读
- python - 合并以相同字母开头的 pandas DataFrame 列
- ios - 如何编译与 UIKit For Mac/Catalyst 一起使用的第 3 方库?
- android - 从 Kotlin 布局中删除项目
- ios - 具有自定义 UITableViewCell 的单个 ViewController 中的多个 UITableView
- android - 以沉浸式模式显示 BottomSheetDialogFragment
- ios - 为什么在第一次关闭时不需要“self”?
- ios - 如何在 ios 应用程序中跨所有视图控制器跟踪用户位置?
- android - 工具栏内视图的 LayoutGravity 无法正常工作
- assembly - 使用循环从用户键盘读取数字的说明是什么?是否需要“标签”?
- javascript - 当我包含推荐的安全覆盖时,为什么脚本拒绝在 Google Analytics Chrome 扩展中加载?