首页 > 解决方案 > TensorFlow/keras ,训练期间未使用 GPU

问题描述

我正在使用 Tensorflow 中内置的 Keras。我正在使用conda envwithtensorflow-gpu==1.10.0 我有 CUDA 9.0 和 cudnn 7

我看到我的 GPU 在训练开始的几秒钟内上升了 30%(我猜它加载了图像)。然后,当 CPU 以 20-25% 运行时,它又回到 1% 或 2%

我尝试使用CUDA_VISIBLE_DEVICES=-1,我的 CPU 高达 94%(所以 tensorflow 正在使用 GPU ......)

这是我的模型以及我如何训练它:

import numpy as np
import pandas as pd
import time

import tensorflow
from tensorflow import keras

taille_image = (96,96)      
batch_size = 32 

model = keras.Sequential()

model.add(keras.layers.Conv2D(16, (3, 3), input_shape=(taille_image[0], taille_image[1], 3),padding = "same", activation="relu"))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.MaxPool2D(pool_size=(2,2)))

model.add(keras.layers.Conv2D(32, (3, 3), padding = "same", activation="relu"))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.MaxPool2D(pool_size=(2,2)))

model.add(keras.layers.Conv2D(64, (3, 3), padding = "same", activation="relu"))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.MaxPool2D(pool_size=(2,2)))

model.add(keras.layers.Conv2D(128, (3, 3), padding = "same", activation="relu"))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.MaxPool2D(pool_size=(2,2)))

model.add(keras.layers.Conv2D(256, (3, 3), padding = "same", activation="relu"))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.MaxPool2D(pool_size=(2,2)))

model.add(keras.layers.Flatten())                
model.add(keras.layers.Dense(3,activation="softmax"))

model.compile(loss='categorical_crossentropy',
        optimizer='adam',
        metrics=['accuracy'])
#model.summary()


def Entrainer(texte, barre, nb_epochs): 


        train_datagen = keras.preprocessing.image.ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        vertical_flip=True
        )
        valid_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

        test_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

        train_generator = train_datagen.flow_from_directory(
                directory='Images traitées/train',
                batch_size=batch_size,
                target_size=taille_image,
                color_mode="rgb",
                class_mode="categorical",
                shuffle=True,
                seed=42
        )

        validation_generator = valid_datagen.flow_from_directory(
                directory='Images traitées/validation',
                batch_size=1,
                target_size=taille_image,
                color_mode="rgb",
                class_mode="categorical",
                shuffle=True,
                seed=42
        )

        test_generator = test_datagen.flow_from_directory(
                directory='Images traitées/test',
                batch_size=1,
                target_size=taille_image,
                color_mode="rgb",
                class_mode=None,
                shuffle=False,
                seed=42
        )
        NAME = "16,32,64,128,256-conv-{}-batchs-{}".format(batch_size,int(time.time()))
        tensorboard = keras.callbacks.TensorBoard(log_dir = 'Graph/{}'.format(NAME))
        step_size_train = train_generator.n//train_generator.batch_size
        step_size_valid = validation_generator.n//validation_generator.batch_size
        step_size_test = test_generator.n//test_generator.batch_size

        model.fit_generator(
                generator=train_generator,
                steps_per_epoch=step_size_train,
                validation_data=validation_generator,
                validation_steps = step_size_valid,
                epochs=nb_epochs,
                callbacks = [tensorboard]
        )

它返回:

Using TensorFlow backend.
Found 11085 images belonging to 3 classes.
Found 2787 images belonging to 4 classes.
Found 89 images belonging to 1 classes.
2019-02-07 10:18:08.258962: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-02-07 10:18:08.681742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate(GHz): 1.038
pciBusID: 0000:01:00.0
totalMemory: 3.00GiB freeMemory: 2.48GiB
2019-02-07 10:18:08.686793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-02-07 10:18:09.293895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-07 10:18:09.296482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2019-02-07 10:18:09.302927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2019-02-07 10:18:09.306066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2169 MB memory)
-> physical GPU (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2)
Epoch 1/16
 20/346 [>.............................] - ETA: 2:50 - loss: 0.9647 - acc: 0.5500

它很慢,当我检查 GPU 的性能时: GPU 1% usage during training

标签: pythontensorflow

解决方案


推荐阅读