首页 > 解决方案 > ResourceExhaustedError:在 Keras 中分配张量时出现 OOM

问题描述

我正在使用 tf.keras 训练具有以下摘要的模型:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 28, 28, 20)        520       
_________________________________________________________________
flatten (Flatten)            (None, 15680)             0         
_________________________________________________________________
dense (Dense)                (None, 5)                 78405     
=================================================================
Total params: 78,925
Trainable params: 78,925
Non-trainable params: 0
_________________________________________________________________ 

我正在使用 fit 方法调用

model.fit(X_train, y_train, batch_size=32, steps_per_epoch=125, epochs=5, use_multiprocessing=True)

其中 X_train 是形状为 [900000,32,32,1] 的张量流变量。

我遇到以下错误:

Resource exhausted: OOM when allocating tensor with shape[900000,28,28,20] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node conv2d_1/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[metrics_2/acc/Identity/_53]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[900000,28,28,20] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node conv2d_1/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

我无法理解为什么当批量大小为 32 时它分配形状为 [900000,28,28,20] 的张量。我期待 [32,28,28,20]。


完整代码:

IMAGE_SIZE = 32
BATCH_SIZE = 32

conv1_layer = keras.layers.Conv2D(input_shape=(IMAGE_SIZE, IMAGE_SIZE, 1), filters=20, kernel_size=[5, 5], activation='relu')
f = keras.layers.Flatten()
output_layer = keras.layers.Dense(units=5, activation='softmax')

model = keras.models.Sequential(layers=[conv1_layer, f, output_layer])
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=BATCH_SIZE, steps_per_epoch=int(X_train.shape[0])//BATCH_SIZE, epochs=5, use_multiprocessing=True)

标签: tensorflowkeras

解决方案


问题最有可能出现在use_multiprocessing=True,它期望输入作为生成器对象(请参阅docs) - 因此,如果馈送X_train数组不会立即产生错误,它可能会通过在轴 0 上迭代并馈送所有 900,000 个样本来将其视为生成器一次 - 因此错误。尝试use_multiprocessing=False,或使用生成器提供数据。

此外,steps_per_epoch可能是一个额外的来源 - 忽略它们fit- 然后,在传递它们之前评估您的tf.Variable输入,因为fit它不会自动处理。


推荐阅读