tensorflow - Out Of Memory when running multi-gpu cnn with TensorFlow
问题描述
I'm trying to run a simple cnn on cifar10, combining code from 2 examples: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/6_MultiGPU/multigpu_cnn.py
https://github.com/exelban/tensorflow-cifar-10
I'm getting OOM errors.
I first tried the code with the complete cnn , without multi-gpu support, and it is working ok. Next I used the multi-gpu code, ran ok too. Combining them is not working.
with tf.device('/cpu:0'):
tower_grads = []
reuse_vars = False
# tf Graph input
X = tf.placeholder(tf.float32, shape=[None, _IMAGE_SIZE * _IMAGE_SIZE * _IMAGE_CHANNELS], name='Input')
Y = tf.placeholder(tf.float32, shape=[None, _NUM_CLASSES], name='Output')
phase = tf.placeholder(tf.bool, name='phase')
# learning_rate = tf.placeholder(tf.float32, shape=[], name='learning_rate')
keep_prob = tf.placeholder(tf.float32)
global_step = tf.get_variable(name='global_step', trainable=False, initializer=0)
# Loop over all GPUs and construct their own computation graph
for i in range(_NUM_GPUS):
with tf.device('/gpu:{}'.format(i)):
# learning_rate = tf.placeholder(tf.float32, shape=[], name='learning_rate')
# keep_prob = tf.placeholder(tf.float32)
# Split data between GPUs
_x = X[i * _BATCH_SIZE: (i+1) * _BATCH_SIZE]
_y = Y[i * _BATCH_SIZE: (i+1) * _BATCH_SIZE]
print("x shape:",_x.shape)
print("y shape:",_y.shape)
# Because Dropout have different behavior at training and prediction time, we
# need to create 2 distinct computation graphs that share the same weights.
_x = tf.reshape(_x, [-1, _IMAGE_SIZE, _IMAGE_SIZE, _IMAGE_CHANNELS], name='images')
# Create a graph for training
logits_train, y_pred_cls = feed_net(_x, _NUM_CLASSES, keep_prob, reuse=reuse_vars, is_training=True)
# Create another graph for testing that reuse the same weights
logits_test, y_pred_cls = feed_net(_x, _NUM_CLASSES, keep_prob, reuse=True, is_training=False)
# Define loss and optimizer (with train logits, for dropout to take effect)
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits_train, labels=_y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
grads = optimizer.compute_gradients(loss_op)
# Only first GPU compute accuracy
if i == 0:
# Evaluate model (with test logits, for dropout to be disabled)
correct_pred = tf.equal(tf.argmax(logits_test, 1), tf.argmax(_y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
reuse_vars = True
tower_grads.append(grads)
tower_grads = average_gradients(tower_grads)
train_op = optimizer.apply_gradients(tower_grads)
The error is happening when running with more than 1 gpu (got 4), after about 90 iterations (less than one epoch)
ResourceExhaustedError: Ran out of GPU memory when allocating 0 bytes for
[[Node: softmax_cross_entropy_with_logits_sg_3 = SoftmaxCrossEntropyWithLogits[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:3"](softmax_cross_entropy_with_logits_sg_3/Reshape, softmax_cross_entropy_with_logits_sg_3/Reshape_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: main_params/map/while/Less_1/_206 = _Send[T=DT_BOOL, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1905_main_params/map/while/Less_1", _device="/job:localhost/replica:0/task:0/device:GPU:0"](main_params/map/while/Less_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
UPDATE:
The problem was with how the data was divided across the GPUs.
I used tf.split(X, _NUM_GPUS)
for the data and the labels, then I could assign each GPU with it's right data chunk.
解决方案
这是解决方案:问题在于如何在 GPU 之间分配数据。我用于tf.split(X, _NUM_GPUS)
数据和标签,然后我可以为每个 GPU 分配正确的数据块。此外,只有一个 GPU 正在运行accuracy
,因此它需要获取完整大小的数据。
推荐阅读
- python - SQLAlchemy 错误:“TypeError:应命名其他参数
_ ,得到'可为空'" - typo3 - 如何通过扩展管理器 (TYPO3) 使扩展可配置
- node.js - 是否可以使用 aws xray 监控 websocket 通信
- xslt - 使用 xsl:apply-templates 在特定结构中循环
- html - 将图像正确添加到 Blade.php 文件
- c# - Tensorflow 2 中的外部推理上下文(shape_refiner.cc)是什么?
- c - C 系统调用失败
- android - 在revylerview中调用notifyDataSetChanges方法后如何保持recyclerview的位置?
- xml - 从 XSD 生成的 XML:值对于模式不是方面有效的
- c - 有没有一种方法可以在 C 中对标头和库依赖项进行版本控制?