tensorflow - 如何在多 GPU 分布中最小化优化器
问题描述
我正在尝试调整 MNIST 的 DL 模型以同时在多个 GPU 中运行。但是,我找不到让它工作的方法。我是 DL 新手,所以我不完全理解这段代码背后的所有逻辑。我尝试了很多东西,但似乎都没有。这是代码:
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
y_ = tf.placeholder(tf.float32, [None, 10])
# LINES TO MAKE MODEL...
y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
#Crossentropy
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
我得到这个问题:
RuntimeError: Use `_distributed_apply()` instead of `apply_gradients()` in a cross-replica context.
调整最小化函数来使用这个_distributed_apply()
函数并不能解决这个问题。如果我改变
#train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
至
opt = tf.train.AdamOptimizer(1e-4)
grads_and_vars = opt.compute_gradients(cross_entropy)
train_step = opt._distributed_apply(strategy, grads_and_vars)
我收到以下错误:
Traceback (most recent call last):
File "/home/baq/.local/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 108, in _make_tensor_into_per_replica
device = input_tensor.device
AttributeError: 'NoneType' object has no attribute 'device'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "multigpu.py", line 97, in <module>
train_step = opt._distributed_apply(strategy, grads_and_vars)
File "/home/baq/.local/lib/python3.7/site-packages/tensorflow/python/training/optimizer.py", line 665, in _distributed_apply
ds_reduce_util.ReduceOp.SUM, grads_and_vars)
File "/home/baq/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1254, in batch_reduce_to
return self._batch_reduce_to(reduce_op, value_destination_pairs)
File "/home/baq/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 739, in _batch_reduce_to
reduce_op, value_destination_pairs)
File "/home/baq/.local/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 285, in batch_reduce
value_destination_pairs)
File "/home/baq/.local/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 133, in _normalize_value_destination_pairs
per_replica = _make_tensor_into_per_replica(pair[0])
File "/home/baq/.local/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 110, in _make_tensor_into_per_replica
raise ValueError("Cannot convert `input_tensor` to a `PerReplica` object "
ValueError: Cannot convert `input_tensor` to a `PerReplica` object because it doesn't have device set.
关于如何解决这个问题的任何想法?谢谢。
解决方案
推荐阅读
- image-processing - 我在哪里可以使用 python 项目进行图像处理。最好有代码,因为我是新手,我正在尝试图像处理
- vba - 在 MS Project 中使用替换 VBA 功能
- javascript - 使用 jodit 编辑器在 textarea 中写入或删除字符串
- debugging - 调用图生成的更快回溯?
- c - 这段代码有什么问题?值不是加入到 t 中了吗?
- ajax - Laravel 搜索结果太慢
- python - 使用 TFRecordDataset 时如何设置纪元计数器?
- amazon-web-services - 无法运行 AWS -Nuke
- swift - 有没有办法根据之前是否在另一个视图控制器上按下按钮来编写 if 语句?
- c# - 如何调试为 Excel 公式编写并在发布模式下运行的 C# 代码?