python - 无法使用 V100 GPU 运行分布式 TensorFlow
问题描述
无法使用 GPU 运行 TensorFlow。代码在 CPU 中工作。
Debian 9.8 版
- 1 GPU 英伟达Tesla V100
- TensorFlow-GPU 1.12
- 英伟达驱动: NVIDIA-Linux-x86_64-390.46.run
- CUDA:cuda_9.0.176_384.81_linux-run
- CuDNN: cudnn-9.0-linux-x64-v7.4.1.5.tgz
- NCCL: nccl_2.3.7-1+cuda9.0_x86_64.txz
更新:用 CuDNN 7.1.4 测试和同样的问题
补丁
- cuda_9.0.176.1_linux-运行
- cuda_9.0.176.2_linux-运行
- cuda_9.0.176.3_linux-运行
- cuda_9.0.176.4_linux-运行
错误:
et convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:119) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'conv1/Conv2D', defined at:
File "mnist_distributed.py", line 237, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 196, in main
features, labels, keep_prob, global_step, train_step, accuracy, merged = create_model()
File "mnist_distributed.py", line 149, in create_model
y_conv, keep_prob = deepnn(x)
File "mnist_distributed.py", line 77, in deepnn
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "mnist_distributed.py", line 119, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:119) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
代码在这里
图书馆:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda
版本
CUDA
cat /usr/local/cuda/version.txt
CUDA Version 9.0.176
CUDA Patch Version 9.0.176.1
CUDA Patch Version 9.0.176.2
CUDA Patch Version 9.0.176.3
CUDA Patch Version 9.0.176.4
深度神经网络
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 1
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
相似的:
解决方案
通过详细查看日志,我收到了 OOM 错误,然后我在 tf.train.Server 中更改了以下内容以使其正常工作:
config_proto = tf.ConfigProto(log_device_placement=True)
config_proto.gpu_options.allow_growth = True
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index, config=config_proto)
错误:
2019-02-20 04:27:30.580666: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 836.47M (877106944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-02-20 04:27:30.612909: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.619060: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.625466: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.630800: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.636172: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.641168: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.723663: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-02-20 04:27:30.726611: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mnist_distributed.py", line 234, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 222, in main
feed_dict={features: batch[0], labels: batch[1], keep_prob: 1.0})
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:118) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
Caused by op 'conv1/Conv2D', defined at:
File "mnist_distributed.py", line 234, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 195, in main
features, labels, keep_prob, global_step, train_step, accuracy, merged = create_model()
File "mnist_distributed.py", line 148, in create_model
y_conv, keep_prob = deepnn(x)
File "mnist_distributed.py", line 76, in deepnn
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "mnist_distributed.py", line 118, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:118) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
推荐阅读
- c# - 从 AvalonDock LayoutAnchorable 中移除停靠组合按钮
- vue.js - 从 Capacitor 配置文件中获取 appId
- mule - 为组件中的参数“资源”找到无效配置
- c++ - Visual Studio Code - 更改编译器会出错
- wordpress - 使用 url 重写规则防止管理目录重定向
- actions-on-google - @assistant/conversation 中 CollectionBrowse 中的循环语句有什么用?
- terraform - Terraform:模块输出
- flutter - 无论如何在颤动的文本表单字段上创建可忽略的工具提示
- javascript - 打字稿展平地图数组
- jmeter - Jmeter中的文件上传问题