首页 > 解决方案 > 未能分配 X 字节统一内存;结果:CUDA_ERROR_OUT_OF_MEMORY:内存不足

问题描述

我正在尝试运行 tensorflow 项目,但在大学 HPC 集群上遇到内存问题。我必须为数百个不同长度的输入运行预测工作。我们有不同数量的 vmem 的 GPU 节点,所以我试图以一种不会在 GPU 节点的任何组合 - 输入长度时崩溃的方式设置脚本。

在网上搜索解决方案后,我使用了 TF_FORCE_UNIFIED_MEMORY、XLA_PYTHON_CLIENT_MEM_FRACTION、XLA_PYTHON_CLIENT_PREALLOCATE 和 TF_FORCE_GPU_ALLOW_GROWTH,以及 tensorflow 的set_memory_growth. 据我了解,使用统一内存,我应该能够使用比 GPU 本身更多的内存。

这是我的最终解决方案(仅相关部分)

os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)

import tensorflow as tf    
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      print(gpu)
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

我使用--mem=30G(slurm 作业调度程序)和--gres=gpu:1.

这是我的代码崩溃的错误。据我了解,它确实尝试使用统一内存,但由于某种原因失败了。

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.


Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
    donated_invars=donated_invars, inline=inline)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
    return call_bind(self, fun, *args, **params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
    return trace.process_call(self, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
    return primitive.impl(f, *tracers, **params)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
    *unsafe_map(arg_spec, args))
  File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
    ans = call(fun, *args)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
    compiled = compile_or_get_cached(backend, built, options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
    return backend_compile(backend, computation, compile_options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

对于如何让它工作和使用所有可用内存的任何想法,我都会很高兴。

谢谢!

标签: tensorflowout-of-memorygpuslurm

解决方案


看起来您的 GPU 不完全支持统一内存。支持是有限的,实际上 GPU 将所有数据保存在其内存中。

描述见这篇文章:https ://developer.nvidia.com/blog/unified-memory-cuda-beginners/

尤其是:

在具有 pre-Pascal GPU(如 Tesla K80)的系统上,调用 cudaMallocManaged()会在调用时处于活动状态的 GPU 设备上分配 size 字节的托管内存。在内部,驱动程序还为分配覆盖的所有页面设置页表条目,以便系统知道这些页面驻留在该 GPU 上。

和:

由于这些较旧的 GPU 不能出现页面错误,因此所有数据都必须驻留在 GPU 上,以防内核访问它(即使它不会访问)。

根据 TechPowerUp 的说法,您的 GPU 是基于 Kepler 的:https ://www.techpowerup.com/gpu-specs/geforce-gtx-titan-black.c2549

据我所知,TensorFlow 也应该对此发出警告。就像是:

计算能力低于 6.0 的 GPU(Pascal 前级 GPU)上的统一内存不支持超额订阅。


推荐阅读