首页 > 解决方案 > google colab 出现 OOM 问题

问题描述

我正在构建一个 keras 模型来运行一些简单的图像识别任务。如果我在原始 Keras 中做所有事情,我不会遇到 OOM。然而,奇怪的是,当我通过我编写的一个迷你框架进行操作时,它相当简单,主要是为了让我可以跟踪我使用的超参数和设置,我点击了 OOM。大多数执行应该与运行原始 Keras 相同。我猜我在代码中犯了一些错误。请注意,同样的迷你框架在我的本地笔记本电脑上运行 CPU 没有问题。我想我需要调试。但在此之前,有人有什么一般性的建议吗?

这是我得到的几行错误:

Epoch 1/50
2018-05-18 17:40:27.435366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-18 17:40:27.435906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:00:04.0 totalMemory: 11.17GiB freeMemory: 504.38MiB
2018-05-18 17:40:27.435992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-05-18 17:40:27.784517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 17:40:27.784675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-05-18 17:40:27.784724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-05-18 17:40:27.785072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 243 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-05-18 17:40:38.569609: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 36.00MiB.  Current allocation summary follows.
2018-05-18 17:40:38.569702: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256):   Total Chunks: 66, Chunks in use: 66. 16.5KiB allocated for chunks. 16.5KiB in use in bin. 2.3KiB client-requested in use in bin.
2018-05-18 17:40:38.569768: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512):   Total Chunks: 10, Chunks in use: 10. 5.0KiB allocated for chunks. 5.0KiB in use in bin. 5.0KiB client- etc. etc

2018-05-18 17:40:38.573706: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[18432,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

标签: kerasgoogle-colaboratory

解决方案


这是由 GPU 内存不足引起的,从警告中可以清楚地看出。

第一个解决方法是,如果可能,您可以通过编写此 Config 原型并传递给 tf.session() 来允许 GPU 内存增长

   # See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth 
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

然后将此配置传递给导致此错误的会话。喜欢

tf.Session(config = config)

如果这没有帮助,您可以为导致此错误的特定会话禁用 GPU。像这样

config = tf.ConfigProto(device_count ={'GPU': 0}) 
sess = tf.Session(config=config)

如果您使用的是 keras,您可以获取 keras 的后端并通过提取会话来应用这些配置。


推荐阅读