首页 > 解决方案 > TensorFlow 内存问题

问题描述

我正在做一个对象检测项目。我使用 Livox ( https://github.com/Livox-SDK/livox_detection ) 提供的算法。我正在使用 8 GB 内存的 Nvidia jetson xavier nx 上做这个项目。由于兼容性原因,我正在使用 Tensorflow 1.15.4。该算法在 GTX 1050 显卡上的 TF 1.13.1 上运行良好,但速度太慢。

当我启动算法时,tensorflow 会分配内存(例如 3000 mb)。但是当算法从激光雷达传感器获取传感器数据时,内存会增加,直到几乎达到可用的 7700 Mb。一分钟后,所有回溯都被列出,程序启动,但推理时间约为 300 毫秒(GTX 1050 上为 50 毫秒,应该为 24 毫秒)。我认为重写已用内存存在问题。该程序适用于预先训练的模型,这意味着我无法训练新模型,因为我没有数据集。

我将在下面发布主要的回溯。

xavier@xavier-desktop:~/livox_detection-master$ python3 livox_rosdetection.py 
2021-03-29 18:43:51.996530: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

1008 224 30
WARNING:tensorflow:From /home/xavier/livox_detection-master/networks/model.py:27: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

(1, 1008, 224, 30)
WARNING:tensorflow:From /home/xavier/livox_detection-master/networks/model.py:77: The name tf.image.resize_bilinear is deprecated. Please use tf.compat.v1.image.resize_bilinear instead.

WARNING:tensorflow:From livox_rosdetection.py:66: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From livox_rosdetection.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From livox_rosdetection.py:72: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-03-29 18:44:07.473530: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-03-29 18:44:07.474464: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x34229010 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-29 18:44:07.474578: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-03-29 18:44:07.524156: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-29 18:44:07.760361: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.760946: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x341f0ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-29 18:44:07.761005: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Xavier, Compute Capability 7.2
2021-03-29 18:44:07.761407: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.761515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1665] Found device 0 with properties: 
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.109
pciBusID: 0000:00:00.0
2021-03-29 18:44:07.761579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-03-29 18:44:07.835484: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-29 18:44:07.854590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-03-29 18:44:07.894728: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-03-29 18:44:07.913741: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-03-29 18:44:07.932242: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-03-29 18:44:07.938152: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-03-29 18:44:07.938577: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.939043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.939263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1793] Adding visible gpu devices: 0
2021-03-29 18:44:07.943363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-03-29 18:44:14.396299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-29 18:44:14.396420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212]      0 
2021-03-29 18:44:14.396449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0:   N 
2021-03-29 18:44:14.396959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:14.397271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:14.397482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3882 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)

2021-03-29 18:47:25.321906: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-03-29 18:47:42.396094: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-29 18:48:05.615685: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.89GiB (rounded to 4179165184).  Current allocation summary follows.
2021-03-29 18:48:05.637466: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):   Total Chunks: 21, Chunks in use: 21. 5.2KiB allocated for chunks. 5.2KiB in use in bin. 4.6KiB client-requested in use in bin.

then some allocator lines like above

2021-03-29 18:48:05.699020: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 291.68MiB
2021-03-29 18:48:05.699180: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 4071376896 memory_limit_: 4071376896 available bytes: 0 curr_region_allocation_bytes_: 8142753792
2021-03-29 18:48:05.699424: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                  4071376896
InUse:                   305850368
MaxInUse:               3307795968
NumAllocs:                     198
MaxAllocSize:           3073374208

2021-03-29 18:48:05.715474: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ********____________________________________________________________________________________________
2021-03-29 18:48:15.799330: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.89GiB (rounded to 4179165184).  Current allocation summary follows.

之后有时会有这样的回溯:

2021-03-28 14:37:45.895238: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.

我已经尝试使用之前提到的 github 链接中的“livox_rosdetection.py”中的这一行来限制分配量:

               # config.gpu_options.allow_growth = True
                config.gpu_options.per_process_gpu_memory_fraction = 0.5

最后要提到的一件事是内存在 CPU 和 GPU 之间被分割。当需要更多信息时,我会发布它。

有人知道我怎样才能使检测更快吗?

提前致谢

标签: pythontensorflowdeep-learningnvidia-jetson

解决方案


推荐阅读