python - TensorFlow 内存问题
问题描述
我正在做一个对象检测项目。我使用 Livox ( https://github.com/Livox-SDK/livox_detection ) 提供的算法。我正在使用 8 GB 内存的 Nvidia jetson xavier nx 上做这个项目。由于兼容性原因,我正在使用 Tensorflow 1.15.4。该算法在 GTX 1050 显卡上的 TF 1.13.1 上运行良好,但速度太慢。
当我启动算法时,tensorflow 会分配内存(例如 3000 mb)。但是当算法从激光雷达传感器获取传感器数据时,内存会增加,直到几乎达到可用的 7700 Mb。一分钟后,所有回溯都被列出,程序启动,但推理时间约为 300 毫秒(GTX 1050 上为 50 毫秒,应该为 24 毫秒)。我认为重写已用内存存在问题。该程序适用于预先训练的模型,这意味着我无法训练新模型,因为我没有数据集。
我将在下面发布主要的回溯。
xavier@xavier-desktop:~/livox_detection-master$ python3 livox_rosdetection.py
2021-03-29 18:43:51.996530: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
1008 224 30
WARNING:tensorflow:From /home/xavier/livox_detection-master/networks/model.py:27: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
(1, 1008, 224, 30)
WARNING:tensorflow:From /home/xavier/livox_detection-master/networks/model.py:77: The name tf.image.resize_bilinear is deprecated. Please use tf.compat.v1.image.resize_bilinear instead.
WARNING:tensorflow:From livox_rosdetection.py:66: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
WARNING:tensorflow:From livox_rosdetection.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From livox_rosdetection.py:72: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2021-03-29 18:44:07.473530: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-03-29 18:44:07.474464: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x34229010 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-29 18:44:07.474578: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-29 18:44:07.524156: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-29 18:44:07.760361: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.760946: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x341f0ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-29 18:44:07.761005: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Xavier, Compute Capability 7.2
2021-03-29 18:44:07.761407: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.761515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1665] Found device 0 with properties:
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.109
pciBusID: 0000:00:00.0
2021-03-29 18:44:07.761579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-03-29 18:44:07.835484: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-29 18:44:07.854590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-03-29 18:44:07.894728: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-03-29 18:44:07.913741: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-03-29 18:44:07.932242: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-03-29 18:44:07.938152: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-03-29 18:44:07.938577: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.939043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:07.939263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1793] Adding visible gpu devices: 0
2021-03-29 18:44:07.943363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-03-29 18:44:14.396299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-29 18:44:14.396420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0
2021-03-29 18:44:14.396449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N
2021-03-29 18:44:14.396959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:14.397271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-03-29 18:44:14.397482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3882 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2021-03-29 18:47:25.321906: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-03-29 18:47:42.396094: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-29 18:48:05.615685: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.89GiB (rounded to 4179165184). Current allocation summary follows.
2021-03-29 18:48:05.637466: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 21, Chunks in use: 21. 5.2KiB allocated for chunks. 5.2KiB in use in bin. 4.6KiB client-requested in use in bin.
then some allocator lines like above
2021-03-29 18:48:05.699020: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 291.68MiB
2021-03-29 18:48:05.699180: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 4071376896 memory_limit_: 4071376896 available bytes: 0 curr_region_allocation_bytes_: 8142753792
2021-03-29 18:48:05.699424: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 4071376896
InUse: 305850368
MaxInUse: 3307795968
NumAllocs: 198
MaxAllocSize: 3073374208
2021-03-29 18:48:05.715474: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ********____________________________________________________________________________________________
2021-03-29 18:48:15.799330: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.89GiB (rounded to 4179165184). Current allocation summary follows.
之后有时会有这样的回溯:
2021-03-28 14:37:45.895238: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
我已经尝试使用之前提到的 github 链接中的“livox_rosdetection.py”中的这一行来限制分配量:
# config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.5
最后要提到的一件事是内存在 CPU 和 GPU 之间被分割。当需要更多信息时,我会发布它。
有人知道我怎样才能使检测更快吗?
提前致谢
解决方案
推荐阅读
- python - opencv中重叠圆圈之间的黑色区域 - python
- c# - 在 ASP.NET MVC 中将 10000 个数据上传到 Devextreme 数据网格时出现问题
- mysql - mySQL如何从没有组的重复值中选择MAX(日期)
- javascript - 无法从片段活动中的 firebase 获取值列表、图像
- javascript - Vuetify 验证消息变量
- java - GLFW 高 GPU 使用率
- apt - 如何在 Ubuntu 20.04 上安装 lubicu66?
- python - Python递归(可能是递归内的递归?)将例如字符串划分为子字符串(每个子字符串都可以有更多的子字符串)
- groovy - 使用 Groovy 的 YAMLBuilder 时有什么方法可以排除空属性?
- html - 使用 svg 本地文件而不是图标(从 Bootstrap 5 图标到本地 svg 文件导入)