首页 > 解决方案 > Tensorflow/CUDA 中卷积算法之间的结果不匹配

问题描述

我正在训练一个卷积自动编码器并注意到这个警告:

Tensorflow: 2.5-gpu from pip
Driver: 460.80
cuda: 11.2.2
cudnn: 8.1.1
XLA: Yes
Mixed precision: Yes
26/27 [===========================>..] - ETA: 0s - loss: 1.0554 - pre_dense_out_loss: 0.9997 - de_conv1dtranspose_out_loss: 0.55782021-06-05 21:28:17.678118: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 0: 95.25 vs 80.8125
2021-06-05 21:28:17.678132: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 1: 95.6875 vs 81
2021-06-05 21:28:17.678136: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 2: 95.4375 vs 82.125
2021-06-05 21:28:17.678139: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 3: 95.3125 vs 80.5625
2021-06-05 21:28:17.678141: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 4: 95.375 vs 81.3125
2021-06-05 21:28:17.678145: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 5: 94.9375 vs 79.8125
2021-06-05 21:28:17.678148: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 6: 95.3125 vs 81
2021-06-05 21:28:17.678151: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 7: 95.625 vs 82
2021-06-05 21:28:17.678153: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 8: 94.75 vs 78.5625
2021-06-05 21:28:17.678156: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:682] Difference at 9: 95.25 vs 80.25
2021-06-05 21:28:17.678170: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:545] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
%custom-call.20 = (f16[1,5,24,24]{2,1,0,3}, u8[0]{0}) custom-call(f16[3778,1,50,24]{3,2,1,0} %bitcast.237, f16[3778,1,10,24]{3,2,1,0} %arg45.46), window={size=1x5 stride=1x5}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_type="Conv2DBackpropFilter" op_name="gradient_tape/model/de_conv1dtranspose_2/conv1d_transpose/Conv2DBackpropFilter"}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" for 1+TC vs 0+TC
2021-06-05 21:28:17.678174: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:192] Device: GeForce RTX 3070
2021-06-05 21:28:17.678177: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:193] Platform: Compute Capability 8.6
2021-06-05 21:28:17.678180: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:194] Driver: 11020 (460.80.0)
2021-06-05 21:28:17.678182: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:195] Runtime: <undefined>
2021-06-05 21:28:17.678185: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:202] cudnn version: 8.1.1

这是 Ubuntu 20.04 上的全新版本。我之前在 Windows 的 RTX 2060 上运行时没有注意到这个警告。输入数据有点大,所以 MRE 可能很困难。有谁知道这个警告是关于什么的?

标签: tensorflowdeconvolution

解决方案


这可能是低精度(例如 FP16)数据类型累积的影响。

您使用哪些数据类型?以及哪些算法?

来自:https ://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html

  1. 混合精度数值精度

当计算精度和输出精度不同时,数值精度可能会因一种算法而异。

例如,当计算在 FP32 中执行并且输出在 FP16 中时,与 CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (ALGO_1) 相比,CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 (ALGO_0) 的精度较低。这是因为 ALGO_0 没有使用额外的工作空间,而是强制将中间结果累加到 FP16 中,即半精度浮点数,这样会降低精度。另一方面,ALGO_1 使用额外的工作空间来累积 FP32 中的中间值,即全精度浮点数。


推荐阅读