python - 无法在 CUDA 上下文上同步：CUDA_ERROR_NOT_INITIALIZED：Tensorflow-gpu 在 RTX 2070super 上工作非常缓慢

问题描述

简短版本：我用 keras 编写了一个遗传算法，其中一个时期中的每个模型都获得相同的输入并产生不同的输出，我可以验证这些输出。出于什么原因，tensorflow 使用了我的 gpu，但性能只有 1%（但至少它使用了所有 RAM）。这个过程只比我的 cpu 快两倍。所以我想使用多处理同时训练至少 100 个模型。但是 CUDA 似乎对此有问题。

错误代码：

2020-09-29 09:23:45.040414: E tensorflow/stream_executor/cuda/cuda_driver.cc:951] could not synchronize on CUDA context: CUDA_ERROR_NOT_INITIALIZED: initialization error :: *** Begin stack trace ***
        tensorflow::CurrentStackTrace()
        stream_executor::gpu::GpuDriver::SynchronizeContext(stream_executor::gpu::GpuContext*)
        stream_executor::StreamExecutor::SynchronizeAllActivity()
        tensorflow::GPUUtil::SyncAll(tensorflow::Device*)
        tensorflow::BaseGPUDevice::Sync()
        tensorflow::TensorHandle::CopyToDevice(tensorflow::EagerContext const&, tensorflow::Device*, tensorflow::Tensor*)
        tensorflow::TensorHandle::Resolve(tensorflow::Status*)
        TFE_TensorHandleResolve






        _PyFunction_Vectorcall


        _PyFunction_Vectorcall


        _PyFunction_Vectorcall


        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyObject_FastCallDict
        _PyObject_Call_Prepend

        _PyObject_MakeTpCall

        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName



        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall


        _PyFunction_Vectorcall
        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall


        _PyEval_EvalCodeWithName



        _PyFunction_Vectorcall


        _PyObject_FastCallDict
        _PyObject_Call_Prepend

        _PyObject_MakeTpCall

        _PyFunction_Vectorcall


        _PyFunction_Vectorcall


        _PyFunction_Vectorcall


        _PyFunction_Vectorcall


        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall


        _PyFunction_Vectorcall


        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode




        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall


        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall


        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall


        _PyFunction_Vectorcall

        _PyFunction_Vectorcall

        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode



        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        PyVectorcall_Call


        Py_BytesMain
        __libc_start_main

*** End stack trace ***

长版：顺便感谢您的阅读。就像我说的，我正在用 Tensorflow 编写遗传算法。例如，我有 5 个 epoch，每个 epoch 有 10 个模型，有 1000 个时间步来训练它们。问题在于，在每个时间步之后（每次我通过模型提供一些东西）我都使用模型的输出来运行一些影响下一个时间步的输入的代码。所以我的代码看起来像这样：

for e in range(len(epochs)):
  for m in range(len(models)):
    for t in range(len(timesteps)):

      output = current_pool[m].predict(x=neural_input, batch_size=1)

      do_something(output)

我安装了 Tensorflow-GPU 并运行了这个测试，看看 gpu 是否真的在使用：

print("GPU Available: ", tf.test.is_gpu_available())

输出是：

    2020-09-29 10:29:34.023544: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-29 10:29:34.048725: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3392090000 Hz
2020-09-29 10:29:34.049098: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d402523970 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-29 10:29:34.049123: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-09-29 10:29:34.051904: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-29 10:29:34.158278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.158838: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d4025bef00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-29 10:29:34.158855: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5
2020-09-29 10:29:34.159028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.159447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-09-29 10:29:34.159484: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-29 10:29:34.160743: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-29 10:29:34.161988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-29 10:29:34.162231: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-29 10:29:34.163486: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-29 10:29:34.164267: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-29 10:29:34.167046: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-29 10:29:34.167242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.167759: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.168171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-29 10:29:34.168225: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-29 10:29:34.522348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-29 10:29:34.522386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-09-29 10:29:34.522393: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-09-29 10:29:34.522607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.523069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.523476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 7267 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
GPU Available:  True

所以我的gpu正在使用中，很好。现在我的第一个问题是我的 gpu (RTX 2070super) 非常慢。它甚至不使用 1% 的功率。我的旧 gpu GTX 770 使用基本相同的代码快了 10 倍。（我不得不说，由于 gtx 770 的计算能力，让 GTX 770 在 tensorflow-gpu 支持下运行是相当困难的。我使用了来自 GitHub 的支持 3.0 的预构建 tensorflow 轮）

使用我的 RTX 2070super，我在我的 linux 服务器上使用 conda tensorflow-gpu 环境来保存 gpu。我通过 ssh 从我的 Windows pc 连接到我的 linux 服务器版本 18.04 上的环境。但是GPU位于Linux服务器上应该不是问题，或者可以吗？

我可以在下面列出安装在该 tensorflow-gpu 环境中的所有内容：

_libgcc_mutex             0.1                        main
_tflow_select             2.1.0                       gpu
absl-py                   0.10.0                   py38_0
astunparse                1.6.3                      py_0
blas                      1.0                         mkl
blinker                   1.4                      py38_0
brotlipy                  0.7.0           py38h7b6447c_1000
c-ares                    1.16.1               h7b6447c_0
ca-certificates           2020.7.22                     0
cachetools                4.1.1                      py_0
certifi                   2020.6.20                py38_0
cffi                      1.14.3           py38he30daa8_0
chardet                   3.0.4                 py38_1003
click                     7.1.2                      py_0
cryptography              3.1              py38h1ba5d50_0
cudatoolkit               10.1.243             h6bb024c_0
cudnn                     7.6.5                cuda10.1_0
cupti                     10.1.168                      0
cycler                    0.10.0                   py38_0
dbus                      1.13.16              hb2f20db_0
expat                     2.2.9                he6710b0_2
fontconfig                2.13.0               h9420a91_0
freetype                  2.10.2               h5ab3b9f_0
gast                      0.3.3                      py_0
glib                      2.65.0               h3eb4bd4_0
google-auth               1.21.2                     py_0
google-auth-oauthlib      0.4.1                      py_2
google-pasta              0.2.0                      py_0
grpcio                    1.31.0           py38hf8bcb03_0
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb31296c_0
h5py                      2.10.0           py38hd6299e0_1
hdf5                      1.10.6               hb1b8bf9_0
icu                       58.2                 he6710b0_3
idna                      2.10                       py_0
importlib-metadata        1.7.0                    py38_0
intel-openmp              2020.2                      254
jpeg                      9b                   h024ee3a_2
keras                     2.4.3                         0
keras-base                2.4.3                      py_0
keras-preprocessing       1.1.0                      py_1
kiwisolver                1.2.0            py38hfd86e86_0
lcms2                     2.11                 h396b838_0
ld_impl_linux-64          2.33.1               h53a641e_7
libedit                   3.1.20191231         h14c3975_1
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
libpng                    1.6.37               hbc83047_0
libprotobuf               3.12.4               hd408876_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h2733197_1
libuuid                   1.0.3                h1bed415_2
libxcb                    1.14                 h7b6447c_0
libxml2                   2.9.10               he19cac6_1
lz4-c                     1.9.2                he6710b0_1
markdown                  3.2.2                    py38_0
matplotlib                3.3.1                         0
matplotlib-base           3.3.1            py38h817c723_0
mkl                       2020.2                      256
mkl-service               2.3.0            py38he904b0f_0
mkl_fft                   1.2.0            py38h23d657b_0
mkl_random                1.1.1            py38h0573a6f_0
ncurses                   6.2                  he6710b0_1
numpy                     1.19.1           py38hbc911f0_0
numpy-base                1.19.1           py38hfa32c7d_0
oauthlib                  3.1.0                      py_0
olefile                   0.46                       py_0
openssl                   1.1.1h               h7b6447c_0
opt_einsum                3.1.0                      py_0
pandas                    1.1.1            py38he6710b0_0
pcre                      8.44                 he6710b0_0
pillow                    7.2.0            py38hb39fc2d_0
pip                       20.2.2                   py38_0
protobuf                  3.12.4           py38he6710b0_0
pyasn1                    0.4.8                      py_0
pyasn1-modules            0.2.8                      py_0
pycparser                 2.20                       py_2
pyjwt                     1.7.1                    py38_0
pyopenssl                 19.1.0                     py_1
pyparsing                 2.4.7                      py_0
pyqt                      5.9.2            py38h05f1152_4
pysocks                   1.7.1                    py38_0
python                    3.8.5                h7579374_1
python-dateutil           2.8.1                      py_0
pytz                      2020.1                     py_0
pyyaml                    5.3.1            py38h7b6447c_1
qt                        5.9.7                h5867ecd_1
readline                  8.0                  h7b6447c_0
requests                  2.24.0                     py_0
requests-oauthlib         1.3.0                      py_0
rsa                       4.6                        py_0
scipy                     1.5.2            py38h0b6359f_0
setuptools                49.6.0                   py38_0
sip                       4.19.13          py38he6710b0_0
six                       1.15.0                     py_0
sqlite                    3.33.0               h62c20be_0
tensorboard               2.2.1              pyh532a8cf_0
tensorboard-plugin-wit    1.6.0                      py_0
tensorflow                2.2.0           gpu_py38hb782248_0
tensorflow-base           2.2.0           gpu_py38h83e3d50_0
tensorflow-estimator      2.2.0              pyh208ff02_0
tensorflow-gpu            2.2.0                h0d30ee6_0
termcolor                 1.1.0                    py38_1
tk                        8.6.10               hbc83047_0
tornado                   6.0.4            py38h7b6447c_1
urllib3                   1.25.10                    py_0
werkzeug                  1.0.1                      py_0
wheel                     0.35.1                     py_0
wrapt                     1.12.1           py38h7b6447c_1
xz                        5.2.5                h7b6447c_0
yaml                      0.2.5                h7b6447c_0
zipp                      3.1.0                      py_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.5                h9ceee32_0

所以要得出一个结论，我有两个问题：

为什么当我安装了 tensorflow-gpu 并且 tensorflow 正在使用我的 gpu 但我的 Nvidia GTX 770 比更强大的 RTX 2070super 快 10 倍时，我的 gpu 这么慢
为什么我不能使用多处理同时处理多个模型，即使一个时期的模型可以完全独立地训练。

感谢您的宝贵时间，我希望您有一些想法可以帮助我。:) 如果您需要有关我的系统的更多信息，请告诉我。

标签： pythontensorflowkerasparallel-processinggpu

python - 无法在 CUDA 上下文上同步：CUDA_ERROR_NOT_INITIALIZED：Tensorflow-gpu 在 RTX 2070super 上工作非常缓慢

问题描述

解决方案

推荐阅读