python - 无法在 CUDA 上下文上同步:CUDA_ERROR_NOT_INITIALIZED:Tensorflow-gpu 在 RTX 2070super 上工作非常缓慢
问题描述
简短版本:我用 keras 编写了一个遗传算法,其中一个时期中的每个模型都获得相同的输入并产生不同的输出,我可以验证这些输出。出于什么原因,tensorflow 使用了我的 gpu,但性能只有 1%(但至少它使用了所有 RAM)。这个过程只比我的 cpu 快两倍。所以我想使用多处理同时训练至少 100 个模型。但是 CUDA 似乎对此有问题。
错误代码:
2020-09-29 09:23:45.040414: E tensorflow/stream_executor/cuda/cuda_driver.cc:951] could not synchronize on CUDA context: CUDA_ERROR_NOT_INITIALIZED: initialization error :: *** Begin stack trace ***
tensorflow::CurrentStackTrace()
stream_executor::gpu::GpuDriver::SynchronizeContext(stream_executor::gpu::GpuContext*)
stream_executor::StreamExecutor::SynchronizeAllActivity()
tensorflow::GPUUtil::SyncAll(tensorflow::Device*)
tensorflow::BaseGPUDevice::Sync()
tensorflow::TensorHandle::CopyToDevice(tensorflow::EagerContext const&, tensorflow::Device*, tensorflow::Tensor*)
tensorflow::TensorHandle::Resolve(tensorflow::Status*)
TFE_TensorHandleResolve
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyObject_FastCallDict
_PyObject_Call_Prepend
_PyObject_MakeTpCall
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyFunction_Vectorcall
PyVectorcall_Call
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
PyVectorcall_Call
_PyEval_EvalFrameDefault
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyObject_FastCallDict
_PyObject_Call_Prepend
_PyObject_MakeTpCall
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
PyEval_EvalCodeEx
PyEval_EvalCode
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
PyEval_EvalCodeEx
PyEval_EvalCode
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
_PyEval_EvalCodeWithName
_PyFunction_Vectorcall
PyVectorcall_Call
Py_BytesMain
__libc_start_main
*** End stack trace ***
长版:顺便感谢您的阅读。就像我说的,我正在用 Tensorflow 编写遗传算法。例如,我有 5 个 epoch,每个 epoch 有 10 个模型,有 1000 个时间步来训练它们。问题在于,在每个时间步之后(每次我通过模型提供一些东西)我都使用模型的输出来运行一些影响下一个时间步的输入的代码。所以我的代码看起来像这样:
for e in range(len(epochs)):
for m in range(len(models)):
for t in range(len(timesteps)):
output = current_pool[m].predict(x=neural_input, batch_size=1)
do_something(output)
我安装了 Tensorflow-GPU 并运行了这个测试,看看 gpu 是否真的在使用:
print("GPU Available: ", tf.test.is_gpu_available())
输出是:
2020-09-29 10:29:34.023544: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-29 10:29:34.048725: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3392090000 Hz
2020-09-29 10:29:34.049098: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d402523970 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-29 10:29:34.049123: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-29 10:29:34.051904: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-29 10:29:34.158278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.158838: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d4025bef00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-29 10:29:34.158855: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5
2020-09-29 10:29:34.159028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.159447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-09-29 10:29:34.159484: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-29 10:29:34.160743: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-29 10:29:34.161988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-29 10:29:34.162231: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-29 10:29:34.163486: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-29 10:29:34.164267: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-29 10:29:34.167046: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-29 10:29:34.167242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.167759: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.168171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-29 10:29:34.168225: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-29 10:29:34.522348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-29 10:29:34.522386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-09-29 10:29:34.522393: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-09-29 10:29:34.522607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.523069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-29 10:29:34.523476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 7267 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
GPU Available: True
所以我的gpu正在使用中,很好。现在我的第一个问题是我的 gpu (RTX 2070super) 非常慢。它甚至不使用 1% 的功率。我的旧 gpu GTX 770 使用基本相同的代码快了 10 倍。(我不得不说,由于 gtx 770 的计算能力,让 GTX 770 在 tensorflow-gpu 支持下运行是相当困难的。我使用了来自 GitHub 的支持 3.0 的预构建 tensorflow 轮)
使用我的 RTX 2070super,我在我的 linux 服务器上使用 conda tensorflow-gpu 环境来保存 gpu。我通过 ssh 从我的 Windows pc 连接到我的 linux 服务器版本 18.04 上的环境。但是GPU位于Linux服务器上应该不是问题,或者可以吗?
我可以在下面列出安装在该 tensorflow-gpu 环境中的所有内容:
_libgcc_mutex 0.1 main
_tflow_select 2.1.0 gpu
absl-py 0.10.0 py38_0
astunparse 1.6.3 py_0
blas 1.0 mkl
blinker 1.4 py38_0
brotlipy 0.7.0 py38h7b6447c_1000
c-ares 1.16.1 h7b6447c_0
ca-certificates 2020.7.22 0
cachetools 4.1.1 py_0
certifi 2020.6.20 py38_0
cffi 1.14.3 py38he30daa8_0
chardet 3.0.4 py38_1003
click 7.1.2 py_0
cryptography 3.1 py38h1ba5d50_0
cudatoolkit 10.1.243 h6bb024c_0
cudnn 7.6.5 cuda10.1_0
cupti 10.1.168 0
cycler 0.10.0 py38_0
dbus 1.13.16 hb2f20db_0
expat 2.2.9 he6710b0_2
fontconfig 2.13.0 h9420a91_0
freetype 2.10.2 h5ab3b9f_0
gast 0.3.3 py_0
glib 2.65.0 h3eb4bd4_0
google-auth 1.21.2 py_0
google-auth-oauthlib 0.4.1 py_2
google-pasta 0.2.0 py_0
grpcio 1.31.0 py38hf8bcb03_0
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb31296c_0
h5py 2.10.0 py38hd6299e0_1
hdf5 1.10.6 hb1b8bf9_0
icu 58.2 he6710b0_3
idna 2.10 py_0
importlib-metadata 1.7.0 py38_0
intel-openmp 2020.2 254
jpeg 9b h024ee3a_2
keras 2.4.3 0
keras-base 2.4.3 py_0
keras-preprocessing 1.1.0 py_1
kiwisolver 1.2.0 py38hfd86e86_0
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libprotobuf 3.12.4 hd408876_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libuuid 1.0.3 h1bed415_2
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 he19cac6_1
lz4-c 1.9.2 he6710b0_1
markdown 3.2.2 py38_0
matplotlib 3.3.1 0
matplotlib-base 3.3.1 py38h817c723_0
mkl 2020.2 256
mkl-service 2.3.0 py38he904b0f_0
mkl_fft 1.2.0 py38h23d657b_0
mkl_random 1.1.1 py38h0573a6f_0
ncurses 6.2 he6710b0_1
numpy 1.19.1 py38hbc911f0_0
numpy-base 1.19.1 py38hfa32c7d_0
oauthlib 3.1.0 py_0
olefile 0.46 py_0
openssl 1.1.1h h7b6447c_0
opt_einsum 3.1.0 py_0
pandas 1.1.1 py38he6710b0_0
pcre 8.44 he6710b0_0
pillow 7.2.0 py38hb39fc2d_0
pip 20.2.2 py38_0
protobuf 3.12.4 py38he6710b0_0
pyasn1 0.4.8 py_0
pyasn1-modules 0.2.8 py_0
pycparser 2.20 py_2
pyjwt 1.7.1 py38_0
pyopenssl 19.1.0 py_1
pyparsing 2.4.7 py_0
pyqt 5.9.2 py38h05f1152_4
pysocks 1.7.1 py38_0
python 3.8.5 h7579374_1
python-dateutil 2.8.1 py_0
pytz 2020.1 py_0
pyyaml 5.3.1 py38h7b6447c_1
qt 5.9.7 h5867ecd_1
readline 8.0 h7b6447c_0
requests 2.24.0 py_0
requests-oauthlib 1.3.0 py_0
rsa 4.6 py_0
scipy 1.5.2 py38h0b6359f_0
setuptools 49.6.0 py38_0
sip 4.19.13 py38he6710b0_0
six 1.15.0 py_0
sqlite 3.33.0 h62c20be_0
tensorboard 2.2.1 pyh532a8cf_0
tensorboard-plugin-wit 1.6.0 py_0
tensorflow 2.2.0 gpu_py38hb782248_0
tensorflow-base 2.2.0 gpu_py38h83e3d50_0
tensorflow-estimator 2.2.0 pyh208ff02_0
tensorflow-gpu 2.2.0 h0d30ee6_0
termcolor 1.1.0 py38_1
tk 8.6.10 hbc83047_0
tornado 6.0.4 py38h7b6447c_1
urllib3 1.25.10 py_0
werkzeug 1.0.1 py_0
wheel 0.35.1 py_0
wrapt 1.12.1 py38h7b6447c_1
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
zipp 3.1.0 py_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h9ceee32_0
所以要得出一个结论,我有两个问题:
- 为什么当我安装了 tensorflow-gpu 并且 tensorflow 正在使用我的 gpu 但我的 Nvidia GTX 770 比更强大的 RTX 2070super 快 10 倍时,我的 gpu 这么慢
- 为什么我不能使用多处理同时处理多个模型,即使一个时期的模型可以完全独立地训练。
感谢您的宝贵时间,我希望您有一些想法可以帮助我。:) 如果您需要有关我的系统的更多信息,请告诉我。
解决方案
推荐阅读
- php - 存在性能问题的大循环
- mongodb - 使用官方 mongo-go-driver 进行正确的通配符多字段查询
- windows - 尽管我在 app.json 中设置了一个 android 包,但出现错误“您的项目必须在 app.json 中设置一个 Android 包”
- r - 是否有用于返回 1 个单元格中多个位置匹配的向量的 R 函数?
- jquery - 在具有固定标题的模态中跳转到锚点
- mysql - 如何读取列类型并将其更改为按最早日期排序的日期?
- python-3.x - 如何找到numpy数组列的总和
- reactjs - 部署 create-react-app 项目不会丑化我的代码
- java - 为什么我可以将比较器对象传递给排序方法?
- c++ - 使用模板成员函数时出现 LNK2001 错误