首页 > 解决方案 > CUDA 运行时错误:哪个 Cuda 版本与使用 BERT-NER 运行 NER 任务兼容

问题描述

我已经设置了我的VM上安装的所有需求包,我发现没有安装nvidia GPU驱动程序,在需求中没有nvidia GPU驱动程序安装说明,我想知道哪个cuda版本和它兼容的nvidia驱动程序也需要解决以下错误。

Github链接:github

错误日志:

  File "run_ner.py", line 594, in <module>
    main()
  File "run_ner.py", line 489, in main
    loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask)
  File "/home/pt3_gcp/BERT-NER/ber_ner/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "run_ner.py", line 35, in forward
    valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device='cuda')
  File "/home/pt3_gcp/BERT-NER/ber_ner/lib/python3.7/site-packages/torch/cuda/__init__.py", line 178, in _lazy_init
    _check_driver()
  File "/home/pt3_gcp/BERT-NER/ber_ner/lib/python3.7/site-packages/torch/cuda/__init__.py", line 99, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
**Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
**

从以下链接安装最新的 cuda 版本后, cuda我收到以下错误,

06/04/2020 07:38:40 - INFO - __main__ -   ***** Running training *****
06/04/2020 07:38:40 - INFO - __main__ -     Num examples = 14041
06/04/2020 07:38:40 - INFO - __main__ -     Batch size = 32
06/04/2020 07:38:40 - INFO - __main__ -     Num steps = 2190
Epoch:   0%|                                                                                 | 0/5 [00:00<?, ?it/sTHCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=38 : no CUDA-capable device is detectedt/s]
Traceback (most recent call last):
  File "run_ner.py", line 594, in <module>
    main()
  File "run_ner.py", line 489, in main
    loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask)
  File "/home/pt3_gcp/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "run_ner.py", line 35, in forward
    valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device='cuda')
  File "/home/pt3_gcp/.local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 179, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50

标签: pytorchnamed-entity-recognitionhuggingface-transformersbert-language-model

解决方案


前段时间我有同样的问题。以下命令为我修复!

如果您进行了多次安装,并且您现在可能已经尝试了很多东西,这将是一个问题。基本上删除所有内容

sudo apt-get purge nvidia-*
sudo apt-get remove nvidia-cuda-toolkit
sudo apt autoremove --purge cuda-10-0 // you might have a different version, check it git cuda --version

还删除用户群中的现有文件

rm -rf /usr/local/cuda* // anything related to cuda
rm -rf /usr/local/nvidia* // anything related to nvidia

现在,终于全新安装了

sudo apt-get update // update your packages

sudo apt search nvidia-driver  // to get the latest version of the driver. After finding out the latest version, install it with

sudo apt install nvidia-driver-450 (or any other number, depending on the latest version) 

安装后必须重启!

sudo reboot

当你回来时,nvidia-smi应该可以工作,你的 gpus 也应该工作


推荐阅读