首页 > 解决方案 > 手动连接kaggle的GPU

问题描述

我知道大多数用户使用 TensorFlow 或 PyTorch 作为建模框架,但我正在尝试转换一个用 paddle 编写的模型(称为ernie-doc)以使其在 kaggle 上运行,我猜发生了一些 GPU 连接问题。

!pip install -q -U paddlepaddle-gpu
import paddle
import paddle.fluid as fluid
paddle.enable_static()
# as the document suggests, check 
fluid.install_check.run_check()

它运行成功

Running Verify Fluid Program ... 
Your Paddle Fluid works well on SINGLE GPU or CPU.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid

然而,当适合模型时,事情会变得很奇怪

sys.path.append(os.path.abspath("/kaggle/input/erniedoc/ernie-doc"))
from finetune.classifier import create_model, evaluate
...
print("use gpu...")
place = fluid.CUDAPlace(0)
startup_prog = fluid.Program()
train_program = fluid.Program()
origin_train_program = train_program
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, startup_prog)
...
outputs = evaluate(exe, train_program, train_pyreader, graph_vars, 
                                        train_mems_vars, tower_mems_np,
                                       "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
...

它抱怨,

RuntimeError                              Traceback (most recent call last)
<ipython-input-8-51b504e78714> in main(args)
    163                         outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars, 
    164                                         train_mems_vars, tower_mems_np,
--> 165                                        "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
    166                         tower_mems_np = outputs['tower_mems_np']
    167 

...

/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py in _run_program(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
   1230         else:
   1231             self._default_executor.run_prepared_ctx(ctx, scope, False, False,
-> 1232                                                     False)
   1233         arr = scope.find_var(fetch_var_name).get_fetch_list()
   1234         tensors = arr._move_to_list()

RuntimeError: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1   paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
2   paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long)
3   paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
4   paddle::memory::allocation::Allocator::Allocate(unsigned long)
5   paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
6   paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
7   paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
8   paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
9   paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
10  paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
11  std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
12  paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
13  paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
14  paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
15  paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
16  paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 432.000244MB memory on GPU 0, 15.811646GB memory has been allocated and available memory is only 89.750000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 

 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)

这是怎么回事,脚本是根据官方修改的,并且分配了一些内存,所以我假设GPU已经连接并且脚本到目前为止没有错误,但是为什么会这样,GPU:0处理16GB内存并且没有其他东西在运行。之后检查 GPU 信息

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    35W / 250W |  16191MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

我应该停止某个过程还是做其他事情?任何建议将不胜感激!

标签: pythongpukagglepaddle-paddle

解决方案


推荐阅读