python - 手动连接kaggle的GPU
问题描述
我知道大多数用户使用 TensorFlow 或 PyTorch 作为建模框架,但我正在尝试转换一个用 paddle 编写的模型(称为ernie-doc)以使其在 kaggle 上运行,我猜发生了一些 GPU 连接问题。
!pip install -q -U paddlepaddle-gpu
import paddle
import paddle.fluid as fluid
paddle.enable_static()
# as the document suggests, check
fluid.install_check.run_check()
它运行成功
Running Verify Fluid Program ...
Your Paddle Fluid works well on SINGLE GPU or CPU.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid
然而,当适合模型时,事情会变得很奇怪
sys.path.append(os.path.abspath("/kaggle/input/erniedoc/ernie-doc"))
from finetune.classifier import create_model, evaluate
...
print("use gpu...")
place = fluid.CUDAPlace(0)
startup_prog = fluid.Program()
train_program = fluid.Program()
origin_train_program = train_program
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, startup_prog)
...
outputs = evaluate(exe, train_program, train_pyreader, graph_vars,
train_mems_vars, tower_mems_np,
"train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
...
它抱怨,
RuntimeError Traceback (most recent call last)
<ipython-input-8-51b504e78714> in main(args)
163 outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars,
164 train_mems_vars, tower_mems_np,
--> 165 "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
166 tower_mems_np = outputs['tower_mems_np']
167
...
/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py in _run_program(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
1230 else:
1231 self._default_executor.run_prepared_ctx(ctx, scope, False, False,
-> 1232 False)
1233 arr = scope.find_var(fetch_var_name).get_fetch_list()
1234 tensors = arr._move_to_list()
RuntimeError:
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
2 paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long)
3 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
4 paddle::memory::allocation::Allocator::Allocate(unsigned long)
5 paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
6 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
7 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
8 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
9 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
10 paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
11 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
12 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
13 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
14 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
15 paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
16 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 432.000244MB memory on GPU 0, 15.811646GB memory has been allocated and available memory is only 89.750000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
这是怎么回事,脚本是根据官方修改的,并且分配了一些内存,所以我假设GPU已经连接并且脚本到目前为止没有错误,但是为什么会这样,GPU:0处理16GB内存并且没有其他东西在运行。之后检查 GPU 信息
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 41C P0 35W / 250W | 16191MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
我应该停止某个过程还是做其他事情?任何建议将不胜感激!
解决方案
推荐阅读
- visual-studio - 在 Visual Studio 中使用 Svelte 和 Rollup 创建多个自定义元素
- putty - 如何使用 plink 将图像文件从本地计算机传输到远程服务器
- bash - 为什么嵌入式 shell 脚本中缺少我的 EOF?
- xpath - 如何在使用 normalize-space 和 substring-before 的情况下使用 XPath 获取元素的多次出现
- javascript - 使用 css 和 javascript 显示文件名
- python - 基于索引的 2 列 2 数据帧之间的差异
- html - 获取两个元素之间的所有标签(XPath)
- javascript - 使用 vue-chartjs 将两个数字排成一行
- math - 检查两个列表是否相似
- c# - Google 日历活动未显示在日历中