首页 > 解决方案 > CUDA 错误 59:触发了设备端断言

问题描述

我使用 Pytorch 收到上述错误,并带有以下断言:

/opt/conda/conda-bld/pytorch_1565272269120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0]

assertion `index >= -size[i] && index < size at] && "index out of bounds"` failed

我已经看到了这个问题的其他解决方案,这些解决方案描述了由于标签不是来自 (0, num_classes-1) 等原因。但是,我已经确保在我的情况下,并且在计算铰链损失时出现错误,如下所示:

diff_hinge_loss+=  F.hinge_embedding_loss( neg_dist - pos_dist, torch.tensor(-1).to(cuda), args.diff_margin, reduction='sum').to(cuda)

最初训练时一切正常,但是在训练某些时期后,计算铰链损失时出现 CUDA 运行时错误。

完整的错误跟踪:

/opt/conda/conda-bld/pytorch_1565272269120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0] 
Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Traceback (most recent call last):

  File "ours-vision.py", line 1106, in <module>
    penalty_erm, penalty_irm, penalty_ws, penalty_same_ctr, penalty_diff_ctr = train( train_dataset, data_match_tensor, label_match_tensor, phi, opt, opt_ws, scheduler, epoch, base_domain_idx, bool_erm, bool_ws, bool_ctr )

  File "ours-vision.py", line 688, in train
diff_hinge_loss+=  F.hinge_embedding_loss( neg_dist - pos_dist, torch.tensor(-1).to(cuda), args.diff_margin, reduction='sum').to(cuda)

RuntimeError: CUDA error: device-side assert triggered

标签: pytorchgpuhinge-loss

解决方案


推荐阅读