gpu - Pytorch detach() 函数无法在不同的 GPU 服务器上执行
问题描述
最近,我们的实验室购买了一台带有 9 个 GPU 的新服务器,我想在这台机器上运行我的程序。但是,我没有更改正确的代码,并且收到了如下所示的意外错误。
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File "main.py", line 166, in <module>
p_img.copy_(netG(p_z).detach())
File "/usr/local/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/szhangcj/python/GBGAN/celebA_attention/sagan_models.py", line 100, in forward
out,p1 = self.attn1(out)
File "/usr/local/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/szhangcj/python/GBGAN/celebA_attention/sagan_models.py", line 32, in forward
energy = torch.bmm(proj_query,proj_key) # transpose check
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCBlas.cu:411
但是,我可以在具有 4 个 GPU 的旧机器上成功运行我的编程。我不确定问题是什么,似乎错误是由detach()
函数引起的。我的代码如下。
z_b = torch.FloatTensor(opt.batch_size, opt.z_dim).to(device)
img_b = torch.FloatTensor(opt.batch_size, 3, 64, 64).to(device)
img_a = torch.FloatTensor(opt.batch_size, 3, 64, 64).to(device)
p_z = torch.FloatTensor(pool_size, opt.z_dim).to(device)
p_img = torch.FloatTensor(pool_size, 3, 64, 64).to(device)
## evaluation placeholder
show_z_b = torch.FloatTensor(100, opt.z_dim).to(device)
eval_z_b = torch.FloatTensor(250, opt.z_dim).to(device) # 250/batch * 120 --> 300000
optim_D = optim.Adam(netD.parameters(), lr=opt.lr_d) # other param?
optim_G = optim.Adam(netG.parameters(), lr=opt.lr_g) #?suitable
criterion_G = nn.MSELoss()
eta = 1
loss_GD = []
pre_loss = 0
cur_loss = 0
G_epoch = 1
for epoch in range(start_epoch, start_epoch + opt.num_epoch):
print('Start epoch: %d' % epoch)
## input_pool: [pool_size, opt.z_dim] -> [pool_size, 32, 32]
netD.train()
netG.eval()
p_z.normal_()
# print(netG(p_z).detach().size())
p_img.copy_(netG(p_z).detach())
for t in range(opt.period):
for _ in range(opt.dsteps):
t = time.time()
### Update D
netD.zero_grad()
## real
real_img, _ = next(iter(dataloader)) # [batch_size, 1, 32, 32]
img_b.copy_(real_img.squeeze().to(device))
real_D_err = torch.log(1 + torch.exp(-netD(img_b))).mean()
print("D real loss", netD(img_b).mean())
# real_D_err.backward()
## fake
z_b_idx = random.sample(range(pool_size), opt.batch_size)
img_a.copy_(p_img[z_b_idx])
fake_D_err = torch.log(1 + torch.exp(netD(img_a))).mean() # torch scalar[]
loss_gp = calc_gradient_penalty(netD, img_b, img_a)
total_loss = real_D_err + fake_D_err + loss_gp
print("D fake loss", netD(img_a).mean())
total_loss.backward()
optim_D.step()
## update input pool
p_img_t = p_img.clone().to(device)
p_img_t.requires_grad_(True)
if p_img_t.grad is not None:
p_img_t.grad.zero_()
fake_D_score = netD(p_img_t)
fake_D_score.backward(torch.ones(len(p_img_t)).to(device))
p_img = img_truncate(p_img + eta * p_img_t.grad)
print("The mean of gradient", torch.mean(p_img_t.grad))
解决方案
该错误是由 RTX GPU 卡和 CUDA 驱动程序之间的版本不匹配引起的。
推荐阅读
- azure-iot-hub - 如何使用 azure python sdk 版本 2 将经过 X.509 身份验证的下游设备连接到启用了 azure edge 的网关
- angular - IIS 重定向到新 URL
- python - 通过脚本 Python 在合成部分 Blender 中创建颜色渐变节点
- apache-pulsar - Apache pulsar 以不可预知的方式超时
- c# - 事件CellEditEnding wpf时,Datagrid使用按钮更改列中的图像
- spring-mvc - ContentCachingResponseWrapper:如何使用 ContentCachingResponseWrapper 获取应用程序响应对象(不是 httpResponse)
- r - 在 HTML bookdown 输出中添加单独的目录
- dictionary - 用clojure中的字母映射索引
- mysql - 更新插入属性包括 updated_at
- file - 检查文件是否位于基本目录中的最安全方法是什么?