首页 > 解决方案 > 当我使用 torch.load 时出现运行时错误“存储大小错误:”

问题描述

当我调用torch.load("pthfilename"). 我的模型在多个 GPU 上进行了训练,我使用以下代码保存了模型:

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
device = torch.device(arg.local_rank)
net = Net().to(device)
net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[arg.local_rank])
torch.save(net.state_dict(), "0.pth"))

错误是:

Traceback (most recent call last):
  File "/root/PycharmProjects/test.py", line 8, in <module>
    model_dict = torch.load("0.pth")
  File "torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "torch/serialization.py", line 709, in _legacy_load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected -4916312287391674656 got 24

标签: loadtorch

解决方案


如果您使用多进程类型(例如:DistributedDataParallel)训练您的模型,您应该在保存模型时分配一个 local_rank。

参考这个链接这个这个,希望这个解决方案可以帮助你。

def save_checkpoint(epoch, model, best_top5, optimizer, 
                        is_best=False, 
                        filename='checkpoint.pth.tar'):
    state = {
        'epoch': epoch+1, 'state_dict': model.state_dict(),
        'best_top5': best_top5, 'optimizer' : optimizer.state_dict(),
    }
    torch.save(state, filename)

if args.local_rank == 0:
    if is_best: save_checkpoint(epoch, model, best_top5, optimizer, is_best=True, filename='model_best.pth.tar')

推荐阅读