首页 > 解决方案 > Pytorch NCCL DDP 冻结但 Gloo 工作

问题描述

我试图弄清楚同一 Ubuntu 20.04 系统上的两个 Nvidia 2070S GPU 是否可以通过 NCCL 和 Pytorch 1.8 相互访问。

我的测试脚本基于Pytorch 文档,但后端从 更改"gloo""nccl".

当后端为"gloo"时,脚本在不到一分钟的时间内完成运行。

$ time python test_ddp.py 
Running basic DDP example on rank 0.
Running basic DDP example on rank 1.

real    0m4.839s
user    0m4.980s
sys     0m1.942s

但是,当后端设置为"nccl"时,脚本会卡在下面的输出中,并且永远不会返回到 bash 提示符。

$ python test_ddp.py 
Running basic DDP example on rank 1.
Running basic DDP example on rank 0.

禁用 IB 时同样的问题

$ NCCL_IB_DISABLE=1 python test_ddp.py
Running basic DDP example on rank 1.
Running basic DDP example on rank 0.

我正在使用这些软件包:

使用 NCCL 时如何解决问题?谢谢!

用于测试 NCCL 的 Python 代码:

import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp

from torch.nn.parallel import DistributedDataParallel as DDP


def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"

    # gloo: works
    # dist.init_process_group("gloo", rank=rank, world_size=world_size)

    # nccl: hangs forever
    dist.init_process_group(
        "nccl", init_method="tcp://10.1.1.20:23456", rank=rank, world_size=world_size
    )


def cleanup():
    dist.destroy_process_group()


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn(outputs, labels).backward()
    optimizer.step()

    cleanup()


def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn, args=(world_size,), nprocs=world_size, join=True)


if __name__ == "__main__":
    run_demo(demo_basic, 2)

标签: pythonpytorchnvidia

解决方案


推荐阅读