首页 > 解决方案 > 如何跨多个节点使用 mpi4py 发送数据?

问题描述

大家:我在hpc上运行我的代码,但是我不能在节点之间传输数据。我编写了一个简单的代码来测试跨节点的内核之间的通信。首先,我使用一个节点8核,我的代码是(test.py

from mpi4py import MPI
import sys
import numpy as np

def print_hello(rank, size, name):
  msg = "Hello World! I am process {0} of {1} on {2}.\n"
  sys.stdout.write(msg.format(rank, size, name))

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    size = MPI.COMM_WORLD.Get_size()
    rank = MPI.COMM_WORLD.Get_rank()
    name = MPI.Get_processor_name()
    if rank == 0:
        data = np.random.random((8,64,64,64))
        print(data.shape)
    else:
        data = None

    data = comm.scatter(data,root=0)
    print(data.shape)

    print_hello(rank, size, name)

我使用srun -N 1 -n 8 python3 test.py 2>&1 | tee out.txt它运行它,就像mpirun -np 8 python3 test.py 1>&1 | tee out.txt它只运行 5 秒out.txt文件是:

(64, 64, 64)
Hello World! I am process 4 of 8 on cn3478.
(64, 64, 64)
Hello World! I am process 5 of 8 on cn3478.
(64, 64, 64)
Hello World! I am process 6 of 8 on cn3478.
(64, 64, 64)
Hello World! I am process 7 of 8 on cn3478.
(64, 64, 64)
Hello World! I am process 1 of 8 on cn3478.
(64, 64, 64)
Hello World! I am process 2 of 8 on cn3478.
(64, 64, 64)
Hello World! I am process 3 of 8 on cn3478.
(8, 64, 64, 64)
(64, 64, 64)
Hello World! I am process 0 of 8 on cn3478.

一切看起来都不错!但是,当我使用两个节点 48 核时,就出错了!该文件是(test48.py):

from mpi4py import MPI
import sys
import numpy as np

def print_hello(rank, size, name):
  msg = "Hello World! I am process {0} of {1} on {2}.\n"
  sys.stdout.write(msg.format(rank, size, name))

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    size = MPI.COMM_WORLD.Get_size()
    rank = MPI.COMM_WORLD.Get_rank()
    name = MPI.Get_processor_name()
    if rank == 0:
        data = np.random.random((48,64,64,64))
        print(data.shape)
    else:
        data = None

    data = comm.scatter(data,root=0)
    print(data.shape)

    print_hello(rank, size, name)

我运行yhrun -N 2 -n 48 python3 test48.py 2>&1 | tee out2.txt 它运行了很多时间(超过 1 小时)并且没有打印任何东西。我猜是数据传输出错了,因为我注释掉了这两行:

    data = comm.scatter(data,root=0)
    print(data.shape)

代码很快完成,输出为:

Hello World! I am process 25 of 48 on cn3598.
Hello World! I am process 28 of 48 on cn3598.
Hello World! I am process 30 of 48 on cn3598.
Hello World! I am process 31 of 48 on cn3598.
Hello World! I am process 32 of 48 on cn3598.
Hello World! I am process 40 of 48 on cn3598.
Hello World! I am process 41 of 48 on cn3598.
Hello World! I am process 44 of 48 on cn3598.
Hello World! I am process 24 of 48 on cn3598.
Hello World! I am process 26 of 48 on cn3598.
Hello World! I am process 27 of 48 on cn3598.
Hello World! I am process 29 of 48 on cn3598.
Hello World! I am process 33 of 48 on cn3598.
Hello World! I am process 34 of 48 on cn3598.
Hello World! I am process 35 of 48 on cn3598.
Hello World! I am process 36 of 48 on cn3598.
Hello World! I am process 37 of 48 on cn3598.
Hello World! I am process 38 of 48 on cn3598.
Hello World! I am process 42 of 48 on cn3598.
Hello World! I am process 43 of 48 on cn3598.
Hello World! I am process 45 of 48 on cn3598.
Hello World! I am process 46 of 48 on cn3598.
Hello World! I am process 47 of 48 on cn3598.
Hello World! I am process 39 of 48 on cn3598.
Hello World! I am process 1 of 48 on cn3597.
Hello World! I am process 3 of 48 on cn3597.
Hello World! I am process 4 of 48 on cn3597.
Hello World! I am process 9 of 48 on cn3597.
Hello World! I am process 12 of 48 on cn3597.
Hello World! I am process 15 of 48 on cn3597.
Hello World! I am process 16 of 48 on cn3597.
Hello World! I am process 17 of 48 on cn3597.
Hello World! I am process 20 of 48 on cn3597.
Hello World! I am process 2 of 48 on cn3597.
Hello World! I am process 5 of 48 on cn3597.
Hello World! I am process 6 of 48 on cn3597.
Hello World! I am process 7 of 48 on cn3597.
Hello World! I am process 8 of 48 on cn3597.
Hello World! I am process 10 of 48 on cn3597.
Hello World! I am process 11 of 48 on cn3597.
Hello World! I am process 13 of 48 on cn3597.
Hello World! I am process 14 of 48 on cn3597.
Hello World! I am process 18 of 48 on cn3597.
Hello World! I am process 19 of 48 on cn3597.
Hello World! I am process 21 of 48 on cn3597.
Hello World! I am process 22 of 48 on cn3597.
Hello World! I am process 23 of 48 on cn3597.
(48, 64, 64, 64)
Hello World! I am process 0 of 48 on cn3597.

代码有什么问题?还是不允许节点之间的数据传输?欢迎任何建议!

标签: pythonmpislurmmpi4py

解决方案


推荐阅读