首页 > 解决方案 > 使用多个并发 MPI LAMMPS 作业减慢速度

问题描述

我正在使用 AMD 2990WX (Ubuntu 18.04) 运行 LAMMPS 模拟。

当我使用 mpirun 只运行一个 LAMMPS 作业时,如下所示。

    #!/bin/sh

    LAMMPS_HOME=/APP/LAMMPS/src
    MPI_HOME=/APP/LIBS/OPENMPI2

    Tf=0.30

    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf}

我没有问题,模拟按照我的意愿进行。

但是当我运行下面的脚本时。每个 LAMMPS 作业几乎是单个 LAMMPS 作业的 3 倍。因此,我在并行环境中没有性能提升(因为 3 个作业的运行速度是单个作业的 1/3)

    #!/bin/sh

    LAMMPS_HOME=/APP/LAMMPS/src
    MPI_HOME=/APP/LIBS/OPENMPI2

    Tf=0.30

    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}

没有主机文件my_host,它是一样的。主机文件如下:

    <hostname> slots=32 max-slots=32

我安装了 openmpi --with-cuda, fftw--enable-shared和 LAMMPS 几个包。

我已经尝试过 openmpi v1.8、v3.0、v4.0 和 fftw v3.3.8。RAM足够了,存储也足够了。我还检查了平均负载和核心使用情况。当我运行第二个脚本时,它们显示机器使用 24 个内核(或相应的负载)。sh first.sh当我在单独的终端(即每个终端)中同时运行第一个脚本的副本时,会发生同样的问题。

我使用 bash 脚本有什么问题吗?mpirun或者(或 LAMMPS)+ Ryzen是否存在任何已知问题?

更新

我已经测试了以下脚本:

/bin/sh

LAMMPS_HOME=/APP/LAMMPS/src
MPI_HOME=/APP/LIBS/OPENMPI2

Tf=0.30

$MPI_HOME/bin/mpirun --cpu-set 0-7 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 8-15 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 16-23 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}

结果显示如下:

[<hostname>:09617] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 4 bound to socket 0[core 20[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../..]
[<hostname>:09619] MCW rank 5 bound to socket 0[core 21[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../..]
[<hostname>:09619] MCW rank 6 bound to socket 0[core 22[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../..]
[<hostname>:09619] MCW rank 7 bound to socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../..]
[<hostname>:09619] MCW rank 0 bound to socket 0[core 16[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 1 bound to socket 0[core 17[hwt 0-1]]: [../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 2 bound to socket 0[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 3 bound to socket 0[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 4 bound to socket 0[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 5 bound to socket 0[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 6 bound to socket 0[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 7 bound to socket 0[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 0 bound to socket 0[core 8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 1 bound to socket 0[core 9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 2 bound to socket 0[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 3 bound to socket 0[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../..]

我对 MPI 了解不多,但对我来说,它并没有显示出任何奇怪的地方。有什么问题吗?

标签: shellmpiopenmpiamd-processorlammps

解决方案


推荐阅读