首页 > 解决方案 > OpenMPI:MPI.Init 挂在 Java 中 - 如何调试?

问题描述

我已经根据https://www.open-mpi.org/faq/?category=java在本地编译了具有 Java 支持的 OpenMPI 。在我使用 Oracle Java 8 的本地机器上,这可以正常工作,但在使用 OpenJDK 8 的集群上,这种方法会导致 MPI Init 挂起。您对如何从这里开始有任何指示吗?追踪?玩弄其他版本的 Java?我找不到任何关于这个接口在 Java 版本方面支持什么的文档。

package com.acme.hello;
import mpi.*;

public class HelloMpi {
    public static void main(String args[]) throws Exception {
        int me,size;
        System.out.println("attempting MPI init");
        args=MPI.Init(args);
        System.out.println("MPI init done");
    }
}

> java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

> ~/NQSIM/java$ mpirun -version
mpirun (Open MPI) 3.1.2

> ~/NQSIM/java$ mpirun -np 2 java -classpath 
"./target/test-classes/" com.acme.hello.HelloMpi
attempting MPI init
attempting MPI init
(hangs here forever)

编辑:examples/hello_c 显示相同的行为,因此它与 Java 无关。我想这一定是运输中的东西。我必须仅使用用户权限构建/安装 OpenMPI。系统上有一个现有的 OpenMPI,但不支持 Java。关于如何进行的任何想法?

Edit2:切换到不同的字节层,例如使用--mca btl vader,self,工作。以下是--mca btl_base_verbose派对停止前的输出:

[fdr4:33013] mca: base: components_register: registering framework btl components
[fdr4:33013] mca: base: components_register: found loaded component sm
[fdr4:33014] mca: base: components_register: registering framework btl components
[fdr4:33014] mca: base: components_register: found loaded component sm
[fdr4:33013] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: found loaded component self
[fdr4:33014] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: component self register function successful
[fdr4:33014] mca: base: components_register: found loaded component self
[fdr4:33013] mca: base: components_register: found loaded component tcp
[fdr4:33014] mca: base: components_register: component self register function successful
[fdr4:33013] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component tcp
[fdr4:33013] mca: base: components_register: found loaded component vader
[fdr4:33013] mca: base: components_register: component vader register function successful
[fdr4:33013] mca: base: components_register: found loaded component openib
[fdr4:33014] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component vader
[fdr4:33014] mca: base: components_register: component vader register function successful
[fdr4:33014] mca: base: components_register: found loaded component openib
[fdr4:33013] mca: base: components_register: component openib register function successful
[fdr4:33013] mca: base: components_open: opening btl components
[fdr4:33013] mca: base: components_open: found loaded component sm
[fdr4:33013] mca: base: components_open: component sm open function successful
[fdr4:33013] mca: base: components_open: found loaded component self
[fdr4:33013] mca: base: components_open: component self open function successful
[fdr4:33013] mca: base: components_open: found loaded component tcp
[fdr4:33013] mca: base: components_open: component tcp open function successful
[fdr4:33013] mca: base: components_open: found loaded component vader
[fdr4:33013] mca: base: components_open: component vader open function successful
[fdr4:33013] mca: base: components_open: found loaded component openib
[fdr4:33013] mca: base: components_open: component openib open function successful
[fdr4:33013] select: initializing btl component sm
[fdr4:33014] mca: base: components_register: component openib register function successful
[fdr4:33014] mca: base: components_open: opening btl components
[fdr4:33014] mca: base: components_open: found loaded component sm
[fdr4:33014] mca: base: components_open: component sm open function successful
[fdr4:33014] mca: base: components_open: found loaded component self
[fdr4:33014] mca: base: components_open: component self open function successful
[fdr4:33014] mca: base: components_open: found loaded component tcp
[fdr4:33014] mca: base: components_open: component tcp open function successful
[fdr4:33014] mca: base: components_open: found loaded component vader
[fdr4:33014] mca: base: components_open: component vader open function successful
[fdr4:33014] mca: base: components_open: found loaded component openib
[fdr4:33014] mca: base: components_open: component openib open function successful
[fdr4:33014] select: initializing btl component sm
[fdr4:33014] select: init of component sm returned success
[fdr4:33014] select: initializing btl component self
[fdr4:33014] select: init of component self returned success
[fdr4:33014] select: initializing btl component tcp
[fdr4:33013] select: init of component sm returned success
[fdr4:33013] select: initializing btl component self
[fdr4:33013] select: init of component self returned success
[fdr4:33013] select: initializing btl component tcp
[fdr4:33014] select: init of component tcp returned success
[fdr4:33014] select: initializing btl component vader
[fdr4:33013] select: init of component tcp returned success
[fdr4:33013] select: initializing btl component vader
[fdr4:33014] select: init of component vader returned success
[fdr4:33014] select: initializing btl component openib
[fdr4:33013] select: init of component vader returned success
[fdr4:33013] select: initializing btl component openib
[fdr4:33014] Checking distance from this process to device=mlx4_0
[fdr4:33013] Checking distance from this process to device=mlx4_0
[fdr4:33013] hwloc_distances->nbobjs=4
[fdr4:33013] hwloc_distances->latency[0]=1.000000
[fdr4:33013] hwloc_distances->latency[1]=2.000000
[fdr4:33013] hwloc_distances->latency[2]=3.000000
[fdr4:33014] hwloc_distances->nbobjs=4
[fdr4:33014] hwloc_distances->latency[0]=1.000000
[fdr4:33014] hwloc_distances->latency[1]=2.000000
[fdr4:33014] hwloc_distances->latency[2]=3.000000
[fdr4:33013] hwloc_distances->latency[3]=2.000000
[fdr4:33013] hwloc_distances->latency[4]=2.000000
[fdr4:33013] hwloc_distances->latency[5]=1.000000
[fdr4:33013] hwloc_distances->latency[6]=2.000000
[fdr4:33013] hwloc_distances->latency[7]=3.000000
[fdr4:33013] ibv_obj->logical_index=1
[fdr4:33014] hwloc_distances->latency[3]=2.000000
[fdr4:33014] hwloc_distances->latency[4]=2.000000
[fdr4:33014] hwloc_distances->latency[5]=1.000000
[fdr4:33014] hwloc_distances->latency[6]=2.000000
[fdr4:33014] hwloc_distances->latency[7]=3.000000
[fdr4:33014] ibv_obj->logical_index=1
[fdr4:33013] my_obj->logical_index=0
[fdr4:33013] Process is bound: distance to device is 2.000000
[fdr4:33014] my_obj->logical_index=0
[fdr4:33014] Process is bound: distance to device is 2.000000
[fdr4:33013] [rank=0] openib: using port mlx4_0:1
[fdr4:33013] select: init of component openib returned success
[fdr4:33014] [rank=1] openib: using port mlx4_0:1
[fdr4:33014] select: init of component openib returned success
[fdr4:33013] mca: bml: Using self btl for send to [[59315,1],0] on node fdr4
[fdr4:33014] mca: bml: Using self btl for send to [[59315,1],1] on node fdr4
[fdr4:33013] mca: bml: Using vader btl for send to [[59315,1],1] on node fdr4
[fdr4:33014] mca: bml: Using vader btl for send to [[59315,1],0] on node fdr4

标签: javampiopenmpi

解决方案


已解决。在这种情况下,问题是对用户施加的限制之一。服务器被配置为使用默认设置,但是在更改以下内容后,/etc/security/limits.conf它开始使用默认字节层(因为我自己无法直接测试它,不幸的是我不知道这两个设置中的哪一个是肇事者):

*               -       memlock         unlimited
*               -       nofile          16384

推荐阅读