java - OpenMPI:MPI.Init 挂在 Java 中 - 如何调试?
问题描述
我已经根据https://www.open-mpi.org/faq/?category=java在本地编译了具有 Java 支持的 OpenMPI 。在我使用 Oracle Java 8 的本地机器上,这可以正常工作,但在使用 OpenJDK 8 的集群上,这种方法会导致 MPI Init 挂起。您对如何从这里开始有任何指示吗?追踪?玩弄其他版本的 Java?我找不到任何关于这个接口在 Java 版本方面支持什么的文档。
package com.acme.hello;
import mpi.*;
public class HelloMpi {
public static void main(String args[]) throws Exception {
int me,size;
System.out.println("attempting MPI init");
args=MPI.Init(args);
System.out.println("MPI init done");
}
}
> java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> ~/NQSIM/java$ mpirun -version
mpirun (Open MPI) 3.1.2
> ~/NQSIM/java$ mpirun -np 2 java -classpath
"./target/test-classes/" com.acme.hello.HelloMpi
attempting MPI init
attempting MPI init
(hangs here forever)
编辑:examples/hello_c 显示相同的行为,因此它与 Java 无关。我想这一定是运输中的东西。我必须仅使用用户权限构建/安装 OpenMPI。系统上有一个现有的 OpenMPI,但不支持 Java。关于如何进行的任何想法?
Edit2:切换到不同的字节层,例如使用--mca btl vader,self
,工作。以下是--mca btl_base_verbose
派对停止前的输出:
[fdr4:33013] mca: base: components_register: registering framework btl components
[fdr4:33013] mca: base: components_register: found loaded component sm
[fdr4:33014] mca: base: components_register: registering framework btl components
[fdr4:33014] mca: base: components_register: found loaded component sm
[fdr4:33013] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: found loaded component self
[fdr4:33014] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: component self register function successful
[fdr4:33014] mca: base: components_register: found loaded component self
[fdr4:33013] mca: base: components_register: found loaded component tcp
[fdr4:33014] mca: base: components_register: component self register function successful
[fdr4:33013] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component tcp
[fdr4:33013] mca: base: components_register: found loaded component vader
[fdr4:33013] mca: base: components_register: component vader register function successful
[fdr4:33013] mca: base: components_register: found loaded component openib
[fdr4:33014] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component vader
[fdr4:33014] mca: base: components_register: component vader register function successful
[fdr4:33014] mca: base: components_register: found loaded component openib
[fdr4:33013] mca: base: components_register: component openib register function successful
[fdr4:33013] mca: base: components_open: opening btl components
[fdr4:33013] mca: base: components_open: found loaded component sm
[fdr4:33013] mca: base: components_open: component sm open function successful
[fdr4:33013] mca: base: components_open: found loaded component self
[fdr4:33013] mca: base: components_open: component self open function successful
[fdr4:33013] mca: base: components_open: found loaded component tcp
[fdr4:33013] mca: base: components_open: component tcp open function successful
[fdr4:33013] mca: base: components_open: found loaded component vader
[fdr4:33013] mca: base: components_open: component vader open function successful
[fdr4:33013] mca: base: components_open: found loaded component openib
[fdr4:33013] mca: base: components_open: component openib open function successful
[fdr4:33013] select: initializing btl component sm
[fdr4:33014] mca: base: components_register: component openib register function successful
[fdr4:33014] mca: base: components_open: opening btl components
[fdr4:33014] mca: base: components_open: found loaded component sm
[fdr4:33014] mca: base: components_open: component sm open function successful
[fdr4:33014] mca: base: components_open: found loaded component self
[fdr4:33014] mca: base: components_open: component self open function successful
[fdr4:33014] mca: base: components_open: found loaded component tcp
[fdr4:33014] mca: base: components_open: component tcp open function successful
[fdr4:33014] mca: base: components_open: found loaded component vader
[fdr4:33014] mca: base: components_open: component vader open function successful
[fdr4:33014] mca: base: components_open: found loaded component openib
[fdr4:33014] mca: base: components_open: component openib open function successful
[fdr4:33014] select: initializing btl component sm
[fdr4:33014] select: init of component sm returned success
[fdr4:33014] select: initializing btl component self
[fdr4:33014] select: init of component self returned success
[fdr4:33014] select: initializing btl component tcp
[fdr4:33013] select: init of component sm returned success
[fdr4:33013] select: initializing btl component self
[fdr4:33013] select: init of component self returned success
[fdr4:33013] select: initializing btl component tcp
[fdr4:33014] select: init of component tcp returned success
[fdr4:33014] select: initializing btl component vader
[fdr4:33013] select: init of component tcp returned success
[fdr4:33013] select: initializing btl component vader
[fdr4:33014] select: init of component vader returned success
[fdr4:33014] select: initializing btl component openib
[fdr4:33013] select: init of component vader returned success
[fdr4:33013] select: initializing btl component openib
[fdr4:33014] Checking distance from this process to device=mlx4_0
[fdr4:33013] Checking distance from this process to device=mlx4_0
[fdr4:33013] hwloc_distances->nbobjs=4
[fdr4:33013] hwloc_distances->latency[0]=1.000000
[fdr4:33013] hwloc_distances->latency[1]=2.000000
[fdr4:33013] hwloc_distances->latency[2]=3.000000
[fdr4:33014] hwloc_distances->nbobjs=4
[fdr4:33014] hwloc_distances->latency[0]=1.000000
[fdr4:33014] hwloc_distances->latency[1]=2.000000
[fdr4:33014] hwloc_distances->latency[2]=3.000000
[fdr4:33013] hwloc_distances->latency[3]=2.000000
[fdr4:33013] hwloc_distances->latency[4]=2.000000
[fdr4:33013] hwloc_distances->latency[5]=1.000000
[fdr4:33013] hwloc_distances->latency[6]=2.000000
[fdr4:33013] hwloc_distances->latency[7]=3.000000
[fdr4:33013] ibv_obj->logical_index=1
[fdr4:33014] hwloc_distances->latency[3]=2.000000
[fdr4:33014] hwloc_distances->latency[4]=2.000000
[fdr4:33014] hwloc_distances->latency[5]=1.000000
[fdr4:33014] hwloc_distances->latency[6]=2.000000
[fdr4:33014] hwloc_distances->latency[7]=3.000000
[fdr4:33014] ibv_obj->logical_index=1
[fdr4:33013] my_obj->logical_index=0
[fdr4:33013] Process is bound: distance to device is 2.000000
[fdr4:33014] my_obj->logical_index=0
[fdr4:33014] Process is bound: distance to device is 2.000000
[fdr4:33013] [rank=0] openib: using port mlx4_0:1
[fdr4:33013] select: init of component openib returned success
[fdr4:33014] [rank=1] openib: using port mlx4_0:1
[fdr4:33014] select: init of component openib returned success
[fdr4:33013] mca: bml: Using self btl for send to [[59315,1],0] on node fdr4
[fdr4:33014] mca: bml: Using self btl for send to [[59315,1],1] on node fdr4
[fdr4:33013] mca: bml: Using vader btl for send to [[59315,1],1] on node fdr4
[fdr4:33014] mca: bml: Using vader btl for send to [[59315,1],0] on node fdr4
解决方案
已解决。在这种情况下,问题是对用户施加的限制之一。服务器被配置为使用默认设置,但是在更改以下内容后,/etc/security/limits.conf
它开始使用默认字节层(因为我自己无法直接测试它,不幸的是我不知道这两个设置中的哪一个是肇事者):
* - memlock unlimited
* - nofile 16384
推荐阅读
- sql-server - 图像未插入 SQL Server 数据库
- python - 如何修复指出字符串无法转换为浮点数的值错误
- python - 结果缺少一些数字
- c# - 如何使用 .NET 4.5 或更高版本以异步方式获取驱动器、文件夹或文件?
- c++ - 当我使用向量名称后跟包含整数变量的括号时,括号是什么意思?
- powershell - 如何从共享的 Outlook 邮箱访问电子邮件
- python - 多天按分钟过滤数据帧
- python - 删除带有特定字符串的行
- java - java中的每个类只允许有一个子类,为什么多态允许有多个子类进行继承?
- excel - 使用生成的百分位值创建百分位图/图表