mpi - 数据解包将在第 501 行读取文件 util/show_help.c 中缓冲区的末尾
问题描述
我通过 slurm 提交了一份工作。这项工作运行了 12 个小时,并且按预期工作。然后我得到了Data unpack would read past end of buffer in file util/show_help.c at line 501
。我经常会遇到类似的错误,ORTE has lost communication with a remote daemon
但我通常会在工作开始时遇到这种情况。这很烦人,但仍然不会像 12 小时后出现错误那样造成时间损失。有没有快速解决这个问题的方法?Open MPI 版本是 4.0.1。
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------
解决方案
推荐阅读
- applescript - 如何获取应用程序打开窗口的文件路径?
- asp.net-core - 在 ubuntu 和 .net core 3.0 上缺少“系统”参考
- javascript - 按时间戳查询firestore中的数据
- vb.net - 多索引搜索
- typescript - 尝试登录或创建帐户时出现错误
- html - 将鼠标悬停在第二个下拉子菜单上时不会显示
- c# - Apache Ignite.NET 中的 IgniteQueue
- scala - 并行转换和持久化 Spark DStream 到几个不同的位置?
- javascript - 合并两个 JSONata 表达式
- ios - iOS 13.1 不接收静默 CKQuerySubscription 通知