mpi - 数据解包将在第 501 行读取文件 util/show

问题描述

我通过 slurm 提交了一份工作。这项工作运行了 12 个小时，并且按预期工作。然后我得到了Data unpack would read past end of buffer in file util/show_help.c at line 501。我经常会遇到类似的错误，ORTE has lost communication with a remote daemon但我通常会在工作开始时遇到这种情况。这很烦人，但仍然不会像 12 小时后出现错误那样造成时间损失。有没有快速解决这个问题的方法？Open MPI 版本是 4.0.1。
--------------------------------------------------------------------------                                                                                                                                                                       
By default, for Open MPI 4.0 and later, infiniband ports on a device                                                                                                                                                                         
are not used by default.  The intent is to use UCX for these devices.                                                                                                                                                                        
You can override this policy by setting the btl_openib_allow_ib MCA parameter                                                                                                                                                                    
to true.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
Local host:              barbun40                                                                                                                                                                                                            
Local adapter:           mlx5_0                                                                                                                                                                                                              
Local port:              1                                                                                                                                                                                                                                                                                                                                                                                                                                                              
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
WARNING: There was an error initializing an OpenFabrics device.                                                                                                                                                                                                                                                                                                                                                                                                                             
Local host:   barbun40                                                                                                                                                                                                                       
Local device: mlx5_0                                                                                                                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                                                   
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in 
file util/show_help.c at line 501                                                                                                        
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port 
not selected                                                                                                                            
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error 
messages                                                                                                                                  
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in 
device init                                                                                                                            
--------------------------------------------------------------------------                                                                                                                                                                   
Primary job  terminated normally, but 1 process returned                                                                                                                                                                                     
a non-zero exit code. Per user-direction, the job has been aborted.                                                                                                                                                                          
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
An MPI communication peer process has unexpectedly disconnected.  This                                                                                                                                                                       
usually indicates a failure in the peer process (e.g., a crash or                                                                                                                                                                            
otherwise exiting without calling MPI_FINALIZE first).                                                                                                                                                                                                                                                                                                                                                                                                                                    
Although this local MPI process will likely now behave unpredictably                                                                                                                                                                         
(it may even hang or crash), the root cause of this problem is the                                                                                                                                                                           
failure of the peer -- that is what you need to investigate.  For                                                                                                                                                                            
example, there may be a core file that you can examine.  More                                                                                                                                                                                
generally: such peer hangups are frequently caused by application bugs                                                                                                                                                                       
or other external events.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Local host: barbun64                                                                                                                                                                                                                         
Local PID:  252415                                                                                                                                                                                                                           
Peer host:  barbun39                                                                                                                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
mpirun detected that one or more processes exited with non-zero status, thus causing                                                                                                                                                         
the job to be terminated. The first process to do so was:                                                                                                                                                                                                                                                                                                                                                                                                                                   
Process name: [[15284,1],35]                                                                                                                                                                                                                 
Exit code:    9                                                                                                                                                                                                                            
--------------------------------------------------------------------------
标签： mpiopenmpi
mpi - 数据解包将在第 501 行读取文件 util/show_help.c 中缓冲区的末尾

问题描述

解决方案

推荐阅读