首页 > 解决方案 > 运行 Linpack 时出错

问题描述

我正在尝试在我的个人笔记本电脑上运行 HPL Linpack。我在虚拟机上使用 CentOS 8。

分配的核心数:6

内存:12.5GB

节点:1

当我使用较小的 N 值运行时,它运行良好,但是当我尝试最大化 CPU 使用率时,使用较大的 N 值(尝试达到 75-80% 的使用率),我每次都会遇到不同的错误。

错误- 在单独运行时弹出所有错误。

[1617771807.179752] [localhost:3301 :0]           sock.c:344  UCX  ERROR recv(fd=28) failed: Bad address
[1617771807.188129] [localhost:3298 :0]           sock.c:344  UCX  ERROR recv(fd=27) failed: Connection reset by peer
[1617771807.249456] [localhost:3298 :0]           sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor
[localhost:03298] *** An error occurred in MPI_Send
[localhost:03298] *** reported by process [3696427009,2]
[localhost:03298] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3
[localhost:03298] *** MPI_ERR_OTHER: known error not in list
[localhost:03298] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[localhost:03298] ***    and potentially your MPI job)

_________________________________________________________________________________________

malloc(): corrupted top size
[localhost:06009] *** Process received signal ***
[localhost:06009] Signal: Aborted (6)
[localhost:06009] Signal code:  (-6)
[localhost:06009] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f230e65cb20]
[localhost:06009] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f230e2be7ff]
[localhost:06009] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f230e2a8c35]
[localhost:06009] [ 3] /lib64/libc.so.6(+0x7a987)[0x7f230e301987]
[localhost:06009] [ 4] /lib64/libc.so.6(+0x81d8c)[0x7f230e308d8c]
[localhost:06009] [ 5] /lib64/libc.so.6(+0x851f5)[0x7f230e30c1f5]
[localhost:06009] [ 6] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f230e30d412]
[localhost:06009] [ 7] ./xhpl[0x4232e3]
[localhost:06009] [ 8] ./xhpl[0x4202cd]
[localhost:06009] [ 9] ./xhpl[0x41168e]
[localhost:06009] [10] ./xhpl[0x408eff]
[localhost:06009] [11] ./xhpl[0x4018aa]
[localhost:06009] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f230e2aa7b3]
[localhost:06009] [13] ./xhpl[0x401cae]
[localhost:06009] *** End of error message ***


_________________________________________________________________________________________

corrupted size vs. prev_size
[localhost:05847] *** Process received signal ***
[localhost:05847] Signal: Aborted (6)
[localhost:05847] Signal code:  (-6)
[localhost:05847] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f07c812eb20]
[localhost:05847] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f07c7d907ff]
[localhost:05847] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f07c7d7ac35]
[localhost:05847] [ 3] /lib64/libc.so.6(+0x7a987)[0x7f07c7dd3987]
[localhost:05847] [ 4] /lib64/libc.so.6(+0x81d8c)[0x7f07c7ddad8c]
[localhost:05847] [ 5] /lib64/libc.so.6(+0x825e6)[0x7f07c7ddb5e6]
[localhost:05847] [ 6] /lib64/libc.so.6(+0x83a1b)[0x7f07c7ddca1b]
[localhost:05847] [ 7] ./xhpl[0x423596]
[localhost:05847] [ 8] ./xhpl[0x4202a6]
[localhost:05847] [ 9] ./xhpl[0x41168e]
[localhost:05847] [10] ./xhpl[0x408eff]
[localhost:05847] [11] ./xhpl[0x4018aa]
[localhost:05847] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f07c7d7c7b3]
[localhost:05847] [13] ./xhpl[0x401cae]
[localhost:05847] *** End of error message ***

使用公式:

N = int((round(sqrt((memory_per_node * 1024 * 1024 * 1024 * 节点)/8))) * percent_usage)

标签: mpiopenmpiblashpcclinpack

解决方案


推荐阅读