首页 > 解决方案 > 仅特定核心数上的 MPI 内存损坏

问题描述

对于某些背景,我正在使用 MPI 并行化一个基本的 PDE 求解器。该程序采用一个域,并为每个处理器分配一个覆盖其中一部分的网格。如果我用单核或四核运行,程序运行得很好。但是,如果我使用两个或三个核心运行,我会得到如下所示的核心转储:

*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000018bd540 ***
======= Backtrace: =========
*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000022126e0 ***
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc1a63f77e5]
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x80dfb)[0x7fc1a6400dfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca753f77e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fc1a640453c]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7fca753fe9dc]
/usr/lib/libmpi.so.12(+0x25919)[0x7fc1a6d25919]
/lib/x86_64-linux-gnu/libc.so.6(+0x80678)[0x7fca75400678]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x52a9)[0x7fc198fe52a9]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca7540453c]
/usr/lib/libmpi.so.12(ompi_mpi_finalize+0x412)[0x7fc1a6d41a22]
/usr/lib/libmpi.so.12(+0x25919)[0x7fca75d25919]
MeshTest(_ZN15MPICommunicator7cleanupEv+0x26)[0x422e70]
/usr/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x4381)[0x7fca68844381]
MeshTest(main+0x364)[0x41af2a]
/usr/lib/libopen-pal.so.13(mca_base_component_close+0x19)[0x7fca74c88fe9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc1a63a0830]
/usr/lib/libopen-pal.so.13(mca_base_components_close+0x42)[0x7fca74c89062]
MeshTest(_start+0x29)[0x41aaf9]
/usr/lib/libmpi.so.12(+0x7d3b4)[0x7fca75d7d3b4]
======= Memory map: ========
<insert core dump>

当我创建一个新网格时,我已经能够跟踪错误:

Result Domain::buildGrid(unsigned int shp[2], pair2<double> &bounds){
  // ... Unrelated code ...

  // grid is already allocated and needs to be cleared.
  delete grid;                                                                                                         
  grid = new Grid(bounds, shp, nghosts);                                                                                                                                                                                                    
  return SUCCESS;                                                                                                    
}

Grid::Grid(const pair2<double>& bounds, unsigned int sz[2], unsigned int nghosts){
  // ... Code unrelated to memory allocation ...

  // Construct the grid. Start by adding ghost points.
  shp[0] = sz[0] + 2*nghosts;
  shp[1] = sz[1] + 2*nghosts;
  try{
    points[0] = new double[shp[0]];
    points[1] = new double[shp[1]];
    for(int i = 0; i < shp[0]; i++){
      points[0][i] = grid_bounds[0][0] + (i - (int)nghosts)*dx;
    }
    for(int j = 0; j < shp[1]; j++){
      points[1][j] = grid_bounds[1][0] + (j - (int)nghosts)*dx;
    }
  }
  catch(std::bad_alloc& ba){
    std::cout << "Failed to allocate memory for grid.\n";
    shp[0] = 0;
    shp[1] = 0;
    dx = 0;
    points[0] = NULL;
    points[1] = NULL;
  }
}

Grid::~Grid(){
  delete[] points[0];
  delete[] points[1];
}

据我所知,我的 MPI 实现很好,并且Domain类中所有依赖于 MPI 的功能似乎都可以正常工作。我假设某处某处非法访问了超出其范围的内存,但我不知道在哪里;此时,代码实际上只是初始化 MPI,加载一些参数,设置网格(在其构造期间发生唯一的内存访问),然后调用MPI_Finalize()并返回。

标签: c++mpi

解决方案


事实证明,我的Grid构造函数在分配点时出现错误(它在分配 y 点时读取points[0][j] = ...),当我将代码复制到我的帖子中而不是在我的代码中时,我以某种方式捕获并纠正了这些错误。该错误仅出现在 2 和 3 核心运行中,因为网格对于 1 和 4 核心运行是完全正方形的,因此shp[0]等于shp[1]. 谢谢大家,给点建议。看到这么简单的事情,我现在觉得有点尴尬。


推荐阅读