首页 > 解决方案 > cuda 并行归约 #6 有效,归约 #7 失败

问题描述

我使用此代码进行减少:

http://www.math.nsysu.edu.tw/~lam/MPI/code/cuda/reduction.cu

这是基于马克哈里斯的谈话在这里

http://www.math.nsysu.edu.tw/~lam/MPI/lecture/reduction.pdf

但对于

#define blocksize 1024
#define gridsize  1024*8
#define size blocksize*gridsize

内核 reduce6 工作,reduce7 失败。bcos reduce7 是否取决于大小必须达到上面定义的甚至“大小”的共享内存量?

代码片段在这里:

#define THR_PER_BLC 1024
#define BLC_PER_GRD  16
#define GRID_SIZE THR_PER_BLC * BLC_PER_GRD

template<unsigned int nThreads>
__global__ void reduce7(int *g_idata, int *g_odata, unsigned int n) {
     //I added GRID_SIZE myself so it can be volatile
     __shared__ volatile  int sdata[THR_PER_BLC]; 
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * (nThreads * 2) + threadIdx.x;
    unsigned int gridSize = nThreads * 2 * gridDim.x;
    sdata[tid] = 0;
    while (i < n) {
        sdata[tid] += g_idata[i] + g_idata[i + nThreads];
        i += gridSize;
    }
    __syncthreads();
// reduction in shared memory
    if (nThreads >= 512) {
        if (tid < 256) { sdata[tid] += sdata[tid + 256]; }
        __syncthreads();
    }
    if (nThreads >= 256) {
        if (tid < 128) { sdata[tid] += sdata[tid + 128]; }
        __syncthreads();
    }
    if (nThreads >= 128) {
        if (tid < 64) { sdata[tid] += sdata[tid + 64]; }
        __syncthreads();
    }
    if (tid < 32) {
        if (nThreads >= 64) sdata[tid] += sdata[tid + 32];
        if (nThreads >= 32) sdata[tid] += sdata[tid + 16];
        if (nThreads >= 16) sdata[tid] += sdata[tid + 8];
        if (nThreads >= 8) sdata[tid] += sdata[tid + 4];
        if (nThreads >= 4) sdata[tid] += sdata[tid + 2];
        if (nThreads >= 2) sdata[tid] += sdata[tid + 1];
// transfer of the result to global memory
        if (tid == 0) g_odata[blockIdx.x] = sdata[0];
    }
}

这个内核在 main 中是这样调用的:

threads = THR_PER_BLC /2 ;

int gsize = BLC_PER_GRD /8;

switch (threads) {
    case 512:
        reduce7<512> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 256:
        reduce7<256> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 128:
        reduce7<128> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 64:
        reduce7<64> << < gsize, threads  >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 32:
        reduce7<32> << < gsize, threads  >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 16:
        reduce7<16> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 8:
        reduce7<8> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 4:
        reduce7<4> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 2:
        reduce7<2> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
    case 1:
        reduce7<1> << < gsize, threads >> > (g_idata, g_odata, GRID_SIZE);
        break;
}
cudaThreadSynchronize();

基本上意味着reduce7不能被调用到大的GRID_SIZE?

这是我的测试

#################################################################
6 Unroll the complete loop
Kernal elapsed time =      0.030(ms)
Elapsed time =      0.057(ms)
Sum = 8192, with BLC_PER_GRD 16 THR_PER_BLC 512
#################################################################
7 Final
Kernal elapsed time =      0.015(ms), band =
Elapsed time =      0.040(ms)
Sum = 8192, with BLC_PER_GRD 16 THR_PER_BLC 512
#################################################################

#################################################################
6 Unroll the complete loop
Kernal elapsed time =      0.031(ms)
Elapsed time =      0.057(ms)
Sum = 8192, with BLC_PER_GRD 8 THR_PER_BLC 1024
#################################################################
7 Final
Kernal elapsed time =      0.015(ms), band =
Elapsed time =      0.040(ms)
Sum = 8192, with BLC_PER_GRD 8 THR_PER_BLC 1024
#################################################################

#################################################################
6 Unroll the complete loop
Kernal elapsed time =      0.569(ms)
Elapsed time =     12.889(ms)
Sum = 8388608, with BLC_PER_GRD 8192 THR_PER_BLC 1024
#################################################################

还有我的显卡:

a@M:/usr/local/cuda/samples/bin/x86_64/linux/release$ ./dev*Drv
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version:                           9.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6078 MBytes (6373572608 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

嗯,让我们设置 128 个线程,网格大小为 4:

#define MAX_SHM 49152
#define GRID_SIZE MAX_SHM / sizeof(int)

#define THR_PER_BLC 128
#define BLC_PER_GRD GRID_SIZE/THR_PER_BLC

然后 reduce7 工作。所以这意味着reduce7严格依赖于max shm?

编辑

似乎我对这一行感到困惑:while (i < n) {,其中 n 是 GRID_SIZE。然后现在我不知道是什么i意思。需要一段时间消化一下。但很高兴知道,在一个块中只能有特定数量的线程,对于这种情况,我们必须与 SM 匹配。

标签: cudagpureduce

解决方案


首先,这种减少所需的共享内存仅与的需求一样大,而不是网格的需求。所以要求共享内存大小为网格是没有意义的。

其次,这要求每个块有 64Kbytes 的静态分配共享内存:

 __shared__ volatile  int sdata[GRID_SIZE]; 

那行不通,因为:

Total amount of shared memory per block:       49152 bytes

此外,这要求每个块动态分配 64KB 的共享内存:

 case 128:
    reduce7<128> << < gsize, threads, GRID_SIZE * sizeof(int) >> > (g_idata, g_odata, GRID_SIZE);
    break;

所以那个组合(64K + 64K)永远不会起作用。

您似乎对如何使用共享内存以及每个块需要多少内存感到困惑。int该块每个线程只需要一个数量(在这种情况下)。

您可能还会对静态分配的共享内存与动态分配的共享内存的语法和用法感到困惑。对于这种类型的问题,您通常会使用其中一个,而不是两者都使用。

我不知道这个评论是什么意思:

 //I added GRID_SIZE myself so it can be volatile

通常的建议:任何时候您在使用 CUDA 代码时遇到问题,您都应该进行适当的 CUDA 错误检查并使用 运行您的代码cuda-memcheck然后再向其他人寻求帮助。即使您开始使用的示例代码没有正确的 CUDA 错误检查,您也应该在开始修改并遇到问题时添加它。

然后 reduce7 工作。所以这意味着reduce7严格依赖于max shm?

这意味着reduce7每个块需要一定数量的共享内存。该数量是int每个线程一个。这就是它所需要的。如果你给它更多,那是可以的(有点)只要你不超过可以给的最大值。如果超过可以给定的最大值,则整个内核启动将失败。

换句话说,您真正需要的是:

__shared__ volatile  int sdata[THR_PER_BLC]; 

推荐阅读