首页 > 解决方案 > CuPY: Not seeing kernel concurrency

问题描述

I'm currently working on parallelizing huge matrix computation using CuPY's RawKernels using async streams.

It seems like each RawKernel call is waiting for prevous kernel to finish eventhough I specify that stream is non-blocking.

Visual profiler

Does anyone have an idea on what I'm doing wrong?

Here's a simple example that creates 32 streams. Each stream should copy single slice of 3D input array to 3D output array.

import cupy

kernel = cupy.RawKernel(
    '''
    extern "C"
    __global__ void simple_copy(float* iArr, float* oArr, int rows, int cols, int slice){
        unsigned int col = blockDim.x*blockIdx.x + threadIdx.x;
        unsigned int row = blockDim.y*blockIdx.y + threadIdx.y;
    
        if(row < rows && col < cols){
//this for loop is just additional work to see kernel launches in visual profiler more easily
            for(int i=0; i<1000; i++){
                oArr[rows*cols*slice + row*cols + col] = iArr[rows*cols*slice + row*cols + col];
            }
        }        
    } 
    '''
    , 'simple_copy')


device = cupy.cuda.Device()
# [x, y, z]
iArr1 = cupy.ones((32*32, 32*32, 32), dtype=cupy.float32)
oArr1 = cupy.zeros((32*32, 32*32, 32), dtype=cupy.float32)

n = 32
map_streams = []
for i in range(n):
    map_streams.append(cupy.cuda.stream.Stream(non_blocking=True))

# I want to run kernel on individual z-axis slice asynchronous
for i, stream in enumerate(map_streams):
    with stream:
        kernel((32, 32), (32, 32), (iArr1, oArr1, 32*32, 32*32, i))
device.synchronize()

标签: pythoncudacupy

解决方案


It seems like each RawKernel call is waiting for prevous kernel to finish eventhough I specify that stream is non-blocking.... .... Does anyone have an idea on what I'm doing wrong?

You are not doing anything wrong, beyond expecting something to happen that is not possible.

Concurrent kernel execution is only possible if there are sufficient resources available to run more than one kernel at once. All currently supported GPUs have a maximum of 2048 active threads per multiprocessor, and your blocks are 1024 threads each. That means a maximum of two blocks can run per multiprocessor. Depending on the size of your GPU, that means less than roughly 60 blocks could run simultaneously at an absolutely maximum. Given that one kernel launch will "saturate" your GPU capacity many times over, the possibility of a second actually have resources to run is extremely slim. That is why you see no overlap or concurrency between kernel launches.


推荐阅读