python - CuPY: Not seeing kernel concurrency
问题描述
I'm currently working on parallelizing huge matrix computation using CuPY's RawKernels using async streams.
It seems like each RawKernel call is waiting for prevous kernel to finish eventhough I specify that stream is non-blocking.
Does anyone have an idea on what I'm doing wrong?
Here's a simple example that creates 32 streams. Each stream should copy single slice of 3D input array to 3D output array.
import cupy
kernel = cupy.RawKernel(
'''
extern "C"
__global__ void simple_copy(float* iArr, float* oArr, int rows, int cols, int slice){
unsigned int col = blockDim.x*blockIdx.x + threadIdx.x;
unsigned int row = blockDim.y*blockIdx.y + threadIdx.y;
if(row < rows && col < cols){
//this for loop is just additional work to see kernel launches in visual profiler more easily
for(int i=0; i<1000; i++){
oArr[rows*cols*slice + row*cols + col] = iArr[rows*cols*slice + row*cols + col];
}
}
}
'''
, 'simple_copy')
device = cupy.cuda.Device()
# [x, y, z]
iArr1 = cupy.ones((32*32, 32*32, 32), dtype=cupy.float32)
oArr1 = cupy.zeros((32*32, 32*32, 32), dtype=cupy.float32)
n = 32
map_streams = []
for i in range(n):
map_streams.append(cupy.cuda.stream.Stream(non_blocking=True))
# I want to run kernel on individual z-axis slice asynchronous
for i, stream in enumerate(map_streams):
with stream:
kernel((32, 32), (32, 32), (iArr1, oArr1, 32*32, 32*32, i))
device.synchronize()
解决方案
It seems like each RawKernel call is waiting for prevous kernel to finish eventhough I specify that stream is non-blocking.... .... Does anyone have an idea on what I'm doing wrong?
You are not doing anything wrong, beyond expecting something to happen that is not possible.
Concurrent kernel execution is only possible if there are sufficient resources available to run more than one kernel at once. All currently supported GPUs have a maximum of 2048 active threads per multiprocessor, and your blocks are 1024 threads each. That means a maximum of two blocks can run per multiprocessor. Depending on the size of your GPU, that means less than roughly 60 blocks could run simultaneously at an absolutely maximum. Given that one kernel launch will "saturate" your GPU capacity many times over, the possibility of a second actually have resources to run is extremely slim. That is why you see no overlap or concurrency between kernel launches.
推荐阅读
- spring-boot - Keycloak :使用带有永不过期令牌的服务帐户
- python - 使用 pywin32 关联正确的窗口以定义对象
- c# - .NET 6 中的 using 语句/指令在哪里
- python-3.x - RuntimeWarning:从未等待协程“tryit”
- google-apps-script - 调用脚本函数会导致复合范围的“范围未找到”
- python - 如何根据列表的值过滤字典
- css - 通过拉伸使背景图像覆盖 div,而不是保持纵横比
- git - git合并分支时如何避免损坏部分?
- angular - 在新选项卡中打开子组件并在父/子组件之间传递数据
- python - Python:如何在注意图形结构的同时迭代 MIME 消息树?