python - 像这样在 Numba 中实现 cuda gridsync() 是否安全

问题描述

Numba 缺少 cuda-C 命令 gridsync()，因此没有用于在整个网格中同步的固定方法。只有块级同步可用。

如果 cudaKernal1 的执行时间非常快，那么以下代码的运行速度会快 1000 倍

for i in range(10000):
   X = X + cudaKernel1[(100,100),(32,32)] (X)

通过将循环放入同一个内核中，以避免 gpu 内核设置时间。但是您不能，因为您需要在下一次迭代开始之前完成所有网格，并且 Numba 中没有 gridsync() 命令。

这是在 numba 中执行 gridsync() 的一种明显方法，因此您认为人们会使用此方法，但我找不到任何示例。

然而，我发现很多关于 stackoverflow 的评论——没有解释——试图使用原子计数器来同步网格中的块是没有意义的、不安全的或者会在竞争条件下死锁。相反，他们建议在两个步骤之间退出内核。但是，如果每个步骤都非常快，那么调用内核要比执行它花费更长的时间，因此如果您可以在不退出的情况下循环这些步骤，则速度可以快 1000 倍。

我无法弄清楚什么是不安全的，或者为什么会有一个会成为陷阱的竞争条件。

以下内容有什么问题。

@numba.cuda.jit('void()')
def gpu_initGridSync():
    if ( cuda.threadIdx.x == 0): 
        Global_u[0] = 0
        Global_u[1] = 0

@numba.cuda.jit('void(int32)'device=True)
def gpu_fakeGridSync(i):
    ###wait till the the entire grid has finished doSomething()
    # in Cuda-C we'd call gridsync()
    # but lack that in Numba so do the following instead.

    #Syncthreads in current block
    numba.cuda.syncthreads()

    #increment global counter, once per block
    if ( cuda.threadIdx.x == 0 ):  numba.atomic.add( Global_u, 0, 1 )

    # idle in a loop
    while ( Global_u[0] < (i+1)*cuda.gridDim.x-1 ) ):  pass   #2

    #regroup the block threads after the slow global memory reads.
    numba.cuda.syncthreads()

    # now, to avoid a race condition of blocks re-entering the above while
    # loop before other blocks have exited we do this global sync a second time

     #increment global counter, once per block
    if ( cuda.threadIdx.x == 0 ):  numba.atomic.add( Global_u,1, 1 )

    # idle in a loop
    while ( Global_u[1] > (i+2)*cuda.gridDim.x ) ):  pass   #2

    #regroup the block threads after the slow global memory reads.
    numba.cuda.syncthreads()

然后像这样使用它：

@numba.cuda.jit('void(float32[:])')):
def ReallyReallyFast(X):
    i = numba.cuda.grid(1)
    for h in range(1,40000,4):
        temp = calculateSomething(X)
        gpu_fakeGridSync(h)
        X[i] = X[i]+temp
        gpu_fakeGridSync(h+2)

gpu_initGridSync[(1,),(1,)]()
ReallyReallyFast[(1000,), (32,) ](X)


@numba.cuda.jit('float32(float32[:])',device=True):
def calculateSomething(X):  # A dummy example of a very fast kernel operation
    i = numba.cuda.grid(1)
    if (i>0):
        return (X[i]-X[i-1])/2.0
    return 0.0

在我看来，这在逻辑上是合理的。初始化全局计数器有一个微妙的步骤。这必须在它自己的内核调用中完成以避免竞争条件。但在那之后，我可以自由地调用 fakeGridSync 而无需重新初始化它。我必须跟踪我调用它的循环迭代方式（因此将参数传递给 gridSync）。

我承认我可以看到有一些浪费的努力，但这是一个交易杀手吗？例如，在语句 #2 中，这个 while 循环意味着所有已完成块中的所有线程都在徒劳地旋转它们的轮子。我想这可能会稍微减慢仍在尝试执行“doSomething”的网格块。但是，我不确定这种浪费的努力有多糟糕。对语句#2 的第二个挑剔是所有线程都在争夺相同的全局内存，因此它们访问它的速度会很慢。如果这意味着调度程序推迟它们的执行并让有用的线程更频繁地执行，那甚至可能是一件好事。可以通过仅在每个块中使用 thread(0) 检查该冲突是否是一个问题来改进这种幼稚的代码。

标签： pythoncudasynchronizationnumba

我认为 Robert Crovella 的评论指出了为什么这种方法会失败的正确答案。

我错误地假设调度程序执行了先发制人的多任务处理，以便所有块都有一个时间片来运行。

目前 Nvidia GPU 没有先发制人的多线程调度程序。作业运行完成。

因此，一旦有足够的块进入 while 循环等待，调度程序将不会启动剩余的块。因此等待循环将永远等待。

我看到有研究论文建议 Nvidia 如何使其调度程序先发制人。https://www.computer.org/csdl/proceedings/snpd/2012/2120/00/06299288.pdf 但显然现在情况并非如此。

我想知道 cuda-C 是如何成功执行 gridSync() 命令的。如果它可以在 C 中完成，那么必须有一些通用的方法来解决这些限制。 这是一个谜，希望有人在下面评论

将 1000 倍的加速摆在桌面上真的很可惜。

python - 像这样在 Numba 中实现 cuda gridsync() 是否安全

问题描述

解决方案

推荐阅读