首页 > 解决方案 > CUDA 目标的 Numba 和 guvectorize:代码运行速度比预期慢

问题描述

值得注意的细节

首先是尝试将 guvectorize 与以下函数一起使用。我正在传递一堆 numpy 数组并尝试使用它们来乘以两个数组。如果使用 cuda 以外的目标运行,则此方法有效。但是,当切换到 cuda 时,会导致未知错误:

文件 "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug)

TypeError: init () got an unexpected keyword argument 'debug'

在遵循了我可以从这个错误中找到的所有内容之后,我只遇到了死胡同。我猜这是一个非常简单的修复,我完全错过了,但是哦,好吧。还应该说,这个错误只在运行一次后由于内存过载而崩溃。

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

所有的数组都是 numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
    for as_of_date in range(0,ed):
        for ID in range(0,rowCount):
            for num in range(0,n):
                cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

尝试在命令行中使用 nvprofiler 运行代码会导致以下错误:

警告:当前配置不支持统一内存分析,因为在此多 GPU 设置上检测到一对不支持对等的设备。当对等映射不可用时,系统回退到使用零拷贝内存。它会导致访问统一内存的内核运行速度变慢。更多详细信息,请访问: http ://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

我意识到我正在使用启用 SLI 的显卡(两张卡是相同的,evga gtx 1080ti,并且具有相同的设备 ID),所以我禁用了 SLI 并添加了“CUDA_VISIBLE_DEVICES”行来尝试限制到另一张卡,但是我留下相同的结果。

我仍然可以使用 nvprof 运行代码,但与 njit(parallel=True) 和 prange 相比,cuda 函数速度较慢。通过使用较小的数据大小,我们可以运行代码,但它比 target='parallel' 和 target='cpu' 慢。

为什么 cuda 这么慢,这些错误是什么意思?

谢谢您的帮助!

编辑:这是代码的一个工作示例:

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

我可以使用 gtx 1080ti 在 cuda 中运行代码,但是,它比并行或 cpu 运行要慢得多。我查看了有关 guvectorize 的其他帖子,但没有一篇文章帮助我了解在 guvectorize 中运行什么是最佳的,哪些不是最佳的。有没有办法让这段代码“对 cuda 友好”,或者只是在数组之间进行乘法太简单而看不到任何好处?

标签: pythonperformancecudanumbanvprof

解决方案


First of all, the basic operation you have shown is to take two matrices, transfer them to the GPU, do some elementwise multiplications to produce a 3rd array, and pass that 3rd array back to the host.

It may be possible to make a numba/cuda guvectorize (or cuda.jit kernel) implementation that might run faster than a naive serial python implementation, but I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing. This is because the arithmetic intensity per byte transferred between host and device is just too low. This operation is far too simple.

Secondly, it's important, I believe, to start out with an understanding of what numba vectorize and guvectorize are intended to do. The basic principle is to write the ufunc definition from the standpoint of "what will a worker do?" and then allow numba to spin up multiple workers from that. The way that you instruct numba to spin up multiple workers is to pass a data set that is larger than the signatures you have given. It should be noted that numba does not know how to parallelize a for-loop inside a ufunc definition. It gets parallel "strength" by taking your ufunc definition and running it among parallel workers, where each worker handles a "slice" of the data, but runs your entire ufunc definition on that slice. As some additional reading, I've covered some of this ground here also.

So a problem we have in your realization is that you have written a signature (and ufunc) which maps the entire input data set to a single worker. As @talonmies showed, your underlying kernel is being spun up with a total of 64 threads/workers (which is far to small to be interesting on a GPU, even apart from the above statements about arithmetic intensity), but I suspect in fact that 64 is actually just a numba minimum threadblock size, and in fact only 1 thread in that threadblock is doing any useful calculation work. That one thread is executing your entire ufunc, including all for-loops, in a serial fashion.

That's obviously not what anyone would intend for rational use of vectorize or guvectorize.

So let's revisit what you are trying to do. Ultimately your ufunc wants to multiply an input value from one array by an input value from another array and store the result in a 3rd array. We want to repeat that process many times. If all 3 array sizes were the same, we could actually realize this with vectorize and would not even have to resort to the more complicated guvectorize. Let's compare that approach to your original, focusing on the CUDA kernel execution. Here's a worked example, where t14.py is your original code, run with the profiler, and t15.py is a vectorize version of it, acknowledging that we have changed the size of your multBy array to match cv and discount:

$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

We see that your application reported a run-time of about 1.244 seconds, whereas the vectorize version reports a runtime of about 0.375 seconds. But there is python overhead in both of these numbers. If we look at the generated CUDA kernel duration in the profiler, the difference is even more stark. We see that the original kernel took about 1.225 seconds whereas the vectorize kernel executes in about 842 microseconds (i.e. less than 1 millisecond). We also note that the computation kernel time is now much, much smaller than the time it takes to transfer the 3 arrays to/from the GPU (which takes about 20 millisconds total) and we note that the kernel dimensions are now 15625 blocks of 128 threads each for a total thread/worker count of 2000000, exactly matching the total number of multiply operations to be done, and substantially more than the paltry 64 threads (and possibly, really only 1 thread) in action with your original code.

Given the simplicity of the above vectorize approach, if what you really want to do is this element-wise multiplication, then you might consider just replicating multBy so that it is dimensionally matching the other two arrays, and be done with it.

But the question remains: how to handle dissimilar input array sizes, as in the original problem? For that I think we need to go to guvectorize (or, as @talonmies indicated, write your own @cuda.jit kernel, which is probably the best advice, notwithstanding the possibility that none of these approaches may overcome the overhead of transferring data to/from the device, as already mentioned).

In order to tackle this with guvectorize, we need to think more carefully about the "slicing" concept already mentioned. Let's re-write your guvectorize kernel so that it only operates on a "slice" of the overall data, and then allow the guvectorize launch function to spin up multiple workers to tackle it, one worker per slice.

In CUDA, we like to have lots of workers; you really can't have too many. So this will affect how we "slice" our arrays, so as to give the possibility for multiple workers to act. If we were to slice along the 3rd (last, n) dimension, we would only have 5 slices to work with, so a maximum of 5 workers. Likewise if we slice along the first, or countRow dimension, we would have 100 slices, so a maximum of 100 workers. Ideally, we would slice along the 2nd, or countCol dimension. However for simplicity, I will slice along the first, or countRow dimension. This is clearly non-optimal, but see below for a worked example of how you might approach the slicing-by-second-dimension problem. Slicing by the first dimension means we will remove the first for-loop from our guvectorize kernel, and allow the ufunc system to parallelize along that dimension (based on sizes of arrays we pass). The code could look something like this:

$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

Observations:

  1. The code changes were related to removing the countCol parameter, removing the first for-loop from the guvectorize kernel, and making the appropriate changes to the function signature to reflect this. We also modified our 3-dimensional functions in the signature to two-dimensional. We are taking a two-dimensional "slice" of the 3-dimensional data, after all, and letting each worker work on a slice.

  2. The kernel dimensions as reported by the profiler are now 2 blocks instead of 1. This makes sense, because in the original realization, there was really only 1 "slice" presented, and therefore 1 worker needed, and therefore 1 thread (but numba spun up 1 threadblock of 64 threads). In this realization, there are 100 slices, and numba chose to spin up 2 threadblocks of 64 workers/threads, to provide the needed 100 workers/threads.

  3. The kernel performance reported by the profiler of 47.4ms is now somewhere in between the original (~1.224s) and the massively parallel vectorize version (at ~0.001s). So going from 1 to 100 workers has sped things up considerably, but there are more performance gains possible. If you figure out how to slice on the countCol dimension, you can probably get closer to the vectorize version, performance-wise (see below). Note that the difference between where we are at here (~47ms) and the vectorize version (~1ms) is more than enough to make up for the additional transfer cost (~5ms, or less) of transferring a slightly larger multBy matrix to the device, to facilitate the vectorize simplicity.

Some additional comments on the python timing: I believe the exact behavior of how python is compiling the necessary kernels for the original, vectorize, and guvectorize improved versions is different. If we modify the t15.py code to run a "warm-up" run, then at least the python timing is consistent, trend-wise with the overall wall time and the kernel-only timing:

$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839

real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091

real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283

real    0m1.252s
user    0m0.680s
sys     0m0.441s
$

Now, responding to a question in the comments, effectively: "How would I recast the problem to slice along the 4000 (countCol, or "middle") dimension?"

We can be guided by what worked to slice along the first dimension. One possible approach would be to rearrange the shape of the arrays so that the 4000 dimension was the first dimension, then remove that, similar to what we did in the previous treatment of guvectorize. Here's a worked example:

$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[num] * discount[ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

Somewhat predictably, we observe that the execution time has dropped from ~47ms when we sliced into 100 workers to ~9ms when we slice into 4000 workers. Similarly, we observe that numba is choosing to spin up 63 blocks of 64 threads each for a total of 4032 threads, to handle the 4000 workers needed for this "slicing".

Still not as fast as the ~1ms vectorize kernel (which has many more available parallel "slices" for workers), but quite a bit faster than the ~1.2s kernel proposed in the original question. And the overall walltime of the python code is about 2x faster, even with all the python overhead.

As a final observation, let's revisit the statement I made earlier (and is similar to statements made in the comment and in the other answer):

"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."

We now have convenient test cases in either t16.py or t17.py that we could work with to test this. For simplicity I'll choose t16.py. We can "convert this back to a CPU code" simply by removing the target designation from the guvectorize ufunc:

$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741

real    0m0.528s
user    0m0.474s
sys     0m0.047s
$

So we see that this CPU-only version runs the function in about 6 milliseconds, and it has no GPU "overhead" such as CUDA initialization, and copy of data to/from GPU. The overall walltime is also our best measurement, at about 0.5s compared to about 1.0s for our best GPU case. So this particular problem, due to its low arithmetic intensity per byte of data transfer, probably isn't well-suited to GPU computation.


推荐阅读