python - CUDA 目标的 Numba 和 guvectorize:代码运行速度比预期慢
问题描述
值得注意的细节
- 大型数据集(1000 万 x 5),(200 x 1000 万 x 5)
- Numpy 主要是
- 每次运行后需要更长的时间
- 使用 Spyder3
- 视窗 10
首先是尝试将 guvectorize 与以下函数一起使用。我正在传递一堆 numpy 数组并尝试使用它们来乘以两个数组。如果使用 cuda 以外的目标运行,则此方法有效。但是,当切换到 cuda 时,会导致未知错误:
文件 "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug)
TypeError: init () got an unexpected keyword argument 'debug'
在遵循了我可以从这个错误中找到的所有内容之后,我只遇到了死胡同。我猜这是一个非常简单的修复,我完全错过了,但是哦,好吧。还应该说,这个错误只在运行一次后由于内存过载而崩溃。
os.environ["NUMBA_ENABLE_CUDASIM"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...
所有的数组都是 numpy
@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:],
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)',
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
for as_of_date in range(0,ed):
for ID in range(0,rowCount):
for num in range(0,n):
cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]
尝试在命令行中使用 nvprofiler 运行代码会导致以下错误:
警告:当前配置不支持统一内存分析,因为在此多 GPU 设置上检测到一对不支持对等的设备。当对等映射不可用时,系统回退到使用零拷贝内存。它会导致访问统一内存的内核运行速度变慢。更多详细信息,请访问: http ://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
我意识到我正在使用启用 SLI 的显卡(两张卡是相同的,evga gtx 1080ti,并且具有相同的设备 ID),所以我禁用了 SLI 并添加了“CUDA_VISIBLE_DEVICES”行来尝试限制到另一张卡,但是我留下相同的结果。
我仍然可以使用 nvprof 运行代码,但与 njit(parallel=True) 和 prange 相比,cuda 函数速度较慢。通过使用较小的数据大小,我们可以运行代码,但它比 target='parallel' 和 target='cpu' 慢。
为什么 cuda 这么慢,这些错误是什么意思?
谢谢您的帮助!
编辑:这是代码的一个工作示例:
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
for as_of_date in range(0,countRow):
for ID in range(0,countCol):
for num in range(0,n):
cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
我可以使用 gtx 1080ti 在 cuda 中运行代码,但是,它比并行或 cpu 运行要慢得多。我查看了有关 guvectorize 的其他帖子,但没有一篇文章帮助我了解在 guvectorize 中运行什么是最佳的,哪些不是最佳的。有没有办法让这段代码“对 cuda 友好”,或者只是在数组之间进行乘法太简单而看不到任何好处?
解决方案
First of all, the basic operation you have shown is to take two matrices, transfer them to the GPU, do some elementwise multiplications to produce a 3rd array, and pass that 3rd array back to the host.
It may be possible to make a numba/cuda guvectorize (or cuda.jit kernel) implementation that might run faster than a naive serial python implementation, but I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing. This is because the arithmetic intensity per byte transferred between host and device is just too low. This operation is far too simple.
Secondly, it's important, I believe, to start out with an understanding of what numba vectorize
and guvectorize
are intended to do. The basic principle is to write the ufunc definition from the standpoint of "what will a worker do?" and then allow numba to spin up multiple workers from that. The way that you instruct numba to spin up multiple workers is to pass a data set that is larger than the signatures you have given. It should be noted that numba does not know how to parallelize a for-loop inside a ufunc definition. It gets parallel "strength" by taking your ufunc definition and running it among parallel workers, where each worker handles a "slice" of the data, but runs your entire ufunc definition on that slice. As some additional reading, I've covered some of this ground here also.
So a problem we have in your realization is that you have written a signature (and ufunc) which maps the entire input data set to a single worker. As @talonmies showed, your underlying kernel is being spun up with a total of 64 threads/workers (which is far to small to be interesting on a GPU, even apart from the above statements about arithmetic intensity), but I suspect in fact that 64 is actually just a numba minimum threadblock size, and in fact only 1 thread in that threadblock is doing any useful calculation work. That one thread is executing your entire ufunc, including all for-loops, in a serial fashion.
That's obviously not what anyone would intend for rational use of vectorize
or guvectorize
.
So let's revisit what you are trying to do. Ultimately your ufunc wants to multiply an input value from one array by an input value from another array and store the result in a 3rd array. We want to repeat that process many times. If all 3 array sizes were the same, we could actually realize this with vectorize
and would not even have to resort to the more complicated guvectorize
. Let's compare that approach to your original, focusing on the CUDA kernel execution. Here's a worked example, where t14.py is your original code, run with the profiler, and t15.py is a vectorize
version of it, acknowledging that we have changed the size of your multBy
array to match cv
and discount
:
$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
312.36ms 1.2160us - - - - - 8B 6.2742MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.81ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.52ms 5.8696ms - - - - - 15.259MB 2.5387GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.74ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.93ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
321.40ms 1.22538s (1 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s 7.1816ms - - - - - 15.259MB 2.0749GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
193.92ms 6.2729ms - - - - - 15.259MB 2.3755GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
201.09ms 5.7101ms - - - - - 15.259MB 2.6096GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
364.92ms 842.49us (15625 1 1) (128 1 1) 13 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms 7.1528ms - - - - - 15.259MB 2.0833GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
We see that your application reported a run-time of about 1.244 seconds, whereas the vectorize version reports a runtime of about 0.375 seconds. But there is python overhead in both of these numbers. If we look at the generated CUDA kernel duration in the profiler, the difference is even more stark. We see that the original kernel took about 1.225 seconds whereas the vectorize kernel executes in about 842 microseconds (i.e. less than 1 millisecond). We also note that the computation kernel time is now much, much smaller than the time it takes to transfer the 3 arrays to/from the GPU (which takes about 20 millisconds total) and we note that the kernel dimensions are now 15625 blocks of 128 threads each for a total thread/worker count of 2000000, exactly matching the total number of multiply operations to be done, and substantially more than the paltry 64 threads (and possibly, really only 1 thread) in action with your original code.
Given the simplicity of the above vectorize
approach, if what you really want to do is this element-wise multiplication, then you might consider just replicating multBy
so that it is dimensionally matching the other two arrays, and be done with it.
But the question remains: how to handle dissimilar input array sizes, as in the original problem? For that I think we need to go to guvectorize
(or, as @talonmies indicated, write your own @cuda.jit
kernel, which is probably the best advice, notwithstanding the possibility that none of these approaches may overcome the overhead of transferring data to/from the device, as already mentioned).
In order to tackle this with guvectorize
, we need to think more carefully about the "slicing" concept already mentioned. Let's re-write your guvectorize
kernel so that it only operates on a "slice" of the overall data, and then allow the guvectorize
launch function to spin up multiple workers to tackle it, one worker per slice.
In CUDA, we like to have lots of workers; you really can't have too many. So this will affect how we "slice" our arrays, so as to give the possibility for multiple workers to act. If we were to slice along the 3rd (last, n
) dimension, we would only have 5 slices to work with, so a maximum of 5 workers. Likewise if we slice along the first, or countRow
dimension, we would have 100 slices, so a maximum of 100 workers. Ideally, we would slice along the 2nd, or countCol
dimension. However for simplicity, I will slice along the first, or countRow
dimension. This is clearly non-optimal, but see below for a worked example of how you might approach the slicing-by-second-dimension problem. Slicing by the first dimension means we will remove the first for-loop from our guvectorize kernel, and allow the ufunc system to parallelize along that dimension (based on sizes of arrays we pass). The code could look something like this:
$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
307.05ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
307.79ms 5.9293ms - - - - - 15.259MB 2.5131GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.34ms 1.3440us - - - - - 8B 5.6766MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.54ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
317.27ms 47.398ms (2 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms 7.3799ms - - - - - 15.259MB 2.0192GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
Observations:
The code changes were related to removing the
countCol
parameter, removing the first for-loop from the guvectorize kernel, and making the appropriate changes to the function signature to reflect this. We also modified our 3-dimensional functions in the signature to two-dimensional. We are taking a two-dimensional "slice" of the 3-dimensional data, after all, and letting each worker work on a slice.The kernel dimensions as reported by the profiler are now 2 blocks instead of 1. This makes sense, because in the original realization, there was really only 1 "slice" presented, and therefore 1 worker needed, and therefore 1 thread (but numba spun up 1 threadblock of 64 threads). In this realization, there are 100 slices, and numba chose to spin up 2 threadblocks of 64 workers/threads, to provide the needed 100 workers/threads.
The kernel performance reported by the profiler of 47.4ms is now somewhere in between the original (~1.224s) and the massively parallel
vectorize
version (at ~0.001s). So going from 1 to 100 workers has sped things up considerably, but there are more performance gains possible. If you figure out how to slice on thecountCol
dimension, you can probably get closer to thevectorize
version, performance-wise (see below). Note that the difference between where we are at here (~47ms) and the vectorize version (~1ms) is more than enough to make up for the additional transfer cost (~5ms, or less) of transferring a slightly largermultBy
matrix to the device, to facilitate thevectorize
simplicity.
Some additional comments on the python timing: I believe the exact behavior of how python is compiling the necessary kernels for the original, vectorize, and guvectorize improved versions is different. If we modify the t15.py code to run a "warm-up" run, then at least the python timing is consistent, trend-wise with the overall wall time and the kernel-only timing:
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839
real 0m2.522s
user 0m1.572s
sys 0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091
real 0m1.050s
user 0m0.473s
sys 0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283
real 0m1.252s
user 0m0.680s
sys 0m0.441s
$
Now, responding to a question in the comments, effectively: "How would I recast the problem to slice along the 4000 (countCol
, or "middle") dimension?"
We can be guided by what worked to slice along the first dimension. One possible approach would be to rearrange the shape of the arrays so that the 4000 dimension was the first dimension, then remove that, similar to what we did in the previous treatment of guvectorize
. Here's a worked example:
$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[num] * discount[ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
Somewhat predictably, we observe that the execution time has dropped from ~47ms when we sliced into 100 workers to ~9ms when we slice into 4000 workers. Similarly, we observe that numba is choosing to spin up 63 blocks of 64 threads each for a total of 4032 threads, to handle the 4000 workers needed for this "slicing".
Still not as fast as the ~1ms vectorize
kernel (which has many more available parallel "slices" for workers), but quite a bit faster than the ~1.2s kernel proposed in the original question. And the overall walltime of the python code is about 2x faster, even with all the python overhead.
As a final observation, let's revisit the statement I made earlier (and is similar to statements made in the comment and in the other answer):
"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."
We now have convenient test cases in either t16.py or t17.py that we could work with to test this. For simplicity I'll choose t16.py. We can "convert this back to a CPU code" simply by removing the target designation from the guvectorize
ufunc:
$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741
real 0m0.528s
user 0m0.474s
sys 0m0.047s
$
So we see that this CPU-only version runs the function in about 6 milliseconds, and it has no GPU "overhead" such as CUDA initialization, and copy of data to/from GPU. The overall walltime is also our best measurement, at about 0.5s compared to about 1.0s for our best GPU case. So this particular problem, due to its low arithmetic intensity per byte of data transfer, probably isn't well-suited to GPU computation.
推荐阅读
- http2 - HTTP/2 `:scheme` 标头的 HTTP/1 等效项是什么?
- c# - 找不到使用 xpath 的 xhtml 文档中的元素
- python - 在python中缓存
- javascript - 无法为谷歌网站翻译器设置默认语言
- angular6 - 应用程序 Application Insights 中已发送但未收到的跟踪
- c# - 正则表达式 C#:清理错误的 Json 字符串
- javascript - 在同一个 JS 脚本中读取自己的 JS 域 URL 的 cookie
- java - Android Service 需要在后台运行,直到应用程序被杀死
- java - 是什么导致我的抽象类中的错误?
- php - php >! 运营商不合法但有效