python - CuPy 并发
问题描述
我正在使用 CuPy (7.0.0) 并尝试使用简单的示例脚本获取并发流:
import cupy as cp
# creating streams
map_streams = []
for i in range(0, 100):
map_streams.append(cp.cuda.stream.Stream(non_blocking=True))
asize = (1000, 100)
# creating arrays on the device
x = cp.ones(asize)
y = cp.ones(asize)
z = cp.ndarray(asize)
# do multiplications in the streams
for stream in map_streams:
with stream:
z = x * y
但是乘法是按顺序执行的。
==8339== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
[...]
432.83ms 18.688us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 42 cupy_multiply__float64_float64_float64 [376]
433.01ms 19.391us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 43 cupy_multiply__float64_float64_float64 [381]
433.32ms 18.720us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 44 cupy_multiply__float64_float64_float64 [386]
433.52ms 19.936us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 45 cupy_multiply__float64_float64_float64 [391]
433.71ms 18.880us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 46 cupy_multiply__float64_float64_float64 [396]
433.89ms 19.680us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 47 cupy_multiply__float64_float64_float64 [401]
434.16ms 19.232us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 48 cupy_multiply__float64_float64_float64 [406]
[...]
谁能告诉我我的脚本有什么问题?
更新:
即使我增加工作量,流也会按顺序处理。
asize = (1000, 200)
x = cp.random.rand(asize[0], asize[1])
y = cp.random.rand(asize[0], asize[1])
z = cp.ndarray(asize)
for stream in map_streams:
with stream:
z = cp.fft.fft2(x*y)
结果如下:
[...]
1.8e+10s 10.784us (391 1 1) (128 1 1) 12 0B 0B Tesla K80 (0) 1 100 cupy_copy__float64_complex128 [5444]
1.8e+10s 20.384us (50 1 1) (8 5 5) 72 7.8125KB 0B Tesla K80 (0) 1 100 void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5491]
1.8e+10s 10.464us (49 1 1) (128 1 1) 72 0B 0B Tesla K80 (0) 1 100 void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5494]
1.8e+10s 29.055us (63 1 1) (10 16 1) 92 0B 6.2500KB Tesla K80 (0) 1 100 void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5496]
1.8e+10s 10.176us (391 1 1) (128 1 1) 12 0B 0B Tesla K80 (0) 1 101 cupy_copy__float64_complex128 [5502]
1.8e+10s 20.896us (50 1 1) (8 5 5) 72 7.8125KB 0B Tesla K80 (0) 1 101 void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5549]
1.8e+10s 10.592us (49 1 1) (128 1 1) 72 0B 0B Tesla K80 (0) 1 101 void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5552]
1.8e+10s 28.831us (63 1 1) (10 16 1) 92 0B 6.2500KB Tesla K80 (0) 1 101 void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5554]
1.8e+10s 10.431us (391 1 1) (128 1 1) 12 0B 0B Tesla K80 (0) 1 102 cupy_copy__float64_complex128 [5560]
1.8e+10s 20.959us (50 1 1) (8 5 5) 72 7.8125KB 0B Tesla K80 (0) 1 102 void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5607]
1.8e+10s 10.720us (49 1 1) (128 1 1) 72 0B 0B Tesla K80 (0) 1 102 void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5610]
1.8e+10s 28.640us (63 1 1) (10 16 1) 92 0B 6.2500KB Tesla K80 (0) 1 102 void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5612]
[...]
解决方案
推荐阅读
- c++ - 在释放内存之前总是重新分配内存是否安全 - C++
- haskell - Haskell - 如何使用函数体中类型签名中指定的类型参数?
- python - 在字典中查找匹配键并用值替换键
- emacs - Emacs:如何定义目录的“书签”变量
- java - 将请求 DTO 映射到实体对象的设计模式?
- swift - 无法将“CGFloat”类型的值转换为闭包结果类型“CGPoint”
- node.js - Sequelize cli 模型创建
- ruby-on-rails-5 - 在 ActiveAdmin 中自定义评论的索引和显示页面
- java - 更改参数量的 Java 函数调用
- javascript - 使用 expressjs 将标头发送到客户端后无法设置标头