fortran - OpenACC Fortran 循环中的顺序点积
问题描述
在 Fortran 程序中,我有一个大循环,其中dot_product
对循环内生成的小向量进行了多次调用:
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels
!$acc loop independent private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end kernels
!$acc end data
print "(2(g0, x))", res
endprogram
当使用 PGI 编译器编译时,似乎加速实现dot_product
使用加速循环,因此可以防止更好地加速主循环(在 gang 和 vector 上):
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
17, !$acc loop vector(32) ! threadidx%x
Generating implicit reduction(+:subarray1$r)
14, CUDA shared memory used for subarray2,subarray1
15, Loop is parallelizable
17, Loop is parallelizable
从日志中可以看出,它对循环私有向量使用隐式缩减和共享内存。
有没有办法强制dot_product
顺序运行?
解决方案
Is there a way to force dot_product to run sequentially?
So long as you don't mind the array syntax being run sequentially as well, just add "gang vector" to the loop directive.
% cat test.f90
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels loop gang vector private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end data
print "(2(g0, x))", res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
13, Loop is parallelizable
Generating Tesla code
13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
14, !$acc loop seq
16, !$acc loop seq
13, Local memory used for subarray2,subarray1
14, Loop is parallelizable
16, Loop is parallelizable
推荐阅读
- javascript - 需要用 IF 语句更新 custom.js 中的 href
- java - SFTP 将文件上传到同一远程服务器上的多个目录
- c# - 自动完成字段和多项选择
- html - 应用 google web 字体后,部分背景被移除。我应该怎么办?
- python - 似乎无法初始化火花上下文 (pyspark)
- azure - Azure 逻辑应用本地数据网关不允许执行本机 SQL 查询
- xquery - xdmp:eval 为不同的 ML 版本返回不同的结果
- swift - Swift 通知相互替换
- javascript - 输入文件避免重复
- amazon-web-services - AWS Cognito 上的消息模板占位符