vectorization - 如何理解 icc 编译器优化报告中的加速?
问题描述
环境为:
icc 版本 19.0.0.117(gcc 版本 5.4.0 兼容性)
Intel parallel studio XE cluster edition 2019
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Ubuntu 16.04
编译器标志是:
-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all
我使用 OpenMP simd 或 intel parama 对我的循环进行矢量化以获得加速。在icc生成的优化报告中,我通常会看到如下结果:
LOOP BEGIN at get_forces.c(3668,3)
remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 0.773
remark #15300: LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 3
remark #15451: unmasked unaligned unit stride stores: 2
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 21
remark #15477: vector cost: 11.000
remark #15478: estimated potential speedup: 1.050
remark #15488: --- end vector cost summary ---
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25015: Estimate of max trip count of loop=1
LOOP END
我的问题是:我不明白加速比是如何计算的
normalized vectorization overhead 0.773
scalar cost: 21
vector cost: 11.000
另一个更极端和令人困惑的案例可能是
LOOP BEGIN at get_forces.c(2690,8)
<Distributed chunk3>
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 1.857
remark #15448: unmasked aligned unit stride loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 7
remark #15477: vector cost: 3.500
remark #15478: estimated potential speedup: 0.770
remark #15488: --- end vector cost summary ---
remark #25436: completely unrolled by 3
LOOP END
现在,3.5+1.857=5.357 < 7
所以,我仍然可以 simd 这个循环并获得加速,或者我应该在报告中取加速数 0.770 而不是 simd?
如何理解 icc 编译器优化报告中的加速?
解决方案
“标量成本”是指“标量循环一次迭代的成本”。
“向量成本”是指“向量化循环的一次迭代的成本除以vector_length*unroll_factor”,即多少等价于一次标量迭代的成本。
“向量化开销”显示循环之前/之后向量初始化/最终化的归一化(通过向量迭代成本)成本。
为整个循环执行计算“估计的潜在加速”。它显示了向量化循环执行的标准化(按标量迭代成本)潜在增益——包括估计循环行程计数的剥离、余数和主循环。它不能从上面显示的标量和向量成本中显式推导出来。
推荐阅读
- r - 减去或分割矩阵时出现“不一致的数组”错误
- django-rest-framework - ViewSet 仅在一个 REST 操作(Django Rest)中使用令牌身份验证
- python - 在 python3 虚拟环境上运行 pyshark 时出现错误?
- acumatica - 分支的 Acumatica 报表参数
- html - 当窗口变小时,如何使导航栏按照与其中的图像相同的比例缩小其大小
- android - Android SDK Manager 报错“加载 SDK 组件信息失败”。
- javascript - 如何在相关组件上制作一个 Vue 插件函数调用方法?
- ruby-on-rails - “method_missing”:未定义的方法“active_storage”(NoMethodError)
- php - 用固定值替换 PHP 数据输出
- java - 将 JOptionPane 输入读入数组