首页 > 解决方案 > OpenMP 行为:使用 ICC 和 GCC 会产生显着不同的运行时间


对于 i7-6700K 上的 OpenMP 小基准测试,我编写了以下代码:

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>

constexpr int bench_rounds = 32;

int main(void) {
    using std::chrono::high_resolution_clock;
    using std::chrono::duration_cast;
    using std::chrono::duration;
    using std::chrono::milliseconds;

    size_t vec_size = 16;
    size_t large_vec_size = 1024*16;
    std::vector<double> test_vec(large_vec_size * large_vec_size, 0);

    auto t1 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(int j = 0; j < large_vec_size; ++j) {
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
    auto t2 = high_resolution_clock::now();
    duration<double, std::milli> ms_double = t2 - t1;
    std::cout << ms_double.count() << "ms\n";

    return 0;


  1. 使用最新的英特尔编译器,使用icpc main.cpp -o test
  2. 使用最新的英特尔编译器,使用icpc -qopenmp main.cpp -o test -liomp5
  3. 使用 GCC 11.2,使用g++ main.cpp -o test
  4. 使用 GCC 11.2,使用g++ -fopenmp main.cpp -o test -lgomp


  1. 警告“unrecognized OpenMP #pragma #pragma omp parallel for collapse(2)”,运行时间:2490 ms
  2. 无警告,运行时间:14080 毫秒
  3. 无警告,运行时间:45550 毫秒
  4. 无警告,运行时间:13400 毫秒

GCC 的结果或多或少与预期的一样,我在四个内核上运行它,并且我的加速比三个稍大。但是对于英特尔编译器,我不明白结果:为什么它这么快,尤其是在忽略 OpenMP-pragma 的情况下?


g++ main.cpp -o test_gcc_clean
g++ -fopenmp main.cpp -o test_gcc_omp -lgomp
g++ -fopenmp -march=native -mavx -O3 main.cpp -o test_gcc_opt -lgomp
icpc main.cpp -o test_icc_clean
icpc -qopenmp main.cpp -o test_icc_omp -liomp5
icpc -qopenmp -march=native -mavx -O3 main.cpp -o test_icc_opt -liomp5


echo "Clean GCC"
echo "GCC with OpenMP"
echo "Optimized GCC"
echo "Clean ICC"
echo "ICC with OpenMP"
echo "Optimized ICC"


Clean GCC
GCC with OpenMP
Optimized GCC
Clean ICC
ICC with OpenMP
Optimized ICC


icpc -march=native -mavx -O3 main.cpp -o test_icc_nomp

会更快,运行时间为1286 ms,但会在编译期间抛出错误,指出它不知道 OpenMP 编译指示。


#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>

constexpr int bench_rounds = 32;

int main(void) {
    using std::chrono::high_resolution_clock;
    using std::chrono::duration_cast;
    using std::chrono::duration;
    using std::chrono::milliseconds;

    size_t vec_size = 16;
    size_t large_vec_size = 1024*16;
    std::vector<double> test_vec(large_vec_size * large_vec_size, 0),
    test_vec_II(large_vec_size * large_vec_size, 0),
    test_vec_III(large_vec_size * large_vec_size, 0),
    test_vec_IV(large_vec_size * large_vec_size, 0);

    auto t1 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(int j = 0; j < large_vec_size; ++j) {
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
    auto t2 = high_resolution_clock::now();

    auto t3 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for
    for(int j = 0; j < large_vec_size; ++j) {
        #pragma omp simd
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec_II[j * large_vec_size + i] = i + j + test_vec_II[j * large_vec_size + i] * 1e-13;
    auto t4 = high_resolution_clock::now();

    auto t5 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(size_t j = 0; j < large_vec_size; ++j) {
        for(size_t i = 0; i < large_vec_size; ++i) {
            test_vec_III[j * large_vec_size + i] = i + j + test_vec_III[j * large_vec_size + i] * 1e-13;
    auto t6 = high_resolution_clock::now();

    auto t7 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for
    for(size_t j = 0; j < large_vec_size; ++j) {
        #pragma omp simd
        for(size_t i = 0; i < large_vec_size; ++i) {
            test_vec_IV[j * large_vec_size + i] = i + j + test_vec_IV[j * large_vec_size + i] * 1e-13;
    auto t8 = high_resolution_clock::now();

    duration<double, std::milli> ms_double = t2 - t1, 
    ms_double_simd = t4 - t3, 
    ms_double_sizet = t6 - t5, 
    ms_double_simd_sizet = t8 - t7;
    std::cout << "Coll: " << ms_double.count() << " ms\n";
    std::cout << "SIMD: " << ms_double_simd.count() << " ms\n";
    std::cout << "CoST: " << ms_double_sizet.count() << " ms\n";
    std::cout << "SIST: " << ms_double_simd_sizet.count() << " ms\n";

    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_II.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_III.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_IV.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";

    return 0;


Clean GCC
Coll: 46281.8 ms
SIMD: 47917.9 ms
CoST: 44322 ms
SIST: 44275.4 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
GCC with OpenMP
Coll: 13799.6 ms
SIMD: 14546 ms
CoST: 12913.8 ms
SIST: 13113.1 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized GCC
Coll: 4955.54 ms
SIMD: 5080.45 ms
CoST: 5203.64 ms
SIST: 5011.17 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized GCC, no OpenMP
Coll: 5201.49 ms
SIMD: 5198.48 ms
CoST: 6148.23 ms
SIST: 6279.94 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Clean ICC
Coll: 2579.12 ms
SIMD: 5315.75 ms
CoST: 5296.52 ms
SIST: 6892.02 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
ICC with OpenMP
Coll: 14466 ms
SIMD: 4974.81 ms
CoST: 13539.5 ms
SIST: 4963.63 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized ICC
Coll: 15753.4 ms
SIMD: 5114.96 ms
CoST: 13509.4 ms
SIST: 5100.88 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized ICC, no OpenMP
Coll: 1302.34 ms
SIMD: 5200.3 ms
CoST: 5535.02 ms
SIST: 5565.15 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True


标签: c++openmpicc


问题来自collapse(2)子句,也与代码自动矢量化有关。事实上,两个编译器都不能在崩溃时自动矢量化循环,但是 ICC 在热循环的中间使用了非常昂贵的idiv指令(这非常糟糕),而 GCC 产生了更好的代码。这来自collapse(2)没有得到很好优化的子句(在许多编译器上)。你可以在GodBold上看到。请注意,使用collapse(2)子句优化内核并不容易,因为编译器不知道循环的边界以及相关的成本(以及模数的除法器)。

如果没有collapse(2),GCC 能够成功地对循环进行矢量化,但令人惊讶的是 ICC 不能。希望我们可以使用simd指令帮助 ICC 。一旦使用,这两个编译器都会生成相对较好的代码。它仍然不是最优的,因为在主流 x86-64 平台上size_t通常是 8 字节和int4 字节,并且不同类型的循环比较使得代码更难以有效地向量化以及产生最佳的标量指令。您可以使用临时变量来解决此问题。您可以在此处查看生成的汇编代码。

请注意,一旦代码被修复,ICC 生成的程序集非常好。该代码受内存限制,最终代码应仅用几个线程就使 RAM 饱和。如果输入数组适合它,即使 L3 缓存也应该被 ICC 生成的程序集饱和。


for(int k = 0; k < bench_rounds; ++k) {
    int limit = large_vec_size;
    #pragma omp parallel for
    for(int j = 0; j < limit; ++j) {
        #pragma omp simd
        for(int i = 0; i < limit; ++i) {
           test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;

