首页 > 解决方案 > pthread 和 printf 的 C 性能不佳

问题描述

我正在使用大型阵列测试 Linux 的 ac 代码以测量线程性能,当线程增加到最大内核(英特尔 4770 为 8)时,应用程序的扩展性非常好,但这仅适用于我的代码的纯数学部分。

如果我为结果数组添加 printf 部分,那么即使重定向到文件,时间也会变得太大,从几秒到几分钟,而 printf 这些数组应该只增加几秒钟。

编码:

(gcc 7.5.0-Ubuntu 18.04)

没有 printf 循环:

gcc -O3 -m64 exp_multi.c -pthread -lm 

使用 printf 循环:

gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];

/*    gcc -O3 -m64 exp_multi.c -pthread -lm    */


void* run(void *received_Val){
    int single_val = *((int *) received_Val);
    int r;
    int i;
    double p;

  for (r = 0; r < REIT; r++) {
    p = XXX + 0.00001*single_val*MAXSIZE/num_threads;

    for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {

      xv[i]=p;
      yv[i]=exp(p);

    p += 0.00001;
    }

  }

    return 0;
}


int main(){
    int i;
      pthread_t tid[num_threads];


    for (i=0;i<num_threads;i++){
      int *arg = malloc(sizeof(*arg));
      if ( arg == NULL ) {
        fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
        exit(1);
      }

      *arg = i;
        pthread_create(&(tid[i]), NULL, run, arg);
    }

    for(i=0; i<num_threads; i++)
    {
        pthread_join(tid[i], NULL);
    }

#ifdef PRINT_ARRAY
    for (i=0;i<MAXSIZE;i++){
        printf("x=%.20lf, e^x=%.20lf\n",xv[i],yv[i]);
    }
#endif

    return 0;
}


pthread_create 中的 malloc 将整数作为最后一个参数传递,如本文中所建议的那样

我试过了,没有成功,clang,添加free(tid)指令,避免使用malloc指令,反向循环,只有1个一维数组,1个没有pthread的线程版本。

EDIT2:我认为 exp 函数是处理器资源密集型的,可能受到处理器代实现的每核缓存或 SIMD 资源的影响。以下示例代码基于发布在 Stack Overflow上的许可代码。

这段代码在有或没有 printf 循环的情况下运行速度很快,并且 math.h 中的 exp 几年前已经改进,它可以快 x40 左右,至少在 Intel 4770 (Haswell) 上,这个链接是一个已知的测试代码数学库与 SSE2,现在数学的 exp 速度应该接近为浮点和 x8 并行计算优化的 AVX2 算法。

测试结果:expf 与其他 SSE2 算法(exp_ps):

sinf .. ->            55.5 millions of vector evaluations/second ->  12 cycles/value
cosf .. ->            57.3 millions of vector evaluations/second ->  11 cycles/value
sincos (x87) .. ->    9.1 millions of vector evaluations/second ->   71 cycles/value
expf .. ->            61.4 millions of vector evaluations/second ->  11 cycles/value
logf .. ->            55.6 millions of vector evaluations/second ->  12 cycles/value
cephes_sinf .. ->     52.5 millions of vector evaluations/second ->  12 cycles/value
cephes_cosf .. ->     41.9 millions of vector evaluations/second ->  15 cycles/value
cephes_expf .. ->     18.3 millions of vector evaluations/second ->  35 cycles/value
cephes_logf .. ->     20.2 millions of vector evaluations/second ->  32 cycles/value
sin_ps .. ->          54.1 millions of vector evaluations/second ->  12 cycles/value
cos_ps .. ->          54.8 millions of vector evaluations/second ->  12 cycles/value
sincos_ps .. ->       54.6 millions of vector evaluations/second ->  12 cycles/value
exp_ps .. ->          42.6 millions of vector evaluations/second ->  15 cycles/value
log_ps .. ->          41.0 millions of vector evaluations/second ->  16 cycles/value
/* Performance test exp(x) algorithm

based on AVX implementation of Giovanni Garberoglio


 Copyright (C) 2020 Antonio R.


     AVX implementation of exp:
     Modified code from this source: https://github.com/reyoung/avx_mathfun
     Based on "sse_mathfun.h", by Julien Pommier
     http://gruntthepeon.free.fr/ssemath/
     Copyright (C) 2012 Giovanni Garberoglio
     Interdisciplinary Laboratory for Computational Science (LISC)
     Fondazione Bruno Kessler and University of Trento
     via Sommarive, 18
     I-38123 Trento (Italy)


    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.
    Permission is granted to anyone to use this software for any purpose,
    including commercial applications, and to alter it and redistribute it
    freely, subject to the following restrictions:
    1. The origin of this software must not be misrepresented; you must not
       claim that you wrote the original software. If you use this software
       in a product, an acknowledgment in the product documentation would be
       appreciated but is not required.
    2. Altered source versions must be plainly marked as such, and must not be
       misrepresented as being the original software.
    3. This notice may not be removed or altered from any source distribution.
    (this is the zlib license)

  */

/*    gcc -O3 -m64 -Wall -mavx2 -march=haswell  expc.c -lm     */

#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5


__m256 exp256_ps(__m256 x) {

/*
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int r;
    int i;
    float p;
    static float xv[MAXSIZE];
    static float yv[MAXSIZE];
    float *xp;
    float *yp;

  for (r = 0; r < REIT; r++) {
    p = XXX;
    xp = xv;
    yp = yv;

    for (i = 0; i < MAXSIZE; i += 8) {

    __m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
    __m256 y = exp256_ps(x);

    _mm256_store_ps(xp,x);
    _mm256_store_ps(yp,y);

    xp += 8;
    yp += 8;
    p += 0.00008;
    }

  }

  for (i=0;i<MAXSIZE;i++){
      printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
  }

    return 0;
}


为了比较,这是数学库中带有 exp (x) 的代码示例,单线程和浮点数。

#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/*    gcc -O3 -m64 exp_st.c -lm    */


int main(){
    int r;
    int i;
    float p;
    static float xv[MAXSIZE];
    static float yv[MAXSIZE];


  for (r = 0; r < REIT; r++) {
    p = XXX;

    for (i = 0; i < MAXSIZE; i++) {

      xv[i]=p;
      yv[i]=expf(p);

    p += 0.00001;
    }
  }

  for (i=0;i<MAXSIZE;i++){
      printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
  }

    return 0;
}

解决方案:正如 Andreas Wenzel 所说,gcc 编译器足够聪明,它决定不必将结果实际写入数组,这些写入已被编译器优化掉。在我根据新信息进行新的性能测试之后,或者在我犯了几个错误或我假设错误的事实之前,结果似乎更清楚了:exp (double arg) 或 expf(float arg),即 x2+ exp(double arg) , 已得到改进,但它不是快速 AVX2 算法(x8 并行浮点 arg),它比 SSE2 算法(x4 并行浮点 arg)快 x6 左右。以下是英特尔超线程 CPU 预期的一些结果,SSE2 算法除外:

exp(双参数)单线程:18 分 46 秒

exp(双参数)4 个线程:5 分 4 秒

exp(双参数)8 个线程:4 分 28 秒

expf (float arg) 单线程:7 分 32 秒

expf (float arg) 4 个线程:1 分 58 秒

expf (float arg) 8 个线程:1 分 41 秒

相对误差**:

i           x                     y = expf(x)           double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.518329620e+00   y = 8.059411496e-02   exp_dbl = 8.059411715e-02   rel_err =-2.716802480e-08
i = 372000  x =-1.278343201e+00   y = 2.784983218e-01   exp_dbl = 2.784983343e-01   rel_err =-4.490403948e-08
i = 496000  x =-3.867173195e-02   y = 9.620664716e-01   exp_dbl = 9.620664730e-01   rel_err =-1.481617428e-09
i = 620000  x = 1.201261759e+00   y = 3.324308872e+00   exp_dbl = 3.324308753e+00   rel_err = 3.571995830e-08
i = 744000  x = 2.441616058e+00   y = 1.149159718e+01   exp_dbl = 1.149159684e+01   rel_err = 2.955980805e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.921588898e+00   y = 1.372204742e+02   exp_dbl = 1.372204694e+02   rel_err = 3.563072184e-08

* Julien Pommier的 SSE2 算法,x6,8 速度从一个线程增加到 8 个线程。我的性能测试代码使用aligned(16) 作为传递给库的vector/4 float 数组的typedef union,而不是对齐的整个float 数组。这可能会导致性能下降,至少对于其他 AVX2 代码而言,它的多线程性能改进对于英特尔超线程似乎也很好,但在较低的速度下,时间在 x2.5-x1.5 之间增加。也许 SSE2 代码可以通过我无法改进的更好的数组对齐来加速:

exp_ps(x4 并行浮点 arg)单线程:12 分 7 秒

exp_ps(x4 并行浮点 arg)4 个线程:3 分 10 秒

exp_ps(x4 并行浮点 arg)8 个线程:1 分 46 秒

相对误差**:

i           x                     y = exp_ps(x)         double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.518329620e+00   y = 8.059412241e-02   exp_dbl = 8.059411715e-02   rel_err = 6.527768787e-08
i = 372000  x =-1.278343201e+00   y = 2.784983218e-01   exp_dbl = 2.784983343e-01   rel_err =-4.490403948e-08
i = 496000  x =-3.977407143e-02   y = 9.610065222e-01   exp_dbl = 9.610065335e-01   rel_err =-1.174323454e-08
i = 620000  x = 1.200158238e+00   y = 3.320642233e+00   exp_dbl = 3.320642334e+00   rel_err =-3.054731957e-08
i = 744000  x = 2.441616058e+00   y = 1.149159622e+01   exp_dbl = 1.149159684e+01   rel_err =-5.342903415e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.921588898e+00   y = 1.372204742e+02   exp_dbl = 1.372204694e+02   rel_err = 3.563072184e-08

AVX2 算法(x8 并行浮点 arg)单线程:1 分 45 秒

AVX2 算法(x8 并行浮点 arg)4 线程:28 秒

AVX2 算法(x8 并行浮点 arg)8 个线程:27 秒

相对误差**:

i           x                     y = exp256_ps(x)      double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.516632080e+00   y = 8.073104918e-02   exp_dbl = 8.073104510e-02   rel_err = 5.057888540e-08
i = 372000  x =-1.279417157e+00   y = 2.781994045e-01   exp_dbl = 2.781993997e-01   rel_err = 1.705288467e-08
i = 496000  x =-3.954863176e-02   y = 9.612231851e-01   exp_dbl = 9.612232069e-01   rel_err =-2.269774967e-08
i = 620000  x = 1.199879169e+00   y = 3.319715738e+00   exp_dbl = 3.319715775e+00   rel_err =-1.119642824e-08
i = 744000  x = 2.440370798e+00   y = 1.147729492e+01   exp_dbl = 1.147729571e+01   rel_err =-6.896860199e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.923286438e+00   y = 1.374535980e+02   exp_dbl = 1.374536045e+02   rel_err =-4.676466368e-08

**相对误差是相同的,因为 SSE2 和 AVX2 的代码使用相同的算法,而且很可能也是库函数 exp(x) 的错误。

源代码 AVX2算法多线程

/* Performance test of a multithreaded exp(x) algorithm

based on AVX implementation of Giovanni Garberoglio


 Copyright (C) 2020 Antonio R.


     AVX implementation of exp:
     Modified code from this source: https://github.com/reyoung/avx_mathfun
     Based on "sse_mathfun.h", by Julien Pommier
     http://gruntthepeon.free.fr/ssemath/
     Copyright (C) 2012 Giovanni Garberoglio
     Interdisciplinary Laboratory for Computational Science (LISC)
     Fondazione Bruno Kessler and University of Trento
     via Sommarive, 18
     I-38123 Trento (Italy)


    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.
    Permission is granted to anyone to use this software for any purpose,
    including commercial applications, and to alter it and redistribute it
    freely, subject to the following restrictions:
    1. The origin of this software must not be misrepresented; you must not
       claim that you wrote the original software. If you use this software
       in a product, an acknowledgment in the product documentation would be
       appreciated but is not required.
    2. Altered source versions must be plainly marked as such, and must not be
       misrepresented as being the original software.
    3. This notice may not be removed or altered from any source distribution.
    (this is the zlib license)

  */

  /*    gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm    */

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4

typedef float  FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;



__m256 exp256_ps(__m256 x) {

/*
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}


void* run(void *received_Val){
    int single_val = *((int *) received_Val);
    int r;
    int i;
    float p;
    float *xp;
    float *yp;

  for (r = 0; r < REIT; r++) {
    p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
    xp = xv + single_val*MAXSIZE/num_threads;
    yp = yv + single_val*MAXSIZE/num_threads;

    for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {

      __m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
      __m256 y = exp256_ps(x);

      _mm256_store_ps(xp,x);
      _mm256_store_ps(yp,y);

      xp += 8;
      yp += 8;
      p += 0.00008;

    }
  }

    return 0;
}


int main(){
    int i;
      pthread_t tid[num_threads];


    for (i=0;i<num_threads;i++){
      int *arg = malloc(sizeof(*arg));
      if ( arg == NULL ) {
        fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
        exit(1);
      }

      *arg = i;
        pthread_create(&(tid[i]), NULL, run, arg);
    }

    for(i=0; i<num_threads; i++)
    {
        pthread_join(tid[i], NULL);
    }

    for (i=0;i<MAXSIZE;i++){
        printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
    }

    return 0;
}

图表概述: exp(double arg) 没有 printf 循环 没有 printf 循环的 exp(双 arg),不是真正的性能,正如 Andreas Wenzel 发现的那样,当结果不是 printf 时,gcc 不会计算 exp(x),即使是浮点版本也慢得多,因为它的组装不同指示。尽管图形可能对某些仅使用低级 CPU 缓存/寄存器的汇编算法有用。 expf (float arg) 实际性能或使用 printf 循环 expf(float arg) 实际性能还是用printf loop AVX2算法 AVX2算法,性能最好。 性能测试

标签: cmultithreadingperformancepthreads

解决方案


当您在程序结束时不打印数组时,gcc 编译器足够聪明,可以意识到计算结果没有可观察到的影响。因此,编译器决定不必将结果实际写入数组,因为这些结果从未使用过。相反,这些写入被编译器优化掉了。

此外,当您不打印结果时,库函数exp没有可观察到的效果,前提是它没有被调用的输入太高以至于会导致浮点溢出(这会导致函数引发浮点例外)。这也允许编译器优化这些函数调用。

正如您在 gcc 编译器为您的代码发出的不打印结果的汇编指令中看到的那样,编译后的程序不会无条件调用该函数exp,而是测试该函数的输入exp是否高于7.09e2(以确保不会发生溢出)。只有发生溢出时,程序才会跳转到调用该函数的代码exp。以下是相关的汇编代码:

ucomisd xmm1, xmm3
jnb .L9

在上述汇编代码中,CPU 寄存器xmm3包含双精度浮点值7.09e2。如果输入高于这个常数,函数exp会导致浮点溢出,因为结果不能用双精度浮点值表示。

由于您的输入始终有效且足够低,不会导致浮点溢出,因此您的程序将永远不会执行此跳转,因此它永远不会真正调用 function exp

这解释了为什么当您不打印结果时您的代码会变得如此之快。如果您不打印结果,您的编译器将确定计算没有可观察到的影响,因此它将优化它们。

因此,如果您希望编译器实际执行计算,则必须确保计算具有一些可观察到的效果。这并不意味着您必须实际打印所有结果(几兆字节大)。如果您只打印一行取决于所有结果(例如所有结果的总和)就足够了。

但是,如果您将函数调用替换为对exp其他自定义函数的调用,那么,至少在我的测试中,编译器还不够聪明,无法意识到函数调用没有可观察到的效果。在这种情况下,即使您不打印计算结果,也无法优化函数调用。

由于上述原因,如果要比较两个函数的性能,则必须确保计算实际发生,确保结果具有可观察的效果。否则,您将面临编译器优化掉至少一些计算的风险,并且比较将不公平。


推荐阅读