首页 > 解决方案 > 为什么在 OpenMP 线程中运行的任务实际上比在串行中运行的时间更长?

问题描述

我编写了这段代码来估计积分的值。

一个简单直接for()的并行循环,使用openmp.

无论我做什么,我都无法将并行运行时间减少到少于串行运行时间。

问题是什么?

lenmuta, tol, cores, seed分别是1, 0.01, number_of_threads0

这是代码:

// ================= Libraries
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <sys/time.h>

int main( int argc, char* argv[] )
{
    float lenmuta = atof( argv[1] );
    float tol     = atof( argv[2] );
    int   cores   = atoi( argv[3] );
    int   seed    = atoi( argv[4] );

#define M  1 / ( tol*tol*0.01*lenmuta*lenmuta );

    unsigned int N = M;

    printf( "%10.5f \n", tol );
    printf(     "%d \n", N );

    omp_set_num_threads( cores );

    double sum2;
    int    Threadnum;
    float  rvalue;

    Threadnum = omp_get_max_threads();
    rvalue    = lenmuta / ( 1 + lenmuta * lenmuta );

    printf( "the true value is %f \n", rvalue );
    printf( "the number of threads we use is %d \n", Threadnum );

    struct random_data* state;
    double start1 = omp_get_wtime();
    int    k, i;
    double y;
    char   statebuf[32];

    state = malloc( sizeof( struct random_data ) );

    initstate_r( seed, statebuf, 32, state );
    srandom_r(   seed, state );

    // =========Parallel Region=======================

    #pragma omp parallel for private( i, y ) reduction( +:sum2 )
    for( i = 0; i < N; i++ )
    {
         y     = -( 1 / lenmuta ) * log( (double) rand() / ( (double) RAND_MAX ) );
         sum2 +=  cos( y );
    }
    sum2 = sum2 / N;

    printf( "Time: \t %f \n", omp_get_wtime() - start1 );
    printf( "the estimate value is %1.10f \n", sum2 );

    return 0;
    }

标签: performanceparallel-processingopenmpparallelism-amdahl

解决方案


谈性能?
代码可以运行得更快~ +4x (!) for 1E6 loops

不管是否使用 OpenMP 工具,让我们从这些开始。OpenMP 线程管理(实例化、任务分发、结果收集 -(a smart reduction(+:sum2))**都需要一些附加成本 - 请参阅花费在此上的汇编指令的数量**(比例))。

在此处输入图像描述

鉴于您的#pragma-decorated 代码已经支付了所有这些附加成本(它按照指示做了),您几乎没有得到任何回报1E6 doubles以换取烧毁的费用 - 但是( 1E6的减少总和几乎只是一个语法 -糖,如果与无附加成本的纯 [SERIAL] 代码执行相比,如果聪明(甚至小于如果不是),那么与不消耗线程管理和任务的附加费用相比,它的总和是相同~ 18 [ms]~ 70 [ms]分发/结果收集开销(此处~ 400 [ms]用于2 核沙盒演示测试),

   0.01000 
1000000 
the true value is 0.500000      the number of threads we use is 2 

OpenMP as-is    Time:    0.467055     
SERIAL as-is    Time:    0.069820 <~~+            70 [ms] @ 1E6 loops
OpenMP Re-DEF'd Time:    0.328225    |            !!
SERIAL Re-DEF'd Time:    0.017899 <~~+~ 6x FASTER 18 [ms] @ 1E6 loops

勘误:mea culpa - 代码避免了一个fDIV用于底部情况的代码(重新测试显示 ~ +10% 加速 - 见代码

测试与 1E6 一样少的循环(@-a-few-GHz-CPU-cores ...)会产生比事实更多的噪音。无论如何,我们可以在任何一种策略中变得更快:

OpenMP as-is    Time:      0.375825     the estimate value is 0.5000982178 
SERIAL as-is    Time:      0.062920     the estimate value is 0.5004906150
                                |||
                               ~300 [us]FASTER--v
OpenMP Re-DEF'd Time:      0.062613     the estimate value is 0.4992401088
                              ~2    [ms]SLOWER--^
                               ||||
SERIAL Re-DEF'd Time:      0.060253     the estimate value is 0.4995912559 

公平地注意到,对于更长的循环,循环递增的开销将产生更多的总计算时间,即使使用-O3,所以重构的代码将表现出所有时间最快的结果,但加速因子将变得更小到~ 1.16xfor25E6循环

核心缺陷:
非常糟糕的成本不平衡:影响
会损害任何计算的效率

实际上,在最昂贵的语法构造函数(更不用说随机数)中几乎没有什么需要计算(几个fADD-s, fMUL, fNEG, fDIV-s, ),这至少无法证明那些花费在构建 OpenMP 代码上的成本是合理的- 执行生态系统(然而,我们将展示它甚至可以减少 6 倍以更快的运行)。fLOG

为什么要重新计算,百万+倍的常数值越少?

因此,让我们清除不应该进入任何以绩效为导向的部分的东西:

double C_LogRAND_MAX = log( (double) RAND_MAX );
double C_1divLENMUTA = -1 / lenmuta;
double C_2sub        = C_LogRAND_MAX * C_1divLENMUTA;

和 :

#pragma omp parallel for private( i, y ) reduction( +:sum2 )
for( i = 0; i < N; i++ )
{
     sum2 +=  cos( C_1divLENMUTA           // fMUL
                 * log( (double) rand() )  // +costs of keeping the seed-state management 
                 - C_2sub                  // fSUB
                   );
}

最后但并非最不重要的一点是,随机序列的并行来源值得再次仔细研究,因为这些工具试图保持其内部状态,这可能会“跨”线程产生麻烦。好消息是,Stack Overflow 可以为解决这个性能问题提供很多帮助。


w/o -O3:                                                                                               =:_____________________________________________________________________:[ns]
SERIAL                     NOP       Time:     3.867352 DIV( 2000000000 ) ~   0.000000002     ~   2 [ns]:||:for(){...}loop-overhead                                           :
SERIAL                    RAND()     Time:    10.845900 DIV( 1000000000 ) ~   0.000000011     ~  +9 [ns]:  |||||||||:rand()                                                   :
SERIAL           (double) RAND()     Time:    10.923597 DIV( 1000000000 ) ~   0.000000011     ~  +0 [ns]:           :(double)                                                 :
SERIAL      LOG( (double) RAND() )   Time:    37.156017 DIV( 1000000000 ) ~   0.000000037     ~ +27 [ns]:           |||||||||||||||||||||||||||:log()                         :
SERIAL COS( LOG( (double) RAND() ) ) Time:    54.472115 DIV(  800000000 ) ~   0.000000068     ~ +31 [ns]:                                      |||||||||||||||||||||||||||||||:cos()
SERIAL COS( LOG( (double) RAND() ) ) Time:    55.320798 DIV(  800000000 ) ~   0.000000069               :                        w/o  -O3                                     :
w/-O3: :::( :::( (::::::) ::::() ) )          !!.       :::(  ::::::::: )              !!              =:____________:           :!                                           :
SERIAL COS( LOG( (double) RAND() ) ) Time:     9.305908 DIV(  800000000 ) ~   0.000000012     ~  12 [ns]:||||||||||||            with -O3                                     :
SERIAL COS( LOG( (double) RAND() ) ) Time:     2.135143 DIV(  200000000 ) ~   0.000000011               :                                                                     :                                                                       
SERIAL      LOG( (double) RAND() )   Time:     2.082406 DIV(  200000000 ) ~   0.000000010               :                                                                     :                                                                       
SERIAL           (double) RAND()     Time:     2.244600 DIV(  200000000 ) ~   0.000000011
SERIAL                    RAND()     Time:     2.101538 DIV(  200000000 ) ~   0.000000011
SERIAL                     NOP       Time:     0.000000 DIV(  200000000 ) ~   0.000000000
                                                                                       ^^
                                                                                       ||
                                                                                      !||
                                                                                      vvv
OpenMP COS( LOG( (double) RAND() ) ) Time:    33.336248 DIV(  100000000 ) ~   0.000000333  #pragma omp parallel num_threads(  2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.388479 DIV(    1000000 ) ~   0.000000388  #pragma omp parallel num_threads(  2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    37.636114 DIV(  100000000 ) ~   0.000000376  #pragma omp parallel num_threads(  2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    38.876272 DIV(  100000000 ) ~   0.000000389  #pragma omp parallel num_threads(  2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    44.226391 DIV(  100000000 ) ~   0.000000442  #pragma omp parallel num_threads(  2 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:     0.333573 DIV(    1000000 ) ~   0.000000334  #pragma omp parallel num_threads(  4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    35.624111 DIV(  100000000 ) ~   0.000000356  #pragma omp parallel num_threads(  4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    37.820558 DIV(  100000000 ) ~   0.000000378  #pragma omp parallel num_threads(  4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    38.625498 DIV(  100000000 ) ~   0.000000386  #pragma omp parallel num_threads(  4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    39.782386 DIV(  100000000 ) ~   0.000000398  #pragma omp parallel num_threads(  4 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:     0.317120 DIV(    1000000 ) ~   0.000000317  #pragma omp parallel num_threads(  8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    34.692555 DIV(  100000000 ) ~   0.000000347  #pragma omp parallel num_threads(  8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.360407 DIV(    1000000 ) ~   0.000000360  #pragma omp parallel num_threads(  8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:     3.737517 DIV(   10000000 ) ~   0.000000374  #pragma omp parallel num_threads(  8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.380087 DIV(    1000000 ) ~   0.000000380  #pragma omp parallel num_threads(  8 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:     0.354283 DIV(    1000000 ) ~   0.000000354  #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    35.984292 DIV(  100000000 ) ~   0.000000360  #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     3.654442 DIV(   10000000 ) ~   0.000000365  #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    37.233374 DIV(  100000000 ) ~   0.000000372  #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:     4.112637 DIV(   10000000 ) ~   0.000000411  #pragma omp parallel num_threads( 16 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:    37.813872 DIV(  100000000 ) ~   0.000000378  #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.412896 DIV(    1000000 ) ~   0.000000413  #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    34.098855 DIV(  100000000 ) ~   0.000000341  #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    35.372660 DIV(  100000000 ) ~   0.000000354  #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    39.256430 DIV(  100000000 ) ~   0.000000393  #pragma omp parallel num_threads( 32 ) with -O3


/////////////////////////////////////////////////////////////////////////////////////////////
//
// -O3
// warning: iteration 2147483647 invokes undefined behavior [-Waggressive-loop-optimizations]
//     for( i = 0; i < N; i++ )
//     ^~~
********************************************************************************************/

推荐阅读