首页 > 解决方案 > Why does not AVX further improve the performance compared with SSE2?

问题描述

I am new to the field of SSE2 and AVX. I write the following code to test the performance of both SSE2 and AVX.

#include <cmath>
#include <iostream>
#include <chrono>
#include <emmintrin.h>
#include <immintrin.h>

void normal_res(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, unsigned long N) {
    for (unsigned long n = 0; n < N; n++) {
        c[n] = sqrt(a[n]) + sqrt(b[n]);
    }
}

void normal(float* a, float* b, float* c, unsigned long N) {
    for (unsigned long n = 0; n < N; n++) {
        c[n] = sqrt(a[n]) + sqrt(b[n]);
    }
}

void sse(float* a, float* b, float* c, unsigned long N) {
    __m128* a_ptr = (__m128*)a;
    __m128* b_ptr = (__m128*)b;

    for (unsigned long n = 0; n < N; n+=4, a_ptr++, b_ptr++) {
        __m128 asqrt = _mm_sqrt_ps(*a_ptr);
        __m128 bsqrt = _mm_sqrt_ps(*b_ptr);
        __m128 add_result = _mm_add_ps(asqrt, bsqrt);
        _mm_store_ps(&c[n], add_result);
    }
}

void avx(float* a, float* b, float* c, unsigned long N) {
    __m256* a_ptr = (__m256*)a;
    __m256* b_ptr = (__m256*)b;

    for (unsigned long n = 0; n < N; n+=8, a_ptr++, b_ptr++) {
        __m256 asqrt = _mm256_sqrt_ps(*a_ptr);
        __m256 bsqrt = _mm256_sqrt_ps(*b_ptr);
        __m256 add_result = _mm256_add_ps(asqrt, bsqrt);
        _mm256_store_ps(&c[n], add_result);
    }
}

int main(int argc, char** argv) {
    unsigned long N = 1 << 30;

    auto *a = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
    auto *b = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
    auto *c = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));

    std::chrono::time_point<std::chrono::system_clock> start, end;
    for (unsigned long i = 0; i < N; ++i) {                                                                                                                                                                                   
        a[i] = 3141592.65358;           
        b[i] = 1234567.65358;                                                                                                                                                                            
    }

    start = std::chrono::system_clock::now();   
    for (int i = 0; i < 5; i++)                                                                                                                                                                              
        normal(a, b, c, N);                                                                                                                                                                                                                                                                                                                                                                                                            
    end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    std::cout << "normal elapsed time: " << elapsed_seconds.count() / 5 << std::endl;

    start = std::chrono::system_clock::now();     
    for (int i = 0; i < 5; i++)                                                                                                                                                                                                                                                                                                                                                                                         
        normal_res(a, b, c, N);    
    end = std::chrono::system_clock::now();
    elapsed_seconds = end - start;
    std::cout << "normal restrict elapsed time: " << elapsed_seconds.count() / 5 << std::endl;                                                                                                                                                                                 

    start = std::chrono::system_clock::now();
    for (int i = 0; i < 5; i++)                                                                                                                                                                                                                                                                                                                                                                                              
        sse(a, b, c, N);    
    end = std::chrono::system_clock::now();
    elapsed_seconds = end - start;
    std::cout << "sse elapsed time: " << elapsed_seconds.count() / 5 << std::endl;   

    start = std::chrono::system_clock::now();
    for (int i = 0; i < 5; i++)                                                                                                                                                                                                                                                                                                                                                                                              
        avx(a, b, c, N);    
    end = std::chrono::system_clock::now();
    elapsed_seconds = end - start;
    std::cout << "avx elapsed time: " << elapsed_seconds.count() / 5 << std::endl;   
    return 0;            
}

I compile my program by using g++ complier as the following.

g++ -msse -msse2 -mavx -mavx512f -O2

The results are as the following. It seems that there is no further improvement when I use more advanced 256 bits vectors.

normal elapsed time: 10.5311
normal restrict elapsed time: 8.00338
sse elapsed time: 0.995806
avx elapsed time: 0.973302

I have two questions.

  1. Why does not AVX give me further improvement? Is it because the memory bandwidth?
  2. According to my experiment, the SSE2 perform 10 times faster than the naive version. Why is that? I expect the SSE2 can only be 4 times faster based on its 128 bits vectors with respect to single precision floating points. Thanks a lot.

标签: c++performancesseavxcpu-cache

解决方案


There are several issues here....

  1. Memory bandwidth is very likely to be important for these array sizes -- more notes below.
  2. Throughput for SSE and AVX square root instructions may not be what you expect on your processor -- more notes below.
  3. The first test ("normal") may be slower than expected because the output array is instantiated (i.e., virtual to physical mappings are created) during the timed part of the test. (Just fill c with zeros in the loop that initializes a and b to fix this.)

Memory Bandwidth Notes:

  • With N = 1<<30 and float variables, each array is 4GiB.
  • Each test reads two arrays and writes to a third array. This third array must also be read from memory before being overwritten -- this is called a "write allocate" or a "read for ownership".
  • So you are reading 12 GiB and writing 4 GiB in each test. The SSE and AVX tests therefore correspond to ~16 GB/s of DRAM bandwidth, which is near the high end of the range typically seen for single-threaded operation on recent processors.

Instruction Throughput Notes:

  • The best reference for instruction latency and throughput on x86 processors is "instruction_tables.pdf" from https://www.agner.org/optimize/
  • Agner defines "reciprocal throughput" as the average number of cycles per retired instruction when the processor is given a workload of independent instructions of the same type.
  • As an example, for an Intel Skylake core, the throughput of SSE and AVX SQRT is the same:
  • SQRTPS (xmm) 1/throughput = 3 --> 1 instruction every 3 cycles
  • VSQRTPS (ymm) 1/throughput = 6 --> 1 instruction every 6 cycles
  • Execution time for the square roots is expected to be (1<<31) square roots / 4 square roots per SSE SQRT instruction * 3 cycles per SSE SQRT instruction / 3 GHz = 0.54 seconds (randomly assuming a processor frequency).
  • Expected throughput for the "normal" and "normal_res" cases depends on the specifics of the generated assembly code.

推荐阅读