c++ - Why does not AVX further improve the performance compared with SSE2?
问题描述
I am new to the field of SSE2 and AVX. I write the following code to test the performance of both SSE2 and AVX.
#include <cmath>
#include <iostream>
#include <chrono>
#include <emmintrin.h>
#include <immintrin.h>
void normal_res(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void normal(float* a, float* b, float* c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void sse(float* a, float* b, float* c, unsigned long N) {
__m128* a_ptr = (__m128*)a;
__m128* b_ptr = (__m128*)b;
for (unsigned long n = 0; n < N; n+=4, a_ptr++, b_ptr++) {
__m128 asqrt = _mm_sqrt_ps(*a_ptr);
__m128 bsqrt = _mm_sqrt_ps(*b_ptr);
__m128 add_result = _mm_add_ps(asqrt, bsqrt);
_mm_store_ps(&c[n], add_result);
}
}
void avx(float* a, float* b, float* c, unsigned long N) {
__m256* a_ptr = (__m256*)a;
__m256* b_ptr = (__m256*)b;
for (unsigned long n = 0; n < N; n+=8, a_ptr++, b_ptr++) {
__m256 asqrt = _mm256_sqrt_ps(*a_ptr);
__m256 bsqrt = _mm256_sqrt_ps(*b_ptr);
__m256 add_result = _mm256_add_ps(asqrt, bsqrt);
_mm256_store_ps(&c[n], add_result);
}
}
int main(int argc, char** argv) {
unsigned long N = 1 << 30;
auto *a = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *b = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *c = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
std::chrono::time_point<std::chrono::system_clock> start, end;
for (unsigned long i = 0; i < N; ++i) {
a[i] = 3141592.65358;
b[i] = 1234567.65358;
}
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal(a, b, c, N);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "normal elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal_res(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "normal restrict elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
sse(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "sse elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
avx(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "avx elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
return 0;
}
I compile my program by using g++ complier as the following.
g++ -msse -msse2 -mavx -mavx512f -O2
The results are as the following. It seems that there is no further improvement when I use more advanced 256 bits vectors.
normal elapsed time: 10.5311
normal restrict elapsed time: 8.00338
sse elapsed time: 0.995806
avx elapsed time: 0.973302
I have two questions.
- Why does not AVX give me further improvement? Is it because the memory bandwidth?
- According to my experiment, the SSE2 perform 10 times faster than the naive version. Why is that? I expect the SSE2 can only be 4 times faster based on its 128 bits vectors with respect to single precision floating points. Thanks a lot.
解决方案
There are several issues here....
- Memory bandwidth is very likely to be important for these array sizes -- more notes below.
- Throughput for SSE and AVX square root instructions may not be what you expect on your processor -- more notes below.
- The first test ("normal") may be slower than expected because the output array is instantiated (i.e., virtual to physical mappings are created) during the timed part of the test. (Just fill c with zeros in the loop that initializes a and b to fix this.)
Memory Bandwidth Notes:
- With N = 1<<30 and float variables, each array is 4GiB.
- Each test reads two arrays and writes to a third array. This third array must also be read from memory before being overwritten -- this is called a "write allocate" or a "read for ownership".
- So you are reading 12 GiB and writing 4 GiB in each test. The SSE and AVX tests therefore correspond to ~16 GB/s of DRAM bandwidth, which is near the high end of the range typically seen for single-threaded operation on recent processors.
Instruction Throughput Notes:
- The best reference for instruction latency and throughput on x86 processors is "instruction_tables.pdf" from https://www.agner.org/optimize/
- Agner defines "reciprocal throughput" as the average number of cycles per retired instruction when the processor is given a workload of independent instructions of the same type.
- As an example, for an Intel Skylake core, the throughput of SSE and AVX SQRT is the same:
- SQRTPS (xmm) 1/throughput = 3 --> 1 instruction every 3 cycles
- VSQRTPS (ymm) 1/throughput = 6 --> 1 instruction every 6 cycles
- Execution time for the square roots is expected to be (1<<31) square roots / 4 square roots per SSE SQRT instruction * 3 cycles per SSE SQRT instruction / 3 GHz = 0.54 seconds (randomly assuming a processor frequency).
- Expected throughput for the "normal" and "normal_res" cases depends on the specifics of the generated assembly code.
推荐阅读
- php - 在这里,我正在使用 Laravel 共享 API 的响应,其中我想在 'doctor_detail_data' 数组而不是 'doctor_requests' 中专门化名称
- machine-learning - 我们是否需要在 Pyspark ML 中分别保存和加载管道和模型?
- python - 如何将 API 数据保存到 csv 文件中?另外如何修复回溯错误?
- amazon-web-services - 通过 Kinesis Stream 或 Firehose 写入 S3
- amazon-web-services - 成功生成 S3 预签名 URL 后,为什么会收到“拒绝访问”?
- swiftui - 如何防止 SwiftUI 在 ForEach 循环中重新初始化 Initialized-View?
- python - 我需要使用 pandas 数据框根据第二个分类变量中的值来估算分类变量的缺失值
- javascript - 控制台中关于 Promise 的输出
- react-native - 使用 amazon chime 集成到现有 react-native 项目时导入 NativeMobileSDKBridgePackage 时出错
- git - Git 太大,无法上传到 github。除master外的所有分支都被删除,所有文件都被清除