首页 > 解决方案 > 如何对几行 C 编程代码进行基准测试?

问题描述

我最近听说了无分支编程的想法,我想尝试一下,看看它是否可以提高性能。我有以下 C 函数。

int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}

删除 if 分支后,我有这个:

int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}

现在我想知道无分支版本是否更快。我四处搜寻,发现了一个名为 hyperfine 的工具(https://github.com/sharkdp/hyperfine)。所以我写了下面main的函数,并squarehyperfine.

int main() {
    printf("%d\n", square(38));
    return 0;
}

问题是基于超精细结果,我无法确定哪个版本更好。在 C 编程中,人们通常如何确定函数的哪个版本更快?

下面是我的一些hyperfine结果。

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.4 ms ±   0.2 ms    [User: 2.2 ms, System: 3.2 ms]
  Range (min … max):     4.9 ms …   6.1 ms    230 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       6.1 ms ±   0.7 ms    [User: 2.2 ms, System: 3.7 ms]
  Range (min … max):     5.0 ms …   9.7 ms    225 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.5 ms ±   0.3 ms    [User: 2.1 ms, System: 3.5 ms]
  Range (min … max):     4.9 ms …   7.0 ms    211 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.6 ms ±   0.4 ms    [User: 2.0 ms, System: 3.9 ms]
  Range (min … max):     4.8 ms …   7.0 ms    217 runs

  Warning: Command took less than 5 ms to complete. Results might be inaccurate.


C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.7 ms ±   0.3 ms    [User: 1.9 ms, System: 4.0 ms]
  Range (min … max):     5.0 ms …   6.6 ms    220 runs

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.6 ms ±   0.3 ms    [User: 1.9 ms, System: 3.9 ms]
  Range (min … max):     4.8 ms …   6.9 ms    219 runs

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.8 ms ±   0.3 ms    [User: 1.5 ms, System: 4.0 ms]
  Range (min … max):     5.2 ms …   7.3 ms    224 runs

C:\my_projects\untitled>

标签: cperformanceperformance-testingbenchmarkingmicrobenchmark

解决方案


如何对几行 C 编程代码进行基准测试?

编译代码并检查编译器生成的程序集。

通常使用 Godbolt 并在那里检查生成的程序集。神螺栓链接

一种半不可靠的方法是计算执行的汇编指令。我不了解 Windows - 我在 linux 上工作。使用 gdb,我使用此问题中提供的代码,并使用:

// 1.c
#if MACRO
int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}
#else
int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}
#endif
// start-stop places for counting assembly instructions
// Adding attribute and a specific asm syntax that is a GNU extension
// So that the compiler will not optimize the functions out
__attribute__((__noinline__)) void begin() { __asm__("nop"); }
__attribute__((__noinline__)) void finish() { __asm__("nop"); }
// trying to use volatile so that compiler 
// wouldn't optimize the function completely out
volatile int arg = 38, res;
int main() {
    begin();
    res = square(arg);
    finish();
}

然后在 bash 中编译和基准测试:

# a short function to count number of instructions executed between "begin" and "finish" functions
$ b() { printf "%s\n" 'set editing off' 'set prompt' 'set confirm off' 'set pagination off' 'b begin' 'r' 'set $count=0' 'while ($pc != finish)' 'stepi' 'set $count=$count+1' 'end' 'printf "The count of instruction between begin and finish is: %d\n", $count' 'q' | gdb "$1" |& grep 'The count'; }

# then compile and measure
$ gcc -D MACRO=0 1.c ; b a.out
The count of instruction between begin and finish is: 34
$ gcc -D MACRO=1 1.c ; b a.out
The count of instruction between begin and finish is: 22

看起来在我的平台上使用 gcc10 编译器没有任何选项而没有优化,第二个版本执行了 12 条短指令。但是将编译器输出与优化进行比较是没有意义的。启用优化后,一条指令有所不同:

$ gcc -O -D MACRO=0 1.c ; b a.out
The count of instruction between begin and finish is: 11
$ gcc -O -D MACRO=1 1.c ; b a.out 
The count of instruction between begin and finish is: 10

笔记:

  • 您的代码square(38)可以优化为无操作。
  • 使用您的代码,hyperfine branchless.exe您正在比较printfie 的执行情况。刷新输出并打印它所花费的时间,而不是square().
  • 该答案中所述,您可以在可用时使用硬件计数器。

推荐阅读