首页 > 解决方案 > 20 倍的性能差异 Interlocked.Read 与 Interlocked.CompareExchange 虽然都使用 lock cmpxchg 实现

问题描述

问题

我想编写一个小型分析器类,它允许我测量整个应用程序中热路径的运行时间。在这样做的过程中,我发现了两种可能的实现之间的有趣的性能差异,我无法解释,但想理解。

设置

思路如下:

// somewhere accessible
public static profiler HotPathProfiler = new HotPathProfiler("some name", enabled: true);

// within programm
long ticket = profiler.Enter();
... // hot path
var result = profiler.Exit(ticket: ticket);

由于并行运行的这些热路径并不多,因此想法是通过一个数组来实现这一点,该数组保存时间戳(当插槽空闲时为 0)并在调用时返回索引(称为票证)Enter()。所以这个类如下所示:

public class HotPathProfiler
{
    private readonly string _name;
    private readonly bool _enabled;
    private readonly long[] _ticketList;

    public HotPathProfiler(string name, bool enabled)
    {
        _name = name;
        _enabled = enabled;
        _ticketList = new long[128];
    }
}

如果 code Enter()s 且 128 张票均不可用,-1则将返回该Exit(ticket)函数可以通过提早返回来处理的情况。

在考虑如何实现Enter()调用时,我看到了Interlocked.Read方法,它可以在 32 位系统上以原子方式读取值,而根据文档,在 64 位系统上是不必要的。

所以我继续实现了各种类型的Enter()方法,包括一种 withInterlocked.Read和一种 with Interlocked.CompareExchange,并将它们与BenchmarkDotNet进行了比较。这就是我发现巨大的性能差异的地方:

|       Method |      Mean |    Error |   StdDev | Code Size |
|------------- |----------:|---------:|---------:|----------:|
|    SafeArray |  28.64 ns | 0.573 ns | 0.536 ns |     295 B |
| SafeArrayCAS | 744.75 ns | 8.741 ns | 7.749 ns |     248 B |

两者的基准看起来几乎相同:

        [Benchmark]
        public void SafeArray()
        {
            // doesn't matter if 'i < 1' or 'i < 10'
            // performance differs by the same factor (approx. 20x)
            for (int i = 0; i < 1; i++)
            {
                _ticketArr[i] = _hpp_sa.EnterSafe(); 

                // SafeArrayCAS:
                // _ticketArr[i] = _hpp_sa_cas.EnterSafe(); 
            }
        }

实现

同样,空闲槽保存值0,占用槽一些其他值(时间戳)。Enter()应该返回插槽的索引/票。

SafeArrayCAS(慢)

        public long EnterSafe()
        {
            if (!_enabled)
            {
                return -1;
            }
            long last = 0;
            long ts = Stopwatch.GetTimestamp();
            long val;
            do
            {
                val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
                last++;
            } while (val != 0 && last < 128);
            return val == 0 ? last : -1;
        }

SafeArray(快速)

        public long EnterSafe()
        {
            if (!_enabled)
            {
                return -1;
            }
            long last = 0;
            long val;
            do
            {
                val = Interlocked.Read(ref _ticketList[last]);
                last++;
            } while (val != 0 && last < 128);
            if (val != 0)
            {
                return -1;
            }
            long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
            if (prev != 0)
            {
                return -1;
            }
            return last;
        }

进入兔子洞

现在,有人会说看到差异也就不足为奇了,因为慢速方法总是尝试对条目进行 CAS,而另一种只是懒惰地读取每个条目,然后只尝试一次 CAS。

但是,除了基准只做 1 的事实Enter(),即只有一次while{}运行不应该产生太大的差异(20x)之外,一旦你意识到原子读取是作为 CAS 实现的,就更难解释了:

SafeArrayCAS(慢)

        public long EnterSafe()
        {
            if (!_enabled)
[...] // ommited for brevity 
            {
                return -1;
[...] // ommited for brevity   
            }
            long last = 0;
00007FF82D048FCE  xor         edi,edi  
            long ts = Stopwatch.GetTimestamp();
00007FF82D048FD0  lea         rcx,[rsp+28h]  
00007FF82D048FD5  call        CLRStub[JumpStub]@7ff82d076d70 (07FF82D076D70h)   
00007FF82D048FDA  mov         rsi,qword ptr [rsp+28h]  
00007FF82D048FDF  mov         rax,7FF88CF3E07Ch  
00007FF82D048FE9  cmp         dword ptr [rax],0  
00007FF82D048FEC  jne         HotPathProfilerSafeArrayCAS.EnterSafe()+0A6h (07FF82D049046h)  
            long val;
            do
            {
                val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
00007FF82D048FEE  mov         rbx,qword ptr [rsp+50h]  
00007FF82D048FF3  mov         rax,qword ptr [rbx+10h]  
00007FF82D048FF7  mov         edx,dword ptr [rax+8]  
00007FF82D048FFA  movsxd      rdx,edx  
00007FF82D048FFD  cmp         rdi,rdx  
00007FF82D049000  jae         HotPathProfilerSafeArrayCAS.EnterSafe()+0ADh (07FF82D04904Dh)  
00007FF82D049002  lea         rdx,[rax+rdi*8+10h]  
00007FF82D049007  xor         eax,eax  
00007FF82D049009  lock cmpxchg qword ptr [rdx],rsi  
                last++;
00007FF82D04900E  inc         rdi  
            } while (val != 0 && last < 128);
00007FF82D049011  test        rax,rax  
00007FF82D049014  je          HotPathProfilerSafeArrayCAS.EnterSafe()+084h (07FF82D049024h)  
00007FF82D049016  cmp         rdi,80h  
00007FF82D04901D  mov         qword ptr [rsp+50h],rbx  
00007FF82D049022  jl          HotPathProfilerSafeArrayCAS.EnterSafe()+04Eh (07FF82D048FEEh)  
     

SafeArray(快速)

        public long EnterSafe()
        {
            if (!_enabled)
[...] // ommited for brevity 
            {
                return -1;
[...] // ommited for brevity
            }
            long last = 0;
00007FF82D046C74  xor         esi,esi  
            long val;
            do
            {
                val = Interlocked.Read(ref _ticketList[last]);
00007FF82D046C76  mov         rax,qword ptr [rcx+10h]  
00007FF82D046C7A  mov         edx,dword ptr [rax+8]  
00007FF82D046C7D  movsxd      rdx,edx  
00007FF82D046C80  cmp         rsi,rdx  
00007FF82D046C83  jae         HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82D046D2Ch)  
00007FF82D046C89  lea         rdx,[rax+rsi*8+10h]  
00007FF82D046C8E  xor         r8d,r8d  
00007FF82D046C91  xor         eax,eax  
00007FF82D046C93  lock cmpxchg qword ptr [rdx],r8  
                last++;
00007FF82D046C98  inc         rsi  
            } while (val != 0 && last < 128);
00007FF82D046C9B  test        rax,rax  
00007FF82D046C9E  je          HotPathProfilerSafeArray.EnterSafe()+059h (07FF82D046CA9h)  
00007FF82D046CA0  cmp         rsi,80h  
00007FF82D046CA7  jl          HotPathProfilerSafeArray.EnterSafe()+026h (07FF82D046C76h)  
            if (val != 0)
[...] // ommited for brevity   
            {
                return -1;
[...] // ommited for brevity  
            }
            long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
00007FF82FBA6ADF  mov         rcx,qword ptr [rcx+10h]  
00007FF82FBA6AE3  mov         eax,dword ptr [rcx+8]  
00007FF82FBA6AE6  movsxd      rax,eax  
00007FF82FBA6AE9  cmp         rsi,rax  
00007FF82FBA6AEC  jae         HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82FBA6B4Ch)  
00007FF82FBA6AEE  lea         rdi,[rcx+rsi*8+10h]  
00007FF82FBA6AF3  mov         qword ptr [rsp+28h],rdi  
00007FF82FBA6AF8  lea         rcx,[rsp+30h]  
00007FF82FBA6AFD  call        CLRStub[JumpStub]@7ff82d076d70 (07FF82D076D70h)  
00007FF82FBA6B02  mov         rdx,qword ptr [rsp+30h]  
00007FF82FBA6B07  xor         eax,eax  
00007FF82FBA6B09  mov         rdi,qword ptr [rsp+28h]  
00007FF82FBA6B0E  lock cmpxchg qword ptr [rdi],rdx  
00007FF82FBA6B13  mov         rdi,rax  
00007FF82FBA6B16  mov         rax,7FF88CF3E07Ch  
00007FF82FBA6B20  cmp         dword ptr [rax],0  
00007FF82FBA6B23  jne         HotPathProfilerSafeArray.EnterSafe()+0D5h (07FF82FBA6B45h)  
            if (prev != 0)
[...] // ommited for brevity 

概括

我在Xeon E-2176G(6 核 Coffee Lake)CPU上运行所有 Win10 x64 Release 版本。Assembler 输出来自 Visual Studio,但等于 BenchmarkDotNet 的DisassemblyDiagnoser

除了我为什么要这样做的方式和原因之外,我根本无法解释这两种方法之间的性能差异。我猜应该不会这么多。可以是 BenchmarkDotNet 本身吗?我还缺少其他东西吗?

感觉就像我对这些低级的东西的理解有一个黑点,我想对此有所了解......谢谢!

PS:

到目前为止我已经尝试过:

标签: c#performanceassemblyx86-64atomic

解决方案


推荐阅读