首页 > 解决方案 > Is it worse in any aspect to use the CMPXCHG instruction on an 8-bit field than on a 32-bit field?

问题描述

I'd like to ask if using a CMPXCHG instruction on an 8-bit memory field would be worse in any aspect than using it on a 32-bit field.

I'm using C11 stdatomic.h to implement a couple of synchronization methods.

标签: cassemblyx86c11instruction-set

解决方案


No, there's no penalty for lock cmpxchg [mem], reg 8 vs. 32-bit. Modern x86 CPUs can load and store to their L1d cache with no penalty for a single byte vs. an aligned dword or qword. Can modern x86 hardware not store a single byte to memory? answer: it can with zero penalty1 because they spend the transistors to make even unaligned loads/stores fast.

The surrounding asm instructions dealing with a narrow integer in a register should also have negligible if any extra cost vs. [u]int32_t. See Why doesn't GCC use partial registers? - most compilers know how to be careful with partial registers, and modern CPUs (Haswell and later, and all non-Intel) don't rename the low 8 separately from the rest of the register so the only danger is false dependencies. Depending on exactly what you're doing, it might be best to use unsigned local temporaries with an _Atomic uint8_t, or it might be best to make your locals also uint8_t.

Footnote 1: Unlike on some non-x86 CPUs where a byte store actually is implemented with a cache RMW cycle (Are there any modern CPUs where a cached byte store is actually slower than a word store?). On those CPUs you'd hope that atomic xchg would be just as cheap for word vs. byte, but that's too much to hope for with cmpxchg. But almost all non-x86 ISAs have LL/SC instead of xchg / cmpxchg anyway, so even an atomic exchange is separate LL and SC instructions, and the SC would be take an RMW cycle to commit to cache.


推荐阅读