首页 > 解决方案 > 在 x64 中,使用“pop [RAX]”,值临时存储在哪里?

问题描述

我找到的答案解释了在 x86 平台上,如果没有将值存储在两者之间的某个位置,则无法进行直接的内存到内存复制。

mov rax,[RSI]
mov [RDI],rax

我使用 pop 大量使用 64 位写入内存,这似乎直接将值从内存复制到内存,没有任何明显的“中间人”。

在写入之前但读取之后的值在哪里?

标签: assemblyx86x86-64cpu-architecture

解决方案


The temporary location is a buffer somewhere inside the CPU that isn't part of the architectural state.

On a modern x86 like Skylake, pop [mem] decodes as 2 uops, so presumably the first uop is a pop into an internal register, and the 2nd is a store.

We know that modern x86 CPUs do have a few extra logical registers reserved for use by microcode and multi-uop instructions like this. They're renamed onto the physical register file the same way that architectural registers are. e.g. http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ mentions "some extra architectural registers for internal use". Henry calls them "architectural" registers but that's potentially confusing terminology. He just means logical as opposed to physical, like the architectural registers. These temporary registers aren't (AFAIK) used across instruction boundaries, only within one x86 instruction.

Original 8086 was non-pipelined (except for instruction prefetch) so the internal microcode or logic that implemented pop [mem] presumably just loaded and then stored from some special purpose buffer. Like add [mem], reg but with a different address for the load vs. store and without feeding it through the ALU.

direct memory-to-memory copy is not possible on x86.

You're probably referring to things like the accepted answer on Why IA32 does not allow memory to memory mov? That explanation of the reason is unfortunately just plain wrong and very misleading.

It's an instruction encoding limitation that makes mov [mem], [mem] impossible, not a limitation of CPU internals. See What x86 instructions take two (or more) memory operands?
pop [mem] is one of them because one of the memory operands is implicit. Even original 8086 could do this.


I make heavy use of 64bit writes to memory using pop

If front-end uop throughput or port 2/3 pressure is a bottleneck, consider using 128-bit SSE loads from the stack, then store 64-bit halves with movlps and movhps. On current Intel CPUs (like Skylake), movhps [mem], xmm0 is a single-uop instruction. (Actually micro-fused; all stores are store-address + store-data. But anyway, no port 5 shuffle uop needed like for the useless memory-destination form of pextrq).

Or if your destinations are contiguous, do 128-bit or 256-bit copies.

There are use-cases for pop [mem] but it's not wonderful, and typically not faster on mainstream Intel than pop reg / mov [mem], reg because it's still 2 uops. It does safe code size, and doesn't need a tmp reg, though.

See https://agner.org/optimize/


推荐阅读