首页 > 解决方案 > 将局部变量与 16 字节边界对齐 (x86 asm)

问题描述

我在分配 128 位变量时遇到问题,以便它在 16 字节边界上对齐(在堆栈上,而不是堆上)。我无法控制调用我的函数时堆栈是否对齐,所以我只是假设它不是。

这是我的函数的样子(简化):

; start of stackframe
push ebp
mov ebp, esp

; space for our variable
sub esp, 0x10

; the 128-bit variable would be at [ebp - 0x10]
...

; end of stackframe
mov esp, ebp
pop ebp

and esp, 0xFFFF'FFF0现在我可以通过在之前插入来对齐变量,sub esp, 16但是我将不再能够引用它,[ebp - 0x10]因为ebp它将引用旧的、未对齐的堆栈指针。

考虑到这一点,我认为我需要在mov ebp, esp指令之前对齐堆栈,以便我能够手动对齐我的变量。所以在这个例子中:

; align esp
and esp, 0xFFFF'FFF0

; start of stackframe
push ebp
mov ebp, esp

; padding (because of the push ebp)
sub esp, 0xC

; space for our variable
sub esp, 0x10

; the 128-bit variable would be at [ebp - 0x10]
...

; end of stackframe
mov esp, ebp
pop ebp

问题是我们不会在堆栈帧末尾正确清理堆栈(不确定)。这是因为我们mov ebp, esp在对齐堆栈之后进行。

我真的想不出一个很好的方法来做到这一点。由于 sse 对齐要求,我觉得这应该是一个常见问题,但我找不到关于该主题的太多信息。还要记住,在调用我的函数之前,我无法控制堆栈,因为这是 shellcode。

编辑:我想一种解决方案是将我的堆栈帧包装在另一个堆栈帧中。所以是这样的:

push ebp
mov ebp, esp

; align the stack
and esp, 0xFFFF'FFF0

; the "real" stackframe start
push ebp
mov ebp, esp

; padding due to the push ebp prior to this
sub esp, 0xC

; space for our variable
sub esp, 0x10

; our variable is now at [ebp - 0x1C] (i think)
...

; the "real" stackframe end
mov esp, ebp
pop ebp

mov esp, ebp
pop ebp

标签: assemblyx86memory-alignmentcallstack

解决方案


After aligning the stack, reference locals relative to ESP. Or if you don't need many integer regs, possibly just align EDI or something instead of ESP itself, and access memory relative to that.

   push  ebp
   mov   ebp, esp     ; or any register, doesn't really matter

   and  esp, -16      ; round ESP down to a multiple of 16, reserving 0 to 12 bytes
   sub  esp, 32       ; reserve 32 bytes we know are there for sure.

   mov  dword [esp+4], 1234  ; store a local

   xorps  xmm0,xmm0
   movaps [esp+16], xmm0     ; zero 16 bytes of space with an aligned store

   leave            ; mov esp, ebp ; pop ebp
   ret

If you push args before a function call, remember that changes ESP temporarily. You might simplify by reserving enough space up front as part of an initial sub and simply store args with mov, like GCC does with -faccumulate-outgoing-args


If you need access to incoming function args on the stack, you can still access them relative to EBP.

There are lots of ways to solve this problem depending on what you still need access to and what you don't. e.g. after aligning the stack you could stash the pointer to the pre-aligned value in memory somewhere, freeing up all 7 other registers. (In that case, you could load any stack args into registers before aligning the stack, so you don't need to keep a pointer to the top of your stack frame.)


Look at clang output, or GCC8 and later, when compiling C or C++ with alignas(32) for locals, e.g. on https://godbolt.org/. Those compilers (with -O2) do what I suggested and reference locals relative to ESP after aligning the stack.

The standard 32-bit Linux calling convention aligns ESP by 16 before a call pushes a return address, so a simple sub can always reach a known alignas(16) boundary. Depending on how your shellcode is reached, you might not be able to take advantage of that even if exploiting code that does have that guarantee. e.g. the ret at the end of a vulnerable function will restore 16-byte stack alignment if this is a classic code-injection exploit of a buffer overflow, simply overwriting the return address with a pointer directly to your code. Not a chain of return addresses for a ROP attack.

Anyway, that's why you should use a higher alignas if you want to see how compilers handle it. The compilers on Godbolt other than MSVC are installed to target Linux. Many other 32-bit ABIs only guarantee 4-byte stack alignment.


In shellcode it might make more sense to just use movups loads and stores and not bother with stack alignment. Even though that means you can't use memory source operands unless you use the AVX version. e.g. paddd xmm0, [esp+16] could fault if ESP isn't aligned by 16, but movups xmm1, [esp+16] can't. And neither can vpaddd xmm0, xmm0, [esp+16]

You'll have to decide whether separate load instructions cost you more payload size than the prologue.

Also, [ESP] addressing modes always require a SIB byte, costing you 1 extra byte of code-size. So that's one downside. For performance saving uops is often worth it, but for code-size it can be worth using the 3-byte setup sequence of push reg / mov reg, esp.


If you don't need to return, just and esp, -16 and forget about it! e.g. do this at the very top of your shellcode, to whatever alignment you want, then take advantage of it for any calls/rets inside your payload. The entry point of your exploit isn't going to ret (right?), and you usually don't care what was on the stack above it, so you don't need to preserve the old value.


推荐阅读