Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Is there any way on Windows to work around the requirement that the XMM registers are preserved within a function call?(Aside from writing it all in assembly)

I have many AVX2 intrinsic functions that are unfortunately bloated by this.

As an example this will be placed at the top of the function by the compiler(MSVC):

00007FF9D0EBC602 vmovaps xmmword ptr [rsp+1490h],xmm6
00007FF9D0EBC60B vmovaps xmmword ptr [rsp+1480h],xmm7
00007FF9D0EBC614 vmovaps xmmword ptr [rsp+1470h],xmm8
00007FF9D0EBC61D vmovaps xmmword ptr [rsp+1460h],xmm9
00007FF9D0EBC626 vmovaps xmmword ptr [rsp+1450h],xmm10
00007FF9D0EBC62F vmovaps xmmword ptr [rsp+1440h],xmm11
00007FF9D0EBC638 vmovaps xmmword ptr [rsp+1430h],xmm12
00007FF9D0EBC641 vmovaps xmmword ptr [rsp+1420h],xmm13
00007FF9D0EBC64A vmovaps xmmword ptr [rsp+1410h],xmm14
00007FF9D0EBC653 vmovaps xmmword ptr [rsp+1400h],xmm15

And then at the end of the function..

00007FF9D0EBD6E6 vmovaps xmm6,xmmword ptr [r11-10h]
00007FF9D0EBD6EC vmovaps xmm7,xmmword ptr [r11-20h]
00007FF9D0EBD6F2 vmovaps xmm8,xmmword ptr [r11-30h]
00007FF9D0EBD6F8 vmovaps xmm9,xmmword ptr [r11-40h]
00007FF9D0EBD6FE vmovaps xmm10,xmmword ptr [r11-50h]
00007FF9D0EBD704 vmovaps xmm11,xmmword ptr [r11-60h]
00007FF9D0EBD70A vmovaps xmm12,xmmword ptr [r11-70h]
00007FF9D0EBD710 vmovaps xmm13,xmmword ptr [r11-80h]
00007FF9D0EBD716 vmovaps xmm14,xmmword ptr [r11-90h]
00007FF9D0EBD71F vmovaps xmm15,xmmword ptr [r11-0A0h]

That is 20 instructions that do nothing since I have no need to preserve the state of XMM. I have 100's of these functions that the compiler is bloating up like this. They are all invoked from the same call site via function pointers.

I tried changing the calling convention(__vectorcall/cdecl/fastcall) but that doesn't appear to do anything.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
160 views
Welcome To Ask or Share your Answers For Others

1 Answer

Use the x86-64 System V calling convention for your helper functions that you want to piece together via function pointers. In that calling convention, all of xmm/ymm0..15 and zmm0..31 are call-clobbered so even helper functions that need more than 5 vector registers don't have to save/restore any.

The outer interpreter function that calls them should still use Windows x64 fastcall or vectorcall, so from the outside it fully respects that calling convention.

This will hoist all the save/restore of XMM6..15 into that caller, instead of each helper function. This reduces static code size and amortizes the runtime cost over multiple calls through your function pointers.


AFAIK, MSVC doesn't support marking functions as using the x86-64 System V calling convention, only fastcall vs. vectorcall, so you'll have to use clang.

(ICC is buggy and fails to save/restore XMM6..15 around a call to a System V ABI function).

Windows GCC is buggy with 32-byte stack alignment for spilling __m256, so it's not in general safe to use GCC with -march= with anything that includes AVX.


Use __attribute__((sysv_abi)) or __attribute__((ms_abi)) on function and function-pointer declarations.

I think ms_abi is __fastcall, not __vectorcall. Clang may support __attribute__((vectorcall)) as well, but I haven't tried it. Google results are mostly feature requests/discussion.

void (*helpers[10])(float *, float*) __attribute__((sysv_abi));

__attribute__((ms_abi))
void outer(float *p) {
    helpers[0](p, p+10);
    helpers[1](p, p+10);
    helpers[2](p+20, p+30);
}

compiles as follows on Godbolt with clang 8.0 -O3 -march=skylake. (gcc/clang on Godbolt target Linux, but I used explicit ms_abi and sysv_abi on both the function and function-pointers so the code gen doesn't depend on the fact that the default is sysv_abi. Obviously you'd want to build your function with a Windows gcc or clang so calls to other functions would use the right calling convention. And a useful object-file format, etc.)

Notice that gcc/clang emit code for outer() that expects the incoming pointer arg in RCX (Windows x64), but pass it to the callees in RDI and RSI (x86-64 System V).

outer:                                  # @outer
        push    r14
        push    rsi
        push    rdi
        push    rbx
        sub     rsp, 168
        vmovaps xmmword ptr [rsp + 144], xmm15 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 128], xmm14 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 112], xmm13 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 96], xmm12 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 80], xmm11 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 64], xmm10 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 48], xmm9 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 32], xmm8 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 16], xmm7 # 16-byte Spill
        vmovaps xmmword ptr [rsp], xmm6 # 16-byte Spill
        mov     rbx, rcx                            # save p 
        lea     r14, [rcx + 40]
        mov     rdi, rcx
        mov     rsi, r14
        call    qword ptr [rip + helpers]
        mov     rdi, rbx
        mov     rsi, r14
        call    qword ptr [rip + helpers+8]
        lea     rdi, [rbx + 80]
        lea     rsi, [rbx + 120]
        call    qword ptr [rip + helpers+16]
        vmovaps xmm6, xmmword ptr [rsp] # 16-byte Reload
        vmovaps xmm7, xmmword ptr [rsp + 16] # 16-byte Reload
        vmovaps xmm8, xmmword ptr [rsp + 32] # 16-byte Reload
        vmovaps xmm9, xmmword ptr [rsp + 48] # 16-byte Reload
        vmovaps xmm10, xmmword ptr [rsp + 64] # 16-byte Reload
        vmovaps xmm11, xmmword ptr [rsp + 80] # 16-byte Reload
        vmovaps xmm12, xmmword ptr [rsp + 96] # 16-byte Reload
        vmovaps xmm13, xmmword ptr [rsp + 112] # 16-byte Reload
        vmovaps xmm14, xmmword ptr [rsp + 128] # 16-byte Reload
        vmovaps xmm15, xmmword ptr [rsp + 144] # 16-byte Reload
        add     rsp, 168
        pop     rbx
        pop     rdi
        pop     rsi
        pop     r14
        ret

GCC makes basically the same code. But Windows GCC is buggy with AVX.

ICC19 makes similar code, but without the save/restore of xmm6..15. This is a showstopper bug; if any of the callees do clobber those regs like they're allowed to, then returning from this function will violate its calling convention.

This leaves clang as the only compiler that you can use. That's fine; clang is very good.


If your callees don't need all the YMM registers, saving/restoring all of them in the outer function is overkill. But there's no middle ground with existing toolchains; you'd have to hand-write outer in asm to take advantage of knowing that none of your possible callees ever clobber XMM15 for example.


Note that calling other MS-ABI functions from inside outer() is totally fine. GCC / clang will (barring bugs) emit correct code for that too, and it's fine if a called function chooses not to destroy xmm6..15.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...