assembly - Using ymm registers as a "memory-like" storage location

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

assembly - Using ymm registers as a "memory-like" storage location

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Consider the following loop in x86:

; on entry, rdi has the number of iterations
.top:
; some magic happens here to calculate a result in rax
mov [array + rdi * 8], rax ; store result in output array
dec rdi
jnz .top

It's straightforward: something calculates a result in rax (not shown) and then we store the result into an array, in reverse order as we index with rdi.

I would like to transform the above loop not make any writes to memory (we can assume the non-shown calculation doesn't write to memory).

As long as the loop count in rdi is limited, I could use the ample space (512 bytes) provided by the ymm regs to save the values instead, but it seems awkward to actually do this, since you can't "index" an arbitrary register.

One approach would be to always shuffle the entire "array" of ymm registers by one element and then insert the element in the newly freed up position.

Something like this:

vpermq  ymm3, ymm3, 10_01_00_11b ; left rotate ymm by qword
vpermq  ymm2, ymm2, 10_01_00_11b ; left rotate ymm by qword
vpermq  ymm1, ymm1, 10_01_00_11b ; left rotate ymm by qword
vpermq  ymm0, ymm0, 10_01_00_11b ; left rotate ymm by qword

vblenddd ymm3, ymm3, ymm2, 3     ; promote one qword of ymm2 to ymm3
vblenddd ymm2, ymm2, ymm1, 3     ; promote one qword of ymm1 to ymm2
vblenddd ymm1, ymm1, ymm0, 3     ; promote one qword of ymm0 to ymm1

pinsrq   xmm0, rax, 0  ; playing with mixed-VEX mode fire (see Peter's answer)

This shows only handling four of the 16 registers, so evidently to do all 16 it will be a lot of code (32 instructions).

Is there a better way?

Unpredictable branches are undesirable, but we can still consider solutions that use them.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.0k views

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:50:16+0000

You can't vpinsrq into a YMM register. Only an xmm destination is available, so it unavoidably zeros the upper lane of the full YMM register. It was introduced with AVX1 as the VEX version of the 128-bit instruction. AVX2 and AVX512 did not upgrade it to YMM/ZMM destinations. I'm guessing they didn't want to provide insert into high lanes, and it would have been odd to provide a YMM version that still only looked at the lowest bit of the imm8.

You're going to need a scratch register and then blend into a YMM with vpblendd. Or (on Skylake or AMD) use the legacy-SSE version to leave the upper bytes unchanged! On Skylake, writing an XMM reg with a legacy-SSE instruction has a false dependency on the full register. You want this false dependency. (I haven't tested this; it might trigger a merging uop of some sort). But you don't want this on Haswell where it saves the upper halves of all the YMM regs, going into "state C".

The obvious solution is to leave yourself a scratch reg to use for vmovq+vpblendd (instead of vpinsrq y,r,0). That's still 2 uops, but vpblendd doesn't need port 5 on Intel CPUs, in case that matters. (movq uses port 5). If you're really hard up for space, themm0..7` MMX registers are available.

Reducing the cost

With nested loops, we can split the work. With a small amount of unrolling of the inner loop, we can mostly remove that part of the cost.

For example, if we have an inner loop produce 4 results, we can use your brute-force stack approach on 2 or 4 registers in the inner loop, giving moderate overhead with no actual unrolling ("magic" payload appears only once). 3 or 4 uops, optionally with no loop-carried dep chain.

; on entry, rdi has the number of iterations
.outer:
    mov       r15d, 3
.inner:
; some magic happens here to calculate a result in rax

%if  AVOID_SHUFFLES
    vmovdqa   xmm3, xmm2
    vmovdqa   xmm2, xmm1
    vmovdqa   xmm1, xmm0
    vmovq     xmm0, rax
%else
    vpunpcklqdq  xmm2, xmm1, xmm2        ; { high=xmm2[0], low=xmm1[0] }
    vmovdqa   xmm1, xmm0
    vmovq     xmm0, rax
%endif

    dec   r15d
    jnz   .inner

    ;; Big block only runs once per 4 iters of the inner loop, and is only ~12 insns.
    vmovdqa  ymm15, ymm14
    vmovdqa  ymm13, ymm12
    ...

    ;; shuffle the new 4 elements into the lowest reg we read here (ymm3 or ymm4)

%if  AVOID_SHUFFLES       ; inputs are in low element of xmm0..3
    vpunpcklqdq  xmm1, xmm1, xmm0     ; don't write xmm0..2: longer false dep chain next iter.  Or break it.
    vpunpcklqdq  xmm4, xmm3, xmm2
    vinserti128  ymm4, ymm1, xmm4, 1  ; older values go in the top half
    vpxor        xmm1, xmm1, xmm1     ; shorten false-dep chains

%else                     ; inputs are in xmm2[1,0], xmm1[0], and xmm0[0]
    vpunpcklqdq  xmm3, xmm0, xmm1     ; [ 2nd-newest,  newest ]
    vinserti128  ymm3, ymm2, xmm3, 1
    vpxor        xmm2, xmm2,xmm2   ; break loop-carried dep chain for the next iter
    vpxor        xmm1, xmm1,xmm1   ; and this, which feeds into the loop-carried chain
%endif

    sub   rdi, 4
    ja   .outer

Bonus: this only requires AVX1 (and is cheaper on AMD, keeping 256-bit vectors out of the inner loop). We still get 12 x 4 qwords of storage instead of 16 x 4. That was an arbitrary number anyway.

Limited unrolling

We can unroll just the inner loop, like this:

.top:
    vmovdqa     ymm15, ymm14
    ...
    vmovdqa     ymm3, ymm2           ; 12x movdqa
    vinserti128 ymm2, ymm0, xmm1, 1

    magic
    vmovq       xmm0, rax
    magic
    vpinsrq     xmm0, rax, 1
    magic
    vmovq       xmm1, rax
    magic
    vpinsrq     xmm1, rax, 1

    sub         rdi, 4
    ja          .top

When we leave the loop, ymm15..2 and xmm1 and 0 full of valuable data. If they were at the bottom, they'd run the same number of times but ymm2 would be a copy of xmm0 and 1. A jmp to enter the loop without doing the vmovdqa stuff on the first iter is an option.

Per 4x magic, this costs us 6 uops for port 5 (movq + pinsrq), 12 vmovdqa (no execution unit), and 1x vinserti128 (port 5 again). So that's 19 uops per 4 magic, or 4.75 uops.

You can interleave the vmovdqa + vinsert with the first magic, or just split it before / after the first magic. You can't clobber xmm0 until after the vinserti128, but if you have a spare integer reg you can delay the vmovq.

More nesting

Another loop nesting level, or another unrolling, would greatly reduce the amount of vmovdqa instructions. Just getting the data shuffled into YMM regs at all has a minimum cost, though. Loading an xmm from GP regs.

AVX512 can give us cheaper int->xmm. (And it would allow writing to all 4 elements of a YMM). But I don't see it avoiding the need to unroll or nest loops to avoid touching all the registers every time.

PS:

My first idea for the shuffle accumulator was shuffling elements one to the left. But then I realized this ended up with 5 elements of state, not 4, because we had high and low in two regs, plus the newly-written xmm0. (And could have used vpalignr.)

Leaving here as an example of what you can do with vshufpd: move low to high in one register, and merge in the high from another as the new low.

    vshufpd   xmm2, xmm1,xmm2, 01b     ; xmm2[1]=xmm2[0], xmm2[0]=xmm1[1].  i.e. [ low(xmm2), high(xmm1) ]
    vshufpd   xmm1, xmm0,xmm1, 01b
    vmovq     xmm0, rax

AVX512: indexing vectors as memory

For the general case of writing to vector regs as memory, we can vpbroadcastq zmm0{k1}, rax and repeat for other zmm registers with a different k1 mask. Broadcasts with merge-masking (where the mask has a single bit set) give us an indexed store into vector registers, but we need one instruction for every possible destination register.

Creating the mask:

xor      edx, edx
bts      rdx, rcx          #  rdx = 1<<(rcx&63)
kmovq     k1, rdx
kshiftrq  k2, k1, 8
kshiftrq  k3, k1, 16
...

To read from a ZMM register:

vpcompressq zmm0{k1}{z}, zmm1    ; zero-masking: zeros whole reg if no bits set
vpcompressq zmm0{k2},    zmm2    ; merge-masking
... repeat as many times as you have possible source regs

vmovq       rax, zmm0

(See the docs for vpcompressq: with zero-masking it zeros all elements above the one it writes)

To hide vpcompressq latency, you can do multiple dep chains into multiple tmp vectors, then vpor xmm0, xmm0, xmm1 at the end. (One of the vectors will be all zero, the other will have the selected element.)

On SKX it has 3c latency and 2c throughput, according to this instatx64 report.

Categories

assembly - Using ymm registers as a "memory-like" storage location

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Reducing the cost

Limited unrolling

More nesting

AVX512: indexing vectors as memory

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags