If your address is an assemble-time constant, not link-time, this is super easy. It's just an integer, and you can split it up manually.
I asked gcc and clang to compile unsigned abs_addr() { return 0x12345678; }
(Godbolt)
// gcc8.2 -O3
abs_addr():
mov w0, 0x5678 // low half
movk w0, 0x1234, lsl 16 // high half
ret
(Writing w0
implicitly zero-extends into 64-bit x0
, same as x86-64).
Or if your constant is only a link-time constant and you need to generate relocations in the .o
for the linker to fill in, the GAS manual documents what you can do, in the AArch64 machine-specific section:
Relocations for ‘MOVZ’ and ‘MOVK’ instructions can be generated by
prefixing the label with #:abs_g2:
etc. For example to load the
48-bit absolute address of foo
into x0
:
movz x0, #:abs_g2:foo // bits 32-47, overflow check
movk x0, #:abs_g1_nc:foo // bits 16-31, no overflow check
movk x0, #:abs_g0_nc:foo // bits 0-15, no overflow check
The GAS manual's example is sub-optimal; going low to high is more efficient on at least some AArch64 CPUs (see below). For a 32-bit constant, follow the same pattern that gcc used for a numeric literal.
movz x0, #:abs_g0_nc:foo // bits 0-15, no overflow check
movk x0, #:abs_g1:foo // bits 16-31, overflow check
#:abs_g1:foo
will is known to have its possibly-set bits in the 16-31 range, so the assembler knows to use a lsl 16
when encoding movk
. You should not use an explicit lsl 16
here.
I chose x0
instead of w0
because that's what gcc does for unsigned long long
. Probably performance is identical on all CPUs, and code size is identical.
.text
func:
// efficient
movz x0, #:abs_g0_nc:foo // bits 0-15, no overflow check
movk x0, #:abs_g1:foo // bits 16-31, overflow check
// inefficient but does assemble + link
// movz x1, #:abs_g1:foo // bits 16-31, overflow check
// movk x1, #:abs_g0_nc:foo // bits 0-15, no overflow check
.data
foo: .word 123 // .data will be in a different page than .text
With GCC: aarch64-linux-gnu-gcc -nostdlib aarch-reloc.s
to build and link (just to prove we can, this will just crash if you actually ran it), and then aarch64-linux-gnu-objdump -drwC a.out
:
a.out: file format elf64-littleaarch64
Disassembly of section .text:
000000000040010c <func>:
40010c: d2802280 mov x0, #0x114 // #276
400110: f2a00820 movk x0, #0x41, lsl #16
Clang appears to have a bug here, making it unusable: it only assembles #:abs_g1_nc:foo
(no check for the high half) and #:abs_g0:foo
(overflow check for the low half). This is backwards, and results in a linker error (g0 overflow) when foo
has a 32-bit address. I'm using clang version 7.0.1 on x86-64 Arch Linux.
$ clang -target aarch64 -c aarch-reloc.s
aarch-reloc.s:5:15: error: immediate must be an integer in range [0, 65535].
movz x0, #:abs_g0_nc:foo
^
As a workaround g1_nc
instead of g1
is fine, you can live without overflow checks. But you need g0_nc
, unless you have a linker where checking can be disabled. (Or maybe some clang installs come with a linker that's bug-compatible with the relocations clang emits?) I was testing with GNU ld (GNU Binutils) 2.31.1 and GNU gold (GNU Binutils 2.31.1) 1.16
$ aarch64-linux-gnu-ld.bfd aarch-reloc.o
aarch64-linux-gnu-ld.bfd: warning: cannot find entry symbol _start; defaulting to 00000000004000b0
aarch64-linux-gnu-ld.bfd: aarch-reloc.o: in function `func':
(.text+0x0): relocation truncated to fit: R_AARCH64_MOVW_UABS_G0 against `.data'
$ aarch64-linux-gnu-ld.gold aarch-reloc.o
aarch-reloc.o(.text+0x0): error: relocation overflow in R_AARCH64_MOVW_UABS_G0
MOVZ vs. MOVK vs. MOVN
movz
= move-zero puts a 16-bit immediate into a register with a left-shift of 0, 16, 32 or 48 (and clears the rest of the bits). You always want to start a sequence like this with a movz
, and then movk
the rest of the bits. (movk
= move-keep. Move 16-bit immediate into register, keeping other bits unchanged.)
mov
is sort of a pseudo-instruction that can pick movz
, but I just tested with GNU binutils and clang, and you need an explicit movz
(not mov
) with an immediate like #:abs_g0:foo
. Apparently the assembler won't infer that it needs movz
there, unlike with a numeric literal.
For a narrow immediate, e.g. 0xFF000
which has non-zero bits in two aligned 16-bit chunks of the value, mov w0, #0x18000
would pick the bitmask-immediate form of mov
, which is actually an alias for ORR
-immediate with the zero register. AArch64 bitmask-immediates use a powerful encoding scheme for repeated patterns of bit-ranges. (So e.g. and x0, x1, 0x5555555555555555
(keep only the even bits) can be encoded in a single 32-bit-wide instruction, great for bit-hacks.)
There's also movn
(move not) which flips the bits. This is useful for negative values, allowing you to have all the upper bits set to 1
. There's even a relocation for it, according to AArch64 relocation prefixes.
Performance: movz low16; movk high16
in that order
The Cortex A57 optimization manual
4.14 Fast literal generation
Cortex-A57 r1p0 and later revisions support optimized literal generation for 32- and 64-bit code
MOV wX, #bottom_16_bits
MOVK wX, #top_16_bits, lsl #16
[and other examples]
... If any of these sequences appear sequentially and in the described order in program code, the two instructions
can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program
code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.
The sequences include movz low16
+ movk high16
into x or w registers, in that order. (And also back-to-back movk
to set the high 32, again in low, high order.) According to the manual, both instructions have to use w, or both have to use x registers.
Without special support, the movk
would have to wait for the movz
result to be ready as an input for an ALU operation to replace that 16-bit chunk. Presumably at some point in the pipeline, the 2 instructions merge into a single 32-bit immediate movz or movk, removing the dependency chain.