armlint examines AArch64 machine code to find suboptimal instruction
sequences. For example, building the constant 0x66666666 as
movz w0, #0x6666
movk w0, #0x6666, lsl #16
is two instructions where one would do, because 0x66666666 is encodable
as an AArch64 logical (bitmask) immediate:
mov w0, #0x66666666 ; orr w0, wzr, #0x66666666
armlint helps compiler writers and assembly authors generate tighter code, and documents corners of the A64 instruction set.
- suboptimal MOVZ/MOVK sequence
movz w0, #0x6666 ; movk w0, #0x6666, lsl #16instead ofmov w0, #0x66666666(single bitmask-immediate ORR)
- LSL foldable into shifted-register form
lsl w0, w1, #3 ; add w0, w2, w0instead ofadd w0, w2, w1, lsl #3. Same for SUB, AND, ORR, EOR (and flag-setting variants). Conservative: v1 only flags when the consumer overwrites the LSL destination, guaranteeing the LSL result is dead.
- compare-zero branch foldable into CBZ/CBNZ
cmp w0, #0 ; b.eq targetinstead ofcbz w0, target. Same forb.ne->cbnz. Also matches the equivalent zero-test idiomscmp Rn, xzr(SUBS XZR, Rn, XZR) andtst Rn, Rn(ANDS XZR, Rn, Rn): all three setZ=1iffRn==0and so fold identically.- Soundness:
CBZ/CBNZdoes not write NZCV, butCMP Rn, #0writes all four flags. Folding is unsound if subsequent code reads N, C, or V (e.g.ADCS,CSEL,B.LT,CCMP). armlint runs a forward NZCV-liveness scan on the fall-through path: the finding is emitted only after seeing an instruction that overwrites NZCV without reading them (ADDS/SUBS/ANDS/BICS/FCMP) or a terminator that makes prior flags unobservable (RET, BL, BLR). The scan suppresses on any flag-reader, an unsafe terminator (B unconditional, BR), or after a 16-instruction window with no decision. The branch-target path is not scanned; full soundness would require basic-block analysis.
- compare-zero signed-branch foldable into TBZ/TBNZ
cmp wn, #0 ; b.lt target(orb.ge/b.mi/b.pl) folds totbnz wn, #(datasize-1), target(ortbz). AfterCMP Rn, #0/CMP Rn, ZR/TST Rn, Rn,V == 0, soB.LT(N != V) reduces to "N == 1" -- exactly a test of the sign bit.B.MIdirectly testsN;B.GE/B.PLare the inverse.- Range:
TBZ/TBNZuse a 14-bit signed offset (~32 KB reach), vs.B.cond's 19-bit (~1 MB). The fold is suggested only when the target fits in the TBZ encoding. - Soundness: same NZCV-liveness scan as the CMP-branch check above. The rewrite drops the CMP/TST, so downstream code that observes N/C/V before they're overwritten would see different values; the scan suppresses on any flag-reader. Shares the existing CMP/TST pending slot, which is sufficient because the sign-only and EQ/NE conditions are mutually exclusive at the same B.cond.
- TST single-bit + B.EQ/NE foldable into TBZ/TBNZ
tst w0, #(1<<5) ; b.eq targetinstead oftbz w0, #5, target. Same forb.ne->tbnz. Only the immediate-formTSTis matched (ANDS XZR, Rn, #imm) and only when the immediate is a single power-of-two bit.- Range:
TBZ/TBNZuse a 14-bit signed offset (~32 KB reach), much shorter thanB.cond's 19-bit (~1 MB). The fold is suggested only when the target fits in the TBZ encoding. - Soundness: same NZCV-liveness scan as the CMP-branch check.
- bitfield op via two shifts foldable into UBFX/SBFX or UBFIZ/SBFIZ
lsl wd, ws, #a ; lsr wd, wd, #bfolds depending on the relationship betweenaandb:b >= a: extraction.ubfx wd, ws, #(b-a), #(datasize-b). Withasrit folds tosbfx(sign-extending).b < a: insertion.ubfiz wd, ws, #(a-b), #(datasize-a)-- placesws[datasize-a-1 .. 0]atwd[datasize-b-1 .. a-b]with bits belowa-bzeroed. Withasrit folds tosbfiz(sign-extending the high bits fromws[datasize-a-1]).- Same for X-form.
- Compilers occasionally leave the two-shift form when bit-tracking can't prove the LSL invariant; armlint flags such pairs as a missed combine.
- v1 requires the consumer's
RdandRnto equal the LSL'sRdso the LSL result is dead after the rewrite.
- shift-and-mask bitfield extraction foldable into UBFX
lsr wd, ws, #n ; and wd, wd, #((1<<w)-1)extracts bitsws[n+w-1 .. n]; equivalent toubfx wd, ws, #n, #w(cappingwatdatasize-nwhen the mask is wider than the LSR-fillable bits). Same for X-form.- Mask must be a contiguous run of low bits with no rotation
(
immr=0and(N, imms)encodingS+1=wones at the appropriate element size); rotated/non-contiguous masks like#0x6are correctly skipped.
- redundant zero-extension after a producer that already zeroed those bits
- Generalises the previous "redundant UXTW after W-form ALU" rule
to size-aware producer/consumer pairs. The check tracks the
threshold
Pat which the producer guaranteesRt[63:P] == 0: 32 for W-form data-processing orLDR Wt/LDRSB Wt/LDRSH Wt, 16 forLDRH Wt, 8 forLDRB Wt. A consumer that clears bits aboveCis redundant whenP <= C. - Recognised consumers:
UXTB / UXTH / UXTW(W- or X-form UBFM aliases withimmr=0,imms ∈ {7,15,31},N=sf), AND-imm with mask0xff/0xffff/0xffffffff(both forms in W and X register variants), andMOV Wd, Wd(ORR Wd, WZR, WdwithRm = Rd; the W-form register MOV writes back through the W register and so clears X[63:32], givingC = 32). - Example flags:
add w0,w1,w2 ; uxtw x0,w0,ldrh w0,[x1] ; uxth w0,w0,ldrb w8,[x9] ; and w8,w8,#0xff. Counter-example (P > C, not flagged):ldr w0,[x1] ; uxth w0,w0-- LDR W loads 32 valid bits, so UXTH would actually clear bits 31..16. - v1 still requires
Rd == Rn == producer.Rdso the consumer is purely dead. Producer set covers all W-form DP classes (ADD/SUBimmediate/shifted/extended, logical immediate,MOVZ/MOVN/MOVK, bitfieldSBFM/BFM/UBFM,EXTR, logical shifted register,ADC/SBC, conditional select, DP-3/2/1-source) plus integer loads withWtdestination in any addressing mode.
- Generalises the previous "redundant UXTW after W-form ALU" rule
to size-aware producer/consumer pairs. The check tracks the
threshold
MOV Xd, Xdis a literal no-op- The X-form register MOV alias (
ORR Xd, XZR, Xm, LSL #0) withRm = RdreadsXdand writes the same 64 bits back; the instruction has no architectural effect and can be removed. Modern optimisers almost never emit this, but it shows up occasionally in hand-written assembly and in legacy object code. - The W-form
MOV Wd, Wdis NOT a no-op: writing throughWdclearsX[63:32]. It is handled instead as a consumer of the redundant-zero-extension check above, where it fires only when a preceding producer already zeroed those bits.
- The X-form register MOV alias (
- redundant sign-extension after a producer that already replicated the sign
- Mirror of the zero-extension framework above. The check tracks two
thresholds
(S, W): the producer guaranteesRd[W-1:S] = sign(Rd[S-1]). A consumerSXTB / SXTH / SXTWwith thresholds(S_c, W_c)is redundant iffS_p <= S_cANDW_p == W_cANDRd == Rn == producer.Rd. - Recognised producers:
SXTB / SXTH / SXTW(SBFM aliases withimmr = 0,imms in {7, 15, 31}); the sign-extending integer loadsLDRSB / LDRSH / LDRSWin any addressing mode; andASR Rd, Rn, #k(SBFM withimms = datasize-1,immr = k > 0), which replicatesRn[datasize-1]through bits[datasize-k, datasize)ofRd, soS = datasize-k. (S, W) maps for the canonical SXT* pairs:LDRSB Wt/SXTB Wd,Wn-> (8, 32);LDRSH Wt/SXTH Wd,Wn-> (16, 32);LDRSB Xt/SXTB Xd,Wn-> (8, 64);LDRSH Xt/SXTH Xd,Wn-> (16, 64);LDRSW Xt/SXTW Xd,Wn-> (32, 64). Example flagged ASR pair:asr w0, w1, #24 ; sxtb w0, w0(S_p=8 = S_c=8);asr x0, x1, #48 ; sxth x0, w0. W_p == W_c(not<=) because a W-form consumer writes back throughWdand zerosX[63:32], which differs from an X-form producer's sign-extended upper half. Example flagged:ldrsb w0, [x1] ; sxtb w0, w0;ldrsh x0, [x1] ; sxth x0, w0;ldrsb w0, [x1] ; sxth w0, w0(S_p=8 subsumes S_c=16 within W=32). Counter-example:ldrsb w0, [x1] ; sxtb x0, w0-- producer leftX[63:32] = 0, consumer would set those bits to sign of byte; not redundant.- Same producer state also feeds a "dead sign-extension" path: if
the next instruction is a zero-ext consumer (
UXTB/UXTH/UXTW,ANDwith low-mask, orMOV Wd, Wd) that clears bits>= C_cwithC_c <= S_p, the consumer overwrites every sign-extended bit and the sign-extension is dead. No width-matching constraint is needed -- when widths mismatch, the W-form auto-zero ofX[63:32]covers the upper half. Example flagged:sxtb w0, w1 ; uxtb w0, w0(replace withuxtb w0, w1);ldrsb w0, [x1] ; uxtb w0, w0(replace withldrb w0, [x1]);ldrsw x0, [x1] ; uxtw x0, w0(replace withldr w0, [x1]).
- Mirror of the zero-extension framework above. The check tracks two
thresholds
- self-op identities (
AND/ORR/EOR/SUB/BIC/ORN/EON Rd, Rs, Rs)AND Rd, Rs, RsandORR Rd, Rs, Rscollapse toMOV Rd, Rs(identity).EOR Rd, Rs, Rs,SUB Rd, Rs, Rs, andBIC Rd, Rs, Rs(=Rs AND NOT Rs) collapse toMOV Rd, XZR(zero).ORN Rd, Rs, Rs(=Rs OR NOT Rs) andEON Rd, Rs, Rs(=Rs XOR NOT Rs) collapse toMOV Rd, #-1/MOVN Rd, #0(all-ones). Both W- and X-form.- The flag-setting variants
ANDS Rd, Rs, Rs,SUBS Rd, Rs, Rs, andBICS Rd, Rs, Rsare deliberately NOT flagged: writingRdwhile setting flags is the user's intent (combined zero-test + register copy or register zero). Rd = 31(result discarded) andRn = 31(ZRsource, not a real self-op) are excluded.- On uarches with move elimination,
MOV Rd, Rsis zero-cycle whileAND/ORR Rd, Rs, Rsgoes through the ALU.EOR Rd, Rs, Rsis the canonical x86 zero idiom that occasionally bleeds into AArch64 toolchain output; the canonical AArch64 form isMOV Rd, XZR(eight occurrences observed in dyld at the time of writing).
- adjacent LDR/STR foldable into LDP/STP
- Two unsigned-offset
LDR Wt, [Rn, #imm12*4](or X-form, scale 8) to consecutive scaled offsets fold into a singleLDP Wt1, Wt2, [Rn, #imm7*4]. Analogous for stores ->STP. Both W- and X-form supported; load+load and store+store only (no mixing). - v1 supports only the unsigned-offset form. The scaled imm12
guarantees natural alignment to the access size, which the LDP
encoding requires.
LDUR(unscaled) and pre-/post-indexed forms are deferred: their byte offsets aren't constrained to be a multiple of the access size, andLDP/STPon unaligned addresses has implementation-defined behaviour on AArch64 (some cores fault even when singleLDR/STRworks underSCTLR_EL1.A = 0). - Constraints checked: same base register
Rn; same access size (both W or both X); same direction (load/load or store/store); consecutive offsets (imm12_2 = imm12_1 + 1in scaled units);Rt1 != Rt2(LDP/STP requires distinct destinations); for loads,Rt1 != Rn(else the first load clobbers the base before the second load reads it). The first imm12 must also fit LDP's signed-7-bit imm7 (i.e., be at most 63 for non-negative unsigned-offset sources). - Four consecutive LDR/STRs fold into TWO non-overlapping LDP/STPs (after firing, the state resets so the second LDR isn't also used as the first of a new pair).
- Atomicity caveat: a single LDP is NOT atomic across its two halves (AArch64 doesn't guarantee single-copy atomicity for pairs), but neither are two separate LDRs. So the rewrite doesn't change ordering or atomicity guarantees -- acquire / release variants use different opcodes.
- Two unsigned-offset
- BFXIL synthesis via AND-AND-ORR
AND Rd, Rd, #~mask ; AND Rt, Rs, #mask ; ORR Rd, Rd, Rt(withmask = (1<<w)-1, the two ANDs in either order, and the ORR's second-and-third operands in either order) is equivalent to a singleBFXIL Rd, Rs, #0, #w. Both W- and X-form. The check detects the 3-instruction window with strict adjacency.- The high-mask
AND Rd, Rd, #~maskis identified by the bitmask-immediate encodingimmr = imms + 1(S =datasize-w-1, R =datasize-w, esize = datasize). The low-mask consumer reuses the existingdecode_and_imm_lowmaskdecoder. The ORR is logical-shifted-register with LSL #0. - Aliasing constraints needed for the BFXIL rewrite to be
semantically equivalent:
Rt != clear.Rd(else the isolate clobbers the cleared register in place),Rt != Rs(else the isolate modifies the source -- BFXIL leaves the source unchanged), andRs != clear.Rd(the degenerate case where Rs is the just-cleared register yields the wrong result -- the original sequence zeros Rd's low bits, but BFXIL Rd, Rd, ... is a no-op). - Modern compilers fuse this pattern themselves; 0 hits in the system-binary sample. Useful for hand-written assembly and legacy object code.
- CSEL same-operand identity (
CSEL Rd, Rn, Rn, cond)- When the CSEL's
Rn == Rm, both branches produceRn, so the cond is irrelevant and the instruction is equivalent toMOV Rd, Rn. The CSEL also reads NZCV for no reason. Both W- and X-form. - Only
CSEL(op2 = 00) is flagged. The other members of the conditional-select family --CSINC,CSINV,CSNEG-- have different "else" branches (Rn+1, ~Rn, -Rn) and are NOT identities whenRn == Rm. The decoder enforces(op & 0x7FE00C00) == 0x1A800000, which fixes op2 = 00. Rd = 31(result discarded) andRn = 31(ZRsource) are excluded for consistency with the other self-op identity check.
- When the CSEL's
- ADD/SUB #0 is redundant
- The non-flag-setting
ADD Rd, Rn, #0orSUB Rd, Rn, #0is a no-op whenRd == Rnand is equivalent toMOV Rd, RnwhenRd != Rn. Modern toolchains usually emit theMOV(ORR-alias) form directly, but the explicitADD #0shows up occasionally in real code, notably as a way to set up a function argument from a callee-saved register. - The
ADDS/SUBSflag-setting variants are not flagged: writingRdand settingZ = (Rn == 0)may both be wanted. The SP encoding (Rd = 31orRn = 31) is also excluded because that's the canonicalMOV (to/from SP)alias and the only way to spellMOV X0, SP/MOV SP, X0. - The
Rd == Rncase is further suppressed when immediately preceded byADR/ADRPwith the sameRd: that's a page-relative addressing pair (adrp x8, page ; add x8, x8, #pageoff) where the linker happened to resolvepageoffto 0. Removing theADDrequires re-linking, not an assembler rewrite, so it's not actionable.
- The non-flag-setting
- redundant zero-CMP/TST after a flag-setting ALU
adds/subs/ands/bics/adcs/sbcs Rd, ... ; cmp Rd, #0 ; b.eq/b.ne L-- the S-variant ALU already setZ = (Rd == 0), so theCMP/TSTis recomputing the sameZ. TheB.EQ/B.NEcan read the S-variant's flags directly; theCMP/TSTis dead.- v1 requires the full three-instruction window: S-variant
immediately followed by
CMP Rd, #0/CMP Rd, ZR/TST Rd, Rd, immediately followed byB.EQ/B.NE. The same forward NZCV-liveness scan as the CMP+B.cond check confirms that downstream code does not observe N/C/V (which the S-variant sets differently from theCMP). - Combines with the CMP+B.cond -> CBZ/CBNZ check above: both fire
on the matching pattern, giving the user a choice between
dropping the
CMP(and keeping theB.cond) or folding theCMP+B.condpair into aCBZ/CBNZ. Both rewrites have identical downstream behaviour.
armlint depends on Capstone and uses
pkg-config to locate it. On macOS:
brew install capstone
On Debian/Ubuntu:
apt install libcapstone-dev pkg-config
Build:
git clone https://github.com/gaul/armlint.git armlint
cd armlint
make all
armlint is intended to be part of compiler test suites which should
#include "armlint.h" and link libarmlint.a. It can also read
arbitrary AArch64 binaries (ELF, thin Mach-O, or universal/fat Mach-O):
./armlint /path/to/aarch64/binary
./armlint /bin/ls
- Arm Architecture Reference Manual - A64 instruction set
- Arm Cortex-A optimization guides - per-microarchitecture tuning notes
- Capstone disassembly framework - library to parse instructions
- x86lint - x86-64 equivalent of asmlint
Copyright (C) 2026 Andrew Gaul
Licensed under the Apache License, Version 2.0