-
Notifications
You must be signed in to change notification settings - Fork 0
Add AoCO 2025 Day 05 Study Notes #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
299 changes: 299 additions & 0 deletions
299
_posts/2026-02-28-Advent-of-Compiler-Optimisations-Study-Notes-05.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,299 @@ | ||
| --- | ||
| layout: default | ||
| title: "Study Notes: ARM's barrel shifter tricks, Advent of Compiler Optimisations 2025" | ||
| date: 2026-02-28 | ||
| tag: compiler | ||
| --- | ||
|
|
||
| ## Study Notes: ARM's barrel shifter tricks, Advent of Compiler Optimisations 2025 | ||
|
|
||
| These notes are based on the post [**ARM's barrel shifter tricks**](https://xania.org/202512/05-barrel-shifting-with-arm) and the YouTube video [**[AoCO 5/25] Multiplying with a Constant**](https://www.youtube.com/watch?v=TZubUyr2UEY&list=PL2HVqYf7If8cY4wLk7JUQ2f0JXY_xMQm2&index=6) which are Day 5 of the [Advent of Compiler Optimisations 2025](https://xania.org/AoCO2025-archive) Series by [Matt Godbolt](https://xania.org/MattGodbolt). | ||
|
|
||
| My notes focus on reproducing and verifying [Matt Godbolt](https://xania.org/MattGodbolt)'s teaching within a local development environment using `LLVM` toolchain on `Ubuntu`. | ||
|
|
||
| Selected technical insights from the YouTube comment section are reproduced at the end of these notes to provide additional context. | ||
|
|
||
| Written by me and assisted by AI, proofread by me and assisted by AI. | ||
|
|
||
| #### Development Environment | ||
| ``` | ||
| $ lsb_release -d | ||
| Description: Ubuntu 24.04.3 LTS | ||
|
|
||
| $ clang -v | ||
| Ubuntu clang version 18.1.8 | ||
|
|
||
| $ llvm-objdump -v | ||
| Ubuntu LLVM version 18.1.8 | ||
|
|
||
| $ nvim --version | ||
| NVIM v0.11.5 | ||
|
|
||
| $ echo $SHELL | ||
| /usr/bin/fish | ||
| ``` | ||
|
|
||
| ## Introduction | ||
|
|
||
| Following the Day 05 technical materials, I performed sequential tests for constant | ||
|
|
||
| multiplication ranging from x multiplied by two to x multiplied by twenty on the AArch64 target. | ||
|
|
||
| After evaluating the assembly output, | ||
|
|
||
| I identified six distinct compiler optimization strategies that I would like to share with you. | ||
|
|
||
| ## Case 01 : `x * 2` | ||
|
|
||
| ``` | ||
| $ nvim mul.c | ||
| ``` | ||
|
|
||
| ``` | ||
| int mul(int x) { | ||
| return x * 2; | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| $ rm -f (path filter *.o); clang -O2 -target aarch64-linux-gnu -c mul.c; llvm-objdump -d mul.o | ||
| ``` | ||
|
|
||
| ``` | ||
| mul.o: file format elf64-littleaarch64 | ||
|
|
||
| Disassembly of section .text: | ||
|
|
||
| 0000000000000000 <mul>: | ||
| 0: 531f7800 lsl w0, w0, #1 | ||
| 4: d65f03c0 ret | ||
| ``` | ||
|
|
||
| ARM Instruction: `lsl <Rd>, <Rn>, #<shift>` | ||
|
|
||
| The compiler utilizes a Logical Shift Left (`lsl`) to perform multiplication by powers of two. | ||
| Here, w0 is the destination (`Rd`), the original `w0` is the source (`Rn`), and `#1` is the immediate shift value. | ||
| Shifting a register left by 1 bit is equivalent to multiplying by 2. | ||
|
|
||
| ## Case 02: `x * 3` | ||
|
|
||
| ``` | ||
| $ nvim mul.c | ||
| ``` | ||
|
|
||
| ``` | ||
| int mul(int x) { | ||
| return x * 3; | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| $ rm -f (path filter *.o); clang -O2 -target aarch64-linux-gnu -c mul.c; llvm-objdump -d mul.o | ||
| ``` | ||
|
|
||
| ``` | ||
| mul.o: file format elf64-littleaarch64 | ||
|
|
||
| Disassembly of section .text: | ||
|
|
||
| 0000000000000000 <mul>: | ||
| 0: 0b000400 add w0, w0, w0, lsl #1 | ||
| 4: d65f03c0 ret | ||
| ``` | ||
|
|
||
| ARM Instruction: `add <Rd>, <Rn>, <Rm>, lsl #<shift>` | ||
|
|
||
| AArch64 supports shifted-register operands within arithmetic instructions. | ||
| This add instruction performs a left shift of 1 bit on the second source register (`Rm`) before addition. | ||
| The operation represents the formula `w0 = w0 + (w0 << 1)`, which computes `x = x + x * 2`. | ||
|
|
||
| ## Case 03 : `x * 6` | ||
|
|
||
| ``` | ||
| $ nvim mul.c | ||
| ``` | ||
|
|
||
| ``` | ||
| int mul(int x) { | ||
| return x * 6; | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| $ rm -f (path filter *.o); clang -O2 -target aarch64-linux-gnu -c mul.c; llvm-objdump -d mul.o | ||
| ``` | ||
|
|
||
| ``` | ||
| mul.o: file format elf64-littleaarch64 | ||
|
|
||
| Disassembly of section .text: | ||
|
|
||
| 0000000000000000 <mul>: | ||
| 0: 0b000408 add w8, w0, w0, lsl #1 | ||
| 4: 531f7900 lsl w0, w8, #1 | ||
| 8: d65f03c0 ret | ||
| ``` | ||
|
|
||
| ARM Instructions: | ||
| - `add <Rd>, <Rn>, <Rm>, lsl #<shift>` | ||
| - `lsl <Rd>, <Rn>, #<shift>` | ||
|
|
||
| The multiplication of 6x is decomposed into two discrete stages. | ||
| First, the compiler calculates `w8 = w0 + (w0 << 1) = w0 + 2 * w0 = 3 * w0`. | ||
| Second, it calculates `w0 = (w8 << 1) = 2 * w8 = 2 * (3 * w0) = 6 * w0` | ||
|
|
||
| ## Case 04 : `x * 7` | ||
|
|
||
| ``` | ||
| $ nvim mul.c | ||
| ``` | ||
|
|
||
| ``` | ||
| int mul(int x) { | ||
| return x * 7; | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| $ rm -f (path filter *.o); clang -O2 -target aarch64-linux-gnu -c mul.c; llvm-objdump -d mul.o | ||
| ``` | ||
|
|
||
| ``` | ||
| mul.o: file format elf64-littleaarch64 | ||
|
|
||
| Disassembly of section .text: | ||
|
|
||
| 0000000000000000 <mul>: | ||
| 0: 531d7008 lsl w8, w0, #3 | ||
| 4: 4b000100 sub w0, w8, w0 | ||
| 8: d65f03c0 ret | ||
| ``` | ||
|
|
||
| ARM Instructions: | ||
| - `lsl <Rd>, <Rn>, #<shift>` | ||
| - `sub <Rd>, <Rn>, <Rm>` | ||
|
|
||
| The compiler implements a shift-and-subtract strategy for constants near powers of two. | ||
| To compute `7x`, it first executes `w8 = w0 << 3 = 8 * w0` | ||
| It then performs `w0 = w8 - w0 = 8 * w0 - w0 = 7 * w0`. | ||
|
|
||
| ## Case 05 : `x * 11` | ||
|
|
||
| ``` | ||
| $ nvim mul.c | ||
| ``` | ||
|
|
||
| ``` | ||
| int mul(int x) { | ||
| return x * 11; | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| $ rm -f (path filter *.o); clang -O2 -target aarch64-linux-gnu -c mul.c; llvm-objdump -d mul.o | ||
| ``` | ||
|
|
||
| ``` | ||
| mul.o: file format elf64-littleaarch64 | ||
|
|
||
| Disassembly of section .text: | ||
|
|
||
| 0000000000000000 <mul>: | ||
| 0: 52800168 mov w8, #0xb // =11 | ||
| 4: 1b087c00 mul w0, w0, w8 | ||
| 8: d65f03c0 ret | ||
| ``` | ||
|
|
||
| ARM Instructions: | ||
| - `mov <Rd>, <Imm>` | ||
| - `mul <Rd>, <Rn>, <Rm>` | ||
|
|
||
| The compiler defaults to the `mul` instruction because decomposing the constant `11` cannot be achieved in only two instructions. | ||
|
|
||
| If the compiler were to adopt a manual shift-and-subtract strategy, | ||
| the code generator would need to output three instructions: | ||
| ``` | ||
| add w8, w0, w0, lsl 1 // w8 = x + 2x = 3x | ||
| lsl w8, w8, #2 // w8 = w8 << 2 = 3x << 2 = 3x * 4 = 12x | ||
| sub w0, w8, w0 // w0 = 12x - x = 11x | ||
| ``` | ||
| Obviously, this requires 3 instructions. By using mov followed by mul, | ||
| the compiler achieves the same result in only 2 instructions. | ||
|
|
||
| ## Case 06 : `x * 14` | ||
|
|
||
| ``` | ||
| $ nvim mul.c | ||
| ``` | ||
|
|
||
| ``` | ||
| int mul(int x) { | ||
| return x * 14; | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| $ rm -f (path filter *.o); clang -O2 -target aarch64-linux-gnu -c mul.c; llvm-objdump -d mul.o | ||
| ``` | ||
|
|
||
| ``` | ||
| mul.o: file format elf64-littleaarch64 | ||
|
|
||
| Disassembly of section .text: | ||
|
|
||
| 0000000000000000 <mul>: | ||
| 0: 531c6c08 lsl w8, w0, #4 | ||
| 4: 4b000500 sub w0, w8, w0, lsl #1 | ||
| 8: d65f03c0 ret | ||
| ``` | ||
|
|
||
| ARM Instructions: | ||
| - `lsl <Rd>, <Rn>, #<shift>` | ||
| - `sub <Rd>, <Rn>, <Rm>, lsl #<shift>` | ||
|
|
||
| The computation of 14x demonstrates the flexibility of the sub instruction with shifted operands. | ||
| The compiler first calculates `w8 = w0 << 4 = 16 * w0`. | ||
| Subsequently, it performs `w0 = w8 - (w0 << 1) = w8 - w0 * 2 = 16 * w0 - 2 * w0 = 14 * w0`. | ||
|
|
||
| ## YouTube Comment Insights | ||
|
|
||
| Since YouTube does not currently support generating direct permanent links to individual comments, | ||
| I have reproduced the relevant technical insight below in its entirety to ensure both accuracy and proper attribution. | ||
|
|
||
| ``` | ||
| @kruador | ||
| @ciberman Yes, it means 'ARMv8'. That's not quite right because ARM Ltd enhanced the 32-bit instruction set (which they now call A32) | ||
| as well as adding the 64-bit instruction set (A64) in version 8. | ||
|
|
||
| They also refer to 'AArch32' and 'AArch64' for extra confusion. I think 'AArch32' means 'the architectural state of a 32-bit ARM processor' | ||
| because you can use the alternative "Thumb" instruction set (which ARM Ltd renamed to T32 with ARMv8, in their documentation at least) instead of A32. | ||
| The embedded ARM Cortex-M only support T32, not A32. | ||
|
|
||
| There is no equivalent of Thumb for AArch64 (no 'T64'), at least not as of yet (probably not ever), so 'AArch64' and 'A64' are virtually interchangeable. | ||
| And most people just say 'arm64' because 'AArch64' is unpronounceable while 'A64' is too ambiguous. | ||
|
|
||
| @tlhIngan | ||
| ARMv8 was designed to be more streamlined for modern superscalar architectures so it jettisoned a lot of ARM stuff that was responsible | ||
| for causing pipeline stalls and dependencies in favor of simpler instructions that can run faster. | ||
| When AArch64 was being introduced I remember seeing the ARM presentations on why the instruction set dumped a lot of it. | ||
| It's why an ARMv8 core only beats an ARMv7 core by about 10% in AArch32 mode but running the same code in AArch64 mode you can achieve a 50+% speedup. | ||
| Losing RSB for a two instruction LSB/SUB combination was deemed far superior in simplifying ALU operations. | ||
|
|
||
| @kruador | ||
| I think RSB was only really useful for this kind of operation. If you're not doing a shift on one of the operands, you can just swap which register is which. | ||
| But the 32-bit ARM architecture only supports the shift on operand 2, | ||
| so you have to have an instruction that does say Rdest := operand2 - Rn instead of Rdest := Rm - operand2. | ||
|
|
||
| ARM1 didn't even have a multiply instruction. Adds, shifts and subtracts were the only options out there. | ||
| No room for a multiplier in only 25,000 transistors! So RSB was really helpful there. However, these days there an abundance of transistors available: | ||
| even the lowly ARM Cortex-M0 (a 32-bit ARMv6 architecture core that only supports the Thumb instruction set, and not all of that) can be configured with a single-cycle multiplier. | ||
|
|
||
| The main issue wasn't simplifying the ALU operations, I don't think, but simply releasing bits to be able to encode more different operations and more registers. | ||
| AArch64 needs three bits more per instruction for register mapping - one for the destination register and one for each source - because it has twice as many registers as AArch32 (32 vs 16). | ||
| ``` | ||
|
|
||
| ## References | ||
| 1. https://developer.arm.com/documentation/dui0473/m/overview-of-the-arm-architecture/access-to-the-inline-barrel-shifter | ||
| 2. https://www.davespace.co.uk/arm/introduction-to-arm/barrel-shifter.html | ||
| 3. https://www.d.umn.edu/~gshute/logic/barrel-shifter.html | ||
| 4. https://community.element14.com/technologies/fpga-group/b/blog/posts/systemverilog-study-notes-barrel-shifter-rtl-combinational-circuit | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better clarity and to directly relate back to the C code
return x * 3;, you could show the final result of the computation in the explanation.