From 6ee4277c05be89ccfc83348ad03ec76875a79782 Mon Sep 17 00:00:00 2001 From: Manuel Saelices Date: Sun, 12 Apr 2026 00:23:46 +0200 Subject: [PATCH 1/5] [Skills] Add mojo-optimizations skill New skill capturing performance optimization patterns for Mojo code. Layers on top of mojo-syntax and is triggered when profiling, benchmarking, tuning latency, or porting performance-sensitive code to Mojo. Covers hot-path inlining, unsafe pointer access in inner loops, pre-allocation and lazy containers, struct layout for cache efficiency, views over owned strings, ref over var in loops, _Global lazy caches, init_pointee_move for heap fields, hash-keyed caching, comptime specialization, nibble-based SIMD byte scanning, prefiltering strategies, numeric accumulation, fast-path dispatch by input shape, and guidance on what not to optimize. --- README.md | 10 + mojo-optimizations/SKILL.md | 490 ++++++++++++++++++++++++++++++++++++ 2 files changed, 500 insertions(+) create mode 100644 mojo-optimizations/SKILL.md diff --git a/README.md b/README.md index 6775602..736afdf 100644 --- a/README.md +++ b/README.md @@ -82,6 +82,16 @@ triggered when Python types are used Mojo or a Python module needs to interact with Mojo code. Many capabilities of Mojo - Python interoperability are fairly new, and existing coding agents don't handle them correctly without guidance. +### `mojo-optimizations` + +[This skill](mojo-optimizations/SKILL.md) captures performance optimization +patterns for Mojo. It layers on top of `mojo-syntax` and is triggered when +profiling, benchmarking, tuning latency, or porting performance-sensitive code +to Mojo. It covers hot-path inlining, pre-allocation, unsafe pointer access in +hot loops, struct layout, view-vs-owned types, lazy-initialized global caches, +hash-keyed caching, `comptime` specialization, nibble-based SIMD byte scanning, +prefiltering strategies, and fast-path dispatch. + ## Examples Once these skills are installed, you can use them for many common tasks. diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md new file mode 100644 index 0000000..76b735d --- /dev/null +++ b/mojo-optimizations/SKILL.md @@ -0,0 +1,490 @@ +--- +name: mojo-optimizations +description: Performance optimization patterns for Mojo code. Use this skill in addition to mojo-syntax when writing or refactoring Mojo code that needs to be fast, when profiling shows a hot path, when the user mentions benchmarks, regressions, latency, throughput, "make it faster", or when porting performance-sensitive code (parsers, matchers, numeric loops, byte-level scanners) to Mojo. Use to overcome misconceptions about where Mojo spends cycles and which idioms compile to tight code. +--- + + + +Apply these patterns **on top of `mojo-syntax`** — they assume you already know +modern Mojo syntax. These rules are extracted from real optimization work on a +production Mojo codebase where a matcher hot path went from ~4 µs to ~1.3 µs +(3x) by composing them. They compound: isolated use gives small wins, chained +use is where the order-of-magnitude speedups come from. + +**Measure before and after every change.** Never guess. + +## The profile-driven workflow + +Before optimizing, build a benchmark harness that isolates the hot path: + +```mojo +from std.time import perf_counter_ns + +# 1. Build once, outside the timing loop. Never include setup in measurements. +var compiled = compile_once(pattern) + +# 2. Warmup — first N iterations are discarded (cache cold, JIT artifacts). +for _ in range(WARMUP_ITERATIONS): + _ = compiled.run(input) + +# 3. Auto-calibrate iters so each sample takes >= 1 ms (OS jitter dominates +# sub-ms samples). +var iters = initial_iters +var cal_start = perf_counter_ns() +for _ in range(iters): + _ = compiled.run(input) +if perf_counter_ns() - cal_start < 1_000_000: + iters *= (1_000_000 // (perf_counter_ns() - cal_start)) + 1 + +# 4. Collect samples until ~500 ms total, take **median** (not mean). +``` + +**Common mistakes a pretrained model will make**: including pattern compilation +inside the timing loop; reporting mean (outliers skew it); running <100 ms +total; skipping warmup. All of these were real issues fixed in the source +project (and re-introduced by naive rewrites). + +## Inlining hot-path trampolines + +Wrapper methods that just forward to an inner engine are "trampolines". If a +small input (e.g., a 16-byte `StringSlice`) flows through 4 trampoline levels +without `@always_inline`, LLVM can't fold the call chain and the fast path pays +3-4 call-frame costs per invocation. Mark **every level** of the dispatch chain +`@always_inline`: + +```mojo +# CompiledRegex -> HybridMatcher -> DFAMatcher -> DFAEngine : all @always_inline +struct DFAMatcher: + @always_inline + def is_match(self, text: ImmSlice, start: Int = 0) -> Bool: + return self.engine_ptr[].is_match(text, start) + + @always_inline + def match_first(self, text: ImmSlice, start: Int = 0) -> Optional[Match]: + return self.engine_ptr[].match_first(text, start) +``` + +- One link missing `@always_inline` breaks the fold — check the whole chain. +- `@always_inline("nodebug")` additionally strips debug info for tiny helpers + (accessors, byte reads) so they don't clutter stack traces. +- Use `@no_inline` on **cold** paths inside a hot function to keep the hot loop + small and reduce I-cache pressure (error handlers, first-time-setup paths). +- Don't blindly inline large functions — `@always_inline` on a 200-line + function bloats call sites. The rule is: inline thin forwarders and small + leaf helpers, not big functions. + +## Unsafe pointer access in inner loops + +`List`/`Span`/`StringSlice` indexing emits a bounds check on every access. For +loops that run millions of times, hoist the bounds check out and use +`unsafe_ptr()` for the actual reads/writes: + +```mojo +# WRONG — bounds check per iteration +for i in range(len(states)): + if states[i].is_active: # len check every step + ... + +# CORRECT — one bounds proof, then unchecked access +var states_ptr = states.unsafe_ptr() +var n = len(states) +for i in range(n): + if states_ptr[i].is_active: + ... +``` + +Apply the same to `StringSlice.unsafe_ptr()` for byte scans, and to +`List[T].unsafe_ptr()` inside inner DFA/state-machine loops. + +**Also audit the loop for checks that are always true for the input type**: +`uint8_val >= 0` is always true; `char_code < 256` is always true for +`UInt8`-typed inputs. Such dead conditions still cost instructions — delete +them. + +## Pre-allocate collections; lazy-allocate when zero is common + +| Situation | Pattern | +|---------------------------------------|--------------------------------------------| +| Known upper bound N | `List[T](capacity=N)` | +| Known lower bound, growable | `var xs = List[T](); xs.reserve(estimate)` | +| Zero is the common case (`findall`) | Lazy: don't allocate until first append | +| Fixed compile-time size | `InlineArray[T, N]` (stack, no heap) | +| Small hot container, size never grows | `SIMD[DType.uint8, 128]` as a dense bitset | + +```mojo +# Known bound — pre-size at construction +var elements = List[Node](capacity=len(tokens)) + +# Lazy container — defer allocation until first insert +struct LazyList[T: Copyable & Movable]: + var _data: UnsafePointer[T, MutAnyOrigin] + var _len: Int + var _capacity: Int + + def __init__(out self): + self._data = UnsafePointer[T, MutAnyOrigin]() + self._len = 0 + self._capacity = 0 + + def append(mut self, value: T): + if self._capacity == 0: + self._realloc(8) # first-use reservation + elif self._len == self._capacity: + self._realloc(self._capacity * 2) + (self._data + self._len).init_pointee_move(value) + self._len += 1 +``` + +**Pitfall**: `List` grows by doubling. A loop that appends 1000 items triggers +~10 reallocations + memcpys if you start from zero. Pre-sizing is usually a +1.2-2x win on append-heavy code. + +## Struct layout for cache efficiency + +Small, trivially-copyable structs pass in registers and avoid memcpy. Aim to +keep hot-iterated structs under a cache line (64 bytes), ideally 32 bytes. + +```mojo +struct Match(Copyable, Movable, TrivialRegisterPassable): + comptime __copy_ctor_is_trivial = True # LLVM elides the copy entirely + var group_id: Int # 8 + var start_idx: Int # 8 + var end_idx: Int # 8 + var text_ptr: UnsafePointer[Byte, ImmutAnyOrigin] # 8 + # Total: 32 bytes — fits in 4 registers, no stack ops on copy. +``` + +- Store `UnsafePointer[Byte]` + offsets, not full `StringSlice` fields, when + every byte of the struct matters. (A `StringSlice` is ~16 bytes — two pointer + payloads — which doubles the struct.) +- `TrivialRegisterPassable` requires all fields to be trivially copyable; + adding a `String` or `List` field silently drops the trait. +- `@fieldwise_init` synthesises the constructor without you typing field + assignments — use it on any plain data struct. + +## Pass views, not owned strings + +`String` is heap-allocated; copying or building one from a literal allocates. +Public APIs that only read text should take a `StringSlice` (view) instead. +Define a module alias so every layer agrees on the same type: + +```mojo +comptime ImmSlice = StringSlice[ImmutAnyOrigin] + +def search(pattern: ImmSlice, text: ImmSlice) raises -> Optional[Match]: + ... # callers pass literals, zero alloc +``` + +`Span[Byte]` plays the same role for raw byte views. Prefer `Span[Byte]` over +`(ptr: UnsafePointer[Byte], len: Int)` parameter pairs — it's the same two +machine words but carries provenance and can't go out of sync. + +## Avoid copies in loops: `ref` over `var` + +Iterating a container of large structs (`ASTNode`, `Match`, records) with +`var x = container[i]` **copies** on every step. Use `ref` to alias: + +```mojo +# WRONG — full struct copy per iteration +for i in range(len(nodes)): + var node = nodes[i] # copy + process(node) + +# CORRECT — zero-copy reference +for i in range(len(nodes)): + ref node = nodes[i] # alias + process(node) + +# Also for local bindings to nested fields: +ref matchers = matchers_ptr[] # instead of `var matchers = matchers_ptr[]` +``` + +`^` transfer: when you *do* want ownership but won't use the source again, use +`value^` to move instead of copy: `engine_ptr.init_pointee_move(engine^)`. + +## Lazy-initialized global caches + +For precomputed lookup tables, matcher dictionaries, or any expensive +build-once value, use `_Global` from `std.ffi`. It guarantees single-shot +lazy initialization behind a pointer you can mutate through: + +```mojo +from std.ffi import _Global +from std.os import abort + +comptime MatcherCache = Dict[Int, SomeMatcher] +comptime _MATCHER_CACHE = _Global["MatcherCache", _init_matcher_cache] + +def _init_matcher_cache() -> MatcherCache: + return MatcherCache() + +def _get_matcher_cache() -> UnsafePointer[MatcherCache, MutAnyOrigin]: + try: + return _MATCHER_CACHE.get_or_create_ptr() + except e: + abort[prefix="ERROR:"](String(e)) + +@always_inline +def get_matcher(key: Int) raises -> SomeMatcher: + var cache_ptr = _get_matcher_cache() + ref cache = cache_ptr[] + if key not in cache: + cache[key] = build_matcher(key) + return cache[key] +``` + +- The string key in `_Global["MatcherCache", ...]` is the **global identity** — + must be unique per cache across the whole program. +- Return `UnsafePointer` for interior mutability even when the caller is + `read`-self: this is how you cache through a trait method that demands + immutable `self`. + +## Heap-allocated fields: use `init_pointee_move`, not assignment + +`alloc[T](1)` returns **uninitialized** memory. Assigning `ptr[] = T(...)` +invokes *move-assignment* into that uninitialized storage, which runs +destructor logic on garbage — a classic flaky double-free at process exit: + +```mojo +# WRONG — move-assign into uninitialized memory, undefined behavior +self._lazy_dfa_ptr = alloc[LazyDFA](1) +self._lazy_dfa_ptr[] = LazyDFA(vm^) # UB + +# CORRECT — construct in place +self._lazy_dfa_ptr = alloc[LazyDFA](1) +self._lazy_dfa_ptr.init_pointee_move(LazyDFA(vm^)) +``` + +Same rule applies in `__copyinit__` when copying heap-owned fields: + +```mojo +def __copyinit__(out self, copy: Self): + self._ptr = alloc[Self.T](1) + self._ptr.init_pointee_move(copy._ptr[].copy()) +``` + +A `__del__` that calls `.free()` on the pointer is mandatory to pair with +these. + +## Cache by hash, not by string + +When a cache key is a string, keying the `Dict` on `String` forces callers to +allocate just to check cache membership. Key on `hash(slice)` instead — cache +hits become zero-allocation: + +```mojo +comptime RegexCache = Dict[UInt64, CompiledRegex] + +def compile_regex(pattern: ImmSlice) raises -> CompiledRegex: + var cache_ptr = _get_regex_cache() + var key = hash(pattern) + if key in cache_ptr[]: + var cached = cache_ptr[][key] + if cached.pattern == pattern: # collision guard: byte-compare + return cached + # miss or collision — allocate String once for the stored copy + var compiled = CompiledRegex(String(pattern)) + cache_ptr[][key] = compiled + return compiled +``` + +Always keep the collision guard (`cached.pattern == pattern`) — 64-bit hash +collisions are astronomically rare but not zero. On collision, fall through +to a fresh compile. + +## Comptime specialization for fast-path code + +`comptime if`/`comptime for` generate **distinct code per specialization** and +have zero runtime cost. Use them to pick between SIMD widths, architectures, or +unrolled loop bodies: + +```mojo +comptime SIMD_WIDTH = simd_width_of[DType.uint8]() + +def match_chunk[size: Int](self, chunk: SIMD[DType.uint8, size]) -> SIMD[DType.bool, size]: + comptime if size == 16: + # Fast path: one pshufb pair + var lo = self.low_lut._dynamic_shuffle(chunk & 0x0F) + var hi = self.high_lut._dynamic_shuffle((chunk >> 4) & 0x0F) + return (lo & hi) != 0 + else: + # Generic path: process in 16-byte sub-chunks + var result = SIMD[DType.bool, size](False) + comptime for offset in range(0, size, 16): + ... + return result +``` + +Also precompute lookup tables as `comptime` so they end up as `.rodata`, not +generated at runtime: + +```mojo +comptime DIGIT_LUT = _build_digit_lut() # runs at compile time +``` + +## SIMD byte scanning: nibble-based lookup + +For byte-level character class scans (matching `[a-z]`, `\d`, `\w`, arbitrary +byte sets), the naive 256-entry lookup table is too big for `_dynamic_shuffle`. +Decompose each byte into two **nibbles** (4-bit halves) and use two 16-entry +tables — this fits exactly in a single `pshufb`/`vpshufb` instruction: + +```mojo +# Two 16-entry tables, precomputed at compile time +var lo = low_nibble_lut._dynamic_shuffle(chunk & 0x0F) +var hi = high_nibble_lut._dynamic_shuffle((chunk >> 4) & 0x0F) +var matches: SIMD[DType.bool, 16] = (lo & hi) != 0 +# matches[i] is True iff chunk[i] is in the class +``` + +Nibble lookup is typically **20-100x faster** than per-byte scalar dispatch on +hot paths. For *contiguous* ranges like `[a-z]`, `[0-9]`, an even cheaper path +exists: + +```mojo +# Range check via unsigned subtract — no lookup table needed +var offset = chunk - SIMD[DType.uint8, 32](range_start) +var matches = offset <= SIMD[DType.uint8, 32](range_end - range_start) +``` + +Record `range_start`/`range_end` on the matcher struct at construction time so +the hot path can pick this fast path. Non-contiguous classes fall back to the +nibble tables. + +## Prefilters: cheap scan before the expensive match + +Full engine invocation per byte is the wrong granularity. Extract a *cheap* +signal from the pattern — a required literal substring, a first-byte set, a +fixed prefix — and use the fastest available scan primitive to locate +candidate positions first. The expensive matcher only runs where the prefilter +says "maybe": + +| Prefilter | Scan primitive | Use when | +|--------------------------|----------------------------------|--------------------------------| +| Required literal | `StringSlice.find` | Pattern contains a fixed substr| +| Last-literal for `.*LIT` | `String.rfind` (single pass) | `.*literal` prefix pattern | +| First-byte set | SIMD equality sweep + bitmask | Small (<8) set of start bytes | +| Byte class (`\d`, `\w`) | Nibble SIMD scan | Start is a character class | + +**Critical anti-pattern**: using repeated forward `find` to locate the *last* +occurrence is O(N × occurrences). Use `rfind` for a single reverse O(N) pass: + +```mojo +# WRONG — O(N * k) +var pos = 0 +while True: + var next = text.find(literal, pos) + if next == -1: break + pos = next + 1 +var last = pos - 1 + +# CORRECT — O(N) +var last = text.rfind(literal) +``` + +Real PR impact: this one change took `.*@example\.com` on a large text from +**39x slower** to **10x faster** than Python. + +## Numeric accumulation over string concat + +When parsing numbers or building values byte by byte, accumulate into an `Int` +directly — don't build a `String` and parse at the end: + +```mojo +# WRONG — N allocations for an N-digit number +var num_str = String("") +while is_digit(text[i]): + num_str += String(chr(text[i])) + i += 1 +var num = Int(num_str) + +# CORRECT — zero allocation +var num = 0 +while is_digit(text[i]): + num = num * 10 + (Int(text[i]) - Int(ord("0"))) + i += 1 +``` + +Same pattern applies to checksum accumulation, hash building, and any +"consume-bytes-and-fold" logic. + +## Fast-path dispatch by input shape + +At construction time (not match time), classify the input and record which +optimized path can run. Cheap patterns (single literal, fixed-length sequence, +anchored prefix) should bypass the general engine entirely: + +```mojo +struct CompiledRegex: + var _simple_literal: Bool # pattern is a fixed string + var _literal: String # extracted literal if any + var _has_dotstar_prefix: Bool # .*LITERAL + var _engine: Engine # general fallback + + def __init__(out self, var ast: ASTNode, pattern: String): + self._simple_literal = _is_simple_literal(ast) + ... # analyze once at compile time + self._engine = build_engine(ast) + + @always_inline + def match_first(self, text: ImmSlice) -> Optional[Match]: + if self._simple_literal: + return _literal_search(text, self._literal) # 100x faster path + if self._has_dotstar_prefix: + return _dotstar_literal_path(text, self._literal) + return self._engine.match_first(text) +``` + +Per-match analysis overhead is amortized across every call on the same +compiled object. Keep the analysis in `__init__`; keep the hot path branchy +but cheap. + +## Unlikely-branch hoisting + +When a fast path dominates, put the check **first** and **return immediately**, +so the fallback isn't inlined into the hot prologue: + +```mojo +@always_inline +def is_match(self, text: ImmSlice) -> Bool: + if len(text) == 0: # cold edge case + return self._matches_empty + if self._simple_literal: # hot path, short-circuits + return _literal_eq(text, self._literal) + return self._engine.is_match(text) # general path, not inlined hot +``` + +Pair with `@no_inline` on the general-path helper if it's large, so the inlined +caller stays small. + +## What NOT to optimize + +- **Don't** replace clear code with micro-optimizations before profiling. All + of the above earned their place by moving a measured hot path; applied + elsewhere they're just noise. +- **Don't** pre-allocate containers you'll use once or twice — the `List` + default growth strategy is already fine for small cases. +- **Don't** blanket-`@always_inline` functions over ~50 lines; you'll bloat + every caller and slow compile times. Inline thin forwarders, not whole + engines. +- **Don't** `unsafe_ptr()` outside hot loops — you lose bounds checks without + payoff, and debugging the next segfault costs far more than the saved + microseconds. +- **Don't** cache aggressively without measuring cache hit rate. An + infrequently-hit cache wastes memory and adds a branch. + +The single most reliable workflow: profile, pick one pattern from this list, +apply it, re-benchmark. Commit that delta. Repeat. Stop when the hot path is +no longer hot. From 30931aa62a3d323ce36f574855dfbcac8209d6d5 Mon Sep 17 00:00:00 2001 From: Manuel Saelices Date: Sun, 12 Apr 2026 00:34:03 +0200 Subject: [PATCH 2/5] [Skills] Use std.benchmark in mojo-optimizations profile-driven workflow Replace the hand-rolled perf_counter_ns harness with the stdlib std.benchmark idiom (Bench, Bencher, BenchConfig, BenchId, keep, ThroughputMeasure). Matches the pattern used in mojo/stdlib/benchmarks/ so users copy-paste from the reference tree instead of reinventing warmup and calibration logic. --- mojo-optimizations/SKILL.md | 114 +++++++++++++++++++++++++++++------- 1 file changed, 92 insertions(+), 22 deletions(-) diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md index 76b735d..ef9a713 100644 --- a/mojo-optimizations/SKILL.md +++ b/mojo-optimizations/SKILL.md @@ -27,34 +27,104 @@ use is where the order-of-magnitude speedups come from. ## The profile-driven workflow -Before optimizing, build a benchmark harness that isolates the hot path: +Use `std.benchmark` — **never** hand-roll a `perf_counter_ns` harness. +`Bench`/`Bencher` handles warmup, auto-calibration, repetitions, and +statistical summary for you. The stdlib `mojo/stdlib/benchmarks/` tree is the +reference — copy-paste from there when starting a new file. ```mojo -from std.time import perf_counter_ns - -# 1. Build once, outside the timing loop. Never include setup in measurements. -var compiled = compile_once(pattern) - -# 2. Warmup — first N iterations are discarded (cache cold, JIT artifacts). -for _ in range(WARMUP_ITERATIONS): - _ = compiled.run(input) +from std.benchmark import ( + Bench, BenchConfig, Bencher, BenchId, + BenchMetric, ThroughputMeasure, + keep, black_box, +) + +# One benchmark = an @parameter def that takes `mut b: Bencher`. +# The inner @always_inline @parameter closure is what gets measured. +@parameter +def bench_match_first(mut b: Bencher) raises: + var compiled = compile_regex(PATTERN) # setup OUTSIDE the timed body + @always_inline + @parameter + def call_fn() raises: + for _ in range(1000): # batch to amortize per-iter overhead + var r = compiled.match_first(TEXT) + keep(r) # prevent dead-code elimination + b.iter[call_fn]() + keep(Bool(compiled)) # keep the setup alive past the loop + +# Parametric benchmarks use compile-time params; the harness calls them +# once per specialization. +@parameter +def bench_insert[size: Int](mut b: Bencher) raises: + var items = make_dict[size]() + @always_inline + @parameter + def call_fn() raises: + for k in range(size, size + 10): + items[k] = k + b.iter[call_fn]() + keep(Bool(items)) + +def main() raises: + var m = Bench(BenchConfig(num_repetitions=5)) + m.bench_function[bench_match_first](BenchId("match_first")) + comptime for size in (10, 100, 1_000, 10_000): + m.bench_function[bench_insert[size]]( + BenchId(String("insert[", size, "]")) + ) + print(m) # prints the results table +``` -# 3. Auto-calibrate iters so each sample takes >= 1 ms (OS jitter dominates -# sub-ms samples). -var iters = initial_iters -var cal_start = perf_counter_ns() -for _ in range(iters): - _ = compiled.run(input) -if perf_counter_ns() - cal_start < 1_000_000: - iters *= (1_000_000 // (perf_counter_ns() - cal_start)) + 1 +Key rules the harness enforces, so you don't: + +- **Setup outside `call_fn`.** Anything built inside the inner closure is + rebuilt on every iteration. Compile regex, allocate buffers, load fixtures + before `b.iter[...]`. +- **`keep(value)` every result.** Without it the optimizer deletes the call + you're trying to measure. `keep(Bool(container))` at the end of the outer + function keeps setup alive past the timed region. `black_box(x)` is the + stronger sibling — use it on inputs you want to force through memory. +- **Batch inside `call_fn`** (e.g., `for _ in range(1000)`) when a single + call is sub-microsecond. The harness auto-calibrates, but batching further + reduces timer overhead for very fast ops. +- **`num_repetitions > 1`** when you care about stability; the harness + reports min/mean/max across repetitions. +- **Throughput units**: pass `ThroughputMeasure` so results are normalized to + GElems/s, GB/s, or GFLOPS/s instead of raw time. Use `bench_with_input` to + pipe a fixture to a parametric bench fn: -# 4. Collect samples until ~500 ms total, take **median** (not mean). +```mojo +m.bench_with_input[InputT, bench_fn]( + BenchId("atof", filename), + input_data, + [ + ThroughputMeasure(BenchMetric.elements, len(input_data)), + ThroughputMeasure(BenchMetric.bytes, total_bytes), + ], +) ``` -**Common mistakes a pretrained model will make**: including pattern compilation -inside the timing loop; reporting mean (outliers skew it); running <100 ms -total; skipping warmup. All of these were real issues fixed in the source -project (and re-introduced by naive rewrites). +- **`iter_custom`** is the escape hatch when the thing you're measuring needs + its own context (e.g., GPU dispatch). Pass a closure taking an iteration + count and returning elapsed ns. Use this only when `iter[...]` can't + express the setup. + +`BenchConfig` defaults are sensible: `num_warmup_iters=10`, +`max_runtime_secs=1.0`, `max_iters=1_000`. Override `num_repetitions` (for +stability) and `max_runtime_secs` (for precision) first; leave the rest alone +unless you've measured a reason. + +File layout: name benchmark files `bench_.mojo`, mirror the source +tree, and put them under a `benchmarks/` directory. Stdlib tooling keys on +the `bench_` prefix. + +**Common mistakes a pretrained model will make**: hand-rolling +`perf_counter_ns` loops; omitting `keep(...)` so the compiler deletes the +work; constructing inputs inside `call_fn`; reporting a single run instead of +`num_repetitions>1`; forgetting `@parameter` on the bench function or +`@always_inline @parameter` on the inner closure; using `Bench()` without +printing `print(m)` at the end (nothing renders otherwise). ## Inlining hot-path trampolines From 524ca9c9c4b360d8d99c6f8925c5f08637b483a1 Mon Sep 17 00:00:00 2001 From: Manuel Saelices Date: Sun, 12 Apr 2026 00:34:41 +0200 Subject: [PATCH 3/5] [Skills] Correct unsafe pointer access section with verified stdlib behavior The previous text incorrectly claimed List, Span, and StringSlice all emit a bounds check on every __getitem__ call. Verified against the stdlib (std/collections/_index_normalization.mojo and std/builtin/debug_assert.mojo): List and Span both pass assert_always=False to normalize_index, so their bounds check compiles out in default (ASSERT=safe) release builds. Only StringSlice[byte=i] emits the check by default, plus a UTF-8 start-byte debug_assert. Replace the section with a per-type table of actual costs, explain what unsafe_ptr() reliably buys you in default release (negative-index branch, trap-free loop optimization, parity with -D ASSERT=all builds), and mention list.unsafe_get(idx) as a safer middle ground. --- mojo-optimizations/SKILL.md | 44 +++++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 12 deletions(-) diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md index ef9a713..d034a1f 100644 --- a/mojo-optimizations/SKILL.md +++ b/mojo-optimizations/SKILL.md @@ -157,17 +157,35 @@ struct DFAMatcher: ## Unsafe pointer access in inner loops -`List`/`Span`/`StringSlice` indexing emits a bounds check on every access. For -loops that run millions of times, hoist the bounds check out and use -`unsafe_ptr()` for the actual reads/writes: +Know what `__getitem__` actually costs — it differs by type. All three go +through `normalize_index`, but with different assert modes: + +| Type | Default-release check | Extra per-access work | +|-----------------------|----------------------------------------------------|-----------------------| +| `List[T][i]` | **None** (`assert_mode="none"`, compiled out) | Negative-index normalization branch | +| `Span[T][i]` | **None** (`assert_mode="none"`, compiled out) | Negative-index normalization branch | +| `StringSlice[byte=i]` | **Bounds check + UTF-8 start-byte assert** (`"safe"`) | Negative-index normalization branch | + +The global `ASSERT` mode defaults to `safe`. Under `-D ASSERT=all` or a +debug build, **all three** types emit bounds checks on every access. So +whether `unsafe_ptr()` actually saves a branch depends on the build mode. + +What `unsafe_ptr()` reliably buys you, even in default release: + +1. Skips the negative-index normalization branch (a conditional on every + access for signed index types). +2. Removes the `StringSlice[byte=i]` bounds check + UTF-8 assert. +3. Enables more aggressive loop optimization — with no possible trap, LLVM + can vectorize, unroll, and hoist more freely. +4. Makes `-D ASSERT=all` debug builds as fast as release on the hot path. ```mojo -# WRONG — bounds check per iteration +# Safe but slow in -D ASSERT=all, and still branches per iter on sign check for i in range(len(states)): - if states[i].is_active: # len check every step + if states[i].is_active: ... -# CORRECT — one bounds proof, then unchecked access +# Pointer-based hot loop — no normalization, no possible trap var states_ptr = states.unsafe_ptr() var n = len(states) for i in range(n): @@ -175,13 +193,15 @@ for i in range(n): ... ``` -Apply the same to `StringSlice.unsafe_ptr()` for byte scans, and to -`List[T].unsafe_ptr()` inside inner DFA/state-machine loops. +For `List` specifically, `list.unsafe_get(idx)` is a safer middle ground — +it asserts in debug (`assert_mode` default) but still avoids negative-index +handling. Use `unsafe_ptr()` only when you also want the raw-pointer loop +form (e.g., to feed `+ offset` arithmetic or SIMD loads). -**Also audit the loop for checks that are always true for the input type**: -`uint8_val >= 0` is always true; `char_code < 256` is always true for -`UInt8`-typed inputs. Such dead conditions still cost instructions — delete -them. +**Separately, audit the loop for checks that are always true for the input +type**: `uint8_val >= 0` is always true; `char_code < 256` is always true +for `UInt8`-typed inputs. Such dead conditions still cost instructions — +delete them. ## Pre-allocate collections; lazy-allocate when zero is common From b1802a61c2534c499b16ec12529cf6c46fc616b2 Mon Sep 17 00:00:00 2001 From: Manuel Saelices Date: Sun, 12 Apr 2026 00:45:16 +0200 Subject: [PATCH 4/5] [Skills] Simplify hash-cache section with a generic example Replace the regex-specific example (CompiledRegex, _get_regex_cache, ImmSlice) with a neutral get_or_build pattern that applies to any expensive value keyed by a string. Add a brief list of use cases (parsed configs, compiled templates, resolved paths, interned symbols, SQL plans) so readers see the shape beyond regex. --- mojo-optimizations/SKILL.md | 39 +++++++++++++++++++------------------ 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md index d034a1f..3134ed4 100644 --- a/mojo-optimizations/SKILL.md +++ b/mojo-optimizations/SKILL.md @@ -370,29 +370,30 @@ these. ## Cache by hash, not by string -When a cache key is a string, keying the `Dict` on `String` forces callers to -allocate just to check cache membership. Key on `hash(slice)` instead — cache -hits become zero-allocation: +`Dict[String, V]` forces callers to allocate a `String` just to check cache +membership. Hash the slice instead — cache hits become zero-allocation. +Works for any "expensive value keyed by a string": parsed configs, compiled +templates, resolved file paths, interned symbols, SQL plans. ```mojo -comptime RegexCache = Dict[UInt64, CompiledRegex] - -def compile_regex(pattern: ImmSlice) raises -> CompiledRegex: - var cache_ptr = _get_regex_cache() - var key = hash(pattern) - if key in cache_ptr[]: - var cached = cache_ptr[][key] - if cached.pattern == pattern: # collision guard: byte-compare - return cached - # miss or collision — allocate String once for the stored copy - var compiled = CompiledRegex(String(pattern)) - cache_ptr[][key] = compiled - return compiled +# Generic pattern: expensive T built from a string key, memoized. +comptime Cache = Dict[UInt64, Entry] + +def get_or_build(key: StringSlice, mut cache: Cache) raises -> Entry: + var h = hash(key) + if h in cache: + ref hit = cache[h] + if hit.source == key: # collision guard: byte-compare + return hit.copy() + var built = build(String(key)) # allocate the String once, on miss + cache[h] = built + return built ``` -Always keep the collision guard (`cached.pattern == pattern`) — 64-bit hash -collisions are astronomically rare but not zero. On collision, fall through -to a fresh compile. +The collision guard is mandatory. 64-bit hash collisions are astronomically +rare but not zero, and silently returning the wrong value under a collided +key is the worst possible failure mode. On mismatch, fall through to a fresh +build. ## Comptime specialization for fast-path code From b2b5ee164b94c0e3d6c600ee9a3e920cfd9a8891 Mon Sep 17 00:00:00 2001 From: Manuel Saelices Date: Sun, 12 Apr 2026 00:56:58 +0200 Subject: [PATCH 5/5] [Skills] Diversify examples away from regex-specific domain Replace regex-flavored examples (CompiledRegex, DFAMatcher, Match, LazyDFA, compile_regex, match_first, is_match, .*literal) with a variety of neutral domains: - Benchmark: parse_json - Inlining trampolines: JsonParser -> Tokenizer -> ByteScanner - Struct layout: Token (tokenizer output) - Views: tokenize(source) - ref over var: rows, table - Global caches: SymbolTable, intern() - init_pointee_move: Arena - SIMD scanning: JSON whitespace, CSV delimiters, URL-safe chars - Prefilters: log scanning, filename extraction, JSON value detection - Fast-path dispatch: QueryPlan (pk lookup, sequential scan, general) - Unlikely-branch hoisting: validate(input) --- mojo-optimizations/SKILL.md | 204 ++++++++++++++++++------------------ 1 file changed, 102 insertions(+), 102 deletions(-) diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md index 3134ed4..c7765ee 100644 --- a/mojo-optimizations/SKILL.md +++ b/mojo-optimizations/SKILL.md @@ -18,10 +18,9 @@ These same principles apply to any files this skill references. --> Apply these patterns **on top of `mojo-syntax`** — they assume you already know -modern Mojo syntax. These rules are extracted from real optimization work on a -production Mojo codebase where a matcher hot path went from ~4 µs to ~1.3 µs -(3x) by composing them. They compound: isolated use gives small wins, chained -use is where the order-of-magnitude speedups come from. +modern Mojo syntax. These patterns compound: any one of them in isolation +gives a small win, but chaining several on the same hot path is where +order-of-magnitude speedups come from. **Measure before and after every change.** Never guess. @@ -42,16 +41,16 @@ from std.benchmark import ( # One benchmark = an @parameter def that takes `mut b: Bencher`. # The inner @always_inline @parameter closure is what gets measured. @parameter -def bench_match_first(mut b: Bencher) raises: - var compiled = compile_regex(PATTERN) # setup OUTSIDE the timed body +def bench_parse_json(mut b: Bencher) raises: + var source = load_fixture("large.json") # setup OUTSIDE the timed body @always_inline @parameter def call_fn() raises: for _ in range(1000): # batch to amortize per-iter overhead - var r = compiled.match_first(TEXT) + var r = parse_json(source) keep(r) # prevent dead-code elimination b.iter[call_fn]() - keep(Bool(compiled)) # keep the setup alive past the loop + keep(Bool(source)) # keep the setup alive past the loop # Parametric benchmarks use compile-time params; the harness calls them # once per specialization. @@ -68,7 +67,7 @@ def bench_insert[size: Int](mut b: Bencher) raises: def main() raises: var m = Bench(BenchConfig(num_repetitions=5)) - m.bench_function[bench_match_first](BenchId("match_first")) + m.bench_function[bench_parse_json](BenchId("parse_json")) comptime for size in (10, 100, 1_000, 10_000): m.bench_function[bench_insert[size]]( BenchId(String("insert[", size, "]")) @@ -79,7 +78,7 @@ def main() raises: Key rules the harness enforces, so you don't: - **Setup outside `call_fn`.** Anything built inside the inner closure is - rebuilt on every iteration. Compile regex, allocate buffers, load fixtures + rebuilt on every iteration. Load fixtures, allocate buffers, pre-compile before `b.iter[...]`. - **`keep(value)` every result.** Without it the optimizer deletes the call you're trying to measure. `keep(Bool(container))` at the end of the outer @@ -135,15 +134,15 @@ without `@always_inline`, LLVM can't fold the call chain and the fast path pays `@always_inline`: ```mojo -# CompiledRegex -> HybridMatcher -> DFAMatcher -> DFAEngine : all @always_inline -struct DFAMatcher: +# JsonParser -> Tokenizer -> ByteScanner : all @always_inline +struct Tokenizer: @always_inline - def is_match(self, text: ImmSlice, start: Int = 0) -> Bool: - return self.engine_ptr[].is_match(text, start) + def peek(self) -> UInt8: + return self.scanner_ptr[].peek() @always_inline - def match_first(self, text: ImmSlice, start: Int = 0) -> Optional[Match]: - return self.engine_ptr[].match_first(text, start) + def next_token(mut self) -> Token: + return self.scanner_ptr[].next_token() ``` - One link missing `@always_inline` breaks the fold — check the whole chain. @@ -247,12 +246,12 @@ Small, trivially-copyable structs pass in registers and avoid memcpy. Aim to keep hot-iterated structs under a cache line (64 bytes), ideally 32 bytes. ```mojo -struct Match(Copyable, Movable, TrivialRegisterPassable): +struct Token(Copyable, Movable, TrivialRegisterPassable): comptime __copy_ctor_is_trivial = True # LLVM elides the copy entirely - var group_id: Int # 8 - var start_idx: Int # 8 - var end_idx: Int # 8 - var text_ptr: UnsafePointer[Byte, ImmutAnyOrigin] # 8 + var kind: Int # 8 + var start: Int # 8 + var length: Int # 8 + var source_ptr: UnsafePointer[Byte, ImmutAnyOrigin] # 8 # Total: 32 bytes — fits in 4 registers, no stack ops on copy. ``` @@ -273,7 +272,7 @@ Define a module alias so every layer agrees on the same type: ```mojo comptime ImmSlice = StringSlice[ImmutAnyOrigin] -def search(pattern: ImmSlice, text: ImmSlice) raises -> Optional[Match]: +def tokenize(source: ImmSlice) raises -> List[Token]: ... # callers pass literals, zero alloc ``` @@ -283,30 +282,30 @@ machine words but carries provenance and can't go out of sync. ## Avoid copies in loops: `ref` over `var` -Iterating a container of large structs (`ASTNode`, `Match`, records) with +Iterating a container of large structs (records, tree nodes, tokens) with `var x = container[i]` **copies** on every step. Use `ref` to alias: ```mojo # WRONG — full struct copy per iteration -for i in range(len(nodes)): - var node = nodes[i] # copy - process(node) +for i in range(len(rows)): + var row = rows[i] # copy + process(row) # CORRECT — zero-copy reference -for i in range(len(nodes)): - ref node = nodes[i] # alias - process(node) +for i in range(len(rows)): + ref row = rows[i] # alias + process(row) # Also for local bindings to nested fields: -ref matchers = matchers_ptr[] # instead of `var matchers = matchers_ptr[]` +ref table = table_ptr[] # instead of `var table = table_ptr[]` ``` `^` transfer: when you *do* want ownership but won't use the source again, use -`value^` to move instead of copy: `engine_ptr.init_pointee_move(engine^)`. +`value^` to move instead of copy: `ptr.init_pointee_move(value^)`. ## Lazy-initialized global caches -For precomputed lookup tables, matcher dictionaries, or any expensive +For precomputed lookup tables, interned symbol caches, or any expensive build-once value, use `_Global` from `std.ffi`. It guarantees single-shot lazy initialization behind a pointer you can mutate through: @@ -314,28 +313,28 @@ lazy initialization behind a pointer you can mutate through: from std.ffi import _Global from std.os import abort -comptime MatcherCache = Dict[Int, SomeMatcher] -comptime _MATCHER_CACHE = _Global["MatcherCache", _init_matcher_cache] +comptime SymbolTable = Dict[Int, InternedSymbol] +comptime _SYMBOL_TABLE = _Global["SymbolTable", _init_symbol_table] -def _init_matcher_cache() -> MatcherCache: - return MatcherCache() +def _init_symbol_table() -> SymbolTable: + return SymbolTable() -def _get_matcher_cache() -> UnsafePointer[MatcherCache, MutAnyOrigin]: +def _get_symbol_table() -> UnsafePointer[SymbolTable, MutAnyOrigin]: try: - return _MATCHER_CACHE.get_or_create_ptr() + return _SYMBOL_TABLE.get_or_create_ptr() except e: abort[prefix="ERROR:"](String(e)) @always_inline -def get_matcher(key: Int) raises -> SomeMatcher: - var cache_ptr = _get_matcher_cache() - ref cache = cache_ptr[] - if key not in cache: - cache[key] = build_matcher(key) - return cache[key] +def intern(id: Int) raises -> InternedSymbol: + var table_ptr = _get_symbol_table() + ref table = table_ptr[] + if id not in table: + table[id] = InternedSymbol(id) + return table[id] ``` -- The string key in `_Global["MatcherCache", ...]` is the **global identity** — +- The string key in `_Global["SymbolTable", ...]` is the **global identity** — must be unique per cache across the whole program. - Return `UnsafePointer` for interior mutability even when the caller is `read`-self: this is how you cache through a trait method that demands @@ -349,12 +348,12 @@ destructor logic on garbage — a classic flaky double-free at process exit: ```mojo # WRONG — move-assign into uninitialized memory, undefined behavior -self._lazy_dfa_ptr = alloc[LazyDFA](1) -self._lazy_dfa_ptr[] = LazyDFA(vm^) # UB +self._arena_ptr = alloc[Arena](1) +self._arena_ptr[] = Arena(capacity^) # UB # CORRECT — construct in place -self._lazy_dfa_ptr = alloc[LazyDFA](1) -self._lazy_dfa_ptr.init_pointee_move(LazyDFA(vm^)) +self._arena_ptr = alloc[Arena](1) +self._arena_ptr.init_pointee_move(Arena(capacity^)) ``` Same rule applies in `__copyinit__` when copying heap-owned fields: @@ -427,22 +426,23 @@ comptime DIGIT_LUT = _build_digit_lut() # runs at compile time ## SIMD byte scanning: nibble-based lookup -For byte-level character class scans (matching `[a-z]`, `\d`, `\w`, arbitrary -byte sets), the naive 256-entry lookup table is too big for `_dynamic_shuffle`. -Decompose each byte into two **nibbles** (4-bit halves) and use two 16-entry -tables — this fits exactly in a single `pshufb`/`vpshufb` instruction: +For byte-level membership tests — JSON whitespace (`\t \n \r ' '`), CSV +delimiters, URL-safe characters, digit ranges, any fixed byte set — the naive +256-entry lookup table is too big for `_dynamic_shuffle`. Decompose each byte +into two **nibbles** (4-bit halves) and use two 16-entry tables — this fits +exactly in a single `pshufb`/`vpshufb` instruction: ```mojo # Two 16-entry tables, precomputed at compile time var lo = low_nibble_lut._dynamic_shuffle(chunk & 0x0F) var hi = high_nibble_lut._dynamic_shuffle((chunk >> 4) & 0x0F) var matches: SIMD[DType.bool, 16] = (lo & hi) != 0 -# matches[i] is True iff chunk[i] is in the class +# matches[i] is True iff chunk[i] is in the byte set ``` Nibble lookup is typically **20-100x faster** than per-byte scalar dispatch on -hot paths. For *contiguous* ranges like `[a-z]`, `[0-9]`, an even cheaper path -exists: +hot paths. For *contiguous* ranges (e.g., `a`-`z`, `0`-`9`), an even cheaper +path exists: ```mojo # Range check via unsigned subtract — no lookup table needed @@ -450,43 +450,44 @@ var offset = chunk - SIMD[DType.uint8, 32](range_start) var matches = offset <= SIMD[DType.uint8, 32](range_end - range_start) ``` -Record `range_start`/`range_end` on the matcher struct at construction time so -the hot path can pick this fast path. Non-contiguous classes fall back to the +Record `range_start`/`range_end` on the scanner struct at construction time so +the hot path can pick this fast path. Non-contiguous sets fall back to the nibble tables. -## Prefilters: cheap scan before the expensive match +## Prefilters: cheap scan before the expensive path -Full engine invocation per byte is the wrong granularity. Extract a *cheap* -signal from the pattern — a required literal substring, a first-byte set, a -fixed prefix — and use the fastest available scan primitive to locate -candidate positions first. The expensive matcher only runs where the prefilter -says "maybe": +Running the full processing pipeline per byte is the wrong granularity. +Extract a *cheap* signal — a required literal substring, a lead-byte set, +a fixed prefix — and use the fastest scan primitive to locate candidate +positions first. The expensive logic only runs where the prefilter says +"maybe". This applies to parsers, validators, search engines, log scanners, +packet decoders, etc. -| Prefilter | Scan primitive | Use when | -|--------------------------|----------------------------------|--------------------------------| -| Required literal | `StringSlice.find` | Pattern contains a fixed substr| -| Last-literal for `.*LIT` | `String.rfind` (single pass) | `.*literal` prefix pattern | -| First-byte set | SIMD equality sweep + bitmask | Small (<8) set of start bytes | -| Byte class (`\d`, `\w`) | Nibble SIMD scan | Start is a character class | +| Prefilter | Scan primitive | Example | +|--------------------|-------------------------------|----------------------------------------| +| Required literal | `StringSlice.find` | Scan for `"error"` before parsing line | +| Last occurrence | `String.rfind` (single pass) | Find last `/` to extract filename | +| Lead-byte set | SIMD equality sweep + bitmask | Scan for `{`, `[`, `"` to find JSON values | +| Byte range/class | Nibble SIMD scan | Skip to first digit before number parse| **Critical anti-pattern**: using repeated forward `find` to locate the *last* -occurrence is O(N × occurrences). Use `rfind` for a single reverse O(N) pass: +occurrence is O(N x occurrences). Use `rfind` for a single reverse O(N) pass: ```mojo # WRONG — O(N * k) var pos = 0 while True: - var next = text.find(literal, pos) + var next = text.find(delimiter, pos) if next == -1: break pos = next + 1 var last = pos - 1 # CORRECT — O(N) -var last = text.rfind(literal) +var last = text.rfind(delimiter) ``` -Real PR impact: this one change took `.*@example\.com` on a large text from -**39x slower** to **10x faster** than Python. +In practice, switching from repeated-`find` to `rfind` for a "find last +suffix" operation has delivered **8-40x speedups** on multi-KB inputs. ## Numeric accumulation over string concat @@ -513,34 +514,33 @@ Same pattern applies to checksum accumulation, hash building, and any ## Fast-path dispatch by input shape -At construction time (not match time), classify the input and record which -optimized path can run. Cheap patterns (single literal, fixed-length sequence, -anchored prefix) should bypass the general engine entirely: +At construction time (not execution time), classify the workload and record +which optimized path can run. Simple cases should bypass the general engine +entirely — the analysis cost is paid once and amortized across every call: ```mojo -struct CompiledRegex: - var _simple_literal: Bool # pattern is a fixed string - var _literal: String # extracted literal if any - var _has_dotstar_prefix: Bool # .*LITERAL - var _engine: Engine # general fallback +struct QueryPlan: + var _is_pk_lookup: Bool # equality on primary key + var _is_simple_scan: Bool # single-table, no joins + var _executor: GeneralExecutor # general fallback - def __init__(out self, var ast: ASTNode, pattern: String): - self._simple_literal = _is_simple_literal(ast) - ... # analyze once at compile time - self._engine = build_engine(ast) + def __init__(out self, query: Query): + self._is_pk_lookup = _has_pk_equality(query) + self._is_simple_scan = _is_single_table(query) + self._executor = plan_general(query) @always_inline - def match_first(self, text: ImmSlice) -> Optional[Match]: - if self._simple_literal: - return _literal_search(text, self._literal) # 100x faster path - if self._has_dotstar_prefix: - return _dotstar_literal_path(text, self._literal) - return self._engine.match_first(text) + def execute(self, db: Database) -> ResultSet: + if self._is_pk_lookup: + return _pk_index_lookup(db, self._executor.key) # O(1) path + if self._is_simple_scan: + return _sequential_scan(db, self._executor.table) + return self._executor.run(db) ``` -Per-match analysis overhead is amortized across every call on the same -compiled object. Keep the analysis in `__init__`; keep the hot path branchy -but cheap. +Keep the classification in `__init__`; keep the hot path branchy but cheap. +The same pattern applies to parsers (literal vs. complex grammar), formatters +(fixed-width vs. general), and serializers (flat struct vs. nested). ## Unlikely-branch hoisting @@ -549,12 +549,12 @@ so the fallback isn't inlined into the hot prologue: ```mojo @always_inline -def is_match(self, text: ImmSlice) -> Bool: - if len(text) == 0: # cold edge case - return self._matches_empty - if self._simple_literal: # hot path, short-circuits - return _literal_eq(text, self._literal) - return self._engine.is_match(text) # general path, not inlined hot +def validate(self, input: ImmSlice) -> Bool: + if len(input) == 0: # cold edge case + return self._empty_valid + if self._exact_mode: # hot path, short-circuits + return _byte_eq(input, self._expected) + return self._general.check(input) # general path, not inlined hot ``` Pair with `@no_inline` on the general-path helper if it's large, so the inlined