From 6ee4277c05be89ccfc83348ad03ec76875a79782 Mon Sep 17 00:00:00 2001
From: Manuel Saelices <msaelices@gmail.com>
Date: Sun, 12 Apr 2026 00:23:46 +0200
Subject: [PATCH 1/5] [Skills] Add mojo-optimizations skill

New skill capturing performance optimization patterns for Mojo code. Layers
on top of mojo-syntax and is triggered when profiling, benchmarking, tuning
latency, or porting performance-sensitive code to Mojo. Covers hot-path
inlining, unsafe pointer access in inner loops, pre-allocation and lazy
containers, struct layout for cache efficiency, views over owned strings,
ref over var in loops, _Global lazy caches, init_pointee_move for heap
fields, hash-keyed caching, comptime specialization, nibble-based SIMD byte
scanning, prefiltering strategies, numeric accumulation, fast-path dispatch
by input shape, and guidance on what not to optimize.
---
 README.md                   |  10 +
 mojo-optimizations/SKILL.md | 490 ++++++++++++++++++++++++++++++++++++
 2 files changed, 500 insertions(+)
 create mode 100644 mojo-optimizations/SKILL.md

diff --git a/README.md b/README.md
index 6775602..736afdf 100644
--- a/README.md
+++ b/README.md
@@ -82,6 +82,16 @@ triggered when Python types are used Mojo or a Python module needs to interact
 with Mojo code. Many capabilities of Mojo - Python interoperability are fairly
 new, and existing coding agents don't handle them correctly without guidance.
 
+### `mojo-optimizations`
+
+[This skill](mojo-optimizations/SKILL.md) captures performance optimization
+patterns for Mojo. It layers on top of `mojo-syntax` and is triggered when
+profiling, benchmarking, tuning latency, or porting performance-sensitive code
+to Mojo. It covers hot-path inlining, pre-allocation, unsafe pointer access in
+hot loops, struct layout, view-vs-owned types, lazy-initialized global caches,
+hash-keyed caching, `comptime` specialization, nibble-based SIMD byte scanning,
+prefiltering strategies, and fast-path dispatch.
+
 ## Examples
 
 Once these skills are installed, you can use them for many common tasks.
diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md
new file mode 100644
index 0000000..76b735d
--- /dev/null
+++ b/mojo-optimizations/SKILL.md
@@ -0,0 +1,490 @@
+---
+name: mojo-optimizations
+description: Performance optimization patterns for Mojo code. Use this skill in addition to mojo-syntax when writing or refactoring Mojo code that needs to be fast, when profiling shows a hot path, when the user mentions benchmarks, regressions, latency, throughput, "make it faster", or when porting performance-sensitive code (parsers, matchers, numeric loops, byte-level scanners) to Mojo. Use to overcome misconceptions about where Mojo spends cycles and which idioms compile to tight code.
+---
+
+<!-- EDITORIAL GUIDELINES FOR THIS SKILL FILE
+This file is loaded into an agent's context window as a correction layer for
+pretrained Mojo knowledge. Every line costs context. When editing:
+- Be terse. Use tables and inline code over prose where possible.
+- Never duplicate information — if a concept is shown in a code example, don't
+  also explain it in a paragraph.
+- Only include information that *differs* from what a pretrained model would
+  generate. Don't document things models already get right.
+- Prefer one consolidated code block over multiple small ones.
+- Keep WRONG/CORRECT pairs short — just enough to pattern-match the fix.
+- If adding a new section, ask: "Would a model get this wrong?" If not, skip it.
+These same principles apply to any files this skill references.
+-->
+
+Apply these patterns **on top of `mojo-syntax`** — they assume you already know
+modern Mojo syntax. These rules are extracted from real optimization work on a
+production Mojo codebase where a matcher hot path went from ~4 µs to ~1.3 µs
+(3x) by composing them. They compound: isolated use gives small wins, chained
+use is where the order-of-magnitude speedups come from.
+
+**Measure before and after every change.** Never guess.
+
+## The profile-driven workflow
+
+Before optimizing, build a benchmark harness that isolates the hot path:
+
+```mojo
+from std.time import perf_counter_ns
+
+# 1. Build once, outside the timing loop. Never include setup in measurements.
+var compiled = compile_once(pattern)
+
+# 2. Warmup — first N iterations are discarded (cache cold, JIT artifacts).
+for _ in range(WARMUP_ITERATIONS):
+    _ = compiled.run(input)
+
+# 3. Auto-calibrate iters so each sample takes >= 1 ms (OS jitter dominates
+#    sub-ms samples).
+var iters = initial_iters
+var cal_start = perf_counter_ns()
+for _ in range(iters):
+    _ = compiled.run(input)
+if perf_counter_ns() - cal_start < 1_000_000:
+    iters *= (1_000_000 // (perf_counter_ns() - cal_start)) + 1
+
+# 4. Collect samples until ~500 ms total, take **median** (not mean).
+```
+
+**Common mistakes a pretrained model will make**: including pattern compilation
+inside the timing loop; reporting mean (outliers skew it); running <100 ms
+total; skipping warmup. All of these were real issues fixed in the source
+project (and re-introduced by naive rewrites).
+
+## Inlining hot-path trampolines
+
+Wrapper methods that just forward to an inner engine are "trampolines". If a
+small input (e.g., a 16-byte `StringSlice`) flows through 4 trampoline levels
+without `@always_inline`, LLVM can't fold the call chain and the fast path pays
+3-4 call-frame costs per invocation. Mark **every level** of the dispatch chain
+`@always_inline`:
+
+```mojo
+# CompiledRegex -> HybridMatcher -> DFAMatcher -> DFAEngine : all @always_inline
+struct DFAMatcher:
+    @always_inline
+    def is_match(self, text: ImmSlice, start: Int = 0) -> Bool:
+        return self.engine_ptr[].is_match(text, start)
+
+    @always_inline
+    def match_first(self, text: ImmSlice, start: Int = 0) -> Optional[Match]:
+        return self.engine_ptr[].match_first(text, start)
+```
+
+- One link missing `@always_inline` breaks the fold — check the whole chain.
+- `@always_inline("nodebug")` additionally strips debug info for tiny helpers
+  (accessors, byte reads) so they don't clutter stack traces.
+- Use `@no_inline` on **cold** paths inside a hot function to keep the hot loop
+  small and reduce I-cache pressure (error handlers, first-time-setup paths).
+- Don't blindly inline large functions — `@always_inline` on a 200-line
+  function bloats call sites. The rule is: inline thin forwarders and small
+  leaf helpers, not big functions.
+
+## Unsafe pointer access in inner loops
+
+`List`/`Span`/`StringSlice` indexing emits a bounds check on every access. For
+loops that run millions of times, hoist the bounds check out and use
+`unsafe_ptr()` for the actual reads/writes:
+
+```mojo
+# WRONG — bounds check per iteration
+for i in range(len(states)):
+    if states[i].is_active:        # len check every step
+        ...
+
+# CORRECT — one bounds proof, then unchecked access
+var states_ptr = states.unsafe_ptr()
+var n = len(states)
+for i in range(n):
+    if states_ptr[i].is_active:
+        ...
+```
+
+Apply the same to `StringSlice.unsafe_ptr()` for byte scans, and to
+`List[T].unsafe_ptr()` inside inner DFA/state-machine loops.
+
+**Also audit the loop for checks that are always true for the input type**:
+`uint8_val >= 0` is always true; `char_code < 256` is always true for
+`UInt8`-typed inputs. Such dead conditions still cost instructions — delete
+them.
+
+## Pre-allocate collections; lazy-allocate when zero is common
+
+| Situation                             | Pattern                                    |
+|---------------------------------------|--------------------------------------------|
+| Known upper bound N                   | `List[T](capacity=N)`                      |
+| Known lower bound, growable           | `var xs = List[T](); xs.reserve(estimate)` |
+| Zero is the common case (`findall`)   | Lazy: don't allocate until first append    |
+| Fixed compile-time size               | `InlineArray[T, N]` (stack, no heap)       |
+| Small hot container, size never grows | `SIMD[DType.uint8, 128]` as a dense bitset |
+
+```mojo
+# Known bound — pre-size at construction
+var elements = List[Node](capacity=len(tokens))
+
+# Lazy container — defer allocation until first insert
+struct LazyList[T: Copyable & Movable]:
+    var _data: UnsafePointer[T, MutAnyOrigin]
+    var _len: Int
+    var _capacity: Int
+
+    def __init__(out self):
+        self._data = UnsafePointer[T, MutAnyOrigin]()
+        self._len = 0
+        self._capacity = 0
+
+    def append(mut self, value: T):
+        if self._capacity == 0:
+            self._realloc(8)       # first-use reservation
+        elif self._len == self._capacity:
+            self._realloc(self._capacity * 2)
+        (self._data + self._len).init_pointee_move(value)
+        self._len += 1
+```
+
+**Pitfall**: `List` grows by doubling. A loop that appends 1000 items triggers
+~10 reallocations + memcpys if you start from zero. Pre-sizing is usually a
+1.2-2x win on append-heavy code.
+
+## Struct layout for cache efficiency
+
+Small, trivially-copyable structs pass in registers and avoid memcpy. Aim to
+keep hot-iterated structs under a cache line (64 bytes), ideally 32 bytes.
+
+```mojo
+struct Match(Copyable, Movable, TrivialRegisterPassable):
+    comptime __copy_ctor_is_trivial = True   # LLVM elides the copy entirely
+    var group_id: Int                         # 8
+    var start_idx: Int                        # 8
+    var end_idx: Int                          # 8
+    var text_ptr: UnsafePointer[Byte, ImmutAnyOrigin]   # 8
+    # Total: 32 bytes — fits in 4 registers, no stack ops on copy.
+```
+
+- Store `UnsafePointer[Byte]` + offsets, not full `StringSlice` fields, when
+  every byte of the struct matters. (A `StringSlice` is ~16 bytes — two pointer
+  payloads — which doubles the struct.)
+- `TrivialRegisterPassable` requires all fields to be trivially copyable;
+  adding a `String` or `List` field silently drops the trait.
+- `@fieldwise_init` synthesises the constructor without you typing field
+  assignments — use it on any plain data struct.
+
+## Pass views, not owned strings
+
+`String` is heap-allocated; copying or building one from a literal allocates.
+Public APIs that only read text should take a `StringSlice` (view) instead.
+Define a module alias so every layer agrees on the same type:
+
+```mojo
+comptime ImmSlice = StringSlice[ImmutAnyOrigin]
+
+def search(pattern: ImmSlice, text: ImmSlice) raises -> Optional[Match]:
+    ...                                       # callers pass literals, zero alloc
+```
+
+`Span[Byte]` plays the same role for raw byte views. Prefer `Span[Byte]` over
+`(ptr: UnsafePointer[Byte], len: Int)` parameter pairs — it's the same two
+machine words but carries provenance and can't go out of sync.
+
+## Avoid copies in loops: `ref` over `var`
+
+Iterating a container of large structs (`ASTNode`, `Match`, records) with
+`var x = container[i]` **copies** on every step. Use `ref` to alias:
+
+```mojo
+# WRONG — full struct copy per iteration
+for i in range(len(nodes)):
+    var node = nodes[i]        # copy
+    process(node)
+
+# CORRECT — zero-copy reference
+for i in range(len(nodes)):
+    ref node = nodes[i]        # alias
+    process(node)
+
+# Also for local bindings to nested fields:
+ref matchers = matchers_ptr[]  # instead of `var matchers = matchers_ptr[]`
+```
+
+`^` transfer: when you *do* want ownership but won't use the source again, use
+`value^` to move instead of copy: `engine_ptr.init_pointee_move(engine^)`.
+
+## Lazy-initialized global caches
+
+For precomputed lookup tables, matcher dictionaries, or any expensive
+build-once value, use `_Global` from `std.ffi`. It guarantees single-shot
+lazy initialization behind a pointer you can mutate through:
+
+```mojo
+from std.ffi import _Global
+from std.os import abort
+
+comptime MatcherCache = Dict[Int, SomeMatcher]
+comptime _MATCHER_CACHE = _Global["MatcherCache", _init_matcher_cache]
+
+def _init_matcher_cache() -> MatcherCache:
+    return MatcherCache()
+
+def _get_matcher_cache() -> UnsafePointer[MatcherCache, MutAnyOrigin]:
+    try:
+        return _MATCHER_CACHE.get_or_create_ptr()
+    except e:
+        abort[prefix="ERROR:"](String(e))
+
+@always_inline
+def get_matcher(key: Int) raises -> SomeMatcher:
+    var cache_ptr = _get_matcher_cache()
+    ref cache = cache_ptr[]
+    if key not in cache:
+        cache[key] = build_matcher(key)
+    return cache[key]
+```
+
+- The string key in `_Global["MatcherCache", ...]` is the **global identity** —
+  must be unique per cache across the whole program.
+- Return `UnsafePointer` for interior mutability even when the caller is
+  `read`-self: this is how you cache through a trait method that demands
+  immutable `self`.
+
+## Heap-allocated fields: use `init_pointee_move`, not assignment
+
+`alloc[T](1)` returns **uninitialized** memory. Assigning `ptr[] = T(...)`
+invokes *move-assignment* into that uninitialized storage, which runs
+destructor logic on garbage — a classic flaky double-free at process exit:
+
+```mojo
+# WRONG — move-assign into uninitialized memory, undefined behavior
+self._lazy_dfa_ptr = alloc[LazyDFA](1)
+self._lazy_dfa_ptr[] = LazyDFA(vm^)     # UB
+
+# CORRECT — construct in place
+self._lazy_dfa_ptr = alloc[LazyDFA](1)
+self._lazy_dfa_ptr.init_pointee_move(LazyDFA(vm^))
+```
+
+Same rule applies in `__copyinit__` when copying heap-owned fields:
+
+```mojo
+def __copyinit__(out self, copy: Self):
+    self._ptr = alloc[Self.T](1)
+    self._ptr.init_pointee_move(copy._ptr[].copy())
+```
+
+A `__del__` that calls `.free()` on the pointer is mandatory to pair with
+these.
+
+## Cache by hash, not by string
+
+When a cache key is a string, keying the `Dict` on `String` forces callers to
+allocate just to check cache membership. Key on `hash(slice)` instead — cache
+hits become zero-allocation:
+
+```mojo
+comptime RegexCache = Dict[UInt64, CompiledRegex]
+
+def compile_regex(pattern: ImmSlice) raises -> CompiledRegex:
+    var cache_ptr = _get_regex_cache()
+    var key = hash(pattern)
+    if key in cache_ptr[]:
+        var cached = cache_ptr[][key]
+        if cached.pattern == pattern:       # collision guard: byte-compare
+            return cached
+    # miss or collision — allocate String once for the stored copy
+    var compiled = CompiledRegex(String(pattern))
+    cache_ptr[][key] = compiled
+    return compiled
+```
+
+Always keep the collision guard (`cached.pattern == pattern`) — 64-bit hash
+collisions are astronomically rare but not zero. On collision, fall through
+to a fresh compile.
+
+## Comptime specialization for fast-path code
+
+`comptime if`/`comptime for` generate **distinct code per specialization** and
+have zero runtime cost. Use them to pick between SIMD widths, architectures, or
+unrolled loop bodies:
+
+```mojo
+comptime SIMD_WIDTH = simd_width_of[DType.uint8]()
+
+def match_chunk[size: Int](self, chunk: SIMD[DType.uint8, size]) -> SIMD[DType.bool, size]:
+    comptime if size == 16:
+        # Fast path: one pshufb pair
+        var lo = self.low_lut._dynamic_shuffle(chunk & 0x0F)
+        var hi = self.high_lut._dynamic_shuffle((chunk >> 4) & 0x0F)
+        return (lo & hi) != 0
+    else:
+        # Generic path: process in 16-byte sub-chunks
+        var result = SIMD[DType.bool, size](False)
+        comptime for offset in range(0, size, 16):
+            ...
+        return result
+```
+
+Also precompute lookup tables as `comptime` so they end up as `.rodata`, not
+generated at runtime:
+
+```mojo
+comptime DIGIT_LUT = _build_digit_lut()  # runs at compile time
+```
+
+## SIMD byte scanning: nibble-based lookup
+
+For byte-level character class scans (matching `[a-z]`, `\d`, `\w`, arbitrary
+byte sets), the naive 256-entry lookup table is too big for `_dynamic_shuffle`.
+Decompose each byte into two **nibbles** (4-bit halves) and use two 16-entry
+tables — this fits exactly in a single `pshufb`/`vpshufb` instruction:
+
+```mojo
+# Two 16-entry tables, precomputed at compile time
+var lo = low_nibble_lut._dynamic_shuffle(chunk & 0x0F)
+var hi = high_nibble_lut._dynamic_shuffle((chunk >> 4) & 0x0F)
+var matches: SIMD[DType.bool, 16] = (lo & hi) != 0
+# matches[i] is True iff chunk[i] is in the class
+```
+
+Nibble lookup is typically **20-100x faster** than per-byte scalar dispatch on
+hot paths. For *contiguous* ranges like `[a-z]`, `[0-9]`, an even cheaper path
+exists:
+
+```mojo
+# Range check via unsigned subtract — no lookup table needed
+var offset = chunk - SIMD[DType.uint8, 32](range_start)
+var matches = offset <= SIMD[DType.uint8, 32](range_end - range_start)
+```
+
+Record `range_start`/`range_end` on the matcher struct at construction time so
+the hot path can pick this fast path. Non-contiguous classes fall back to the
+nibble tables.
+
+## Prefilters: cheap scan before the expensive match
+
+Full engine invocation per byte is the wrong granularity. Extract a *cheap*
+signal from the pattern — a required literal substring, a first-byte set, a
+fixed prefix — and use the fastest available scan primitive to locate
+candidate positions first. The expensive matcher only runs where the prefilter
+says "maybe":
+
+| Prefilter                | Scan primitive                   | Use when                       |
+|--------------------------|----------------------------------|--------------------------------|
+| Required literal         | `StringSlice.find`               | Pattern contains a fixed substr|
+| Last-literal for `.*LIT` | `String.rfind` (single pass)     | `.*literal` prefix pattern     |
+| First-byte set           | SIMD equality sweep + bitmask    | Small (<8) set of start bytes  |
+| Byte class (`\d`, `\w`)  | Nibble SIMD scan                 | Start is a character class     |
+
+**Critical anti-pattern**: using repeated forward `find` to locate the *last*
+occurrence is O(N × occurrences). Use `rfind` for a single reverse O(N) pass:
+
+```mojo
+# WRONG — O(N * k)
+var pos = 0
+while True:
+    var next = text.find(literal, pos)
+    if next == -1: break
+    pos = next + 1
+var last = pos - 1
+
+# CORRECT — O(N)
+var last = text.rfind(literal)
+```
+
+Real PR impact: this one change took `.*@example\.com` on a large text from
+**39x slower** to **10x faster** than Python.
+
+## Numeric accumulation over string concat
+
+When parsing numbers or building values byte by byte, accumulate into an `Int`
+directly — don't build a `String` and parse at the end:
+
+```mojo
+# WRONG — N allocations for an N-digit number
+var num_str = String("")
+while is_digit(text[i]):
+    num_str += String(chr(text[i]))
+    i += 1
+var num = Int(num_str)
+
+# CORRECT — zero allocation
+var num = 0
+while is_digit(text[i]):
+    num = num * 10 + (Int(text[i]) - Int(ord("0")))
+    i += 1
+```
+
+Same pattern applies to checksum accumulation, hash building, and any
+"consume-bytes-and-fold" logic.
+
+## Fast-path dispatch by input shape
+
+At construction time (not match time), classify the input and record which
+optimized path can run. Cheap patterns (single literal, fixed-length sequence,
+anchored prefix) should bypass the general engine entirely:
+
+```mojo
+struct CompiledRegex:
+    var _simple_literal: Bool       # pattern is a fixed string
+    var _literal: String             # extracted literal if any
+    var _has_dotstar_prefix: Bool    # .*LITERAL
+    var _engine: Engine              # general fallback
+
+    def __init__(out self, var ast: ASTNode, pattern: String):
+        self._simple_literal = _is_simple_literal(ast)
+        ...                           # analyze once at compile time
+        self._engine = build_engine(ast)
+
+    @always_inline
+    def match_first(self, text: ImmSlice) -> Optional[Match]:
+        if self._simple_literal:
+            return _literal_search(text, self._literal)   # 100x faster path
+        if self._has_dotstar_prefix:
+            return _dotstar_literal_path(text, self._literal)
+        return self._engine.match_first(text)
+```
+
+Per-match analysis overhead is amortized across every call on the same
+compiled object. Keep the analysis in `__init__`; keep the hot path branchy
+but cheap.
+
+## Unlikely-branch hoisting
+
+When a fast path dominates, put the check **first** and **return immediately**,
+so the fallback isn't inlined into the hot prologue:
+
+```mojo
+@always_inline
+def is_match(self, text: ImmSlice) -> Bool:
+    if len(text) == 0:            # cold edge case
+        return self._matches_empty
+    if self._simple_literal:      # hot path, short-circuits
+        return _literal_eq(text, self._literal)
+    return self._engine.is_match(text)   # general path, not inlined hot
+```
+
+Pair with `@no_inline` on the general-path helper if it's large, so the inlined
+caller stays small.
+
+## What NOT to optimize
+
+- **Don't** replace clear code with micro-optimizations before profiling. All
+  of the above earned their place by moving a measured hot path; applied
+  elsewhere they're just noise.
+- **Don't** pre-allocate containers you'll use once or twice — the `List`
+  default growth strategy is already fine for small cases.
+- **Don't** blanket-`@always_inline` functions over ~50 lines; you'll bloat
+  every caller and slow compile times. Inline thin forwarders, not whole
+  engines.
+- **Don't** `unsafe_ptr()` outside hot loops — you lose bounds checks without
+  payoff, and debugging the next segfault costs far more than the saved
+  microseconds.
+- **Don't** cache aggressively without measuring cache hit rate. An
+  infrequently-hit cache wastes memory and adds a branch.
+
+The single most reliable workflow: profile, pick one pattern from this list,
+apply it, re-benchmark. Commit that delta. Repeat. Stop when the hot path is
+no longer hot.

From 30931aa62a3d323ce36f574855dfbcac8209d6d5 Mon Sep 17 00:00:00 2001
From: Manuel Saelices <msaelices@gmail.com>
Date: Sun, 12 Apr 2026 00:34:03 +0200
Subject: [PATCH 2/5] [Skills] Use std.benchmark in mojo-optimizations
 profile-driven workflow

Replace the hand-rolled perf_counter_ns harness with the stdlib
std.benchmark idiom (Bench, Bencher, BenchConfig, BenchId, keep,
ThroughputMeasure). Matches the pattern used in mojo/stdlib/benchmarks/
so users copy-paste from the reference tree instead of reinventing
warmup and calibration logic.
---
 mojo-optimizations/SKILL.md | 114 +++++++++++++++++++++++++++++-------
 1 file changed, 92 insertions(+), 22 deletions(-)

diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md
index 76b735d..ef9a713 100644
--- a/mojo-optimizations/SKILL.md
+++ b/mojo-optimizations/SKILL.md
@@ -27,34 +27,104 @@ use is where the order-of-magnitude speedups come from.
 
 ## The profile-driven workflow
 
-Before optimizing, build a benchmark harness that isolates the hot path:
+Use `std.benchmark` — **never** hand-roll a `perf_counter_ns` harness.
+`Bench`/`Bencher` handles warmup, auto-calibration, repetitions, and
+statistical summary for you. The stdlib `mojo/stdlib/benchmarks/` tree is the
+reference — copy-paste from there when starting a new file.
 
 ```mojo
-from std.time import perf_counter_ns
-
-# 1. Build once, outside the timing loop. Never include setup in measurements.
-var compiled = compile_once(pattern)
-
-# 2. Warmup — first N iterations are discarded (cache cold, JIT artifacts).
-for _ in range(WARMUP_ITERATIONS):
-    _ = compiled.run(input)
+from std.benchmark import (
+    Bench, BenchConfig, Bencher, BenchId,
+    BenchMetric, ThroughputMeasure,
+    keep, black_box,
+)
+
+# One benchmark = an @parameter def that takes `mut b: Bencher`.
+# The inner @always_inline @parameter closure is what gets measured.
+@parameter
+def bench_match_first(mut b: Bencher) raises:
+    var compiled = compile_regex(PATTERN)      # setup OUTSIDE the timed body
+    @always_inline
+    @parameter
+    def call_fn() raises:
+        for _ in range(1000):                   # batch to amortize per-iter overhead
+            var r = compiled.match_first(TEXT)
+            keep(r)                             # prevent dead-code elimination
+    b.iter[call_fn]()
+    keep(Bool(compiled))                        # keep the setup alive past the loop
+
+# Parametric benchmarks use compile-time params; the harness calls them
+# once per specialization.
+@parameter
+def bench_insert[size: Int](mut b: Bencher) raises:
+    var items = make_dict[size]()
+    @always_inline
+    @parameter
+    def call_fn() raises:
+        for k in range(size, size + 10):
+            items[k] = k
+    b.iter[call_fn]()
+    keep(Bool(items))
+
+def main() raises:
+    var m = Bench(BenchConfig(num_repetitions=5))
+    m.bench_function[bench_match_first](BenchId("match_first"))
+    comptime for size in (10, 100, 1_000, 10_000):
+        m.bench_function[bench_insert[size]](
+            BenchId(String("insert[", size, "]"))
+        )
+    print(m)                                   # prints the results table
+```
 
-# 3. Auto-calibrate iters so each sample takes >= 1 ms (OS jitter dominates
-#    sub-ms samples).
-var iters = initial_iters
-var cal_start = perf_counter_ns()
-for _ in range(iters):
-    _ = compiled.run(input)
-if perf_counter_ns() - cal_start < 1_000_000:
-    iters *= (1_000_000 // (perf_counter_ns() - cal_start)) + 1
+Key rules the harness enforces, so you don't:
+
+- **Setup outside `call_fn`.** Anything built inside the inner closure is
+  rebuilt on every iteration. Compile regex, allocate buffers, load fixtures
+  before `b.iter[...]`.
+- **`keep(value)` every result.** Without it the optimizer deletes the call
+  you're trying to measure. `keep(Bool(container))` at the end of the outer
+  function keeps setup alive past the timed region. `black_box(x)` is the
+  stronger sibling — use it on inputs you want to force through memory.
+- **Batch inside `call_fn`** (e.g., `for _ in range(1000)`) when a single
+  call is sub-microsecond. The harness auto-calibrates, but batching further
+  reduces timer overhead for very fast ops.
+- **`num_repetitions > 1`** when you care about stability; the harness
+  reports min/mean/max across repetitions.
+- **Throughput units**: pass `ThroughputMeasure` so results are normalized to
+  GElems/s, GB/s, or GFLOPS/s instead of raw time. Use `bench_with_input` to
+  pipe a fixture to a parametric bench fn:
 
-# 4. Collect samples until ~500 ms total, take **median** (not mean).
+```mojo
+m.bench_with_input[InputT, bench_fn](
+    BenchId("atof", filename),
+    input_data,
+    [
+        ThroughputMeasure(BenchMetric.elements, len(input_data)),
+        ThroughputMeasure(BenchMetric.bytes, total_bytes),
+    ],
+)
 ```
 
-**Common mistakes a pretrained model will make**: including pattern compilation
-inside the timing loop; reporting mean (outliers skew it); running <100 ms
-total; skipping warmup. All of these were real issues fixed in the source
-project (and re-introduced by naive rewrites).
+- **`iter_custom`** is the escape hatch when the thing you're measuring needs
+  its own context (e.g., GPU dispatch). Pass a closure taking an iteration
+  count and returning elapsed ns. Use this only when `iter[...]` can't
+  express the setup.
+
+`BenchConfig` defaults are sensible: `num_warmup_iters=10`,
+`max_runtime_secs=1.0`, `max_iters=1_000`. Override `num_repetitions` (for
+stability) and `max_runtime_secs` (for precision) first; leave the rest alone
+unless you've measured a reason.
+
+File layout: name benchmark files `bench_<thing>.mojo`, mirror the source
+tree, and put them under a `benchmarks/` directory. Stdlib tooling keys on
+the `bench_` prefix.
+
+**Common mistakes a pretrained model will make**: hand-rolling
+`perf_counter_ns` loops; omitting `keep(...)` so the compiler deletes the
+work; constructing inputs inside `call_fn`; reporting a single run instead of
+`num_repetitions>1`; forgetting `@parameter` on the bench function or
+`@always_inline @parameter` on the inner closure; using `Bench()` without
+printing `print(m)` at the end (nothing renders otherwise).
 
 ## Inlining hot-path trampolines
 

From 524ca9c9c4b360d8d99c6f8925c5f08637b483a1 Mon Sep 17 00:00:00 2001
From: Manuel Saelices <msaelices@gmail.com>
Date: Sun, 12 Apr 2026 00:34:41 +0200
Subject: [PATCH 3/5] [Skills] Correct unsafe pointer access section with
 verified stdlib behavior

The previous text incorrectly claimed List, Span, and StringSlice all
emit a bounds check on every __getitem__ call. Verified against the
stdlib (std/collections/_index_normalization.mojo and
std/builtin/debug_assert.mojo): List and Span both pass
assert_always=False to normalize_index, so their bounds check compiles
out in default (ASSERT=safe) release builds. Only StringSlice[byte=i]
emits the check by default, plus a UTF-8 start-byte debug_assert.

Replace the section with a per-type table of actual costs, explain what
unsafe_ptr() reliably buys you in default release (negative-index
branch, trap-free loop optimization, parity with -D ASSERT=all builds),
and mention list.unsafe_get(idx) as a safer middle ground.
---
 mojo-optimizations/SKILL.md | 44 +++++++++++++++++++++++++++----------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md
index ef9a713..d034a1f 100644
--- a/mojo-optimizations/SKILL.md
+++ b/mojo-optimizations/SKILL.md
@@ -157,17 +157,35 @@ struct DFAMatcher:
 
 ## Unsafe pointer access in inner loops
 
-`List`/`Span`/`StringSlice` indexing emits a bounds check on every access. For
-loops that run millions of times, hoist the bounds check out and use
-`unsafe_ptr()` for the actual reads/writes:
+Know what `__getitem__` actually costs — it differs by type. All three go
+through `normalize_index`, but with different assert modes:
+
+| Type                  | Default-release check                              | Extra per-access work |
+|-----------------------|----------------------------------------------------|-----------------------|
+| `List[T][i]`          | **None** (`assert_mode="none"`, compiled out)      | Negative-index normalization branch |
+| `Span[T][i]`          | **None** (`assert_mode="none"`, compiled out)      | Negative-index normalization branch |
+| `StringSlice[byte=i]` | **Bounds check + UTF-8 start-byte assert** (`"safe"`) | Negative-index normalization branch |
+
+The global `ASSERT` mode defaults to `safe`. Under `-D ASSERT=all` or a
+debug build, **all three** types emit bounds checks on every access. So
+whether `unsafe_ptr()` actually saves a branch depends on the build mode.
+
+What `unsafe_ptr()` reliably buys you, even in default release:
+
+1. Skips the negative-index normalization branch (a conditional on every
+   access for signed index types).
+2. Removes the `StringSlice[byte=i]` bounds check + UTF-8 assert.
+3. Enables more aggressive loop optimization — with no possible trap, LLVM
+   can vectorize, unroll, and hoist more freely.
+4. Makes `-D ASSERT=all` debug builds as fast as release on the hot path.
 
 ```mojo
-# WRONG — bounds check per iteration
+# Safe but slow in -D ASSERT=all, and still branches per iter on sign check
 for i in range(len(states)):
-    if states[i].is_active:        # len check every step
+    if states[i].is_active:
         ...
 
-# CORRECT — one bounds proof, then unchecked access
+# Pointer-based hot loop — no normalization, no possible trap
 var states_ptr = states.unsafe_ptr()
 var n = len(states)
 for i in range(n):
@@ -175,13 +193,15 @@ for i in range(n):
         ...
 ```
 
-Apply the same to `StringSlice.unsafe_ptr()` for byte scans, and to
-`List[T].unsafe_ptr()` inside inner DFA/state-machine loops.
+For `List` specifically, `list.unsafe_get(idx)` is a safer middle ground —
+it asserts in debug (`assert_mode` default) but still avoids negative-index
+handling. Use `unsafe_ptr()` only when you also want the raw-pointer loop
+form (e.g., to feed `+ offset` arithmetic or SIMD loads).
 
-**Also audit the loop for checks that are always true for the input type**:
-`uint8_val >= 0` is always true; `char_code < 256` is always true for
-`UInt8`-typed inputs. Such dead conditions still cost instructions — delete
-them.
+**Separately, audit the loop for checks that are always true for the input
+type**: `uint8_val >= 0` is always true; `char_code < 256` is always true
+for `UInt8`-typed inputs. Such dead conditions still cost instructions —
+delete them.
 
 ## Pre-allocate collections; lazy-allocate when zero is common
 

From b1802a61c2534c499b16ec12529cf6c46fc616b2 Mon Sep 17 00:00:00 2001
From: Manuel Saelices <msaelices@gmail.com>
Date: Sun, 12 Apr 2026 00:45:16 +0200
Subject: [PATCH 4/5] [Skills] Simplify hash-cache section with a generic
 example

Replace the regex-specific example (CompiledRegex, _get_regex_cache,
ImmSlice) with a neutral get_or_build pattern that applies to any
expensive value keyed by a string. Add a brief list of use cases
(parsed configs, compiled templates, resolved paths, interned symbols,
SQL plans) so readers see the shape beyond regex.
---
 mojo-optimizations/SKILL.md | 39 +++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md
index d034a1f..3134ed4 100644
--- a/mojo-optimizations/SKILL.md
+++ b/mojo-optimizations/SKILL.md
@@ -370,29 +370,30 @@ these.
 
 ## Cache by hash, not by string
 
-When a cache key is a string, keying the `Dict` on `String` forces callers to
-allocate just to check cache membership. Key on `hash(slice)` instead — cache
-hits become zero-allocation:
+`Dict[String, V]` forces callers to allocate a `String` just to check cache
+membership. Hash the slice instead — cache hits become zero-allocation.
+Works for any "expensive value keyed by a string": parsed configs, compiled
+templates, resolved file paths, interned symbols, SQL plans.
 
 ```mojo
-comptime RegexCache = Dict[UInt64, CompiledRegex]
-
-def compile_regex(pattern: ImmSlice) raises -> CompiledRegex:
-    var cache_ptr = _get_regex_cache()
-    var key = hash(pattern)
-    if key in cache_ptr[]:
-        var cached = cache_ptr[][key]
-        if cached.pattern == pattern:       # collision guard: byte-compare
-            return cached
-    # miss or collision — allocate String once for the stored copy
-    var compiled = CompiledRegex(String(pattern))
-    cache_ptr[][key] = compiled
-    return compiled
+# Generic pattern: expensive T built from a string key, memoized.
+comptime Cache = Dict[UInt64, Entry]
+
+def get_or_build(key: StringSlice, mut cache: Cache) raises -> Entry:
+    var h = hash(key)
+    if h in cache:
+        ref hit = cache[h]
+        if hit.source == key:               # collision guard: byte-compare
+            return hit.copy()
+    var built = build(String(key))          # allocate the String once, on miss
+    cache[h] = built
+    return built
 ```
 
-Always keep the collision guard (`cached.pattern == pattern`) — 64-bit hash
-collisions are astronomically rare but not zero. On collision, fall through
-to a fresh compile.
+The collision guard is mandatory. 64-bit hash collisions are astronomically
+rare but not zero, and silently returning the wrong value under a collided
+key is the worst possible failure mode. On mismatch, fall through to a fresh
+build.
 
 ## Comptime specialization for fast-path code
 

From b2b5ee164b94c0e3d6c600ee9a3e920cfd9a8891 Mon Sep 17 00:00:00 2001
From: Manuel Saelices <msaelices@gmail.com>
Date: Sun, 12 Apr 2026 00:56:58 +0200
Subject: [PATCH 5/5] [Skills] Diversify examples away from regex-specific
 domain

Replace regex-flavored examples (CompiledRegex, DFAMatcher, Match,
LazyDFA, compile_regex, match_first, is_match, .*literal) with a
variety of neutral domains:
- Benchmark: parse_json
- Inlining trampolines: JsonParser -> Tokenizer -> ByteScanner
- Struct layout: Token (tokenizer output)
- Views: tokenize(source)
- ref over var: rows, table
- Global caches: SymbolTable, intern()
- init_pointee_move: Arena
- SIMD scanning: JSON whitespace, CSV delimiters, URL-safe chars
- Prefilters: log scanning, filename extraction, JSON value detection
- Fast-path dispatch: QueryPlan (pk lookup, sequential scan, general)
- Unlikely-branch hoisting: validate(input)
---
 mojo-optimizations/SKILL.md | 204 ++++++++++++++++++------------------
 1 file changed, 102 insertions(+), 102 deletions(-)

diff --git a/mojo-optimizations/SKILL.md b/mojo-optimizations/SKILL.md
index 3134ed4..c7765ee 100644
--- a/mojo-optimizations/SKILL.md
+++ b/mojo-optimizations/SKILL.md
@@ -18,10 +18,9 @@ These same principles apply to any files this skill references.
 -->
 
 Apply these patterns **on top of `mojo-syntax`** — they assume you already know
-modern Mojo syntax. These rules are extracted from real optimization work on a
-production Mojo codebase where a matcher hot path went from ~4 µs to ~1.3 µs
-(3x) by composing them. They compound: isolated use gives small wins, chained
-use is where the order-of-magnitude speedups come from.
+modern Mojo syntax. These patterns compound: any one of them in isolation
+gives a small win, but chaining several on the same hot path is where
+order-of-magnitude speedups come from.
 
 **Measure before and after every change.** Never guess.
 
@@ -42,16 +41,16 @@ from std.benchmark import (
 # One benchmark = an @parameter def that takes `mut b: Bencher`.
 # The inner @always_inline @parameter closure is what gets measured.
 @parameter
-def bench_match_first(mut b: Bencher) raises:
-    var compiled = compile_regex(PATTERN)      # setup OUTSIDE the timed body
+def bench_parse_json(mut b: Bencher) raises:
+    var source = load_fixture("large.json")     # setup OUTSIDE the timed body
     @always_inline
     @parameter
     def call_fn() raises:
         for _ in range(1000):                   # batch to amortize per-iter overhead
-            var r = compiled.match_first(TEXT)
+            var r = parse_json(source)
             keep(r)                             # prevent dead-code elimination
     b.iter[call_fn]()
-    keep(Bool(compiled))                        # keep the setup alive past the loop
+    keep(Bool(source))                          # keep the setup alive past the loop
 
 # Parametric benchmarks use compile-time params; the harness calls them
 # once per specialization.
@@ -68,7 +67,7 @@ def bench_insert[size: Int](mut b: Bencher) raises:
 
 def main() raises:
     var m = Bench(BenchConfig(num_repetitions=5))
-    m.bench_function[bench_match_first](BenchId("match_first"))
+    m.bench_function[bench_parse_json](BenchId("parse_json"))
     comptime for size in (10, 100, 1_000, 10_000):
         m.bench_function[bench_insert[size]](
             BenchId(String("insert[", size, "]"))
@@ -79,7 +78,7 @@ def main() raises:
 Key rules the harness enforces, so you don't:
 
 - **Setup outside `call_fn`.** Anything built inside the inner closure is
-  rebuilt on every iteration. Compile regex, allocate buffers, load fixtures
+  rebuilt on every iteration. Load fixtures, allocate buffers, pre-compile
   before `b.iter[...]`.
 - **`keep(value)` every result.** Without it the optimizer deletes the call
   you're trying to measure. `keep(Bool(container))` at the end of the outer
@@ -135,15 +134,15 @@ without `@always_inline`, LLVM can't fold the call chain and the fast path pays
 `@always_inline`:
 
 ```mojo
-# CompiledRegex -> HybridMatcher -> DFAMatcher -> DFAEngine : all @always_inline
-struct DFAMatcher:
+# JsonParser -> Tokenizer -> ByteScanner : all @always_inline
+struct Tokenizer:
     @always_inline
-    def is_match(self, text: ImmSlice, start: Int = 0) -> Bool:
-        return self.engine_ptr[].is_match(text, start)
+    def peek(self) -> UInt8:
+        return self.scanner_ptr[].peek()
 
     @always_inline
-    def match_first(self, text: ImmSlice, start: Int = 0) -> Optional[Match]:
-        return self.engine_ptr[].match_first(text, start)
+    def next_token(mut self) -> Token:
+        return self.scanner_ptr[].next_token()
 ```
 
 - One link missing `@always_inline` breaks the fold — check the whole chain.
@@ -247,12 +246,12 @@ Small, trivially-copyable structs pass in registers and avoid memcpy. Aim to
 keep hot-iterated structs under a cache line (64 bytes), ideally 32 bytes.
 
 ```mojo
-struct Match(Copyable, Movable, TrivialRegisterPassable):
+struct Token(Copyable, Movable, TrivialRegisterPassable):
     comptime __copy_ctor_is_trivial = True   # LLVM elides the copy entirely
-    var group_id: Int                         # 8
-    var start_idx: Int                        # 8
-    var end_idx: Int                          # 8
-    var text_ptr: UnsafePointer[Byte, ImmutAnyOrigin]   # 8
+    var kind: Int                             # 8
+    var start: Int                            # 8
+    var length: Int                           # 8
+    var source_ptr: UnsafePointer[Byte, ImmutAnyOrigin]   # 8
     # Total: 32 bytes — fits in 4 registers, no stack ops on copy.
 ```
 
@@ -273,7 +272,7 @@ Define a module alias so every layer agrees on the same type:
 ```mojo
 comptime ImmSlice = StringSlice[ImmutAnyOrigin]
 
-def search(pattern: ImmSlice, text: ImmSlice) raises -> Optional[Match]:
+def tokenize(source: ImmSlice) raises -> List[Token]:
     ...                                       # callers pass literals, zero alloc
 ```
 
@@ -283,30 +282,30 @@ machine words but carries provenance and can't go out of sync.
 
 ## Avoid copies in loops: `ref` over `var`
 
-Iterating a container of large structs (`ASTNode`, `Match`, records) with
+Iterating a container of large structs (records, tree nodes, tokens) with
 `var x = container[i]` **copies** on every step. Use `ref` to alias:
 
 ```mojo
 # WRONG — full struct copy per iteration
-for i in range(len(nodes)):
-    var node = nodes[i]        # copy
-    process(node)
+for i in range(len(rows)):
+    var row = rows[i]          # copy
+    process(row)
 
 # CORRECT — zero-copy reference
-for i in range(len(nodes)):
-    ref node = nodes[i]        # alias
-    process(node)
+for i in range(len(rows)):
+    ref row = rows[i]          # alias
+    process(row)
 
 # Also for local bindings to nested fields:
-ref matchers = matchers_ptr[]  # instead of `var matchers = matchers_ptr[]`
+ref table = table_ptr[]        # instead of `var table = table_ptr[]`
 ```
 
 `^` transfer: when you *do* want ownership but won't use the source again, use
-`value^` to move instead of copy: `engine_ptr.init_pointee_move(engine^)`.
+`value^` to move instead of copy: `ptr.init_pointee_move(value^)`.
 
 ## Lazy-initialized global caches
 
-For precomputed lookup tables, matcher dictionaries, or any expensive
+For precomputed lookup tables, interned symbol caches, or any expensive
 build-once value, use `_Global` from `std.ffi`. It guarantees single-shot
 lazy initialization behind a pointer you can mutate through:
 
@@ -314,28 +313,28 @@ lazy initialization behind a pointer you can mutate through:
 from std.ffi import _Global
 from std.os import abort
 
-comptime MatcherCache = Dict[Int, SomeMatcher]
-comptime _MATCHER_CACHE = _Global["MatcherCache", _init_matcher_cache]
+comptime SymbolTable = Dict[Int, InternedSymbol]
+comptime _SYMBOL_TABLE = _Global["SymbolTable", _init_symbol_table]
 
-def _init_matcher_cache() -> MatcherCache:
-    return MatcherCache()
+def _init_symbol_table() -> SymbolTable:
+    return SymbolTable()
 
-def _get_matcher_cache() -> UnsafePointer[MatcherCache, MutAnyOrigin]:
+def _get_symbol_table() -> UnsafePointer[SymbolTable, MutAnyOrigin]:
     try:
-        return _MATCHER_CACHE.get_or_create_ptr()
+        return _SYMBOL_TABLE.get_or_create_ptr()
     except e:
         abort[prefix="ERROR:"](String(e))
 
 @always_inline
-def get_matcher(key: Int) raises -> SomeMatcher:
-    var cache_ptr = _get_matcher_cache()
-    ref cache = cache_ptr[]
-    if key not in cache:
-        cache[key] = build_matcher(key)
-    return cache[key]
+def intern(id: Int) raises -> InternedSymbol:
+    var table_ptr = _get_symbol_table()
+    ref table = table_ptr[]
+    if id not in table:
+        table[id] = InternedSymbol(id)
+    return table[id]
 ```
 
-- The string key in `_Global["MatcherCache", ...]` is the **global identity** —
+- The string key in `_Global["SymbolTable", ...]` is the **global identity** —
   must be unique per cache across the whole program.
 - Return `UnsafePointer` for interior mutability even when the caller is
   `read`-self: this is how you cache through a trait method that demands
@@ -349,12 +348,12 @@ destructor logic on garbage — a classic flaky double-free at process exit:
 
 ```mojo
 # WRONG — move-assign into uninitialized memory, undefined behavior
-self._lazy_dfa_ptr = alloc[LazyDFA](1)
-self._lazy_dfa_ptr[] = LazyDFA(vm^)     # UB
+self._arena_ptr = alloc[Arena](1)
+self._arena_ptr[] = Arena(capacity^)     # UB
 
 # CORRECT — construct in place
-self._lazy_dfa_ptr = alloc[LazyDFA](1)
-self._lazy_dfa_ptr.init_pointee_move(LazyDFA(vm^))
+self._arena_ptr = alloc[Arena](1)
+self._arena_ptr.init_pointee_move(Arena(capacity^))
 ```
 
 Same rule applies in `__copyinit__` when copying heap-owned fields:
@@ -427,22 +426,23 @@ comptime DIGIT_LUT = _build_digit_lut()  # runs at compile time
 
 ## SIMD byte scanning: nibble-based lookup
 
-For byte-level character class scans (matching `[a-z]`, `\d`, `\w`, arbitrary
-byte sets), the naive 256-entry lookup table is too big for `_dynamic_shuffle`.
-Decompose each byte into two **nibbles** (4-bit halves) and use two 16-entry
-tables — this fits exactly in a single `pshufb`/`vpshufb` instruction:
+For byte-level membership tests — JSON whitespace (`\t \n \r ' '`), CSV
+delimiters, URL-safe characters, digit ranges, any fixed byte set — the naive
+256-entry lookup table is too big for `_dynamic_shuffle`. Decompose each byte
+into two **nibbles** (4-bit halves) and use two 16-entry tables — this fits
+exactly in a single `pshufb`/`vpshufb` instruction:
 
 ```mojo
 # Two 16-entry tables, precomputed at compile time
 var lo = low_nibble_lut._dynamic_shuffle(chunk & 0x0F)
 var hi = high_nibble_lut._dynamic_shuffle((chunk >> 4) & 0x0F)
 var matches: SIMD[DType.bool, 16] = (lo & hi) != 0
-# matches[i] is True iff chunk[i] is in the class
+# matches[i] is True iff chunk[i] is in the byte set
 ```
 
 Nibble lookup is typically **20-100x faster** than per-byte scalar dispatch on
-hot paths. For *contiguous* ranges like `[a-z]`, `[0-9]`, an even cheaper path
-exists:
+hot paths. For *contiguous* ranges (e.g., `a`-`z`, `0`-`9`), an even cheaper
+path exists:
 
 ```mojo
 # Range check via unsigned subtract — no lookup table needed
@@ -450,43 +450,44 @@ var offset = chunk - SIMD[DType.uint8, 32](range_start)
 var matches = offset <= SIMD[DType.uint8, 32](range_end - range_start)
 ```
 
-Record `range_start`/`range_end` on the matcher struct at construction time so
-the hot path can pick this fast path. Non-contiguous classes fall back to the
+Record `range_start`/`range_end` on the scanner struct at construction time so
+the hot path can pick this fast path. Non-contiguous sets fall back to the
 nibble tables.
 
-## Prefilters: cheap scan before the expensive match
+## Prefilters: cheap scan before the expensive path
 
-Full engine invocation per byte is the wrong granularity. Extract a *cheap*
-signal from the pattern — a required literal substring, a first-byte set, a
-fixed prefix — and use the fastest available scan primitive to locate
-candidate positions first. The expensive matcher only runs where the prefilter
-says "maybe":
+Running the full processing pipeline per byte is the wrong granularity.
+Extract a *cheap* signal — a required literal substring, a lead-byte set,
+a fixed prefix — and use the fastest scan primitive to locate candidate
+positions first. The expensive logic only runs where the prefilter says
+"maybe". This applies to parsers, validators, search engines, log scanners,
+packet decoders, etc.
 
-| Prefilter                | Scan primitive                   | Use when                       |
-|--------------------------|----------------------------------|--------------------------------|
-| Required literal         | `StringSlice.find`               | Pattern contains a fixed substr|
-| Last-literal for `.*LIT` | `String.rfind` (single pass)     | `.*literal` prefix pattern     |
-| First-byte set           | SIMD equality sweep + bitmask    | Small (<8) set of start bytes  |
-| Byte class (`\d`, `\w`)  | Nibble SIMD scan                 | Start is a character class     |
+| Prefilter          | Scan primitive                | Example                                |
+|--------------------|-------------------------------|----------------------------------------|
+| Required literal   | `StringSlice.find`            | Scan for `"error"` before parsing line |
+| Last occurrence    | `String.rfind` (single pass)  | Find last `/` to extract filename      |
+| Lead-byte set      | SIMD equality sweep + bitmask | Scan for `{`, `[`, `"` to find JSON values |
+| Byte range/class   | Nibble SIMD scan              | Skip to first digit before number parse|
 
 **Critical anti-pattern**: using repeated forward `find` to locate the *last*
-occurrence is O(N × occurrences). Use `rfind` for a single reverse O(N) pass:
+occurrence is O(N x occurrences). Use `rfind` for a single reverse O(N) pass:
 
 ```mojo
 # WRONG — O(N * k)
 var pos = 0
 while True:
-    var next = text.find(literal, pos)
+    var next = text.find(delimiter, pos)
     if next == -1: break
     pos = next + 1
 var last = pos - 1
 
 # CORRECT — O(N)
-var last = text.rfind(literal)
+var last = text.rfind(delimiter)
 ```
 
-Real PR impact: this one change took `.*@example\.com` on a large text from
-**39x slower** to **10x faster** than Python.
+In practice, switching from repeated-`find` to `rfind` for a "find last
+suffix" operation has delivered **8-40x speedups** on multi-KB inputs.
 
 ## Numeric accumulation over string concat
 
@@ -513,34 +514,33 @@ Same pattern applies to checksum accumulation, hash building, and any
 
 ## Fast-path dispatch by input shape
 
-At construction time (not match time), classify the input and record which
-optimized path can run. Cheap patterns (single literal, fixed-length sequence,
-anchored prefix) should bypass the general engine entirely:
+At construction time (not execution time), classify the workload and record
+which optimized path can run. Simple cases should bypass the general engine
+entirely — the analysis cost is paid once and amortized across every call:
 
 ```mojo
-struct CompiledRegex:
-    var _simple_literal: Bool       # pattern is a fixed string
-    var _literal: String             # extracted literal if any
-    var _has_dotstar_prefix: Bool    # .*LITERAL
-    var _engine: Engine              # general fallback
+struct QueryPlan:
+    var _is_pk_lookup: Bool         # equality on primary key
+    var _is_simple_scan: Bool       # single-table, no joins
+    var _executor: GeneralExecutor  # general fallback
 
-    def __init__(out self, var ast: ASTNode, pattern: String):
-        self._simple_literal = _is_simple_literal(ast)
-        ...                           # analyze once at compile time
-        self._engine = build_engine(ast)
+    def __init__(out self, query: Query):
+        self._is_pk_lookup = _has_pk_equality(query)
+        self._is_simple_scan = _is_single_table(query)
+        self._executor = plan_general(query)
 
     @always_inline
-    def match_first(self, text: ImmSlice) -> Optional[Match]:
-        if self._simple_literal:
-            return _literal_search(text, self._literal)   # 100x faster path
-        if self._has_dotstar_prefix:
-            return _dotstar_literal_path(text, self._literal)
-        return self._engine.match_first(text)
+    def execute(self, db: Database) -> ResultSet:
+        if self._is_pk_lookup:
+            return _pk_index_lookup(db, self._executor.key)   # O(1) path
+        if self._is_simple_scan:
+            return _sequential_scan(db, self._executor.table)
+        return self._executor.run(db)
 ```
 
-Per-match analysis overhead is amortized across every call on the same
-compiled object. Keep the analysis in `__init__`; keep the hot path branchy
-but cheap.
+Keep the classification in `__init__`; keep the hot path branchy but cheap.
+The same pattern applies to parsers (literal vs. complex grammar), formatters
+(fixed-width vs. general), and serializers (flat struct vs. nested).
 
 ## Unlikely-branch hoisting
 
@@ -549,12 +549,12 @@ so the fallback isn't inlined into the hot prologue:
 
 ```mojo
 @always_inline
-def is_match(self, text: ImmSlice) -> Bool:
-    if len(text) == 0:            # cold edge case
-        return self._matches_empty
-    if self._simple_literal:      # hot path, short-circuits
-        return _literal_eq(text, self._literal)
-    return self._engine.is_match(text)   # general path, not inlined hot
+def validate(self, input: ImmSlice) -> Bool:
+    if len(input) == 0:            # cold edge case
+        return self._empty_valid
+    if self._exact_mode:           # hot path, short-circuits
+        return _byte_eq(input, self._expected)
+    return self._general.check(input)    # general path, not inlined hot
 ```
 
 Pair with `@no_inline` on the general-path helper if it's large, so the inlined