6x faster inference via batched dgemm + parallel hull#1
Open
trulite wants to merge 1 commit intoPercepta-Core:mainfrom
Open
6x faster inference via batched dgemm + parallel hull#1trulite wants to merge 1 commit intoPercepta-Core:mainfrom
trulite wants to merge 1 commit intoPercepta-Core:mainfrom
Conversation
…ries After sequential generation completes, re-runs the forward pass with two optimizations: 1. Batch all token projections into dgemm (matrix-matrix) instead of per-token dgemv (matrix-vector). Accelerate/OpenBLAS automatically uses all CPU cores + AMX. 2. Parallelize hull insert+query across attention heads with OpenMP. Each head's hull is independent — no synchronization needed. Loop nesting is flipped (outer=heads, inner=positions) so there's one OpenMP fork per layer instead of one per position. Results on M4 Pro (Sudoku, 980K tokens): Sequential: 37.7s (26K tok/s) Batched: 6.0s (165K tok/s) — 6.3x faster The build system detects libomp on macOS (homebrew) and links OpenBLAS + libgomp on Linux. Falls back gracefully when neither is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Ryvn Preview Creating preview This comment will be automatically updated with preview details. |
Author
|
Hi - do you guys have any policy for PRs |
trulite
added a commit
to trulite/transformer-vm
that referenced
this pull request
Mar 27, 2026
The weight builder now classifies each attention head: 0 = lookup (needs hull for key-value search) 1 = passthrough (output = V[t], proven: score(t,t) > score(t,s)) 2 = gather (position-keyed lookup, output = V[round(qx/qy)]) The head_type metadata is saved to model.bin and read by the C++ engine. In the batched verification, passthrough heads copy V[t] and gather heads compute a single array index — both O(1) per position, no hull insert/query needed. Mathematical proof (gather): for keys on the parabola (2s, -s²), the score -(s-q)² + q² is uniquely maximized at s=q. The quadratic penalty (≥1 for integer keys) dominates the tiebreak perturbation (<0.62). The query value q is always an integer (sum of integer cumsums) and q ≤ t (causality by construction). Results on M4 Pro (Sudoku, 980K tokens): PR Percepta-Core#1 (hull only): 5.96s (6.3× over sequential) + passthrough bypass: 5.50s (6.6×) + gather bypass: 3.96s (9.6×) 17 passthrough + 15 gather = 32 heads skip the hull entirely. The remaining 9 active lookup heads have small key sets (K ≤ 200) and run the hull in microseconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
oaustegard
referenced
this pull request
in oaustegard/transformer-vm
Apr 25, 2026
The four per-layer projections (qkv, out, fi, fo) in transformer.cpp go
through dense matvec, but the analytically-constructed weights in weights.py
are sparse by construction. The existing SparseMatrix infrastructure is
already used for the output head — wiring it into the per-layer projections
is the natural extension.
Changes:
* transformer.cpp:
- Add Linux OpenBLAS branch via -DUSE_OPENBLAS so non-Mac builds get a
fair dense baseline (the original #else fell through to a naive nested
loop, which makes any sparse comparison trivially favorable).
- Add SparseLayer + sparse_matvec, gated on -DUSE_SPARSE_PROJ. When set,
the four per-layer projections build CSR forms at load time and the
hot loop calls sparse_matvec instead of matvec.
- Add --max-gen=N runtime flag so benchmarks can bound the run regardless
of regen mode.
- Build matrix: default (naive), -DUSE_OPENBLAS (fair dense), and
-DUSE_SPARSE_PROJ (the patch). All three preserve the original behavior.
* scripts/make_synthetic_model.py:
- torch-free generator that emits a model.bin in transformer.cpp::load()
format with a controllable sparsity knob. Lets us measure dense-vs-sparse
crossover without standing up the full Python build pipeline.
* scripts/bench.py:
- Sweep harness: builds one synthetic model per sparsity level, runs all
three binaries, verifies they emit byte-identical token streams, reports
tok/s and projection-time fraction. Forces single-thread BLAS for
fairness at small per-token problem sizes.
* results/sweep.tsv:
- Numbers from the sandbox run at D=128 L=8 H=4 F=256 V=128. Crossover at
~75% sparsity; sparse wins 5x at 90%, 30x at 99%. See issue #1 for the
handoff describing what's left to confirm on a real machine.
This was referenced Apr 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dgemv) into single matrix-matrix calls (dgemm), letting Accelerate/OpenBLAS use all CPU cores + AMX automaticallyBenchmark (M4 Pro, Sudoku 980K tokens)
Breakdown:
Test plan
🤖 Generated with Claude Code