Skip to content

6x faster inference via batched dgemm + parallel hull#1

Open
trulite wants to merge 1 commit intoPercepta-Core:mainfrom
trulite:batched-verification
Open

6x faster inference via batched dgemm + parallel hull#1
trulite wants to merge 1 commit intoPercepta-Core:mainfrom
trulite:batched-verification

Conversation

@trulite
Copy link
Copy Markdown

@trulite trulite commented Mar 26, 2026

Summary

  • Batch all per-token matrix-vector projections (dgemv) into single matrix-matrix calls (dgemm), letting Accelerate/OpenBLAS use all CPU cores + AMX automatically
  • Parallelize hull insert+query across attention heads with OpenMP — each head's hull is independent, so the loop is flipped to one fork per layer instead of one per position
  • Auto-detect libomp on macOS and link OpenBLAS + libgomp on Linux; falls back gracefully when neither is available

Benchmark (M4 Pro, Sudoku 980K tokens)

Sequential:  37.7s  (26K tok/s)
Batched:      6.0s  (165K tok/s)  — 6.3x faster

Breakdown:

Sequential Batched
Projections 14.3s (37%) 1.2s (20%)
Hull 21.4s (56%) 4.8s (80%)

Test plan

  • Spot-check verification passes on hello, collatz, sudoku
  • Correct Sudoku solution output matches sequential
  • Test on Linux with OpenBLAS + libgomp
  • Test fallback build without OpenMP

🤖 Generated with Claude Code

…ries

After sequential generation completes, re-runs the forward pass with two
optimizations:

1. Batch all token projections into dgemm (matrix-matrix) instead of
   per-token dgemv (matrix-vector). Accelerate/OpenBLAS automatically
   uses all CPU cores + AMX.

2. Parallelize hull insert+query across attention heads with OpenMP.
   Each head's hull is independent — no synchronization needed. Loop
   nesting is flipped (outer=heads, inner=positions) so there's one
   OpenMP fork per layer instead of one per position.

Results on M4 Pro (Sudoku, 980K tokens):
  Sequential:  37.7s  (26K tok/s)
  Batched:      6.0s  (165K tok/s)  — 6.3x faster

The build system detects libomp on macOS (homebrew) and links OpenBLAS +
libgomp on Linux. Falls back gracefully when neither is available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ryvn-technologies
Copy link
Copy Markdown

Ryvn Preview

Creating preview prerelease-Percepta-Core-transformer-vm for this pull request.


This comment will be automatically updated with preview details.

@trulite
Copy link
Copy Markdown
Author

trulite commented Mar 27, 2026

Hi - do you guys have any policy for PRs

trulite added a commit to trulite/transformer-vm that referenced this pull request Mar 27, 2026
The weight builder now classifies each attention head:
  0 = lookup (needs hull for key-value search)
  1 = passthrough (output = V[t], proven: score(t,t) > score(t,s))
  2 = gather (position-keyed lookup, output = V[round(qx/qy)])

The head_type metadata is saved to model.bin and read by the C++
engine. In the batched verification, passthrough heads copy V[t]
and gather heads compute a single array index — both O(1) per
position, no hull insert/query needed.

Mathematical proof (gather): for keys on the parabola (2s, -s²),
the score -(s-q)² + q² is uniquely maximized at s=q. The quadratic
penalty (≥1 for integer keys) dominates the tiebreak perturbation
(<0.62). The query value q is always an integer (sum of integer
cumsums) and q ≤ t (causality by construction).

Results on M4 Pro (Sudoku, 980K tokens):
  PR Percepta-Core#1 (hull only):     5.96s  (6.3× over sequential)
  + passthrough bypass:  5.50s  (6.6×)
  + gather bypass:       3.96s  (9.6×)

17 passthrough + 15 gather = 32 heads skip the hull entirely.
The remaining 9 active lookup heads have small key sets (K ≤ 200)
and run the hull in microseconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
oaustegard referenced this pull request in oaustegard/transformer-vm Apr 25, 2026
The four per-layer projections (qkv, out, fi, fo) in transformer.cpp go
through dense matvec, but the analytically-constructed weights in weights.py
are sparse by construction. The existing SparseMatrix infrastructure is
already used for the output head — wiring it into the per-layer projections
is the natural extension.

Changes:
* transformer.cpp:
  - Add Linux OpenBLAS branch via -DUSE_OPENBLAS so non-Mac builds get a
    fair dense baseline (the original #else fell through to a naive nested
    loop, which makes any sparse comparison trivially favorable).
  - Add SparseLayer + sparse_matvec, gated on -DUSE_SPARSE_PROJ. When set,
    the four per-layer projections build CSR forms at load time and the
    hot loop calls sparse_matvec instead of matvec.
  - Add --max-gen=N runtime flag so benchmarks can bound the run regardless
    of regen mode.
  - Build matrix: default (naive), -DUSE_OPENBLAS (fair dense), and
    -DUSE_SPARSE_PROJ (the patch). All three preserve the original behavior.

* scripts/make_synthetic_model.py:
  - torch-free generator that emits a model.bin in transformer.cpp::load()
    format with a controllable sparsity knob. Lets us measure dense-vs-sparse
    crossover without standing up the full Python build pipeline.

* scripts/bench.py:
  - Sweep harness: builds one synthetic model per sparsity level, runs all
    three binaries, verifies they emit byte-identical token streams, reports
    tok/s and projection-time fraction. Forces single-thread BLAS for
    fairness at small per-token problem sizes.

* results/sweep.tsv:
  - Numbers from the sandbox run at D=128 L=8 H=4 F=256 V=128. Crossover at
    ~75% sparsity; sparse wins 5x at 90%, 30x at 99%. See issue #1 for the
    handoff describing what's left to confirm on a real machine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant