6x faster inference via batched dgemm + parallel hull by trulite · Pull Request #1 · Percepta-Core/transformer-vm

trulite · 2026-03-26T21:44:34Z

Summary

Batch all per-token matrix-vector projections (dgemv) into single matrix-matrix calls (dgemm), letting Accelerate/OpenBLAS use all CPU cores + AMX automatically
Parallelize hull insert+query across attention heads with OpenMP — each head's hull is independent, so the loop is flipped to one fork per layer instead of one per position
Auto-detect libomp on macOS and link OpenBLAS + libgomp on Linux; falls back gracefully when neither is available

Benchmark (M4 Pro, Sudoku 980K tokens)

Sequential:  37.7s  (26K tok/s)
Batched:      6.0s  (165K tok/s)  — 6.3x faster

Breakdown:

	Sequential	Batched
Projections	14.3s (37%)	1.2s (20%)
Hull	21.4s (56%)	4.8s (80%)

Test plan

Spot-check verification passes on hello, collatz, sudoku
Correct Sudoku solution output matches sequential
Test on Linux with OpenBLAS + libgomp
Test fallback build without OpenMP

🤖 Generated with Claude Code

…ries After sequential generation completes, re-runs the forward pass with two optimizations: 1. Batch all token projections into dgemm (matrix-matrix) instead of per-token dgemv (matrix-vector). Accelerate/OpenBLAS automatically uses all CPU cores + AMX. 2. Parallelize hull insert+query across attention heads with OpenMP. Each head's hull is independent — no synchronization needed. Loop nesting is flipped (outer=heads, inner=positions) so there's one OpenMP fork per layer instead of one per position. Results on M4 Pro (Sudoku, 980K tokens): Sequential: 37.7s (26K tok/s) Batched: 6.0s (165K tok/s) — 6.3x faster The build system detects libomp on macOS (homebrew) and links OpenBLAS + libgomp on Linux. Falls back gracefully when neither is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ryvn-technologies · 2026-03-26T21:44:37Z

Ryvn Preview

Creating preview prerelease-Percepta-Core-transformer-vm for this pull request.

_{This comment will be automatically updated with preview details.}

trulite · 2026-03-27T07:31:06Z

Hi - do you guys have any policy for PRs

The weight builder now classifies each attention head: 0 = lookup (needs hull for key-value search) 1 = passthrough (output = V[t], proven: score(t,t) > score(t,s)) 2 = gather (position-keyed lookup, output = V[round(qx/qy)]) The head_type metadata is saved to model.bin and read by the C++ engine. In the batched verification, passthrough heads copy V[t] and gather heads compute a single array index — both O(1) per position, no hull insert/query needed. Mathematical proof (gather): for keys on the parabola (2s, -s²), the score -(s-q)² + q² is uniquely maximized at s=q. The quadratic penalty (≥1 for integer keys) dominates the tiebreak perturbation (<0.62). The query value q is always an integer (sum of integer cumsums) and q ≤ t (causality by construction). Results on M4 Pro (Sudoku, 980K tokens): PR Percepta-Core#1 (hull only): 5.96s (6.3× over sequential) + passthrough bypass: 5.50s (6.6×) + gather bypass: 3.96s (9.6×) 17 passthrough + 15 gather = 32 heads skip the hull entirely. The remaining 9 active lookup heads have small key sets (K ≤ 200) and run the hull in microseconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The four per-layer projections (qkv, out, fi, fo) in transformer.cpp go through dense matvec, but the analytically-constructed weights in weights.py are sparse by construction. The existing SparseMatrix infrastructure is already used for the output head — wiring it into the per-layer projections is the natural extension. Changes: * transformer.cpp: - Add Linux OpenBLAS branch via -DUSE_OPENBLAS so non-Mac builds get a fair dense baseline (the original #else fell through to a naive nested loop, which makes any sparse comparison trivially favorable). - Add SparseLayer + sparse_matvec, gated on -DUSE_SPARSE_PROJ. When set, the four per-layer projections build CSR forms at load time and the hot loop calls sparse_matvec instead of matvec. - Add --max-gen=N runtime flag so benchmarks can bound the run regardless of regen mode. - Build matrix: default (naive), -DUSE_OPENBLAS (fair dense), and -DUSE_SPARSE_PROJ (the patch). All three preserve the original behavior. * scripts/make_synthetic_model.py: - torch-free generator that emits a model.bin in transformer.cpp::load() format with a controllable sparsity knob. Lets us measure dense-vs-sparse crossover without standing up the full Python build pipeline. * scripts/bench.py: - Sweep harness: builds one synthetic model per sparsity level, runs all three binaries, verifies they emit byte-identical token streams, reports tok/s and projection-time fraction. Forces single-thread BLAS for fairness at small per-token problem sizes. * results/sweep.tsv: - Numbers from the sandbox run at D=128 L=8 H=4 F=256 V=128. Crossover at ~75% sparsity; sparse wins 5x at 90%, 30x at 99%. See issue #1 for the handoff describing what's left to confirm on a real machine.

trulite mentioned this pull request Mar 27, 2026

10x inference: bypass hull for passthrough and position-keyed heads #2

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6x faster inference via batched dgemm + parallel hull#1

6x faster inference via batched dgemm + parallel hull#1
trulite wants to merge 1 commit intoPercepta-Core:mainfrom
trulite:batched-verification

trulite commented Mar 26, 2026

Uh oh!

ryvn-technologies Bot commented Mar 26, 2026

Uh oh!

trulite commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trulite commented Mar 26, 2026

Summary

Benchmark (M4 Pro, Sudoku 980K tokens)

Test plan

Uh oh!

ryvn-technologies Bot commented Mar 26, 2026

Uh oh!

trulite commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant