Skip to content

cpu perf#2

Merged
richiejp merged 14 commits into
masterfrom
cpu-perf
Jun 16, 2026
Merged

cpu perf#2
richiejp merged 14 commits into
masterfrom
cpu-perf

Conversation

@richiejp

Copy link
Copy Markdown
Contributor
  • bench: PyTorch reference + ggml GEMM microbench, exact-length pf-bench
  • perf: portable AVX-512 CPU build via GGML_CPU_ALL_VARIANTS
  • perf: CPU ablation profiler (PF_PROF) + optional Q8_0 expert requantizer
  • docs: CPU performance analysis (SSE-trap root cause, CPU_ALL_VARIANTS fix)
  • perf: flash attention (default) for CPU and Vulkan
  • bench: PF_WINDOW knob + memory columns in pf-bench
  • perf: prototype O(n*band) banded sliding-window attention
  • perf: integrate banded sliding-window attention (PF_BANDED)
  • perf: chunk the MoE FFN over tokens (PF_MOE_CHUNK) to lift the large-window cap
  • docs: MoE chunking + single-window result
  • perf: enable banded attention + MoE chunking by default (length-gated)
  • docs: refresh README Bench with the HF-vs-ours comparison
  • docs: add speedup column to Bench tables, trim prose

richiejp and others added 14 commits June 16, 2026 11:57
- scripts/bench_torch.py: HF transformers reference forward throughput at
  matched token lengths (cpu/cuda, fp16/bf16/fp32, graceful OOM), mirroring
  tools/bench so the tables line up.
- pf-bench: optional [lengths] arg -- exact token counts (synth text tokenized
  then truncated) so both engines bench identical lengths. Backward compatible.
- pf-gemm-bench (bench/gemm_microbench.cpp): ggml CPU mul_mat GFLOP/s by
  dtype/shape, to compare kernel throughput against a BLAS-backed framework.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CPU backend was running with no SIMD. Under Nix the cc-wrapper strips
-march=native (NIX_ENFORCE_NO_NATIVE), so a GGML_NATIVE=ON build compiles
ggml-cpu as plain SSE (objdump: zero zmm/ymm/vfmadd), and the CI's GGML_NATIVE=OFF
has none either. That was the whole CPU deficit vs PyTorch/MKL: with real AVX-512,
ggml-f16 is ~10x faster and beats the reference (512 tok: 280 -> ~3000 vs 1935),
no quantization needed.

Fix is llama.cpp-style runtime dispatch: GGML_CPU_ALL_VARIANTS + GGML_BACKEND_DL
compile libggml-cpu-<isa>.so per ISA level and load the best-scoring one at
runtime (zen4 here) -- portable, and immune to -march stripping. New preset
release-portable (tools emitted to bin/ beside the variants so load_all() finds
them). Engine: ggml_backend_load_all() + threads via the registry proc address
(the cpu-specific symbol now lives in the variant .so); no-ops in a static build.
PF_NTHREADS overrides thread count (a sweep confirms the default is optimal).
CI builds release-portable and asserts the AVX-512 variant actually has %zmm, so
the SSE-only trap can't silently return. test_graph_blocks uses the backend API
(not ggml_graph_compute_with_ctx) so it links under BACKEND_DL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PF_PROF=noattn|nomoe skips a block at graph-build time so the wall-time delta
attributes its cost (512 tok, AVX-512: MoE 64%, attention 34%). Input-set calls
are guarded on ->buffer since an ablated block orphans its inputs; no-op unset.

scripts/requant_q8.py copies a GGUF verbatim except chosen weights (default the
MoE experts), quantized to Q8_0. With AVX-512 already winning, Q8 is a minor
optional lever (~15% over f16) at a precision cost that misses the f16 parity
gate on long inputs, so it would need its own tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fix)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After the SIMD fix, attention dominated at length on both backends (PF_PROF: CPU
8192 tok 72%, Vulkan 2k-32k ~69%) -- the engine built the full [n,n] score matrix
then masked it to the sliding window, O(n^2) for an O(n*256) receptive field.

Replace it with ggml_flash_attn_ext: fused QK/softmax/V, no materialized scores,
sinks via add_sinks, sliding-window mask, F32 accumulate. Numerically exact here
(passes the f32 cos>=0.99999 parity gate and window-stitch), and ~1.8-2.4x faster
where attention dominates (CPU 8192: 798 -> 1928 tok/s; Vulkan: ~2.3x at 8k-131k).
Default on; PF_NOFLASH selects the explicit path (reference / tap debugging).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PF_WINDOW sets tokens-per-forward (the pf_set_window knob) so the
memory/throughput tradeoff of the processing window can be measured. The table
now reports the gallocr compute-buffer size (per-forward activation memory =
Vulkan VRAM / CPU RAM, bounded by the window) and host RSS, and the header prints
the weights-buffer size. Shows that at the default W the compute buffer is flat
across document length (the windowing's point) and that raising W grows it
O(n^2) via the sliding-window mask.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pf-banded-proto (bench/banded_attn_proto.cpp) validates a block-local form of the
sliding-window attention: group tokens into blocks of B >= radius, each query
block attends only to blocks {i-1,i,i+1}. The mask becomes a [3B,B,n_blocks] band
(O(n*B)) instead of [n,n] (O(n^2)), and attention compute drops to O(n*band) --
while being bit-identical to the full masked attention (max|d|=0 across sizes and
block sizes, since it computes the same dot products locally).

This is the last O(n^2) term: flash attention removed the materialized scores but
the sliding-window mask is still [n,n], which OOMs Vulkan by W=16384. The band
mask is 21x smaller at 16k tokens, 85x at 64k -- unlocking large processing
windows (fewer halo recomputes) and cutting the compute buffer. Model integration
(GQA + sinks + parity) is the next step; docs/cpu-perf.md has the memory study.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Block-local attention in the model graph (src/model.cpp): tokens group into
blocks of B=256 (>= the 128 sliding-window radius); each query block flash-attends
to blocks {i-1,i,i+1} with an O(n*B) F16 band mask carrying the sliding window +
attention sinks. GQA broadcasts over the head dim; sequences are padded to a block
multiple and masked/trimmed. Replaces the O(n^2) [n,n] mask and full-window
attention with O(n*band) compute and memory -- bit-identical math.

Validated: passes the f32 cos>=0.99999 parity gate and window-stitch on both CPU
and Vulkan. Speedup over flash at the default window: Vulkan ~2.5x at 8k-32k
tokens (the flash kernel computes the whole window; banded only the band), CPU
~1.1x at length; a slight loss at very short inputs (block padding), so it's
opt-in via PF_BANDED. Compute buffer is also lower (Vulkan 32768: 166 vs 208 MiB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…window cap

Banded attention removed the O(n^2) attention term, but large single windows
still OOM'd Vulkan on the MoE expert matmul's activation scratch
(mul_mat_id y_sz > maxStorageBufferRange). The MoE is purely per-token, so
PF_MOE_CHUNK processes it in token-chunks (router+experts+combine per slice,
concatenated) -- exact, no halo. The graph node bound scales with the chunk
count.

With banded + MoE chunking a 131072-token document runs in ONE window instead of
windowing at W=4096: ~1.28x faster (no halo recompute) at higher memory (compute
buffer 2389 vs 166 MiB) -- the throughput/VRAM tradeoff the window knob now
exposes. Exact: passes the f32 cos>=0.99999 parity gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Banded attention now defaults on for sequences >= 2048 tokens (the measured
crossover: a slight loss below from the B=256 block padding, neutral-to-faster at
and above -- CPU 1.0x@2048 -> 1.1x@4096, Vulkan 1.1x -> 2.5x). PF_BANDED=1/0 still
forces it. MoE chunking defaults to the forward window size, so it's inert at the
default window (n <= W) yet keeps a raised window (single-pass long docs) from
OOMing on the Vulkan mul_mat_id scratch.

Net: long inputs get the banded speedup automatically (the long-3k fixture now
exercises it by default) with no CPU regression and no memory change at the
default window; short inputs keep the full flash path. Default parity (CPU +
Vulkan), window-stitch, and the fast suite all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the stale single-config table (whose CPU numbers were the pre-fix
SSE-only build) with the forward-tok/s comparison against stock HF Transformers
on GPU and CPU, the release-portable build/run commands, and the flat-memory
note. Faster on both devices at every length (7-18x GPU, 1.6-7.7x CPU); HF OOMs
past ~16k tokens where ours holds ~2.8 GiB out to 131k.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`cmake --build -j` is unbounded on Makefiles; release-portable compiles all 14
ggml-cpu ISA variants, so uncapped parallelism exhausted the 16 GB runner (build
killed with SIGTERM/143 mid-compile). Cap to -j4 (runner core count).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@richiejp richiejp merged commit 646342f into master Jun 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant