Conversation
richiejp
commented
Jun 16, 2026
Contributor
- bench: PyTorch reference + ggml GEMM microbench, exact-length pf-bench
- perf: portable AVX-512 CPU build via GGML_CPU_ALL_VARIANTS
- perf: CPU ablation profiler (PF_PROF) + optional Q8_0 expert requantizer
- docs: CPU performance analysis (SSE-trap root cause, CPU_ALL_VARIANTS fix)
- perf: flash attention (default) for CPU and Vulkan
- bench: PF_WINDOW knob + memory columns in pf-bench
- perf: prototype O(n*band) banded sliding-window attention
- perf: integrate banded sliding-window attention (PF_BANDED)
- perf: chunk the MoE FFN over tokens (PF_MOE_CHUNK) to lift the large-window cap
- docs: MoE chunking + single-window result
- perf: enable banded attention + MoE chunking by default (length-gated)
- docs: refresh README Bench with the HF-vs-ours comparison
- docs: add speedup column to Bench tables, trim prose
- scripts/bench_torch.py: HF transformers reference forward throughput at matched token lengths (cpu/cuda, fp16/bf16/fp32, graceful OOM), mirroring tools/bench so the tables line up. - pf-bench: optional [lengths] arg -- exact token counts (synth text tokenized then truncated) so both engines bench identical lengths. Backward compatible. - pf-gemm-bench (bench/gemm_microbench.cpp): ggml CPU mul_mat GFLOP/s by dtype/shape, to compare kernel throughput against a BLAS-backed framework. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CPU backend was running with no SIMD. Under Nix the cc-wrapper strips -march=native (NIX_ENFORCE_NO_NATIVE), so a GGML_NATIVE=ON build compiles ggml-cpu as plain SSE (objdump: zero zmm/ymm/vfmadd), and the CI's GGML_NATIVE=OFF has none either. That was the whole CPU deficit vs PyTorch/MKL: with real AVX-512, ggml-f16 is ~10x faster and beats the reference (512 tok: 280 -> ~3000 vs 1935), no quantization needed. Fix is llama.cpp-style runtime dispatch: GGML_CPU_ALL_VARIANTS + GGML_BACKEND_DL compile libggml-cpu-<isa>.so per ISA level and load the best-scoring one at runtime (zen4 here) -- portable, and immune to -march stripping. New preset release-portable (tools emitted to bin/ beside the variants so load_all() finds them). Engine: ggml_backend_load_all() + threads via the registry proc address (the cpu-specific symbol now lives in the variant .so); no-ops in a static build. PF_NTHREADS overrides thread count (a sweep confirms the default is optimal). CI builds release-portable and asserts the AVX-512 variant actually has %zmm, so the SSE-only trap can't silently return. test_graph_blocks uses the backend API (not ggml_graph_compute_with_ctx) so it links under BACKEND_DL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PF_PROF=noattn|nomoe skips a block at graph-build time so the wall-time delta attributes its cost (512 tok, AVX-512: MoE 64%, attention 34%). Input-set calls are guarded on ->buffer since an ablated block orphans its inputs; no-op unset. scripts/requant_q8.py copies a GGUF verbatim except chosen weights (default the MoE experts), quantized to Q8_0. With AVX-512 already winning, Q8 is a minor optional lever (~15% over f16) at a precision cost that misses the f16 parity gate on long inputs, so it would need its own tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After the SIMD fix, attention dominated at length on both backends (PF_PROF: CPU 8192 tok 72%, Vulkan 2k-32k ~69%) -- the engine built the full [n,n] score matrix then masked it to the sliding window, O(n^2) for an O(n*256) receptive field. Replace it with ggml_flash_attn_ext: fused QK/softmax/V, no materialized scores, sinks via add_sinks, sliding-window mask, F32 accumulate. Numerically exact here (passes the f32 cos>=0.99999 parity gate and window-stitch), and ~1.8-2.4x faster where attention dominates (CPU 8192: 798 -> 1928 tok/s; Vulkan: ~2.3x at 8k-131k). Default on; PF_NOFLASH selects the explicit path (reference / tap debugging). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PF_WINDOW sets tokens-per-forward (the pf_set_window knob) so the memory/throughput tradeoff of the processing window can be measured. The table now reports the gallocr compute-buffer size (per-forward activation memory = Vulkan VRAM / CPU RAM, bounded by the window) and host RSS, and the header prints the weights-buffer size. Shows that at the default W the compute buffer is flat across document length (the windowing's point) and that raising W grows it O(n^2) via the sliding-window mask. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pf-banded-proto (bench/banded_attn_proto.cpp) validates a block-local form of the
sliding-window attention: group tokens into blocks of B >= radius, each query
block attends only to blocks {i-1,i,i+1}. The mask becomes a [3B,B,n_blocks] band
(O(n*B)) instead of [n,n] (O(n^2)), and attention compute drops to O(n*band) --
while being bit-identical to the full masked attention (max|d|=0 across sizes and
block sizes, since it computes the same dot products locally).
This is the last O(n^2) term: flash attention removed the materialized scores but
the sliding-window mask is still [n,n], which OOMs Vulkan by W=16384. The band
mask is 21x smaller at 16k tokens, 85x at 64k -- unlocking large processing
windows (fewer halo recomputes) and cutting the compute buffer. Model integration
(GQA + sinks + parity) is the next step; docs/cpu-perf.md has the memory study.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Block-local attention in the model graph (src/model.cpp): tokens group into
blocks of B=256 (>= the 128 sliding-window radius); each query block flash-attends
to blocks {i-1,i,i+1} with an O(n*B) F16 band mask carrying the sliding window +
attention sinks. GQA broadcasts over the head dim; sequences are padded to a block
multiple and masked/trimmed. Replaces the O(n^2) [n,n] mask and full-window
attention with O(n*band) compute and memory -- bit-identical math.
Validated: passes the f32 cos>=0.99999 parity gate and window-stitch on both CPU
and Vulkan. Speedup over flash at the default window: Vulkan ~2.5x at 8k-32k
tokens (the flash kernel computes the whole window; banded only the band), CPU
~1.1x at length; a slight loss at very short inputs (block padding), so it's
opt-in via PF_BANDED. Compute buffer is also lower (Vulkan 32768: 166 vs 208 MiB).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…window cap Banded attention removed the O(n^2) attention term, but large single windows still OOM'd Vulkan on the MoE expert matmul's activation scratch (mul_mat_id y_sz > maxStorageBufferRange). The MoE is purely per-token, so PF_MOE_CHUNK processes it in token-chunks (router+experts+combine per slice, concatenated) -- exact, no halo. The graph node bound scales with the chunk count. With banded + MoE chunking a 131072-token document runs in ONE window instead of windowing at W=4096: ~1.28x faster (no halo recompute) at higher memory (compute buffer 2389 vs 166 MiB) -- the throughput/VRAM tradeoff the window knob now exposes. Exact: passes the f32 cos>=0.99999 parity gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Banded attention now defaults on for sequences >= 2048 tokens (the measured crossover: a slight loss below from the B=256 block padding, neutral-to-faster at and above -- CPU 1.0x@2048 -> 1.1x@4096, Vulkan 1.1x -> 2.5x). PF_BANDED=1/0 still forces it. MoE chunking defaults to the forward window size, so it's inert at the default window (n <= W) yet keeps a raised window (single-pass long docs) from OOMing on the Vulkan mul_mat_id scratch. Net: long inputs get the banded speedup automatically (the long-3k fixture now exercises it by default) with no CPU regression and no memory change at the default window; short inputs keep the full flash path. Default parity (CPU + Vulkan), window-stitch, and the fast suite all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the stale single-config table (whose CPU numbers were the pre-fix SSE-only build) with the forward-tok/s comparison against stock HF Transformers on GPU and CPU, the release-portable build/run commands, and the flat-memory note. Faster on both devices at every length (7-18x GPU, 1.6-7.7x CPU); HF OOMs past ~16k tokens where ours holds ~2.8 GiB out to 131k. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`cmake --build -j` is unbounded on Makefiles; release-portable compiles all 14 ggml-cpu ISA variants, so uncapped parallelism exhausted the 16 GB runner (build killed with SIGTERM/143 mid-compile). Cap to -j4 (runner core count). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.