cpu perf by richiejp · Pull Request #2 · localai-org/privacy-filter.cpp

richiejp · 2026-06-16T11:01:13Z

bench: PyTorch reference + ggml GEMM microbench, exact-length pf-bench
perf: portable AVX-512 CPU build via GGML_CPU_ALL_VARIANTS
perf: CPU ablation profiler (PF_PROF) + optional Q8_0 expert requantizer
docs: CPU performance analysis (SSE-trap root cause, CPU_ALL_VARIANTS fix)
perf: flash attention (default) for CPU and Vulkan
bench: PF_WINDOW knob + memory columns in pf-bench
perf: prototype O(n*band) banded sliding-window attention
perf: integrate banded sliding-window attention (PF_BANDED)
perf: chunk the MoE FFN over tokens (PF_MOE_CHUNK) to lift the large-window cap
docs: MoE chunking + single-window result
perf: enable banded attention + MoE chunking by default (length-gated)
docs: refresh README Bench with the HF-vs-ours comparison
docs: add speedup column to Bench tables, trim prose

- scripts/bench_torch.py: HF transformers reference forward throughput at matched token lengths (cpu/cuda, fp16/bf16/fp32, graceful OOM), mirroring tools/bench so the tables line up. - pf-bench: optional [lengths] arg -- exact token counts (synth text tokenized then truncated) so both engines bench identical lengths. Backward compatible. - pf-gemm-bench (bench/gemm_microbench.cpp): ggml CPU mul_mat GFLOP/s by dtype/shape, to compare kernel throughput against a BLAS-backed framework. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The CPU backend was running with no SIMD. Under Nix the cc-wrapper strips -march=native (NIX_ENFORCE_NO_NATIVE), so a GGML_NATIVE=ON build compiles ggml-cpu as plain SSE (objdump: zero zmm/ymm/vfmadd), and the CI's GGML_NATIVE=OFF has none either. That was the whole CPU deficit vs PyTorch/MKL: with real AVX-512, ggml-f16 is ~10x faster and beats the reference (512 tok: 280 -> ~3000 vs 1935), no quantization needed. Fix is llama.cpp-style runtime dispatch: GGML_CPU_ALL_VARIANTS + GGML_BACKEND_DL compile libggml-cpu-<isa>.so per ISA level and load the best-scoring one at runtime (zen4 here) -- portable, and immune to -march stripping. New preset release-portable (tools emitted to bin/ beside the variants so load_all() finds them). Engine: ggml_backend_load_all() + threads via the registry proc address (the cpu-specific symbol now lives in the variant .so); no-ops in a static build. PF_NTHREADS overrides thread count (a sweep confirms the default is optimal). CI builds release-portable and asserts the AVX-512 variant actually has %zmm, so the SSE-only trap can't silently return. test_graph_blocks uses the backend API (not ggml_graph_compute_with_ctx) so it links under BACKEND_DL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PF_PROF=noattn|nomoe skips a block at graph-build time so the wall-time delta attributes its cost (512 tok, AVX-512: MoE 64%, attention 34%). Input-set calls are guarded on ->buffer since an ablated block orphans its inputs; no-op unset. scripts/requant_q8.py copies a GGUF verbatim except chosen weights (default the MoE experts), quantized to Q8_0. With AVX-512 already winning, Q8 is a minor optional lever (~15% over f16) at a precision cost that misses the f16 parity gate on long inputs, so it would need its own tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After the SIMD fix, attention dominated at length on both backends (PF_PROF: CPU 8192 tok 72%, Vulkan 2k-32k ~69%) -- the engine built the full [n,n] score matrix then masked it to the sliding window, O(n^2) for an O(n*256) receptive field. Replace it with ggml_flash_attn_ext: fused QK/softmax/V, no materialized scores, sinks via add_sinks, sliding-window mask, F32 accumulate. Numerically exact here (passes the f32 cos>=0.99999 parity gate and window-stitch), and ~1.8-2.4x faster where attention dominates (CPU 8192: 798 -> 1928 tok/s; Vulkan: ~2.3x at 8k-131k). Default on; PF_NOFLASH selects the explicit path (reference / tap debugging). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PF_WINDOW sets tokens-per-forward (the pf_set_window knob) so the memory/throughput tradeoff of the processing window can be measured. The table now reports the gallocr compute-buffer size (per-forward activation memory = Vulkan VRAM / CPU RAM, bounded by the window) and host RSS, and the header prints the weights-buffer size. Shows that at the default W the compute buffer is flat across document length (the windowing's point) and that raising W grows it O(n^2) via the sliding-window mask. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pf-banded-proto (bench/banded_attn_proto.cpp) validates a block-local form of the sliding-window attention: group tokens into blocks of B >= radius, each query block attends only to blocks {i-1,i,i+1}. The mask becomes a [3B,B,n_blocks] band (O(n*B)) instead of [n,n] (O(n^2)), and attention compute drops to O(n*band) -- while being bit-identical to the full masked attention (max|d|=0 across sizes and block sizes, since it computes the same dot products locally). This is the last O(n^2) term: flash attention removed the materialized scores but the sliding-window mask is still [n,n], which OOMs Vulkan by W=16384. The band mask is 21x smaller at 16k tokens, 85x at 64k -- unlocking large processing windows (fewer halo recomputes) and cutting the compute buffer. Model integration (GQA + sinks + parity) is the next step; docs/cpu-perf.md has the memory study. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Block-local attention in the model graph (src/model.cpp): tokens group into blocks of B=256 (>= the 128 sliding-window radius); each query block flash-attends to blocks {i-1,i,i+1} with an O(n*B) F16 band mask carrying the sliding window + attention sinks. GQA broadcasts over the head dim; sequences are padded to a block multiple and masked/trimmed. Replaces the O(n^2) [n,n] mask and full-window attention with O(n*band) compute and memory -- bit-identical math. Validated: passes the f32 cos>=0.99999 parity gate and window-stitch on both CPU and Vulkan. Speedup over flash at the default window: Vulkan ~2.5x at 8k-32k tokens (the flash kernel computes the whole window; banded only the band), CPU ~1.1x at length; a slight loss at very short inputs (block padding), so it's opt-in via PF_BANDED. Compute buffer is also lower (Vulkan 32768: 166 vs 208 MiB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…window cap Banded attention removed the O(n^2) attention term, but large single windows still OOM'd Vulkan on the MoE expert matmul's activation scratch (mul_mat_id y_sz > maxStorageBufferRange). The MoE is purely per-token, so PF_MOE_CHUNK processes it in token-chunks (router+experts+combine per slice, concatenated) -- exact, no halo. The graph node bound scales with the chunk count. With banded + MoE chunking a 131072-token document runs in ONE window instead of windowing at W=4096: ~1.28x faster (no halo recompute) at higher memory (compute buffer 2389 vs 166 MiB) -- the throughput/VRAM tradeoff the window knob now exposes. Exact: passes the f32 cos>=0.99999 parity gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Banded attention now defaults on for sequences >= 2048 tokens (the measured crossover: a slight loss below from the B=256 block padding, neutral-to-faster at and above -- CPU 1.0x@2048 -> 1.1x@4096, Vulkan 1.1x -> 2.5x). PF_BANDED=1/0 still forces it. MoE chunking defaults to the forward window size, so it's inert at the default window (n <= W) yet keeps a raised window (single-pass long docs) from OOMing on the Vulkan mul_mat_id scratch. Net: long inputs get the banded speedup automatically (the long-3k fixture now exercises it by default) with no CPU regression and no memory change at the default window; short inputs keep the full flash path. Default parity (CPU + Vulkan), window-stitch, and the fast suite all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the stale single-config table (whose CPU numbers were the pre-fix SSE-only build) with the forward-tok/s comparison against stock HF Transformers on GPU and CPU, the release-portable build/run commands, and the flat-memory note. Faster on both devices at every length (7-18x GPU, 1.6-7.7x CPU); HF OOMs past ~16k tokens where ours holds ~2.8 GiB out to 131k. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`cmake --build -j` is unbounded on Makefiles; release-portable compiles all 14 ggml-cpu ISA variants, so uncapped parallelism exhausted the 16 GB runner (build killed with SIGTERM/143 mid-compile). Cap to -j4 (runner core count). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

richiejp and others added 14 commits June 16, 2026 11:57

docs: CPU performance analysis (SSE-trap root cause, CPU_ALL_VARIANTS…

caebe08

… fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: MoE chunking + single-window result

064d840

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: add speedup column to Bench tables, trim prose

165bcf6

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

richiejp merged commit 646342f into master Jun 16, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cpu perf#2

cpu perf#2
richiejp merged 14 commits into
masterfrom
cpu-perf

richiejp commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

richiejp commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant