localai-org · richiejp · Jun 16, 2026 · Jun 16, 2026
diff --git a/README.md b/README.md
@@ -15,6 +15,59 @@ Convert your own from a HF checkpoint with
 [`scripts/convert.py`](scripts/convert.py) — self-contained, no llama.cpp
 dependency (see [Convert](#convert)).
 
+## Bench
+
+A "redaction race" against stock HF Transformers on the same hardware:
+
+**CPU — 8k-token document, real time.** Both finish; ours is 7.7× faster.
+
+![CPU redaction race: privacy-filter.cpp vs HF Transformers on an 8k-token document](demo/out/pii_duel_cpu.gif)
+
+**GPU — 132k-token document (4× slow-mo).** Ours runs flat to 131k tokens; HF
+hits the 16 GiB memory wall and OOMs at ~16k.
+
+![GPU redaction race: privacy-filter.cpp runs to 131k tokens while HF OOMs](demo/out/pii_duel_gpu.gif)
+
+Full-quality MP4s: [CPU](demo/out/pii_duel_cpu_final.mp4) · [GPU](demo/out/pii_duel_gpu_final.mp4).
+
+Single forward-pass latency and throughput vs stock HF Transformers (transformers
+5.9, eager), Ryzen 9 7900 (12 threads) + RTX 5070 Ti, f16/fp16, matched token
+counts ([scripts/bench_torch.py](scripts/bench_torch.py)). `tokens` is the input
+sequence length classified in one forward pass (the whole document at once, not
+generation); latency is `tokens ÷ tok/s`.
+
+GPU — ours (Vulkan) vs HF (CUDA):
+
+| tokens | HF (tok/s) | HF (ms) | ours (tok/s) | ours (ms) | speedup |
+|-------:|-----------:|--------:|-------------:|----------:|--------:|
+|    512 |      5 526 |      93 |      100 503 |         5 |     18× |
+|  2 048 |     16 427 |     125 |      145 481 |        14 |    8.9× |
+|  8 192 |     14 154 |     579 |      105 034 |        78 |    7.4× |
+| 32 768 |        OOM |     OOM |       83 519 |       392 |       — |
+| 131072 |        OOM |     OOM |       81 105 |     1 616 |       — |
+
+CPU — ours vs HF (fp32):
+
+| tokens | HF (tok/s) | HF (s) | ours (tok/s) | ours (s) | speedup |
+|-------:|-----------:|-------:|-------------:|---------:|--------:|
+|    512 |      2 171 |   0.24 |        3 564 |     0.14 |    1.6× |
+|  2 048 |        978 |   2.09 |        3 490 |     0.59 |    3.6× |
+|  8 192 |        304 |  26.95 |        2 332 |     3.51 |    7.7× |
+
+The speedup widens with length because HF's full self-attention is O(n²) while
+ours is banded/near-linear, so our tok/s stays roughly flat as HF's collapses.
+Memory is flat ~2.8 GiB VRAM on a 16
+GiB GPU. `release-portable` runtime-dispatches the best ggml-cpu ISA (AVX-512
+without `-march=native`); flash + banded attention default on. See
+[docs/cpu-perf.md](docs/cpu-perf.md).
+
+Reproduce the numbers:
+
+```sh
+cmake --preset release-portable && cmake --build --preset release-portable -j
+build/release-portable/bin/pf-bench model.gguf [cpu|vulkan] [iters] [lengths]
+```
+
 ## Build
 
 ```sh
@@ -126,37 +179,3 @@ cmake --preset fuzz && cmake --build --preset fuzz -j
 PF_GGUF=model.gguf ./build/fuzz/fuzz_tokenizer corpus_tok/
 ./build/fuzz/fuzz_gguf corpus_gguf/
 ```
-
-## Bench
-
-```sh
-cmake --preset release-portable && cmake --build --preset release-portable -j
-build/release-portable/bin/pf-bench model.gguf [cpu|vulkan] [iters] [lengths]
-```
-
-Forward tok/s vs stock HF Transformers (transformers 5.9, eager), Ryzen 9 7900 (12
-threads) + RTX 5070 Ti, f16/fp16, matched token counts
-([scripts/bench_torch.py](scripts/bench_torch.py)):
-
-GPU — ours (Vulkan) vs HF (CUDA):
-
-| tokens |      HF |    ours |    × |
-|-------:|--------:|--------:|-----:|
-|    512 |   5 526 | 100 503 |  18× |
-|  2 048 |  16 427 | 145 481 | 8.9× |
-|  8 192 |  14 154 | 105 034 | 7.4× |
-| 32 768 |     OOM |  83 519 |    — |
-| 131072 |     OOM |  81 105 |    — |
-
-CPU — ours vs HF (fp32):
-
-| tokens |    HF |  ours |    × |
-|-------:|------:|------:|-----:|
-|    512 | 2 171 | 3 564 | 1.6× |
-|  2 048 |   978 | 3 490 | 3.6× |
-|  8 192 |   304 | 2 332 | 7.7× |
-
-Memory is flat ~2.8 GiB VRAM / ~3 GiB RAM to 131k tokens; HF OOMs past ~16k on a 16
-GiB GPU. `release-portable` runtime-dispatches the best ggml-cpu ISA (AVX-512
-without `-march=native`); flash + banded attention default on. See
-[docs/cpu-perf.md](docs/cpu-perf.md).
diff --git a/demo/.gitignore b/demo/.gitignore
@@ -0,0 +1,4 @@
+__pycache__/
+*.pyc
+out/*.tmp
+out/.trim*
diff --git a/demo/document.txt b/demo/document.txt
@@ -0,0 +1 @@
+Thanks for getting back to me so quickly about the prior-authorization request. I want to make sure everything is in order before the review board meets next week, since the last batch got held up over a missing signature. The patient, John Doe, has been with our practice for years and is in good health, so we expect this one to be routine. If anything looks incomplete on your end, please call our billing office at +1 555-0112 rather than replying here. I have attached the updated history, and the relevant appointment was on 2026-05-12. If you still need the original signed consent form, you can email me directly at jane.roe@northside-clinic.org. Appreciate you helping get claim 4471 across the line before the deadline.
diff --git a/demo/gen_corpus.py b/demo/gen_corpus.py
@@ -0,0 +1,73 @@
+#!/usr/bin/env python3
+"""Generate the scrolling-demo corpora for pii_duel.py.
+
+The document is a readable, PII-sparse prose paragraph (demo/document.txt) tiled
+to the target token count -- the same tile-to-length method the benchmark uses
+(scripts/bench_torch.py) -- so "tokens" on screen means the same thing as in the
+README. One tile = 160 tokens (pf-cli --tok-batch), giving an exact char<->token
+mapping. Entity spans are the REAL pf-cli --classify output for the tile,
+replicated per tile (each tile is identical, so the model finds the same spans).
+
+Writes demo/traces/<scene>/{content.json,engines.json} for scene in cpu,gpu.
+"""
+import json
+from pathlib import Path
+
+HERE = Path(__file__).resolve().parent
+SEED = open(HERE / "document.txt").read().rstrip("\n") + "\n\n"   # 731 chars
+TOKENS_PER_TILE = 160                                             # pf-cli --tok-batch
+
+# Real spans from: pf-cli --classify <model> 0.5 <<< "<tile>"  (offsets into SEED).
+SEED_ENTITIES = [
+    {"type": "person", "start": 236, "end": 244},   # John Doe
+    {"type": "phone",  "start": 419, "end": 430},   # +1 555-0112
+    {"type": "date",   "start": 531, "end": 541},   # 2026-05-12
+    {"type": "email",  "start": 624, "end": 653},   # jane.roe@northside-clinic.org
+]
+
+L = len(SEED)
+
+
+def build(target_tokens):
+    tiles = round(target_tokens / TOKENS_PER_TILE)
+    doc = SEED * tiles
+    ents = []
+    for t in range(tiles):
+        off = t * L
+        for e in SEED_ENTITIES:
+            ents.append({"type": e["type"], "start": e["start"] + off, "end": e["end"] + off})
+    return doc, ents, tiles * TOKENS_PER_TILE
+
+
+def write_scene(scene, target_tokens, note, engines):
+    doc, ents, n_tokens = build(target_tokens)
+    d = Path(__file__).resolve().parent / "traces" / scene
+    d.mkdir(parents=True, exist_ok=True)
+    content = {"document": doc, "n_tokens": n_tokens, "note": note, "entities": ents}
+    json.dump(content, open(d / "content.json", "w"), ensure_ascii=False)
+    json.dump(engines, open(d / "engines.json", "w"), indent=2, ensure_ascii=False)
+    chars = len(doc)
+    print(f"{scene}: {len(ents):,} entities, {n_tokens:,} tokens, {chars:,} chars "
+          f"({chars/n_tokens:.2f} chars/tok)")
+    for e in engines:
+        if e.get("oom_at_tokens"):
+            print(f"   {e['label']:<20} OOM at {e['oom_at_tokens']:,} tok "
+                  f"(~{e['oom_at_tokens']/e['tps']:.2f}s) @ {e['tps']:,} tok/s")
+        else:
+            print(f"   {e['label']:<20} {n_tokens/e['tps']:.2f}s @ {e['tps']:,} tok/s")
+
+
+# CPU: 8k doc, both finish (README CPU 8 192 row: ours 2 332, HF 304 tok/s).
+write_scene("cpu", 8192, "8k-token document · CPU (Ryzen 9 7900)", [
+    {"key": "ours", "label": "privacy-filter.cpp", "device": "CPU", "tps": 2332, "hero": True},
+    {"key": "hf",   "label": "HF Transformers",    "device": "CPU", "tps": 304,  "hero": False},
+])
+
+# GPU: 132k doc. Ours (Vulkan) runs the whole thing (README 131 072 row: 81 105
+# tok/s). HF (CUDA) OOMs past ~16k on a 16 GiB GPU -> dies at 16 384 tokens; use
+# its measured 8k throughput (14 154 tok/s) as the rate up to the wall.
+write_scene("gpu", 131072, "132k-token document · GPU (RTX 5070 Ti, 16 GiB)", [
+    {"key": "ours", "label": "privacy-filter.cpp", "device": "GPU · Vulkan", "tps": 81105, "hero": True},
+    {"key": "hf",   "label": "HF Transformers",    "device": "GPU · CUDA",   "tps": 14154,
+     "hero": False, "oom_at_tokens": 16384},
+])
diff --git a/demo/make.sh b/demo/make.sh
@@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Build a privacy-filter "redaction race" demo video end to end: render the
+# pii_duel scrolling TUI for a scene, record it with recorder-for-agents, trim
+# the dead lead-in, append a branding outro.
+#
+#   SCENE=cpu ./make.sh            # 8k-token doc, CPU, real time, both finish
+#   SCENE=gpu DILATE=4 ./make.sh   # 132k-token doc, GPU, HF OOMs
+#
+# env: RECORDER, SCENE(cpu|gpu), DILATE, WIDTH/HEIGHT/FPS/FONTSIZE, LINK
+set -euo pipefail
+HERE=$(cd "$(dirname "$0")" && pwd)
+RECORDER=${RECORDER:-/home/rich/python/recorder-for-agents}
+SCENE=${SCENE:-cpu}
+OUT=${1:-pii_duel_${SCENE}.mp4}
+DILATE=${DILATE:-1}
+W=${WIDTH:-1280}; H=${HEIGHT:-720}; FS=${FONTSIZE:-16}; FPS=${FPS:-30}
+LINK=${LINK:-github.com/richiejp/privacy-filter.cpp}
+SDIR="$HERE/traces/$SCENE"
+
+[ -d "$SDIR" ] || { echo "no scene at $SDIR (run gen_corpus.py)"; exit 1; }
+[ -x "$RECORDER/record.sh" ] || { echo "recorder not found at $RECORDER"; exit 1; }
+
+# capture length: start delay + slowest engine (scaled) + settle + slice of card
+DUR=$(python3 - "$SDIR" "$DILATE" <<'PY'
+import json, sys, math
+from pathlib import Path
+d = Path(sys.argv[1]); dil = float(sys.argv[2])
+content = json.load(open(d / "content.json")); eng = json.load(open(d / "engines.json"))
+n = content["n_tokens"]
+slow = max((e.get("oom_at_tokens") or n) / e["tps"] for e in eng)
+print(int(math.ceil(1.0 + slow * dil + 1.4 + 0.7 + 2.0)))
+PY
+)
+echo "[make] scene=$SCENE ${W}x${H}@${FPS} fs=${FS} dilate=${DILATE} duration=${DUR}s -> out/$OUT"
+
+WORK="$HERE" BG="#0d1117" FG="#d7dde5" FONTSIZE="$FS" DURATION="$DUR" \
+  WIDTH="$W" HEIGHT="$H" FPS="$FPS" START_DELAY=1.0 END_HOLD=0.2 \
+  "$RECORDER/record.sh" "python3 pii_duel.py --scene traces/$SCENE --dilate $DILATE --link '$LINK'" "$OUT"
+
+RAW="$HERE/out/$OUT"; NOEXT="${OUT%.mp4}"
+if [ -f "$RECORDER/examples/duel/trim_lead.sh" ]; then
+  bash "$RECORDER/examples/duel/trim_lead.sh" "$RAW" "$HERE/out/.trim_$SCENE.mp4" \
+    && mv "$HERE/out/.trim_$SCENE.mp4" "$RAW"
+fi
+if [ -f "$RECORDER/examples/duel/outro.sh" ]; then
+  OW="$W" OH="$H" TITLE="privacy-filter.cpp" \
+  LINK1="github.com/richiejp/privacy-filter.cpp" \
+  LINK2="real NER spans · stock ggml · see README Bench" \
+    bash "$RECORDER/examples/duel/outro.sh" "$RAW" "$HERE/out/${NOEXT}_final.mp4"
+  echo "-> $HERE/out/${NOEXT}_final.mp4"
+else
+  echo "-> $RAW"
+fi
diff --git a/demo/out/pii_duel_cpu.gif b/demo/out/pii_duel_cpu.gif
diff --git a/demo/out/pii_duel_cpu.mp4 b/demo/out/pii_duel_cpu.mp4
diff --git a/demo/out/pii_duel_cpu_final.mp4 b/demo/out/pii_duel_cpu_final.mp4
diff --git a/demo/out/pii_duel_gpu.gif b/demo/out/pii_duel_gpu.gif
diff --git a/demo/out/pii_duel_gpu.mp4 b/demo/out/pii_duel_gpu.mp4
diff --git a/demo/out/pii_duel_gpu_final.mp4 b/demo/out/pii_duel_gpu_final.mp4
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Thanks for getting back to me so quickly about the prior-authorization request. I want to make sure everything is in order before the review board meets next week, since the last batch got held up over a missing signature. The patient, John Doe, has been with our practice for years and is in good health, so we expect this one to be routine. If anything looks incomplete on your end, please call our billing office at +1 555-0112 rather than replying here. I have attached the updated history, and the relevant appointment was on 2026-05-12. If you still need the original signed consent form, you can email me directly at jane.roe@northside-clinic.org. Appreciate you helping get claim 4471 across the line before the deadline.