Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 53 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,59 @@ Convert your own from a HF checkpoint with
[`scripts/convert.py`](scripts/convert.py) — self-contained, no llama.cpp
dependency (see [Convert](#convert)).

## Bench

A "redaction race" against stock HF Transformers on the same hardware:

**CPU — 8k-token document, real time.** Both finish; ours is 7.7× faster.

![CPU redaction race: privacy-filter.cpp vs HF Transformers on an 8k-token document](demo/out/pii_duel_cpu.gif)

**GPU — 132k-token document (4× slow-mo).** Ours runs flat to 131k tokens; HF
hits the 16 GiB memory wall and OOMs at ~16k.

![GPU redaction race: privacy-filter.cpp runs to 131k tokens while HF OOMs](demo/out/pii_duel_gpu.gif)

Full-quality MP4s: [CPU](demo/out/pii_duel_cpu_final.mp4) · [GPU](demo/out/pii_duel_gpu_final.mp4).

Single forward-pass latency and throughput vs stock HF Transformers (transformers
5.9, eager), Ryzen 9 7900 (12 threads) + RTX 5070 Ti, f16/fp16, matched token
counts ([scripts/bench_torch.py](scripts/bench_torch.py)). `tokens` is the input
sequence length classified in one forward pass (the whole document at once, not
generation); latency is `tokens ÷ tok/s`.

GPU — ours (Vulkan) vs HF (CUDA):

| tokens | HF (tok/s) | HF (ms) | ours (tok/s) | ours (ms) | speedup |
|-------:|-----------:|--------:|-------------:|----------:|--------:|
| 512 | 5 526 | 93 | 100 503 | 5 | 18× |
| 2 048 | 16 427 | 125 | 145 481 | 14 | 8.9× |
| 8 192 | 14 154 | 579 | 105 034 | 78 | 7.4× |
| 32 768 | OOM | OOM | 83 519 | 392 | — |
| 131072 | OOM | OOM | 81 105 | 1 616 | — |

CPU — ours vs HF (fp32):

| tokens | HF (tok/s) | HF (s) | ours (tok/s) | ours (s) | speedup |
|-------:|-----------:|-------:|-------------:|---------:|--------:|
| 512 | 2 171 | 0.24 | 3 564 | 0.14 | 1.6× |
| 2 048 | 978 | 2.09 | 3 490 | 0.59 | 3.6× |
| 8 192 | 304 | 26.95 | 2 332 | 3.51 | 7.7× |

The speedup widens with length because HF's full self-attention is O(n²) while
ours is banded/near-linear, so our tok/s stays roughly flat as HF's collapses.
Memory is flat ~2.8 GiB VRAM on a 16
GiB GPU. `release-portable` runtime-dispatches the best ggml-cpu ISA (AVX-512
without `-march=native`); flash + banded attention default on. See
[docs/cpu-perf.md](docs/cpu-perf.md).

Reproduce the numbers:

```sh
cmake --preset release-portable && cmake --build --preset release-portable -j
build/release-portable/bin/pf-bench model.gguf [cpu|vulkan] [iters] [lengths]
```

## Build

```sh
Expand Down Expand Up @@ -126,37 +179,3 @@ cmake --preset fuzz && cmake --build --preset fuzz -j
PF_GGUF=model.gguf ./build/fuzz/fuzz_tokenizer corpus_tok/
./build/fuzz/fuzz_gguf corpus_gguf/
```

## Bench

```sh
cmake --preset release-portable && cmake --build --preset release-portable -j
build/release-portable/bin/pf-bench model.gguf [cpu|vulkan] [iters] [lengths]
```

Forward tok/s vs stock HF Transformers (transformers 5.9, eager), Ryzen 9 7900 (12
threads) + RTX 5070 Ti, f16/fp16, matched token counts
([scripts/bench_torch.py](scripts/bench_torch.py)):

GPU — ours (Vulkan) vs HF (CUDA):

| tokens | HF | ours | × |
|-------:|--------:|--------:|-----:|
| 512 | 5 526 | 100 503 | 18× |
| 2 048 | 16 427 | 145 481 | 8.9× |
| 8 192 | 14 154 | 105 034 | 7.4× |
| 32 768 | OOM | 83 519 | — |
| 131072 | OOM | 81 105 | — |

CPU — ours vs HF (fp32):

| tokens | HF | ours | × |
|-------:|------:|------:|-----:|
| 512 | 2 171 | 3 564 | 1.6× |
| 2 048 | 978 | 3 490 | 3.6× |
| 8 192 | 304 | 2 332 | 7.7× |

Memory is flat ~2.8 GiB VRAM / ~3 GiB RAM to 131k tokens; HF OOMs past ~16k on a 16
GiB GPU. `release-portable` runtime-dispatches the best ggml-cpu ISA (AVX-512
without `-march=native`); flash + banded attention default on. See
[docs/cpu-perf.md](docs/cpu-perf.md).
4 changes: 4 additions & 0 deletions demo/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
__pycache__/
*.pyc
out/*.tmp
out/.trim*
1 change: 1 addition & 0 deletions demo/document.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Thanks for getting back to me so quickly about the prior-authorization request. I want to make sure everything is in order before the review board meets next week, since the last batch got held up over a missing signature. The patient, John Doe, has been with our practice for years and is in good health, so we expect this one to be routine. If anything looks incomplete on your end, please call our billing office at +1 555-0112 rather than replying here. I have attached the updated history, and the relevant appointment was on 2026-05-12. If you still need the original signed consent form, you can email me directly at jane.roe@northside-clinic.org. Appreciate you helping get claim 4471 across the line before the deadline.
73 changes: 73 additions & 0 deletions demo/gen_corpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env python3
"""Generate the scrolling-demo corpora for pii_duel.py.

The document is a readable, PII-sparse prose paragraph (demo/document.txt) tiled
to the target token count -- the same tile-to-length method the benchmark uses
(scripts/bench_torch.py) -- so "tokens" on screen means the same thing as in the
README. One tile = 160 tokens (pf-cli --tok-batch), giving an exact char<->token
mapping. Entity spans are the REAL pf-cli --classify output for the tile,
replicated per tile (each tile is identical, so the model finds the same spans).

Writes demo/traces/<scene>/{content.json,engines.json} for scene in cpu,gpu.
"""
import json
from pathlib import Path

HERE = Path(__file__).resolve().parent
SEED = open(HERE / "document.txt").read().rstrip("\n") + "\n\n" # 731 chars
TOKENS_PER_TILE = 160 # pf-cli --tok-batch

# Real spans from: pf-cli --classify <model> 0.5 <<< "<tile>" (offsets into SEED).
SEED_ENTITIES = [
{"type": "person", "start": 236, "end": 244}, # John Doe
{"type": "phone", "start": 419, "end": 430}, # +1 555-0112
{"type": "date", "start": 531, "end": 541}, # 2026-05-12
{"type": "email", "start": 624, "end": 653}, # jane.roe@northside-clinic.org
]

L = len(SEED)


def build(target_tokens):
tiles = round(target_tokens / TOKENS_PER_TILE)
doc = SEED * tiles
ents = []
for t in range(tiles):
off = t * L
for e in SEED_ENTITIES:
ents.append({"type": e["type"], "start": e["start"] + off, "end": e["end"] + off})
return doc, ents, tiles * TOKENS_PER_TILE


def write_scene(scene, target_tokens, note, engines):
doc, ents, n_tokens = build(target_tokens)
d = Path(__file__).resolve().parent / "traces" / scene
d.mkdir(parents=True, exist_ok=True)
content = {"document": doc, "n_tokens": n_tokens, "note": note, "entities": ents}
json.dump(content, open(d / "content.json", "w"), ensure_ascii=False)
json.dump(engines, open(d / "engines.json", "w"), indent=2, ensure_ascii=False)
chars = len(doc)
print(f"{scene}: {len(ents):,} entities, {n_tokens:,} tokens, {chars:,} chars "
f"({chars/n_tokens:.2f} chars/tok)")
for e in engines:
if e.get("oom_at_tokens"):
print(f" {e['label']:<20} OOM at {e['oom_at_tokens']:,} tok "
f"(~{e['oom_at_tokens']/e['tps']:.2f}s) @ {e['tps']:,} tok/s")
else:
print(f" {e['label']:<20} {n_tokens/e['tps']:.2f}s @ {e['tps']:,} tok/s")


# CPU: 8k doc, both finish (README CPU 8 192 row: ours 2 332, HF 304 tok/s).
write_scene("cpu", 8192, "8k-token document · CPU (Ryzen 9 7900)", [
{"key": "ours", "label": "privacy-filter.cpp", "device": "CPU", "tps": 2332, "hero": True},
{"key": "hf", "label": "HF Transformers", "device": "CPU", "tps": 304, "hero": False},
])

# GPU: 132k doc. Ours (Vulkan) runs the whole thing (README 131 072 row: 81 105
# tok/s). HF (CUDA) OOMs past ~16k on a 16 GiB GPU -> dies at 16 384 tokens; use
# its measured 8k throughput (14 154 tok/s) as the rate up to the wall.
write_scene("gpu", 131072, "132k-token document · GPU (RTX 5070 Ti, 16 GiB)", [
{"key": "ours", "label": "privacy-filter.cpp", "device": "GPU · Vulkan", "tps": 81105, "hero": True},
{"key": "hf", "label": "HF Transformers", "device": "GPU · CUDA", "tps": 14154,
"hero": False, "oom_at_tokens": 16384},
])
53 changes: 53 additions & 0 deletions demo/make.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Build a privacy-filter "redaction race" demo video end to end: render the
# pii_duel scrolling TUI for a scene, record it with recorder-for-agents, trim
# the dead lead-in, append a branding outro.
#
# SCENE=cpu ./make.sh # 8k-token doc, CPU, real time, both finish
# SCENE=gpu DILATE=4 ./make.sh # 132k-token doc, GPU, HF OOMs
#
# env: RECORDER, SCENE(cpu|gpu), DILATE, WIDTH/HEIGHT/FPS/FONTSIZE, LINK
set -euo pipefail
HERE=$(cd "$(dirname "$0")" && pwd)
RECORDER=${RECORDER:-/home/rich/python/recorder-for-agents}
SCENE=${SCENE:-cpu}
OUT=${1:-pii_duel_${SCENE}.mp4}
DILATE=${DILATE:-1}
W=${WIDTH:-1280}; H=${HEIGHT:-720}; FS=${FONTSIZE:-16}; FPS=${FPS:-30}
LINK=${LINK:-github.com/richiejp/privacy-filter.cpp}
SDIR="$HERE/traces/$SCENE"

[ -d "$SDIR" ] || { echo "no scene at $SDIR (run gen_corpus.py)"; exit 1; }
[ -x "$RECORDER/record.sh" ] || { echo "recorder not found at $RECORDER"; exit 1; }

# capture length: start delay + slowest engine (scaled) + settle + slice of card
DUR=$(python3 - "$SDIR" "$DILATE" <<'PY'
import json, sys, math
from pathlib import Path
d = Path(sys.argv[1]); dil = float(sys.argv[2])
content = json.load(open(d / "content.json")); eng = json.load(open(d / "engines.json"))
n = content["n_tokens"]
slow = max((e.get("oom_at_tokens") or n) / e["tps"] for e in eng)
print(int(math.ceil(1.0 + slow * dil + 1.4 + 0.7 + 2.0)))
PY
)
echo "[make] scene=$SCENE ${W}x${H}@${FPS} fs=${FS} dilate=${DILATE} duration=${DUR}s -> out/$OUT"

WORK="$HERE" BG="#0d1117" FG="#d7dde5" FONTSIZE="$FS" DURATION="$DUR" \
WIDTH="$W" HEIGHT="$H" FPS="$FPS" START_DELAY=1.0 END_HOLD=0.2 \
"$RECORDER/record.sh" "python3 pii_duel.py --scene traces/$SCENE --dilate $DILATE --link '$LINK'" "$OUT"

RAW="$HERE/out/$OUT"; NOEXT="${OUT%.mp4}"
if [ -f "$RECORDER/examples/duel/trim_lead.sh" ]; then
bash "$RECORDER/examples/duel/trim_lead.sh" "$RAW" "$HERE/out/.trim_$SCENE.mp4" \
&& mv "$HERE/out/.trim_$SCENE.mp4" "$RAW"
fi
if [ -f "$RECORDER/examples/duel/outro.sh" ]; then
OW="$W" OH="$H" TITLE="privacy-filter.cpp" \
LINK1="github.com/richiejp/privacy-filter.cpp" \
LINK2="real NER spans · stock ggml · see README Bench" \
bash "$RECORDER/examples/duel/outro.sh" "$RAW" "$HERE/out/${NOEXT}_final.mp4"
echo "-> $HERE/out/${NOEXT}_final.mp4"
else
echo "-> $RAW"
fi
Binary file added demo/out/pii_duel_cpu.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/out/pii_duel_cpu.mp4
Binary file not shown.
Binary file added demo/out/pii_duel_cpu_final.mp4
Binary file not shown.
Binary file added demo/out/pii_duel_gpu.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/out/pii_duel_gpu.mp4
Binary file not shown.
Binary file added demo/out/pii_duel_gpu_final.mp4
Binary file not shown.
Loading
Loading