demo/redaction race bench by richiejp · Pull Request #5 · localai-org/privacy-filter.cpp

richiejp · 2026-06-18T09:53:11Z

docs: add redaction-race demo videos, move Bench to top of README
perf: use all cores on non-SMT CPUs (ARM/Apple) by default
build: cross-compile for arm64 via docker buildx + qemu
demo: on-device PII-scan video for Raspberry Pi 5 (CPU, q8)
bench: add --compile/--warmup/--iters flags to bench_torch.py
feat: publish q8 (experts-only) GGUFs for both models

The CPU thread default halved hardware_concurrency() to skip x86 HyperThreading siblings, but ARM (Raspberry Pi 5 Cortex-A76) and Apple silicon have no SMT, so /2 silently ran on half the cores. Detect SMT via /sys/devices/system/cpu/smt/ active and only halve when it is actually on; otherwise use all logical cores. On a Pi 5 this is ~1.8x (2 -> 4 threads, 88% scaling). PF_NTHREADS still overrides. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Static, self-contained aarch64 build targeting the Raspberry Pi 5 (armv8.2-a+dotprod+fp16); base ubuntu:24.04 matches the Pi's glibc 2.39 ABI so the binary drops straight on. .dockerignore keeps the context lean. docker buildx build --platform linux/arm64 -f docker/Dockerfile.arm64 \ --target export --output type=local,dest=build/arm64 . Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Single-document NER scan (pii_scan.py): the document scrolls on the left with PII redacted as the scan frontier passes; the right pane is the live NER feed (category + byte range). All data is real -- spans and token count from pf-cli, 360 tok/s measured on the Pi (Cortex-A76 @ 1.5 GHz, q8). 1,360 tokens, 107 spans across 22 categories in 3.8 s; q8 output is span-for-span identical to f16. gen_scan.py builds the trace; README Bench gains the clip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

torch.compile (off by default) for the HF-side comparison, a real warm-up loop (default 5) so small/fast lengths are not timed cold, and configurable iters (default 10). Compile time stays out of the timed loop (warm-up only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

publish_hf.py gains --quant {f16,q8}; the q8 file lands alongside the f16 in the same HF repo. Both model cards document the q8 variant (experts-only Q8_0, ~1.6 GB) with measured f16-parity (top-1 agreement + KL), framed plainly: reducing bits is almost never a free lunch -- f16 stays the reference, q8 is a deliberate size/speed tradeoff to validate on your own data. The q8 GGUFs were produced by requant_q8.py from the f16. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

richiejp and others added 5 commits June 18, 2026 10:59

richiejp force-pushed the demo/redaction-race-bench branch from c34accb to 07a2a56 Compare June 18, 2026 09:59

richiejp merged commit 61a30dc into master Jun 18, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

demo/redaction race bench#5

demo/redaction race bench#5
richiejp merged 5 commits into
masterfrom
demo/redaction-race-bench

richiejp commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

richiejp commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant