demo/redaction race bench#5
Merged
Merged
Conversation
richiejp
commented
Jun 18, 2026
Contributor
- docs: add redaction-race demo videos, move Bench to top of README
- perf: use all cores on non-SMT CPUs (ARM/Apple) by default
- build: cross-compile for arm64 via docker buildx + qemu
- demo: on-device PII-scan video for Raspberry Pi 5 (CPU, q8)
- bench: add --compile/--warmup/--iters flags to bench_torch.py
- feat: publish q8 (experts-only) GGUFs for both models
The CPU thread default halved hardware_concurrency() to skip x86 HyperThreading siblings, but ARM (Raspberry Pi 5 Cortex-A76) and Apple silicon have no SMT, so /2 silently ran on half the cores. Detect SMT via /sys/devices/system/cpu/smt/ active and only halve when it is actually on; otherwise use all logical cores. On a Pi 5 this is ~1.8x (2 -> 4 threads, 88% scaling). PF_NTHREADS still overrides. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Static, self-contained aarch64 build targeting the Raspberry Pi 5
(armv8.2-a+dotprod+fp16); base ubuntu:24.04 matches the Pi's glibc 2.39 ABI so
the binary drops straight on. .dockerignore keeps the context lean.
docker buildx build --platform linux/arm64 -f docker/Dockerfile.arm64 \
--target export --output type=local,dest=build/arm64 .
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Single-document NER scan (pii_scan.py): the document scrolls on the left with PII redacted as the scan frontier passes; the right pane is the live NER feed (category + byte range). All data is real -- spans and token count from pf-cli, 360 tok/s measured on the Pi (Cortex-A76 @ 1.5 GHz, q8). 1,360 tokens, 107 spans across 22 categories in 3.8 s; q8 output is span-for-span identical to f16. gen_scan.py builds the trace; README Bench gains the clip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
torch.compile (off by default) for the HF-side comparison, a real warm-up loop (default 5) so small/fast lengths are not timed cold, and configurable iters (default 10). Compile time stays out of the timed loop (warm-up only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
publish_hf.py gains --quant {f16,q8}; the q8 file lands alongside the f16 in the
same HF repo. Both model cards document the q8 variant (experts-only Q8_0, ~1.6
GB) with measured f16-parity (top-1 agreement + KL), framed plainly: reducing
bits is almost never a free lunch -- f16 stays the reference, q8 is a deliberate
size/speed tradeoff to validate on your own data. The q8 GGUFs were produced by
requant_q8.py from the f16.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c34accb to
07a2a56
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.