localai-org · richiejp · Jun 16, 2026 · Jun 14, 2026 · Jun 15, 2026 · Jun 15, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -54,6 +54,9 @@ jobs:
         id: cache
         uses: actions/cache@v4
         with:
+          # HF safetensors download + reference fixtures only. The GGUF is
+          # (re)converted from these on every run, so it always reflects the
+          # current scripts/convert.py rather than a stale cached artifact.
           path: |
             ~/models/privacy-filter-multilingual
             tests/fixtures/hf
@@ -63,24 +66,38 @@ jobs:
           python3 -m venv .venv
           .venv/bin/pip install -q torch --index-url https://download.pytorch.org/whl/cpu
           .venv/bin/pip install -q -r scripts/requirements.txt
-      - name: fetch model + convert + dump fixtures
+      - name: fetch model + dump fixtures
         if: steps.cache.outputs.cache-hit != 'true'
         run: |
           .venv/bin/pip install -q "huggingface_hub[cli]"
           .venv/bin/hf download OpenMed/privacy-filter-multilingual \
             --local-dir ~/models/privacy-filter-multilingual
-          # GGUF conversion lives in the llama.cpp fork (same files serve both
-          # engines); the cache is seeded with pf-rope2-f16.gguf + pf-f32.gguf
-          # once, manually. Without them the model-label tests skip (exit 77).
           .venv/bin/python scripts/hf_dump.py \
             --model ~/models/privacy-filter-multilingual --out tests/fixtures/hf
+      - name: convert HF -> GGUF
+        run: |
+          # Conversion is part of the tested path: the parity suite below gates
+          # these freshly converted GGUFs against the HF reference fixtures, so a
+          # scripts/convert.py regression fails CI. ~/ggufs is deliberately
+          # outside the cached paths so every run reconverts with the current
+          # script. (The f16 is the shipped artifact, published at
+          # huggingface.co/LocalAI-io; the f32 adds the tight exact-rotation
+          # parity gate, cos >= 0.99999, that isolates conversion errors from
+          # f16 rounding.)
+          mkdir -p ~/ggufs
+          .venv/bin/python scripts/convert.py \
+            --model ~/models/privacy-filter-multilingual \
+            --outfile ~/ggufs/pf-rope2-f16.gguf --outtype f16
+          .venv/bin/python scripts/convert.py \
+            --model ~/models/privacy-filter-multilingual \
+            --outfile ~/ggufs/pf-f32.gguf --outtype f32
       - name: build
         run: cmake --preset release -DGGML_NATIVE=OFF && cmake --build --preset release -j
       - name: parity suite
-        run: PF_GGUF_DIR=~/models/privacy-filter-multilingual ctest --preset release -L model
+        run: PF_GGUF_DIR=~/ggufs ctest --preset release -L model
       - name: fuzz smoke (5 min/target)
         run: |
           cmake --preset fuzz && cmake --build --preset fuzz -j --target fuzz_tokenizer fuzz_gguf
-          PF_GGUF=~/models/privacy-filter-multilingual/pf-rope2-f16.gguf \
+          PF_GGUF=~/ggufs/pf-rope2-f16.gguf \
             ./build/fuzz/fuzz_tokenizer -max_total_time=300 -max_len=4096
           ./build/fuzz/fuzz_gguf -max_total_time=300 -max_len=8192
diff --git a/README.md b/README.md
@@ -8,8 +8,12 @@ PII/NER entity spans with exact UTF-8 byte offsets. Stock upstream ggml — no
 patches; the model's YaRN `truncate=false` frequencies are computed at load
 time and fed to `ggml_rope_ext` as `freq_factors`.
 
-Uses the same GGUF files as the llama.cpp-based path (arch
-`openai-privacy-filter`, converted by the llama.cpp-fork converter).
+Pre-converted GGUFs (arch `openai-privacy-filter`):
+[`LocalAI-io/privacy-filter-multilingual-GGUF`](https://huggingface.co/LocalAI-io/privacy-filter-multilingual-GGUF)
+and [`LocalAI-io/privacy-filter-GGUF`](https://huggingface.co/LocalAI-io/privacy-filter-GGUF).
+Convert your own from a HF checkpoint with
+[`scripts/convert.py`](scripts/convert.py) — self-contained, no llama.cpp
+dependency (see [Convert](#convert)).
 
 ## Build
 
@@ -34,6 +38,23 @@ echo "Contact John Doe at jdoe@example.com" | \
   build/release/pf-cli --classify model.gguf 0.5       # [cpu|cuda|vulkan]
 ```
 
+## Convert
+
+Pre-converted GGUFs are linked above. To convert an `OpenAIPrivacyFilter` HF
+checkpoint yourself:
+
+```sh
+pip install -r scripts/requirements.txt   # torch + safetensors + gguf
+python scripts/convert.py --model <hf-model-dir> --outfile model-f16.gguf
+python scripts/convert.py --model <hf-model-dir> --outfile model-f32.gguf --outtype f32
+```
+
+[`scripts/convert.py`](scripts/convert.py) reads `config.json` +
+`model.safetensors` + `tokenizer.json` and emits the GGUF directly — it does
+**not** depend on llama.cpp or its converter. The nightly CI converts the model
+this way and gates the result against the HF reference logits, so the converter
+stays in parity (`.github/workflows/ci.yml`).
+
 ## C API
 
 Flat C API in [`include/pf.h`](include/pf.h): an opaque `pf_ctx` handle and
@@ -80,9 +101,11 @@ pf_free(ctx);
 
 ```sh
 ctest --preset debug -LE model            # fast suite, sanitizers, no assets
-# reference fixtures (one-time, pinned env: scripts/requirements.txt):
+# reference fixtures + GGUF (one-time, pinned env: scripts/requirements.txt):
 python scripts/hf_dump.py --model <hf-model-dir> --out tests/fixtures/hf
-PF_GGUF_DIR=<dir-with-ggufs> ctest --preset release          # full parity
+python scripts/convert.py --model <hf-model-dir> --outfile ggufs/pf-rope2-f16.gguf
+python scripts/convert.py --model <hf-model-dir> --outfile ggufs/pf-f32.gguf --outtype f32
+PF_GGUF_DIR=ggufs ctest --preset release                     # full parity (f16 + tight f32)
 PF_DEVICE=vulkan PF_GGUF_DIR=... ctest --preset release -L model   # on GPU
 ```
 

diff --git a/fuzz/fuzz_tokenizer.cpp b/fuzz/fuzz_tokenizer.cpp
@@ -4,7 +4,9 @@
 //   - encode: valid ids, start < end, non-decreasing starts, every byte
 //     covered (offsets are widened to UTF-8 char boundaries, so tokens may
 //     overlap on a multibyte char but never leave gaps)
-// With PF_GGUF set the full encode path runs; otherwise pretokenize only.
+// With PF_GGUF set to a loadable GGUF the full encode path runs; unset runs
+// pretokenize-only. PF_GGUF set but missing is a hard error (exit 1) — setting
+// it requests full-encode fuzzing, so the file has to be there.
 #include "tokenizer.h"
 
 #include <cstdio>
@@ -19,6 +21,17 @@ extern "C" int LLVMFuzzerInitialize(int *, char ***) {
         std::fprintf(stderr, "fuzz_tokenizer: PF_GGUF unset, pretokenize-only mode\n");
         return 0;
     }
+    // PF_GGUF was set, so full-encode fuzzing was requested: the GGUF is a hard
+    // requirement. Exit cleanly (exit 1, not abort -> no core dump) when it's
+    // missing, so CI fails loudly instead of silently fuzzing pretokenize-only.
+    // CI generates it with scripts/convert.py; a missing file means misconfig.
+    // A file that exists but won't load is a real bug, so that path aborts below.
+    if (FILE * f = std::fopen(gguf, "rb")) {
+        std::fclose(f);
+    } else {
+        std::fprintf(stderr, "fuzz_tokenizer: PF_GGUF set but missing: %s\n", gguf);
+        std::exit(1);
+    }
     ggml_context * gctx = nullptr;
     gguf_init_params params = { /*no_alloc =*/ true, &gctx };
     gguf_context * g = gguf_init_from_file(gguf, params);

diff --git a/model-cards/privacy-filter-multilingual.md b/model-cards/privacy-filter-multilingual.md
@@ -0,0 +1,161 @@
+---
+license: apache-2.0
+base_model: OpenMed/privacy-filter-multilingual
+base_model_relation: quantized
+pipeline_tag: token-classification
+library_name: gguf
+tags:
+  - gguf
+  - privacy-filter.cpp
+  - llama-cpp
+  - localai
+  - token-classification
+  - pii
+  - ner
+  - privacy
+  - redaction
+  - multilingual
+  - openai-privacy-filter
+language:
+  - ar
+  - bn
+  - de
+  - en
+  - es
+  - fr
+  - hi
+  - it
+  - ja
+  - ko
+  - nl
+  - pt
+  - te
+  - tr
+  - vi
+  - zh
+---
+
+# privacy-filter-multilingual — GGUF (F16)
+
+GGUF conversion of [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual),
+a multilingual PII **token-classification** model (a fine-tune of
+[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)). It labels every
+token with a BIOES tag over **54 PII categories (217 classes)** across **16 languages**, so
+it can be served locally with **no Python** as the encoder/NER tier of a PII redactor.
+
+For the full model description, label space, evaluation, limitations, and citations, see the
+**[source model card](https://huggingface.co/OpenMed/privacy-filter-multilingual)** — this
+card only covers the GGUF packaging and how to run it.
+
+## Runtimes
+
+This GGUF uses a **custom architecture, `openai-privacy-filter`**, that is not (yet) part of
+upstream llama.cpp. It runs on:
+
+1. **[privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp)** *(recommended)* —
+   a small standalone GGML engine for exactly this model family, on **stock upstream ggml with
+   no patches** (CPU / CUDA / Vulkan). This is the reference runtime and what the parity numbers
+   below are measured against.
+
+   ```sh
+   # build (see the repo README for CUDA/Vulkan)
+   cmake --preset release && cmake --build --preset release -j
+   # run
+   echo "Contact John Doe at jdoe@example.com" | \
+     build/release/pf-cli --classify privacy-filter-multilingual-f16.gguf 0.5
+   ```
+
+   It exposes a flat C API (`pf_load` / `pf_classify` → entity spans with UTF-8 byte offsets;
+   `pf_tokenize` / `pf_logits`) shaped for FFI — see the repo README.
+
+2. **[LocalAI](https://github.com/mudler/LocalAI)** — install from the model gallery; LocalAI
+   serves it behind the gRPC `TokenClassify` RPC and runs the constrained BIOES Viterbi decode,
+   returning entity spans. LocalAI drives it through the **`privacy-filter` backend** (which
+   wraps privacy-filter.cpp); older builds used a llama.cpp-patched path. The model is **not** a
+   chat/completion model — it is a PII detector that other models opt into.
+
+   ```bash
+   local-ai models install privacy-filter-multilingual
+   ```
+
+   The gallery entry carries the detection policy in a `pii_detection:` block (default: mask
+   everything detected; block credentials / financial-secrets / crypto). Other models opt in by
+   listing it under `pii.detectors`:
+
+   ```yaml
+   # any chat or cloud-proxy model — opt in and reference the detector(s)
+   name: my-assistant
+   pii:
+     enabled: true
+     detectors:
+       - privacy-filter-multilingual
+   ```
+
+3. **llama.cpp — only with a patch.** Stock `llama.cpp`, `llama-cpp-python`, Ollama, and
+   LM Studio will **fail to load** this file (`unknown model architecture:
+   'openai-privacy-filter'`). The arch can be added with carry-patches (TOKEN_CLS pooling, the
+   architecture + HF→GGUF converter, the bidirectional banded-attention graph, and an all-SWA
+   no-cache mask fix; TOKEN_CLS pooling tracks the still-open
+   [PR #19725](https://github.com/ggml-org/llama.cpp/pull/19725)). Until that support lands
+   upstream, the patched path is carried by LocalAI; `privacy-filter.cpp` above is the
+   patch-free alternative.
+
+> **Pooling note (llama.cpp path only):** the model must be loaded with **TOKEN_CLS pooling**
+> (the GGUF's default). If you drive `llama-embedding` directly for testing, do **not** pass
+> `--pooling none` — that overrides the default and yields raw hidden states instead of label
+> logits. privacy-filter.cpp handles this automatically.
+
+## Files
+
+| File | Precision | Size | Notes |
+|---|---|---|---|
+| `privacy-filter-multilingual-f16.gguf` | F16 | ~2.7 GB | 217 `classifier.output_labels`; `pooling_type = TOKEN_CLS`. Validated artifact. |
+
+F16 is the validated, shipped precision. Quantized variants are deferred until they can be
+evaluated with a **task metric (span-F1 per language) + KL-vs-F16** — perplexity is meaningless
+for a classifier, so a naively-quantized GGUF is not published here yet.
+
+## Architecture & conversion
+
+gpt-oss-style sparse **MoE** (8 layers, `d_model=640`, 128 experts, top-4 routing, ~50M active
+per token), **bidirectional banded attention** (symmetric sliding window 128, attention sinks
+retained), **interleaved (GPT-J) RoPE** with YaRN (θ=150000, factor 32), o200k (`o200k_base`)
+tokenizer, and a 217-way token-classification head (`score` → `cls.output`).
+
+The conversion reproduces the HF reference **exactly at F16**: token-for-token argmax match on
+the parity prompt set, **full-logit cosine = 1.0**, every layer's residual-stream cosine = 1.0
+(relerr ≈ 2e-4, i.e. F16 rounding). The two load-bearing conversion choices — the expert
+`gate_up` `chunk(2)` split and the `n_swa = 2·sliding_window` window mapping — are both
+confirmed by that parity. privacy-filter.cpp re-derives the YaRN `truncate=false` frequencies at
+load time (fed to `ggml_rope_ext` as `freq_factors`) so the same GGUF is interchangeable across
+runtimes.
+
+This GGUF was produced by [`scripts/convert.py`](https://github.com/localai-org/privacy-filter.cpp/blob/master/scripts/convert.py)
+— a self-contained HF→GGUF converter (no llama.cpp dependency). Nightly CI re-runs it and gates
+the output against the HF reference logits, so the published artifact stays in parity.
+
+## Label space
+
+`O` plus `B-`/`I-`/`E-`/`S-` for each of 54 categories (1 + 54×4 = 217), spanning identity,
+contact, address, dates/time, government IDs, financial, crypto, vehicle, digital, and auth
+entities. The ordered `id2label` table is embedded in the GGUF (`classifier.output_labels`).
+See the [source card](https://huggingface.co/OpenMed/privacy-filter-multilingual#label-space-54-categories)
+for the full list.
+
+## Limitations & intended use
+
+Identical to the [source model](https://huggingface.co/OpenMed/privacy-filter-multilingual#limitations--intended-use):
+multilingual but uneven (strongest on de/es/fr/it/hi/te/en; weaker on CJK), trained on
+synthetic AI4Privacy data, **not** a substitute for legal/compliance review, and **not** a
+clinical PHI model. Use it as one tier behind deterministic regex pre-filters and human review.
+
+## License
+
+**Apache-2.0**, inherited from `openai/privacy-filter` and `OpenMed/privacy-filter-multilingual`.
+
+## Credits & citation
+
+Conversion and runtime support by the **LocalAI** project (`privacy-filter.cpp`). The model
+itself is by **OpenMed**, fine-tuned from **OpenAI**'s `privacy-filter`, on **AI4Privacy**
+datasets — please cite all of them (BibTeX in the
+[source card](https://huggingface.co/OpenMed/privacy-filter-multilingual#citation)).