Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ jobs:
id: cache
uses: actions/cache@v4
with:
# HF safetensors download + reference fixtures only. The GGUF is
# (re)converted from these on every run, so it always reflects the
# current scripts/convert.py rather than a stale cached artifact.
path: |
~/models/privacy-filter-multilingual
tests/fixtures/hf
Expand All @@ -63,24 +66,38 @@ jobs:
python3 -m venv .venv
.venv/bin/pip install -q torch --index-url https://download.pytorch.org/whl/cpu
.venv/bin/pip install -q -r scripts/requirements.txt
- name: fetch model + convert + dump fixtures
- name: fetch model + dump fixtures
if: steps.cache.outputs.cache-hit != 'true'
run: |
.venv/bin/pip install -q "huggingface_hub[cli]"
.venv/bin/hf download OpenMed/privacy-filter-multilingual \
--local-dir ~/models/privacy-filter-multilingual
# GGUF conversion lives in the llama.cpp fork (same files serve both
# engines); the cache is seeded with pf-rope2-f16.gguf + pf-f32.gguf
# once, manually. Without them the model-label tests skip (exit 77).
.venv/bin/python scripts/hf_dump.py \
--model ~/models/privacy-filter-multilingual --out tests/fixtures/hf
- name: convert HF -> GGUF
run: |
# Conversion is part of the tested path: the parity suite below gates
# these freshly converted GGUFs against the HF reference fixtures, so a
# scripts/convert.py regression fails CI. ~/ggufs is deliberately
# outside the cached paths so every run reconverts with the current
# script. (The f16 is the shipped artifact, published at
# huggingface.co/LocalAI-io; the f32 adds the tight exact-rotation
# parity gate, cos >= 0.99999, that isolates conversion errors from
# f16 rounding.)
mkdir -p ~/ggufs
.venv/bin/python scripts/convert.py \
--model ~/models/privacy-filter-multilingual \
--outfile ~/ggufs/pf-rope2-f16.gguf --outtype f16
.venv/bin/python scripts/convert.py \
--model ~/models/privacy-filter-multilingual \
--outfile ~/ggufs/pf-f32.gguf --outtype f32
- name: build
run: cmake --preset release -DGGML_NATIVE=OFF && cmake --build --preset release -j
- name: parity suite
run: PF_GGUF_DIR=~/models/privacy-filter-multilingual ctest --preset release -L model
run: PF_GGUF_DIR=~/ggufs ctest --preset release -L model
- name: fuzz smoke (5 min/target)
run: |
cmake --preset fuzz && cmake --build --preset fuzz -j --target fuzz_tokenizer fuzz_gguf
PF_GGUF=~/models/privacy-filter-multilingual/pf-rope2-f16.gguf \
PF_GGUF=~/ggufs/pf-rope2-f16.gguf \
./build/fuzz/fuzz_tokenizer -max_total_time=300 -max_len=4096
./build/fuzz/fuzz_gguf -max_total_time=300 -max_len=8192
31 changes: 27 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,12 @@ PII/NER entity spans with exact UTF-8 byte offsets. Stock upstream ggml — no
patches; the model's YaRN `truncate=false` frequencies are computed at load
time and fed to `ggml_rope_ext` as `freq_factors`.

Uses the same GGUF files as the llama.cpp-based path (arch
`openai-privacy-filter`, converted by the llama.cpp-fork converter).
Pre-converted GGUFs (arch `openai-privacy-filter`):
[`LocalAI-io/privacy-filter-multilingual-GGUF`](https://huggingface.co/LocalAI-io/privacy-filter-multilingual-GGUF)
and [`LocalAI-io/privacy-filter-GGUF`](https://huggingface.co/LocalAI-io/privacy-filter-GGUF).
Convert your own from a HF checkpoint with
[`scripts/convert.py`](scripts/convert.py) — self-contained, no llama.cpp
dependency (see [Convert](#convert)).

## Build

Expand All @@ -34,6 +38,23 @@ echo "Contact John Doe at jdoe@example.com" | \
build/release/pf-cli --classify model.gguf 0.5 # [cpu|cuda|vulkan]
```

## Convert

Pre-converted GGUFs are linked above. To convert an `OpenAIPrivacyFilter` HF
checkpoint yourself:

```sh
pip install -r scripts/requirements.txt # torch + safetensors + gguf
python scripts/convert.py --model <hf-model-dir> --outfile model-f16.gguf
python scripts/convert.py --model <hf-model-dir> --outfile model-f32.gguf --outtype f32
```

[`scripts/convert.py`](scripts/convert.py) reads `config.json` +
`model.safetensors` + `tokenizer.json` and emits the GGUF directly — it does
**not** depend on llama.cpp or its converter. The nightly CI converts the model
this way and gates the result against the HF reference logits, so the converter
stays in parity (`.github/workflows/ci.yml`).

## C API

Flat C API in [`include/pf.h`](include/pf.h): an opaque `pf_ctx` handle and
Expand Down Expand Up @@ -80,9 +101,11 @@ pf_free(ctx);

```sh
ctest --preset debug -LE model # fast suite, sanitizers, no assets
# reference fixtures (one-time, pinned env: scripts/requirements.txt):
# reference fixtures + GGUF (one-time, pinned env: scripts/requirements.txt):
python scripts/hf_dump.py --model <hf-model-dir> --out tests/fixtures/hf
PF_GGUF_DIR=<dir-with-ggufs> ctest --preset release # full parity
python scripts/convert.py --model <hf-model-dir> --outfile ggufs/pf-rope2-f16.gguf
python scripts/convert.py --model <hf-model-dir> --outfile ggufs/pf-f32.gguf --outtype f32
PF_GGUF_DIR=ggufs ctest --preset release # full parity (f16 + tight f32)
PF_DEVICE=vulkan PF_GGUF_DIR=... ctest --preset release -L model # on GPU
```

Expand Down
15 changes: 14 additions & 1 deletion fuzz/fuzz_tokenizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
// - encode: valid ids, start < end, non-decreasing starts, every byte
// covered (offsets are widened to UTF-8 char boundaries, so tokens may
// overlap on a multibyte char but never leave gaps)
// With PF_GGUF set the full encode path runs; otherwise pretokenize only.
// With PF_GGUF set to a loadable GGUF the full encode path runs; unset runs
// pretokenize-only. PF_GGUF set but missing is a hard error (exit 1) — setting
// it requests full-encode fuzzing, so the file has to be there.
#include "tokenizer.h"

#include <cstdio>
Expand All @@ -19,6 +21,17 @@ extern "C" int LLVMFuzzerInitialize(int *, char ***) {
std::fprintf(stderr, "fuzz_tokenizer: PF_GGUF unset, pretokenize-only mode\n");
return 0;
}
// PF_GGUF was set, so full-encode fuzzing was requested: the GGUF is a hard
// requirement. Exit cleanly (exit 1, not abort -> no core dump) when it's
// missing, so CI fails loudly instead of silently fuzzing pretokenize-only.
// CI generates it with scripts/convert.py; a missing file means misconfig.
// A file that exists but won't load is a real bug, so that path aborts below.
if (FILE * f = std::fopen(gguf, "rb")) {
std::fclose(f);
} else {
std::fprintf(stderr, "fuzz_tokenizer: PF_GGUF set but missing: %s\n", gguf);
std::exit(1);
}
ggml_context * gctx = nullptr;
gguf_init_params params = { /*no_alloc =*/ true, &gctx };
gguf_context * g = gguf_init_from_file(gguf, params);
Expand Down
161 changes: 161 additions & 0 deletions model-cards/privacy-filter-multilingual.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
---
license: apache-2.0
base_model: OpenMed/privacy-filter-multilingual
base_model_relation: quantized
pipeline_tag: token-classification
library_name: gguf
tags:
- gguf
- privacy-filter.cpp
- llama-cpp
- localai
- token-classification
- pii
- ner
- privacy
- redaction
- multilingual
- openai-privacy-filter
language:
- ar
- bn
- de
- en
- es
- fr
- hi
- it
- ja
- ko
- nl
- pt
- te
- tr
- vi
- zh
---

# privacy-filter-multilingual — GGUF (F16)

GGUF conversion of [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual),
a multilingual PII **token-classification** model (a fine-tune of
[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)). It labels every
token with a BIOES tag over **54 PII categories (217 classes)** across **16 languages**, so
it can be served locally with **no Python** as the encoder/NER tier of a PII redactor.

For the full model description, label space, evaluation, limitations, and citations, see the
**[source model card](https://huggingface.co/OpenMed/privacy-filter-multilingual)** — this
card only covers the GGUF packaging and how to run it.

## Runtimes

This GGUF uses a **custom architecture, `openai-privacy-filter`**, that is not (yet) part of
upstream llama.cpp. It runs on:

1. **[privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp)** *(recommended)* —
a small standalone GGML engine for exactly this model family, on **stock upstream ggml with
no patches** (CPU / CUDA / Vulkan). This is the reference runtime and what the parity numbers
below are measured against.

```sh
# build (see the repo README for CUDA/Vulkan)
cmake --preset release && cmake --build --preset release -j
# run
echo "Contact John Doe at jdoe@example.com" | \
build/release/pf-cli --classify privacy-filter-multilingual-f16.gguf 0.5
```

It exposes a flat C API (`pf_load` / `pf_classify` → entity spans with UTF-8 byte offsets;
`pf_tokenize` / `pf_logits`) shaped for FFI — see the repo README.

2. **[LocalAI](https://github.com/mudler/LocalAI)** — install from the model gallery; LocalAI
serves it behind the gRPC `TokenClassify` RPC and runs the constrained BIOES Viterbi decode,
returning entity spans. LocalAI drives it through the **`privacy-filter` backend** (which
wraps privacy-filter.cpp); older builds used a llama.cpp-patched path. The model is **not** a
chat/completion model — it is a PII detector that other models opt into.

```bash
local-ai models install privacy-filter-multilingual
```

The gallery entry carries the detection policy in a `pii_detection:` block (default: mask
everything detected; block credentials / financial-secrets / crypto). Other models opt in by
listing it under `pii.detectors`:

```yaml
# any chat or cloud-proxy model — opt in and reference the detector(s)
name: my-assistant
pii:
enabled: true
detectors:
- privacy-filter-multilingual
```

3. **llama.cpp — only with a patch.** Stock `llama.cpp`, `llama-cpp-python`, Ollama, and
LM Studio will **fail to load** this file (`unknown model architecture:
'openai-privacy-filter'`). The arch can be added with carry-patches (TOKEN_CLS pooling, the
architecture + HF→GGUF converter, the bidirectional banded-attention graph, and an all-SWA
no-cache mask fix; TOKEN_CLS pooling tracks the still-open
[PR #19725](https://github.com/ggml-org/llama.cpp/pull/19725)). Until that support lands
upstream, the patched path is carried by LocalAI; `privacy-filter.cpp` above is the
patch-free alternative.

> **Pooling note (llama.cpp path only):** the model must be loaded with **TOKEN_CLS pooling**
> (the GGUF's default). If you drive `llama-embedding` directly for testing, do **not** pass
> `--pooling none` — that overrides the default and yields raw hidden states instead of label
> logits. privacy-filter.cpp handles this automatically.

## Files

| File | Precision | Size | Notes |
|---|---|---|---|
| `privacy-filter-multilingual-f16.gguf` | F16 | ~2.7 GB | 217 `classifier.output_labels`; `pooling_type = TOKEN_CLS`. Validated artifact. |

F16 is the validated, shipped precision. Quantized variants are deferred until they can be
evaluated with a **task metric (span-F1 per language) + KL-vs-F16** — perplexity is meaningless
for a classifier, so a naively-quantized GGUF is not published here yet.

## Architecture & conversion

gpt-oss-style sparse **MoE** (8 layers, `d_model=640`, 128 experts, top-4 routing, ~50M active
per token), **bidirectional banded attention** (symmetric sliding window 128, attention sinks
retained), **interleaved (GPT-J) RoPE** with YaRN (θ=150000, factor 32), o200k (`o200k_base`)
tokenizer, and a 217-way token-classification head (`score` → `cls.output`).

The conversion reproduces the HF reference **exactly at F16**: token-for-token argmax match on
the parity prompt set, **full-logit cosine = 1.0**, every layer's residual-stream cosine = 1.0
(relerr ≈ 2e-4, i.e. F16 rounding). The two load-bearing conversion choices — the expert
`gate_up` `chunk(2)` split and the `n_swa = 2·sliding_window` window mapping — are both
confirmed by that parity. privacy-filter.cpp re-derives the YaRN `truncate=false` frequencies at
load time (fed to `ggml_rope_ext` as `freq_factors`) so the same GGUF is interchangeable across
runtimes.

This GGUF was produced by [`scripts/convert.py`](https://github.com/localai-org/privacy-filter.cpp/blob/master/scripts/convert.py)
— a self-contained HF→GGUF converter (no llama.cpp dependency). Nightly CI re-runs it and gates
the output against the HF reference logits, so the published artifact stays in parity.

## Label space

`O` plus `B-`/`I-`/`E-`/`S-` for each of 54 categories (1 + 54×4 = 217), spanning identity,
contact, address, dates/time, government IDs, financial, crypto, vehicle, digital, and auth
entities. The ordered `id2label` table is embedded in the GGUF (`classifier.output_labels`).
See the [source card](https://huggingface.co/OpenMed/privacy-filter-multilingual#label-space-54-categories)
for the full list.

## Limitations & intended use

Identical to the [source model](https://huggingface.co/OpenMed/privacy-filter-multilingual#limitations--intended-use):
multilingual but uneven (strongest on de/es/fr/it/hi/te/en; weaker on CJK), trained on
synthetic AI4Privacy data, **not** a substitute for legal/compliance review, and **not** a
clinical PHI model. Use it as one tier behind deterministic regex pre-filters and human review.

## License

**Apache-2.0**, inherited from `openai/privacy-filter` and `OpenMed/privacy-filter-multilingual`.

## Credits & citation

Conversion and runtime support by the **LocalAI** project (`privacy-filter.cpp`). The model
itself is by **OpenMed**, fine-tuned from **OpenAI**'s `privacy-filter`, on **AI4Privacy**
datasets — please cite all of them (BibTeX in the
[source card](https://huggingface.co/OpenMed/privacy-filter-multilingual#citation)).
Loading
Loading