The art of masking: concealing identity, preserving the essentials.
carnaval is a reversible Python framework for text-document anonymization. It masks sensitive entities (people, organizations, emails, phone numbers, bank identifiers, etc.) before sending them to a cloud LLM, and restores the original values in the structured response (JSON or XML) on the way back.
- License: Apache 2.0
- Stack: Python 3.11 / 3.12 / 3.13, GLiNER (zero-shot NER), regex, AES-256-GCM, PyMuPDF
- No external PII framework (no Presidio, no spaCy NER)
- 187 tests passing, ~95% coverage, mypy-checked, CI on every push
- Used internally in production at one enterprise (anonymization of supplier acknowledgments before LLM extraction). Public API may evolve until v1.0.
pip install carnaval# 1. Clone the repository
git clone <repo>
cd carnaval
# 2. Set up virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or: .\.venv\Scripts\activate # Windows
# 3. Install in editable mode
pip install -e .Create and edit your .env file to set your vault encryption password:
cp .env.example .env
# Edit .env and set CARNAVAL_VAULT_PASSWORD=<32+ characters>Anonymize a document using one of the pre-configured business profiles:
python anonymize.py inbox/my_document.txt --profile acknowledgeRestore the original sensitive data back into the LLM's response (e.g. JSON/XML structure):
python reinject.py response_llm.json --vault outbox/vault/my_document_vault.encRaw TXT --> S1 Intake
--> S2 Preprocess (language, normalization)
--> S3 Detect (regex + denylist + GLiNER)
--> S4 Resolve (dedup, arbitration)
--> S5 Mask (placeholders + encrypted vault)
--> S6 Output (6 formats: txt/json/jsonl/xml/conll/html)
JSON/XML --> S7 Reinject --> JSON/XML with original values
| Profile | Document Type |
|---|---|
acknowledge |
Supplier order acknowledgment |
invoice |
Invoice / professional fee note |
email |
B2B professional email |
Private profiles (real client data) in profiles_private/ (git-ignored).
| Doc | Topic |
|---|---|
| docs/00_overview.md | Overview, principles |
| docs/01_architecture_etages.md | The 7 stages in detail |
| docs/02_install.md | Installation |
| docs/03_deploiement_production.md | Production |
| docs/04_configuration.md | YAML config + profiles |
| docs/05_extension_listes.md | Adding entities to mask |
| docs/06_extension_recognizers.md | Coding a new recognizer |
| docs/07_securite.md | Vault, password, audit |
| docs/08_format_entree_sortie.md | Supported formats |
| docs/09_troubleshooting.md | Common errors |
| docs/10_api_reference.md | Python API |
pytest # Run standard tests (182 passing, 5 slow/AI deselected)
pytest -m slow # Run neural network tests (downloads GLiNER multi-PII model ~500 MB)
pytest --cov=src/carnaval # Run with coverage report (~95% coverage)The test suite consists of 187 total tests validating carnaval against a dedicated corpus of hundreds of fake documents. This corpus represents the worst real-world B2B data quality and privacy cases encountered in production, covering:
- Case & Accent Variations (e.g. Stephanie / Stéphanie / STEPHANIE)
- Valid & Invalid Identifiers (IBAN/BIC checks with mod-97 verification)
- Country-Specific Identifiers (NIR, VAT/TVA, SIREN / SIRET)
- Overlapping Entities (Stage 4 arbitration for emails with subdomains, etc.)
- Punctuation & Complex Layouts (Names attached to punctuation, "LASTNAME Firstname", "Mr. LASTNAME")
- Dirty PDF Extractions (Noisy text containing pipe characters like "Chi | mieBERTAUX")
- Multi-Occurrence Consistency (Mapping identical entities to the same placeholder index)
- Tricky False Positives (Business terms mistaken as BICs, e.g. "PARC")
- Multilingual long documents
You can find programmatic library usage examples in the examples/ directory:
- examples/quickstart_api.py: A simple, commented python script that walks through using the library programmatically to anonymize data and reinject original values back into simulated LLM output.
Contributions are welcome! Please read CONTRIBUTING.md and our CODE_OF_CONDUCT.md before getting started.
- Issues and PRs: Welcome! Please ensure no personal or client data is included in public fixtures (use fictitious entities like Acme Corp, Globex, Initech, etc.).
- Security Policy: For reporting security vulnerabilities, please check SECURITY.md to report responsibly via email.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.