Skip to content

carnaval-ai/carnaval

carnaval

CI codecov Python Version Checked with mypy License PyPI Version PyPI Downloads CodeQL Analysis

The art of masking: concealing identity, preserving the essentials.

carnaval is a reversible Python framework for text-document anonymization. It masks sensitive entities (people, organizations, emails, phone numbers, bank identifiers, etc.) before sending them to a cloud LLM, and restores the original values in the structured response (JSON or XML) on the way back.

Status: Stable (Beta) - v0.2.3

  • License: Apache 2.0
  • Stack: Python 3.11 / 3.12 / 3.13, GLiNER (zero-shot NER), regex, AES-256-GCM, PyMuPDF
  • No external PII framework (no Presidio, no spaCy NER)
  • 187 tests passing, ~95% coverage, mypy-checked, CI on every push
  • Used internally in production at one enterprise (anonymization of supplier acknowledgments before LLM extraction). Public API may evolve until v1.0.

Installation

Standard Installation (from PyPI)

pip install carnaval

Development / Local Source Installation

# 1. Clone the repository
git clone <repo>
cd carnaval

# 2. Set up virtual environment
python -m venv .venv
source .venv/bin/activate       # Linux/macOS
# or: .\.venv\Scripts\activate  # Windows

# 3. Install in editable mode
pip install -e .

Quick Start

1. Configuration

Create and edit your .env file to set your vault encryption password:

cp .env.example .env
# Edit .env and set CARNAVAL_VAULT_PASSWORD=<32+ characters>

2. Anonymization

Anonymize a document using one of the pre-configured business profiles:

python anonymize.py inbox/my_document.txt --profile acknowledge

3. Reinjection

Restore the original sensitive data back into the LLM's response (e.g. JSON/XML structure):

python reinject.py response_llm.json --vault outbox/vault/my_document_vault.enc

7-Stage Architecture

Raw TXT --> S1 Intake
        --> S2 Preprocess (language, normalization)
        --> S3 Detect (regex + denylist + GLiNER)
        --> S4 Resolve (dedup, arbitration)
        --> S5 Mask (placeholders + encrypted vault)
        --> S6 Output (6 formats: txt/json/jsonl/xml/conll/html)

JSON/XML --> S7 Reinject --> JSON/XML with original values

Out-of-the-box Business Profiles

Profile Document Type
acknowledge Supplier order acknowledgment
invoice Invoice / professional fee note
email B2B professional email

Private profiles (real client data) in profiles_private/ (git-ignored).

Documentation

Doc Topic
docs/00_overview.md Overview, principles
docs/01_architecture_etages.md The 7 stages in detail
docs/02_install.md Installation
docs/03_deploiement_production.md Production
docs/04_configuration.md YAML config + profiles
docs/05_extension_listes.md Adding entities to mask
docs/06_extension_recognizers.md Coding a new recognizer
docs/07_securite.md Vault, password, audit
docs/08_format_entree_sortie.md Supported formats
docs/09_troubleshooting.md Common errors
docs/10_api_reference.md Python API

Tests & Validation Corpus

pytest                          # Run standard tests (182 passing, 5 slow/AI deselected)
pytest -m slow                  # Run neural network tests (downloads GLiNER multi-PII model ~500 MB)
pytest --cov=src/carnaval       # Run with coverage report (~95% coverage)

The test suite consists of 187 total tests validating carnaval against a dedicated corpus of hundreds of fake documents. This corpus represents the worst real-world B2B data quality and privacy cases encountered in production, covering:

  • Case & Accent Variations (e.g. Stephanie / Stéphanie / STEPHANIE)
  • Valid & Invalid Identifiers (IBAN/BIC checks with mod-97 verification)
  • Country-Specific Identifiers (NIR, VAT/TVA, SIREN / SIRET)
  • Overlapping Entities (Stage 4 arbitration for emails with subdomains, etc.)
  • Punctuation & Complex Layouts (Names attached to punctuation, "LASTNAME Firstname", "Mr. LASTNAME")
  • Dirty PDF Extractions (Noisy text containing pipe characters like "Chi | mieBERTAUX")
  • Multi-Occurrence Consistency (Mapping identical entities to the same placeholder index)
  • Tricky False Positives (Business terms mistaken as BICs, e.g. "PARC")
  • Multilingual long documents

Examples

You can find programmatic library usage examples in the examples/ directory:

  • examples/quickstart_api.py: A simple, commented python script that walks through using the library programmatically to anonymize data and reinject original values back into simulated LLM output.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md and our CODE_OF_CONDUCT.md before getting started.

  • Issues and PRs: Welcome! Please ensure no personal or client data is included in public fixtures (use fictitious entities like Acme Corp, Globex, Initech, etc.).
  • Security Policy: For reporting security vulnerabilities, please check SECURITY.md to report responsibly via email.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

About

Reversible PII anonymization for LLM pipelines: mask names, emails and bank details before text reaches a cloud LLM, then restore them in the response. Local-first, encrypted vault, 6 languages. Python, Apache-2.0.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages