carnaval

The art of masking: concealing identity, preserving the essentials.

carnaval is a reversible Python framework for text-document anonymization. It masks sensitive entities (people, organizations, emails, phone numbers, bank identifiers, etc.) before sending them to a cloud LLM, and restores the original values in the structured response (JSON or XML) on the way back.

Status: Stable (Beta) - v0.2.3

License: Apache 2.0
Stack: Python 3.11 / 3.12 / 3.13, GLiNER (zero-shot NER), regex, AES-256-GCM, PyMuPDF
No external PII framework (no Presidio, no spaCy NER)
187 tests passing, ~95% coverage, mypy-checked, CI on every push
Used internally in production at one enterprise (anonymization of supplier acknowledgments before LLM extraction). Public API may evolve until v1.0.

Installation

Standard Installation (from PyPI)

pip install carnaval

Development / Local Source Installation

# 1. Clone the repository
git clone <repo>
cd carnaval

# 2. Set up virtual environment
python -m venv .venv
source .venv/bin/activate       # Linux/macOS
# or: .\.venv\Scripts\activate  # Windows

# 3. Install in editable mode
pip install -e .

Quick Start

1. Configuration

Create and edit your .env file to set your vault encryption password:

cp .env.example .env
# Edit .env and set CARNAVAL_VAULT_PASSWORD=<32+ characters>

2. Anonymization

Anonymize a document using one of the pre-configured business profiles:

python anonymize.py inbox/my_document.txt --profile acknowledge

3. Reinjection

Restore the original sensitive data back into the LLM's response (e.g. JSON/XML structure):

python reinject.py response_llm.json --vault outbox/vault/my_document_vault.enc

7-Stage Architecture

Raw TXT --> S1 Intake
        --> S2 Preprocess (language, normalization)
        --> S3 Detect (regex + denylist + GLiNER)
        --> S4 Resolve (dedup, arbitration)
        --> S5 Mask (placeholders + encrypted vault)
        --> S6 Output (6 formats: txt/json/jsonl/xml/conll/html)

JSON/XML --> S7 Reinject --> JSON/XML with original values

Out-of-the-box Business Profiles

Profile	Document Type
`acknowledge`	Supplier order acknowledgment
`invoice`	Invoice / professional fee note
`email`	B2B professional email

Private profiles (real client data) in profiles_private/ (git-ignored).

Documentation

Doc	Topic
docs/00_overview.md	Overview, principles
docs/01_architecture_etages.md	The 7 stages in detail
docs/02_install.md	Installation
docs/03_deploiement_production.md	Production
docs/04_configuration.md	YAML config + profiles
docs/05_extension_listes.md	Adding entities to mask
docs/06_extension_recognizers.md	Coding a new recognizer
docs/07_securite.md	Vault, password, audit
docs/08_format_entree_sortie.md	Supported formats
docs/09_troubleshooting.md	Common errors
docs/10_api_reference.md	Python API

Tests & Validation Corpus

pytest                          # Run standard tests (182 passing, 5 slow/AI deselected)
pytest -m slow                  # Run neural network tests (downloads GLiNER multi-PII model ~500 MB)
pytest --cov=src/carnaval       # Run with coverage report (~95% coverage)

The test suite consists of 187 total tests validating carnaval against a dedicated corpus of hundreds of fake documents. This corpus represents the worst real-world B2B data quality and privacy cases encountered in production, covering:

Case & Accent Variations (e.g. Stephanie / Stéphanie / STEPHANIE)
Valid & Invalid Identifiers (IBAN/BIC checks with mod-97 verification)
Country-Specific Identifiers (NIR, VAT/TVA, SIREN / SIRET)
Overlapping Entities (Stage 4 arbitration for emails with subdomains, etc.)
Punctuation & Complex Layouts (Names attached to punctuation, "LASTNAME Firstname", "Mr. LASTNAME")
Dirty PDF Extractions (Noisy text containing pipe characters like "Chi | mieBERTAUX")
Multi-Occurrence Consistency (Mapping identical entities to the same placeholder index)
Tricky False Positives (Business terms mistaken as BICs, e.g. "PARC")
Multilingual long documents

Examples

You can find programmatic library usage examples in the examples/ directory:

examples/quickstart_api.py: A simple, commented python script that walks through using the library programmatically to anonymize data and reinject original values back into simulated LLM output.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md and our CODE_OF_CONDUCT.md before getting started.

Issues and PRs: Welcome! Please ensure no personal or client data is included in public fixtures (use fictitious entities like Acme Corp, Globex, Initech, etc.).
Security Policy: For reporting security vulnerabilities, please check SECURITY.md to report responsibly via email.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

carnaval

Status: Stable (Beta) - v0.2.3

Installation

Standard Installation (from PyPI)

Development / Local Source Installation

Quick Start

1. Configuration

2. Anonymization

3. Reinjection

7-Stage Architecture

Out-of-the-box Business Profiles

Documentation

Tests & Validation Corpus

Examples

Contributing

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github		.github
assets/dictionaries		assets/dictionaries
config		config
docs		docs
examples		examples
log		log
profiles		profiles
scripts		scripts
src/carnaval		src/carnaval
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
anonymize.py		anonymize.py
pyproject.toml		pyproject.toml
reinject.py		reinject.py
requirements-lock.txt		requirements-lock.txt
requirements.txt		requirements.txt
video_presentation_concept.md		video_presentation_concept.md

Folders and files

Latest commit

History

Repository files navigation

carnaval

Status: Stable (Beta) - v0.2.3

Installation

Standard Installation (from PyPI)

Development / Local Source Installation

Quick Start

1. Configuration

2. Anonymization

3. Reinjection

7-Stage Architecture

Out-of-the-box Business Profiles

Documentation

Tests & Validation Corpus

Examples

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages