This project provides a complete pipeline for Named Entity Recognition (NER) focused on identifying Personally Identifiable Information (PII) in Polish text. It includes utilities for data conversion, label generalization, model training using Hugging Face Transformers, and a deployment-ready Flask API with a simple web interface.
Repository: https://github.com/radlab-dev-group/anonymizer-model
- Data Processing CLI: Tools to convert CONLL/IOB formats to JSONL, generalize labels, and generate distribution reports.
- Training Pipeline: A configurable trainer based on
AutoModelForTokenClassificationwith Weights & Biases (W&B) integration. - Advanced Inference: A predictor that handles sub-token merging, punctuation cleaning, and gap preservation to return human-readable entities.
- REST API: A Flask-based service to serve multiple model versions with optional dynamic quantization for faster inference.
- Web Tester: A lightweight HTML/JS interface for real-time PII detection testing.
Ensure you have Python
git clone https://github.com/radlab-dev-group/anonymizer-model.git
cd anonymizer-model
pip install .
The project is designed to work with datasets like clarin-pl/kpwr-ner.
- Download Dataset: Store IOB files in
dataset/kpwr/raw/. - Convert to JSONL:
pii-classifier convert \
-i dataset/kpwr/raw/kpwr-ner-n82-train-tune.iob \
-o dataset/kpwr/converted/specific/kpwr-ner-n82-train-tune.jsonl
pii-classifier convert \
-i dataset/kpwr/raw/kpwr-ner-n82-test.iob \
-o dataset/kpwr/converted/specific/kpwr-ner-n82-test.jsonl
- Generalize Labels:
Map fine-grained labels to general categories using a mapping file (e.g.,
config/mappings/kpwr-ner.json).
pii-classifier generalise \
-i dataset/kpwr/converted/specific/kpwr-ner-n82-train-tune.jsonl \
dataset/kpwr/converted/specific/kpwr-ner-n82-test.jsonl \
-m config/mappings/kpwr-ner.json \
-o dataset/kpwr/converted/generalised/kpwr-ner-general-whole.jsonl
- Generate Report: Create an Excel report to analyze class distribution.
pii-classifier report \
-i dataset/kpwr/converted/generalised/kpwr-ner-general-whole.jsonl \
-o dataset/kpwr/converted/generalised/kpwr-ner-general-whole-report.xlsx
Training is driven by a JSON configuration file located in config/training/.
To start training, run the training script:
python pii_classification/trainer/train.py
Key Training Features:
- Configurable: Hyperparameters (learning rate, batch size, epochs) are managed via
kpwr-ner-config.json. - W&B Integration: Logs metrics and hyperparameters to Weights & Biases.
- Automatic Export: Saves the best model (based on
f1_macro) into afinal_modeldirectory.
The API allows you to load multiple model versions and perform predictions.
python3 -m pii_classification.api.app
Endpoints:
GET /models: Returns a list of available models and the default model.POST /predict: Accepts JSON withtextand optionalmodelname. Returns a list of tokens and their predicted PII labels.
Open pii_classification/ui/index.html in a browser to interact with the API. The UI allows you to select a model,
input Polish text, and see highlighted PII entities.
Or simple python-like http server:
cd anonymizer-model/pii_classification/ui
python3 -m http.server
>> Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...The pii-classifier command provides the following sub-commands:
| Command | Description | Required Arguments |
|---|---|---|
convert |
Convert CONLL/IOB file to JSONL | -i (input), -o (output) |
generalise |
Map labels using a JSON map | -i (input files), -m (mapping), -o (output) |
report |
Generate Excel distribution report | -i (input files), -o (output) |
├── config/
│ ├── mappings/ # Label mapping JSONs
│ └── training/ # Training hyperparameter configs
├── pii_classification/
│ ├── analysis/ # Label generalization and reporting logic
│ ├── api/ # Flask API implementation
│ ├── cli/ # CLI entry point
│ ├── converters/ # Format conversion utilities
│ ├── inference/ # Model prediction and post-processing logic
│ ├── trainer/ # Training scripts and data processors
│ └── ui/ # Frontend tester (HTML/JS)
└── pyproject.toml # Project dependencies and metadata