PII Classification Model

This project provides a complete pipeline for Named Entity Recognition (NER) focused on identifying Personally Identifiable Information (PII) in Polish text. It includes utilities for data conversion, label generalization, model training using Hugging Face Transformers, and a deployment-ready Flask API with a simple web interface.

Repository: https://github.com/radlab-dev-group/anonymizer-model

Features

Data Processing CLI: Tools to convert CONLL/IOB formats to JSONL, generalize labels, and generate distribution reports.
Training Pipeline: A configurable trainer based on AutoModelForTokenClassification with Weights & Biases (W&B) integration.
Advanced Inference: A predictor that handles sub-token merging, punctuation cleaning, and gap preservation to return human-readable entities.
REST API: A Flask-based service to serve multiple model versions with optional dynamic quantization for faster inference.
Web Tester: A lightweight HTML/JS interface for real-time PII detection testing.

Installation

Ensure you have Python $\ge$ 3.9 installed.

git clone https://github.com/radlab-dev-group/anonymizer-model.git
cd anonymizer-model
pip install .

Data Preparation

The project is designed to work with datasets like clarin-pl/kpwr-ner.

Download Dataset: Store IOB files in dataset/kpwr/raw/.
Convert to JSONL:

pii-classifier convert \
     -i dataset/kpwr/raw/kpwr-ner-n82-train-tune.iob \
     -o dataset/kpwr/converted/specific/kpwr-ner-n82-train-tune.jsonl


pii-classifier convert \
     -i dataset/kpwr/raw/kpwr-ner-n82-test.iob \
     -o dataset/kpwr/converted/specific/kpwr-ner-n82-test.jsonl

Generalize Labels: Map fine-grained labels to general categories using a mapping file (e.g., config/mappings/kpwr-ner.json).

pii-classifier generalise \
     -i dataset/kpwr/converted/specific/kpwr-ner-n82-train-tune.jsonl \
     dataset/kpwr/converted/specific/kpwr-ner-n82-test.jsonl \
     -m config/mappings/kpwr-ner.json \
     -o dataset/kpwr/converted/generalised/kpwr-ner-general-whole.jsonl

Generate Report: Create an Excel report to analyze class distribution.

pii-classifier report \
     -i dataset/kpwr/converted/generalised/kpwr-ner-general-whole.jsonl \
     -o dataset/kpwr/converted/generalised/kpwr-ner-general-whole-report.xlsx

Training

Training is driven by a JSON configuration file located in config/training/.

To start training, run the training script:

python pii_classification/trainer/train.py

Key Training Features:

Configurable: Hyperparameters (learning rate, batch size, epochs) are managed via kpwr-ner-config.json.
W&B Integration: Logs metrics and hyperparameters to Weights & Biases.
Automatic Export: Saves the best model (based on f1_macro) into a final_model directory.

Inference & API

Running the API

The API allows you to load multiple model versions and perform predictions.

python3 -m pii_classification.api.app

Endpoints:

GET /models: Returns a list of available models and the default model.
POST /predict: Accepts JSON with text and optional model name. Returns a list of tokens and their predicted PII labels.

Web Interface (under development)

Open pii_classification/ui/index.html in a browser to interact with the API. The UI allows you to select a model, input Polish text, and see highlighted PII entities.

Or simple python-like http server:

cd anonymizer-model/pii_classification/ui
python3 -m http.server

>> Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

CLI Reference

The pii-classifier command provides the following sub-commands:

Command	Description	Required Arguments
`convert`	Convert CONLL/IOB file to JSONL	`-i` (input), `-o` (output)
`generalise`	Map labels using a JSON map	`-i` (input files), `-m` (mapping), `-o` (output)
`report`	Generate Excel distribution report	`-i` (input files), `-o` (output)

Project Structure

├── config/
│   ├── mappings/       # Label mapping JSONs
│   └── training/       # Training hyperparameter configs
├── pii_classification/
│   ├── analysis/       # Label generalization and reporting logic
│   ├── api/            # Flask API implementation
│   ├── cli/            # CLI entry point
│   ├── converters/     # Format conversion utilities
│   ├── inference/      # Model prediction and post-processing logic
│   ├── trainer/        # Training scripts and data processors
│   └── ui/             # Frontend tester (HTML/JS)
└── pyproject.toml      # Project dependencies and metadata

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
config		config
dataset/kpwr		dataset/kpwr
pii_classification		pii_classification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run-api.sh		run-api.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PII Classification Model

Features

Installation

Data Preparation

Training

Inference & API

Running the API

Web Interface (under development)

CLI Reference

Project Structure

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PII Classification Model

Features

Installation

Data Preparation

Training

Inference & API

Running the API

Web Interface (under development)

CLI Reference

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages