Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
- name: setup - python
uses: actions/setup-python@v4
with:
python-version: 3.12
python-version: 3.13
- name: Install Global Dependencies
run: pip install -U pip && pip install uv
- name: install
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ jobs:
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.12
python-version: 3.13
- name: Checkout branch "main"
uses: actions/checkout@v4
with:
Expand Down
5 changes: 2 additions & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,11 @@ jobs:
name: Run Tests
strategy:
matrix:
python-version: [ "3.9", "3.10", "3.11", "3.12" ]
python-version: [ "3.10", "3.11", "3.12" ] # 3.13: thinc/spacy not yet compatible (C API _PyLong_AsByteArray)
os: [ubuntu-latest, macos-latest]
runs-on: ${{ matrix.os }}

# Checkout the code, install poetry, install dependencies,
# and run test with coverage
# Checkout, install deps with uv, run tests with coverage
steps:
- name: Environment Setup
uses: actions/checkout@v4
Expand Down
2 changes: 1 addition & 1 deletion .zenodo.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"access_right": "open",
"version": "0.5.0",
"version": "0.6.0",
"creators": [
{
"orcid": "0000-0003-0665-098X",
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@ authors:
given-names: Eidan J.
orcid: https://orcid.org/0000-0003-0665-098X
title: "pii-codex: a Python library for PII detection, categorization, and severity assessment"
version: 0.5.0
version: 0.6.0
doi: 10.5281/zenodo.7212576
date-released: 2025-12-16
date-released: 2026-02-13
4 changes: 1 addition & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,7 @@ test: lint test.all
test.cov: test.coverage

install:
@uv sync
@uv sync --all-extras
@uv sync --extra dev
@uv sync --extra dev --extra detections
$(MAKE) install.pre_commit
@echo "Installation complete!"

Expand Down
31 changes: 10 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Author: Eidan Rosado - [@EdyVision](https://github.com/EdyVision) <br/>
Affiliation: Nova Southeastern University, College of Computing and Engineering

## Project Background
The <em>PII Codex</em> project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment (to be publicly released later in 2023). There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.
The <em>PII Codex</em> project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment. There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.

The outputs of the primary PII Codex analysis and adapter functions are AnalysisResult or AnalysisResultSet objects that will provide a listing of detections, severities, mean risk scores for each string processed, and summary statistics on the analysis made. The final outputs do not contain the original texts but instead will provide where to find the detections should the end-user care for this information in their analysis.

Expand All @@ -37,44 +37,33 @@ Potential usages include sanitizing of dataset strings (e.g. a collection of soc
<hr/>

## Running Locally with uv
This project uses `uv` for dependency management. To run this project, install `uv` and proceed to follow the instructions under `/docs/LOCAL_SETUP.md`.

`Note: This project has only been tested with Ubuntu and MacOS and with Python versions 3.11 and 3.12. You may need to upgrade pip ahead of installation.`

## Installing with PIP
Video capture of install provided in LOCAL_SETUP.md file. Make sure you set up a virtual environment with either python 3.11 or 3.12 and upgrade pip with:
This project uses `uv` for dependency management. Install [uv](https://docs.astral.sh/uv/) then clone the repo and run:

```bash
pip install --upgrade pip
pip install -U pip uv # only needed if you haven't already done so
make install
```

Before adding `pii-codex` on your project, download the spaCy `en_core_web_lg` model:
This runs `uv sync --extra dev --extra detections` so you get the base package, dev tools (pytest, black, pylint, etc.), and detection extras (spaCy, Presidio Analyzer/Anonymizer). The spaCy model `en_core_web_lg` is included in the `detections` extra and is installed automatically; you do not need to run `spacy download` yourself. If for some reason the model is missing at runtime, the code will attempt to install it (via `spacy download` or, in uv-managed venvs without pip, via `uv pip install` and a known wheel URL).

```bash
pip install -U spacy
python3 -m spacy download en_core_web_lg
```
For more detail, see [docs/LOCAL_SETUP.md](docs/LOCAL_SETUP.md). This project has been tested on Ubuntu and macOS with Python 3.11 and 3.12.

For more details on spaCy installation and usage, refer to their <a href="https://spacy.io/usage">docs</a>.

The repository releases are hosted on PyPi and can be installed with:
## Installing as a dependency (PyPI or uv)
Releases are on PyPI. To use PII Codex in another project:

```bash
pip install pii-codex
pip install "pii-codex[detections]"
```

`Note: The extras installed with pii-codex[detections] are the spaCy, Micrisoft Presidio Analyzer, and Microsoft Anonymzer packages.`

Using uv:
With uv:

```bash
uv sync
uv add pii-codex
uv add "pii-codex[detections]"
```

The `[detections]` extra installs spaCy, Microsoft Presidio Analyzer and Anonymizer, and the `en_core_web_lg` model (via a direct wheel URL), so detection works out of the box. If you install without the extra and later use detection features, the code will try to install the model on first use when possible.

For those using Google Collab, check out the example notebook:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/EdyVision/802ce21aab21eb5d9afa9e43d301eef7/pii-codex-sample-notebook.ipynb)
Expand Down
10 changes: 6 additions & 4 deletions codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@ coverage:
status:
project:
default:
target: 90% # the required coverage value
threshold: 1% # the leniency in hitting the target
target: 90%
threshold: 1%
informational: true # report only; do not fail build or turn red when below target
patch:
default:
target: 90% # the required coverage value
threshold: 1% # the leniency in hitting the target
target: 90%
threshold: 1%
informational: true # report only; do not fail build or turn red when below target
2 changes: 1 addition & 1 deletion pii_codex/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.5.0"
__version__ = "0.6.0"
28 changes: 28 additions & 0 deletions pii_codex/models/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,34 @@ class PIIType(Enum):
AU_COMPANY_NUMBER = "AU_COMPANY_NUMBER"
AU_MEDICAL_ACCOUNT_NUMBER = "AU_MEDICAL_ACCOUNT_NUMBER"
AU_TAX_FILE_NUMBER = "AU_TAX_FILE_NUMBER"
# Presidio extended (global and regional)
MAC_ADDRESS = "MAC_ADDRESS"
US_MBI = "US_MBI" # US Medicare Beneficiary Identifier
UK_NHS = "UK_NHS"
UK_NINO = "UK_NINO"
ES_NIF = "ES_NIF"
ES_NIE = "ES_NIE"
IT_FISCAL_CODE = "IT_FISCAL_CODE"
IT_DRIVER_LICENSE = "IT_DRIVER_LICENSE"
IT_VAT_CODE = "IT_VAT_CODE"
IT_PASSPORT = "IT_PASSPORT"
IT_IDENTITY_CARD = "IT_IDENTITY_CARD"
PL_PESEL = "PL_PESEL"
SG_NRIC_FIN = "SG_NRIC_FIN"
SG_UEN = "SG_UEN"
IN_PAN = "IN_PAN"
IN_AADHAAR = "IN_AADHAAR"
IN_VEHICLE_REGISTRATION = "IN_VEHICLE_REGISTRATION"
IN_VOTER = "IN_VOTER"
IN_PASSPORT = "IN_PASSPORT"
IN_GSTIN = "IN_GSTIN"
FI_PERSONAL_IDENTITY_CODE = "FI_PERSONAL_IDENTITY_CODE"
KR_DRIVER_LICENSE = "KR_DRIVER_LICENSE"
KR_FRN = "KR_FRN"
KR_PASSPORT = "KR_PASSPORT"
KR_BRN = "KR_BRN"
KR_RRN = "KR_RRN"
TH_TNIN = "TH_TNIN"


class NISTCategory(Enum):
Expand Down
29 changes: 28 additions & 1 deletion pii_codex/models/microsoft_presidio_pii.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,35 @@ class MSFTPresidioPIIType(Enum):
US_PASSPORT_NUMBER = "US_PASSPORT"
US_INDIVIDUAL_TAXPAYER_IDENTIFICATION = "US_ITIN"
INTERNATIONAL_BANKING_ACCOUNT_NUMBER = "IBAN_CODE"
# UK_NATIONAL_HEALTH_NUMBER = "UK_NHS" # To be added in future versions
AU_BUSINESS_NUMBER = "AU_ABN"
AU_COMPANY_NUMBER = "AU_ACN"
AU_MEDICAL_ACCOUNT_NUMBER = "AU_MEDICARE"
AU_TAX_FILE_NUMBER = "AU_TFN"
# New in latest Presidio (global and regional)
MAC_ADDRESS = "MAC_ADDRESS"
US_MBI = "US_MBI"
UK_NHS = "UK_NHS"
UK_NINO = "UK_NINO"
ES_NIF = "ES_NIF"
ES_NIE = "ES_NIE"
IT_FISCAL_CODE = "IT_FISCAL_CODE"
IT_DRIVER_LICENSE = "IT_DRIVER_LICENSE"
IT_VAT_CODE = "IT_VAT_CODE"
IT_PASSPORT = "IT_PASSPORT"
IT_IDENTITY_CARD = "IT_IDENTITY_CARD"
PL_PESEL = "PL_PESEL"
SG_NRIC_FIN = "SG_NRIC_FIN"
SG_UEN = "SG_UEN"
IN_PAN = "IN_PAN"
IN_AADHAAR = "IN_AADHAAR"
IN_VEHICLE_REGISTRATION = "IN_VEHICLE_REGISTRATION"
IN_VOTER = "IN_VOTER"
IN_PASSPORT = "IN_PASSPORT"
IN_GSTIN = "IN_GSTIN"
FI_PERSONAL_IDENTITY_CODE = "FI_PERSONAL_IDENTITY_CODE"
KR_DRIVER_LICENSE = "KR_DRIVER_LICENSE"
KR_FRN = "KR_FRN"
KR_PASSPORT = "KR_PASSPORT"
KR_BRN = "KR_BRN"
KR_RRN = "KR_RRN"
TH_TNIN = "TH_TNIN"
Loading