Why Loclean?

The All-in-One Local AI Data Cleaner.

`pip install loclean`

Why Loclean?

Documentation: nxank4.github.io/loclean

Loclean bridges the gap between data engineering and local AI, designed for production pipelines where privacy and stability are non-negotiable.

Privacy first and zero cost

Leverage the power of small language models (SLMs) including Phi-3, Qwen, Gemma, DeepSeek, TinyLlama, and LFM2.5 running locally via llama.cpp. Clean sensitive PII, medical records, or proprietary data without a single byte leaving your infrastructure. See the available models section for the full list.

Deterministic outputs

Forget about "hallucinations" or parsing loose text. Loclean uses GBNF grammars and Pydantic V2 to force the LLM to output valid, type-safe JSON. If it breaks the schema, it doesn't pass.

Structured extraction with Pydantic

Extract structured data from unstructured text with guaranteed schema compliance:

from pydantic import BaseModel
import loclean

class Product(BaseModel):
    name: str
    price: int
    color: str

# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(item.name)  # "t-shirt"
print(item.price)  # 50000

# Extract from dataframe (default: structured dict for performance)
import polars as pl
df = pl.DataFrame({"description": ["Selling red t-shirt for 50k"]})
result = loclean.extract(df, schema=Product, target_col="description")

# Query with Polars struct (vectorized operations)
result.filter(pl.col("description_extracted").struct.field("price") > 50000)

The extract() function ensures 100% compliance with your Pydantic schema through:

Dynamic GBNF grammar generation: Automatically converts Pydantic schemas to GBNF grammars
JSON repair: Automatically fixes malformed JSON output from LLMs
Retry logic: Retries with adjusted prompts when validation fails

Loclean also provides clean() for general data cleaning and scrub() for privacy-preserving PII redaction. Explore the examples and documentation to discover more features.

Backend agnostic (zero copy)

Built on Narwhals, Loclean supports Pandas, Polars, PyArrow, Modin, cuDF, and other backends natively. The library automatically detects your dataframe backend and uses the most efficient operations for each.

Running Polars? We keep it lazy.
Running Pandas? We handle it seamlessly.
No heavy dependency lock-in.

For advanced usage patterns, caching strategies, batch processing, parallel execution, and performance optimization tips, check out the documentation.

Installation

Requirements

Python 3.10, 3.11, 3.12, 3.13, 3.14, or 3.15
No GPU required (runs on CPU by default)

Basic installation

Using pip (recommended):

pip install loclean

The basic installation includes local inference support (via llama-cpp-python).

Installation notice:

Fast (30-60 seconds): Pre-built wheels are available for most platforms (Linux x86_64, macOS, Windows)

Slow (5-10 minutes): If you see "Building wheels for collected packages: llama-cpp-python", it's building from source. This is normal and only happens when no pre-built wheel is available for your platform. Please be patient - this is not an error!

To ensure fast installation:
pip install --upgrade pip setuptools wheel
pip install loclean
This ensures pip can find and use pre-built wheels when available.

Using uv (alternative, often faster):

uv pip install loclean

Using conda/mamba:

conda install -c conda-forge loclean
# or
mamba install -c conda-forge loclean

Optional dependencies

The basic installation includes local inference support. Loclean uses Narwhals for backend-agnostic dataframe operations, so if you already have Pandas, Polars, or PyArrow installed, the basic installation is sufficient.

Install dataframe libraries (if not already present):

If you don't have any dataframe library installed, or want to ensure you have all supported backends:

pip install loclean[data]

This installs: pandas>=2.3.3, polars>=0.20.0, pyarrow>=22.0.0

For cloud API support (OpenAI, Anthropic, Gemini):

Cloud API support is planned for future releases. Currently, only local inference is available:

pip install loclean[cloud]

For privacy features (Faker integration):

pip install loclean[privacy]

This installs: faker>=20.0.0 for fake data generation in privacy scrubbing.

Install all optional dependencies:

pip install loclean[all]

This installs loclean[data], loclean[cloud], and loclean[privacy]. Useful for production environments where you want all features available.

Note for developers: If you're contributing to Loclean, use the Development installation section below (git clone + uv sync --dev), not loclean[all].

Development installation

To contribute or run tests locally:

# Clone the repository
git clone https://github.com/nxank4/loclean.git
cd loclean

# Install with development dependencies (using uv)
uv sync --dev

# Or using pip
pip install -e ".[dev]"

Model management

Loclean automatically downloads models on first use, but you can pre-download them using the command line:

# Download a specific model
loclean model download --name phi-3-mini

# List available models
loclean model list

# Check download status
loclean model status

Available models

phi-3-mini: Microsoft Phi-3 Mini (3.8B, 4K context) - Default, balanced
tinyllama: TinyLlama 1.1B - Smallest, fastest
gemma-2b: Google Gemma 2B Instruct - Balanced performance
qwen3-4b: Qwen3 4B - Higher quality
gemma-3-4b: Gemma 3 4B - Larger context
deepseek-r1: DeepSeek R1 - Reasoning model
lfm2.5: Liquid LFM2.5-1.2B Instruct (1.17B, 32K context) - Best-in-class 1B scale, optimized for agentic tasks and data extraction

Models are cached in ~/.cache/loclean by default. You can specify a custom cache directory using the --cache-dir option.

Quick start

Loclean is best learned by example. We provide a set of Jupyter notebooks to help you get started:

01-quick-start.ipynb: Core features, structured extraction, and privacy scrubbing.
02-data-cleaning.ipynb: Comprehensive data cleaning strategies.
03-privacy-scrubbing.ipynb: Deep dive into PII redaction.
04-structured-extraction.ipynb: Advanced structured extraction patterns.
05-debug-mode.ipynb: Debugging and verbose mode usage.

Check out the examples/ directory for more details.

Contributing

We love contributions! Loclean is strictly open-source under the Apache 2.0 License.

Please read our contributing guide for details on how to set up your development environment, run tests, and submit pull requests.

Built for the data community.

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.agent		.agent
.github		.github
.vscode		.vscode
assets		assets
docs-web		docs-web
examples		examples
src/loclean		src/loclean
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`pip install loclean`

Why Loclean?

Privacy first and zero cost

Deterministic outputs

Structured extraction with Pydantic

Backend agnostic (zero copy)

Installation

Requirements

Basic installation

Optional dependencies

Development installation

Model management

Available models

Quick start

Contributing

About

Uh oh!

Releases 1

Uh oh!

Contributors 3

Uh oh!

Languages

License

nxank4/loclean

Folders and files

Latest commit

History

Repository files navigation

pip install loclean

Why Loclean?

Privacy first and zero cost

Deterministic outputs

Structured extraction with Pydantic

Backend agnostic (zero copy)

Installation

Requirements

Basic installation

Optional dependencies

Development installation

Model management

Available models

Quick start

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 3

Uh oh!

Languages

`pip install loclean`