MLX Knife Testing Guide

Overview

MLX Knife uses a 3-category test strategy designed for safety, speed, and reproducibility on Apple Silicon. Most tests run in complete isolation without requiring models or network access.

For current test counts, version-specific details, and complete file listings, see TESTING-DETAILS.md.

Test Philosophy

Core Principles:

Isolated by default - User cache stays pristine with sentinel protection
Opt-in live tests - Network/model tests require explicit markers/environment
Mock-heavy - MLX stubs enable fast testing without model downloads
Fast feedback - 500+ tests run in seconds on any Apple Silicon Mac

Cache Architecture:

User Cache (Singleton): ONE permanent cache per system - READ-ONLY in tests
Isolated Cache (Factory): NEW temporary cache PER test - full read/write
Sentinel Safety: Automatic protection prevents accidental User Cache deletion

See TESTING-DETAILS.md → Fundamental Definitions for complete cache architecture and safety mechanisms.

Safety First:

Tests use temporary caches with TEST_SENTINEL protection
Delete operations fail if not in test cache (MLXK2_STRICT_TEST_DELETE=1)
Live tests never modify user cache without explicit environment variables

Unit Test Limitations:

MLX Knife has two test categories:

Unit tests (~500 tests, fast, mocked) - verify code structure
Live E2E tests (real models, slow) - verify actual functionality

Why both are needed: When dependencies like transformers or mlx-lm update their APIs, unit tests (which mock these libraries) continue to pass, but real model loading breaks. Only live E2E tests catch these issues.

Example: transformers 5.0 changed tokenizer initialization - unit tests passed (mocked API), but vision models failed to load in production. Live E2E tests caught the issue immediately.

Quick Start

# Install package + development tools (text-only tests)
pip install -e ".[dev,test]"

# Run default test suite (isolated, no live downloads)
pytest -v

# Before committing
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v

That's it! Default tests use isolated caches and MLX stubs - no model downloads required.

Vision + Audio Tests: For complete development setup including Vision and Audio, see README.md → Development Installation.

Running All Real Tests

Single command (recommended):

./scripts/test-wet-umbrella.sh

This runs all real tests in the correct order. For details on test categories, see TESTING-DETAILS.md.

Manual execution (advanced):

# Portfolio-compatible tests
pytest -m wet -v

# Isolated Cache WRITE tests
MLXK2_TEST_RESUMABLE_DOWNLOAD=1 pytest -m live_resumable -v

Test Categories

Category 1: Isolated Cache (Default)

User cache stays pristine - Tests use temporary caches with sentinel protection

What's tested:

JSON API contracts (list, show, health)
Human output formatting
Model resolution and naming
Push operations (offline: --check-only, --dry-run)
Clone operations (offline: APFS validation, CoW workflow)
Run command and generation (with MLX stubs)
Server API endpoints (minimal, no real models)
Schema validation and spec compliance

How to run:

pytest -v  # Runs all isolated tests

Technical pattern:

def test_something(isolated_cache):
    # Complete isolation with sentinel protection
    assert_is_test_cache(isolated_cache)
    # Test implementation

Category 2: Live Tests (Opt-in)

Require explicit environment setup - Network or user cache dependent

What's tested:

Real HuggingFace push operations
APFS same-volume clone workflows
Stop token validation with real models
Framework detection with private/org models
Multi-shard model health validation

Markers: live_push, live_clone, live_list, live_stop_tokens, live_e2e, live_run, issue27

How to run:

# Live stop tokens (requires models in cache or HF_HOME)
pytest -m live_stop_tokens -v

# Live push (requires credentials + workspace)
export MLXK2_ENABLE_ALPHA_FEATURES=1
export MLXK2_LIVE_PUSH=1
export HF_TOKEN=...
export MLXK2_LIVE_REPO=org/model
export MLXK2_LIVE_WORKSPACE=/path/to/workspace
pytest -m live_push -v

See TESTING-DETAILS.md for complete environment setup instructions.

Category 3: Server Tests (Default)

Basic server functionality - Lightweight API validation

What's tested:

OpenAI-compatible endpoints
SSE streaming functionality
Model loading and error handling
Token limit enforcement

How to run:

pytest -k server -v  # Optional, included in default suite

Note: Basic server tests use MLX stubs and run by default. Comprehensive E2E tests with real models are available via live_e2e marker (ADR-011).

Test Structure

tests_2.0/
├── conftest.py              # Isolated cache, safety sentinel, core fixtures
├── conftest_runner.py       # Runner-specific fixtures/mocks
├── stubs/                   # Minimal MLX/MLX-LM stubs for unit tests
│   ├── mlx/core.py
│   └── mlx_lm/...
├── spec/                    # JSON API spec/contract validation
│   ├── test_cli_commands_json_flag.py
│   ├── test_spec_version_sync.py
│   └── ...
├── live/                    # Opt-in live tests (markers required)
│   ├── test_push_live.py
│   ├── test_clone_live.py
│   └── test_list_human_live.py
├── test_*.py               # Core test files
└── test_*.py.disabled      # Intentionally disabled (WIP)

Legend:

spec/ - API contract validation (stays in sync with docs/schema)
live/ - User Cache READ only - Portfolio Discovery tests (parametrized across many models)
stubs/ - Lightweight MLX replacements for unit tests
conftest.py - Isolated HF cache (temp), safety sentinel, fixtures
- Parent conftest.py applies globally
- Subdirectory conftest.py (live/, spec/) MUST limit scope to own directory only
- See TESTING-DETAILS.md → conftest.py Scope Rules

CRITICAL RULE: ❌ NEVER write to User Cache ❌

Test organization by cache strategy:

User Cache READ → tests_2.0/live/ (Portfolio Discovery with many models)
Isolated Cache WRITE → tests_2.0/ (fresh downloads, mock creation)
Isolated Cache READ → tests_2.0/ (safety copies from User Cache)
Schema validation → tests_2.0/spec/ (mocks, fast)
Workspace operations → tmp_path fixture (Clone/Push tests, separate from cache)

Note: Workspace is semantically distinct from Cache - see TESTING-DETAILS.md → Workspace for details.

See TESTING-DETAILS.md → Truth Table for complete categorization and decision tree.

MLX Stubs (Fast Testing Without Model Downloads)

Purpose: Unit tests run without loading real models

How it works:

conftest.py prepends tests_2.0/stubs/ to sys.path
import mlx / import mlx_lm resolve to minimal stubs
Tests use mock models (~50KB fake files instead of 50GB real models)

Benefits:

Fast test runs (seconds instead of minutes)
Low RAM usage (default suite: 16GB sufficient)
No model downloads required
Deterministic behavior

Limitations:

Tests requiring real mlx-lm integration use @requires_mlx_lm marker
Production CLI/server still use real packages (stubs not installed)

Common Test Commands

# Default suite (isolated, fast)
pytest -v

# Specific categories
pytest -m spec -v              # Only spec/schema tests
pytest -m "not spec" -v        # Exclude spec tests
pytest -k push -v              # Push tests (offline)
pytest -k server -v            # Server tests

# Live tests (opt-in)
pytest -m live_stop_tokens -v  # Stop token validation
pytest -m live_push -v         # Real HF push
pytest -m live_clone -v        # APFS clone workflow

# Development
pytest --durations=10          # Show slowest tests
pytest -k "test_name" -v       # Run specific test

Test Prerequisites

Required Setup

Apple Silicon Mac (M1/M2/M3) - Required (MLX uses Metal)
Python 3.9 or newer
RAM Requirements:
- Default suite: 16GB minimum (isolated tests, mock models)
- Live E2E tests: 32GB minimum (real models, Portfolio Discovery)
- Full suite (wet-umbrella): 64GB recommended
  - Wet umbrella Phase 4 (Vision→Geo pipe): ~29GB peak observed (M2 Max)
  - Sequential loading: Vision unloads before text model loads (not parallel)
  - Portfolio Discovery selects largest eligible models for quality
  - Tested: M2 Max 64GB (comfortable headroom)
  - Untested: M1 Max 32GB (theoretically viable but Metal limits unknown)
  - Note: Metal memory limits may vary by chip generation
~10-20MB disk space for test temp files (default suite)
Test dependencies:
```
pip install -e .[test]
```

Default suite (16GB): Mock models, fast, no downloads needed. Full suite (64GB): Real models, comprehensive validation, recommended for development.

Optional Setup (Live Tests)

Live tests require additional environment setup:

🔍 Show which models would be tested:

HF_HOME=/path/to/cache pytest -m show_model_portfolio -s

This displays all models that would be used in E2E tests (no actual testing).

E2E tests (ADR-011):

# Full E2E test suite with real models
HF_HOME=/path/to/cache pytest -m live_e2e -v

Stop token validation (ADR-009):

pytest -m live_stop_tokens -v
# Uses Portfolio Discovery if models found, else fallback models
# See TESTING-DETAILS.md "Required Models for Live Tests"

Push/Clone tests (alpha features):

# See TESTING-DETAILS.md for complete environment setup

Environment & Caches

User cache (persistent):

Real cache for manual operations
Example: export HF_HOME="/Volumes/SSD/models"
Safe ops: list, health, show

Test cache (isolated):

Ephemeral via fixtures
Default tests never touch user cache
Deletion safety: MLXK2_STRICT_TEST_DELETE=1

Best practice:

Use isolated tests for development (default pytest)
Use live tests for validation (opt-in with markers)
Set HF_HOME to external SSD for live tests

Python Version Compatibility

Tests validated on Python 3.10-3.12 (Python 3.9 not supported since 2.0.4)

Multi-version testing:

# Automated script
./test-multi-python.sh

# Manual verification
python3.10 -m venv test_310
source test_310/bin/activate
pip install -e .[test] && pytest

See TESTING-DETAILS.md for version-specific results.

Code Quality

# Install tools
pip install -e .[dev]

# Code formatting and linting
ruff check mlxk2/ --fix

# Type checking
mypy mlxk2/

# Complete workflow
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest

Test Markers

MLX Knife uses pytest markers to organize tests by category:

Default suite (pytest -v): Unit tests with mocks (fast, offline, no real models)
Spec tests (-m spec): API contract/schema validation
Live tests (-m live_*): Tests with real models or network (opt-in)

Common commands:

# Default test suite (fast, offline)
pytest -v

# API spec/contract tests only
pytest -m spec -v

# Live tests with real models (examples)
pytest -m live_stop_tokens -v  # Stop token validation (ADR-009)
pytest -m live_e2e -v          # E2E server/HTTP/CLI tests (ADR-011)

For complete marker reference, environment requirements, and detailed usage, see:

TESTING-DETAILS.md → Test Execution Guide

Symbol Legend:

🔒 Marker-required: Must use -m marker (skipped by default pytest -v)
Skip-unless-env: Collected but skipped without required environment

Troubleshooting

Tests hang forever:

pytest --timeout=60

Import errors:

pip install -e .[test]

Cache conflicts:

export HF_HOME="/tmp/test_cache"
pytest --cache-clear

Debug specific test:

pytest path/to/test.py::test_name -v -s

Contributing Tests

When submitting PRs with test changes, please document in the PR description:

Test environment (macOS version, Apple Silicon chip, Python version)
Test results (passed/skipped/failed counts)
Any issues encountered and resolutions

See TESTING-DETAILS.md for the current official test environment and results as an example.

Development Workflow

Before committing:

# 1. Code style
ruff check mlxk2/ --fix

# 2. Type checking
mypy mlxk2/

# 3. Run tests
pytest -v

# Or combined
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v

Summary

MLX Knife Testing:

✅ Isolated by default - User cache stays pristine
✅ Fast feedback - 500+ tests run in seconds without model downloads
✅ Low requirements - 16GB RAM, ~20MB disk, no HF cache needed
✅ Opt-in live tests - Real models/network when needed
✅ Multi-Python support - Verified on Python 3.9-3.14

For detailed information including current test counts, complete file structure, version history, and implementation specifics, see TESTING-DETAILS.md.

MLX-Knife 2.0 Testing Framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLX Knife Testing Guide

Overview

Test Philosophy

Quick Start

Running All Real Tests

Test Categories

Category 1: Isolated Cache (Default)

Category 2: Live Tests (Opt-in)

Category 3: Server Tests (Default)

Test Structure

MLX Stubs (Fast Testing Without Model Downloads)

Common Test Commands

Test Prerequisites

Required Setup

Optional Setup (Live Tests)

Environment & Caches

Python Version Compatibility

Code Quality

Test Markers

Troubleshooting

Contributing Tests

Development Workflow

Summary

FilesExpand file tree

TESTING.md

Latest commit

History

TESTING.md

File metadata and controls

MLX Knife Testing Guide

Overview

Test Philosophy

Quick Start

Running All Real Tests

Test Categories

Category 1: Isolated Cache (Default)

Category 2: Live Tests (Opt-in)

Category 3: Server Tests (Default)

Test Structure

MLX Stubs (Fast Testing Without Model Downloads)

Common Test Commands

Test Prerequisites

Required Setup

Optional Setup (Live Tests)

Environment & Caches

Python Version Compatibility

Code Quality

Test Markers

Troubleshooting

Contributing Tests

Development Workflow

Summary