Skip to content

Latest commit

 

History

History
426 lines (320 loc) · 13.5 KB

File metadata and controls

426 lines (320 loc) · 13.5 KB

MLX Knife Testing Guide

Overview

MLX Knife uses a 3-category test strategy designed for safety, speed, and reproducibility on Apple Silicon. Most tests run in complete isolation without requiring models or network access.

For current test counts, version-specific details, and complete file listings, see TESTING-DETAILS.md.

Test Philosophy

Core Principles:

  • Isolated by default - User cache stays pristine with sentinel protection
  • Opt-in live tests - Network/model tests require explicit markers/environment
  • Mock-heavy - MLX stubs enable fast testing without model downloads
  • Fast feedback - 500+ tests run in seconds on any Apple Silicon Mac

Cache Architecture:

  • User Cache (Singleton): ONE permanent cache per system - READ-ONLY in tests
  • Isolated Cache (Factory): NEW temporary cache PER test - full read/write
  • Sentinel Safety: Automatic protection prevents accidental User Cache deletion

See TESTING-DETAILS.md → Fundamental Definitions for complete cache architecture and safety mechanisms.

Safety First:

  • Tests use temporary caches with TEST_SENTINEL protection
  • Delete operations fail if not in test cache (MLXK2_STRICT_TEST_DELETE=1)
  • Live tests never modify user cache without explicit environment variables

Unit Test Limitations:

MLX Knife has two test categories:

  1. Unit tests (~500 tests, fast, mocked) - verify code structure
  2. Live E2E tests (real models, slow) - verify actual functionality

Why both are needed: When dependencies like transformers or mlx-lm update their APIs, unit tests (which mock these libraries) continue to pass, but real model loading breaks. Only live E2E tests catch these issues.

Example: transformers 5.0 changed tokenizer initialization - unit tests passed (mocked API), but vision models failed to load in production. Live E2E tests caught the issue immediately.

Quick Start

# Install package + development tools (text-only tests)
pip install -e ".[dev,test]"

# Run default test suite (isolated, no live downloads)
pytest -v

# Before committing
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v

That's it! Default tests use isolated caches and MLX stubs - no model downloads required.

Vision + Audio Tests: For complete development setup including Vision and Audio, see README.md → Development Installation.

Running All Real Tests

Single command (recommended):

./scripts/test-wet-umbrella.sh

This runs all real tests in the correct order. For details on test categories, see TESTING-DETAILS.md.

Manual execution (advanced):

# Portfolio-compatible tests
pytest -m wet -v

# Isolated Cache WRITE tests
MLXK2_TEST_RESUMABLE_DOWNLOAD=1 pytest -m live_resumable -v

Test Categories

Category 1: Isolated Cache (Default)

User cache stays pristine - Tests use temporary caches with sentinel protection

What's tested:

  • JSON API contracts (list, show, health)
  • Human output formatting
  • Model resolution and naming
  • Push operations (offline: --check-only, --dry-run)
  • Clone operations (offline: APFS validation, CoW workflow)
  • Run command and generation (with MLX stubs)
  • Server API endpoints (minimal, no real models)
  • Schema validation and spec compliance

How to run:

pytest -v  # Runs all isolated tests

Technical pattern:

def test_something(isolated_cache):
    # Complete isolation with sentinel protection
    assert_is_test_cache(isolated_cache)
    # Test implementation

Category 2: Live Tests (Opt-in)

Require explicit environment setup - Network or user cache dependent

What's tested:

  • Real HuggingFace push operations
  • APFS same-volume clone workflows
  • Stop token validation with real models
  • Framework detection with private/org models
  • Multi-shard model health validation

Markers: live_push, live_clone, live_list, live_stop_tokens, live_e2e, live_run, issue27

How to run:

# Live stop tokens (requires models in cache or HF_HOME)
pytest -m live_stop_tokens -v

# Live push (requires credentials + workspace)
export MLXK2_ENABLE_ALPHA_FEATURES=1
export MLXK2_LIVE_PUSH=1
export HF_TOKEN=...
export MLXK2_LIVE_REPO=org/model
export MLXK2_LIVE_WORKSPACE=/path/to/workspace
pytest -m live_push -v

See TESTING-DETAILS.md for complete environment setup instructions.

Category 3: Server Tests (Default)

Basic server functionality - Lightweight API validation

What's tested:

  • OpenAI-compatible endpoints
  • SSE streaming functionality
  • Model loading and error handling
  • Token limit enforcement

How to run:

pytest -k server -v  # Optional, included in default suite

Note: Basic server tests use MLX stubs and run by default. Comprehensive E2E tests with real models are available via live_e2e marker (ADR-011).

Test Structure

tests_2.0/
├── conftest.py              # Isolated cache, safety sentinel, core fixtures
├── conftest_runner.py       # Runner-specific fixtures/mocks
├── stubs/                   # Minimal MLX/MLX-LM stubs for unit tests
│   ├── mlx/core.py
│   └── mlx_lm/...
├── spec/                    # JSON API spec/contract validation
│   ├── test_cli_commands_json_flag.py
│   ├── test_spec_version_sync.py
│   └── ...
├── live/                    # Opt-in live tests (markers required)
│   ├── test_push_live.py
│   ├── test_clone_live.py
│   └── test_list_human_live.py
├── test_*.py               # Core test files
└── test_*.py.disabled      # Intentionally disabled (WIP)

Legend:

  • spec/ - API contract validation (stays in sync with docs/schema)
  • live/ - User Cache READ only - Portfolio Discovery tests (parametrized across many models)
  • stubs/ - Lightweight MLX replacements for unit tests
  • conftest.py - Isolated HF cache (temp), safety sentinel, fixtures

CRITICAL RULE:NEVER write to User Cache

Test organization by cache strategy:

  • User Cache READtests_2.0/live/ (Portfolio Discovery with many models)
  • Isolated Cache WRITEtests_2.0/ (fresh downloads, mock creation)
  • Isolated Cache READtests_2.0/ (safety copies from User Cache)
  • Schema validationtests_2.0/spec/ (mocks, fast)
  • Workspace operationstmp_path fixture (Clone/Push tests, separate from cache)

Note: Workspace is semantically distinct from Cache - see TESTING-DETAILS.md → Workspace for details.

See TESTING-DETAILS.md → Truth Table for complete categorization and decision tree.

MLX Stubs (Fast Testing Without Model Downloads)

Purpose: Unit tests run without loading real models

How it works:

  • conftest.py prepends tests_2.0/stubs/ to sys.path
  • import mlx / import mlx_lm resolve to minimal stubs
  • Tests use mock models (~50KB fake files instead of 50GB real models)

Benefits:

  • Fast test runs (seconds instead of minutes)
  • Low RAM usage (default suite: 16GB sufficient)
  • No model downloads required
  • Deterministic behavior

Limitations:

  • Tests requiring real mlx-lm integration use @requires_mlx_lm marker
  • Production CLI/server still use real packages (stubs not installed)

Common Test Commands

# Default suite (isolated, fast)
pytest -v

# Specific categories
pytest -m spec -v              # Only spec/schema tests
pytest -m "not spec" -v        # Exclude spec tests
pytest -k push -v              # Push tests (offline)
pytest -k server -v            # Server tests

# Live tests (opt-in)
pytest -m live_stop_tokens -v  # Stop token validation
pytest -m live_push -v         # Real HF push
pytest -m live_clone -v        # APFS clone workflow

# Development
pytest --durations=10          # Show slowest tests
pytest -k "test_name" -v       # Run specific test

Test Prerequisites

Required Setup

  1. Apple Silicon Mac (M1/M2/M3) - Required (MLX uses Metal)
  2. Python 3.9 or newer
  3. RAM Requirements:
    • Default suite: 16GB minimum (isolated tests, mock models)
    • Live E2E tests: 32GB minimum (real models, Portfolio Discovery)
    • Full suite (wet-umbrella): 64GB recommended
      • Wet umbrella Phase 4 (Vision→Geo pipe): ~29GB peak observed (M2 Max)
      • Sequential loading: Vision unloads before text model loads (not parallel)
      • Portfolio Discovery selects largest eligible models for quality
      • Tested: M2 Max 64GB (comfortable headroom)
      • Untested: M1 Max 32GB (theoretically viable but Metal limits unknown)
      • Note: Metal memory limits may vary by chip generation
  4. ~10-20MB disk space for test temp files (default suite)
  5. Test dependencies:
    pip install -e .[test]

Default suite (16GB): Mock models, fast, no downloads needed. Full suite (64GB): Real models, comprehensive validation, recommended for development.

Optional Setup (Live Tests)

Live tests require additional environment setup:

🔍 Show which models would be tested:

HF_HOME=/path/to/cache pytest -m show_model_portfolio -s

This displays all models that would be used in E2E tests (no actual testing).

E2E tests (ADR-011):

# Full E2E test suite with real models
HF_HOME=/path/to/cache pytest -m live_e2e -v

Stop token validation (ADR-009):

pytest -m live_stop_tokens -v
# Uses Portfolio Discovery if models found, else fallback models
# See TESTING-DETAILS.md "Required Models for Live Tests"

Push/Clone tests (alpha features):

# See TESTING-DETAILS.md for complete environment setup

Environment & Caches

User cache (persistent):

  • Real cache for manual operations
  • Example: export HF_HOME="/Volumes/SSD/models"
  • Safe ops: list, health, show

Test cache (isolated):

  • Ephemeral via fixtures
  • Default tests never touch user cache
  • Deletion safety: MLXK2_STRICT_TEST_DELETE=1

Best practice:

  • Use isolated tests for development (default pytest)
  • Use live tests for validation (opt-in with markers)
  • Set HF_HOME to external SSD for live tests

Python Version Compatibility

Tests validated on Python 3.10-3.12 (Python 3.9 not supported since 2.0.4)

Multi-version testing:

# Automated script
./test-multi-python.sh

# Manual verification
python3.10 -m venv test_310
source test_310/bin/activate
pip install -e .[test] && pytest

See TESTING-DETAILS.md for version-specific results.

Code Quality

# Install tools
pip install -e .[dev]

# Code formatting and linting
ruff check mlxk2/ --fix

# Type checking
mypy mlxk2/

# Complete workflow
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest

Test Markers

MLX Knife uses pytest markers to organize tests by category:

  • Default suite (pytest -v): Unit tests with mocks (fast, offline, no real models)
  • Spec tests (-m spec): API contract/schema validation
  • Live tests (-m live_*): Tests with real models or network (opt-in)

Common commands:

# Default test suite (fast, offline)
pytest -v

# API spec/contract tests only
pytest -m spec -v

# Live tests with real models (examples)
pytest -m live_stop_tokens -v  # Stop token validation (ADR-009)
pytest -m live_e2e -v          # E2E server/HTTP/CLI tests (ADR-011)

For complete marker reference, environment requirements, and detailed usage, see:

Symbol Legend:

  • 🔒 Marker-required: Must use -m marker (skipped by default pytest -v)
  • Skip-unless-env: Collected but skipped without required environment

Troubleshooting

Tests hang forever:

pytest --timeout=60

Import errors:

pip install -e .[test]

Cache conflicts:

export HF_HOME="/tmp/test_cache"
pytest --cache-clear

Debug specific test:

pytest path/to/test.py::test_name -v -s

Contributing Tests

When submitting PRs with test changes, please document in the PR description:

  1. Test environment (macOS version, Apple Silicon chip, Python version)
  2. Test results (passed/skipped/failed counts)
  3. Any issues encountered and resolutions

See TESTING-DETAILS.md for the current official test environment and results as an example.

Development Workflow

Before committing:

# 1. Code style
ruff check mlxk2/ --fix

# 2. Type checking
mypy mlxk2/

# 3. Run tests
pytest -v

# Or combined
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -v

Summary

MLX Knife Testing:

  • Isolated by default - User cache stays pristine
  • Fast feedback - 500+ tests run in seconds without model downloads
  • Low requirements - 16GB RAM, ~20MB disk, no HF cache needed
  • Opt-in live tests - Real models/network when needed
  • Multi-Python support - Verified on Python 3.9-3.14

For detailed information including current test counts, complete file structure, version history, and implementation specifics, see TESTING-DETAILS.md.


MLX-Knife 2.0 Testing Framework