MLX Knife uses a 3-category test strategy designed for safety, speed, and reproducibility on Apple Silicon. Most tests run in complete isolation without requiring models or network access.
For current test counts, version-specific details, and complete file listings, see TESTING-DETAILS.md.
Core Principles:
- Isolated by default - User cache stays pristine with sentinel protection
- Opt-in live tests - Network/model tests require explicit markers/environment
- Mock-heavy - MLX stubs enable fast testing without model downloads
- Fast feedback - 500+ tests run in seconds on any Apple Silicon Mac
Cache Architecture:
- User Cache (Singleton): ONE permanent cache per system - READ-ONLY in tests
- Isolated Cache (Factory): NEW temporary cache PER test - full read/write
- Sentinel Safety: Automatic protection prevents accidental User Cache deletion
See TESTING-DETAILS.md → Fundamental Definitions for complete cache architecture and safety mechanisms.
Safety First:
- Tests use temporary caches with
TEST_SENTINELprotection - Delete operations fail if not in test cache (
MLXK2_STRICT_TEST_DELETE=1) - Live tests never modify user cache without explicit environment variables
Unit Test Limitations:
MLX Knife has two test categories:
- Unit tests (~500 tests, fast, mocked) - verify code structure
- Live E2E tests (real models, slow) - verify actual functionality
Why both are needed:
When dependencies like transformers or mlx-lm update their APIs, unit tests (which mock these libraries) continue to pass, but real model loading breaks. Only live E2E tests catch these issues.
Example: transformers 5.0 changed tokenizer initialization - unit tests passed (mocked API), but vision models failed to load in production. Live E2E tests caught the issue immediately.
# Install package + development tools (text-only tests)
pip install -e ".[dev,test]"
# Run default test suite (isolated, no live downloads)
pytest -v
# Before committing
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -vThat's it! Default tests use isolated caches and MLX stubs - no model downloads required.
Vision + Audio Tests: For complete development setup including Vision and Audio, see README.md → Development Installation.
Single command (recommended):
./scripts/test-wet-umbrella.shThis runs all real tests in the correct order. For details on test categories, see TESTING-DETAILS.md.
Manual execution (advanced):
# Portfolio-compatible tests
pytest -m wet -v
# Isolated Cache WRITE tests
MLXK2_TEST_RESUMABLE_DOWNLOAD=1 pytest -m live_resumable -vUser cache stays pristine - Tests use temporary caches with sentinel protection
What's tested:
- JSON API contracts (list, show, health)
- Human output formatting
- Model resolution and naming
- Push operations (offline:
--check-only,--dry-run) - Clone operations (offline: APFS validation, CoW workflow)
- Run command and generation (with MLX stubs)
- Server API endpoints (minimal, no real models)
- Schema validation and spec compliance
How to run:
pytest -v # Runs all isolated testsTechnical pattern:
def test_something(isolated_cache):
# Complete isolation with sentinel protection
assert_is_test_cache(isolated_cache)
# Test implementationRequire explicit environment setup - Network or user cache dependent
What's tested:
- Real HuggingFace push operations
- APFS same-volume clone workflows
- Stop token validation with real models
- Framework detection with private/org models
- Multi-shard model health validation
Markers: live_push, live_clone, live_list, live_stop_tokens, live_e2e, live_run, issue27
How to run:
# Live stop tokens (requires models in cache or HF_HOME)
pytest -m live_stop_tokens -v
# Live push (requires credentials + workspace)
export MLXK2_ENABLE_ALPHA_FEATURES=1
export MLXK2_LIVE_PUSH=1
export HF_TOKEN=...
export MLXK2_LIVE_REPO=org/model
export MLXK2_LIVE_WORKSPACE=/path/to/workspace
pytest -m live_push -vSee TESTING-DETAILS.md for complete environment setup instructions.
Basic server functionality - Lightweight API validation
What's tested:
- OpenAI-compatible endpoints
- SSE streaming functionality
- Model loading and error handling
- Token limit enforcement
How to run:
pytest -k server -v # Optional, included in default suiteNote: Basic server tests use MLX stubs and run by default. Comprehensive E2E tests with real models are available via live_e2e marker (ADR-011).
tests_2.0/
├── conftest.py # Isolated cache, safety sentinel, core fixtures
├── conftest_runner.py # Runner-specific fixtures/mocks
├── stubs/ # Minimal MLX/MLX-LM stubs for unit tests
│ ├── mlx/core.py
│ └── mlx_lm/...
├── spec/ # JSON API spec/contract validation
│ ├── test_cli_commands_json_flag.py
│ ├── test_spec_version_sync.py
│ └── ...
├── live/ # Opt-in live tests (markers required)
│ ├── test_push_live.py
│ ├── test_clone_live.py
│ └── test_list_human_live.py
├── test_*.py # Core test files
└── test_*.py.disabled # Intentionally disabled (WIP)
Legend:
spec/- API contract validation (stays in sync withdocs/schema)live/- User Cache READ only - Portfolio Discovery tests (parametrized across many models)stubs/- Lightweight MLX replacements for unit testsconftest.py- Isolated HF cache (temp), safety sentinel, fixtures- Parent
conftest.pyapplies globally - Subdirectory
conftest.py(live/, spec/) MUST limit scope to own directory only - See TESTING-DETAILS.md → conftest.py Scope Rules
- Parent
CRITICAL RULE: ❌ NEVER write to User Cache ❌
Test organization by cache strategy:
- User Cache READ →
tests_2.0/live/(Portfolio Discovery with many models) - Isolated Cache WRITE →
tests_2.0/(fresh downloads, mock creation) - Isolated Cache READ →
tests_2.0/(safety copies from User Cache) - Schema validation →
tests_2.0/spec/(mocks, fast) - Workspace operations →
tmp_pathfixture (Clone/Push tests, separate from cache)
Note: Workspace is semantically distinct from Cache - see TESTING-DETAILS.md → Workspace for details.
See TESTING-DETAILS.md → Truth Table for complete categorization and decision tree.
Purpose: Unit tests run without loading real models
How it works:
conftest.pyprependstests_2.0/stubs/tosys.pathimport mlx/import mlx_lmresolve to minimal stubs- Tests use mock models (~50KB fake files instead of 50GB real models)
Benefits:
- Fast test runs (seconds instead of minutes)
- Low RAM usage (default suite: 16GB sufficient)
- No model downloads required
- Deterministic behavior
Limitations:
- Tests requiring real mlx-lm integration use
@requires_mlx_lmmarker - Production CLI/server still use real packages (stubs not installed)
# Default suite (isolated, fast)
pytest -v
# Specific categories
pytest -m spec -v # Only spec/schema tests
pytest -m "not spec" -v # Exclude spec tests
pytest -k push -v # Push tests (offline)
pytest -k server -v # Server tests
# Live tests (opt-in)
pytest -m live_stop_tokens -v # Stop token validation
pytest -m live_push -v # Real HF push
pytest -m live_clone -v # APFS clone workflow
# Development
pytest --durations=10 # Show slowest tests
pytest -k "test_name" -v # Run specific test- Apple Silicon Mac (M1/M2/M3) - Required (MLX uses Metal)
- Python 3.9 or newer
- RAM Requirements:
- Default suite: 16GB minimum (isolated tests, mock models)
- Live E2E tests: 32GB minimum (real models, Portfolio Discovery)
- Full suite (wet-umbrella): 64GB recommended
- Wet umbrella Phase 4 (Vision→Geo pipe): ~29GB peak observed (M2 Max)
- Sequential loading: Vision unloads before text model loads (not parallel)
- Portfolio Discovery selects largest eligible models for quality
- Tested: M2 Max 64GB (comfortable headroom)
- Untested: M1 Max 32GB (theoretically viable but Metal limits unknown)
- Note: Metal memory limits may vary by chip generation
- ~10-20MB disk space for test temp files (default suite)
- Test dependencies:
pip install -e .[test]
Default suite (16GB): Mock models, fast, no downloads needed. Full suite (64GB): Real models, comprehensive validation, recommended for development.
Live tests require additional environment setup:
🔍 Show which models would be tested:
HF_HOME=/path/to/cache pytest -m show_model_portfolio -sThis displays all models that would be used in E2E tests (no actual testing).
E2E tests (ADR-011):
# Full E2E test suite with real models
HF_HOME=/path/to/cache pytest -m live_e2e -vStop token validation (ADR-009):
pytest -m live_stop_tokens -v
# Uses Portfolio Discovery if models found, else fallback models
# See TESTING-DETAILS.md "Required Models for Live Tests"Push/Clone tests (alpha features):
# See TESTING-DETAILS.md for complete environment setupUser cache (persistent):
- Real cache for manual operations
- Example:
export HF_HOME="/Volumes/SSD/models" - Safe ops:
list,health,show
Test cache (isolated):
- Ephemeral via fixtures
- Default tests never touch user cache
- Deletion safety:
MLXK2_STRICT_TEST_DELETE=1
Best practice:
- Use isolated tests for development (default
pytest) - Use live tests for validation (opt-in with markers)
- Set
HF_HOMEto external SSD for live tests
Tests validated on Python 3.10-3.12 (Python 3.9 not supported since 2.0.4)
Multi-version testing:
# Automated script
./test-multi-python.sh
# Manual verification
python3.10 -m venv test_310
source test_310/bin/activate
pip install -e .[test] && pytestSee TESTING-DETAILS.md for version-specific results.
# Install tools
pip install -e .[dev]
# Code formatting and linting
ruff check mlxk2/ --fix
# Type checking
mypy mlxk2/
# Complete workflow
ruff check mlxk2/ --fix && mypy mlxk2/ && pytestMLX Knife uses pytest markers to organize tests by category:
- Default suite (
pytest -v): Unit tests with mocks (fast, offline, no real models) - Spec tests (
-m spec): API contract/schema validation - Live tests (
-m live_*): Tests with real models or network (opt-in)
Common commands:
# Default test suite (fast, offline)
pytest -v
# API spec/contract tests only
pytest -m spec -v
# Live tests with real models (examples)
pytest -m live_stop_tokens -v # Stop token validation (ADR-009)
pytest -m live_e2e -v # E2E server/HTTP/CLI tests (ADR-011)For complete marker reference, environment requirements, and detailed usage, see:
Symbol Legend:
- 🔒 Marker-required: Must use
-m marker(skipped by defaultpytest -v) - Skip-unless-env: Collected but skipped without required environment
Tests hang forever:
pytest --timeout=60Import errors:
pip install -e .[test]Cache conflicts:
export HF_HOME="/tmp/test_cache"
pytest --cache-clearDebug specific test:
pytest path/to/test.py::test_name -v -sWhen submitting PRs with test changes, please document in the PR description:
- Test environment (macOS version, Apple Silicon chip, Python version)
- Test results (passed/skipped/failed counts)
- Any issues encountered and resolutions
See TESTING-DETAILS.md for the current official test environment and results as an example.
Before committing:
# 1. Code style
ruff check mlxk2/ --fix
# 2. Type checking
mypy mlxk2/
# 3. Run tests
pytest -v
# Or combined
ruff check mlxk2/ --fix && mypy mlxk2/ && pytest -vMLX Knife Testing:
- ✅ Isolated by default - User cache stays pristine
- ✅ Fast feedback - 500+ tests run in seconds without model downloads
- ✅ Low requirements - 16GB RAM, ~20MB disk, no HF cache needed
- ✅ Opt-in live tests - Real models/network when needed
- ✅ Multi-Python support - Verified on Python 3.9-3.14
For detailed information including current test counts, complete file structure, version history, and implementation specifics, see TESTING-DETAILS.md.
MLX-Knife 2.0 Testing Framework