Production-grade, task-agnostic ML inference backend. Serve any trained model — PyTorch, sklearn, ONNX, or anything else — over HTTP without changing the engine's core.
Deploying a trained model to production means writing the same glue code every time: HTTP routing, job tracking, auth, rate limiting, async queues, observability. Inference Engine handles all of that so you only write the model logic.
Plug in a trained artifact. The engine handles the rest.
- HTTP inference serving — sync, batch, and async endpoints out of the box
- LLM-assisted deployment CLI — deploy a
.pkl,.onnx, or PyTorch model in one command - Model versioning + routing — static, canary, and A/B routing strategies
- Async job queue — arq + Redis with graceful in-process fallback
- Multiple execution backends — CPU thread pool, ONNX Runtime, Triton Inference Server
- Authentication + scopes — API key auth with per-tenant rate limiting
- Observability — Prometheus metrics, structured JSON logs, OpenTelemetry tracing
- Zero-dependency quickstart — runs with SQLite + in-process async, no Docker required
git clone <repo>
cd inference-engine
uv sync # or: pip install -e .
uvicorn app.adapters.http.app:app --reload
curl -X POST http://localhost:8000/predict \
-H "X-API-Key: dev-key" \
-H "Content-Type: application/json" \
-d '{"model": "echo", "version": "v1", "data": "hello"}'
# → {"result": "hello"}No Docker required. SQLite and an in-process thread pool handle everything locally.
uv sync --extra cli
export GROQ_API_KEY=<your-key>
inference-engine deploy ./sentiment.pklThe CLI inspects the artifact, generates load() and predict() via LLM, validates the pipeline, and writes the definition file — no boilerplate required.
Non-interactive (CI):
inference-engine deploy ./sentiment.pkl \
--name sentiment --version v1 \
--device cpu --routing static \
--sample-input "this movie was great"cp .env.example .env
bash dev.shStarts Docker services, runs the DB migration, launches the arq worker, and starts uvicorn — all in one command.
| Quickstart | Install, run, first request |
| Guides | Task-based workflows |
| CLI | Deploy and fix commands |
| API Reference | Endpoint schemas |
| Concepts | Architecture and design |
| Configuration | Environment variables |
| Integrations | Redis, Postgres, Triton, ONNX |
| Observability | Metrics, logs, tracing |
| Development | Contributing and testing |
| Inference Engine | BentoML | Ray Serve | SageMaker | |
|---|---|---|---|---|
| Self-hosted | ✓ | ✓ | ✓ | ✗ |
| LLM-assisted deploy | ✓ | ✗ | ✗ | ✗ |
| Zero-dependency quickstart | ✓ | ✗ | ✗ | ✗ |
| Built-in auth + rate limiting | ✓ | ✗ | ✗ | ✓ |
| Async job queue | ✓ | partial | ✓ | ✓ |
