FastAPI service that exposes one POST /v1/responses API across local CT2, ExLlamaV3, GGUF/llama.cpp, and OpenAI-compatible remote Chat Completions backends. It includes runtime metrics, per-model scheduling, replica routing, and admin endpoints for loading, unloading, and inspecting models.
- Overview
- HTTP API
- Inference Example
- Request Fields
- Local Overrides
- Remote Backends
- Replicas
- Timing Metrics
- Test
- Design Notes
- Acknowledgments
- License
- one inference API across local and remote model backends
- local CT2, ExLlamaV3, and GGUF/
llama.cppruntime adapters - OpenAI-compatible remote Chat Completions adapter
- JSON responses and SSE streaming from the same endpoint
- runtime metrics included in inference responses
- admin endpoints for inspecting models and loading or unloading them at runtime
- an in-process scheduler/executor layer in front of inference
- per-model queueing, runtime inflight tracking, and configurable target inflight
- request routing across multiple identical internal replicas for the same model id
- request-level
allow_remotegate for external or paid model calls
| Endpoint | Purpose |
|---|---|
POST /v1/responses |
Run inference. stream: false returns one JSON response; stream: true returns Server-Sent Events (SSE). |
GET /v1/models |
List currently loaded model ids. |
GET /v1/admin/models |
List configured model ids plus aggregate runtime, replica, queue, and load state. |
GET /v1/admin/gpu-memory |
Return current GPU memory usage plus per-model VRAM estimates. |
POST /v1/admin/models/{model_name}/load |
Load one configured model at runtime with optional runtime-only backend-specific overrides. |
POST /v1/admin/models/{model_name}/unload |
Gracefully unload one loaded model. |
See runtime-admin-api.md for the full admin API.
Example request:
{
"model": "google_gemma-4-26B-A4B-it-Q4_K_M-gguf",
"input": "The weather is pleasant today, and I would like to take a walk in the park after lunch.",
"instructions": "Translate to Dutch. Return only the translation.",
"stream": false,
"decoding": {
"beam_size": 1,
"top_k": 1,
"top_p": 1.0,
"temperature": 0.1,
"repetition_penalty": 1.0,
"max_tokens": 256
}
}Example response:
{
"id": "resp_123",
"object": "response",
"model": "google_gemma-4-26B-A4B-it-Q4_K_M-gguf",
"output": [
{
"type": "output_text",
"text": "Het weer is aangenaam vandaag en ik zou na de lunch graag een wandeling in het park willen maken."
}
],
"output_text": "Het weer is aangenaam vandaag en ik zou na de lunch graag een wandeling in het park willen maken.",
"metrics": {
"backend_inference_wall_ms": 138.1,
"engine_total_wall_ms": 138.6,
"engine_outside_backend_wall_ms": 0.5,
"pool_total_wall_ms": 139.0,
"engine_tokenize_ms": null,
"gpu_time_to_first_token_ms": null,
"gpu_generate_total_ms": 138.4,
"gpu_decode_after_first_token_ms": null,
"engine_prompt_tokens": 47,
"engine_output_tokens": 23,
"engine_tokens_per_second": 166.2
}
}POST /v1/responses supports:
stream: falsefor one JSON response envelopestream: truefor SSE events:response.created,response.output_text.delta,response.metrics,response.completed
Current SSE note:
stream: truestill uses the current service-side SSE event path- it is not yet a true backend-native live token stream from every runtime implementation
Currently supported API request fields:
| Field | Type | Required | Default if omitted | Notes |
|---|---|---|---|---|
model |
string |
yes | none | Must match a currently loaded model id. |
input |
string |
yes | none | Main input text. |
instructions |
string | null |
no | null |
Optional high-level guidance. If omitted, the pool falls back to an internal default instruction prompt. Ignored by translategemma_template; omit it there. |
source_lang_code |
string | null |
no | null |
Required for models using prompt_format: "translategemma_template". |
target_lang_code |
string | null |
no | null |
Required for models using prompt_format: "translategemma_template". |
stream |
boolean |
no | false |
false returns one JSON response; true returns SSE events. |
allow_remote |
boolean |
no | false |
Required for requests to openai_compatible remote models. |
decoding |
object |
no | {} |
Omitted subfields fall back to engine.decoding server defaults. |
Currently supported decoding fields:
| Field | Type | Required | Default if omitted | CT2 | ExLlamaV3 | GGUF | Notes |
|---|---|---|---|---|---|---|---|
beam_size |
int |
no | server default, usually 1 |
used | accepted, ignored | accepted, ignored | ExLlamaV3 and GGUF log it and continue. |
top_k |
int |
no | server default, usually 1 |
used | used | used | Sampling control. |
top_p |
float |
no | server default, usually 1.0 |
used | used | used | Sampling control. |
temperature |
float |
no | server default, usually 0.1 |
used | used | used | Sampling control. |
repetition_penalty |
float |
no | server default, usually 1.0 |
used | used | used | Repetition penalty. |
max_tokens |
int |
no | server default, usually 256 |
used | used | used | Maximum generated output tokens. |
stop |
list[string] |
no | server default extra stop list, often empty | used | used | used | Optional extra stop strings. Model-internal stop/eos tokens are handled by the pool/backend. |
Remote OpenAI-compatible models map temperature, top_p, max_tokens, and stop to Chat Completions requests. They accept beam_size, top_k, and repetition_penalty for request schema compatibility, but log and ignore them.
You can keep shared defaults in config/settings.json and put machine-local overrides in config/local.json.
When present, local.json is merged over settings.json (override wins per key).
Settings files can also define service.host, service.port, service.log_level, and global engine.decoding defaults.
Per model, you can set model_path, device, compute_type, prompt_format, enable_thinking, enabled, replicas, replica_max, target_inflight, and optionally override the backend:
{
"engine": {
"backend": "gguf",
"models": {
"gemma-4-26B-A4B-it-exl3-5.10bpw": {
"model_path": "/home/gunnar/models/gemma-4-26B-A4B-it-exl3-5.10bpw",
"backend": "exllamav3",
"prompt_format": "gemma4_template",
"enable_thinking": false,
"enabled": true,
"replicas": 1,
"replica_max": 1,
"target_inflight": 1,
"exllama_cache_size": 16384,
"exllama_cache_quant": "8,8",
"exllama_tensor_parallel": true,
"exllama_gpu_split": "24,24"
},
"google_gemma-4-26B-A4B-it-Q4_K_M-gguf": {
"model_path": "/home/gunnar/models/google_gemma-4-26B-A4B-it-Q4_K_M/google_gemma-4-26B-A4B-it-Q4_K_M.gguf",
"backend": "gguf",
"device": "cuda",
"prompt_format": "gemma4_template",
"enable_thinking": false,
"enabled": true,
"replicas": 3,
"replica_max": 4,
"target_inflight": 1,
"gguf_n_gpu_layers": -1,
"gguf_n_ctx": 4096,
"gguf_flash_attn": "auto",
"gguf_type_k": "q8_0",
"gguf_type_v": "q4_0"
}
}
}
}Notes:
- Models without a
backendfield use the globalengine.backend. enabledcontrols whether a model is loaded by the pool at startup.- A configured model with
enabled: falsemay still be loaded later through the admin API. replicasis the default replica count that will be started for that model id when it is loaded.replica_maxis the maximum allowed replica count for that model id.target_inflightis configured per model id and applied per loaded replica through the scheduler.enable_thinkingis an optional per-model template setting for formats that expose a thinking toggle.- Request-level decoding values override
engine.decodingdefaults when provided. - ExLlamaV3 models also support
exllama_tp_backend,exllama_max_batch_size,exllama_max_chunk_size,exllama_max_q_size, andexllama_max_rq_tokens. - GGUF models also support
gguf_n_gpu_layers,gguf_n_ctx,gguf_flash_attn,gguf_type_k, andgguf_type_v. - OpenAI-compatible remote models support
remote_api_kind,remote_base_url,remote_api_key_env,remote_model,remote_timeout_s,remote_health_check,remote_max_retries, andremote_thinking. - Requests to GGUF models using
prompt_format: "translategemma_template"must includesource_lang_codeandtarget_lang_code; put only the source text ininputand omitinstructions. - The admin API may temporarily override backend-specific load settings at runtime without modifying
settings.jsonorlocal.json. - The exact load override fields, allowed values, defaults, and recommended presets are documented in runtime-admin-api.md.
- ExLlamaV3 dependencies are loaded lazily and required only when an ExLlamaV3 model is configured.
- GGUF dependencies are loaded lazily and required only when a GGUF model is configured.
TranslateGemma request example:
{
"model": "translategemma-12b-it-q5-k-m-gguf",
"input": "Ach, hij is gewoon een ouwe brombeer, maar hij bedoelt het goed.",
"source_lang_code": "nl",
"target_lang_code": "en"
}Optional env vars:
LLM_POOL_SETTINGS_PATH: explicit base settings file path.LLM_POOL_LOCAL_SETTINGS_PATH: explicit local override file path.
Remote models use the same public model contract as local models, but activate an upstream API route instead of loading model weights. For V1, the supported remote backend is openai_compatible with Chat Completions.
Example:
{
"engine": {
"models": {
"frontier-large": {
"backend": "openai_compatible",
"remote_api_kind": "chat_completions",
"remote_base_url": "https://api.example.com/v1",
"remote_api_key_env": "EXAMPLE_API_KEY",
"remote_model": "provider-model",
"remote_timeout_s": 120,
"remote_health_check": "config_only",
"remote_max_retries": 0,
"remote_thinking": "disabled",
"target_inflight": 1,
"enabled": false
}
}
}
}Remote backend notes:
model_pathis not required foropenai_compatiblemodels.- The API key is read from the environment variable named by
remote_api_key_env. - Callers must set
allow_remote: true; otherwise the request is rejected before it enters the scheduler. target_inflightcontrols local submission concurrency. It is not a guarantee that the upstream provider runs requests concurrently.remote_thinkingis an explicit provider extension for APIs that expose a Chat Completionsthinkingfield. It currently accepts"enabled"or"disabled".- Remote calls may incur provider costs. Phase 1 only provides the explicit
allow_remoterequest gate; local budget enforcement and cost ledgers are still design-note work. See remote-openai-compatible-backend-notes.md.
- Clients send only the model id that appears in the API and config.
- A single model id may map to multiple identical internal replicas.
/v1/modelsreturns model ids, not internal replica ids./v1/admin/modelsreturns one aggregate row per model id.replicasis the default replica count for that model id when it is loaded.replica_maxis the maximum allowed replica count for that model id.- Replicas are only for identical runtime instances. Different context sizes or cache settings require different model ids.
The response metrics payload uses nested timers:
backend_inference_wall_mstime spent inside the model runtime itself generating the resultengine_total_wall_msbackend inference plus queueing, scheduling, and other engine work around itpool_total_wall_mstotal time spent inside thellm-poolrequest handler
In other words:
Inferenceis the smallest boundaryEnginewrapsInferencePoolwrapsEngine
The payload may also include runtime-specific counters and sub-timers:
engine_queue_wait_mstime spent waiting in the per-model scheduler queue before backend work startsengine_tokenize_msprompt tokenization time when the backend reports it separatelygpu_time_to_first_token_mstime from generation start to first generated token, when availablegpu_generate_total_msbackend-reported generation timegpu_decode_after_first_token_msgeneration time after the first token, when availableengine_prompt_tokens/engine_output_tokensprompt and generated token counts, when availableengine_tokens_per_secondgenerated output tokens divided by the measured generation wall time
Some fields are backend-dependent and may be null.
python3 -m unittest discover -s testsThe repo also includes design notes and trackers in various stages of completion:
- runtime-scheduler-notes.md Captures the broader scheduler design space beyond the current in-process implementation.
- runtime-scheduler-tracker.md Tracks the current scheduler MVP implementation status and remaining deferred work.
- model-replica-routing-notes.md Captures the client-visible model and replica-routing semantics.
- remote-openai-compatible-backend-notes.md Captures the proposed remote OpenAI-compatible backend shape, including cost-control notes.
- runtime-subprocess-notes.md Captures the intended process-isolation model for loaded runtimes and how that should fit behind the same scheduler/runtime adapter boundary.
This pool builds on a number of excellent upstream projects:
- FastAPI
- Uvicorn
- Pydantic
- CTranslate2
- Transformers
- ExLlamaV3
- llama-cpp-python
- llama.cpp
Licensed under the Apache License, Version 2.0. See LICENSE.