A framework for downloading and hosting Liquid AI LFM models locally. Provides a CLI for model management, a FastAPI-based inference server with an OpenAI-compatible API, a built-in chat UI, and MCP tool integration.
Supports two inference backends:
- Transformers — HuggingFace native weights (full precision or automatic dtype)
- llama.cpp — GGUF quantized weights via
llama-cpp-python(faster on CPU, smaller footprint)
- Python 3.10+
- (Optional) A CUDA-capable GPU for faster inference — CPU works fine for smaller models
Clone the repo and create a virtual environment:
cd liquid-host
python3 -m venv .env
source .env/bin/activateInstall the package:
pip install -e .For development (adds pytest and httpx):
pip install -e ".[dev]"For GPU-accelerated attention (requires a compatible NVIDIA GPU):
pip install -e ".[flash-attn]"Verify the installation:
liquid-host --helpliquid-host listThis shows all 15 supported models across the LFM2 and LFM2.5 families. Filter by family with --family lfm2 or --family lfm2.5.
Option A: HuggingFace native weights (transformers backend)
liquid-host download lfm2.5-1.2b-instructModels are cached to ~/.cache/liquid-host/models/ by default. Use --cache-dir /path/to/dir to change the location.
Option B: GGUF quantized weights (llama.cpp backend — recommended)
leap-bundle download LFM2.5-1.2B-Thinking --quantization=Q4_0 \
--output-path ~/.cache/liquid-host/models/gguf/LFM2.5-1.2B-Thinking-Q4_0Available quantizations: Q4_0, Q4_K_M, Q5_K_M (default), Q8_0. Smaller quantizations are faster and use less memory at the cost of some quality.
To see what you've already downloaded:
liquid-host downloadedWith a GGUF model (llama.cpp backend):
liquid-host serve \
--gguf ~/.cache/liquid-host/models/gguf/LFM2.5-1.2B-Thinking-Q4_0With a HuggingFace model (transformers backend):
liquid-host serve --model lfm2.5-1.2b-instructThis loads the model into memory and starts an HTTP server on http://localhost:8000.
Server options:
| Flag | Default | Description |
|---|---|---|
--model |
None | HuggingFace model to preload (transformers backend) |
--gguf |
None | Path to GGUF file or directory to preload (llama.cpp backend) |
--host |
0.0.0.0 |
Bind address |
--port |
8000 |
Port |
--device-map |
auto |
Device placement — auto, cpu, cuda:0, etc. (transformers only) |
--dtype |
model default | Override dtype — float16, bfloat16, float32 (transformers only) |
--flash-attn |
off | Enable Flash Attention 2 (transformers only, requires the flash-attn extra) |
--n-ctx |
4096 |
Context window size (GGUF only) |
--n-gpu-layers |
0 |
Number of layers to offload to GPU (GGUF only) |
--cache-dir |
~/.cache/liquid-host/models/ |
Custom cache directory |
--mcp-config |
./mcp_servers.json |
Path to MCP server config |
--workers |
1 |
Number of uvicorn workers |
The server exposes an OpenAI-compatible chat completions endpoint, so any OpenAI SDK client works out of the box.
curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is a liquid neural network?"}],
"max_tokens": 256
}'Python (openai SDK):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="LFM2.5-1.2B-Instruct",
messages=[{"role": "user", "content": "What is a liquid neural network?"}],
max_tokens=256,
)
print(response.choices[0].message.content)Streaming:
stream = client.chat.completions.create(
model="LFM2.5-1.2B-Instruct",
messages=[{"role": "user", "content": "Explain mixture of experts."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")The server includes a built-in chat interface. Once the server is running, open your browser to:
http://localhost:8000
The server includes three web interfaces:
- Chat (
/) — streaming chat with the model, multi-turn conversation, MCP tool calling with real-time status, configurable settings (temperature, max tokens, system prompt, tool toggle) - Tool Debugger (
/tools) — test any MCP tool directly with a parameter form, see raw results, copy cURL commands - Training Data Editor (
/training) — browse, edit, add, and delete training examples, sync changes back to HuggingFace Hub
All pages are cross-linked via nav in the header.
When MCP tools are enabled and connected, the chat UI shows real-time status messages as the model calls tools (e.g., "Finding recent NFLX earnings..."), then streams the final answer token-by-token.
For quick experimentation without running a server:
liquid-host run lfm2.5-1.2b-instructPass a system prompt with --system "You are a helpful assistant.". Type your messages and press Enter. Ctrl+C to quit.
Uses full-precision or automatic dtype weights from the HuggingFace Hub. Best for GPU inference or when you need the exact model weights.
liquid-host serve --model lfm2.5-1.2b-instruct --device-map autoUses quantized GGUF weights via llama-cpp-python. Significantly faster on CPU, smaller memory footprint, and supports various quantization levels.
# Download a quantized model
leap-bundle download LFM2.5-1.2B-Thinking --quantization=Q4_0 \
--output-path ~/.cache/liquid-host/models/gguf/LFM2.5-1.2B-Thinking-Q4_0
# Serve it
liquid-host serve \
--gguf ~/.cache/liquid-host/models/gguf/LFM2.5-1.2B-Thinking-Q4_0 \
--n-ctx 4096Available quantizations via leap-bundle:
| Quantization | Description |
|---|---|
Q4_0 |
4-bit, smallest size, fastest inference |
Q4_K_M |
4-bit k-quant, better quality |
Q5_K_M |
5-bit k-quant (default) |
Q8_0 |
8-bit, highest quality |
To offload layers to GPU for faster inference:
liquid-host serve \
--gguf ~/.cache/liquid-host/models/gguf/LFM2.5-1.2B-Thinking-Q4_0 \
--n-gpu-layers 32Liquid Host can connect to MCP (Model Context Protocol) servers, giving the model access to external tools and data sources at runtime.
Create or edit mcp_servers.json in the project root:
{
"servers": [
{
"name": "aiera",
"url": "https://mcp-pub.dev.aiera.com?api_key=YOUR_API_KEY",
"transport": "streamable_http",
"enabled": true,
"description": "Aiera financial data and research platform"
}
]
}Supported transports: sse, streamable_http (or streamable-http, http).
Set "enabled": false to disable a server without removing it. Add multiple entries to connect to several servers simultaneously.
- On server startup, all enabled MCP servers are connected and their tools are discovered
- When a chat request comes in with tools enabled, the model receives the tool definitions and can emit tool calls
- Tool calls are executed against the MCP server, results are fed back, and the model generates a final answer (up to 5 rounds)
- The entire workflow streams status updates to the UI in real time
| Endpoint | Method | Description |
|---|---|---|
/mcp/status |
GET | Shows connected servers and available tools |
/mcp/reconnect |
POST | Re-reads mcp_servers.json and reconnects (no restart needed) |
liquid-host serve --gguf /path/to/model --mcp-config /path/to/mcp_servers.json| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web chat UI |
/tools |
GET | Tool debugger UI |
/training |
GET | Training data editor UI |
/health |
GET | Health check |
/status |
GET | Server status, loaded model, and MCP info |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completions (OpenAI-compatible, supports streaming) |
/load |
POST | Load a HuggingFace model: {"model": "lfm2-24b-a2b"} |
/load-gguf |
POST | Load a GGUF model: {"path": "/path/to/model", "n_ctx": 4096} |
/unload |
POST | Unload the current model and free memory |
/download |
POST | Download a model: {"model": "lfm2-1.2b"} |
/mcp/status |
GET | MCP connection status and tool list |
/mcp/reconnect |
POST | Reconnect to MCP servers (re-reads config) |
/api/tools |
GET | List all MCP tools with JSON schemas |
/api/tools/call |
POST | Execute a tool: {"name": "find_events", "arguments": {...}} |
/api/training/config |
GET/POST | Get or set HF dataset config for training data |
/api/training/examples |
GET | Pull training examples from HF dataset |
/api/training/sync |
POST | Push edited examples back to HF Hub |
You can load and unload models at runtime without restarting the server:
# Load a GGUF model
curl -X POST http://localhost:8000/load-gguf \
-H "Content-Type: application/json" \
-d '{"path": "~/.cache/liquid-host/models/gguf/LFM2.5-1.2B-Thinking-Q4_0", "n_ctx": 4096}'
# Load a HuggingFace model
curl -X POST http://localhost:8000/load \
-H "Content-Type: application/json" \
-d '{"model": "lfm2-1.2b", "device_map": "cpu"}'
# Unload to free memory
curl -X POST http://localhost:8000/unload| Key | Params | Active | Architecture | Notes |
|---|---|---|---|---|
lfm2-350m |
350M | 350M | Dense hybrid | Ultra-edge |
lfm2-700m |
700M | 700M | Dense hybrid | Edge devices |
lfm2-1.2b |
1.2B | 1.2B | Dense hybrid | 10 conv + 6 attn layers |
lfm2-2.6b |
2.6B | 2.6B | Dense hybrid | Mid-range |
lfm2-2.6b-exp |
2.6B | 2.6B | Dense hybrid | Pure RL-trained |
lfm2-8b-a1b |
8B | 1B | MoE | 8B total, 1B active |
lfm2-24b-a2b |
24B | 2.3B | MoE | Largest open model, fits in 32GB RAM |
| Key | Params | Active | Architecture | Notes |
|---|---|---|---|---|
lfm2.5-1.2b-base |
1.2B | 1.2B | Dense hybrid | Pre-trained base (28T tokens) |
lfm2.5-1.2b-instruct |
1.2B | 1.2B | Dense hybrid | Instruction-tuned (recommended) |
lfm2.5-1.2b-thinking |
1.2B | 1.2B | Dense hybrid | Reasoning / chain-of-thought |
lfm2.5-1.2b-jp |
1.2B | 1.2B | Dense hybrid | Japanese-optimized |
lfm2.5-vl-1.6b |
1.6B | 1.6B | Dense hybrid | Vision-language |
lfm2.5-audio-1.5b |
1.5B | 1.5B | Dense hybrid | Audio-language |
# See what's downloaded (shows both HuggingFace and GGUF models)
liquid-host downloaded
# Remove a HuggingFace model
liquid-host delete lfm2-1.2b
# Re-download (force)
liquid-host download lfm2-1.2b --forceThe default cache location is ~/.cache/liquid-host/models/. GGUF models downloaded via leap-bundle are stored in ~/.cache/liquid-host/models/gguf/. Override it globally with --cache-dir on any command.
Liquid Host supports LoRA/QLoRA fine-tuning of LFM models on chat-format JSONL data. Training can run locally (requires a GPU) or remotely on HuggingFace Spaces.
Each line is a JSON object with a messages array. Supports user, assistant, system, and tool roles:
{"messages": [{"role": "user", "content": "What were AAPL's Q3 results?"}, {"role": "assistant", "content": "<think>Look up AAPL earnings.</think>\nLooking up recent AAPL earnings...\n[find_events(bloomberg_ticker='AAPL:US', event_type='earnings')]"}, {"role": "tool", "content": "{\"events\": [...]}"}, {"role": "assistant", "content": "Apple reported Q3 revenue of $85.8B..."}]}See data/training/aiera_tools_v6.jsonl for 208 examples covering all 41 Aiera MCP tools (earnings, financials, filings, transcripts, conferences, indexes, watchlists, sectors, Third Bridge, company docs, research metadata, web search, and more).
For detailed instructions on creating, updating, and maintaining training data — including style requirements, citation standards, and the automated update script — see docs/TRAINING_DATA.md.
For local training:
pip install -e ".[training]"For remote training (no GPU needed):
pip install -e ".[remote-training]"Requires a CUDA GPU with sufficient VRAM.
liquid-host finetune lfm2-24b-a2b data/training/aiera_tools_v4.jsonl \
--output ./finetune-output \
--epochs 3 \
--lora-rank 16 \
--lora-alpha 32 \
--max-seq-length 2048For QLoRA (4-bit quantized base model, reduces VRAM usage):
liquid-host finetune lfm2-24b-a2b data/training/aiera_tools_v4.jsonl \
--quantize-4bitThe adapter is saved to ./finetune-output/adapter/.
No local GPU required. Launches a custom HF Space with a GPU backend that runs the training job, then pushes the adapter to the Hub.
export HF_TOKEN=hf_your_token_here
liquid-host finetune lfm2-24b-a2b data/training/aiera_tools_v4.jsonl \
--remote \
--backend a100-large \
--quantize-4bit \
--project-name my-finetuneAvailable GPU backends:
| Backend | GPU | Notes |
|---|---|---|
t4-small |
NVIDIA T4 (16GB) | Budget option |
t4-medium |
NVIDIA T4 (16GB) | More CPU/RAM |
a10g-small |
NVIDIA A10G (24GB) | Good for 1-3B models |
a10g-large |
NVIDIA A10G (24GB) | More CPU/RAM |
l4x1 |
NVIDIA L4 (24GB) | Recommended for LFM2.5 |
l4x4 |
4x NVIDIA L4 | Multi-GPU |
l40sx1 |
NVIDIA L40S (48GB) | Larger models |
a100-large |
NVIDIA A100 (80GB) | Maximum capability |
Monitor training at the Space URL printed in the output. When complete, the adapter is pushed to <username>/<project-name> on the Hub.
| Flag | Default | Description |
|---|---|---|
--output, -o |
./finetune-output |
Local output directory |
--epochs |
3 |
Number of training epochs |
--batch-size |
4 |
Per-device batch size |
--lr |
2e-4 |
Learning rate |
--lora-rank |
16 |
LoRA rank (higher = more capacity) |
--lora-alpha |
32 |
LoRA alpha scaling |
--lora-dropout |
0.05 |
LoRA dropout |
--max-seq-length |
2048 |
Maximum sequence length |
--quantize-4bit |
off | Use QLoRA (4-bit quantized base model) |
--target-modules |
auto-detect | Comma-separated LoRA target modules |
--gradient-accumulation |
4 |
Gradient accumulation steps |
--remote |
off | Train on HuggingFace Spaces |
--hf-token |
$HF_TOKEN |
HuggingFace API token |
--hf-username |
auto-detect | HuggingFace username |
--project-name |
liquid-host-finetune |
Hub project name for remote training |
--backend |
l4x1 |
GPU backend for remote training |
From a local adapter:
liquid-host serve --model lfm2-24b-a2b --adapter ./finetune-output/adapterFrom a Hub adapter (after remote training):
liquid-host serve --model lfm2-24b-a2b --adapter username/my-finetuneThe adapter is merged into the base model at startup with no inference overhead.
Deploy Liquid Host as a custom Docker container on HuggingFace Inference Endpoints with GPU support.
- HuggingFace account with a valid API token
- A container registry (GitHub Container Registry, Docker Hub, etc.)
- Docker installed locally
# Authenticate with your container registry
echo $GHCR_TOKEN | docker login ghcr.io -u YOUR_USERNAME --password-stdin
# Build for linux/amd64
docker build --platform linux/amd64 -t ghcr.io/healeyengineering/liquid-host:latest .
# Push
docker push ghcr.io/healeyengineering/liquid-host:latest# Authenticate with HuggingFace
pip install huggingface-hub
huggingface-cli login
# Deploy (base model only)
python deploy/deploy_hf.py \
--image ghcr.io/healeyengineering/liquid-host:latest \
--instance-type nvidia-a100 --instance-size x1
# Deploy with a fine-tuned LoRA adapter and training data editor
python deploy/deploy_hf.py \
--image ghcr.io/healeyengineering/liquid-host:latest \
--instance-type nvidia-a100 --instance-size x1 \
--adapter YOUR_USERNAME/my-finetune \
--hf-token $HF_TOKEN \
--training-repo YOUR_USERNAME/my-finetune-data| Flag | Default | Description |
|---|---|---|
--image |
(required) | Docker image URL |
--name |
liquid-host-lfm |
Endpoint name |
--repo |
LiquidAI/LFM2-24B-A2B |
HF model repo (mounted at /repository) |
--instance-type |
nvidia-l4 |
GPU type (nvidia-t4, nvidia-l4, nvidia-a10g, nvidia-a100) |
--instance-size |
x1 |
Instance size (x1, x2, x4) |
--region |
us-east-1 |
Cloud region |
--vendor |
aws |
Cloud vendor (aws, azure) |
--adapter |
None | HF Hub adapter repo to load at startup |
--hf-token |
None | HF token (needed for private adapter repos) |
--training-repo |
None | HF dataset repo for the training data editor |
--scale-to-zero |
15 |
Minutes before scaling to zero (0 to disable) |
--namespace |
your username | HF namespace or org |
The container accepts these environment variables (set via HF portal or deploy script):
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
/repository |
Path to model weights |
MODEL_KEY |
lfm2-24b-a2b |
Model registry key |
ADAPTER_PATH |
None | HF Hub adapter repo ID or local path |
PORT |
80 |
Server port |
DEVICE_MAP |
auto |
Device placement |
DTYPE |
None | Override torch dtype |
USE_FLASH_ATTN |
false |
Enable Flash Attention 2 |
MCP_CONFIG |
None | Path to MCP server config |
N_GPU_LAYERS |
99 |
GPU layers for GGUF backend |
N_CTX |
128000 |
Context window for GGUF backend |
HF_HOME |
/tmp/hf_cache |
HuggingFace cache directory |
TRAINING_HF_REPO |
None | HF dataset repo for training data editor |
| Model | Min GPU | Recommended |
|---|---|---|
| LFM2.5-1.2B (no MCP tools) | T4 (16GB) | T4 |
| LFM2.5-1.2B (with MCP tools) | L4 (24GB) | L4 |
| LFM2-24B-A2B (with MCP tools) | A100 (80GB) | A100 (80GB) |
The 29 Aiera MCP tool schemas add ~13K tokens to the system prompt. The 24B MoE model requires an A100 (80GB) when MCP tools are enabled. Smaller models (1.2B) work on L4 (24GB).
Once deployed, the endpoint URL hosts both the API and the web UI:
- Web UI:
https://YOUR_ENDPOINT_URL/ - Health check:
https://YOUR_ENDPOINT_URL/health - Chat API:
https://YOUR_ENDPOINT_URL/v1/chat/completions
curl https://YOUR_ENDPOINT_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}], "stream": true}'The CLI includes commands to manage training data on HuggingFace Hub:
# Push local JSONL to HF Hub
liquid-host data push data/training/aiera_tools_v4.jsonl --repo username/my-dataset
# Pull from HF Hub to local file
liquid-host data pull --repo username/my-dataset --output data/training/pulled.jsonl
# List/preview a dataset
liquid-host data list --repo username/my-dataset
# Validate a local JSONL file
liquid-host data validate data/training/aiera_tools_v4.jsonlThe web-based training data editor (/training) provides a UI for browsing, editing, and syncing training examples. Set TRAINING_HF_REPO and HF_TOKEN environment variables (or pass --training-repo and --hf-token to the deploy script) to auto-configure it.
Add -v to any command for debug-level logging:
liquid-host -v serve --gguf /path/to/modelThis shows detailed logs for every step: tokenization, model generation (with token counts and tok/s), tool calls, MCP communication, and more.