Chat with any codebase. Locally. Privately. Free.
The open-source alternative to Cursor and Copilot — runs entirely on your machine with Ollama.
Plus: smart routing, fine-tuning, AI agents, MCP server, and full observability.
Ask • Gateway • Smart Routing • Agents & MCP • Fine-tuning • Observability • Prove the Value • Editor • Docs
brew install mara-werils/llmstack/llmstack # or: pipx install llmstack-cli
llmstack quickstart # sizes a model to your machine, proves it works
llmstack ask -i ./src/ # then chat with your codebasequickstart is the 30-second path to first value: it detects your hardware,
recommends and pulls a local model, and prints a real local completion so you
see it work. No API key, no Docker -- the only prerequisite is Ollama.
(Docker is optional, used only for the full gateway stack via llmstack up.)
llmstack ask "How does authentication work?" ./src/One command. No API keys. No cloud. No Docker. No $20/month subscription. Just Ollama + your files.
llmstack ask model=llama3.2 embeddings=nomic-embed-text
Git: main (15 recent commits)
Index cached: 847 chunks (0 files changed)
Embeddings loaded from cache
Answer:
Authentication works through API key validation in the FastAPI gateway
middleware. Each request must include an `Authorization: Bearer <key>`
header. The middleware validates keys against the stored list in
llmstack.yaml [src/gateway/middleware/auth.py:23-45]. Rate limiting
is tied to the API key — each key gets its own token bucket tracked
in Redis [src/gateway/middleware/rate_limit.py:12-38].
┌─────────────── Sources ───────────────┐
│ File Lines Score │
│ gateway/middleware/auth.py 23-45 0.0142 │
│ gateway/middleware/rate_limit.py 12-38 0.0098 │
│ config/schema.py 89-102 0.0076 │
└───────────────────────────────────────┘
| llmstack ask | Cursor | Copilot | Aider | Khoj | |
|---|---|---|---|---|---|
| AST-aware code chunking | Yes | Yes | - | Partial | No |
| Hybrid search (BM25 + vector) | Yes | ? | - | No | No |
| Persistent incremental index | Yes | Yes | - | No | Yes |
| Git-aware context | Yes | Yes | - | Yes | No |
| Interactive conversation | Yes | Yes | - | Yes | Yes |
| 20+ file types (PDF, DOCX, logs...) | Yes | No | No | No | Yes |
| 100% local, 100% private | Yes | No | No | No | Yes |
| 100% free, forever | Yes | $20/mo | $10/mo | API costs | Free |
| Zero config CLI | Yes | IDE only | IDE only | Config needed | Server needed |
Persistent index — first query indexes your project (~30s). Every query after that: ~0.1s. Only re-embeds files that changed (SHA-256 hash diff).
AST-aware chunking — Python files split by functions and classes using the ast module. Large classes (>50 lines) split into individual methods. JS/TS/Go/Rust/Java use regex boundary detection. No more broken chunks mid-function.
Hybrid search — combines BM25 keyword matching (catches exact function names, error messages) with vector cosine similarity (catches meaning and intent). Merged via Reciprocal Rank Fusion. Better recall than either alone.
Git-aware — the LLM sees your current branch, recent commits, and changed files. Ask "what changed this week?" and get real answers.
Interactive mode — multi-turn conversation with your codebase. Context preserved across questions.
# Interactive conversation with your project
llmstack ask -i ./src/
# You: How does the cache work?
# Assistant: The cache uses Redis with SHA-256 keys...
# You: What happens when Redis goes down?
# Assistant: There's an in-memory fallback in rate_limit.py...
# Single question
llmstack ask "Find security vulnerabilities" ./src/ --model llama3.1:70b
# Ask about any file type
llmstack ask "Summarize the key findings" report.pdf
llmstack ask "What went wrong at 3am?" error.log
cat contract.pdf | llmstack ask "Are there any risks?"
# Skip cache for fresh re-index
llmstack ask "What's new?" ./src/ --no-cache
# Without git context
llmstack ask "Explain the architecture" ./src/ --no-git20+ file types: Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, Ruby, PHP, Swift, Kotlin, PDF, DOCX, Markdown, HTML, JSON, YAML, TOML, CSV, logs, and more.
Private by default — everything runs on your machine. llmstack verify-private audits your config and fails loudly if anything (a cloud provider, a webhook, a network-capable agent tool, wide-open CORS) could send code or prompts off the box:
llmstack verify-private # human-readable report, non-zero exit if not private
llmstack verify-private --json # machine-readable, for CI gates
llmstack verify-private --live # also probes the *running* gateway, not just llmstack.yamlIn your editor — the VS Code / OpenVSX extension brings Ask and Explain commands to VS Code, Cursor, and Windsurf, all routed through your local gateway.
Prove the value — llmstack savings tallies the cloud bill you didn't pay, and llmstack benchmark produces a reproducible cost + latency + zero-egress report. The numbers are yours, generated locally. See below.
# Install (pick one)
curl -LsSf https://raw.githubusercontent.com/mara-werils/llmstack/main/install.sh | sh # one-liner
brew install mara-werils/llmstack/llmstack # macOS / Linux (Homebrew)
pipx install llmstack-cli # isolated, no venv to manage
uv tool install llmstack-cli # same, via uv
pip install llmstack-cli # plain pip
# Chat with your codebase (just needs Ollama)
llmstack ask -i ./src/
# Full LLM stack with smart routing
llmstack init # interactive setup wizard
llmstack up| Method | Command | Best for |
|---|---|---|
| One-liner | curl -LsSf https://raw.githubusercontent.com/mara-werils/llmstack/main/install.sh | sh |
Fastest start (auto-picks uv/pipx/pip) |
| Homebrew | brew install mara-werils/llmstack/llmstack |
macOS / Linux, auto-updates |
| pipx | pipx install llmstack-cli |
Isolated CLI, no venv juggling |
| uv | uv tool install llmstack-cli |
uv users |
| pip | pip install llmstack-cli |
Inside an existing environment |
| Docker | docker run -p 8000:8000 ghcr.io/mara-werils/llmstack:latest |
Running the gateway as a server |
| Editor | Search LLMStack in VS Code, Cursor, Windsurf, or VSCodium (Open VSX) | AI in your editor — guide |
Route every request through a single OpenAI-compatible endpoint. llmstack picks the best provider and model automatically.
6 cloud providers + local inference:
| Provider | Models | Pricing tracked |
|---|---|---|
| OpenAI | GPT-4o, GPT-4.1, o3, o4-mini, GPT-4.1-nano | Per-token |
| Anthropic | Claude Opus 4, Sonnet 4, Haiku 4 | Per-token |
| Gemini 2.5 Pro/Flash, Gemini 2.0 Flash | Per-token | |
| Groq | Llama 3.3 70B, Llama 3.1 8B, Mixtral | Per-token |
| Together | Llama 405B, DeepSeek R1/V3, Qwen 72B | Per-token |
| Mistral | Mistral Large/Small, Codestral, Pixtral | Per-token |
| Local | Ollama / vLLM (any GGUF model) | Free |
Fallback chains: if OpenAI returns 429/503, the request automatically retries on Anthropic, then falls back to local.
# llmstack.yaml
providers:
enabled: true
strategy: cost # cost | quality | balanced | latency
providers:
- name: openai
api_key_env: OPENAI_API_KEY
models:
- name: gpt-4.1-nano
tier: simple
cost_per_m_input: 0.10
- name: gpt-4o
tier: medium
cost_per_m_input: 2.50
fallback: [anthropic, local]
- name: anthropic
api_key_env: ANTHROPIC_API_KEY
- name: localResponse headers tell you exactly what happened:
X-Provider: openai
X-Model-Router: gpt-4.1-nano
X-Query-Tier: simple
X-Cost-USD: 0.000003
X-Cache: MISS
The classifier scores every request in < 2ms using 7 heuristic signals (no ML model needed), then picks the cheapest adequate model.
Request → Classify (7 signals, <2ms) → Route to optimal model + provider
| Signal | What it measures |
|---|---|
| Token count | Message length |
| Task markers | "hello" vs "implement distributed consensus" |
| Code detection | Code blocks, programming terms |
| Conversation depth | Turn count |
| System prompt | Complexity of instructions |
| Language mix | Multilingual content |
| Question patterns | Simple fact vs multi-constraint reasoning |
Real results (CPU-only, no GPU):
| Query | Tier | Model | Latency |
|---|---|---|---|
| "Hello!" | Simple | llama3.2:1b | 1.6s |
| "What is 2+2?" | Simple | llama3.2:1b | 5.9s |
| "Write binary search in Python" | Medium | llama3.2:3b | 52.2s |
71% of requests routed to the small model. 71% compute savings.
With cloud providers, cost savings are even bigger — simple queries go to gpt-4.1-nano ($0.10/M) instead of gpt-4o ($2.50/M).
llmstack agent "Find all TODO comments in this repo and summarize them"The agent uses a ReAct loop — it plans, calls tools, observes results, and iterates until the task is done.
6 built-in tools: read_file, write_file, list_directory, grep, shell, http_get
# Use specific tools only
llmstack agent "Check if tests pass" --tools shell,read_file
# Use a larger model for complex tasks
llmstack agent "Refactor auth.py to use JWT tokens" --model llama3.1:70bConnect any MCP-compatible AI client (Claude Code, Cursor, VS Code) to your local LLM:
llmstack mcp --model llama3.2// .claude/claude_desktop_config.json
{
"mcpServers": {
"llmstack": {
"command": "llmstack",
"args": ["mcp", "--model", "llama3.2"]
}
}
}8 tools exposed via MCP: all agent tools + llmstack_chat (LLM inference) + llmstack_ask (file RAG with citations).
Fine-tune a model on your data in one command. No Jupyter. No boilerplate. No ML expertise.
llmstack finetune data.jsonl --base llama3.2:1b --export-ollama my-modelWhat happens:
- Auto data prep — detects format (CSV/JSON/JSONL/TXT/Parquet), auto-maps columns (
instruction/output,prompt/completion,question/answer, chatmessages), splits train/eval - Auto hyperparameters — epochs, LoRA rank, batch size, learning rate all auto-selected based on dataset size and model
- Training — LoRA/QLoRA via unsloth (2x faster) or HuggingFace PEFT
- Export — GGUF conversion +
ollama create→ model ready to serve
# Override any hyperparameter
llmstack finetune data.csv --base llama3.2:1b --epochs 5 --lr 1e-4 --lora-r 32
# Export to GGUF with custom quantization
llmstack finetune data.jsonl --base llama3.2:1b --export-gguf --quant q5_k_m
# Full pipeline: train + export + register in Ollama
llmstack finetune emails.jsonl --base llama3.2:1b --export-ollama email-assistant
# → ollama run email-assistantAuto hyperparameter selection:
| Dataset size | Epochs | LoRA rank | Learning rate |
|---|---|---|---|
| < 100 | 5 | 8 | 1e-4 |
| 100–500 | 3 | 16 | 2e-4 |
| 500–5K | 2 | 16 | 2e-4 |
| 5K+ | 1 | 32+ | 2e-4 |
Every response is scored in real-time. Quality drift triggers alerts. Compare models with A/B testing.
5 metrics scored on every non-streaming response:
| Metric | What it measures |
|---|---|
| Coherence | Structural quality (length, sentences, formatting) |
| Relevance | Does the response address the query? |
| Refusal | "I can't help with that" detection |
| Toxicity | Harmful content flags |
| Repetition | Looping / repetitive output |
Quality drops below 0.4 → CRITICAL alert
Quality trending negative over 50 requests → WARNING alert
# Check live quality from the gateway
llmstack eval --gateway-url http://localhost:8000┌──────────── Quality Summary ────────────┐
│ Metric Mean Recent Trend Count │
│ overall 0.7821 0.7534 -0.02 1042 │
│ coherence 0.8912 0.8845 +0.01 1042 │
│ relevance 0.6834 0.6223 -0.06 1042 │ ← trending down
│ refusal 0.0124 0.0098 -0.00 1042 │
│ repetition 0.0231 0.0187 -0.00 1042 │
└─────────────────────────────────────────┘
# Create a test via API
curl -X POST http://localhost:8000/v1/observe/ab-test \
-d '{"name":"gpt4o-vs-sonnet","model_a":"gpt-4o","model_b":"claude-sonnet-4-20250514","traffic_split":0.5}'
# Check results
curl http://localhost:8000/v1/observe/ab-test/gpt4o-vs-sonnet{
"winner": "claude-sonnet-4-20250514",
"confidence": "high",
"avg_quality_a": 0.7821,
"avg_quality_b": 0.8234,
"requests_a": 523,
"requests_b": 519,
"avg_cost_a_usd": 0.000034,
"avg_cost_b_usd": 0.000089
}Every request is traced end-to-end:
GET /v1/observe/traces?model=gpt-4o&limit=10
Each trace captures: prompt, routing decision, provider, model, response, latency, tokens, cost, quality scores.
llmstack's two headline claims — the open alternative to Cursor/Copilot and it saves you money — ship as numbers you generate locally, not as marketing.
llmstack savings # cumulative savings + months of Copilot/Cursor covered
llmstack savings --plan cursor-pro --jsonThe gateway books a saving on every locally-served request — the cloud price you
didn't pay — into a local ledger (~/.llmstack/savings.json, read offline, no
network). It's valued against dated, sourced pricing using
gpt-4o-mini (one of the cheapest mainstream cloud models) as the baseline, so the
figure is conservative. Cloud-routed (paid) requests are never counted.
GET /v1/savings/summary # running total + subscription-months covered
GET /v1/savings/pricing # the exact dated pricing the math is based on
llmstack benchmark # default suite vs local Ollama
llmstack benchmark -b gpt-4o -o report.md
llmstack benchmark --mock # deterministic, no model requiredEach report measures latency (mean/p50/p95/p99), throughput, the cloud cost for the same tokens, and a zero-egress proof (the run executes under the egress monitor). Every report carries a methodology hash — a fingerprint of the suite version + tasks + baseline + pricing, but not the machine-specific latencies — so two people can confirm they ran the identical benchmark. The harness never prints a cloud latency number it can't reproduce; latency is your measured number.
A CI gate (examples/benchmark_proof.py) runs the suite in deterministic mode and
fails the build on any external connection. See the
benchmark guide.
See the top of this README for the full feature breakdown. Under the hood:
Files → AST chunker (functions/classes) → Embed (Ollama) → Persistent SQLite index
↓
Question → BM25 keyword search ──┐
├── Reciprocal Rank Fusion → Top-K context → LLM → Streamed answer
Question → Vector cosine search ─┘
↑
Git context (branch, commits, diff)
llmstack up
│
├── Qdrant (vector DB) :6333
├── Redis (cache + rate limit) :6379
├── Ollama / vLLM (inference) :11434
├── TEI (embeddings) :8002
├── Gateway :8000
│ ├── Smart Router (<2ms classification)
│ ├── Provider Registry (6 cloud + local)
│ ├── Semantic Cache (Redis, <1ms hit)
│ ├── Circuit Breaker (3-state, exponential backoff)
│ ├── Rate Limiter (token bucket, Redis + Lua)
│ ├── Quality Scorer (5 metrics, every response)
│ ├── Trace Store (5K rolling window)
│ ├── RAG Pipeline (ingest + query)
│ └── Web UI (chat, RAG, dashboard)
├── Prometheus (metrics)
└── Grafana (dashboard) :8080
Auto hardware detection:
| Hardware | Backend | Why |
|---|---|---|
| NVIDIA GPU 16GB+ | vLLM | PagedAttention, max throughput |
| NVIDIA GPU < 16GB | Ollama | Lower memory overhead |
| Apple Silicon | Ollama | Metal acceleration |
| CPU only | Ollama | GGUF quantized models |
| Command | Description |
|---|---|
llmstack quickstart |
Zero-key local AI in ~30s: size a model to your hardware, pull it, prove a real completion |
llmstack ready [--json] |
Fast first-run readiness check; exits non-zero when not ready (CI/script friendly) |
llmstack ask <question> [path] |
Ask questions about local files (persistent index, hybrid search) |
llmstack ask -i [path] |
Interactive conversation with your codebase |
llmstack init [--preset] |
Create config (presets: chat, rag, agent) |
llmstack up |
Start all services |
llmstack down |
Stop all services |
llmstack status |
Health check |
llmstack chat |
Interactive terminal chat |
llmstack agent <task> |
Run an AI agent with tools |
llmstack mcp |
Start MCP server for AI clients |
llmstack finetune <data> |
Fine-tune a model on your data |
llmstack eval |
Evaluate model quality |
llmstack bench |
Benchmark routing performance |
llmstack benchmark |
Reproducible cost + latency + zero-egress report vs cloud |
llmstack savings |
Show money saved running locally vs paid alternatives |
llmstack export |
Generate docker-compose.yml |
llmstack logs <service> |
Stream service logs |
llmstack doctor |
Diagnose system issues |
llmstack verify-private |
Audit config for any external data egress |
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat (auto-routed across providers) |
POST /v1/embeddings |
Text embeddings |
GET /v1/models |
List models from all providers (with pricing) |
GET /v1/onboarding |
First-run readiness: recommended models, what's installed, next steps |
POST /v1/rag/ingest |
Ingest documents for RAG |
POST /v1/rag/query |
RAG query with citations |
GET /v1/observe/traces |
Request traces with quality scores |
GET /v1/observe/quality |
Quality summary with drift detection |
GET /v1/observe/alerts |
Active quality alerts |
POST /v1/observe/ab-test |
Create A/B test |
GET /v1/observe/ab-test/{name} |
A/B test results |
GET /healthz |
System health |
GET /metrics |
Prometheus metrics |
| llmstack ask | Cursor | Aider | Khoj | Simon's llm | |
|---|---|---|---|---|---|
| AST code chunking | Yes | Yes | Partial | No | No |
| Hybrid search (BM25 + vector) | Yes | ? | No | No | No |
| Persistent incremental index | Yes | Yes | No | Yes | Manual |
| Git-aware context | Yes | Yes | Yes | No | No |
| Interactive conversation | Yes | Yes | Yes | Yes | No |
| 20+ file types | Yes | No | No | Yes | No |
| 100% local + free | Yes | No | No | Yes | Yes |
| Zero config CLI | Yes | No | No | No | Yes |
| llmstack | Ollama | LiteLLM | LocalAI | LangSmith | |
|---|---|---|---|---|---|
| Multi-provider gateway | Yes | - | Yes | - | - |
| Smart cost-aware routing | Yes | - | - | - | - |
| Fallback chains | Yes | - | Yes | - | - |
| AI quality scoring | Yes | - | - | - | Yes |
| Drift detection + alerts | Yes | - | - | - | Yes |
| A/B testing | Yes | - | - | - | Yes |
| One-command fine-tuning | Yes | - | - | - | - |
| AI agents with tools | Yes | - | - | - | - |
| MCP server | Yes | - | - | - | - |
| Local inference | Yes | Yes | - | Yes | - |
| Self-hosted / free | Yes | Yes | Partial | Yes | Paid |
# llmstack.yaml
version: "1"
models:
chat:
name: llama3.2
backend: auto
embeddings:
name: bge-m3
providers:
enabled: true
strategy: cost
providers:
- name: openai
api_key_env: OPENAI_API_KEY
fallback: [anthropic, local]
- name: anthropic
api_key_env: ANTHROPIC_API_KEY
- name: local
observe:
quality_tracking: true
alert_threshold: 0.4
drift_threshold: -0.1
gateway:
port: 8000
auth: api_key
rate_limit: 100/min- Python 3.11+
llmstack ask: Ollama running locally. No Docker needed.- Full stack (
llmstack up): Docker - Fine-tuning:
pip install llmstack-cli[finetune](adds PyTorch, PEFT, TRL)
See CONTRIBUTING.md for development setup. PRs welcome.
Apache-2.0
