llmstack

Chat with any codebase. Locally. Privately. Free.

The open-source alternative to Cursor and Copilot — runs entirely on your machine with Ollama.
Plus: smart routing, fine-tuning, AI agents, MCP server, and full observability.

Ask • Gateway • Smart Routing • Agents & MCP • Fine-tuning • Observability • Prove the Value • Editor • Docs

brew install mara-werils/llmstack/llmstack   # or: pipx install llmstack-cli
llmstack quickstart                            # sizes a model to your machine, proves it works
llmstack ask -i ./src/                         # then chat with your codebase

quickstart is the 30-second path to first value: it detects your hardware, recommends and pulls a local model, and prints a real local completion so you see it work. No API key, no Docker -- the only prerequisite is Ollama. (Docker is optional, used only for the full gateway stack via llmstack up.)

Ask Your Codebase Anything

llmstack ask "How does authentication work?" ./src/

One command. No API keys. No cloud. No Docker. No $20/month subscription. Just Ollama + your files.

  llmstack ask  model=llama3.2  embeddings=nomic-embed-text

  Git: main (15 recent commits)
  Index cached: 847 chunks (0 files changed)
  Embeddings loaded from cache

  Answer:
  Authentication works through API key validation in the FastAPI gateway
  middleware. Each request must include an `Authorization: Bearer <key>`
  header. The middleware validates keys against the stored list in
  llmstack.yaml [src/gateway/middleware/auth.py:23-45]. Rate limiting
  is tied to the API key — each key gets its own token bucket tracked
  in Redis [src/gateway/middleware/rate_limit.py:12-38].

  ┌─────────────── Sources ───────────────┐
  │ File                  Lines   Score    │
  │ gateway/middleware/auth.py  23-45  0.0142  │
  │ gateway/middleware/rate_limit.py  12-38  0.0098  │
  │ config/schema.py      89-102  0.0076  │
  └───────────────────────────────────────┘

Why this is better than Cursor/Copilot/Aider

	llmstack ask	Cursor	Copilot	Aider	Khoj
AST-aware code chunking	Yes	Yes	-	Partial	No
Hybrid search (BM25 + vector)	Yes	?	-	No	No
Persistent incremental index	Yes	Yes	-	No	Yes
Git-aware context	Yes	Yes	-	Yes	No
Interactive conversation	Yes	Yes	-	Yes	Yes
20+ file types (PDF, DOCX, logs...)	Yes	No	No	No	Yes
100% local, 100% private	Yes	No	No	No	Yes
100% free, forever	Yes	$20/mo	$10/mo	API costs	Free
Zero config CLI	Yes	IDE only	IDE only	Config needed	Server needed

Key features

Persistent index — first query indexes your project (~30s). Every query after that: ~0.1s. Only re-embeds files that changed (SHA-256 hash diff).

AST-aware chunking — Python files split by functions and classes using the ast module. Large classes (>50 lines) split into individual methods. JS/TS/Go/Rust/Java use regex boundary detection. No more broken chunks mid-function.

Hybrid search — combines BM25 keyword matching (catches exact function names, error messages) with vector cosine similarity (catches meaning and intent). Merged via Reciprocal Rank Fusion. Better recall than either alone.

Git-aware — the LLM sees your current branch, recent commits, and changed files. Ask "what changed this week?" and get real answers.

Interactive mode — multi-turn conversation with your codebase. Context preserved across questions.

# Interactive conversation with your project
llmstack ask -i ./src/
# You: How does the cache work?
# Assistant: The cache uses Redis with SHA-256 keys...
# You: What happens when Redis goes down?
# Assistant: There's an in-memory fallback in rate_limit.py...

# Single question
llmstack ask "Find security vulnerabilities" ./src/ --model llama3.1:70b

# Ask about any file type
llmstack ask "Summarize the key findings" report.pdf
llmstack ask "What went wrong at 3am?" error.log
cat contract.pdf | llmstack ask "Are there any risks?"

# Skip cache for fresh re-index
llmstack ask "What's new?" ./src/ --no-cache

# Without git context
llmstack ask "Explain the architecture" ./src/ --no-git

20+ file types: Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, Ruby, PHP, Swift, Kotlin, PDF, DOCX, Markdown, HTML, JSON, YAML, TOML, CSV, logs, and more.

Private by default — everything runs on your machine. llmstack verify-private audits your config and fails loudly if anything (a cloud provider, a webhook, a network-capable agent tool, wide-open CORS) could send code or prompts off the box:

llmstack verify-private          # human-readable report, non-zero exit if not private
llmstack verify-private --json   # machine-readable, for CI gates
llmstack verify-private --live   # also probes the *running* gateway, not just llmstack.yaml

In your editor — the VS Code / OpenVSX extension brings Ask and Explain commands to VS Code, Cursor, and Windsurf, all routed through your local gateway.

Prove the value — llmstack savings tallies the cloud bill you didn't pay, and llmstack benchmark produces a reproducible cost + latency + zero-egress report. The numbers are yours, generated locally. See below.

Quick Start

# Install (pick one)
curl -LsSf https://raw.githubusercontent.com/mara-werils/llmstack/main/install.sh | sh   # one-liner
brew install mara-werils/llmstack/llmstack   # macOS / Linux (Homebrew)
pipx install llmstack-cli                     # isolated, no venv to manage
uv tool install llmstack-cli                  # same, via uv
pip install llmstack-cli                       # plain pip

# Chat with your codebase (just needs Ollama)
llmstack ask -i ./src/

# Full LLM stack with smart routing
llmstack init            # interactive setup wizard
llmstack up

Install options

Method	Command	Best for
One-liner	`curl -LsSf https://raw.githubusercontent.com/mara-werils/llmstack/main/install.sh \| sh`	Fastest start (auto-picks uv/pipx/pip)
Homebrew	`brew install mara-werils/llmstack/llmstack`	macOS / Linux, auto-updates
pipx	`pipx install llmstack-cli`	Isolated CLI, no venv juggling
uv	`uv tool install llmstack-cli`	uv users
pip	`pip install llmstack-cli`	Inside an existing environment
Docker	`docker run -p 8000:8000 ghcr.io/mara-werils/llmstack:latest`	Running the gateway as a server
Editor	Search LLMStack in VS Code, Cursor, Windsurf, or VSCodium (Open VSX)	AI in your editor — guide

Universal Gateway

Route every request through a single OpenAI-compatible endpoint. llmstack picks the best provider and model automatically.

6 cloud providers + local inference:

Provider	Models	Pricing tracked
OpenAI	GPT-4o, GPT-4.1, o3, o4-mini, GPT-4.1-nano	Per-token
Anthropic	Claude Opus 4, Sonnet 4, Haiku 4	Per-token
Google	Gemini 2.5 Pro/Flash, Gemini 2.0 Flash	Per-token
Groq	Llama 3.3 70B, Llama 3.1 8B, Mixtral	Per-token
Together	Llama 405B, DeepSeek R1/V3, Qwen 72B	Per-token
Mistral	Mistral Large/Small, Codestral, Pixtral	Per-token
Local	Ollama / vLLM (any GGUF model)	Free

Fallback chains: if OpenAI returns 429/503, the request automatically retries on Anthropic, then falls back to local.

# llmstack.yaml
providers:
  enabled: true
  strategy: cost          # cost | quality | balanced | latency
  providers:
    - name: openai
      api_key_env: OPENAI_API_KEY
      models:
        - name: gpt-4.1-nano
          tier: simple
          cost_per_m_input: 0.10
        - name: gpt-4o
          tier: medium
          cost_per_m_input: 2.50
      fallback: [anthropic, local]
    - name: anthropic
      api_key_env: ANTHROPIC_API_KEY
    - name: local

Response headers tell you exactly what happened:

X-Provider: openai
X-Model-Router: gpt-4.1-nano
X-Query-Tier: simple
X-Cost-USD: 0.000003
X-Cache: MISS

Smart Routing

The classifier scores every request in < 2ms using 7 heuristic signals (no ML model needed), then picks the cheapest adequate model.

Request → Classify (7 signals, <2ms) → Route to optimal model + provider

Signal	What it measures
Token count	Message length
Task markers	"hello" vs "implement distributed consensus"
Code detection	Code blocks, programming terms
Conversation depth	Turn count
System prompt	Complexity of instructions
Language mix	Multilingual content
Question patterns	Simple fact vs multi-constraint reasoning

Real results (CPU-only, no GPU):

Query	Tier	Model	Latency
"Hello!"	Simple	llama3.2:1b	1.6s
"What is 2+2?"	Simple	llama3.2:1b	5.9s
"Write binary search in Python"	Medium	llama3.2:3b	52.2s

71% of requests routed to the small model. 71% compute savings.

With cloud providers, cost savings are even bigger — simple queries go to gpt-4.1-nano ($0.10/M) instead of gpt-4o ($2.50/M).

AI Agents & MCP

Agents with tool use

llmstack agent "Find all TODO comments in this repo and summarize them"

The agent uses a ReAct loop — it plans, calls tools, observes results, and iterates until the task is done.

6 built-in tools: read_file, write_file, list_directory, grep, shell, http_get

# Use specific tools only
llmstack agent "Check if tests pass" --tools shell,read_file

# Use a larger model for complex tasks
llmstack agent "Refactor auth.py to use JWT tokens" --model llama3.1:70b

MCP Server

Connect any MCP-compatible AI client (Claude Code, Cursor, VS Code) to your local LLM:

llmstack mcp --model llama3.2

// .claude/claude_desktop_config.json
{
  "mcpServers": {
    "llmstack": {
      "command": "llmstack",
      "args": ["mcp", "--model", "llama3.2"]
    }
  }
}

8 tools exposed via MCP: all agent tools + llmstack_chat (LLM inference) + llmstack_ask (file RAG with citations).

Fine-tuning

Fine-tune a model on your data in one command. No Jupyter. No boilerplate. No ML expertise.

llmstack finetune data.jsonl --base llama3.2:1b --export-ollama my-model

What happens:

Auto data prep — detects format (CSV/JSON/JSONL/TXT/Parquet), auto-maps columns (instruction/output, prompt/completion, question/answer, chat messages), splits train/eval
Auto hyperparameters — epochs, LoRA rank, batch size, learning rate all auto-selected based on dataset size and model
Training — LoRA/QLoRA via unsloth (2x faster) or HuggingFace PEFT
Export — GGUF conversion + ollama create → model ready to serve

# Override any hyperparameter
llmstack finetune data.csv --base llama3.2:1b --epochs 5 --lr 1e-4 --lora-r 32

# Export to GGUF with custom quantization
llmstack finetune data.jsonl --base llama3.2:1b --export-gguf --quant q5_k_m

# Full pipeline: train + export + register in Ollama
llmstack finetune emails.jsonl --base llama3.2:1b --export-ollama email-assistant
# → ollama run email-assistant

Auto hyperparameter selection:

Dataset size	Epochs	LoRA rank	Learning rate
< 100	5	8	1e-4
100–500	3	16	2e-4
500–5K	2	16	2e-4
5K+	1	32+	2e-4

AI Observability

Every response is scored in real-time. Quality drift triggers alerts. Compare models with A/B testing.

Quality scoring (every response, < 1ms)

5 metrics scored on every non-streaming response:

Metric	What it measures
Coherence	Structural quality (length, sentences, formatting)
Relevance	Does the response address the query?
Refusal	"I can't help with that" detection
Toxicity	Harmful content flags
Repetition	Looping / repetitive output

Drift detection & alerts

Quality drops below 0.4 → CRITICAL alert
Quality trending negative over 50 requests → WARNING alert

# Check live quality from the gateway
llmstack eval --gateway-url http://localhost:8000

┌──────────── Quality Summary ────────────┐
│ Metric     Mean    Recent  Trend  Count  │
│ overall    0.7821  0.7534  -0.02   1042  │
│ coherence  0.8912  0.8845  +0.01   1042  │
│ relevance  0.6834  0.6223  -0.06   1042  │  ← trending down
│ refusal    0.0124  0.0098  -0.00   1042  │
│ repetition 0.0231  0.0187  -0.00   1042  │
└─────────────────────────────────────────┘

A/B testing

# Create a test via API
curl -X POST http://localhost:8000/v1/observe/ab-test \
  -d '{"name":"gpt4o-vs-sonnet","model_a":"gpt-4o","model_b":"claude-sonnet-4-20250514","traffic_split":0.5}'

# Check results
curl http://localhost:8000/v1/observe/ab-test/gpt4o-vs-sonnet

{
  "winner": "claude-sonnet-4-20250514",
  "confidence": "high",
  "avg_quality_a": 0.7821,
  "avg_quality_b": 0.8234,
  "requests_a": 523,
  "requests_b": 519,
  "avg_cost_a_usd": 0.000034,
  "avg_cost_b_usd": 0.000089
}

Request tracing

Every request is traced end-to-end:

GET /v1/observe/traces?model=gpt-4o&limit=10

Each trace captures: prompt, routing decision, provider, model, response, latency, tokens, cost, quality scores.

Prove the Value — Savings & Reproducible Benchmarks

llmstack's two headline claims — the open alternative to Cursor/Copilot and it saves you money — ship as numbers you generate locally, not as marketing.

Savings: the cloud bill you didn't pay

llmstack savings                 # cumulative savings + months of Copilot/Cursor covered
llmstack savings --plan cursor-pro --json

The gateway books a saving on every locally-served request — the cloud price you didn't pay — into a local ledger (~/.llmstack/savings.json, read offline, no network). It's valued against dated, sourced pricing using gpt-4o-mini (one of the cheapest mainstream cloud models) as the baseline, so the figure is conservative. Cloud-routed (paid) requests are never counted.

GET /v1/savings/summary    # running total + subscription-months covered
GET /v1/savings/pricing    # the exact dated pricing the math is based on

Benchmarks: cost + latency + privacy, reproducibly

llmstack benchmark                  # default suite vs local Ollama
llmstack benchmark -b gpt-4o -o report.md
llmstack benchmark --mock           # deterministic, no model required

Each report measures latency (mean/p50/p95/p99), throughput, the cloud cost for the same tokens, and a zero-egress proof (the run executes under the egress monitor). Every report carries a methodology hash — a fingerprint of the suite version + tasks + baseline + pricing, but not the machine-specific latencies — so two people can confirm they ran the identical benchmark. The harness never prints a cloud latency number it can't reproduce; latency is your measured number.

A CI gate (examples/benchmark_proof.py) runs the suite in deterministic mode and fails the build on any external connection. See the benchmark guide.

More about `llmstack ask`

See the top of this README for the full feature breakdown. Under the hood:

Files → AST chunker (functions/classes) → Embed (Ollama) → Persistent SQLite index
                                                                    ↓
Question → BM25 keyword search ──┐
                                 ├── Reciprocal Rank Fusion → Top-K context → LLM → Streamed answer
Question → Vector cosine search ─┘
                                                                    ↑
                                                          Git context (branch, commits, diff)

Full Stack Architecture

llmstack up
    │
    ├── Qdrant (vector DB)          :6333
    ├── Redis (cache + rate limit)  :6379
    ├── Ollama / vLLM (inference)   :11434
    ├── TEI (embeddings)            :8002
    ├── Gateway                     :8000
    │   ├── Smart Router (<2ms classification)
    │   ├── Provider Registry (6 cloud + local)
    │   ├── Semantic Cache (Redis, <1ms hit)
    │   ├── Circuit Breaker (3-state, exponential backoff)
    │   ├── Rate Limiter (token bucket, Redis + Lua)
    │   ├── Quality Scorer (5 metrics, every response)
    │   ├── Trace Store (5K rolling window)
    │   ├── RAG Pipeline (ingest + query)
    │   └── Web UI (chat, RAG, dashboard)
    ├── Prometheus (metrics)
    └── Grafana (dashboard)         :8080

Auto hardware detection:

Hardware	Backend	Why
NVIDIA GPU 16GB+	vLLM	PagedAttention, max throughput
NVIDIA GPU < 16GB	Ollama	Lower memory overhead
Apple Silicon	Ollama	Metal acceleration
CPU only	Ollama	GGUF quantized models

CLI Reference

Command	Description
`llmstack quickstart`	Zero-key local AI in ~30s: size a model to your hardware, pull it, prove a real completion
`llmstack ready [--json]`	Fast first-run readiness check; exits non-zero when not ready (CI/script friendly)
`llmstack ask <question> [path]`	Ask questions about local files (persistent index, hybrid search)
`llmstack ask -i [path]`	Interactive conversation with your codebase
`llmstack init [--preset]`	Create config (presets: chat, rag, agent)
`llmstack up`	Start all services
`llmstack down`	Stop all services
`llmstack status`	Health check
`llmstack chat`	Interactive terminal chat
`llmstack agent <task>`	Run an AI agent with tools
`llmstack mcp`	Start MCP server for AI clients
`llmstack finetune <data>`	Fine-tune a model on your data
`llmstack eval`	Evaluate model quality
`llmstack bench`	Benchmark routing performance
`llmstack benchmark`	Reproducible cost + latency + zero-egress report vs cloud
`llmstack savings`	Show money saved running locally vs paid alternatives
`llmstack export`	Generate docker-compose.yml
`llmstack logs <service>`	Stream service logs
`llmstack doctor`	Diagnose system issues
`llmstack verify-private`	Audit config for any external data egress

API Endpoints

Endpoint	Description
`POST /v1/chat/completions`	Chat (auto-routed across providers)
`POST /v1/embeddings`	Text embeddings
`GET /v1/models`	List models from all providers (with pricing)
`GET /v1/onboarding`	First-run readiness: recommended models, what's installed, next steps
`POST /v1/rag/ingest`	Ingest documents for RAG
`POST /v1/rag/query`	RAG query with citations
`GET /v1/observe/traces`	Request traces with quality scores
`GET /v1/observe/quality`	Quality summary with drift detection
`GET /v1/observe/alerts`	Active quality alerts
`POST /v1/observe/ab-test`	Create A/B test
`GET /v1/observe/ab-test/{name}`	A/B test results
`GET /healthz`	System health
`GET /metrics`	Prometheus metrics

Comparison

Codebase Q&A

	llmstack ask	Cursor	Aider	Khoj	Simon's llm
AST code chunking	Yes	Yes	Partial	No	No
Hybrid search (BM25 + vector)	Yes	?	No	No	No
Persistent incremental index	Yes	Yes	No	Yes	Manual
Git-aware context	Yes	Yes	Yes	No	No
Interactive conversation	Yes	Yes	Yes	Yes	No
20+ file types	Yes	No	No	Yes	No
100% local + free	Yes	No	No	Yes	Yes
Zero config CLI	Yes	No	No	No	Yes

LLM Platform

	llmstack	Ollama	LiteLLM	LocalAI	LangSmith
Multi-provider gateway	Yes	-	Yes	-	-
Smart cost-aware routing	Yes	-	-	-	-
Fallback chains	Yes	-	Yes	-	-
AI quality scoring	Yes	-	-	-	Yes
Drift detection + alerts	Yes	-	-	-	Yes
A/B testing	Yes	-	-	-	Yes
One-command fine-tuning	Yes	-	-	-	-
AI agents with tools	Yes	-	-	-	-
MCP server	Yes	-	-	-	-
Local inference	Yes	Yes	-	Yes	-
Self-hosted / free	Yes	Yes	Partial	Yes	Paid

Configuration

# llmstack.yaml
version: "1"

models:
  chat:
    name: llama3.2
    backend: auto
  embeddings:
    name: bge-m3

providers:
  enabled: true
  strategy: cost
  providers:
    - name: openai
      api_key_env: OPENAI_API_KEY
      fallback: [anthropic, local]
    - name: anthropic
      api_key_env: ANTHROPIC_API_KEY
    - name: local

observe:
  quality_tracking: true
  alert_threshold: 0.4
  drift_threshold: -0.1

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min

Requirements

Python 3.11+
llmstack ask: Ollama running locally. No Docker needed.
Full stack (llmstack up): Docker
Fine-tuning: pip install llmstack-cli[finetune] (adds PyTorch, PEFT, TRL)

Contributing

See CONTRIBUTING.md for development setup. PRs welcome.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 696 Commits
.github		.github
assets		assets
deploy		deploy
docs		docs
editors/vscode		editors/vscode
examples		examples
packaging/homebrew		packaging/homebrew
sdks/typescript		sdks/typescript
src/llmstack		src/llmstack
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ADOPTION_PLAN.md		ADOPTION_PLAN.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PRODUCT_STRATEGY.md		PRODUCT_STRATEGY.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
VALUE_PROOF_PLAN.md		VALUE_PROOF_PLAN.md
demo.gif		demo.gif
demo.tape		demo.tape
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

llmstack

Ask Your Codebase Anything

Why this is better than Cursor/Copilot/Aider

Key features

Quick Start

Install options

Universal Gateway

Smart Routing

AI Agents & MCP

Agents with tool use

MCP Server

Fine-tuning

AI Observability

Quality scoring (every response, < 1ms)

Drift detection & alerts

A/B testing

Request tracing

Prove the Value — Savings & Reproducible Benchmarks

Savings: the cloud bill you didn't pay

Benchmarks: cost + latency + privacy, reproducibly

More about llmstack ask

Full Stack Architecture

CLI Reference

API Endpoints

Comparison

Codebase Q&A

LLM Platform

Configuration

Requirements

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

More about `llmstack ask`

Packages