Summary
Ollama is the most popular self-hosted LLM runtime (30K+ GitHub stars, Windows/Linux/macOS). Users with powerful local GPUs (e.g. RX 7900 XTX, RTX 4090) want to run everything locally — embeddings AND the LLM for compression/summarization. Currently there's no path to use Ollama as a compression/summarization provider without running a separate translation proxy (LiteLLM).
Current state
- Embeddings already work with Ollama via
EMBEDDING_PROVIDER=openai + OPENAI_BASE_URL — Ollama speaks the OpenAI embeddings API natively
- LLM compression/summarization requires Anthropic protocol — Ollama does not speak this
ANTHROPIC_BASE_URL exists and supports proxies, but adding LiteLLM as another container adds ~200MB and operational complexity
- No Ollama issues have been filed — this is a gap
Proposed solution (simplest path)
Add an Ollama provider directly to src/config.ts that uses raw fetch() (no SDK dependency) to call POST http://host:11434/api/chat with an Ollama-compatible payload. This is exactly what MiniMax already does — it uses raw fetch to avoid SDK stainless headers.
OLLAMA_BASE_URL=http://host.docker.internal:11434 # default: http://localhost:11434
OLLAMA_MODEL=qwen3:14b # default: llama3.2
Provider detection in detectProvider():
- Check
OLLAMA_BASE_URL or presence of Ollama on localhost
- Return
provider: "ollama" with raw-fetch-based chat completion
Why this is the right approach
- Same pattern as MiniMax — MiniMax already uses
provider: "minimax" with raw fetch for Anthropic-compatible APIs. Ollama would do the same.
- Zero new dependencies — just
fetch() calls to the Ollama /api/chat endpoint
- Completes the self-hosted story — users can already self-host with
EMBEDDING_PROVIDER=openai → Ollama. Adding the LLM provider closes the loop.
- No proxy needed — eliminates the LiteLLM requirement entirely
Environment
- OS: Windows 11
- Hardware: Ryzen 9 5950X, RX 7900 XTX (24GB)
- agentmemory: v0.9.4 (Docker, iii-engine 0.11.6)
- Ollama: v0.23.1 with
nomic-embed-text (embeddings working)
Summary
Ollama is the most popular self-hosted LLM runtime (30K+ GitHub stars, Windows/Linux/macOS). Users with powerful local GPUs (e.g. RX 7900 XTX, RTX 4090) want to run everything locally — embeddings AND the LLM for compression/summarization. Currently there's no path to use Ollama as a compression/summarization provider without running a separate translation proxy (LiteLLM).
Current state
EMBEDDING_PROVIDER=openai+OPENAI_BASE_URL— Ollama speaks the OpenAI embeddings API nativelyANTHROPIC_BASE_URLexists and supports proxies, but adding LiteLLM as another container adds ~200MB and operational complexityProposed solution (simplest path)
Add an Ollama provider directly to
src/config.tsthat uses rawfetch()(no SDK dependency) to callPOST http://host:11434/api/chatwith an Ollama-compatible payload. This is exactly what MiniMax already does — it uses raw fetch to avoid SDK stainless headers.Provider detection in
detectProvider():OLLAMA_BASE_URLor presence of Ollama on localhostprovider: "ollama"with raw-fetch-based chat completionWhy this is the right approach
provider: "minimax"with raw fetch for Anthropic-compatible APIs. Ollama would do the same.fetch()calls to the Ollama/api/chatendpointEMBEDDING_PROVIDER=openai→ Ollama. Adding the LLM provider closes the loop.Environment
nomic-embed-text(embeddings working)