fix(llm): forward num_ctx to Ollama (local-model context truncation)#22
Merged
Conversation
…truncated Ollama defaults num_ctx to 2048; Perspicacite never set it, so RAG synthesis prompts (assembled up to ~context.max_tokens) overflowed the local window and Mistral/Llama produced empty output. Add LLMConfig.ollama_num_ctx (default 8192) and forward it via the LiteLLM completion call for the ollama provider only (no-op for other providers). Hermetic test + config.example.yml doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Local Ollama models (e.g. Mistral) silently fail to produce output on long RAG queries. Root cause: Ollama defaults
num_ctxto 2048 tokens, and Perspicacité never sets it. RAG synthesis prompts are assembled up tocontext.max_tokens(default 8000) — far past 2048 — so the local model only sees a truncated tail and returns empty/garbage. (Reported by a user whose Ollama-Mistral synthesis stage produced nothing on multi-chunk answers.)Fix
LLMConfig.ollama_num_ctx(default 8192, documented inconfig.example.yml).AsyncLLMClient._provider_extra_params(provider)returns{"num_ctx": ...}for theollamaprovider only, merged into the LiteLLM completion call (both non-streaming and streaming paths). No-op for every other provider.num_ctx= more RAM; users can tune it down.Why this is the right layer
LiteLLM forwards
num_ctxto Ollama'soptions.num_ctx. Setting it from config means local users get a usable window out of the box instead of the silent 2048 default. It does not change behaviour for API providers.Test Plan
tests/unit/test_ollama_num_ctx.py— config default/override,num_ctxforwarded forollama, empty foropenai/anthropic/deepseek/minimax(5 passed, hermetic)Branched off latest
main(#18). Unrelated to the docling PR (#12).🤖 Generated with Claude Code