Skip to content

Add local model support via Ollama#44

Open
lmassaron wants to merge 1 commit intohuggingface:mainfrom
lmassaron:ollama-local
Open

Add local model support via Ollama#44
lmassaron wants to merge 1 commit intohuggingface:mainfrom
lmassaron:ollama-local

Conversation

@lmassaron
Copy link
Copy Markdown

Summary

This PR adds support for running the agent with local models via Ollama. It enables users to use the intern without incurring API costs or requiring external API keys for inference.

Key Changes

  • Ollama Integration: Integrated local model support using Ollama's OpenAI-compatible endpoint for improved
    tool-calling reliability.
  • Readiness Checks: The agent now automatically verifies if ollama serve is running and if the requested model is
    available locally.
  • Interactive Model Management:
  • Added /add-model <id> command to dynamically add new local models during a session.
  • Added an interactive prompt to pull missing models directly from the CLI.
  • Optional Authentication: Modified initialization to make the HF_TOKEN optional when using local models (prefixed
    with ollama/).
  • Robust Initialization: Improved the startup loop to handle local environment failures without hanging.

How to test

  1. Ensure Ollama is running (ollama serve).
  2. Run the agent: ml-intern.
  3. Switch to a local model: /model ollama/qwen3.6 (or use /add-model).

@lmassaron
Copy link
Copy Markdown
Author

lmassaron commented Apr 24, 2026

Thank you @statikkkkk for the review. I have addressed the security concerns and the documentation feedback. I also took the opportunity to reapply these features on top of the latest upstream main to align with the recent architectural refactoring.

  1. SSRF Mitigation & Security
    I have implemented a dual-layered defense to ensure local inference remains strictly local and cannot be abused
    in hosted environments:
  • Hostname Validation: In agent/core/llm_params.py, I added strict validation for OLLAMA_API_BASE. It is now
    restricted exclusively to localhost or 127.0.0.1. Any other hostname will raise a ValueError, preventing the
    agent from being used to probe internal network services.
  • Backend Guard: In backend/routes/agent.py, I added a security guard to the /api/config/model endpoint. It explicitly rejects any model ID starting with ollama/. This makes Ollama support a CLI-only feature,
    ensuring it cannot be reached by remote users via the web API.
  1. Context Window Documentation
  • Default Fallback Note: Added a documentation note in agent/core/session.py clarifying that while
    ollama/qwen3.6 has a specific 128k override, all other Ollama models fall back to the project's standard
    200k default context window.
  1. Alignment with Latest Refactor
  • Architecture Update: I have adapted the Ollama logic to the new upstream structure. This includes
    integrating readiness checks into the new agent/core/model_switcher.py module and updating the asynchronous
    interactive loop in agent/main.py.
  • Feature Persistence: Verified that /add-model, automated model pulling, and server reachability checks are
    all fully functional within the new async architecture.
  1. Verification Results
    I have performed a full regression test suite:
  • Headless/CLI: Confirmed local reasoning and tool-calling with ollama/qwen3.6.
  • Security: Verified via script that the backend API correctly rejects local model requests with a 400 Bad Request.
  • Robustness: Confirmed that the agent handles unreachable Ollama servers gracefully with a clear error
    messages.

The branch has been force-pushed with a clean, rebased history.

@akseljoonas
Copy link
Copy Markdown
Collaborator

@claude please review this PR. Focus on correctness, security, LiteLLM/Bedrock routing impact, and whether it's safe to merge against current main. Keep it concise and prefer inline comments where it matters.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @akseljoonas's task in 2m 24s —— View job


PR Review: Add local model support via Ollama

  • Post initial comment
  • Read CLAUDE.md and repo structure
  • Read all changed files
  • Analyze security, correctness, and routing impact
  • Post final review

Verdict: NEEDS CHANGES — The SSRF mitigation in _resolve_llm_params is solid, but the backend guard has a meaningful gap, there are two correctness bugs in ollama_utils.py, and blocking I/O in an async function will stall the event loop during model pulls.


Security

[MEDIUM] Backend guard is incomplete — ollama/ models are still reachable via session creation

ollama/qwen3.6 is listed in AVAILABLE_MODELS (backend/routes/agent.py:68-74), so the valid_ids set in POST /api/session and POST /api/session/{id}/model includes it. The ollama/ rejection guard was added only to POST /api/config/model (line 234). A web user can bypass it:

# POST /api/session  — no ollama guard, ollama/qwen3.6 is in valid_ids
valid_ids = {m["id"] for m in AVAILABLE_MODELS}   # ← includes ollama/qwen3.6
if model and model not in valid_ids:              # passes if model == "ollama/qwen3.6"
    raise HTTPException(...)

Same gap in POST /api/session/{session_id}/model (lines 444-447). Either remove ollama/qwen3.6 from AVAILABLE_MODELS entirely (the cleanest fix, since it's CLI-only by design), or add the same guard to both session endpoints. Fix this →

[LOW] ollama_utils.py doesn't validate OLLAMA_API_BASE hostname

get_ollama_base_url() reads the env var without checking it's localhost. This means is_ollama_running(), is_model_available(), and pull_ollama_model() can all send requests to an arbitrary URL before _resolve_llm_params ever gets to apply its validation. In a pure CLI context the user controls their env, but it's inconsistent with the protection added in llm_params.py. Fix this →


Correctness

[HIGH] Blocking I/O in async pull_ollama_model will freeze the event loop

pull_ollama_model is async def but uses synchronous requests.post() with streaming (for line in response.iter_lines()). This blocks the event loop for the entire duration of a model pull (potentially minutes for large models). is_ollama_running() and is_model_available() have the same issue — they're called from ensure_ollama_readiness which is awaited directly. These should use asyncio.to_thread or be replaced with httpx.AsyncClient. Fix this →

[MEDIUM] None-divide crash in progress display (ollama_utils.py:58-65)

completed = data.get("completed")   # can be None
total = data.get("total")
if status != last_status:
    if total:
        percent = (completed / total) * 100   # ← TypeError if completed is None

Ollama emits total without completed in some status messages (e.g. manifest download). Guard it: if total and completed is not None. Fix this →

[LOW] Unrendered Rich markup in ensure_ollama_readiness (ollama_utils.py:82-83)

print(f"\n[bold red]Error:[/bold red] Ollama server is not reachable.")

print() doesn't process Rich markup — this will render literally as [bold red]Error:[/bold red] in the terminal. Use get_console().print(...) from terminal_display or strip the markup. Fix this →


LiteLLM / Bedrock routing impact

No regressions. The ollama/ branch is inserted first in _resolve_llm_params and has a clean prefix that doesn't collide with bedrock/, anthropic/, openai/, or the HF fallback. The model is forwarded as openai/<name> with a custom api_base, which is the standard LiteLLM pattern. Bedrock routing is untouched.

The _MAX_TOKENS_MAP in session.py is checked before litellm.get_model_info, so the 128k override for qwen3.6 works correctly without interfering with the litellm catalog for other models.


Minor nits

  • headless_main imports PromptSession from prompt_toolkit inside the function (main.py:1107) but never uses it — the headless Ollama check is non-interactive. Dead import.
  • /add-model always prefixes ollama/ (main.py:765) — intentional, but the help text (/add-model <id>) doesn't mention this. Worth documenting.

Summary

The core feature is well-structured and the SSRF concern from the first review was thoughtfully addressed in _resolve_llm_params. The two blocking issues for merge are: (1) the incomplete backend guard that lets web users set an Ollama model via session creation, and (2) the blocking I/O in pull_ollama_model that will lock the async event loop. The completed is None divide-by-zero and the unrendered Rich markup are smaller but will produce observable bugs in practice.

@lmassaron
Copy link
Copy Markdown
Author

lmassaron commented Apr 24, 2026

I have addressed all the security and correctness concerns in the latest
commits:

  1. Robust Backend Guarding
  • Listing Removal: I removed ollama/qwen3.6 from AVAILABLE_MODELS in the backend route entirely.
  • Entry-Point Guards: Added explicit startswith('ollama/') checks to POST /api/session, POST/api/session/restore-summary, and POST /api/session/{session_id}/model. These local models are now strictly unreachable via the web API.
  1. Async I/O & Correctness
  • Non-blocking Network: Refactored ollama_utils.py to use httpx.AsyncClient. All Ollama interactions (status
    checks, tag listing, and streaming pulls) are now fully asynchronous and won't block the event loop.
  • Progress Display Fix: Added a guard for None values in the pull progress logic to prevent TypeError
    divide-by-zero crashes.
  • Rich Rendering: Switched to get_console().print() in the utility module so that Rich formatting is correctly
    rendered in the terminal.
  1. Security & Consistency
  • Hostname Validation: Added hostname checks to get_ollama_base_url() consistent with _resolve_llm_params, restricting it to loopback addresses.
  • CLI Consistency: Fixed a bug where the --model flag was ignored in interactive mode. It now correctly overrides
    the default model for both headless and interactive use.
  • Documentation: Updated the /help command text to explicitly mention the ollama/ prefix behavior.

I have verified these fixes with a regression test suite covering headless execution, interactive mode overrides,
and robust backend guard testing via curl. The branch has been force-pushed with the updated history.

@lmassaron lmassaron force-pushed the ollama-local branch 3 times, most recently from a21980d to 48e4bb9 Compare April 27, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants