Skip to content

feat(inference): per-request VLM model override#64

Merged
Liuhaai merged 1 commit into
mainfrom
feat/per-request-vlm-model-override
May 7, 2026
Merged

feat(inference): per-request VLM model override#64
Liuhaai merged 1 commit into
mainfrom
feat/per-request-vlm-model-override

Conversation

@Liuhaai

@Liuhaai Liuhaai commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds an optional model field to DescribeRequest and CropDescribeRequest so callers can pick a different upstream model per request without restarting the server. The cortex client uses this to route VLM scene-describe and segmentation ground-region requests to different upstream Qwen models from a single trio-core instance (e.g. qwen3-vl-flash for scenes, qwen3.6-plus for segmentation).

Type of Change

  • New feature (non-breaking change that adds functionality)

Changes

  • DescribeRequest / CropDescribeRequest: add optional model: str | None (default None)
  • /api/inference/describe and /api/inference/crop-describe handlers: forward req.model into engine.analyze_frame(...)
  • TrioCore.analyze_video / analyze_frame: accept and forward model
  • BaseBackend.generate / stream_generate: gain optional model kwarg
  • RemoteHTTPBackend: uses model or self._remote_model in the upstream chat.completions.create call; log line includes the effective model
  • Local backends (transformers, mlx, tome, compressed): accept the kwarg and call BaseBackend._warn_model_override_once(model) — one warning per backend instance, then ignore (local backends can't swap models per-request)

When the request omits model, behavior is identical to today.

Test Plan

  • All existing tests pass (python -m pytest tests/ -v for backends/engine/config/inference_router → 60 passed)
  • New tests added for new functionality
    • test_generate_uses_per_request_model_override — asserts the OpenAI client receives the per-request model
    • test_generate_falls_back_to_configured_model — asserts fallback to _remote_model when override is None
  • Regression test passes (python examples/run_regression.py) — not run; this PR doesn't change inference math (only routes the model name through), and remote inference depends on a live DashScope endpoint
  • Manual sanity: targeted tests cover the changed code path; cortex-side caller wiring in a separate PR has been validated end-to-end against the new request schema

Checklist

  • My code follows the project's code style (type hints, no unnecessary abstractions)
  • I have not introduced any new AGPL-licensed dependencies
  • This PR contains no breaking API changes — model is optional with default None and existing callers are unaffected
  • I have not committed secrets, API keys, or credentials

🤖 Generated with Claude Code

Adds an optional `model` field to `DescribeRequest` and `CropDescribeRequest`
so callers can pick a different upstream model per request without a
server reload. Threaded through `TrioCore.analyze_video/analyze_frame`
and `BaseBackend.generate/stream_generate`.

`RemoteHTTPBackend` honors the override on the OpenAI-compatible
chat.completions call (falls back to the configured `remote_vlm_model`
when omitted). Local backends (transformers, mlx, tome, compressed)
log a one-shot warning and ignore the override since they cannot swap
models per request — `BaseBackend._warn_model_override_once` centralizes
this.

Use case: route VLM scene-describe and segmentation ground-region
requests to different upstream Qwen models from a single trio-core
instance.

Tests: two new RemoteHTTPBackend tests cover the override path and the
fallback to the configured default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Liuhaai Liuhaai merged commit 5c9e0d3 into main May 7, 2026
7 checks passed
@Liuhaai Liuhaai deleted the feat/per-request-vlm-model-override branch May 7, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant