Skip to content

ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917

@joelteply

Description

@joelteply

Background

Joel called out a pattern in our inference layer (2026-04-17 chat session):

"context window limits are defined BY the model as are features such as audio or vision. this is probably why those attempts also failed"
"if you need a var require it / pass the entire struct around / has all the info you need / or grab it"
"make it declarative"

Today we have parallel model-info plumbing:

  • system/shared/ModelContextWindows.ts — TS lookup tables (getContextWindow, getInferenceSpeed, isSlowLocalModel, getLatencyAwareTokenLimit)
  • workers/continuum-core/src/ai/types.rs::ModelInfo — Rust struct with Option<> on max_output_tokens and cost_per_1k_tokens
  • 21 hardcoded ModelInfo {…} constructions across openai_adapter.rs, candle_adapter.rs, anthropic_adapter.rs, embedding.rs — each adapter maintains a static catalog instead of querying its source
  • system/core/src/models/mod.rs — yet another parallel Option<u32> max_output_tokens definition

Symptoms this caused (visible on M5 PR #914 verification today):

  • ChatRAGBuilder computed totalBudget = floor(contextWindow × 0.75). For Qwen3.5-4b's 262k window = 196k tokens. RAG actually filled ~14k per request → llama-server allocated full 262k KV cache per persona slot → com.docker.llama-server 20.87 GB resident on M5, 44 GB total vs 32 GB physical = swap.
  • Vision/audio attempts have failed silently when the hardcoded TS table claims a model supports a capability the actual model doesn't.
  • getInferenceSpeed is a TS const — fundamentally can't reflect what's measured at runtime.

Scope

Single coherent refactor, ~25 files, its own branch. Not to be sprinkled into other PRs.

1. ModelMetadata (replaces ModelInfo), all fields required

#[derive(Debug, Clone, Serialize, Deserialize, TS)]
#[ts(export, export_to = "../../../shared/generated/ai/ModelMetadata.ts")]
#[serde(rename_all = "camelCase")]
pub struct ModelMetadata {
    pub id: String,
    pub name: String,
    pub provider: String,
    pub capabilities: Vec<ModelCapability>,
    pub context_window: u32,
    pub max_output_tokens: u32,
    pub cost_per_1k_tokens: CostPer1kTokens,  // local = {0,0}
    pub tokens_per_second: f32,
    pub supports_streaming: bool,
    pub supports_tools: bool,
}

No Option<>. Local-cost = {0,0} is still a declaration, not an absence.

2. Adapters query their source, not hardcoded vec![ModelInfo {…}]

  • DMR: GET http://localhost:12434/engines/v1/models returns the live catalog. docker model inspect <id> exposes GGUF metadata for fields the catalog doesn't.
  • OpenAI / Anthropic / DeepSeek / etc.: their /v1/models endpoint. Cache at adapter initialize().
  • Candle: GGUF metadata directly from the loaded file.

Delete the 21 hardcoded literals.

3. AIProviderAdapter::model_metadata(model_id) returns the full struct

fn model_metadata(&self, model_id: &str) -> Option<ModelMetadata>;  // None ONLY when not in adapter's live catalog

4. Thread ModelMetadata through the chain

  • PersonaResponseGenerator receives ModelMetadata at request entry.
  • ChatRAGBuilder.buildContext(model: ModelMetadata, …) reads model.context_window, model.tokens_per_second, model.capabilities directly.
  • Vision attachment, tool injection — gated by model.capabilities and model.supports_tools.

5. Delete the lookup-helper layer

  • system/shared/ModelContextWindows.ts — fully deletable.
  • system/core/src/models/mod.rs — collapse into ai/types.rs.

Acceptance

  • grep -r "Option<u32>" workers/continuum-core/src/ai/ returns zero hits.
  • grep -rn "ModelInfo {" workers/continuum-core/src/ only matches ai/types.rs (the definition itself).
  • system/shared/ModelContextWindows.ts deleted.
  • ChatRAGBuilder and PersonaResponseGenerator take ModelMetadata; never reconstruct it from loose strings.
  • Live test on M5: persona chat sends prompts that respect model.context_window AND the latency budget derived from model.tokens_per_second. KV cache pressure drops from 20+ GB to single-GB range.

Why separate

Touching 21+ adapter sites + consumer chain + IPC export + TS plumbing has to land atomically. Half of it sprinkled into other PRs leaves the codebase worse than it started.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions