Background
Joel called out a pattern in our inference layer (2026-04-17 chat session):
"context window limits are defined BY the model as are features such as audio or vision. this is probably why those attempts also failed"
"if you need a var require it / pass the entire struct around / has all the info you need / or grab it"
"make it declarative"
Today we have parallel model-info plumbing:
system/shared/ModelContextWindows.ts — TS lookup tables (getContextWindow, getInferenceSpeed, isSlowLocalModel, getLatencyAwareTokenLimit)
workers/continuum-core/src/ai/types.rs::ModelInfo — Rust struct with Option<> on max_output_tokens and cost_per_1k_tokens
- 21 hardcoded
ModelInfo {…} constructions across openai_adapter.rs, candle_adapter.rs, anthropic_adapter.rs, embedding.rs — each adapter maintains a static catalog instead of querying its source
system/core/src/models/mod.rs — yet another parallel Option<u32> max_output_tokens definition
Symptoms this caused (visible on M5 PR #914 verification today):
ChatRAGBuilder computed totalBudget = floor(contextWindow × 0.75). For Qwen3.5-4b's 262k window = 196k tokens. RAG actually filled ~14k per request → llama-server allocated full 262k KV cache per persona slot → com.docker.llama-server 20.87 GB resident on M5, 44 GB total vs 32 GB physical = swap.
- Vision/audio attempts have failed silently when the hardcoded TS table claims a model supports a capability the actual model doesn't.
getInferenceSpeed is a TS const — fundamentally can't reflect what's measured at runtime.
Scope
Single coherent refactor, ~25 files, its own branch. Not to be sprinkled into other PRs.
1. ModelMetadata (replaces ModelInfo), all fields required
#[derive(Debug, Clone, Serialize, Deserialize, TS)]
#[ts(export, export_to = "../../../shared/generated/ai/ModelMetadata.ts")]
#[serde(rename_all = "camelCase")]
pub struct ModelMetadata {
pub id: String,
pub name: String,
pub provider: String,
pub capabilities: Vec<ModelCapability>,
pub context_window: u32,
pub max_output_tokens: u32,
pub cost_per_1k_tokens: CostPer1kTokens, // local = {0,0}
pub tokens_per_second: f32,
pub supports_streaming: bool,
pub supports_tools: bool,
}
No Option<>. Local-cost = {0,0} is still a declaration, not an absence.
2. Adapters query their source, not hardcoded vec![ModelInfo {…}]
- DMR:
GET http://localhost:12434/engines/v1/models returns the live catalog. docker model inspect <id> exposes GGUF metadata for fields the catalog doesn't.
- OpenAI / Anthropic / DeepSeek / etc.: their
/v1/models endpoint. Cache at adapter initialize().
- Candle: GGUF metadata directly from the loaded file.
Delete the 21 hardcoded literals.
3. AIProviderAdapter::model_metadata(model_id) returns the full struct
fn model_metadata(&self, model_id: &str) -> Option<ModelMetadata>; // None ONLY when not in adapter's live catalog
4. Thread ModelMetadata through the chain
PersonaResponseGenerator receives ModelMetadata at request entry.
ChatRAGBuilder.buildContext(model: ModelMetadata, …) reads model.context_window, model.tokens_per_second, model.capabilities directly.
- Vision attachment, tool injection — gated by
model.capabilities and model.supports_tools.
5. Delete the lookup-helper layer
system/shared/ModelContextWindows.ts — fully deletable.
system/core/src/models/mod.rs — collapse into ai/types.rs.
Acceptance
grep -r "Option<u32>" workers/continuum-core/src/ai/ returns zero hits.
grep -rn "ModelInfo {" workers/continuum-core/src/ only matches ai/types.rs (the definition itself).
system/shared/ModelContextWindows.ts deleted.
ChatRAGBuilder and PersonaResponseGenerator take ModelMetadata; never reconstruct it from loose strings.
- Live test on M5: persona chat sends prompts that respect
model.context_window AND the latency budget derived from model.tokens_per_second. KV cache pressure drops from 20+ GB to single-GB range.
Why separate
Touching 21+ adapter sites + consumer chain + IPC export + TS plumbing has to land atomically. Half of it sprinkled into other PRs leaves the codebase worse than it started.
Background
Joel called out a pattern in our inference layer (2026-04-17 chat session):
Today we have parallel model-info plumbing:
system/shared/ModelContextWindows.ts— TS lookup tables (getContextWindow,getInferenceSpeed,isSlowLocalModel,getLatencyAwareTokenLimit)workers/continuum-core/src/ai/types.rs::ModelInfo— Rust struct withOption<>onmax_output_tokensandcost_per_1k_tokensModelInfo {…}constructions acrossopenai_adapter.rs,candle_adapter.rs,anthropic_adapter.rs,embedding.rs— each adapter maintains a static catalog instead of querying its sourcesystem/core/src/models/mod.rs— yet another parallelOption<u32> max_output_tokensdefinitionSymptoms this caused (visible on M5 PR #914 verification today):
ChatRAGBuildercomputedtotalBudget = floor(contextWindow × 0.75). For Qwen3.5-4b's 262k window = 196k tokens. RAG actually filled ~14k per request → llama-server allocated full 262k KV cache per persona slot →com.docker.llama-server20.87 GB resident on M5, 44 GB total vs 32 GB physical = swap.getInferenceSpeedis a TS const — fundamentally can't reflect what's measured at runtime.Scope
Single coherent refactor, ~25 files, its own branch. Not to be sprinkled into other PRs.
1.
ModelMetadata(replacesModelInfo), all fields requiredNo
Option<>. Local-cost = {0,0} is still a declaration, not an absence.2. Adapters query their source, not hardcoded
vec![ModelInfo {…}]GET http://localhost:12434/engines/v1/modelsreturns the live catalog.docker model inspect <id>exposes GGUF metadata for fields the catalog doesn't./v1/modelsendpoint. Cache at adapterinitialize().Delete the 21 hardcoded literals.
3.
AIProviderAdapter::model_metadata(model_id)returns the full struct4. Thread
ModelMetadatathrough the chainPersonaResponseGeneratorreceivesModelMetadataat request entry.ChatRAGBuilder.buildContext(model: ModelMetadata, …)readsmodel.context_window,model.tokens_per_second,model.capabilitiesdirectly.model.capabilitiesandmodel.supports_tools.5. Delete the lookup-helper layer
system/shared/ModelContextWindows.ts— fully deletable.system/core/src/models/mod.rs— collapse intoai/types.rs.Acceptance
grep -r "Option<u32>" workers/continuum-core/src/ai/returns zero hits.grep -rn "ModelInfo {" workers/continuum-core/src/only matchesai/types.rs(the definition itself).system/shared/ModelContextWindows.tsdeleted.ChatRAGBuilderandPersonaResponseGeneratortakeModelMetadata; never reconstruct it from loose strings.model.context_windowAND the latency budget derived frommodel.tokens_per_second. KV cache pressure drops from 20+ GB to single-GB range.Why separate
Touching 21+ adapter sites + consumer chain + IPC export + TS plumbing has to land atomically. Half of it sprinkled into other PRs leaves the codebase worse than it started.