feat(middleware): Model routing, PII filtering, Cloud model proxies#9802
Open
richiejp wants to merge 38 commits into
Open
feat(middleware): Model routing, PII filtering, Cloud model proxies#9802richiejp wants to merge 38 commits into
richiejp wants to merge 38 commits into
Conversation
aff5af4 to
8389d96
Compare
Introduces core/services/routing/{contract,billing} as the foundation
for the routing module. The billing recorder is wired through the
existing UsageMiddleware and runs unconditionally — a no-auth single-
user box now records token usage under a synthetic "local" user, where
previously the middleware short-circuited on a nil auth DB and zero
stats were captured.
- StatsBackend interface with three impls (gorm, in-memory ring,
disabled) selected at startup; Recorder fans out to backend + Prom
counters from a single increment site so DB and metrics cannot
diverge.
- UsageRecord schema extended with RequestedModel/ServedModel,
Pre/PostFilterPromptTokens, pricing version, cost, and correlation/
router/PII foreign keys (all nullable; AutoMigrate handles existing
deployments).
- Synthetic LocalUser persisted to ${DataPath}/.local_user_id so usage
history aggregates across restarts in single-user mode.
- contract.Invariant emits localai_invariant_violation_total and panics
under -tags=routing_strict for nightly E2E surfacing.
- --disable-stats opt-out for ephemeral CI runs.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Wires the billing recorder from the previous commit into user-facing surfaces. Before this, the Recorder collected data but no endpoint queried it without auth, the UI hid the Usage page in single-user mode, and there was no MCP tool to read stats. After: - New REST endpoints GET /api/usage and /api/usage/all that go through application.StatsRecorder() and fall back to the synthetic local user when auth is off. Old /api/auth/usage stays as the auth-only alias. Both new endpoints carry swagger annotations under the "usage" tag. - Sidebar drops authOnly:true on the Usage entry; Usage.jsx picks the endpoint based on authEnabled and skips the empty-state-bail when auth is off. - /api/instructions registry gains a "usage-and-billing" entry so agents discover the surface; the existing reachability test bumps to 13 instructions and asserts the new name is present. - New MCP tool get_usage_stats with read-only semantics, registered under the existing localaitools server. coverage_test.go ::TestToolHTTPRouteMappingComplete documents the route pairing; expectedFullCatalog and expectedReadOnlyCatalog include the tool. Both inproc and httpapi clients implement GetUsageStats; the inproc client picks up the StatsRecorder + FallbackUser at construction in application.go. - Playwright e2e spec usage-dashboard.spec.js asserts (a) the Usage link is visible without auth, (b) the page renders /api/usage data without bailing, and (c) auth-on still routes to /api/auth/usage. Verified end-to-end against tests/e2e-ui/ui-test-server: /api/auth/status reports authEnabled:false, /api/usage returns the local user with a stable UUID, /api/usage/all admits the local user as admin. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Subsystem 3 of the routing module. The regex tier is the cheap,
deterministic layer; the encoder NER tier (TokenClassify gRPC) is
follow-up work.
Pattern set: email, phone, SSN, credit card with Luhn verification,
IPv4 (with octet bounds-check), and common API key prefixes (sk-,
pk-, xoxb-, ghp_, github_pat_). Each pattern has one of three
actions:
- mask: replace the matched span with [REDACTED:<id>] before the
request reaches the backend. Default for everything except
api_key_prefix.
- block: short-circuit the request with HTTP 400 and a pii_blocked
error type. The matched value is never echoed back to the client.
Default for api_key_prefix — leaked credentials are higher harm
than other PII.
- route_local: leave the text intact but flag the echo context so a
future content router refuses cloud-proxy candidates. Useful for
deployments that trust local models with sensitive data but not
external providers.
Wiring:
- core/services/routing/pii: types, regex compile, redactor, in-
memory event ring buffer, YAML config loader, request middleware.
- core/services/routing/piiadapter: per-API-shape adapter (OpenAI
today; Anthropic when needed) so the schema package never imports
pii.
- core/http/routes/openai.go: wires pii.RequestMiddleware as the
innermost middleware in the chat slice — runs after the request
is parsed, mutates the request body in place when masking, returns
400 when blocking.
- core/http/routes/pii.go: GET /api/pii/patterns, GET /api/pii/events,
POST /api/pii/test (admin-or-local-user; events filterable by
correlation_id, user_id, pattern_id).
- pkg/mcp/localaitools: list_pii_patterns, get_pii_events,
test_pii_redaction tools with full route map coverage in
coverage_test.go.
- core/http/endpoints/localai/api_instructions.go: pii-filtering
instructions entry; reachability test bumps to 14.
- --pii-config / --disable-pii flags; pii.yaml format overrides
per-id action with unknown-id rejection at startup.
PIIEvent records never carry the matched value — only the byte
offset, length, and an 8-char sha256 prefix so admins can dedupe
recurring leaks during audit. The contract.Invariant
"pii.event_per_span" asserts every redacted span produces an event
record.
Verified end-to-end against ui-test-server: GET /api/pii/patterns
returns the 6 defaults with correct actions; POST /api/pii/test with
"contact alice@example.com" returns
'redacted="contact [REDACTED:email] about it"' and a span with
hash_prefix=ff8d9819; same with "sk-..." returns blocked=true.
Streaming response filter (the buffered-emit invariant) is in the
plan as a separate slice and not in this commit.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Streaming chat completions weren't producing UsageRecords because the
middleware only parsed token counts from the response body — and OpenAI
clients rarely set stream_options.include_usage, while Anthropic uses a
different shape entirely. Handlers now stamp the canonical token counts
on the echo context via middleware.StampUsage; UsageMiddleware reads the
stamp first and only falls back to body-parse for proxy/foreign
endpoints. The body-parse fallback gains an Anthropic shape so
passthrough proxies for /v1/messages still work.
Billing's Prometheus counters were never reaching /metrics because the
monitoring service that calls otel.SetMeterProvider was created later
than billing.NewRecorder, leaving the counters bound to the no-op global
provider. The metrics service now initialises in application.start()
before any counter is registered, exposes its meter via Application
.MetricsService(), and hands it directly to billing via SetMeter() so
the order-of-operations dependency is explicit rather than racy.
The synthetic local user is now wired unconditionally when stats are
enabled (not just when authDB is nil), so internal/system callers under
auth-on still attribute correctly. The /app/users React route is
guarded by a new RequireAuthEnabled component that redirects to /app
when auth is off, defending against direct URL access of an admin-only
page that has nothing to manage in single-user mode.
A new localai_usage_unrecorded_total{endpoint,reason} counter ticks
whenever a request finishes without producing a record, so silent
billing misses are observable rather than invisible.
Verified end-to-end: chat (streaming + non-streaming), embeddings, and
Anthropic messages (streaming + non-streaming) each produce one
UsageRecord and one Prom counter increment in no-auth mode.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Move PII filtering from a global opt-out to a per-model opt-in: local
models bypass redaction by default, while backends matching `proxy-*`
default to on (forward-compatible with the cloud-passthrough subsystem).
A new ModelConfig.PII block lets a model opt in (`enabled: true`) and
upgrade or downgrade individual pattern actions without touching global
config. The middleware reads the resolved config from the echo context
and short-circuits when disabled, so a chat to a local model pays no
regex-scan cost.
The Anthropic /v1/messages route gains the same redaction path via a
new piiadapter.Anthropic() that walks AnthropicRequest.Messages —
identical shape to the OpenAI adapter, so a future passthrough proxy
gets PII for free.
A new admin page at /app/middleware (System section, admin-only)
surfaces the live state. Three tabs: Filtering shows the pattern
catalogue with action editors plus every model's resolved enabled state
and overrides; Routing is a placeholder until subsystem 2 lands; Events
renders recent PIIEvents (correlation id, pattern id, action, hash
prefix — the redacted content is never stored or displayed). The page
reads /api/middleware/status (a single-round-trip aggregator) and
mutates pattern actions via PUT /api/pii/patterns/:id (transient,
restored from --pii-config on restart). MCP exposes the same surface as
get_middleware_status and set_pii_pattern_action so an agent can
introspect or tune the filter without code access. The drift detector
in pkg/mcp/localaitools/coverage_test.go still passes — both new tools
ship with their HTTP route mappings.
Behaviour change for existing deployments: local models no longer
receive global PII redaction without an explicit `pii: { enabled: true }`
in their YAML. Documented in the new middleware-admin instructions
registry entry.
End-to-end verified against tests/e2e-ui/ui-test-server (which gains a
--pii-yaml flag for injecting per-model PII config into the auto-
generated mock-model.yaml): default-off produces no events; explicit
opt-in produces a mask event; per-model action override produces an
HTTP 400 pii_blocked response.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Add the routing subsystem's content-router tier: a Router config block on ModelConfig turns a model into a smart-router that classifies each request and rewrites input.Model to one of its candidates. The standard model-resolution path then runs ACL, disabled-state, and per-model PII against the chosen target — the router only does *model* selection, not node selection (SmartRouter still owns the latter in distributed mode). The classifier interface lives in core/services/routing/router with one shipped implementation: a feature classifier that picks a candidate by prompt length and code-fence presence. The router.Probe shape is schema-agnostic; per-API-shape extractors (OpenAIProbe, AnthropicProbe) in core/http/middleware translate parsed requests into probes without dragging the schema package into the router. The interface deliberately doesn't depend on core/config — callers translate RouterCandidate slices into FeatureCandidate slices at construction time. The new RouteModel middleware runs after SetModelAndConfig + body parse but before the PII filter. When the resolved config has a Router block, the middleware invokes the classifier, looks up the matched label in the candidate table, reloads the target model's config, asserts depth-1 (the candidate must NOT itself be a router — chained routers turn dispatch into a graph), and swaps MODEL_CONFIG + input.Model in place. RequestedModel/ServedModel get stamped on the context so the usage log records the routing. Classifier failures and unknown labels fall through to Router.Fallback; fallback-empty errors return 503 rather than silently bypassing. The decision log is a ring-buffer in core/services/routing/router that mirrors the PII event log: in-memory by default, capped at 5k records, filterable by correlation_id / user_id / router_model. New REST endpoints surface it: GET /api/router/decisions (admin-only) and an updated GET /api/router/status that lists configured router models + their classifier configs. The /api/middleware/status aggregator pulls the same data so the React Middleware page renders the Routing tab with active routers and recent decisions side-by-side. MCP gains a get_router_decisions tool. The coverage drift detector catches the new tool — its HTTP route is documented in the same map. The new instructions registry entry "intelligent-routing" explains the Router block, the depth-1 rule, and points at the decisions endpoint. Total instructions count → 16. End-to-end verified: configured mock-model as a smart-router with a small (max_prompt_length=30) and a large candidate; a 5-char prompt routes to small-model and a 100-char prompt routes to large-model; both decisions appear in /api/router/decisions and /api/middleware/ status reflects the active config. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>
Closes the output-side gap in the PII subsystem: until now, redaction
only ran on incoming chat requests. A model could generate "your key
is sk-..." and stream it straight to the client. The new StreamFilter
intercepts the OpenAI chat completion stream's content deltas, applies
the same regex tier the request-side middleware uses, and masks
matches that span chunk boundaries.
The buffered-emit invariant: for any active pattern with bounded
max-length L, the filter holds back the trailing L-1 characters of
the cumulative input. New text disambiguates the boundary; the stream
close (Drain) flushes whatever is left. This is what guarantees the
mask survives an arbitrarily-split chunk sequence — alice@example.com
arriving as "alice@" + "example.com" still becomes [REDACTED:email].
Action handling differs from the request side: earlier chunks are
already on the wire by the time later chunks scan, so a "block" can't
actually reject. The filter remaps block to mask for redaction while
recording PIIEvent rows with action=block so audits surface the
original intent ("the model would have leaked X here, suppressed in
flight"). route_local on output is a no-op (the routing decision was
made at request time).
A property test feeds the redactor every corpus input across 10
random chunkings and asserts (a) no secret value ever appears in the
emitted output and (b) the streamed output equals what a single-shot
redaction would produce on the unsplit text.
Wiring: the OpenAI chat endpoint constructs a per-stream filter when
the resolved ModelConfig has PIIIsEnabled — the same gate the
request-side middleware reads, so a model with PII off pays no
streaming cost either. ChatEndpoint signature gains *pii.Redactor and
pii.EventStore parameters; the legacy /v1/mcp/chat/completions wires
nil values (kept for backward compatibility, request-side filter on
the main route still applies).
The mock-backend gains a MOCK_LEAK_EMAIL prompt sentinel that emits a
response containing alice@example.com — used by the end-to-end test:
streaming chat against a mock-model with pii.enabled=true produces a
data chunk containing [REDACTED:email] and an /api/pii/events row
with direction=out and action=mask.
Anthropic /v1/messages and the bare /v1/completions path are NOT yet
wired; their streaming surfaces will get the same filter in a follow-
up. The StreamFilter type is schema-agnostic so wiring is a small
patch per route.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The per-model pii.patterns field was being rendered as a generic JSON-editor textarea, leaving users to discover the schema by trial and error. Replace it with a dedicated component that fetches the live pattern catalog from /api/pii/patterns and presents pattern + action as two select dropdowns per row, with a separate "add" picker that hides patterns already overridden. The pattern catalog is loaded at render time, so new built-in patterns (when added to DefaultPatterns) surface in the UI automatically without schema duplication. Unknown IDs already in the YAML still render so hand-edited configs aren't lost on first load. Also gives pii.enabled a proper label and description in the config metadata registry so the toggle isn't an opaque "Enabled" entry under "Other". Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
…/completions Closes the streaming-coverage gap flagged in 8d421453. The StreamFilter type is wire-format-agnostic, so wiring it into the remaining streaming surfaces is a per-route patch: - Anthropic /v1/messages: text_delta is the only content surface that carries model output; wrap each emit (token-callback path, ChatDeltas path, autoparse fallback) so a pattern split across SSE chunks still gets masked. Drain the buffered tail before any content_block_stop on the text block (normal close, tool-call transitions, autoparse), so trailing residue isn't silently truncated when the model pivots into a tool_use block. Block→mask remap and per-model action overrides follow the same gating as the OpenAI chat path. - /v1/completions: response-side only — the endpoint has no chat message structure for request-side scanning, but a model trained on PII can still emit it. Filter Choices[0].Text per chunk and drain the residue into one final text-bearing chunk just before the stop chunk + [DONE]. Same per-model gate as elsewhere: PII off for non-proxy backends by default, on for proxy-* / explicit pii.enabled = true. Filter is nil when disabled — flow is untouched. Subsystem 3 (PII) is now feature-complete for the MVP scope across both directions on chat/completions/messages. Encoder NER tier (TokenClassify gRPC) remains as a follow-up. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds wire-format-faithful HTTP+SSE forwarding for models whose Backend starts with `proxy-` and whose `proxy.upstream_url` is set. The chat and messages handlers fork to the proxy before any local templating or gRPC dispatch, so the upstream sees the request body the client sent (with only the top-level `model` field optionally rewritten). The streaming PII filter rides on top: per-token text is extracted from each SSE chunk, pushed through pii.StreamFilter, and spliced back into the original envelope so the upstream's event names and metadata pass through untouched. PII residue flushes before the provider's terminal marker ([DONE] / message_stop) so clients that stop reading on the marker don't lose the tail. Auth is provider-aware (OpenAI Bearer, Anthropic x-api-key + anthropic-version header). API keys read from env vars named in config so secrets stay out of YAML and the admin UI. No request-shape translation in the MVP — a client posting OpenAI-shaped requests to a proxy-anthropic model gets a confused upstream. Cross-shape forwarding is deliberately deferred; tool-call argument round-tripping and reasoning-content passthrough deserve their own review. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds a copy-paste-ready model config template for both proxy-openai and proxy-anthropic, covering API key handling via env vars, model name rewriting, request timeout, and the per-model PII gate. Includes a section on combining proxy models with the intelligent router so a single LocalAI instance can mix local and cloud candidates behind one classifier. Documents the MVP limitations explicitly (no request-shape translation, no output-side PII for buffered responses, no retry) so users don't hit them as surprises. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds an HTTPS forward proxy that selectively MITMs traffic for allowlisted LLM API hosts so LocalAI can apply per-request PII redaction to clients authenticating via OAuth / subscription rather than via API keys held by LocalAI. Hosts outside the allowlist get a plain CONNECT tunnel — OAuth flows, telemetry, and unrelated HTTPS keep working without depending on the CA being trusted. Components: - mitm.CA: ECDSA-P256 CA, generated once and persisted (key 0600) - mitm leaf cache: per-SNI leaf certs minted on demand, cached in-mem - mitm.Server: CONNECT-aware HTTP server, hijacks the conn, mints leaf, terminates TLS, parses HTTP/1.1 requests, dispatches - mitm PII handler: re-uses the existing piiadapter for request redaction and pii.StreamFilter for SSE response redaction; runs only on /v1/messages and /v1/chat/completions paths (others pass through verbatim, preserving Anthropic-OAuth and OpenAI-Codex auth flows untouched) - Application wiring: --mitm-listen / --mitm-ca-dir / --mitm-intercept-hosts CLI flags. Off by default. CA cert exposed unauthenticated at GET /api/middleware/proxy-ca.crt for client trust-store install. Primary use case: redact PII from Claude Code sessions running against a Claude Pro/Max subscription, where LocalAI doesn't hold (and can't use) an API key. Codex CLI works the same way. HTTP/1.1 only; HTTP/2 deferred (most CLIs negotiate down without issue). Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Previously the MITM proxy terminated TLS as HTTP/1.1 only. Modern LLM-API clients (Claude Code, Codex CLI) and the Anthropic / OpenAI APIs themselves all speak HTTP/2 — h2 multiplexing is what makes streaming responses cheap. Forcing h1.1 in the middle of the path worked but cost a measurable per-request overhead and would have broken any future client that drops h1 support. Changes: - proxy.go: TLS NextProtos = ["h2", "http/1.1"]; after handshake branch on NegotiatedProtocol. h2 path uses http2.Server.ServeConn with the InterceptHandler wrapped as an http.Handler. h1.1 path retains the manual request-loop with connResponseWriter as a fallback for legacy clients. - handler.go: outbound http.Transport explicitly configured with http2.ConfigureTransport so the upstream leg also negotiates h2. - go.mod: promote golang.org/x/net to a direct dependency (was indirect via websocket). - New tests: TestProxy_NegotiatesHTTP2 verifies resp.Proto == "HTTP/2.0", TestProxy_HTTP2Streaming covers SSE-over-h2 with per- frame flush, TestProxy_HTTP1Fallback locks the legacy path. The InterceptHandler signature is unchanged — h2 streams map 1:1 to http.Request, just like h1, so handlers don't have to know which protocol is on the wire. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
…e and comments
- New core/services/cloudproxy/ssewire package owns the SSE scanner
and the per-provider rewrite/terminal/residual helpers; cloudproxy
and mitm both import it. Removes ~150 lines of literal duplication
between mitm/sse.go and cloudproxy/{sse,proxy}.go.
- handler.go: replace dispatchPIIIntercept (8 positional params) with
a piiDispatcher struct built once at NewPIIHandler time. Hoists the
pattern→action map out of the per-request hot path, fixes a PII
event-ID collision when one request triggered multiple spans of
the same pattern (now uses an atomic seq), and stops silently
dropping store.Record errors.
- proxy.go: cache streaming(body) result instead of re-parsing JSON.
- ca.go: drop the redundant certDER field; use cert.Raw, the byte-
identical buffer x509.ParseCertificate already populates.
- Trim package docs and over-narrating per-declaration comments to
match the project style guide (only WHY when non-obvious).
No behaviour change. All existing tests pass.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds two starter YAMLs to the Import Model page's Power → YAML view: "OpenAI proxy" and "Anthropic proxy". Clicking either fills the editor with a working proxy-* skeleton — backend, upstream URL, api_key_env (so the secret stays out of YAML), upstream_model, request_timeout_seconds, and a sensible per-model PII gate. Templates appear next to the Copy button so they're discoverable without leaving the editor. The user fills in their own model name, upstream URL, and env-var name and submits. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
This reverts commit f11c533ceb9b7c164023ca27e21259d29196bd95. Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds two template cards to the Add Model page (/app/model-editor in create mode): "OpenAI Proxy" and "Anthropic Proxy". Picking either pre-fills the form with backend, upstream URL, api_key_env, upstream_model placeholder, request timeout, and pii.enabled — the user fills in the model name, the env-var name, and the upstream model and saves. This is the right home for the proxy starter; the Import Model page is reserved for fetching artefacts from HF / Ollama / OCI and the proxy doesn't fit that pattern. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds MITMListen and MITMInterceptHosts to RuntimeSettings so an admin can flip the cloudproxy MITM listener on/off and edit the intercept allowlist via /api/settings (already admin-gated; locked down by --disable-runtime-settings when the operator wants no runtime mutation at all). The CA dir stays startup-only — the persisted CA is the trust anchor for every already-installed client, and rotating it from a REST endpoint would orphan them. Editing the listen address or allowlist reuses the same CA via Application.RestartMITM, which stops the old listener (if any), reads the current config, and starts a new one. Also adds a "mitm" section to GET /api/middleware/status so the admin page can render running state, configured vs bound listen address, allowlist, and the CA download URL. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds a "MITM Proxy" tab to /app/middleware. Shows running state + bound listen address; renders Apply/Discard form for the listen address and intercept-host allowlist (which writes through to /api/settings, already admin-gated and watchable by --disable-runtime-settings); offers a one-click CA cert download plus a brief client-setup recipe (NODE_EXTRA_CA_CERTS + HTTPS_PROXY) so an admin can stand up Claude Code / Codex without leaving the page. Backend bits already shipped in 76e3b5fe — this turns the data into a working control surface. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
- ProxyTab: gate the server→local sync useEffect on \!dirty so Refresh / post-save refetch can't clobber mid-typed input. The intercept_hosts array reference changes per fetchAll(), so the previous deps[] silently re-fired every poll. - Switch ProxyTab.save to settingsApi.save — same path Settings.jsx uses. Drops the raw fetch + handcrafted JSON. - Move mitmMutex from a package-level var onto Application, matching p2pMutex / watchdogMutex. Add stopMITMLocked for symmetry with startMITMLocked; RestartMITM now reads as stopLocked → bail-on-empty → startLocked. - Add BackendProxyOpenAI / BackendProxyAnthropic constants in cloudproxy and use them in providerName. Test-data sites stay as literals so a typo'd constant rename still fails the tests. - Trim a buildMITMStatus comment that just narrated the field names. No behaviour change. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Repurpose the PII event store as a shared middleware audit log: add an EventKind discriminator (pii | proxy_connect | proxy_traffic) and proxy-specific fields (Host, Intercepted, BytesSent, BytesReceived, StatusCode, DurationMS) to the existing PIIEvent record. Keep request contents out of the store — bodies live in API/backend traces only. The MITM Server records a proxy_connect row for every CONNECT (with Host + Intercepted=true|false) so admins can see which hostnames a client tried to reach and whether the proxy terminated TLS or tunneled through. The PIIHandler wraps its ResponseWriter to count bytes downstream and records a proxy_traffic row at request end with sent/received byte counts, status code, and duration. The /api/pii/events endpoint accepts a kind= filter. The Middleware admin page Events tab gains a Kind column, a kind filter row, and per-kind detail formatting (host + intercept decision for connects; HTTP status, byte counts, and duration for traffic). The MCP get_pii_events tool stays scoped to kind=pii so the LLM-facing audit isn't polluted by proxy rows with empty PII fields. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Add a Go test for the tunneled CONNECT path: a non-allowlisted host must record a proxy_connect with Intercepted=false and zero proxy_traffic events (since tunneled bytes never reach the dispatcher). Extend the Playwright spec for the Middleware page Events tab. The mock event feed now includes a pii row, two proxy_connect rows (intercept and tunnel decisions), and one proxy_traffic row. New test cases: - proxy_connect rows show "intercepted" / "tunneled" labels - proxy_traffic row shows HTTP status, byte counts, and duration - the kind filter buttons narrow the table to a single kind - the Kind column header and per-kind badges render Note: Playwright runs failed in the local sandbox (the bundled chrome-headless-shell can't load libglib on this NixOS host); the specs are authored against the rendered DOM and will run in CI. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
loadRuntimeSettingsFromFile applied every other persisted runtime
setting (branding, watchdog, P2P, agent pool, ...) back into options
on startup but skipped the MITM fields. So when an admin configured
the listener via /api/settings, runtime_settings.json on the mounted
volume held mitm_listen + mitm_intercept_hosts, but on restart options
came up empty and the start-MITM gate at startup never fired.
Two changes:
- loadRuntimeSettingsFromFile now copies MITMListen and
MITMInterceptHosts from the file when no CLI flag set them. Like
branding, the file is the only source — there are no env vars for
these — so an explicit --mitm-listen still wins, but a /api/settings
save round-trips correctly.
- The startMITMProxy call moves to after loadRuntimeSettingsFromFile.
Previously it ran before the file load, so even with the loader
fix in place options.MITMListen would be empty when the gate
fired. The watchdog and other restartable subsystems already
initialize after the load — MITM now matches.
Tests pin the contract:
- core/config: WritePersistedSettings + ReadPersistedSettings round-trip
preserves both MITM fields.
- core/application: loadRuntimeSettingsFromFile populates MITMListen
and MITMInterceptHosts from a fixture file, and an explicit CLI
flag wins over the file value.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Self-review pass on the routing-stats slice. Each finding paired with test coverage; one refactor (atomic.Pointer for MITM accessors) matches the existing agentPoolService precedent in the same struct. Logic fixes: - pii/stream.go: snap emitBoundary to a rune start so the held tail never contains a split UTF-8 codepoint. Multibyte corpus added to the buffered-emit invariant test. - pii/redactor.go: SetAction publishes a fresh patterns slice (slices.Clone) instead of mutating r.patterns[i].Action in place — Go strings are not atomic two-word values, so concurrent Redact callers iterating an older snapshot would race on the field even under RWMutex. Race-stress test added. - pii/openai adapter: new bit-24 sentinel + 24-bit block field (idxWholeStringFlag/idxBlockMask) replaces the 0xFFFF sentinel that collided with a real block index of 65535. - mitm/proxy.go: fail closed if SetDeadline errors before the TLS handshake — proceeding into the protocol switch on an unhandshaken conn is worse than dropping the connection. - mitm/response.go: Connection: close compared with EqualFold so any casing triggers the post-response disconnect (RFC 9110 §7.6.1). - application: MITMServer/MITMCA accessors now atomic.Pointer-backed (matches agentPoolService); readers no longer race RestartMITM on pointer swap. mitmMutex retained only to serialize Stop+Start. - router/feature.go: prompt length predicates use rune count, not byte count — operators reason in characters, not UTF-8 bytes. Cached once per Classify call rather than recomputed per candidate. - mcp/localaitools/inproc: GetUsageStats(All=true, UserID=…) honours the UserID filter, matching the REST endpoint's ?user_id param — same MCP call now returns the same data over either transport. - react-ui middleware spec: bytes_received mock changed from 1280 to 1228 so formatBytes returns the asserted 1.2KB string. Test coverage added: - pii: race-detector test for SetAction, multibyte UTF-8 corpus. - ssewire: direct unit tests for Scanner edge cases (CRLF, leading blanks, mid-event EOF) and IsTerminalMarker per provider. - mitm: Stop idempotency, restart cycle with allowlist swap. - middleware/route_model: classifier-success, fallback, depth-1-invariant, no-fallback-503, unsupported-classifier paths + OpenAIProbe/AnthropicProbe extractors. - anthropic/messages: drainStreamPIIToText covers nil-filter no-op, empty-drain no-op, residual emit shape, idempotence, and end-of-stream redaction. - application: symmetric MITMInterceptHosts CLI-wins loader test. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>
Four UX changes that together move per-host MITM control out of the
global runtime_settings.json and into model YAML, where PII overrides
already lived. The MITM model template + the Add Model picker entry
mirror how the Talk page surfaces pipeline models.
A. Per-pattern PII enable + persist
- pii.Pattern gains a Disabled flag; Redactor.RedactWithOverrides
skips disabled patterns. SetDisabled mutates via slices.Clone for
the same race-free publish SetAction uses.
- PUT /api/pii/patterns/:id accepts {action?, disabled?} (one or
both). New POST /api/pii/patterns/persist snapshots the live
redactor's deltas vs --pii-config defaults into a new
pii_pattern_overrides map in runtime_settings.json; the boot
loader applies it after redactor construction.
- React: per-row Enabled checkbox + a "Save to disk" button on the
Filtering tab. PUT toasts note the change is transient until
persist is clicked.
- MCP: PIIPatternActionUpdate.Disabled is optional; new
persist_pii_patterns tool. Coverage map + full-catalog test
updated.
B. Model-config link buttons
- Per-model row in the Filtering tab gets an Edit button linking to
/app/model-editor/<name>. Mirrors the same pattern used elsewhere
for navigating to a config from a status surface.
D2. Model configs own MITM hosts
- New mitm: { hosts: [...] } block on ModelConfig. Loader gains
MITMHostOwners() returning {Owners, Conflicts}; ANY duplicate host
across model configs is a critical error that disables the MITM
listener until resolved (strict 1-to-1 invariant the dispatcher
relies on).
- startMITMLocked validates ownership before binding; conflicts are
published on Application.mitmHostConflicts and surfaced via
/api/middleware/status with a clear error message and links to
the colliding configs in the React banner.
- Allowlist is now exactly the set of hosts claimed by model configs
— the global MITMInterceptHosts list and MITMHostsWithPIIDisabled
list are removed from RuntimeSettings, ApplicationConfig, the CLI
flag, and runtime_settings.json. Per-host PII gate inherits from
each owner config's pii.enabled.
- New "MITM Intercept" template in modelTemplates.js (default name
mitm-anthropic, default host api.anthropic.com, pii.enabled: true,
empty pii.patterns: [] for an immediately-visible override editor).
Registered in core/config/meta/registry.go as a string-list field
so the model editor renders it.
- /api/middleware/status MITM payload gains models: a list of every
config that owns at least one MITM host (name, hosts, pii_enabled,
backend), plus host_owners, host_conflicts. The MITM Proxy tab
renders this as a top-level "MITM Models" table with an Add MITM
model button.
Test: ModelConfigLoader.MITMHostOwners cross-config conflict
detection, host-normalisation, and intra-config duplicate handling.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Pull the local-store gRPC backend's KV+KNN logic into a reusable pkg/store/local library so other in-process callers (notably the routing module's KNN classifier) share one implementation. The backend/go/local-store binary becomes a thin pb<->[]float32/[]byte translation wrapper. Shared WrapKeys/WrapValues/UnwrapKeys helpers move to pkg/store/proto.go. Regression test suite covers normalization invariants, sort/merge correctness, delete, KNN top-k ordering, and the 0xFFFF block-index boundary that previously aliased. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>
Lands two routing subsystems behind the existing router config:
KNN classifier — embeds candidate exemplars on first Classify
(atomic.Pointer[local.Store] + loadMu for safe lazy load), picks
the candidate whose nearest exemplar is closest to the probe.
Threshold via router.min_score. Reuses the extracted local-store
library so the same KNN search runs in both the gRPC backend and
in-process router.
LLM classifier — asks a small instruct model to pick a label
from natural-language descriptions. Longest-first label match,
RWMutex-guarded prompt memo cache (size from
router.classifier_cache_size, default 1024), TrimSpace+ToLower
cache key.
EmbedderFactory / LLMCallerFactory adapter pattern on Application
keeps the router package free of HTTP/backend imports. Per-router
sync.Map cache in the middleware avoids re-embedding exemplars on
every request.
Admission control (subsystem 5) — per-model semaphore limiter
(sync.Map[modelName]chan struct{}) gates concurrent in-flight
requests by ModelConfig.Limits.MaxConcurrent. On rejection:
HTTP 503 + Retry-After + audit row via new pii.KindAdmission
event kind + JSON body { error.type: admission_rejected }.
Cap is fixed at first Acquire per model — admin restarts to
resize, matching the rest of the model config lifecycle.
Middleware runs after RouteModel so a router fanout that lands
on a saturated downstream is rejected even when the router-model
itself has slack. /api/middleware/status gains an admission
section listing each gated model's max_concurrent / in_flight /
retry_after_seconds. The Events tab in the Middleware page knows
about admission rows.
Single-source-of-truth constants (ClassifierFeature / KNN / LLM,
LabelFallback) and an errDecision helper de-duplicate the
classifier surface.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The earlier extraction to pkg/store/local was wrong shape: it pulled
the in-memory KV+KNN into the main process so the KNN router
classifier could call it directly. That undermines the point of
having a vector-store backend — admins should be able to swap in
qdrant, pinecone, or any other pluggable store backend without
changes to the routing code.
Reverts pkg/store/local and inlines the implementation back into
backend/go/local-store as package main. KNN now consumes a
router.VectorStore interface (Set / Find), with the production
adapter at core/application/embedder.go wrapping pkg/store's gRPC
client (SetCols / Find) over a backend resolved from
core/backend.StoreBackend — exactly how face/voice recognition
consume the same surface.
RouterConfig gains a store_model field naming the chosen backend
(empty = default local-store). Each router model uses its own
namespace ("router-knn-<routerModelName>") so two routers sharing
a backend can't see each other's exemplars; ModelLoader's
per-(backend, namespace) process isolation does the rest.
The router package gains no core/backend dependency — the
VectorStore interface lives alongside Embedder and LLMCaller and
is wired from the application layer the same way.
Algorithm coverage (sort/merge, normalised fast path, KNN top-K,
dimension enforcement) stays where it belongs — in
backend/go/local-store/store_test.go — exercised through the
gRPC service surface that downstream consumers actually use.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds a TokenClassify gRPC method for token-classification (NER) models
and implements it in the Python transformers backend. The PII redactor
will consume this in a follow-up to add an ML-based detection tier on
top of the regex tier.
Proto surface:
- TokenClassifyRequest { text, threshold }
- TokenClassifyEntity { entity_group, start, end, score, text }
- TokenClassifyResponse { repeated entities }
Byte offsets are into the original UTF-8 text so the consumer can slice
without re-tokenising. entity_group follows HuggingFace's aggregated-tag
convention (PER, LOC, ORG, ... or PII-specific labels depending on the
model).
Go wiring: Client / embedBackend / ConnectionEvictingClient gain
TokenClassify; Backend interface includes it. Generated stubs are
gitignored and regenerated at build time via `make protogen-go`.
Python backend: a new `Type=TokenClassification` model-load branch
loads via `transformers.pipeline("token-classification",
aggregation_strategy="simple")`. The aggregated-strategy pipeline gives
us span-merged entities with byte offsets out of the box. TokenClassify
RPC runs the pipeline, filters by threshold, and returns the entities.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The streaming PII filter wiring referenced auth.GetUser to attribute events to the request's user, but the import line was dropped during rebase. Result was the build failure: "undefined: auth" at chat.go:709 and :1453. Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>
Adds an optional encoder-based detection tier on top of the existing regex tier. NER catches the long tail (unformatted names, locations, mixed-language PII) that regex can't express, while regex keeps the cheap path for formatted hits (emails, SSNs, credit cards). The redactor exposes a new RedactWithNER(ctx, text, overrides, NERConfig) that runs both tiers and merges hits through the same overlap-resolution as before — when an entity span overlaps a regex hit, the stronger action wins (block > route_local > mask). NER pattern IDs are namespaced "ner:<entity_group>" so audit rows and event-tab filters distinguish them from regex hits, and admins can disable a single entity type with the same Disabled-pattern machinery. NERConfig is per-request: each call site supplies the loaded detector + per-group action map + minimum confidence, so the same Redactor instance can serve different models with different NER preferences without per-model redactor instances. Fail-open semantics: a detector error returns the regex-only Result alongside the error. Caller decides whether to surface the failure (fail-closed: refuse the request) or log and proceed (fail-open: ship regex-tier protection only). The regex tier itself never errors. regex hit-collection / overlap-merge / output emission are now factored into collectRegexHits + mergeAndEmit so the regex-only RedactWithOverrides and the new RedactWithNER share one implementation. Out of scope (follow-up commits): - core/application adapter from gRPC TokenClassify to NERDetector - per-model PIIConfig.NER block + middleware wiring - React middleware page surface for NER entity types - gallery model entry for a recommended NER model Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>
The Routing tab now has an explicit affordance for creating a new
routing model — matches the pattern already used on the MITM Proxy
tab. Empty state shows a primary "Create routing model" button;
populated state adds an "Add routing model" button next to the
Active routers header.
Both link to /app/model-editor?template=router. A new template in
modelTemplates.js seeds the editor with the feature classifier and
two empty candidate rows (one for 'code' with requires_code, one for
'chat') so admins fill in candidate models + a fallback and save.
The model editor wouldn't render the router fields until they were
registered, so registry.go gains entries for:
- router.classifier (select: feature / knn / llm)
- router.fallback (model-select chat)
- router.embedding_model (model-select models — for KNN)
- router.store_model (model-select models — for KNN's vector store)
- router.min_score (number)
- router.classifier_model (model-select chat — for LLM)
- router.classifier_cache_size (number)
- router.candidates (code-editor — array of {label, model, rules})
All under the "other" section alongside mitm.hosts, ordered after
the MITM entry.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The router template seeds router.candidates with an array of
{label, model, rules} objects. CodeMirror's EditorState.create({ doc })
requires a string — passing the array crashed inside CM's Text class
with "(intermediate value).split is not a function", surfaced as an
"Unexpected Application Error" overlay the moment the template
loaded.
Adds a StructuredCodeEditor wrapper that:
- YAML-stringifies the structured value for display so CodeMirror
always sees a string,
- parses the user's text back to a structured value on every edit
(using YAML.parse) so the editor form state holds the canonical
shape, ready for unflattenConfig + YAML.stringify on save,
- holds the last-published structured value steady while the YAML
buffer is mid-edit and temporarily invalid (the CM YAML linter
surfaces the syntax error inline).
ConfigFieldRenderer routes code-editor fields through the wrapper
when the form value is non-string; plain text blobs (Go templates
etc.) still use the original CodeEditor with no behaviour change.
Playwright regression test pins:
- The Routing tab's "Create routing model" button navigates to
/app/model-editor?template=router.
- Loading that URL doesn't render the "Unexpected Application Error"
overlay, and the Router Candidates / Classifier fields are visible.
A page.on('pageerror') hook surfaces any uncaught render error so a
future regression fails with a useful message rather than silently
passing.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
…usecase
Two follow-ups for the routing template:
1. router.candidates moves from raw YAML to a dedicated structured
editor. Each candidate is a card with:
- label and model (model picker — no more typing model names from
memory),
- optional description (LLM classifier hint),
- collapsible Rules section: max/min prompt length, requires_code
toggle, and an Examples textarea for KNN exemplars (one per
line).
Empty rule values are stripped from the output so the saved YAML
doesn't carry zero-valued junk. New "router-candidates" component
in the field registry routes to RouterCandidatesEditor; everything
else (regex tier, KNN factory, classifier dispatch) was already
wired against the same structured shape, so the YAML this editor
produces round-trips cleanly.
2. Proxy templates (proxy-openai, proxy-anthropic) ship with
known_usecases: ['chat']. Without it the proxy model wasn't
surfacing in router fallback / candidate pickers (or any chat
capability selector) because pickers filter by FLAG_CHAT and
backends with no explicit usecase list don't pass.
Updated regression test to assert the structured editor's "Add
candidate" button is present — if the field gets reverted to raw
YAML, the test fails loudly instead of silently passing on the
"didn't crash" check alone.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The KNN exemplars field shipped as a single textarea with "one per line" semantics. That broke for any prompt that itself contained a newline — a realistic case for the multi-line prompts admins want to paste in verbatim from real traffic — and gave them a tiny 3-row box for what's often the most consequential field on the form. New ExamplesEditor renders one resizable textarea per exemplar with add / remove buttons. Each exemplar can hold arbitrary text including line breaks; the array on the wire stays a plain []string that the KNN classifier already consumes unchanged. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>
Wires the consuming side of the LLMRouter-style data pipeline: the
KNN classifier can now load exemplars from a JSONL file alongside
(or instead of) hand-written candidate.examples. The benchmarker
itself isn't shipped yet — this lets the consumer be ready when it
lands, and lets admins drop in third-party datasets (LLMRouter's
own outputs etc.) directly.
JSONL shape (one row per query):
{"_meta": {"embedding_model": "longformer-base-4096",
"embedding_dim": 768,
"judge": "claude-opus",
"judge_method": "pairwise_winrate"}}
{"query": "fix the bug in this function",
"best_model": "qwen-coder",
"scores": {"qwen-coder": 0.92, "qwen-chat": 0.45},
"embedding": [0.12, ...]}
{"query": "hello", "best_model": "qwen-chat"}
The _meta header is optional. embedding/scores per row are optional.
Blank lines and "#" comments are skipped.
Loader (pkg pii/services/routing/router/routing_data.go):
- LoadRoutingDataset(path) parses JSONL, validates {query, best_model}
on each row, returns RoutingDataset{Meta, Rows}.
- 8MB per-line buffer so 4096-D Longformer rows fit.
- FilterByCandidates(modelNames) drops rows whose best_model isn't
configured — admins can share one benchmark across deployments
with different lineups.
- EmbeddingsMatch(name, dim) reports whether stored embeddings can
be used verbatim (saving 10-100x cold-start cost on large
datasets).
KNN integration:
- KNNCandidate gains a Model field; the loader maps row.best_model →
candidate.Label.
- NewKNNClassifier signature gains a trailing KNNOptions{Dataset,
EmbeddingModelName}; existing call sites pass KNNOptions{}.
- Seeding has two passes — hand-written Examples first, then dataset
rows. Empty-Examples candidates are no longer a constructor panic;
with no dataset and no examples, the seed step fails on first
Classify and the middleware falls back (the right failure mode).
- Pre-computed embeddings are honoured iff the dataset's
_meta.embedding_model matches the configured embedder; otherwise
re-embed (different embedders → different vector spaces).
- Rows referencing models the router doesn't know about are silently
dropped.
Config:
- RouterConfig.ExemplarsFile (yaml: exemplars_file) names the JSONL.
Relative paths resolve against models dir.
- Field registered in core/config/meta/registry.go so the model
editor renders it as a path input next to the candidates editor.
Tests cover: meta header parsing, optional header, blank/comment
lines, missing-field validation, malformed JSON, missing file,
candidate filter, embeddings-match check; KNN seeds from dataset,
drops unknown models, uses precomputed embeddings when aligned,
re-embeds when mismatched, combines hand-written + dataset
exemplars.
Out of scope: the benchmarking CLI that produces these files.
Discussed as a separate slice — for general use the recommended
shape is pairwise-LLM-judge over a sampled traffic subset with
LocalAI's PII redactor in front of the judge call.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
llama.cpp doesn't support Longformer's sliding-window + global attention pattern — confirmed by grepping convert_hf_to_gguf.py for LongformerModel (not present; supported encoder archs are Bert, DistilBert, Roberta, XLMRoberta, NomicBert, JinaBert, ModernBert, NeoBERT, EuroBert). For routing the dataset schema is encoder-agnostic; we just need SOME long-context sentence encoder. nomic-embed-text-v1.5 (NomicBert arch, 8192 native context, GGUF available, already in gallery/index.yaml) fits the bill and runs on the existing llama-cpp embedding path. Updates the model-editor description for router.embedding_model to surface nomic-embed-text-v1.5 as the default suggestion, with modernbert-embed-base / jina-embeddings-v3 as alternatives. Also corrects an inaccurate comment in routing_data.go that conflated Longformer's context length (4096 tokens) with embedding dimensionality (768) when justifying the 8MB scanner buffer. Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>
…ache
Replace the prior feature/knn/llm router classifiers with a single
score-based classifier that asks an Arch-Router-style model to rank
every policy label as a continuation of the routing prompt and reads
off the softmax distribution. Multi-label routing falls out of this
naturally: the middleware activates every label whose probability
crosses a softmax threshold and picks the first candidate whose
labels are a superset of the active set.
Wiring summary:
- backend.proto adds Score(ScoreRequest) → ScoreResponse. The
llama-cpp C++ backend implements Score on top of force-decoded
candidates against a freshly-cleared KV cache (prompt-KV sharing
optimisation is on the perf TODO list); vLLM uses prompt_logprobs.
Other backends return UNIMPLEMENTED.
- core/services/routing/router/score.go is the classifier. It builds
the ChatML routing prompt once at construction, scores every
policy label as a continuation, and applies an activation
threshold (default 0.15; 0.40 is a better empirical default on
Arch-Router-1.5B per the eval in features/middleware.md).
- RouterConfig grows Policies, ActivationThreshold, and an optional
EmbeddingCache nested struct. RouterCandidate collapses to
{Model, Labels[]} — labels are the matching contract, descriptions
live on the policy.
- The dead feature/knn/llm/routing_data files are removed.
L2 embedding cache:
- core/services/routing/router/embedding_cache.go wraps a Classifier
decorator that embeds each probe, KNN-searches the per-router
local-store collection, returns a cached decision if the cosine
similarity passes a threshold (default 0.80, lowered from 0.92
after the eval against nomic-embed-text-v1.5 paraphrases). Low-
confidence decisions are deliberately not cached so they can't
poison future paraphrases.
- Stats include hits, misses, near_misses, low_confidence, and a
10-bin similarity histogram so admins can see where the cosine
distribution sits relative to the configured threshold.
- Registry tracks built classifiers by fingerprint of the
RouterConfig YAML, so config edits invalidate the cache wrapper
automatically while the on-disk vectors persist.
UI:
- The model-editor schema is rewritten: dead KNN/LLM fields gone,
policies/activation_threshold/embedding_cache.* added with proper
descriptions, sliders, and component bindings.
- RouterCandidatesEditor is rewritten for {model, labels[]} with
multi-select label chips populated from router.policies via a new
FormContext.
- RouterPoliciesEditor is the structured editor for the label
vocabulary, with duplicate-label detection via a memoised set.
- The Routing tab on /app/middleware renders the embedding-cache
histogram inline with a threshold marker.
Verification:
- Unit tests cover the score classifier (multi-label activation,
fallback, depth-1) and the embedding cache (hit, near-miss,
low-confidence skip, embedder/store error fallthrough, histogram
population).
- Refreshed e2e specs (router-template.spec.js, middleware-page.spec.js)
pass under make test-ui-e2e-docker: 133/135 passing with the two
failures unrelated to this slice.
- End-to-end eval against the LocalAGI stack with a 30-prompt corpus
+ 3 paraphrases each produced 35% steady-state hit rate at 0.80
threshold (53% of caching-eligible decisions), 15ms p50 cache-hit
latency vs 246ms classifier round-trip — a ~16× speedup on hits.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
8389d96 to
99f79f4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Allows analyzing requests then routing, filtering and transforming them.
Chat requests can be classified and labelled as requiring particular capabilities.
Then routed to the model which satisfies all of the capabilities. Naturally requests that require fewer capabilities can be handled by smaller specialized models. In addition the classifier chooses more capabilities the more uncertain it is, routing difficult requests to larger general purpose models.
Classification is very fast, but once requests have been classified their embeddings can be used to avoid classifying similar requests. This works by labelling the embeddings of past requests and then doing a cosine similarity search on the embeddings of new requests.
Private information can be detected, when it is found in the request, the request can be modified to redact it,
routed differently or it can be blocked.
Cloud models and a MITM proxy can be configured and take part in filtering and routing.
This allows sending easy requests to smaller local models and hard ones to cloud models.
The MITM proxy allows you to use Claude Code or Codex subscriptions (OAuth) with the PII
filter and potentially even with routing (although this is limited by the cloud providers ToS).
Routing classifies requests using a model such as ArchRouter which labels a request.
We score each request on the possible capabilities it may require and pick a model which
has all of the capabilities with scores towards the top of the distribution.
The ability to score multiple choices is an interesting feature in its own right.
It allows you to very quickly check with what probability an LLM would produce a particular
answer.