You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLMKube should generate and reconcile Envoy AI Gateway (Gateway API) resources from InferenceService and ModelRouter state, so the gateway becomes fleet-aware: routes appear and disappear with InferenceServices, backend health flows from the operator into the data plane, and policy is declared once in LLMKube CRDs instead of hand-written across the Gateway API resource set.
The gateway ecosystem is excellent at moving bytes (auth, token metering, budgets, audit, failover) but knows nothing about models as workloads: it cannot see that a backend is still downloading a model, is GPU-queued, or that one logical model is served across heterogeneous tiers (in-cluster CUDA pods plus off-cluster Metal hosts). LLMKube owns exactly that state. The integration is the bridge.
Problem Statement
As a platform operator running a mixed LLMKube fleet (CUDA InferenceServices plus metal-agent Macs), I want the operator to program an Envoy AI Gateway for me so that every model gets a policy-controlled, observable endpoint without hand-maintaining gateway YAML.
Fronting a homelab fleet with Envoy AI Gateway today works well (verified end to end: Keycloak JWT auth, per-team model allowlists, exact streamed token metering, token-budget 429s, audit access logs to Loki, two-tier failover, ~3ms added TTFT) but requires hand-writing roughly eight resource kinds per deployment (Gateway, EnvoyProxy, AIGatewayRoute, Backend + AIServiceBackend pairs, SecurityPolicy, BackendTrafficPolicy, InferencePool + endpoint-picker Deployment), and the configuration has real footguns, for example: retry/fallback and rate limiting must live in ONE BackendTrafficPolicy or the newer policy silently no-ops, InferencePool refs are namespace-locked to their pods, and the oldest route's catch-all rule gates auth for the whole listener.
Worse, the gateway is blind to backend lifecycle. Measured: an abruptly killed backend behind a ClusterIP-resolved Backend stalls in-flight requests for the full 60s per-attempt timeout with no failover, while pod-backed InferencePool endpoints fail over new requests in 2-4ms. Off-cluster Metal endpoints have no pool equivalent (pools are pods-only), so half the fleet gets the slow path.
Proposed Solution
Three pieces, all driven by existing LLMKube state:
InferenceService opt-in gateway exposure: emit the Backend/AIServiceBackend pair (and optionally InferencePool + endpoint picker for pod-backed runtimes) plus the model-matched AIGatewayRoute rule, lifecycle-bound to the InferenceService.
ModelRouter gains a gateway data-plane mode: compile its rules, budgets, and audit policy into AIGatewayRoute/SecurityPolicy/BackendTrafficPolicy instead of running router-proxy, with the known footguns made structurally impossible by the compiler.
apiVersion: inference.llmkube.dev/v1alpha1kind: ModelRoutermetadata:
name: fleet-routerspec:
dataPlane: Gateway # NEW: compile to Gateway API resources instead of router-proxygatewayRef:
name: ai-gatewaynamespace: ai-gateway# existing rules/budgets/audit stanzas unchanged
Alternatives Considered
Keep router-proxy as the only data plane: duplicates what the Envoy ecosystem now does better (HA, ecosystem, sub-3-percent overhead with a full policy stack).
KServe: brings a different serving stack rather than integrating LLMKube's CRDs, and its gateway story is pod-centric.
Hand-maintained gateway YAML (today's approach): works for a couple of models, does not scale, and cannot react to backend lifecycle at runtime.
Similar features in other projects: KServe LLMInferenceService + Gateway API Inference Extension (in-cluster pods only)
Workarounds you're currently using: hand-written manifest set, validated on Envoy Gateway v1.8.1 + Envoy AI Gateway v0.7.0 + GIE v1.0.x with llama.cpp backends, including metal-agent selectorless-Service Macs via fqdn Backends. Upstream constraints to design around: InferencePool refs cannot mix with AIServiceBackend refs in one rule (no cross-tier fallback under one model name) and pool refs are namespace-locked.
Feature Description
LLMKube should generate and reconcile Envoy AI Gateway (Gateway API) resources from InferenceService and ModelRouter state, so the gateway becomes fleet-aware: routes appear and disappear with InferenceServices, backend health flows from the operator into the data plane, and policy is declared once in LLMKube CRDs instead of hand-written across the Gateway API resource set.
The gateway ecosystem is excellent at moving bytes (auth, token metering, budgets, audit, failover) but knows nothing about models as workloads: it cannot see that a backend is still downloading a model, is GPU-queued, or that one logical model is served across heterogeneous tiers (in-cluster CUDA pods plus off-cluster Metal hosts). LLMKube owns exactly that state. The integration is the bridge.
Problem Statement
Fronting a homelab fleet with Envoy AI Gateway today works well (verified end to end: Keycloak JWT auth, per-team model allowlists, exact streamed token metering, token-budget 429s, audit access logs to Loki, two-tier failover, ~3ms added TTFT) but requires hand-writing roughly eight resource kinds per deployment (Gateway, EnvoyProxy, AIGatewayRoute, Backend + AIServiceBackend pairs, SecurityPolicy, BackendTrafficPolicy, InferencePool + endpoint-picker Deployment), and the configuration has real footguns, for example: retry/fallback and rate limiting must live in ONE BackendTrafficPolicy or the newer policy silently no-ops, InferencePool refs are namespace-locked to their pods, and the oldest route's catch-all rule gates auth for the whole listener.
Worse, the gateway is blind to backend lifecycle. Measured: an abruptly killed backend behind a ClusterIP-resolved Backend stalls in-flight requests for the full 60s per-attempt timeout with no failover, while pod-backed InferencePool endpoints fail over new requests in 2-4ms. Off-cluster Metal endpoints have no pool equivalent (pools are pods-only), so half the fleet gets the slow path.
Proposed Solution
Three pieces, all driven by existing LLMKube state:
Example YAML (if applicable):
Alternatives Considered
Additional Context
Priority
Willingness to Contribute