Skip to content

[FEATURE] First-class Envoy AI Gateway integration: make the gateway fleet-aware #661

@Defilan

Description

@Defilan

Feature Description

LLMKube should generate and reconcile Envoy AI Gateway (Gateway API) resources from InferenceService and ModelRouter state, so the gateway becomes fleet-aware: routes appear and disappear with InferenceServices, backend health flows from the operator into the data plane, and policy is declared once in LLMKube CRDs instead of hand-written across the Gateway API resource set.

The gateway ecosystem is excellent at moving bytes (auth, token metering, budgets, audit, failover) but knows nothing about models as workloads: it cannot see that a backend is still downloading a model, is GPU-queued, or that one logical model is served across heterogeneous tiers (in-cluster CUDA pods plus off-cluster Metal hosts). LLMKube owns exactly that state. The integration is the bridge.

Problem Statement

As a platform operator running a mixed LLMKube fleet (CUDA InferenceServices plus metal-agent Macs), I want the operator to program an Envoy AI Gateway for me so that every model gets a policy-controlled, observable endpoint without hand-maintaining gateway YAML.

Fronting a homelab fleet with Envoy AI Gateway today works well (verified end to end: Keycloak JWT auth, per-team model allowlists, exact streamed token metering, token-budget 429s, audit access logs to Loki, two-tier failover, ~3ms added TTFT) but requires hand-writing roughly eight resource kinds per deployment (Gateway, EnvoyProxy, AIGatewayRoute, Backend + AIServiceBackend pairs, SecurityPolicy, BackendTrafficPolicy, InferencePool + endpoint-picker Deployment), and the configuration has real footguns, for example: retry/fallback and rate limiting must live in ONE BackendTrafficPolicy or the newer policy silently no-ops, InferencePool refs are namespace-locked to their pods, and the oldest route's catch-all rule gates auth for the whole listener.

Worse, the gateway is blind to backend lifecycle. Measured: an abruptly killed backend behind a ClusterIP-resolved Backend stalls in-flight requests for the full 60s per-attempt timeout with no failover, while pod-backed InferencePool endpoints fail over new requests in 2-4ms. Off-cluster Metal endpoints have no pool equivalent (pools are pods-only), so half the fleet gets the slow path.

Proposed Solution

Three pieces, all driven by existing LLMKube state:

  1. InferenceService opt-in gateway exposure: emit the Backend/AIServiceBackend pair (and optionally InferencePool + endpoint picker for pod-backed runtimes) plus the model-matched AIGatewayRoute rule, lifecycle-bound to the InferenceService.
  2. ModelRouter gains a gateway data-plane mode: compile its rules, budgets, and audit policy into AIGatewayRoute/SecurityPolicy/BackendTrafficPolicy instead of running router-proxy, with the known footguns made structurally impossible by the compiler.
  3. Backend health bridging: operator/metal-agent health state ejects and restores gateway backends (the endpoint-picker story, extended to endpoints that are not pods). Tracked separately: [FEATURE] metal-agent: gateway-aware endpoint health (eject and restore off-cluster backends) #662.

Example YAML (if applicable):

apiVersion: inference.llmkube.dev/v1alpha1
kind: ModelRouter
metadata:
  name: fleet-router
spec:
  dataPlane: Gateway   # NEW: compile to Gateway API resources instead of router-proxy
  gatewayRef:
    name: ai-gateway
    namespace: ai-gateway
  # existing rules/budgets/audit stanzas unchanged

Alternatives Considered

  • Keep router-proxy as the only data plane: duplicates what the Envoy ecosystem now does better (HA, ecosystem, sub-3-percent overhead with a full policy stack).
  • KServe: brings a different serving stack rather than integrating LLMKube's CRDs, and its gateway story is pod-centric.
  • Hand-maintained gateway YAML (today's approach): works for a couple of models, does not scale, and cannot react to backend lifecycle at runtime.

Additional Context

  • Related issues: [FEATURE] metal-agent: gateway-aware endpoint health (eject and restore off-cluster backends) #662 (metal-agent endpoint health)
  • Similar features in other projects: KServe LLMInferenceService + Gateway API Inference Extension (in-cluster pods only)
  • Workarounds you're currently using: hand-written manifest set, validated on Envoy Gateway v1.8.1 + Envoy AI Gateway v0.7.0 + GIE v1.0.x with llama.cpp backends, including metal-agent selectorless-Service Macs via fqdn Backends. Upstream constraints to design around: InferencePool refs cannot mix with AIServiceBackend refs in one rule (no cross-tier fallback under one model name) and pool refs are namespace-locked.

Priority

  • High - Would significantly improve my workflow

Willingness to Contribute

  • Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions