[FEATURE] First-class Envoy AI Gateway integration: make the gateway fleet-aware

## Feature Description

LLMKube should generate and reconcile Envoy AI Gateway (Gateway API) resources from InferenceService and ModelRouter state, so the gateway becomes fleet-aware: routes appear and disappear with InferenceServices, backend health flows from the operator into the data plane, and policy is declared once in LLMKube CRDs instead of hand-written across the Gateway API resource set.

The gateway ecosystem is excellent at moving bytes (auth, token metering, budgets, audit, failover) but knows nothing about models as workloads: it cannot see that a backend is still downloading a model, is GPU-queued, or that one logical model is served across heterogeneous tiers (in-cluster CUDA pods plus off-cluster Metal hosts). LLMKube owns exactly that state. The integration is the bridge.

## Problem Statement

> As a platform operator running a mixed LLMKube fleet (CUDA InferenceServices plus metal-agent Macs), I want the operator to program an Envoy AI Gateway for me so that every model gets a policy-controlled, observable endpoint without hand-maintaining gateway YAML.

Fronting a homelab fleet with Envoy AI Gateway today works well (verified end to end: Keycloak JWT auth, per-team model allowlists, exact streamed token metering, token-budget 429s, audit access logs to Loki, two-tier failover, ~3ms added TTFT) but requires hand-writing roughly eight resource kinds per deployment (Gateway, EnvoyProxy, AIGatewayRoute, Backend + AIServiceBackend pairs, SecurityPolicy, BackendTrafficPolicy, InferencePool + endpoint-picker Deployment), and the configuration has real footguns, for example: retry/fallback and rate limiting must live in ONE BackendTrafficPolicy or the newer policy silently no-ops, InferencePool refs are namespace-locked to their pods, and the oldest route's catch-all rule gates auth for the whole listener.

Worse, the gateway is blind to backend lifecycle. Measured: an abruptly killed backend behind a ClusterIP-resolved Backend stalls in-flight requests for the full 60s per-attempt timeout with no failover, while pod-backed InferencePool endpoints fail over new requests in 2-4ms. Off-cluster Metal endpoints have no pool equivalent (pools are pods-only), so half the fleet gets the slow path.

## Proposed Solution

Three pieces, all driven by existing LLMKube state:

1. InferenceService opt-in gateway exposure: emit the Backend/AIServiceBackend pair (and optionally InferencePool + endpoint picker for pod-backed runtimes) plus the model-matched AIGatewayRoute rule, lifecycle-bound to the InferenceService.
2. ModelRouter gains a gateway data-plane mode: compile its rules, budgets, and audit policy into AIGatewayRoute/SecurityPolicy/BackendTrafficPolicy instead of running router-proxy, with the known footguns made structurally impossible by the compiler.
3. Backend health bridging: operator/metal-agent health state ejects and restores gateway backends (the endpoint-picker story, extended to endpoints that are not pods). Tracked separately: #662.

**Example YAML (if applicable):**
```yaml
apiVersion: inference.llmkube.dev/v1alpha1
kind: ModelRouter
metadata:
  name: fleet-router
spec:
  dataPlane: Gateway   # NEW: compile to Gateway API resources instead of router-proxy
  gatewayRef:
    name: ai-gateway
    namespace: ai-gateway
  # existing rules/budgets/audit stanzas unchanged
```

## Alternatives Considered

- Keep router-proxy as the only data plane: duplicates what the Envoy ecosystem now does better (HA, ecosystem, sub-3-percent overhead with a full policy stack).
- KServe: brings a different serving stack rather than integrating LLMKube's CRDs, and its gateway story is pod-centric.
- Hand-maintained gateway YAML (today's approach): works for a couple of models, does not scale, and cannot react to backend lifecycle at runtime.

## Additional Context

- Related issues: #662 (metal-agent endpoint health)
- Similar features in other projects: KServe LLMInferenceService + Gateway API Inference Extension (in-cluster pods only)
- Workarounds you're currently using: hand-written manifest set, validated on Envoy Gateway v1.8.1 + Envoy AI Gateway v0.7.0 + GIE v1.0.x with llama.cpp backends, including metal-agent selectorless-Service Macs via fqdn Backends. Upstream constraints to design around: InferencePool refs cannot mix with AIServiceBackend refs in one rule (no cross-tier fallback under one model name) and pool refs are namespace-locked.

## Priority

- [x] High - Would significantly improve my workflow

## Willingness to Contribute

- [x] Yes, I can submit a PR


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] First-class Envoy AI Gateway integration: make the gateway fleet-aware #661

Feature Description

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Priority

Willingness to Contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] First-class Envoy AI Gateway integration: make the gateway fleet-aware #661

Description

Feature Description

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Priority

Willingness to Contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions