-
Notifications
You must be signed in to change notification settings - Fork 698
Description
Name
Cloud-Native Foundations for Distributed Agentic Systems
Short description
Formalise principles, reference patterns and ecosystem strategy for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.
Responsible group
TOC
Does the initiative belong to a subproject?
Yes
Subproject name
TOC Artificial Intelligence Initiatives
Primary contact
Vincent Caldeira (vincent.caldeira@gmail.com)
Additional contacts
Ricardo Aravena (raravena80@gmail.com)
Initiative description
The purpose of this initiative is to have a CNCF AI WG sub-stream that formalises principles, reference patterns and ecosystem gaps for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.
Scope Definition
The group will focus on architectural guidance and identifying needs for standards definition, not on defining a new runtime model or deep-diving into framework & implementation.
- Protocol interoperability: Can the community converge on Model Context Protocol (MCP) as the default agent-tool and agent-agent wire spec, and how should it be integrated into cloud-native systems? What auth, discovery and streaming extensions are required for cluster and multi-cluster use?
- Agentic Gateway / Data Plane: Traditional REST-centric proxies can’t handle MCP & A2A session fan-out, bidirectional SSE, protocol negotiation or per-agent tenancy. How do we specify a gateway pattern that is session-aware, JSON-RPC–aware, secure and resource-efficient? What minimum behaviours (multiplexing, retries, streaming, auth, tracing) are required for conformance?
- Runtime abstraction: How should an Agent be modelled in cloud-native terms (Pod? CRD? Side-carless process)? What lifecycle hooks and retry semantics are necessary for autonomous, long-running tasks?
State & memory | Which back-ends (object store, vector DB, Redis) and API shapes are suitable for short-term and long-term agent memory? How do we ensure consistency and garbage-collection across thousands of agents? - Fault Tolerance: Patterns for handling fault-tolerance in agentic systems based not solely on execution but also on output quality.
- Observability & Policy Management: Define OpenTelemetry spans and policy CRDs (Kyverno/Gatekeeper) so SREs can trace, limit and audit autonomous behaviour.
Why it matters to CNCF
- Next wave of workloads: Agentic AI shifts compute from monolithic LLM calls to dynamic swarms of small, interactive tasks. This stresses the very areas—scalability, resilience, observability. where Kubernetes and cloud-native projects excel.
- Avoid one-off silos: Vendors are already shipping proprietary agent platforms. A neutral CNCF framework guiding and normalising these different approaches can prevent fragmentation and foster portability, just as OCI normalised container images.
- Fills an enterprise adoption gap: A purpose-built approach for agentic at the gateway layer may be required because auth, tenancy and traffic shaping are missing. Providing a CNCF-blessed spec/blueprint could support standard ways of addressing this through cloud-native traffic management.
- Leverages Envoy heritage while staying protocol-neutral: A common spec can lets Envoy-style extensions, Rust-based proxies, or service-mesh datapaths compete while preserving interoperability.
- Attracts new contributors: Identifying gaps (e.g., agent memory APIs, MCP-K8s discovery) invites fresh projects to join the landscape and advances CNCF’s leadership in AI infrastructure.
Key technologies & projects involved
- Communication Protocols: Model Context Protocol (MCP), gRPC, CloudEvents
- Agent-to-Agent Gateway: Agent Gateway, Envoy-MCP filter POC, Gloo AI Gateway, A2A protocol
- Runtime coordination: Dapr Agents, Kagent
- Scheduling / scaling: Kubernetes Scheduler, KEDA, Kueue, Dynamic Resource Allocation (DRA)
- State / memory: Dapr state components, vector-DB operators (Chroma, Milvus), S3/GCS
- Eventing & workflows: Knative Eventing, Argo Workflows, Temporal
- Observability & Policy Management: OpenTelemetry, Kyverno/Gatekeeper, SPIFFE/SPIRE
Deliverable(s) or exit criteria
- Publish “Foundations for Distributed AI Agents” whitepaper (≤ 12 pp): Describes protocol, runtime, state, scheduling and safety patterns; maps research challenges.
- Produce reference architecture & pattern catalogue
- Standards & API proposals including draft enhancement for “MCP-for-Clusters” (auth, discovery, streaming) and high-level sketch of "Agent CRD" schema and lifecycle states for WG App-Delivery & SIG-Apps review.
- Gap analysis & incubation map identifying where new projects (e.g., AgentMemory API, AgentBench-CN) or SIG plugins are needed.
- Cross-WG alignment providing a formal liaisons with WG Serving (routing/benchmarks), Device-Management (GPU partitioning for agents), TAG Security (tool-scope policy), SIG Autoscaling (agent-aware HPA) around agentic topics.
- Look into approach for a conformance/observability spec defining a minimal OpenTelemetry schema for agent spans and cost/energy labels.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
Status