[Initiative]: Cloud-Native Foundations for Distributed Agentic Systems

### Name

Cloud-Native Foundations for Distributed Agentic Systems

### Short description

Formalise principles, reference patterns and ecosystem strategy for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.

### Responsible group

TOC

### Does the initiative belong to a subproject?

Yes

### Subproject name

TOC Artificial Intelligence Initiatives

### Primary contact

Vincent Caldeira (vincent.caldeira@gmail.com)

### Additional contacts

Ricardo Aravena (raravena80@gmail.com)

### Initiative description

The purpose of this initiative is to have a CNCF AI WG sub-stream that formalises principles, reference patterns and ecosystem gaps for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.

**Scope Definition**

The group will focus on architectural guidance and identifying needs for standards definition, not on defining a new runtime model or deep-diving into framework & implementation.

- **Protocol interoperability:** Can the community converge on Model Context Protocol (MCP) as the default agent-tool and agent-agent wire spec, and how should it be integrated into cloud-native systems? What auth, discovery and streaming extensions are required for cluster and multi-cluster use?
- **Agentic Gateway / Data Plane:** Traditional REST-centric proxies can’t handle MCP & A2A session fan-out, bidirectional SSE, protocol negotiation or per-agent tenancy. How do we specify a gateway pattern that is session-aware, JSON-RPC–aware, secure and resource-efficient? What minimum behaviours (multiplexing, retries, streaming, auth, tracing) are required for conformance?
- **Runtime abstraction:** How should an Agent be modelled in cloud-native terms (Pod? CRD? Side-carless process)? What lifecycle hooks and retry semantics are necessary for autonomous, long-running tasks? ​
State & memory | Which back-ends (object store, vector DB, Redis) and API shapes are suitable for short-term and long-term agent memory? How do we ensure consistency and garbage-collection across thousands of agents? 
- **Fault Tolerance:** Patterns for handling fault-tolerance in agentic systems based not solely on execution but also on output quality.
- **Observability & Policy Management:** Define OpenTelemetry spans and policy CRDs (Kyverno/Gatekeeper) so SREs can trace, limit and audit autonomous behaviour.

**Why it matters to CNCF**

- **Next wave of workloads:** Agentic AI shifts compute from monolithic LLM calls to dynamic swarms of small, interactive tasks. This stresses the very areas—scalability, resilience, observability. where Kubernetes and cloud-native projects excel.
- **Avoid one-off silos:** Vendors are already shipping proprietary agent platforms. A neutral CNCF framework guiding and normalising these different approaches can prevent fragmentation and foster portability, just as OCI normalised container images.
- **Fills an enterprise adoption gap:** A purpose-built approach for agentic at the gateway layer may be required because auth, tenancy and traffic shaping are missing. Providing a CNCF-blessed spec/blueprint could support standard ways of addressing this through cloud-native traffic management. ​
- **Leverages Envoy heritage while staying protocol-neutral:** A common spec can lets Envoy-style extensions, Rust-based proxies, or service-mesh datapaths compete while preserving interoperability.
- **Attracts new contributors:** Identifying gaps (e.g., agent memory APIs, MCP-K8s discovery) invites fresh projects to join the landscape and advances CNCF’s leadership in AI infrastructure.

**Key technologies & projects involved**

- Communication Protocols: Model Context Protocol (MCP), gRPC, CloudEvents
- Agent-to-Agent Gateway: Agent Gateway, Envoy-MCP filter POC, Gloo AI Gateway, A2A protocol
- Runtime coordination: Dapr Agents, Kagent
- Scheduling / scaling: Kubernetes Scheduler, KEDA, Kueue, Dynamic Resource Allocation (DRA)
- State / memory: Dapr state components, vector-DB operators (Chroma, Milvus), S3/GCS
- Eventing & workflows: Knative Eventing, Argo Workflows, Temporal
- Observability & Policy Management: OpenTelemetry, Kyverno/Gatekeeper, SPIFFE/SPIRE

### Deliverable(s) or exit criteria

1. Publish “Foundations for Distributed AI Agents” whitepaper (≤ 12 pp): Describes protocol, runtime, state, scheduling and safety patterns; maps research challenges.
2. Produce reference architecture & pattern catalogue ‌​
3. Standards & API proposals including draft enhancement for “MCP-for-Clusters” (auth, discovery, streaming) and high-level sketch of "Agent CRD" schema and lifecycle states for WG App-Delivery & SIG-Apps review.
4. Gap analysis & incubation map identifying where new projects (e.g., AgentMemory API, AgentBench-CN) or SIG plugins are needed.
5. Cross-WG alignment providing a formal liaisons with WG Serving (routing/benchmarks), Device-Management (GPU partitioning for agents), TAG Security (tool-scope policy), SIG Autoscaling (agent-aware HPA) around agentic topics.
6. Look into approach for a conformance/observability spec defining a minimal OpenTelemetry schema for agent spans and cost/energy labels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Initiative]: Cloud-Native Foundations for Distributed Agentic Systems #1746

Name

Short description

Responsible group

Does the initiative belong to a subproject?

Subproject name

Primary contact

Additional contacts

Initiative description

Deliverable(s) or exit criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Initiative]: Cloud-Native Foundations for Distributed Agentic Systems #1746

Description

Name

Short description

Responsible group

Does the initiative belong to a subproject?

Subproject name

Primary contact

Additional contacts

Initiative description

Deliverable(s) or exit criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions