Skip to content

[Initiative]: Cloud-Native Foundations for Distributed Agentic Systems #1746

@caldeirav

Description

@caldeirav

Name

Cloud-Native Foundations for Distributed Agentic Systems

Short description

Formalise principles, reference patterns and ecosystem strategy for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.

Responsible group

TOC

Does the initiative belong to a subproject?

Yes

Subproject name

TOC Artificial Intelligence Initiatives

Primary contact

Vincent Caldeira (vincent.caldeira@gmail.com)

Additional contacts

Ricardo Aravena (raravena80@gmail.com)

Initiative description

The purpose of this initiative is to have a CNCF AI WG sub-stream that formalises principles, reference patterns and ecosystem gaps for running massively distributed systems of collaborating AI agents on Kubernetes and other cloud-native substrates.

Scope Definition

The group will focus on architectural guidance and identifying needs for standards definition, not on defining a new runtime model or deep-diving into framework & implementation.

  • Protocol interoperability: Can the community converge on Model Context Protocol (MCP) as the default agent-tool and agent-agent wire spec, and how should it be integrated into cloud-native systems? What auth, discovery and streaming extensions are required for cluster and multi-cluster use?
  • Agentic Gateway / Data Plane: Traditional REST-centric proxies can’t handle MCP & A2A session fan-out, bidirectional SSE, protocol negotiation or per-agent tenancy. How do we specify a gateway pattern that is session-aware, JSON-RPC–aware, secure and resource-efficient? What minimum behaviours (multiplexing, retries, streaming, auth, tracing) are required for conformance?
  • Runtime abstraction: How should an Agent be modelled in cloud-native terms (Pod? CRD? Side-carless process)? What lifecycle hooks and retry semantics are necessary for autonomous, long-running tasks? ​
    State & memory | Which back-ends (object store, vector DB, Redis) and API shapes are suitable for short-term and long-term agent memory? How do we ensure consistency and garbage-collection across thousands of agents?
  • Fault Tolerance: Patterns for handling fault-tolerance in agentic systems based not solely on execution but also on output quality.
  • Observability & Policy Management: Define OpenTelemetry spans and policy CRDs (Kyverno/Gatekeeper) so SREs can trace, limit and audit autonomous behaviour.

Why it matters to CNCF

  • Next wave of workloads: Agentic AI shifts compute from monolithic LLM calls to dynamic swarms of small, interactive tasks. This stresses the very areas—scalability, resilience, observability. where Kubernetes and cloud-native projects excel.
  • Avoid one-off silos: Vendors are already shipping proprietary agent platforms. A neutral CNCF framework guiding and normalising these different approaches can prevent fragmentation and foster portability, just as OCI normalised container images.
  • Fills an enterprise adoption gap: A purpose-built approach for agentic at the gateway layer may be required because auth, tenancy and traffic shaping are missing. Providing a CNCF-blessed spec/blueprint could support standard ways of addressing this through cloud-native traffic management. ​
  • Leverages Envoy heritage while staying protocol-neutral: A common spec can lets Envoy-style extensions, Rust-based proxies, or service-mesh datapaths compete while preserving interoperability.
  • Attracts new contributors: Identifying gaps (e.g., agent memory APIs, MCP-K8s discovery) invites fresh projects to join the landscape and advances CNCF’s leadership in AI infrastructure.

Key technologies & projects involved

  • Communication Protocols: Model Context Protocol (MCP), gRPC, CloudEvents
  • Agent-to-Agent Gateway: Agent Gateway, Envoy-MCP filter POC, Gloo AI Gateway, A2A protocol
  • Runtime coordination: Dapr Agents, Kagent
  • Scheduling / scaling: Kubernetes Scheduler, KEDA, Kueue, Dynamic Resource Allocation (DRA)
  • State / memory: Dapr state components, vector-DB operators (Chroma, Milvus), S3/GCS
  • Eventing & workflows: Knative Eventing, Argo Workflows, Temporal
  • Observability & Policy Management: OpenTelemetry, Kyverno/Gatekeeper, SPIFFE/SPIRE

Deliverable(s) or exit criteria

  1. Publish “Foundations for Distributed AI Agents” whitepaper (≤ 12 pp): Describes protocol, runtime, state, scheduling and safety patterns; maps research challenges.
  2. Produce reference architecture & pattern catalogue ‌​
  3. Standards & API proposals including draft enhancement for “MCP-for-Clusters” (auth, discovery, streaming) and high-level sketch of "Agent CRD" schema and lifecycle states for WG App-Delivery & SIG-Apps review.
  4. Gap analysis & incubation map identifying where new projects (e.g., AgentMemory API, AgentBench-CN) or SIG plugins are needed.
  5. Cross-WG alignment providing a formal liaisons with WG Serving (routing/benchmarks), Device-Management (GPU partitioning for agents), TAG Security (tool-scope policy), SIG Autoscaling (agent-aware HPA) around agentic topics.
  6. Look into approach for a conformance/observability spec defining a minimal OpenTelemetry schema for agent spans and cost/energy labels.

Metadata

Metadata

Labels

kind/initiativeAn initiative or an item related to imitative processestag/workloads-foundationTAG Workloads Foundationtoc/initiative/AITOC Artificial Intelligence Initiative

Type

No type

Projects

Status

New

Status

status/accepted

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions