Skip to content

silogen/cluster-forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2,330 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Cluster-Forge

Important

Instructions for installing the AMD Enterprise AI reference stack (for most users) are here

A Kubernetes platform automation tool that deploys AMD Enterprise AI reference stack with complete GitOps infrastructure.

Overview

Cluster-Forge bundles third-party, community, and in-house components into a single, GitOps-managed stack deployable in Kubernetes clusters. It automates the deployment of a complete AI/ML compute platform with all essential services pre-configured and integrated.

Using a bootstrap-first deployment model, Cluster-Forge establishes GitOps infrastructure (ArgoCD, Gitea, OpenBao) before deploying the complete application stack via ArgoCD's app-of-apps pattern.

Ideal for:

  • AI/ML Engineers - Unified platform for model training, serving, and orchestration
  • Platform Engineers - Infrastructure automation with GitOps patterns
  • DevOps Teams - Consistent deployment across development, staging, and production
  • Research Teams - Ephemeral test clusters for experimentation

πŸš€ Quick Start

Single-Command Deployment

./scripts/bootstrap.sh <domain> [--cluster-size=small|medium|large]

Size-Aware Deployment Examples

# Small cluster (1-5 users, development/testing)
./scripts/bootstrap.sh dev.example.com --cluster-size=small

# Medium cluster (5-20 users, team production) [DEFAULT]
./scripts/bootstrap.sh team.example.com --cluster-size=medium

# Large cluster (10s-100s users, enterprise scale)
./scripts/bootstrap.sh prod.example.com --cluster-size=large

# Deploy only specific components
./scripts/bootstrap.sh dev.example.com --apps=argocd,gitea,cluster-forge

# Deploy from specific branch/tag
./scripts/bootstrap.sh prod.example.com --target-revision=v1.8.0

# Install everything except AIRM and its infra dependencies
./scripts/bootstrap.sh prod.example.com --disabled-apps=airm,airm-infra-*

# Combine --apps and --disabled-apps (disabled takes priority)
./scripts/bootstrap.sh dev.example.com --apps=airm,keycloak --disabled-apps=airm

For detailed deployment instructions, see the Bootstrap Guide.

πŸ“‹ Architecture

Bootstrap-First Deployment

Cluster-Forge uses a three-phase bootstrap process:

Phase 1: Pre-Cleanup

  • Detects and removes previous installations when applicable
  • Ensures clean state for fresh deployments

Phase 2: GitOps Foundation Bootstrap (Manual Helm Templates)

  1. ArgoCD (v8.3.5) - GitOps controller deployed via helm template
  2. Gitea (v12.3.0) - Git server with initialization job

Phase 3: App-of-Apps Deployment (ArgoCD-Managed)

  • Creates cluster-forge Application pointing to root/ helm chart
  • ArgoCD syncs all remaining applications including OpenBao from enabledApps list
  • Applications deployed in wave order (-70 to 0) based on dependencies
  • OpenBao (v0.18.2) managed via ArgoCD with openbao-init job

Dual Repository GitOps Pattern

Local Mode (Default) - Self-contained cluster-native GitOps:

  • Uses local Gitea for both cluster-forge and cluster-values repositories
  • Zero external dependencies once bootstrapped
  • Initialization handled by gitea-init-job

External Mode - Traditional GitHub-based GitOps:

  • Points to external GitHub repository
  • Supports custom branch selection for testing

See Values Inheritance Pattern for detailed architecture.

πŸ› οΈ Components

Layer 1: GitOps Foundation

  • ArgoCD 8.3.5 - GitOps continuous deployment controller
  • Gitea 12.3.0 - Self-hosted Git server with SQLite backend
  • OpenBao 0.18.2 - Vault-compatible secrets management
  • External Secrets 0.15.1 - Secrets synchronization operator

Layer 2: Core Infrastructure

Networking & Security:

  • Gateway API v1.3.0 - Kubernetes standard ingress API
  • KGateway v2.1.0-main - Gateway API implementation with WebSocket support
  • MetalLB v0.15.2 - Bare metal load balancer
  • Cert-Manager v1.18.2 - Automated TLS certificate management
  • Kyverno 3.5.1 - Policy engine with modular policy system

Storage & Database:

  • CNPG Operator 0.26.0 - CloudNativePG PostgreSQL operator
  • SeaweedFS Operator - S3-compatible object storage operator
  • SeaweedFS Config - S3 storage deployment with default-bucket, models, and datasets buckets

Layer 3: Observability

  • Prometheus Operator CRDs 23.0.0 - Metrics infrastructure
  • OpenTelemetry Operator 0.93.1 - Telemetry collection
  • OTEL-LGTM Stack v1.0.7 - Integrated observability (Loki, Grafana, Tempo, Mimir)

Layer 4: Identity & Access

  • Keycloak (keycloak-old chart) - Enterprise IAM with AIRM realm
  • Cluster-Auth 0.5.0 - Kubernetes RBAC integration

Layer 5: AI/ML Compute Stack

GPU & Scheduling:

  • AMD GPU Operator v1.4.1 - GPU device plugin and drivers
  • KubeRay Operator 1.4.2 - Ray distributed computing framework
  • Kueue 0.13.0 - Job queueing with multi-framework support
  • AppWrapper v1.1.2 - Application-level resource scheduling
  • KEDA 2.18.1 - Event-driven autoscaling

ML Serving & Inference:

  • KServe v0.16.0 - Model serving platform (Standard deployment mode)

Workflow & Messaging:

  • Kaiwo v0.2.0-rc11 - AI workload orchestration
  • RabbitMQ v2.15.0 - Message broker for async processing

Layer 6: AIRM Application

  • AIRM 0.3.2 - AMD Resource Manager application suite
  • AIM Cluster Model Source - Cluster resource models for AIRM
  • Configurable Image Repositories - Supports custom container registries via cluster-bloom AIRM_IMAGE_REPOSITORY parameter

οΏ½ Configuration

Cluster Sizing

Three cluster profiles with inheritance-based resource optimization:

Small Clusters (1-5 users, dev/test):

  • Single replica deployments
  • Reduced resource limits (ArgoCD controller: 2 CPU, 4Gi RAM)
  • Adds kyverno-policies-storage-local-path for RWXβ†’RWO PVC mutation
  • SeaweedFS volume storage: 250Gi
  • Suitable for: Local workstations, development environments

Medium Clusters (5-20 users, team production):

  • Requires a minimum of 20 CPU cores
  • Single replica with moderate resource allocation
  • Same storage policies as small (local-path support)
  • ArgoCD controller: 2 CPU, 4Gi RAM
  • Default configuration for balanced performance
  • Suitable for: Small teams, staging environments

Large Clusters (10s-100s users, enterprise scale):

  • Requires a minimum of 20 CPU cores
  • OpenBao HA: 3 replicas with Raft consensus
  • No local-path policies (assumes distributed storage)
  • SeaweedFS volume storage: 500Gi
  • Production-grade resource allocation
  • Suitable for: Production deployments, multi-tenant environments

See Cluster Size Configuration for detailed specifications.

Values Files

Configuration follows a streamlined inheritance pattern:

  • Base: Common applications with alpha-sorted enabledApps
  • Size-specific: Only override differences from base (DRY principle)
  • Runtime: Domain and cluster-specific parameters injected during bootstrap

The bootstrap script uses YAML merge semantics where size-specific values override base values.yaml settings.

πŸ“š Documentation

Comprehensive documentation is available in the /docs folder:

Topic Documentation
Getting Started Bootstrap Guide
Configuration Cluster Size Configuration
Architecture Values Inheritance Pattern
Policy System Kyverno Modular Design
Storage Policies Kyverno Access Mode Policy
Operations Backup and Restore
CI/CD Workflow Documentation

Additional documentation:

  • SBOM: See /sbom folder for software bill of materials generation and validation

πŸ“ License

Cluster-Forge is licensed under the Apache License, Version 2.0. See the LICENSE file for details.


Give Cluster-Forge a try and let us know how it works for you!

About

Kubernetes operator which sets up all platform tools to have a cluster ready for applications to run.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors