Infrastructure-as-code repository for ChainLearn, a Stellar-based AI learning platform.
This repository contains all infrastructure configurations for deploying and managing the ChainLearn platform across development, staging, and production environments.
┌─────────────────────────────────────────────────────────────────────────────┐
│ AWS Cloud │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ VPC │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Public │ │ Private │ │ Private │ │ │
│ │ │ Subnets │ │ Subnets │ │ Subnets │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │ │
│ │ │ │ ALB │ │ │ │ ECS │ │ │ │ RDS │ │ │ │
│ │ │ └────────┘ │ │ │ Cluster│ │ │ │Postgres│ │ │ │
│ │ │ │ │ └────────┘ │ │ └────────┘ │ │ │
│ │ └──────────────┘ │ │ │ │ │ │
│ │ │ ┌────────┐ │ │ ┌────────┐ │ │ │
│ │ │ │ElastiC │ │ │ │ Redis │ │ │ │
│ │ │ │ ache │ │ │ └────────┘ │ │ │
│ │ │ └────────┘ │ └──────────────┘ │ │
│ │ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ CloudWatch │ │ Grafana │ │ Prometheus │ │ SNS │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
| Service | Port | Description |
|---|---|---|
| API | 3001 | Main REST API service (Node.js) |
| AI | 8000 | AI/ML service for course generation (Python) |
| Indexer | 3002 | Stellar blockchain indexer (Node.js) |
| Frontend | 3000 | Next.js web application |
chainlearn-infra/
├── terraform/ # Infrastructure as Code
│ ├── modules/ # Reusable Terraform modules
│ │ ├── networking/ # VPC, subnets, security groups
│ │ ├── compute/ # ECS Fargate services
│ │ ├── database/ # RDS PostgreSQL, ElastiCache Redis
│ │ └── monitoring/ # CloudWatch, Grafana, Prometheus
│ ├── environments/ # Environment-specific configs
│ │ ├── dev/ # Development environment
│ │ ├── staging/ # Staging environment
│ │ └── prod/ # Production environment
│ └── main.tf # Root module
│
├── kubernetes/ # Kubernetes manifests
│ ├── base/ # Base Kustomize resources
│ │ ├── api-deployment.yml
│ │ ├── ai-deployment.yml
│ │ ├── indexer-deployment.yml
│ │ ├── frontend-deployment.yml
│ │ ├── ingress.yml
│ │ └── namespace.yml
│ └── overlays/ # Environment overlays
│ ├── dev/
│ ├── staging/
│ └── prod/
│
├── docker/ # Docker configurations
│ ├── docker-compose.dev.yml # Local development stack
│ └── docker-compose.prod.yml # Production-like local stack
│
├── scripts/ # Utility scripts
│ ├── setup-stellar-testnet.sh
│ ├── rotate-secrets.sh
│ └── backup-db.sh
│
└── monitoring/ # Monitoring configurations
├── grafana/dashboards/ # Grafana dashboard definitions
└── prometheus/ # Prometheus configuration
- Terraform >= 1.5.0
- AWS CLI >= 2.0
- kubectl >= 1.28
- Docker >= 24.0
- kustomize >= 5.0
- Node.js >= 18 (for Stellar SDK)
- Stellar CLI (for Soroban contracts)
- Soroban CLI (for contract deployment)
- Create an IAM user with appropriate permissions
- Configure AWS CLI:
aws configure
- Create S3 bucket for Terraform state:
aws s3api create-bucket \ --bucket chainlearn-terraform-state \ --region us-east-1
- Create DynamoDB table for state locking:
aws dynamodb create-table \ --table-name chainlearn-terraform-locks \ --attribute-definitions AttributeName=LockID,AttributeType=S \ --key-schema AttributeName=LockID,KeyType=HASH \ --billing-mode PAY_PER_REQUEST
-
Clone the repository:
git clone https://github.com/your-org/chainlearn-infra.git cd chainlearn-infra -
Set up Stellar testnet accounts:
./scripts/setup-stellar-testnet.sh testnet
-
Start the local development stack:
cd docker cp .env.example .env # Edit with your values docker compose -f docker-compose.dev.yml up -d
-
Access the services:
- Frontend: http://localhost:3000
- API: http://localhost:3001
- AI: http://localhost:8000
- Indexer: http://localhost:3002
- Adminer (DB): http://localhost:8080
- Mailhog: http://localhost:8025
-
Initialize Terraform:
cd terraform/environments/dev terraform init -
Review the plan:
terraform plan
-
Apply the configuration:
terraform apply
-
Configure kubectl:
aws eks update-kubeconfig \ --region us-east-1 \ --name chainlearn-dev
-
Deploy to Kubernetes:
kubectl apply -k kubernetes/overlays/dev/
- Single NAT gateway (cost saving)
- Smaller instance sizes
- Single replicas for services
- Testnet Stellar network
- Debug logging enabled
- Single NAT gateway
- Medium instance sizes
- 2 replicas for critical services
- Testnet Stellar network
- Info logging
- NAT gateway per AZ (high availability)
- Large instance sizes
- 3 replicas for critical services
- Mainnet Stellar network
- Warning-level logging
- Multi-AZ RDS
- Redis cluster with failover
Creates the VPC infrastructure:
- VPC with configurable CIDR
- Public and private subnets across multiple AZs
- Internet gateway and NAT gateways
- Security groups for all services
- VPC flow logs
module "networking" {
source = "../../modules/networking"
project_name = "chainlearn"
environment = "dev"
vpc_cidr = "10.0.0.0/16"
az_count = 2
}Manages ECS Fargate services:
- ECS cluster with Fargate capacity providers
- Task definitions for API, AI, Indexer, Frontend
- Application Load Balancer
- Service discovery
- Auto-scaling policies
module "compute" {
source = "../../modules/compute"
project_name = "chainlearn"
environment = "dev"
vpc_id = module.networking.vpc_id
private_subnet_ids = module.networking.private_subnet_ids
# ... other variables
}Provisions database infrastructure:
- RDS PostgreSQL with encryption
- ElastiCache Redis cluster
- Automated backups
- Performance Insights
- CloudWatch alarms
module "database" {
source = "../../modules/database"
project_name = "chainlearn"
environment = "dev"
private_subnet_ids = module.networking.private_subnet_ids
# ... other variables
}Sets up observability:
- CloudWatch dashboards
- Grafana workspace (Amazon Managed Grafana)
- Prometheus workspace (Amazon Managed Service for Prometheus)
- SNS alerts
- Log metric filters
module "monitoring" {
source = "../../modules/monitoring"
project_name = "chainlearn"
environment = "dev"
ecs_cluster_name = module.compute.ecs_cluster_name
# ... other variables
}The base Kubernetes manifests define:
- Deployments with 2 replicas, resource limits, health checks
- Services (ClusterIP for internal, LoadBalancer for frontend)
- Ingress with TLS termination and rate limiting
- PodDisruptionBudgets for high availability
- ServiceAccounts
Each environment has a kustomization overlay that:
- Adjusts replica counts
- Modifies resource limits
- Changes environment-specific configurations
- Updates domain names and TLS certificates
Apply an overlay:
# Development
kubectl apply -k kubernetes/overlays/dev/
# Staging
kubectl apply -k kubernetes/overlays/staging/
# Production
kubectl apply -k kubernetes/overlays/prod/Sets up Stellar testnet accounts and deploys Soroban contracts:
./scripts/setup-stellar-testnet.sh [network]
# Examples:
./scripts/setup-stellar-testnet.sh testnet
./scripts/setup-stellar-testnet.sh standaloneRotates secrets in AWS Secrets Manager and updates Kubernetes:
./scripts/rotate-secrets.sh [environment] [secret-name]
# Examples:
./scripts/rotate-secrets.sh dev all
./scripts/rotate-secrets.sh prod database
./scripts/rotate-secrets.sh staging redisBacks up PostgreSQL to S3 with encryption:
./scripts/backup-db.sh [environment]
# Examples:
./scripts/backup-db.sh dev
./scripts/backup-db.sh prodCron job (daily at 2 AM):
0 2 * * * /path/to/chainlearn-infra/scripts/backup-db.sh prod >> /var/log/chainlearn-backup.log 2>&1Two pre-configured dashboards:
-
API Metrics (
api-metrics.json):- Request rate and latency
- Error rates by status code
- CPU and memory usage
- Stellar contract calls
- Course completions
-
Contract Metrics (
contract-metrics.json):- Contract call rates by function
- Contract latency percentiles
- Rewards distributed
- Achievements unlocked
- Indexer metrics
Prometheus is configured to scrape metrics from:
- All ChainLearn services
- Node exporter (host metrics)
- Redis exporter
- PostgreSQL exporter
- Blackbox exporter (endpoint monitoring)
Pre-configured alarms for:
- API 5xx errors
- High API latency
- ECS CPU utilization
- Database CPU and storage
- Redis CPU and memory
- Error rate spikes
- Private subnets for all services
- Security groups with minimal required access
- NAT gateways for outbound internet access
- VPC flow logs enabled
- RDS encryption at rest
- ElastiCache encryption at rest and in transit
- Secrets stored in AWS Secrets Manager
- TLS for all external endpoints
- IAM roles with least privilege
- Kubernetes RBAC
- Service accounts per deployment
- No hardcoded credentials
- Single NAT gateway
- Smaller instance sizes (t3.micro/small)
- Single replicas
- Shorter log retention (30 days)
- NAT gateway per AZ (high availability)
- Right-sized instances
- Auto-scaling enabled
- Reserved instances for predictable workloads
- S3 lifecycle policies for backups
Terraform state lock:
terraform force-unlock <lock-id>ECS service not starting:
aws ecs describe-services \
--cluster chainlearn-dev \
--services chainlearn-api
aws logs get-log-events \
--log-group-name /ecs/chainlearn-dev \
--log-stream-name api/<task-id>Kubernetes pod CrashLoopBackOff:
kubectl logs -n chainlearn-dev -l app.kubernetes.io/name=chainlearn-api --previous
kubectl describe pod -n chainlearn-dev <pod-name>Database connection issues:
# Check RDS status
aws rds describe-db-instances \
--db-instance-identifier chainlearn-dev
# Test connection
psql -h <endpoint> -U chainlearn_admin -d chainlearn- Create a feature branch from
main - Make your changes
- Test locally with
docker compose - Submit a pull request
- Wait for CI/CD pipeline to pass
Proprietary - ChainLearn Team
For infrastructure issues, contact the DevOps team:
- Email: devops@chainlearn.io
- Slack: #chainlearn-infra