Configure AWS:
aws configureLogin to ECR (Only for pushing public image, FedGraph already provided public docker image that includes all of the environmental dependencies)
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.awsBuild Docker with amd64 architecture on the cloud and push to ECR
# You can modify the cloud builder using the CLI, with the docker buildx create command.
docker buildx create --driver cloud ryanli3/fedgraph
# Set your new cloud builder as default on your local machine.
docker buildx use cloud-ryanli3-fedgraph --global
# Build and push image to ECR
docker buildx build --platform linux/amd64 -t public.ecr.aws/i7t1s5i1/fedgraph:img . --pushCreate an EKS Cluster with eksctl:
eksctl create cluster -f eks_cluster_config.yaml --timeout=60mAfter waiting the cluster setup, update kubeconfig for AWS EKS to config the cluster using kubectl:
# --region and --name can config in the eks_cluster_config.yaml
# metadata:
# name: user
# region: us-west-2
aws eks --region us-west-2 update-kubeconfig --name mlarge
Optional: Check or switch current cluster only if we have multiple clusters running at the same time:
kubectl config current-context
kubectl config use-context arn:aws:eks:us-west-2:312849146674:cluster/large
Clone the KubeRay Repository, Install Prometheus and Grafana Server
git clone https://github.com/ray-project/kuberay.git
cd kuberay
./install/prometheus/install.shAdd the KubeRay Helm Repository, Install KubeRay Operator:
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1Navigate to the Example Configurations Directory:
cd docs/examples/configsApply Ray Kubernetes Cluster and Ingress Configurations:
kubectl apply -f ray_kubernetes_cluster.yaml
kubectl apply -f ray_kubernetes_ingress.yamlCheck every pod is running correctly:
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-7d7998bcdb-bzpkj 1/1 Running 0 35m
# raycluster-autoscaler-head-47mzs 2/2 Running 0 35m
# raycluster-autoscaler-worker-large-group-grw8w 1/1 Running 0 35mIf a pod status is Pending, it means the ray_kubernetes_cluster.yaml requests too many resources than the cluster can provide, delete the ray_kubernetes_cluster, modify the config and restart the kubernetes
kubectl delete -f ray_kubernetes_cluster.yaml
kubectl apply -f ray_kubernetes_cluster.yamlForward Port for Ray Dashboard, Prometheus, and Grafana
kubectl port-forward service/raycluster-autoscaler-head-svc 8265:8265
# raycluster-autoscaler-head-xxx is the pod name
kubectl port-forward raycluster-autoscaler-head-47mzs 8080:8080
kubectl port-forward prometheus-prometheus-kube-prometheus-prometheus-0 -n prometheus-system 9090:9090
kubectl port-forward deployment/prometheus-grafana -n prometheus-system 3000:3000Final Check
kubectl get pods --all-namespaces -o wideSubmit a Ray Job:
cd fedgraph
ray job submit --runtime-env-json '{
"working_dir": "./",
"excludes": [".git"]
}' --address http://localhost:8265 -- python3 run.py
Stop a Ray Job:
# raysubmit_xxx is the job name that can be found via
ray job stop raysubmit_m5PN9xqV6drJQ8k2 --address http://localhost:8265Delete the RayCluster Custom Resource:
cd docs/examples/configs
kubectl delete -f ray_kubernetes_cluster.yaml
kubectl delete -f ray_kubernetes_ingress.yamlConfirm that the RayCluster Pods are Terminated:
kubectl get pods
# Ensure the output shows no Ray pods except kuberay-operatorFinally, Delete the node first and then delete EKS Cluster:
kubectl get nodes -o name | xargs kubectl delete
eksctl delete cluster --region us-west-2 --name userUse the following command to login to the Hugging Face Hub CLI tool when you set "save: True" in node classification tasks if you haven't done so already:
huggingface-cli login