Skip to content

alpha-hack-program/openshift-ai-odyssey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Red Hat OpenShift AI Odyssey

Your space suit for Red Hat OpenShift AI. Embark on a hands-on journey from zero to MLOps hero through practical missions and challenges! πŸš€πŸͺ

OpenShift AI Odyssey learning path

A hands-on path for small teams. Pair people from different backgrounds so each mission covers both platform and application work.

Follow the missions from the Sun outward β€” through the inner planets, across the asteroid belt, and into the outer system.

Like the real solar system, there is a divide between the inner and outer planets:

Zone Missions Who leads
Sun + inner system β˜€οΈ Sun β†’ Mars (+ optional Moon) Mostly Systems Engineering βš™οΈ β€” install, configure, and prepare the cluster
Asteroid belt β˜„οΈ Between Mars and Jupiter Transition β€” foundation is ready; experimentation and science take the wheel
Outer system πŸͺ Jupiter β†’ Neptune Mostly Science Crew πŸ’» β€” pipelines, training, evaluation, MaaS, and agentic apps
Deep Space 🌌 Bonus explorations Stretch goals for teams that finish early
Crew Background You own…
Systems Engineering βš™οΈ OpenShift infra β€” installation, nodes, storage, operators Cluster plumbing that makes AI workloads run
Science Crew πŸ’» CI/CD, middleware, app dev Projects, notebooks, pipelines, models, and APIs

Important

Red Hatters: Order your lab on the Red Hat Demo Platform β€” select OpenShift on AWS Sandbox. Key settings: activity Practice / Enablement, uncheck cert-manager, enable Configure Authentication, region eu-central-1, OCP version 4.20, 1 control plane node, instance type m6a.4xlarge. See πŸ“‚ Sun mission dossier for the full ordering checklist.

Prerequisites: OpenShift 4.20 on AWS with cluster-admin access and the OpenShift AI 3.4 docs handy.


β˜€οΈ Sun Β· OpenShift + OpenShift AI (shared Β· start here)

Everything orbits from here. Goal: OCP 4.20 on AWS IPI with RHOAI 3.4 running on top.

Systems Engineering βš™οΈ

  • Log in to the cluster (kubeadmin credentials provided by RHDP or your cluster admin)
  • Verify cluster health: oc get nodes and oc get clusteroperators
  • Install the Red Hat OpenShift AI Operator (stable-3.x channel β€” latest GA)
  • Create a DataScienceCluster with core components (dashboard, workbenches, aipipelines, kserve) β€” other components are activated in later missions
  • Verify storage classes and PVC provisioning for user workloads

Science Crew πŸ’»

  • Log in to the OpenShift AI dashboard
  • Create a project
  • Confirm workbenches and pipelines appear in the UI

Telemetry

  • oc get datasciencecluster shows phase Ready
  • OpenShift AI dashboard is reachable at its route
  • A project appears in the dashboard with workbench and pipeline sections visible

Flight Notes

  • oc get csv -n redhat-ods-operator β€” verify operator is Succeeded
  • Dashboard route: oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}'

βœ… Done when: Dashboard is reachable and your team project exists.

πŸ“‚ Full Mission Dossier


Mission 1 β€” Mercury Β· Notebook playground (inner system β˜€οΈ)

Systems Engineering βš™οΈ

  • Confirm notebook image pull and default workbench sizes
  • Verify PVC/storage for home directories
  • Check SCCs and resource quotas in the project

Science Crew πŸ’»

  • Launch a workbench
  • Run a short Python notebook (e.g. pandas, a simple plot)
  • Install one extra package
  • Restart the workbench and confirm your work persists

πŸ“– Getting started β€” workbenches

Telemetry

  • Workbench pod shows Running in the OpenShift console
  • Notebook cell executes without error and renders output
  • After restart, the notebook file is still present in the home directory

Flight Notes

  • oc get notebooks -n <your-project> β€” check workbench CR status
  • If image pull fails, verify the ImageStream exists in redhat-ods-applications

βœ… Done when: A notebook runs code and survives a restart.

πŸ“‚ Full Mission Dossier


Mission 2 β€” Venus Β· GPU power-up (inner system β˜€οΈ)

Pick one track per cluster (or run both on separate node pools).

Track A β€” Real GPUs on AWS (Systems Engineering βš™οΈ leads)

Systems Engineering βš™οΈ

  • Add GPU worker nodes (e.g. g6 instances)
  • Install NVIDIA GPU Operator + Node Feature Discovery
  • Label and taint GPU nodes
  • Confirm nvidia-smi from a GPU pod

Science Crew πŸ’»

  • Verify the GPU appears in the OpenShift AI dashboard (accelerator profile)
  • Launch a workbench with a GPU resource request
  • Confirm the device is visible inside the pod

Track B β€” Fake GPUs for learning (Systems Engineering βš™οΈ leads)

Systems Engineering βš™οΈ

  • Deploy fake-gpu-operator on selected worker nodes
  • Confirm nodes advertise nvidia.com/gpu

Science Crew πŸ’»

  • Schedule a GPU-requesting workload on a fake-GPU node
  • Confirm scheduling succeeds and the dashboard still shows GPU capacity

Telemetry

  • oc describe node <gpu-node> | grep nvidia.com/gpu shows allocatable capacity
  • Accelerator profile appears in the RHOAI dashboard under Settings
  • Workbench with GPU request starts without Pending due to resource constraints

Flight Notes

  • oc get clusterpolicy β€” verify NVIDIA GPU Operator ClusterPolicy is Ready
  • For Track B: fake-gpu resources behave like real ones for scheduling; CUDA workloads won't run

βœ… Done when: At least one node pool exposes GPUs and a workload can request them.

πŸ“‚ Full Mission Dossier


Mission 3 β€” Earth Β· Deploy models (inner system β˜€οΈ)

Learn model serving in general β€” not tied to one specific model yet.

Systems Engineering βš™οΈ

  • Enable model serving (KServe) on the DataScienceCluster
  • Ensure GPU nodes and enough PVC/object storage for model weights
  • Open network routes for inference endpoints
  • Deploy or configure serving runtimes: vLLM, KServe, and llm-d (pick at least two)

Science Crew πŸ’»

  • Open the model catalog
  • Pick a small model suitable for your hardware
  • Deploy it with a serving runtime (vLLM or KServe)
  • Send a test prompt and capture the response
  • (Optional) Compare inference behaviour with llm-d routing

Telemetry

  • InferenceService CR shows READY: True
  • curl to the inference endpoint returns a valid JSON response
  • Model appears as Deployed in the RHOAI dashboard

Flight Notes

  • oc get inferenceservice -n <your-project> β€” check serving status
  • Small CPU-friendly models (e.g. TinyLlama-1B) work without GPU for initial testing

βœ… Done when: At least one model serves inference from the catalog using a chosen runtime.

πŸ“‚ Full Mission Dossier


πŸŒ™ Moon Β· GPT-OSS-20B (optional Β· orbits Earth)

Optional deep dive once Earth is complete. Deploy the Red Hat validated GPT-OSS-20B model specifically.

Systems Engineering βš™οΈ

  • Confirm GPU capacity and storage for GPT-OSS-20B weights
  • Review Performance Insights for your hardware

Science Crew πŸ’»

  • Find GPT-OSS-20B in the model catalog
  • Deploy with vLLM (or your preferred runtime)
  • Run a few representative prompts and save baseline responses for later missions

Telemetry

  • GPT-OSS-20B InferenceService shows READY: True
  • A prompt returns a coherent response within a reasonable latency

Flight Notes

  • GPT-OSS-20B requires significant GPU VRAM β€” check Performance Insights before deploying
  • Save a few baseline prompt/response pairs; you'll use them on Saturn (EvalHub)

βœ… Done when: GPT-OSS-20B serves inference (skip if your hardware cannot fit this model).

πŸ“‚ Full Mission Dossier


Mission 4 β€” Mars Β· Advanced platform (inner system β˜€οΈ β€” last stop before the belt)

Systems Engineering βš™οΈ

  • Enable the training operator, Kueue, and Ray on the DataScienceCluster
  • Configure hardware profiles / accelerator profiles for GPU workloads
  • Set up Kueue quotas and queue management for distributed jobs

Science Crew πŸ’»

  • Submit a sample distributed training job (PyTorchJob or Training Hub + Kubeflow Trainer)
  • Observe how Kueue queues and admits the workload
  • Confirm the job runs on the expected hardware profile

πŸ“– Managing distributed workloads

Telemetry

  • oc get clusterqueue shows quota capacity and admitted workloads
  • PyTorchJob or TrainingJob CR reaches Succeeded status
  • Hardware profile appears selectable in the RHOAI dashboard workbench launcher

Flight Notes

  • oc get localqueue -n <your-project> β€” namespace-level queue must exist before submitting jobs
  • Kueue admission can be observed with oc get workloads -n <your-project> -w

βœ… Done when: A distributed workload is queued, scheduled, and completes on GPU nodes.

πŸ“‚ Full Mission Dossier


β˜„οΈ Asteroid belt β€” crossing point

You made it through the inner system. OpenShift AI is installed, GPUs are ready, models are serving, and advanced platform features are enabled.

From here on, Science Crew πŸ’» takes the lead. Systems Engineering βš™οΈ still supports (storage, routes, operators), but the missions are about experimenting and building on what you stood up.


Mission 5 β€” Jupiter Β· Experimenting (outer system πŸͺ)

Broad experimentation mission β€” connect the dots between pipelines, tracking, and training.

Systems Engineering βš™οΈ

  • Ensure the pipelines and MLflow components are enabled
  • Provide S3-compatible storage for pipeline artifacts (OpenShift Data Foundation, MinIO, or AWS S3)
  • Wire connection secrets and MLFLOW_TRACKING_URI into the namespace

Science Crew πŸ’»

  • Build or import a Kubeflow Pipeline and run it end-to-end
  • Run an interactive Training Hub experiment in a workbench (SFT or LoRA on a small model)
  • Track runs in MLflow β€” compare parameters and metrics across experiments
  • (Stretch) Chain Training Hub into a pipeline step for a reproducible fine-tuning workflow

πŸ“– Working with AI pipelines Β· Training Hub

Telemetry

  • Pipeline run appears as Succeeded in the RHOAI Pipelines UI
  • MLflow experiment shows at least two runs with logged metrics
  • Training Hub job completes and a model checkpoint is saved to the configured storage

Flight Notes

  • oc get pipelinerun -n <your-project> β€” check Tekton pipeline run status
  • MLflow UI is accessible via its route in the redhat-ods-applications namespace

βœ… Done when: A pipeline run and a Training Hub experiment both appear in MLflow.

πŸ“‚ Full Mission Dossier


Mission 6 β€” Saturn Β· TrustyAI (outer system πŸͺ)

Systems Engineering βš™οΈ

  • Deploy EvalHub via the TrustyAI Operator
  • Deploy and configure NeMo Guardrails for your model endpoint
  • Configure MLflow tracking for EvalHub (if not already done on Jupiter)
  • Expose EvalHub and Guardrails routes

Science Crew πŸ’»

  • Submit an EvalHub job against your model endpoint (REST API, notebook, or evalhub CLI)
  • Pick a small benchmark collection and review metrics
  • Send prompts with and without NeMo Guardrails β€” compare blocked vs allowed responses
  • Verify guardrails catch at least one unsafe or off-policy input

πŸ“– Evaluate LLMs with EvalHub

Telemetry

  • EvalHub job reaches completed status and metrics appear in MLflow
  • NeMo Guardrails route is reachable and returns 200 for a safe prompt
  • A known unsafe prompt returns a blocked/refused response through Guardrails

Flight Notes

  • oc get evalhub -n redhat-ods-applications β€” check EvalHub CR status
  • Use the evalhub CLI: evalhub jobs list --url <evalhub-route>

βœ… Done when: An EvalHub job completes and NeMo Guardrails demonstrably filters a test prompt.

πŸ“‚ Full Mission Dossier


Mission 7 β€” Uranus Β· Models-as-a-Service (MaaS) (outer system πŸͺ)

Systems Engineering βš™οΈ

  • Follow the MaaS prerequisites: database secret, TLS, and dashboard config flags
  • Verify the MaaS gateway is healthy

Science Crew πŸ’»

  • Publish a model through MaaS (GPT-OSS-20B if you completed the Moon mission)
  • Create a subscription/token for a consumer
  • Call the governed endpoint with curl or a small app

Telemetry

  • MaaS gateway pod is Running in redhat-ods-applications
  • A published model appears in the MaaS catalog in the dashboard
  • curl with a valid token returns a model response; without a token returns 401

Flight Notes

  • oc get modelmesh -n redhat-ods-applications β€” verify gateway health
  • Token-based auth: include Authorization: Bearer <token> in curl requests

βœ… Done when: A model is consumed through the MaaS API with authentication.

πŸ“‚ Full Mission Dossier


Mission 8 β€” Neptune Β· Agentic applications with RAG and OGX (outer system πŸͺ)

Build an agentic app that retrieves from your own documents and orchestrates tool calls through OGX β€” an OpenAI-compatible agentic API server.

Systems Engineering βš™οΈ

  • Deploy OGX on the cluster (or enable it as a model-serving backend)
  • Provide vector-store backing storage (e.g. Milvus, PGVector, or OGX vector stores)
  • Expose OGX and any dependent services via Routes
  • Wire your model or MaaS endpoint as the LLM backend for OGX

Science Crew πŸ’»

  • Upload documents to a vector store via the OGX Files / Vector Stores API
  • Build a RAG flow using the OGX Responses API (agentic file search + tool calling)
  • Try an agentic starter kit (e.g. LangGraph agentic RAG) on OpenShift
  • Ask questions that require retrieving from your documents β€” confirm the agent cites the right sources

πŸ“– Enterprise RAG chatbot on OpenShift AI

Telemetry

  • OGX /health endpoint returns 200
  • A document uploaded to the vector store is retrievable via the Files API
  • An agentic query returns an answer that cites content from your uploaded documents

Flight Notes

  • curl <ogx-route>/v1/models β€” verify OGX sees your LLM backend
  • Use OPENAI_BASE_URL=<ogx-route>/v1 with any OpenAI-compatible client

βœ… Done when: An agentic RAG application answers from your own data through OGX.

πŸ“‚ Full Mission Dossier


Deep Space 🌌 β€” bonus explorations

Optional missions once you've completed the solar system. Split up and report back β€” each person picks one area.

Model Registry

  • Systems Engineering βš™οΈ: Enable the component, provision backend storage
  • Science Crew πŸ’»: Register a model version from a pipeline or training run

Feature store (Feast)

  • Systems Engineering βš™οΈ: Enable Feast operator
  • Science Crew πŸ’»: Connect a notebook to an online feature store

Model catalog deep dive

  • Science Crew πŸ’»: Use Model Performance view to compare validated models for a workload type

TrustyAI β€” bias & drift

  • Science Crew πŸ’»: Run a bias or data-drift check on a model beyond EvalHub benchmarks

πŸ“– OpenShift AI 3.4 documentation hub

βœ… Done when: Each teammate demos one bonus feature in 5 minutes.


Tips for Mission Control

  • Mix the crews βš™οΈ + πŸ’» throughout β€” but respect the zones: Systems Engineering βš™οΈ leads from the Sun through Mars; Science Crew πŸ’» leads beyond the asteroid belt.
  • Venus track B (fake GPU) is great when AWS GPU quota is tight; switch to track A before Earth.
  • Earth before Moon β€” learn model serving in general first; GPT-OSS-20B on the Moon is optional and hardware-dependent.
  • The asteroid belt β˜„οΈ is your checkpoint: don't cross to Jupiter until Mars (distributed training + Kueue) is working.
  • Jupiter β†’ Saturn β†’ Uranus β†’ Neptune β€” experiment, evaluate and guardrail, govern with MaaS, then build agentic apps.
  • Deep Space topics are optional stretch goals for teams that finish early.
  • For automation examples, see alvarolop/rhoai-gitops.

About

Your space suit for Red Hat OpenShift AI. Embark on a hands-on journey from zero to MLOps hero through practical missions and challenges! πŸš€πŸͺ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors