Your space suit for Red Hat OpenShift AI. Embark on a hands-on journey from zero to MLOps hero through practical missions and challenges! ππͺ
A hands-on path for small teams. Pair people from different backgrounds so each mission covers both platform and application work.
Follow the missions from the Sun outward β through the inner planets, across the asteroid belt, and into the outer system.
Like the real solar system, there is a divide between the inner and outer planets:
| Zone | Missions | Who leads |
|---|---|---|
| Sun + inner system βοΈ | Sun β Mars (+ optional Moon) | Mostly Systems Engineering βοΈ β install, configure, and prepare the cluster |
| Asteroid belt βοΈ | Between Mars and Jupiter | Transition β foundation is ready; experimentation and science take the wheel |
| Outer system πͺ | Jupiter β Neptune | Mostly Science Crew π» β pipelines, training, evaluation, MaaS, and agentic apps |
| Deep Space π | Bonus explorations | Stretch goals for teams that finish early |
| Crew | Background | You own⦠|
|---|---|---|
| Systems Engineering βοΈ | OpenShift infra β installation, nodes, storage, operators | Cluster plumbing that makes AI workloads run |
| Science Crew π» | CI/CD, middleware, app dev | Projects, notebooks, pipelines, models, and APIs |
Important
Red Hatters: Order your lab on the Red Hat Demo Platform β select OpenShift on AWS Sandbox. Key settings: activity Practice / Enablement, uncheck cert-manager, enable Configure Authentication, region eu-central-1, OCP version 4.20, 1 control plane node, instance type m6a.4xlarge. See π Sun mission dossier for the full ordering checklist.
Prerequisites: OpenShift 4.20 on AWS with cluster-admin access and the OpenShift AI 3.4 docs handy.
Everything orbits from here. Goal: OCP 4.20 on AWS IPI with RHOAI 3.4 running on top.
Systems Engineering βοΈ
- Log in to the cluster (kubeadmin credentials provided by RHDP or your cluster admin)
- Verify cluster health:
oc get nodesandoc get clusteroperators - Install the Red Hat OpenShift AI Operator (
stable-3.xchannel β latest GA) - Create a
DataScienceClusterwith core components (dashboard,workbenches,aipipelines,kserve) β other components are activated in later missions - Verify storage classes and PVC provisioning for user workloads
Science Crew π»
- Log in to the OpenShift AI dashboard
- Create a project
- Confirm workbenches and pipelines appear in the UI
Telemetry
oc get datascienceclustershows phaseReady- OpenShift AI dashboard is reachable at its route
- A project appears in the dashboard with workbench and pipeline sections visible
Flight Notes
oc get csv -n redhat-ods-operatorβ verify operator isSucceeded- Dashboard route:
oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}'
β Done when: Dashboard is reachable and your team project exists.
π Full Mission Dossier
Systems Engineering βοΈ
- Confirm notebook image pull and default workbench sizes
- Verify PVC/storage for home directories
- Check SCCs and resource quotas in the project
Science Crew π»
- Launch a workbench
- Run a short Python notebook (e.g.
pandas, a simple plot) - Install one extra package
- Restart the workbench and confirm your work persists
π Getting started β workbenches
Telemetry
- Workbench pod shows
Runningin the OpenShift console - Notebook cell executes without error and renders output
- After restart, the notebook file is still present in the home directory
Flight Notes
oc get notebooks -n <your-project>β check workbench CR status- If image pull fails, verify the
ImageStreamexists inredhat-ods-applications
β Done when: A notebook runs code and survives a restart.
π Full Mission Dossier
Pick one track per cluster (or run both on separate node pools).
Systems Engineering βοΈ
- Add GPU worker nodes (e.g.
g6instances) - Install NVIDIA GPU Operator + Node Feature Discovery
- Label and taint GPU nodes
- Confirm
nvidia-smifrom a GPU pod
Science Crew π»
- Verify the GPU appears in the OpenShift AI dashboard (accelerator profile)
- Launch a workbench with a GPU resource request
- Confirm the device is visible inside the pod
Systems Engineering βοΈ
- Deploy fake-gpu-operator on selected worker nodes
- Confirm nodes advertise
nvidia.com/gpu
Science Crew π»
- Schedule a GPU-requesting workload on a fake-GPU node
- Confirm scheduling succeeds and the dashboard still shows GPU capacity
Telemetry
oc describe node <gpu-node> | grep nvidia.com/gpushows allocatable capacity- Accelerator profile appears in the RHOAI dashboard under Settings
- Workbench with GPU request starts without
Pendingdue to resource constraints
Flight Notes
oc get clusterpolicyβ verify NVIDIA GPU Operator ClusterPolicy isReady- For Track B: fake-gpu resources behave like real ones for scheduling; CUDA workloads won't run
β Done when: At least one node pool exposes GPUs and a workload can request them.
π Full Mission Dossier
Learn model serving in general β not tied to one specific model yet.
Systems Engineering βοΈ
- Enable model serving (KServe) on the
DataScienceCluster - Ensure GPU nodes and enough PVC/object storage for model weights
- Open network routes for inference endpoints
- Deploy or configure serving runtimes: vLLM, KServe, and llm-d (pick at least two)
Science Crew π»
- Open the model catalog
- Pick a small model suitable for your hardware
- Deploy it with a serving runtime (vLLM or KServe)
- Send a test prompt and capture the response
- (Optional) Compare inference behaviour with llm-d routing
Telemetry
InferenceServiceCR showsREADY: Truecurlto the inference endpoint returns a valid JSON response- Model appears as
Deployedin the RHOAI dashboard
Flight Notes
oc get inferenceservice -n <your-project>β check serving status- Small CPU-friendly models (e.g. TinyLlama-1B) work without GPU for initial testing
β Done when: At least one model serves inference from the catalog using a chosen runtime.
π Full Mission Dossier
Optional deep dive once Earth is complete. Deploy the Red Hat validated GPT-OSS-20B model specifically.
Systems Engineering βοΈ
- Confirm GPU capacity and storage for GPT-OSS-20B weights
- Review Performance Insights for your hardware
Science Crew π»
- Find GPT-OSS-20B in the model catalog
- Deploy with vLLM (or your preferred runtime)
- Run a few representative prompts and save baseline responses for later missions
Telemetry
- GPT-OSS-20B
InferenceServiceshowsREADY: True - A prompt returns a coherent response within a reasonable latency
Flight Notes
- GPT-OSS-20B requires significant GPU VRAM β check Performance Insights before deploying
- Save a few baseline prompt/response pairs; you'll use them on Saturn (EvalHub)
β Done when: GPT-OSS-20B serves inference (skip if your hardware cannot fit this model).
π Full Mission Dossier
Systems Engineering βοΈ
- Enable the training operator, Kueue, and Ray on the
DataScienceCluster - Configure hardware profiles / accelerator profiles for GPU workloads
- Set up Kueue quotas and queue management for distributed jobs
Science Crew π»
- Submit a sample distributed training job (PyTorchJob or Training Hub + Kubeflow Trainer)
- Observe how Kueue queues and admits the workload
- Confirm the job runs on the expected hardware profile
π Managing distributed workloads
Telemetry
oc get clusterqueueshows quota capacity and admitted workloads- PyTorchJob or TrainingJob CR reaches
Succeededstatus - Hardware profile appears selectable in the RHOAI dashboard workbench launcher
Flight Notes
oc get localqueue -n <your-project>β namespace-level queue must exist before submitting jobs- Kueue admission can be observed with
oc get workloads -n <your-project> -w
β Done when: A distributed workload is queued, scheduled, and completes on GPU nodes.
π Full Mission Dossier
You made it through the inner system. OpenShift AI is installed, GPUs are ready, models are serving, and advanced platform features are enabled.
From here on, Science Crew π» takes the lead. Systems Engineering βοΈ still supports (storage, routes, operators), but the missions are about experimenting and building on what you stood up.
Broad experimentation mission β connect the dots between pipelines, tracking, and training.
Systems Engineering βοΈ
- Ensure the pipelines and MLflow components are enabled
- Provide S3-compatible storage for pipeline artifacts (OpenShift Data Foundation, MinIO, or AWS S3)
- Wire connection secrets and
MLFLOW_TRACKING_URIinto the namespace
Science Crew π»
- Build or import a Kubeflow Pipeline and run it end-to-end
- Run an interactive Training Hub experiment in a workbench (SFT or LoRA on a small model)
- Track runs in MLflow β compare parameters and metrics across experiments
- (Stretch) Chain Training Hub into a pipeline step for a reproducible fine-tuning workflow
π Working with AI pipelines Β· Training Hub
Telemetry
- Pipeline run appears as
Succeededin the RHOAI Pipelines UI - MLflow experiment shows at least two runs with logged metrics
- Training Hub job completes and a model checkpoint is saved to the configured storage
Flight Notes
oc get pipelinerun -n <your-project>β check Tekton pipeline run status- MLflow UI is accessible via its route in the
redhat-ods-applicationsnamespace
β Done when: A pipeline run and a Training Hub experiment both appear in MLflow.
π Full Mission Dossier
Systems Engineering βοΈ
- Deploy EvalHub via the TrustyAI Operator
- Deploy and configure NeMo Guardrails for your model endpoint
- Configure MLflow tracking for EvalHub (if not already done on Jupiter)
- Expose EvalHub and Guardrails routes
Science Crew π»
- Submit an EvalHub job against your model endpoint (REST API, notebook, or
evalhubCLI) - Pick a small benchmark collection and review metrics
- Send prompts with and without NeMo Guardrails β compare blocked vs allowed responses
- Verify guardrails catch at least one unsafe or off-policy input
π Evaluate LLMs with EvalHub
Telemetry
- EvalHub job reaches
completedstatus and metrics appear in MLflow - NeMo Guardrails route is reachable and returns
200for a safe prompt - A known unsafe prompt returns a blocked/refused response through Guardrails
Flight Notes
oc get evalhub -n redhat-ods-applicationsβ check EvalHub CR status- Use the
evalhubCLI:evalhub jobs list --url <evalhub-route>
β Done when: An EvalHub job completes and NeMo Guardrails demonstrably filters a test prompt.
π Full Mission Dossier
Systems Engineering βοΈ
- Follow the MaaS prerequisites: database secret, TLS, and dashboard config flags
- Verify the MaaS gateway is healthy
Science Crew π»
- Publish a model through MaaS (GPT-OSS-20B if you completed the Moon mission)
- Create a subscription/token for a consumer
- Call the governed endpoint with
curlor a small app
Telemetry
- MaaS gateway pod is
Runninginredhat-ods-applications - A published model appears in the MaaS catalog in the dashboard
curlwith a valid token returns a model response; without a token returns401
Flight Notes
oc get modelmesh -n redhat-ods-applicationsβ verify gateway health- Token-based auth: include
Authorization: Bearer <token>incurlrequests
β Done when: A model is consumed through the MaaS API with authentication.
π Full Mission Dossier
Build an agentic app that retrieves from your own documents and orchestrates tool calls through OGX β an OpenAI-compatible agentic API server.
Systems Engineering βοΈ
- Deploy OGX on the cluster (or enable it as a model-serving backend)
- Provide vector-store backing storage (e.g. Milvus, PGVector, or OGX vector stores)
- Expose OGX and any dependent services via Routes
- Wire your model or MaaS endpoint as the LLM backend for OGX
Science Crew π»
- Upload documents to a vector store via the OGX Files / Vector Stores API
- Build a RAG flow using the OGX Responses API (agentic file search + tool calling)
- Try an agentic starter kit (e.g. LangGraph agentic RAG) on OpenShift
- Ask questions that require retrieving from your documents β confirm the agent cites the right sources
π Enterprise RAG chatbot on OpenShift AI
Telemetry
- OGX
/healthendpoint returns200 - A document uploaded to the vector store is retrievable via the Files API
- An agentic query returns an answer that cites content from your uploaded documents
Flight Notes
curl <ogx-route>/v1/modelsβ verify OGX sees your LLM backend- Use
OPENAI_BASE_URL=<ogx-route>/v1with any OpenAI-compatible client
β Done when: An agentic RAG application answers from your own data through OGX.
π Full Mission Dossier
Optional missions once you've completed the solar system. Split up and report back β each person picks one area.
Model Registry
- Systems Engineering βοΈ: Enable the component, provision backend storage
- Science Crew π»: Register a model version from a pipeline or training run
Feature store (Feast)
- Systems Engineering βοΈ: Enable Feast operator
- Science Crew π»: Connect a notebook to an online feature store
Model catalog deep dive
- Science Crew π»: Use Model Performance view to compare validated models for a workload type
TrustyAI β bias & drift
- Science Crew π»: Run a bias or data-drift check on a model beyond EvalHub benchmarks
π OpenShift AI 3.4 documentation hub
β Done when: Each teammate demos one bonus feature in 5 minutes.
- Mix the crews βοΈ + π» throughout β but respect the zones: Systems Engineering βοΈ leads from the Sun through Mars; Science Crew π» leads beyond the asteroid belt.
- Venus track B (fake GPU) is great when AWS GPU quota is tight; switch to track A before Earth.
- Earth before Moon β learn model serving in general first; GPT-OSS-20B on the Moon is optional and hardware-dependent.
- The asteroid belt βοΈ is your checkpoint: don't cross to Jupiter until Mars (distributed training + Kueue) is working.
- Jupiter β Saturn β Uranus β Neptune β experiment, evaluate and guardrail, govern with MaaS, then build agentic apps.
- Deep Space topics are optional stretch goals for teams that finish early.
- For automation examples, see alvarolop/rhoai-gitops.
