LCORE-941: Use proper vLLM image for prow RHOAI tests #1014

are-ces · 2026-01-18T11:04:57Z

Description

Enabled GPU for vLLM model serving in the Prow e2e test environment.
Add GPU node tolerations to ensure pods are scheduled on GPU enabled nodes.
Use dynamic llama-stack image built from the LCS repository Containerfile.
The proper latest vLLM image (GPU CUDA) is retrieved from the official Red Hat template available.

Note:
some tests fail because the current logic to restart the lightspeed-stack docker container needs to be updated (separate ticket).

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude

Related Tickets & Documents

Related Issue # LCORE-941
Closes # LCORE-941

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

PTAL here to see the prow job rehearse logs.

Summary by CodeRabbit

New Features
- Added GPU support for model inference deployment with Nvidia GPU operator integration.
- Enabled vLLM model serving on GPU infrastructure with dedicated runtime and inference service configurations.
- Introduced Node Feature Discovery for GPU node labeling and detection.
Configuration
- Disabled OpenAI provider; updated inference shield provider configuration to use vLLM-based alternative.
- Enhanced test pipeline with GPU setup, resource validation, and improved service readiness checks.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Enabled GPU for vLLM model serving in the Prow e2e test environment. - Add GPU node tolerations to ensure pods are scheduled on GPU enabled nodes. - Use dynamic llama-stack image built from the LCS repository Containerfile. - Use latest vLLM image from the Red Hat template.

coderabbitai · 2026-01-18T11:05:13Z

Walkthrough

This PR extends RHOAI end-to-end test infrastructure with GPU support by adding Nvidia GPU Operator and Node Feature Discovery manifests, GPU-enabled vLLM runtime/inference service, parameterizing container images, disabling OpenAI in favor of vLLM with Llama-3.2-1B-Instruct, and enhancing pipeline scripts with comprehensive GPU setup and verification.

Changes

Cohort / File(s)	Summary
Configuration & Provider Setup `tests/e2e-prow/rhoai/configs/run.yaml`	Comments out OpenAI provider block; updates shield provider_shield_id from `openai/gpt-4o-mini` to `vllm/meta-llama/Llama-3.2-1B-Instruct`
Kubernetes Namespaces `tests/e2e-prow/rhoai/manifests/namespaces/nfd.yaml`, `tests/e2e-prow/rhoai/manifests/namespaces/nvidia-operator.yaml`	Adds two new namespace resources: openshift-nfd and nvidia-gpu-operator
Operators & OperatorGroups `tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml`, `tests/e2e-prow/rhoai/manifests/operators/operators.yaml`	Refactors operatorgroup.yaml into multi-document YAML with two OperatorGroups for Nvidia GPU and NFD; adds two new Subscription resources for gpu-operator-certified and nfd operators
GPU Infrastructure `tests/e2e-prow/rhoai/manifests/gpu/cluster-policy.yaml`, `tests/e2e-prow/rhoai/scripts/gpu-setup.sh`	Introduces Nvidia ClusterPolicy manifest; adds comprehensive gpu-setup.sh script (208 lines) orchestrating GPU operator deployment, NFD setup, node labeling, and capacity verification with extensive error handling
vLLM GPU Runtime & Service `tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml`, `tests/e2e-prow/rhoai/manifests/vllm/vllm-inference-service-gpu.yaml`	Adds GPU-specific vLLM runtime with Llama-3.2-1B, custom chat templates, and volume mounts (82 lines); adds KServe InferenceService for vLLM with GPU resource allocation
LLM Stack Configuration `tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml`	Parameterizes container image with `${LLAMA_STACK_IMAGE}` variable; adds imagePullSecrets for quay-lightspeed-pull-secret
NFD Instance `tests/e2e-prow/rhoai/manifests/gpu/create-nfd.yaml`	Adds NodeFeatureDiscovery custom resource in openshift-nfd namespace
Core Pipeline Orchestration `tests/e2e-prow/rhoai/pipeline.sh`, `tests/e2e-prow/rhoai/pipeline-vllm.sh`, `tests/e2e-prow/rhoai/pipeline-services.sh`	Enhances pipeline.sh (+79/-13) with LLAMA_STACK_IMAGE initialization, Quay credentials validation, namespace/secret creation, curl-based API testing with improved error handling, extended service waits (600s), and comprehensive debug output; adds gpu-setup.sh invocation and image fetching to pipeline-vllm.sh with strict options; updates pipeline-services.sh to use envsubst for manifest substitution and corrects namespace flag
Bootstrap & Operator Setup `tests/e2e-prow/rhoai/scripts/bootstrap.sh`	Significantly expands (+53/-4) with OperatorGroup application, operator subscription delays (10s), GPU Operator namespace/readiness checks, InstallPlan verification with subscription recreation retry logic, NFD CRD and CSV checks, and DataScienceCluster application
vLLM Deployment & Discovery `tests/e2e-prow/rhoai/scripts/deploy-vllm.sh`, `tests/e2e-prow/rhoai/scripts/fetch-vllm-image.sh`, `tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh`	Replaces CPU-centric flow with GPU-aware deployment (+46/-2): waits for GPU nodes/capacity, applies GPU-specific manifests; adds fetch-vllm-image.sh (36 lines) to discover vLLM image from templates with fallback logic; enhances get-vllm-pod-info.sh (+83/-19) with extended waits (580s), RawDeployment support, internal cluster URL construction, and detailed debugging output for failures

Sequence Diagram

sequenceDiagram
    participant Pipeline as Pipeline (Main)
    participant Bootstrap as Bootstrap<br/>(Operators & OLM)
    participant GPUSetup as GPU Setup<br/>(NFD & GPU Operator)
    participant Cluster as Kubernetes<br/>Cluster
    participant OLM as OLM<br/>Operator Hub
    participant DeployVLLM as Deploy vLLM<br/>(GPU)

    Pipeline->>Bootstrap: Run bootstrap.sh<br/>(apply operators)
    Bootstrap->>Cluster: Apply OperatorGroups<br/>(nvidia-gpu-operator, openshift-nfd)
    Bootstrap->>OLM: Create Subscriptions<br/>(gpu-operator, nfd)
    OLM-->>Cluster: Create InstallPlan &<br/>Install Operators
    Bootstrap-->>Pipeline: Operators ready

    Pipeline->>GPUSetup: Run gpu-setup.sh
    GPUSetup->>Cluster: Apply NFD DaemonSet manifest
    GPUSetup->>Cluster: Wait for nfd-worker pods
    GPUSetup->>Cluster: Patch NFD with GPU tolerations
    GPUSetup->>Cluster: Apply GPU ClusterPolicy
    GPUSetup->>Cluster: Wait for GPU operator pods
    GPUSetup->>Cluster: Wait for GPU node labels<br/>(nvidia.com/gpu.present)
    GPUSetup->>Cluster: Monitor GPU capacity
    GPUSetup-->>Pipeline: GPU infrastructure ready

    Pipeline->>DeployVLLM: Run deploy-vllm.sh
    DeployVLLM->>Cluster: Wait for GPU nodes<br/>& capacity > 0
    DeployVLLM->>Cluster: Apply vllm-runtime-gpu.yaml
    DeployVLLM->>Cluster: Apply vllm-inference-service-gpu.yaml
    DeployVLLM-->>Pipeline: vLLM service deployed

    Pipeline->>Cluster: Verify service readiness<br/>& run API tests

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

LCORE-334: Lightspeed core needs to fully support RHOAI LLM provider (prow) #747 — Introduced the base RHOAI e2e configuration that this PR extends with GPU operators and vLLM GPU manifests.
LCORE-1107: Bumped up llama-stack to RHOAI 3.2 and consuming downstream image #906 — Related provider switching from OpenAI to Llama-3.2-based models, mirroring the provider_shield_id change in this PR.
LCORE-333: Lightspeed core needs to fully support RHEL AI LLM provider #714 — Adds vLLM provider support to e2e configs; this PR completes GPU enablement for vLLM deployment infrastructure.

Suggested reviewers

tisnik
radofuchs
asimurka

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly references the main objective of the PR: enabling proper vLLM image usage for GPU-based Prow RHOAI tests, which aligns with the extensive changes to GPU setup, vLLM deployment, and image retrieval throughout the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml`:
- Around line 17-21: In vllm-runtime-gpu.yaml remove the trailing spaces from
the vLLM CLI flags so Kubernetes doesn't pass malformed args: fix the args
entries that contain "--tool-call-parser " and "--chat-template " to be
"--tool-call-parser" and "--chat-template" (and ensure their corresponding
values like "llama3_json" and
"/mnt/chat-template/tool_chat_template_llama3.2_json.jinja" remain separate
array items or properly paired) so vLLM's argparse recognizes the flags and the
container can start.

In `@tests/e2e-prow/rhoai/scripts/bootstrap.sh`:
- Around line 56-80: The retry after recreating the subscription only checks
once after a 10s sleep, so reset the wait loop state and re-enter the same until
loop to give the recreated subscription the full timeout; specifically, inside
the branch where you delete and reapply the subscription (the block manipulating
subscription gpu-operator-certified and applying
"$BASE_DIR/manifests/operators/operators.yaml"), set elapsed=0 (or set a retried
flag and use continue) so execution jumps back into the existing until oc get
installplan -n nvidia-gpu-operator ... | grep -q . loop and retries for the full
timeout instead of a single one-time check.
- Around line 23-45: operatorgroup.yaml does not create the nvidia-gpu-operator
namespace, so calling oc wait namespace/nvidia-gpu-operator can fail; before
running oc wait or applying operatorgroup.yaml, apply the manifest that creates
the nvidia-gpu-operator namespace (the Namespace manifest that declares
metadata.name: nvidia-gpu-operator), ensure the namespace exists (e.g., oc apply
<namespace-manifest> and optionally oc wait --for=condition=Created
namespace/nvidia-gpu-operator), then continue to apply operatorgroup.yaml,
operators.yaml and the existing oc wait/operatorgroup polling logic.

In `@tests/e2e-prow/rhoai/scripts/deploy-vllm.sh`:
- Around line 32-78: The loop incorrectly checks only the first node via
jsonpath '.items[0].status.capacity.nvidia.com/gpu', causing false timeouts in
multi-node clusters; change the until condition and the capacity assignment to
inspect all nodes (e.g., use jsonpath that ranges over .items[*] or pipe to jq)
and succeed if any node has a non-empty, non-zero GPU value. Update the two oc
calls in the until condition and the capacity=$(...) assignment to collect all
GPU capacities (e.g., '{range .items[*]}{.status.capacity.nvidia\.com/gpu}{"
"}{end}') and test that at least one token != "" and != "0"; keep the existing
debug oc get nodes output unchanged.

In `@tests/e2e-prow/rhoai/scripts/gpu-setup.sh`:
- Around line 163-165: The loop uses an until condition that only checks
capacity via oc get nodes ... status.capacity.nvidia.com/gpu; update that
condition to also check allocatable (status.allocatable.nvidia.com/gpu) is
present and non-zero, so the until stops only when both capacity and allocatable
are non-empty and not "0". Modify the existing until expression (the oc get
nodes calls in the loop) to include the allocatable checks alongside the
capacity checks, leaving the elapsed/timeout break logic (variables elapsed and
timeout) and the existing echo/error handling unchanged.

🧹 Nitpick comments (4)

tests/e2e-prow/rhoai/scripts/gpu-setup.sh (1)

57-74: Avoid overwriting existing NFD tolerations.
Line 57–74 replaces the tolerations array if it already exists, which can drop preconfigured tolerations. Consider appending instead.

♻️ Safer append/initialize approach

+# Ensure tolerations array exists, then append
+if ! oc get daemonset nfd-worker -n openshift-nfd -o jsonpath='{.spec.template.spec.tolerations}' | grep -q .; then
+  oc patch daemonset nfd-worker -n openshift-nfd --type=json -p='[
+    { "op": "add", "path": "/spec/template/spec/tolerations", "value": [] }
+  ]'
+fi
+oc patch daemonset nfd-worker -n openshift-nfd --type=json -p='[
+  { "op": "add", "path": "/spec/template/spec/tolerations/-",
+    "value": { "key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule" } },
+  { "op": "add", "path": "/spec/template/spec/tolerations/-",
+    "value": { "key": "gpu", "operator": "Exists", "effect": "NoSchedule" } }
+]'
-oc patch daemonset nfd-worker -n openshift-nfd --type=json -p='[
-  {
-    "op": "add",
-    "path": "/spec/template/spec/tolerations",
-    "value": [
-      {
-        "key": "nvidia.com/gpu",
-        "operator": "Exists",
-        "effect": "NoSchedule"
-      },
-      {
-        "key": "gpu",
-        "operator": "Exists",
-        "effect": "NoSchedule"
-      }
-    ]
-  }
-]'

tests/e2e-prow/rhoai/pipeline-services.sh (1)

5-9: Guard required env vars before envsubst.

If LLAMA_STACK_IMAGE (or any substituted var) is unset, envsubst will blank it and apply a broken manifest. Consider a fast‑fail check.
🛠️ Suggested fix
+: "${LLAMA_STACK_IMAGE:?LLAMA_STACK_IMAGE must be set}"
 envsubst < "$BASE_DIR/manifests/lightspeed/llama-stack.yaml" | oc apply -f -

tests/e2e-prow/rhoai/pipeline.sh (2)

13-21: Consider extracting the last FROM for multi‑stage Containerfiles.

grep -m1 '^FROM' selects the first stage; if the Containerfile is multi‑stage, this may return a builder image instead of the runtime image. Using the last FROM is safer.

🛠️ Suggested fix

-LLAMA_STACK_IMAGE=$(curl -sL https://raw.githubusercontent.com/lightspeed-core/lightspeed-stack/main/test.containerfile | grep -m1 '^FROM' | awk '{print $2}')
+LLAMA_STACK_IMAGE=$(curl -sL https://raw.githubusercontent.com/lightspeed-core/lightspeed-stack/main/test.containerfile | awk '/^FROM/ {img=$2} END {print img}')

107-154: Recreate the curl pod if it exists but isn’t Running.

If a previous run left vllm-test-curl in Failed/Completed/CrashLoop, oc exec will fail repeatedly. Consider verifying phase and recreating when not Running.

🛠️ Suggested fix

-  # Create a temporary pod for testing (if it doesn't exist)
-  if ! oc get pod vllm-test-curl -n "$NAMESPACE" &>/dev/null; then
+  # Recreate the test pod if it exists but isn't Running
+  if oc get pod vllm-test-curl -n "$NAMESPACE" &>/dev/null; then
+    phase=$(oc get pod vllm-test-curl -n "$NAMESPACE" -o jsonpath='{.status.phase}')
+    if [[ "$phase" != "Running" ]]; then
+      oc delete pod vllm-test-curl -n "$NAMESPACE" --ignore-not-found=true
+    fi
+  fi
+
+  # Create a temporary pod for testing (if it doesn't exist)
+  if ! oc get pod vllm-test-curl -n "$NAMESPACE" &>/dev/null; then
     oc run vllm-test-curl --image=curlimages/curl:latest \
       --restart=Never -n "$NAMESPACE" -- sleep 3600
     oc wait --for=condition=Ready pod/vllm-test-curl -n "$NAMESPACE" --timeout=60s
   fi

coderabbitai · 2026-01-18T11:18:10Z

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml

+        - --enable-auto-tool-choice
+        - --tool-call-parser 
+        - llama3_json
+        - --chat-template 
+        - /mnt/chat-template/tool_chat_template_llama3.2_json.jinja


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine the exact content at lines 17-21 if [ -f "tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml" ]; then echo "=== File content lines 17-21 with special chars shown ===" sed -n '17,21p' "tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml" | cat -A echo "" echo "=== Full context lines 10-25 ===" sed -n '10,25p' "tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml" | cat -n else echo "File not found" find . -name "vllm-runtime-gpu.yaml" -type f fi

Repository: lightspeed-core/lightspeed-stack

Length of output: 885

🌐 Web query:

vLLM argument parser trailing spaces CLI flags parsing behavior

💡 Result:

Short answer: vLLM’s CLI uses a FlexibleArgumentParser built on Python argparse, so flag-parsing follows standard argparse rules — trailing whitespace inside an argument value is preserved by the parser but the shell will usually strip/unpack whitespace unless you quote the value. For file/path arguments and complex JSON-style args, you must quote or escape spaces; vLLM has reported issues when unescaped spaces appear in local file paths. References and guidance below.

vLLM’s parser: vLLM provides FlexibleArgumentParser / argparse helpers (behaves like argparse). Preserve/stripping of trailing spaces is therefore determined by the shell and how you quote the value. [1]

Reported problems: there is a vLLM GitHub issue where inference failed when there was a space character in a local file path — shows you must quote/escape paths. [2]

CLI docs: many vLLM flags accept JSON or values; docs recommend usual CLI quoting (use --flag="value with space" or --flag=value). Boolean flags are store_true/store_false and are unaffected by trailing-space values. [3]

Practical tips:

Quote values containing spaces: --model "/path/with space/model" or --model="/path/with space/model".

Use the equals form to avoid shell word-splitting: --json-arg='{"k":"v with space"}'.

For boolean flags (store_true/store_false) trailing spaces aren’t meaningful.

If you see a bug when quoting correctly, report it with a repro; check/open vLLM issues (the file-path/space issue has been reported). [2]

Sources:
[1] vLLM argparse utilities documentation.
[2] vLLM GitHub issue: "Inference fails when there is space character in local file path."
[3] vLLM CLI reference / flags documentation.

Remove trailing spaces in vLLM flags to prevent startup failure.

Lines 19 and 21 contain trailing spaces after --tool-call-parser and --chat-template. Kubernetes passes the args array directly to the container without shell processing, so these flags become --tool-call-parser and --chat-template (with spaces). The vLLM argparse parser won't recognize these as valid flags and will reject them as unrecognized arguments, causing the container to fail on startup.

Fix

- - --tool-call-parser + - --tool-call-parser - llama3_json - - --chat-template + - --chat-template - /mnt/chat-template/tool_chat_template_llama3.2_json.jinja

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- --enable-auto-tool-choice

- --tool-call-parser

- llama3_json

- --chat-template

- /mnt/chat-template/tool_chat_template_llama3.2_json.jinja

- --enable-auto-tool-choice

- --tool-call-parser

- llama3_json

- --chat-template

- /mnt/chat-template/tool_chat_template_llama3.2_json.jinja

🤖 Prompt for AI Agents

In `@tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml` around lines 17 - 21, In vllm-runtime-gpu.yaml remove the trailing spaces from the vLLM CLI flags so Kubernetes doesn't pass malformed args: fix the args entries that contain "--tool-call-parser " and "--chat-template " to be "--tool-call-parser" and "--chat-template" (and ensure their corresponding values like "llama3_json" and "/mnt/chat-template/tool_chat_template_llama3.2_json.jinja" remain separate array items or properly paired) so vLLM's argparse recognizes the flags and the container can start.

coderabbitai · 2026-01-18T11:18:10Z

tests/e2e-prow/rhoai/scripts/bootstrap.sh

+echo "--> Applying OperatorGroups from operatorgroup.yaml..."
+oc apply -f "$BASE_DIR/manifests/operators/operatorgroup.yaml"
+
+sleep 10
+
 echo "--> Applying Operator Subscriptions from operators.yaml..."
 oc apply -f "$BASE_DIR/manifests/operators/operators.yaml"

+sleep 10
+
+# WAIT FOR GPU OPERATOR NAMESPACE AND OPERATORGROUP
+echo "--> Ensuring GPU Operator namespace and OperatorGroup are ready..."
+oc wait --for=jsonpath='{.status.phase}'=Active namespace/nvidia-gpu-operator --timeout=60s
+echo "  -> Waiting for GPU OperatorGroup to be created..."
+until oc get operatorgroup nvidia-gpu-operator-group -n nvidia-gpu-operator &>/dev/null; do
+  echo "     ...still waiting for OperatorGroup"
+  sleep 2
+done
+echo "  -> GPU OperatorGroup ready"
+
+# Give OLM a moment to process the OperatorGroup before checking subscriptions
+sleep 5
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "bootstrap.sh" -o -name "operatorgroup.yaml" -o -name "operators.yaml" | head -20

Repository: lightspeed-core/lightspeed-stack

Length of output: 239

🏁 Script executed:

cat -n ./tests/e2e-prow/rhoai/scripts/bootstrap.sh | head -50

Repository: lightspeed-core/lightspeed-stack

Length of output: 2193

🏁 Script executed:

cat -n ./tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml

Repository: lightspeed-core/lightspeed-stack

Length of output: 802

🏁 Script executed:

find ./tests/e2e-prow/rhoai/manifests -name "*namespace*" -o -name "*namespaces*" 2>/dev/null | head -20

Repository: lightspeed-core/lightspeed-stack

Length of output: 119

🏁 Script executed:

ls -la ./tests/e2e-prow/rhoai/manifests/namespaces/

Repository: lightspeed-core/lightspeed-stack

Length of output: 310

🏁 Script executed:

cat -n ./tests/e2e-prow/rhoai/manifests/namespaces/nvidia-operator.yaml

Repository: lightspeed-core/lightspeed-stack

Length of output: 211

Ensure the GPU operator namespace exists before oc wait.

oc wait namespace/nvidia-gpu-operator will fail if the namespace doesn't exist. Since operatorgroup.yaml only applies OperatorGroup resources and doesn't create the namespace, apply the namespace manifest first.

🛠️ Suggested fix

echo "--> Ensuring GPU Operator namespace and OperatorGroup are ready..." +if ! oc get namespace nvidia-gpu-operator &>/dev/null; then + oc apply -f "$BASE_DIR/manifests/namespaces/nvidia-operator.yaml" +fi oc wait --for=jsonpath='{.status.phase}'=Active namespace/nvidia-gpu-operator --timeout=60s

🤖 Prompt for AI Agents

In `@tests/e2e-prow/rhoai/scripts/bootstrap.sh` around lines 23 - 45, operatorgroup.yaml does not create the nvidia-gpu-operator namespace, so calling oc wait namespace/nvidia-gpu-operator can fail; before running oc wait or applying operatorgroup.yaml, apply the manifest that creates the nvidia-gpu-operator namespace (the Namespace manifest that declares metadata.name: nvidia-gpu-operator), ensure the namespace exists (e.g., oc apply <namespace-manifest> and optionally oc wait --for=condition=Created namespace/nvidia-gpu-operator), then continue to apply operatorgroup.yaml, operators.yaml and the existing oc wait/operatorgroup polling logic.

coderabbitai · 2026-01-18T11:18:10Z

tests/e2e-prow/rhoai/scripts/bootstrap.sh

+# Verify GPU operator InstallPlan was created before waiting for CSV
+echo "  -> Verifying GPU Operator InstallPlan was created..."
+timeout=120
+elapsed=0
+until oc get installplan -n nvidia-gpu-operator --no-headers 2>/dev/null | grep -q .; do
+  if [ $elapsed -ge $timeout ]; then
+    echo "     ❌ No InstallPlan created for GPU Operator - this is an OLM issue"
+    echo "     Attempting to fix by recreating subscription..."
+    oc delete subscription gpu-operator-certified -n nvidia-gpu-operator
+    sleep 5
+    oc apply -f "$BASE_DIR/manifests/operators/operators.yaml"
+    sleep 10
+    # Try one more time
+    if ! oc get installplan -n nvidia-gpu-operator --no-headers 2>/dev/null | grep -q .; then
+      echo "     ❌ Still no InstallPlan - manual intervention required"
+      exit 1
+    fi
+    break
+  fi
+  echo "     ...waiting for InstallPlan ($elapsed/$timeout seconds)"
+  sleep 5
+  elapsed=$((elapsed + 5))
+done
+echo "  -> InstallPlan created successfully"
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if file exists and show the relevant section wc -l tests/e2e-prow/rhoai/scripts/bootstrap.sh head -80 tests/e2e-prow/rhoai/scripts/bootstrap.sh | tail -40

Repository: lightspeed-core/lightspeed-stack

Length of output: 1872

Give the InstallPlan retry a real wait window.

After recreating the subscription, the script checks for an InstallPlan only once after 10s. OLM can take longer, so this can still fail even when it would succeed shortly after. Reset elapsed to 0 and loop back to the wait condition to give the retry the same full timeout window as the initial attempt, similar to how the fix suggests using a retried flag and continue.

🤖 Prompt for AI Agents

In `@tests/e2e-prow/rhoai/scripts/bootstrap.sh` around lines 56 - 80, The retry after recreating the subscription only checks once after a 10s sleep, so reset the wait loop state and re-enter the same until loop to give the recreated subscription the full timeout; specifically, inside the branch where you delete and reapply the subscription (the block manipulating subscription gpu-operator-certified and applying "$BASE_DIR/manifests/operators/operators.yaml"), set elapsed=0 (or set a retried flag and use continue) so execution jumps back into the existing until oc get installplan -n nvidia-gpu-operator ... | grep -q . loop and retries for the full timeout instead of a single one-time check.

coderabbitai · 2026-01-18T11:18:10Z

tests/e2e-prow/rhoai/scripts/deploy-vllm.sh

+# Wait for GPU nodes to be labeled by NFD
+echo "Waiting for GPU nodes to be labeled by NFD..."
+timeout=600  # 10 minutes
+elapsed=0
+until oc get nodes -l nvidia.com/gpu.present=true --no-headers 2>/dev/null | grep -q .; do
+  if [ $elapsed -ge $timeout ]; then
+    echo "❌ Timeout waiting for GPU nodes to be labeled"
+    exit 1
+  fi
+  echo "No GPU nodes found yet. Waiting... ($elapsed/$timeout seconds)"
+  sleep 10
+  elapsed=$((elapsed + 10))
+done
+echo "✅ GPU nodes detected."
+
+# Wait for GPU capacity to be available
+echo "Waiting for GPU capacity to be available on nodes..."
+timeout=600  # 10 minutes
+elapsed=0
+until [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \
+      [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "0" ]; do
+  if [ $elapsed -ge $timeout ]; then
+    echo "❌ Timeout waiting for GPU capacity"
+    echo "DEBUG: Checking GPU status..."
+    oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{"capacity: "}{.status.capacity.nvidia\.com/gpu}{"\t"}{"allocatable: "}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
+    exit 1
+  fi
+  capacity=$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null || echo "0")
+  echo "GPU capacity: $capacity. Waiting... ($elapsed/$timeout seconds)"
+  sleep 10
+  elapsed=$((elapsed + 10))
+done
+echo "✅ GPU capacity available."
+
+# Display GPU node info
+echo "GPU nodes ready:"
+oc get nodes -l nvidia.com/gpu.present=true -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.nvidia\\.com/gpu,INSTANCE:.metadata.labels.node\\.kubernetes\\.io/instance-type
+
+echo "Applying vLLM manifests..."
+
+envsubst < "$BASE_DIR/manifests/vllm/vllm-runtime-gpu.yaml" | oc apply -f -
+
+# Wait a moment for the ServingRuntime to be fully persisted before creating the InferenceService
+echo "Waiting for ServingRuntime to be ready..."
+sleep 5

-oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml"
+oc apply -f "$BASE_DIR/manifests/vllm/vllm-inference-service-gpu.yaml"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant lines if [ -f "tests/e2e-prow/rhoai/scripts/deploy-vllm.sh" ]; then echo "=== File found. Reading lines 32-78 ===" sed -n '32,78p' tests/e2e-prow/rhoai/scripts/deploy-vllm.sh else echo "File not found" fi

Repository: lightspeed-core/lightspeed-stack

Length of output: 2232

🏁 Script executed:

# Also check the file size to understand context wc -l tests/e2e-prow/rhoai/scripts/deploy-vllm.sh

Repository: lightspeed-core/lightspeed-stack

Length of output: 122

Avoid single-node GPU capacity gating.

The capacity loop only inspects the first matching node (.items[0]). In a multi-node GPU setup, if that node hasn't yet reported capacity while another node is ready, the script will timeout unnecessarily. Check for non-zero capacity on any node instead.

🛠️ Suggested fix

-until [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \ - [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "0" ]; do +until oc get nodes -l nvidia.com/gpu.present=true \ + -o jsonpath='{range .items[*]}{.status.capacity.nvidia\.com/gpu}{"\n"}{end}' 2>/dev/null \ + | grep -qE '^[1-9]'; do if [ $elapsed -ge $timeout ]; then echo "❌ Timeout waiting for GPU capacity" @@ - capacity=$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null || echo "0") - echo "GPU capacity: $capacity. Waiting... ($elapsed/$timeout seconds)" + capacities=$(oc get nodes -l nvidia.com/gpu.present=true \ + -o jsonpath='{range .items[*]}{.metadata.name}:{.status.capacity.nvidia\.com/gpu}{"\n"}{end}' 2>/dev/null || true) + echo -e "GPU capacity:\n$capacities\nWaiting... ($elapsed/$timeout seconds)" sleep 10 elapsed=$((elapsed + 10)) done

🤖 Prompt for AI Agents

In `@tests/e2e-prow/rhoai/scripts/deploy-vllm.sh` around lines 32 - 78, The loop incorrectly checks only the first node via jsonpath '.items[0].status.capacity.nvidia.com/gpu', causing false timeouts in multi-node clusters; change the until condition and the capacity assignment to inspect all nodes (e.g., use jsonpath that ranges over .items[*] or pipe to jq) and succeed if any node has a non-empty, non-zero GPU value. Update the two oc calls in the until condition and the capacity=$(...) assignment to collect all GPU capacities (e.g., '{range .items[*]}{.status.capacity.nvidia\.com/gpu}{" "}{end}') and test that at least one token != "" and != "0"; keep the existing debug oc get nodes output unchanged.

coderabbitai · 2026-01-18T11:18:10Z

tests/e2e-prow/rhoai/scripts/gpu-setup.sh

+until [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \
+      [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "0" ]; do
+  if [ $elapsed -ge $timeout ]; then


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "gpu-setup.sh" -type f

Repository: lightspeed-core/lightspeed-stack

Length of output: 119

🏁 Script executed:

cat -n ./tests/e2e-prow/rhoai/scripts/gpu-setup.sh | sed -n '150,180p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1944

🏁 Script executed:

cat -n ./tests/e2e-prow/rhoai/scripts/gpu-setup.sh | sed -n '180,220p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1686

Add allocatable checks to the GPU capacity wait loop.

The until condition on lines 163–164 only checks capacity, but the echo message on line 187 indicates the script intends to wait for both capacity and allocatable to be available. The loop exits as soon as capacity is non-zero, even if allocatable remains 0, which prevents GPU scheduling and causes later failures. Kubernetes requires non-zero allocatable for pods to be schedulable.

Proposed fix

-until [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \ - [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "0" ]; do +until [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \ + [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "0" ] && \ + [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \ + [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null)" != "0" ]; do

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

until [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \

[ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "0" ]; do

if [ $elapsed -ge $timeout ]; then

until [ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \

[ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}' 2>/dev/null)" != "0" ] && \

[ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null)" != "" ] && \

[ "$(oc get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null)" != "0" ]; do

if [ $elapsed -ge $timeout ]; then

🤖 Prompt for AI Agents

In `@tests/e2e-prow/rhoai/scripts/gpu-setup.sh` around lines 163 - 165, The loop uses an until condition that only checks capacity via oc get nodes ... status.capacity.nvidia.com/gpu; update that condition to also check allocatable (status.allocatable.nvidia.com/gpu) is present and non-zero, so the until stops only when both capacity and allocatable are non-empty and not "0". Modify the existing until expression (the oc get nodes calls in the loop) to include the allocatable checks alongside the capacity checks, leaving the elapsed/timeout break logic (variables elapsed and timeout) and the existing echo/error handling unchanged.

tisnik

LGTM, thank you

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

tisnik approved these changes Jan 19, 2026

View reviewed changes

tisnik merged commit 15d7bdb into lightspeed-core:main Jan 19, 2026
20 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LCORE-941: Use proper vLLM image for prow RHOAI tests #1014

LCORE-941: Use proper vLLM image for prow RHOAI tests #1014

Uh oh!

are-ces commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LCORE-941: Use proper vLLM image for prow RHOAI tests #1014

LCORE-941: Use proper vLLM image for prow RHOAI tests #1014

Uh oh!

Conversation

are-ces commented Jan 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

are-ces commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 18, 2026 •

edited

Loading