From 0df3a981eb9334ed0faf3573758200c85f77cbc4 Mon Sep 17 00:00:00 2001 From: Doug Holt Date: Thu, 28 May 2026 11:51:00 -0600 Subject: [PATCH] docs: clarify legacy CI and virtual lab status --- README.md | 2 +- docs/deepops/testing.md | 21 ++++++++++++--------- virtual/README.md | 15 +++++++++------ virtual/vagrant_startup.sh | 4 ++-- workloads/jenkins/README.md | 12 ++++++++---- 5 files changed, 32 insertions(+), 22 deletions(-) diff --git a/README.md b/README.md index 68df95a20..2a5a0b7ee 100644 --- a/README.md +++ b/README.md @@ -83,7 +83,7 @@ The `nvidia-dgx` role can install NVIDIA DGX platform software on supported DGX ### Virtual -To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This can be used for testing, adding new features, or configuring DeepOps to meet deployment-specific needs. +To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This path is useful for learning and local experimentation, but it is a legacy/community-supported lab path and should not be treated as release-grade validation for current GPU clusters. Consult the [Virtual DeepOps Deployment Guide](virtual/README.md) to build a GPU-enabled virtual cluster with DeepOps. diff --git a/docs/deepops/testing.md b/docs/deepops/testing.md index c42b116f6..c4fc6c3b6 100644 --- a/docs/deepops/testing.md +++ b/docs/deepops/testing.md @@ -27,25 +27,28 @@ This can be useful for excluding specific roles that have known issues or are st ## DeepOps end-to-end testing -The DeepOps project leverages a private Jenkins server to run continuous integration tests. Testing is done using the [virtual](../../virtual) deployment mechanism. Several Vagrant VMs are created, the cluster is deployed, tests are executed, and then the VMs are destroyed. +Public DeepOps pull requests are validated with GitHub Actions for setup, linting, CodeQL, and selected Molecule role tests. Those checks catch many packaging and role-regression issues, but they do not replace deployment validation on real GPU systems. -The goal of the DeepOps CI is to prevent bugs from being introduced into the code base and to identify when changes in 3rd party platforms have occurred or impacted the DeepOps deployment mechanisms. In general, K8s and Slurm deployment issues are detected and resolved with urgency. Many components of DeepOps are 3rd party open source tools that may silently fail or suddenly change without notice. The team will make a best-effort to resolve these issues and include regression tests, however there may be times where a fix is unavailable. Historically, this has been an issue with Rook-Ceph and Kubeflow, and those GitHub communities are best equipped to help with resolutions. +DeepOps also retains a legacy Jenkins/Vagrant test harness in the [jenkins](../../workloads/jenkins) and [virtual](../../virtual) directories. Treat those files as community-supported reference material unless maintainers explicitly say a Jenkins job is still authoritative. New release validation should record the exact environment, operating system, GPU stack, and workload checks used for the pull request or release. -### Testing Method +The goal of DeepOps validation is to prevent bugs from being introduced into the code base and to identify when changes in third-party platforms have affected the DeepOps deployment mechanisms. In general, Kubernetes and Slurm deployment issues are detected and resolved with urgency. Many components of DeepOps are third-party open source tools that may silently fail or change without notice. The team will make a best-effort to resolve these issues and include regression tests, however there may be times where a fix is unavailable. Historically, this has been an issue with Rook-Ceph and Kubeflow, and those GitHub communities are best equipped to help with resolutions. -DeepOps CI contains two types of automated tests: +### Testing Method -- Nightly tests. These are more exhaustive and run on a nightly basis against the `master` branch. +DeepOps currently uses these testing layers: -- PR tests. These are faster and are executed against every open PR when commits are made to `master`. They are also when a commit is made to any DeepOps branch (`release-20.12`, `master`, etc.). Results are integrated into GitHub. +- GitHub Actions for public pull request checks, including setup, linting, CodeQL, and selected Molecule role tests. +- Focused local validation for changed playbooks, roles, scripts, and documentation before opening or updating a pull request. +- GPU-backed deployment validation for changes that affect Slurm, Kubernetes, drivers, container runtimes, DGX platform software, or workload examples. +- Legacy Jenkins/Vagrant jobs as reference material for operators who still run that harness. -In addition to the automated tests, we also provide developers the a method to manually kick off a test run against one or more deployment configurations in parallel from the below testing matrix through the [Jenkins-matrix](../../workloads/jenkins/Jenkinsfile-matrix) Jenkinsfile. +If a change requires GPU-backed validation, document the validation environment and results in the pull request. If that validation cannot be run, state the gap explicitly instead of relying on the legacy Jenkins matrix. ### Tests -A short description of the nightly testing is outlined below. The full suit of tests can be reviewed in the [jenkins](../../workloads/jenkins) directory. Additional details can be found [here](../../workloads/jenkins/README.md). +A short description of the historical Jenkins test matrix is outlined below. The full suite of legacy jobs can be reviewed in the [jenkins](../../workloads/jenkins) directory. These rows are not a promise of current public CI coverage; check the pull request's GitHub Actions and validation notes for current status. -**Testing Matrix** +**Legacy Jenkins Testing Matrix** | Test | [PR](../../workloads/jenkins/Jenkinsfile) | [Nightly](../../workloads/jenkins/Jenkinsfile-nightly) | [Nightly Multi-node](../../workloads/jenkins/Jenkinsfile-multi-nightly) | Comments | | --------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------- | ------------------------------------ | diff --git a/virtual/README.md b/virtual/README.md index 72d4156b6..19db58a21 100644 --- a/virtual/README.md +++ b/virtual/README.md @@ -1,9 +1,11 @@ # DeepOps Virtual -Set up a virtual cluster with DeepOps. Useful for... +Set up a virtual cluster with DeepOps. This is a legacy/community-supported lab +path for learning and local experimentation; it is not the release validation +path for current GPU clusters. Useful for... 1. Learning how to deploy DeepOps on limited hardware -2. Testing new features in DeepOps +2. Testing small changes in a local lab before validating on real systems 3. Tailoring DeepOps in a local environment before deploying it to the production cluster ## Requirements @@ -26,6 +28,8 @@ If deploying kubeflow or another resource-intensive application in this environm Running DeepOps virtually assumes that the host machine's OS is an approved OS. If this is not the case, the scripts used in the steps below may be modified to work with a different OS. +The Vagrantfiles currently cover Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. The startup script defaults to Ubuntu 20.04. Newer operating systems should be validated on real target systems unless the Vagrantfiles have been refreshed and tested for that release. + Also, using VMs and optionally GPU passthrough assumes that the host machine has been configured to enable virtualization in the BIOS. For instructions on how to accomplish this, refer to the sections at the bottom of this README: [Enabling virtualization and GPU passthrough](#enabling-virtualization-and-gpu-passthrough). ## Start the Virtual Cluster @@ -41,7 +45,7 @@ Also, using VMs and optionally GPU passthrough assumes that the host machine has 2. In the virtual directory, startup vagrant. This will start 3 VMs by default. ```sh - # NOTE: The default VM OS is Ubuntu. If you wish the VMs to spawn CentOS, + # NOTE: The default VM OS is Ubuntu 20.04. If you wish the VMs to spawn CentOS, # configure the DEEPOPS_VAGRANT_FILE variable accordingly... # export DEEPOPS_VAGRANT_FILE=$(pwd)/Vagrantfile-centos # NOTE: virtual-gpu01 requires GPU passthrough, by default it is not enabled @@ -135,7 +139,7 @@ $ lspci -nnk | grep NVIDIA In this example, the GPU at `08:00.0` is chosen. -In the `Vagrantfile` there is a "magic string" `#BUS-GPU01` that is utilized in Jenkins automation. This can be updated manually. +In the `Vagrantfile` there is a "magic string" `#BUS-GPU01` that was used by the legacy Jenkins automation. This can be updated manually. Uncomment the `#BUS-GPU01 v.pci` configuration and update it with a mapping to the bus discovered with `lspci`... @@ -159,7 +163,7 @@ The default Vagrantfiles create VMs that are very minimal in terms of resources ### Specify the cluster Operating System -By default, all virtual nodes will deploy with Ubuntu 18.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Supported OS versions are Ubuntu 18.04, Ubuntu 20.04, CentOS 7, and CentOS 8. +By default, all virtual nodes deploy with Ubuntu 20.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Available Vagrantfiles include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. ```sh export DEEPOPS_VAGRANT_OS=centos @@ -355,4 +359,3 @@ $ lspci -nnk -d 10de:1db1 ``` - diff --git a/virtual/vagrant_startup.sh b/virtual/vagrant_startup.sh index f7f9bf1b2..6dd3aaf38 100755 --- a/virtual/vagrant_startup.sh +++ b/virtual/vagrant_startup.sh @@ -5,8 +5,8 @@ set -xe # Get absolute path for script, and convenience vars for virtual and root VIRT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )" -# The default Vagrant Operating System is Ubuntu 18.04 -# To override thise, change these variables to a supported OS +# The default Vagrant operating system is Ubuntu 20.04. +# To override this, change these variables to a supported OS. DEEPOPS_VAGRANT_OS=${DEEPOPS_VAGRANT_OS:-ubuntu} DEEPOPS_OS_VERSION=${DEEPOPS_OS_VERSION:-20.04} diff --git a/workloads/jenkins/README.md b/workloads/jenkins/README.md index 9f8ea1abb..c97f719f4 100644 --- a/workloads/jenkins/README.md +++ b/workloads/jenkins/README.md @@ -1,8 +1,13 @@ # Jenkins Files -We have several Jenkinsfiles. There is one that is meant to be a lightweight verification that quickly runs against all PRs. +This directory contains the legacy Jenkins/Vagrant test harness. Current public +pull request checks run through GitHub Actions; do not assume these Jenkins jobs +are authoritative unless maintainers explicitly enable and reference them for a +specific validation run. -In addition to that we have several which are meant to run nightly and be more robust checks on functionality to check if dependencies have broken. +We have several Jenkinsfiles. There is one that was meant to be a lightweight verification that quickly runs against all PRs. + +In addition to that we have several which were meant to run nightly and be more robust checks on functionality to check if dependencies have broken. ## Configuration @@ -14,7 +19,7 @@ lock(resource: null, label: 'gpu', quantity: 1, variable: 'GPUDATA') ## Jenkinsfile -This is the original Jenkinsfile that runs every time a PR is created. It does a quick test to verify: +This is the original Jenkinsfile that ran every time a PR was created. It does a quick test to verify: * K8S deploys * Slurm Deploys @@ -36,4 +41,3 @@ This does everything `Jenkinsfile-nightly` does in addition to: * Deploys 3 management nodes * Deploys 2 GPU nodes * Runs a multi-node GPU Verification -