Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ The `nvidia-dgx` role can install NVIDIA DGX platform software on supported DGX

### Virtual

To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This can be used for testing, adding new features, or configuring DeepOps to meet deployment-specific needs.
To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This path is useful for learning and local experimentation, but it is a legacy/community-supported lab path and should not be treated as release-grade validation for current GPU clusters.

Consult the [Virtual DeepOps Deployment Guide](virtual/README.md) to build a GPU-enabled virtual cluster with DeepOps.

Expand Down
21 changes: 12 additions & 9 deletions docs/deepops/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,28 @@ This can be useful for excluding specific roles that have known issues or are st

## DeepOps end-to-end testing

The DeepOps project leverages a private Jenkins server to run continuous integration tests. Testing is done using the [virtual](../../virtual) deployment mechanism. Several Vagrant VMs are created, the cluster is deployed, tests are executed, and then the VMs are destroyed.
Public DeepOps pull requests are validated with GitHub Actions for setup, linting, CodeQL, and selected Molecule role tests. Those checks catch many packaging and role-regression issues, but they do not replace deployment validation on real GPU systems.

The goal of the DeepOps CI is to prevent bugs from being introduced into the code base and to identify when changes in 3rd party platforms have occurred or impacted the DeepOps deployment mechanisms. In general, K8s and Slurm deployment issues are detected and resolved with urgency. Many components of DeepOps are 3rd party open source tools that may silently fail or suddenly change without notice. The team will make a best-effort to resolve these issues and include regression tests, however there may be times where a fix is unavailable. Historically, this has been an issue with Rook-Ceph and Kubeflow, and those GitHub communities are best equipped to help with resolutions.
DeepOps also retains a legacy Jenkins/Vagrant test harness in the [jenkins](../../workloads/jenkins) and [virtual](../../virtual) directories. Treat those files as community-supported reference material unless maintainers explicitly say a Jenkins job is still authoritative. New release validation should record the exact environment, operating system, GPU stack, and workload checks used for the pull request or release.

### Testing Method
The goal of DeepOps validation is to prevent bugs from being introduced into the code base and to identify when changes in third-party platforms have affected the DeepOps deployment mechanisms. In general, Kubernetes and Slurm deployment issues are detected and resolved with urgency. Many components of DeepOps are third-party open source tools that may silently fail or change without notice. The team will make a best-effort to resolve these issues and include regression tests, however there may be times where a fix is unavailable. Historically, this has been an issue with Rook-Ceph and Kubeflow, and those GitHub communities are best equipped to help with resolutions.

DeepOps CI contains two types of automated tests:
### Testing Method

- Nightly tests. These are more exhaustive and run on a nightly basis against the `master` branch.
DeepOps currently uses these testing layers:

- PR tests. These are faster and are executed against every open PR when commits are made to `master`. They are also when a commit is made to any DeepOps branch (`release-20.12`, `master`, etc.). Results are integrated into GitHub.
- GitHub Actions for public pull request checks, including setup, linting, CodeQL, and selected Molecule role tests.
- Focused local validation for changed playbooks, roles, scripts, and documentation before opening or updating a pull request.
- GPU-backed deployment validation for changes that affect Slurm, Kubernetes, drivers, container runtimes, DGX platform software, or workload examples.
- Legacy Jenkins/Vagrant jobs as reference material for operators who still run that harness.

In addition to the automated tests, we also provide developers the a method to manually kick off a test run against one or more deployment configurations in parallel from the below testing matrix through the [Jenkins-matrix](../../workloads/jenkins/Jenkinsfile-matrix) Jenkinsfile.
If a change requires GPU-backed validation, document the validation environment and results in the pull request. If that validation cannot be run, state the gap explicitly instead of relying on the legacy Jenkins matrix.

### Tests

A short description of the nightly testing is outlined below. The full suit of tests can be reviewed in the [jenkins](../../workloads/jenkins) directory. Additional details can be found [here](../../workloads/jenkins/README.md).
A short description of the historical Jenkins test matrix is outlined below. The full suite of legacy jobs can be reviewed in the [jenkins](../../workloads/jenkins) directory. These rows are not a promise of current public CI coverage; check the pull request's GitHub Actions and validation notes for current status.

**Testing Matrix**
**Legacy Jenkins Testing Matrix**

| Test | [PR](../../workloads/jenkins/Jenkinsfile) | [Nightly](../../workloads/jenkins/Jenkinsfile-nightly) | [Nightly Multi-node](../../workloads/jenkins/Jenkinsfile-multi-nightly) | Comments |
| --------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------- | ------------------------------------ |
Expand Down
15 changes: 9 additions & 6 deletions virtual/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# DeepOps Virtual

Set up a virtual cluster with DeepOps. Useful for...
Set up a virtual cluster with DeepOps. This is a legacy/community-supported lab
path for learning and local experimentation; it is not the release validation
path for current GPU clusters. Useful for...

1. Learning how to deploy DeepOps on limited hardware
2. Testing new features in DeepOps
2. Testing small changes in a local lab before validating on real systems
3. Tailoring DeepOps in a local environment before deploying it to the production cluster

## Requirements
Expand All @@ -26,6 +28,8 @@ If deploying kubeflow or another resource-intensive application in this environm

Running DeepOps virtually assumes that the host machine's OS is an approved OS. If this is not the case, the scripts used in the steps below may be modified to work with a different OS.

The Vagrantfiles currently cover Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. The startup script defaults to Ubuntu 20.04. Newer operating systems should be validated on real target systems unless the Vagrantfiles have been refreshed and tested for that release.

Also, using VMs and optionally GPU passthrough assumes that the host machine has been configured to enable virtualization in the BIOS. For instructions on how to accomplish this, refer to the sections at the bottom of this README: [Enabling virtualization and GPU passthrough](#enabling-virtualization-and-gpu-passthrough).

## Start the Virtual Cluster
Expand All @@ -41,7 +45,7 @@ Also, using VMs and optionally GPU passthrough assumes that the host machine has
2. In the virtual directory, startup vagrant. This will start 3 VMs by default.

```sh
# NOTE: The default VM OS is Ubuntu. If you wish the VMs to spawn CentOS,
# NOTE: The default VM OS is Ubuntu 20.04. If you wish the VMs to spawn CentOS,
# configure the DEEPOPS_VAGRANT_FILE variable accordingly...
# export DEEPOPS_VAGRANT_FILE=$(pwd)/Vagrantfile-centos
# NOTE: virtual-gpu01 requires GPU passthrough, by default it is not enabled
Expand Down Expand Up @@ -135,7 +139,7 @@ $ lspci -nnk | grep NVIDIA

In this example, the GPU at `08:00.0` is chosen.

In the `Vagrantfile` there is a "magic string" `#BUS-GPU01` that is utilized in Jenkins automation. This can be updated manually.
In the `Vagrantfile` there is a "magic string" `#BUS-GPU01` that was used by the legacy Jenkins automation. This can be updated manually.

Uncomment the `#BUS-GPU01 v.pci` configuration and update it with a mapping to the bus discovered with `lspci`...

Expand All @@ -159,7 +163,7 @@ The default Vagrantfiles create VMs that are very minimal in terms of resources

### Specify the cluster Operating System

By default, all virtual nodes will deploy with Ubuntu 18.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Supported OS versions are Ubuntu 18.04, Ubuntu 20.04, CentOS 7, and CentOS 8.
By default, all virtual nodes deploy with Ubuntu 20.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Available Vagrantfiles include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8.

```sh
export DEEPOPS_VAGRANT_OS=centos
Expand Down Expand Up @@ -355,4 +359,3 @@ $ lspci -nnk -d 10de:1db1
```



4 changes: 2 additions & 2 deletions virtual/vagrant_startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ set -xe
# Get absolute path for script, and convenience vars for virtual and root
VIRT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

# The default Vagrant Operating System is Ubuntu 18.04
# To override thise, change these variables to a supported OS
# The default Vagrant operating system is Ubuntu 20.04.
# To override this, change these variables to a supported OS.
DEEPOPS_VAGRANT_OS=${DEEPOPS_VAGRANT_OS:-ubuntu}
DEEPOPS_OS_VERSION=${DEEPOPS_OS_VERSION:-20.04}

Expand Down
12 changes: 8 additions & 4 deletions workloads/jenkins/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# Jenkins Files

We have several Jenkinsfiles. There is one that is meant to be a lightweight verification that quickly runs against all PRs.
This directory contains the legacy Jenkins/Vagrant test harness. Current public
pull request checks run through GitHub Actions; do not assume these Jenkins jobs
are authoritative unless maintainers explicitly enable and reference them for a
specific validation run.

In addition to that we have several which are meant to run nightly and be more robust checks on functionality to check if dependencies have broken.
We have several Jenkinsfiles. There is one that was meant to be a lightweight verification that quickly runs against all PRs.

In addition to that we have several which were meant to run nightly and be more robust checks on functionality to check if dependencies have broken.

## Configuration

Expand All @@ -14,7 +19,7 @@ lock(resource: null, label: 'gpu', quantity: 1, variable: 'GPUDATA')

## Jenkinsfile

This is the original Jenkinsfile that runs every time a PR is created. It does a quick test to verify:
This is the original Jenkinsfile that ran every time a PR was created. It does a quick test to verify:

* K8S deploys
* Slurm Deploys
Expand All @@ -36,4 +41,3 @@ This does everything `Jenkinsfile-nightly` does in addition to:
* Deploys 3 management nodes
* Deploys 2 GPU nodes
* Runs a multi-node GPU Verification

Loading