Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,22 @@ It is recommended to use the latest release branch for stable code (linked above

### Provisioning System

The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Supported operating systems which are tested and supported include:
The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Current release validation focuses on:

- NVIDIA DGX OS 4, 5, 6, 7
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS, 24.04 LTS
- CentOS 7, 8
- Ubuntu 22.04 LTS and 24.04 LTS
- NVIDIA DGX OS 6 and 7

DeepOps still retains legacy/community-maintained paths for older environments such as DGX OS 4/5, Ubuntu 18.04/20.04, and CentOS 7/8. Treat those paths as compatibility references unless your site validates them for the release you deploy.

### Cluster System

The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required.
The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required. Current release validation focuses on:

- Ubuntu 22.04 LTS and 24.04 LTS for generic Kubernetes and Slurm deployments
- NVIDIA DGX OS 6 and 7 for DGX systems
- Red Hat Enterprise Linux / Rocky Linux 8 and 9 for DGX platform software installation through the `nvidia-dgx` role

- NVIDIA DGX OS 4, 5, 6, 7
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS, 24.04 LTS
- CentOS 7, 8
- Red Hat Enterprise Linux / Rocky Linux 8 and 9 for the DGX software stack through the `nvidia-dgx` role
DeepOps still retains legacy/community-maintained paths for older environments such as DGX OS 4/5, Ubuntu 18.04/20.04, CentOS 7/8, and the historical DGX EL7 stack. Treat those paths as compatibility references unless your site validates them for the release you deploy.

You may also install a supported operating system on all servers via a 3rd-party solution such as [MAAS](https://maas.io/) or [Foreman](https://www.theforeman.org/), or via an existing site-standard automated installer.
For new Ubuntu 24.04 or DGX OS 7 deployments, prefer Ubuntu autoinstall/cloud-init or MAAS and then apply DeepOps roles after the OS is present.
Expand Down
29 changes: 9 additions & 20 deletions docs/deepops/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,16 @@ If a change requires GPU-backed validation, document the validation environment

A short description of the historical Jenkins test matrix is outlined below. The full suite of legacy jobs can be reviewed in the [jenkins](../../workloads/jenkins) directory. These rows are not a promise of current public CI coverage; check the pull request's GitHub Actions and validation notes for current status.

**Legacy Jenkins Testing Matrix**
**Validation Matrix**

| Test | [PR](../../workloads/jenkins/Jenkinsfile) | [Nightly](../../workloads/jenkins/Jenkinsfile-nightly) | [Nightly Multi-node](../../workloads/jenkins/Jenkinsfile-multi-nightly) | Comments |
| --------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------- | ------------------------------------ |
| Ubuntu 18.04 | x | x | x | |
| Ubuntu 20.04 | | x | x | |
| Ubuntu 18.04 | x | x | x | Legacy Jenkins/Vagrant reference only |
| Ubuntu 20.04 | | x | x | Legacy Jenkins/Vagrant reference only |
| Ubuntu 22.04 | | | | setup.sh and Molecule GitHub Actions |
| Ubuntu 24.04 | | | | setup.sh and Molecule GitHub Actions |
| CentOS 7 | | x | x | |
| CentOS | | | x | |
| CentOS 7 | | x | x | Legacy Jenkins/Vagrant reference only |
| CentOS 8 | | | x | Legacy Jenkins/Vagrant reference only |
| DGX OS | | | | Syntax-checked only; full validation requires DGX hardware |
| RHEL | | | | DGX software-stack role syntax-checked only; full validation requires DGX hardware and subscriptions |
| 1 mgmt node | x | x | | |
Expand Down Expand Up @@ -123,30 +123,19 @@ molecule init scenario -r <your-role> --driver-name docker
```

4. In the file `molecule/default/molecule.yml`, define the list of platforms to be tested.
DeepOps currently supports operating systems based on Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04, EL7, and EL8.
The DGX software stack role also supports Red Hat Enterprise Linux / Rocky Linux 8 and 9 for DGX platform software installation.
To test these stacks, the following `platforms` stanza can be used.
DeepOps currently uses Ubuntu 22.04 and Ubuntu 24.04 for setup and Molecule GitHub Actions.
Add Red Hat family images only for roles that explicitly support them, and validate the image choice for that role.
Keep Ubuntu 18.04, Ubuntu 20.04, CentOS 7, and CentOS 8 scenarios in separately named legacy test scenarios when maintaining older compatibility paths.
To test the current Ubuntu stacks, the following `platforms` stanza can be used.

```yaml
platforms:
- name: ubuntu-1804
image: geerlingguy/docker-ubuntu1804-ansible
pre_build_image: true
- name: ubuntu-2004
image: geerlingguy/docker-ubuntu2004-ansible
pre_build_image: true
- name: ubuntu-2204
image: geerlingguy/docker-ubuntu2204-ansible
pre_build_image: true
- name: ubuntu-2404
image: geerlingguy/docker-ubuntu2404-ansible
pre_build_image: true
- name: centos-7
image: geerlingguy/docker-centos7-ansible
pre_build_image: true
- name: centos-8
image: geerlingguy/docker-centos8-ansible
pre_build_image: true
```

5. If you haven't already, define your role's metadata in the file `meta/main.yml`.
Expand Down
9 changes: 5 additions & 4 deletions docs/ngc-ready/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@ These instructions assume the following:

- You have a NGC-Ready server. To determine if your server is NGC-Ready, please review the list of validated servers at the NGC-Ready Server documentation page - https://docs.nvidia.com/certification-programs/ngc-ready-systems/index.html
- Your NGC-Ready Server has a compatible Linux distribution installed:
- Ubuntu Server 20.04 LTS
- Ubuntu Server 22.04 LTS
- Ubuntu Server 24.04 LTS
- CentOS 7
- Red Hat Enterprise Linux / Rocky Linux 8 or 9 when the referenced roles are validated for your server

Legacy Ubuntu 20.04 and CentOS 7 environments may still work for existing deployments, but they are not current release validation targets.

## Setup

Expand All @@ -41,7 +42,7 @@ This process will install the latest NVIDIA GPU Drivers, and Docker with the NVI
# <ip-of-host>: IP of NGC-Ready server, or localhost. The trailing comma is required
# If SSH requires a password, add: -k
# If sudo requires a password, add: -K
ansible-playbook -u <ssh-user> -i <ip-of-host>, playbooks/ngc-ready.yml
ansible-playbook -u <ssh-user> -i <ip-of-host>, playbooks/ngc-ready-server.yml
```

## Testing
Expand All @@ -55,5 +56,5 @@ This process will test the functionality of the NGC-Ready server by running a fu
# <ip-of-host>: IP of NGC-Ready server, or localhost. The trailing comma is required
# If SSH requires a password, add: -k
# If sudo requires a password, add: -K
ansible-playbook -u <ssh-user> -i <ip-of-host>, playbooks/ngc-ready.yml --tags test
ansible-playbook -u <ssh-user> -i <ip-of-host>, playbooks/ngc-ready-server.yml --tags test
```
2 changes: 1 addition & 1 deletion docs/slurm-cluster/slurm-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Single Node Slurm Deployment Guide

## Introduction

The general requirements and procedure for Slurm setup via deepops is documented in the [README.md](README.md) for the slurm-cluster. The instructions below outline the steps to deviate from the general setup to enable single node DeepOps Slurm setup. The machine on which Slurm is being deployed should be up to date in a stable state with GPU drivers already installed and functional. The supported operating systems are Ubuntu 18.04, 20.04, 22.04, and 24.04; CentOS 7 and 8; and RHEL 7 and 8, with RHEL 8 preferred among the RHEL paths.
The general requirements and procedure for Slurm setup via deepops is documented in the [README.md](README.md) for the slurm-cluster. The instructions below outline the steps to deviate from the general setup to enable single node DeepOps Slurm setup. The machine on which Slurm is being deployed should be up to date in a stable state with GPU drivers already installed and functional. Current release validation should target Ubuntu 22.04 or Ubuntu 24.04. Older Ubuntu, CentOS, and RHEL paths are historical compatibility references and should be validated locally before use.

## Deployment Procedure

Expand Down
4 changes: 2 additions & 2 deletions playbooks/bootstrap/bootstrap-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,13 @@
state: present
when: ansible_python.version.major == 3

- name: install epel on EL7
- name: legacy EL7 - install epel
package:
name: epel-release
state: present
when: (ansible_python.version.major == 2) and (ansible_os_family == "RedHat") and (ansible_distribution_major_version == "7")

- name: install python 2 libraries on EL7
- name: legacy EL7 - install python 2 libraries
package:
name:
- python-setuptools
Expand Down
2 changes: 1 addition & 1 deletion roles/singularity_wrapper/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
- name: centos 8 - ensure powertools installed
- name: legacy CentOS 8 - ensure powertools installed
block:
- name: ensure prereq packages installed
yum:
Expand Down
10 changes: 3 additions & 7 deletions virtual/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,9 @@ If deploying kubeflow or another resource-intensive application in this environm

### Operating System Requirements

* Ubuntu 18.04 (or greater)
* CentOS 7.6 (or greater)
Running DeepOps virtually assumes that the host machine's OS is suitable for Vagrant, libvirt, and any optional GPU passthrough configuration. Ubuntu 22.04 LTS is the preferred host path for this legacy lab workflow.

Running DeepOps virtually assumes that the host machine's OS is an approved OS. If this is not the case, the scripts used in the steps below may be modified to work with a different OS.

The Vagrantfiles currently cover Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. The startup script defaults to Ubuntu 20.04. Newer operating systems should be validated on real target systems unless the Vagrantfiles have been refreshed and tested for that release.
The Vagrantfiles currently cover Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. The startup script defaults to Ubuntu 20.04. Treat these Vagrantfiles as legacy/community-supported lab references; validate current release work on real target systems unless the Vagrantfiles have been refreshed and tested for that release.

Also, using VMs and optionally GPU passthrough assumes that the host machine has been configured to enable virtualization in the BIOS. For instructions on how to accomplish this, refer to the sections at the bottom of this README: [Enabling virtualization and GPU passthrough](#enabling-virtualization-and-gpu-passthrough).

Expand Down Expand Up @@ -163,7 +160,7 @@ The default Vagrantfiles create VMs that are very minimal in terms of resources

### Specify the cluster Operating System

By default, all virtual nodes deploy with Ubuntu 20.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Available Vagrantfiles include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8.
By default, all virtual nodes deploy with Ubuntu 20.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Available Vagrantfiles include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. These operating system choices are maintained as legacy lab fixtures, not as release validation targets.

```sh
export DEEPOPS_VAGRANT_OS=centos
Expand Down Expand Up @@ -358,4 +355,3 @@ $ lspci -nnk -d 10de:1db1
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia_vgpu_vfio, nvidia
```


Loading