diff --git a/README.md b/README.md index c2dde8826..fd3f8ea84 100644 --- a/README.md +++ b/README.md @@ -36,20 +36,22 @@ It is recommended to use the latest release branch for stable code (linked above ### Provisioning System -The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Supported operating systems which are tested and supported include: +The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Current release validation focuses on: -- NVIDIA DGX OS 4, 5, 6, 7 -- Ubuntu 18.04 LTS, 20.04, 22.04 LTS, 24.04 LTS -- CentOS 7, 8 +- Ubuntu 22.04 LTS and 24.04 LTS +- NVIDIA DGX OS 6 and 7 + +DeepOps still retains legacy/community-maintained paths for older environments such as DGX OS 4/5, Ubuntu 18.04/20.04, and CentOS 7/8. Treat those paths as compatibility references unless your site validates them for the release you deploy. ### Cluster System -The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required. +The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required. Current release validation focuses on: + +- Ubuntu 22.04 LTS and 24.04 LTS for generic Kubernetes and Slurm deployments +- NVIDIA DGX OS 6 and 7 for DGX systems +- Red Hat Enterprise Linux / Rocky Linux 8 and 9 for DGX platform software installation through the `nvidia-dgx` role -- NVIDIA DGX OS 4, 5, 6, 7 -- Ubuntu 18.04 LTS, 20.04, 22.04 LTS, 24.04 LTS -- CentOS 7, 8 -- Red Hat Enterprise Linux / Rocky Linux 8 and 9 for the DGX software stack through the `nvidia-dgx` role +DeepOps still retains legacy/community-maintained paths for older environments such as DGX OS 4/5, Ubuntu 18.04/20.04, CentOS 7/8, and the historical DGX EL7 stack. Treat those paths as compatibility references unless your site validates them for the release you deploy. You may also install a supported operating system on all servers via a 3rd-party solution such as [MAAS](https://maas.io/) or [Foreman](https://www.theforeman.org/), or via an existing site-standard automated installer. For new Ubuntu 24.04 or DGX OS 7 deployments, prefer Ubuntu autoinstall/cloud-init or MAAS and then apply DeepOps roles after the OS is present. diff --git a/docs/deepops/testing.md b/docs/deepops/testing.md index c4fc6c3b6..95125287c 100644 --- a/docs/deepops/testing.md +++ b/docs/deepops/testing.md @@ -48,16 +48,16 @@ If a change requires GPU-backed validation, document the validation environment A short description of the historical Jenkins test matrix is outlined below. The full suite of legacy jobs can be reviewed in the [jenkins](../../workloads/jenkins) directory. These rows are not a promise of current public CI coverage; check the pull request's GitHub Actions and validation notes for current status. -**Legacy Jenkins Testing Matrix** +**Validation Matrix** | Test | [PR](../../workloads/jenkins/Jenkinsfile) | [Nightly](../../workloads/jenkins/Jenkinsfile-nightly) | [Nightly Multi-node](../../workloads/jenkins/Jenkinsfile-multi-nightly) | Comments | | --------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------- | ------------------------------------ | -| Ubuntu 18.04 | x | x | x | | -| Ubuntu 20.04 | | x | x | | +| Ubuntu 18.04 | x | x | x | Legacy Jenkins/Vagrant reference only | +| Ubuntu 20.04 | | x | x | Legacy Jenkins/Vagrant reference only | | Ubuntu 22.04 | | | | setup.sh and Molecule GitHub Actions | | Ubuntu 24.04 | | | | setup.sh and Molecule GitHub Actions | -| CentOS 7 | | x | x | | -| CentOS | | | x | | +| CentOS 7 | | x | x | Legacy Jenkins/Vagrant reference only | +| CentOS 8 | | | x | Legacy Jenkins/Vagrant reference only | | DGX OS | | | | Syntax-checked only; full validation requires DGX hardware | | RHEL | | | | DGX software-stack role syntax-checked only; full validation requires DGX hardware and subscriptions | | 1 mgmt node | x | x | | | @@ -123,30 +123,19 @@ molecule init scenario -r --driver-name docker ``` 4. In the file `molecule/default/molecule.yml`, define the list of platforms to be tested. - DeepOps currently supports operating systems based on Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04, EL7, and EL8. - The DGX software stack role also supports Red Hat Enterprise Linux / Rocky Linux 8 and 9 for DGX platform software installation. - To test these stacks, the following `platforms` stanza can be used. + DeepOps currently uses Ubuntu 22.04 and Ubuntu 24.04 for setup and Molecule GitHub Actions. + Add Red Hat family images only for roles that explicitly support them, and validate the image choice for that role. + Keep Ubuntu 18.04, Ubuntu 20.04, CentOS 7, and CentOS 8 scenarios in separately named legacy test scenarios when maintaining older compatibility paths. + To test the current Ubuntu stacks, the following `platforms` stanza can be used. ```yaml platforms: - - name: ubuntu-1804 - image: geerlingguy/docker-ubuntu1804-ansible - pre_build_image: true - - name: ubuntu-2004 - image: geerlingguy/docker-ubuntu2004-ansible - pre_build_image: true - name: ubuntu-2204 image: geerlingguy/docker-ubuntu2204-ansible pre_build_image: true - name: ubuntu-2404 image: geerlingguy/docker-ubuntu2404-ansible pre_build_image: true - - name: centos-7 - image: geerlingguy/docker-centos7-ansible - pre_build_image: true - - name: centos-8 - image: geerlingguy/docker-centos8-ansible - pre_build_image: true ``` 5. If you haven't already, define your role's metadata in the file `meta/main.yml`. diff --git a/docs/ngc-ready/README.md b/docs/ngc-ready/README.md index 04b0e2e61..e7618d07b 100644 --- a/docs/ngc-ready/README.md +++ b/docs/ngc-ready/README.md @@ -14,10 +14,11 @@ These instructions assume the following: - You have a NGC-Ready server. To determine if your server is NGC-Ready, please review the list of validated servers at the NGC-Ready Server documentation page - https://docs.nvidia.com/certification-programs/ngc-ready-systems/index.html - Your NGC-Ready Server has a compatible Linux distribution installed: - - Ubuntu Server 20.04 LTS - Ubuntu Server 22.04 LTS - Ubuntu Server 24.04 LTS - - CentOS 7 + - Red Hat Enterprise Linux / Rocky Linux 8 or 9 when the referenced roles are validated for your server + +Legacy Ubuntu 20.04 and CentOS 7 environments may still work for existing deployments, but they are not current release validation targets. ## Setup @@ -41,7 +42,7 @@ This process will install the latest NVIDIA GPU Drivers, and Docker with the NVI # : IP of NGC-Ready server, or localhost. The trailing comma is required # If SSH requires a password, add: -k # If sudo requires a password, add: -K -ansible-playbook -u -i , playbooks/ngc-ready.yml +ansible-playbook -u -i , playbooks/ngc-ready-server.yml ``` ## Testing @@ -55,5 +56,5 @@ This process will test the functionality of the NGC-Ready server by running a fu # : IP of NGC-Ready server, or localhost. The trailing comma is required # If SSH requires a password, add: -k # If sudo requires a password, add: -K -ansible-playbook -u -i , playbooks/ngc-ready.yml --tags test +ansible-playbook -u -i , playbooks/ngc-ready-server.yml --tags test ``` diff --git a/docs/slurm-cluster/slurm-single-node.md b/docs/slurm-cluster/slurm-single-node.md index 18174b52d..fda8d8dde 100644 --- a/docs/slurm-cluster/slurm-single-node.md +++ b/docs/slurm-cluster/slurm-single-node.md @@ -15,7 +15,7 @@ Single Node Slurm Deployment Guide ## Introduction -The general requirements and procedure for Slurm setup via deepops is documented in the [README.md](README.md) for the slurm-cluster. The instructions below outline the steps to deviate from the general setup to enable single node DeepOps Slurm setup. The machine on which Slurm is being deployed should be up to date in a stable state with GPU drivers already installed and functional. The supported operating systems are Ubuntu 18.04, 20.04, 22.04, and 24.04; CentOS 7 and 8; and RHEL 7 and 8, with RHEL 8 preferred among the RHEL paths. +The general requirements and procedure for Slurm setup via deepops is documented in the [README.md](README.md) for the slurm-cluster. The instructions below outline the steps to deviate from the general setup to enable single node DeepOps Slurm setup. The machine on which Slurm is being deployed should be up to date in a stable state with GPU drivers already installed and functional. Current release validation should target Ubuntu 22.04 or Ubuntu 24.04. Older Ubuntu, CentOS, and RHEL paths are historical compatibility references and should be validated locally before use. ## Deployment Procedure diff --git a/playbooks/bootstrap/bootstrap-python.yml b/playbooks/bootstrap/bootstrap-python.yml index 1885f8baa..241381de8 100644 --- a/playbooks/bootstrap/bootstrap-python.yml +++ b/playbooks/bootstrap/bootstrap-python.yml @@ -35,13 +35,13 @@ state: present when: ansible_python.version.major == 3 - - name: install epel on EL7 + - name: legacy EL7 - install epel package: name: epel-release state: present when: (ansible_python.version.major == 2) and (ansible_os_family == "RedHat") and (ansible_distribution_major_version == "7") - - name: install python 2 libraries on EL7 + - name: legacy EL7 - install python 2 libraries package: name: - python-setuptools diff --git a/roles/singularity_wrapper/tasks/main.yml b/roles/singularity_wrapper/tasks/main.yml index 82336c7db..f4e8890f3 100644 --- a/roles/singularity_wrapper/tasks/main.yml +++ b/roles/singularity_wrapper/tasks/main.yml @@ -1,5 +1,5 @@ --- -- name: centos 8 - ensure powertools installed +- name: legacy CentOS 8 - ensure powertools installed block: - name: ensure prereq packages installed yum: diff --git a/virtual/README.md b/virtual/README.md index 19db58a21..b1f3809a0 100644 --- a/virtual/README.md +++ b/virtual/README.md @@ -23,12 +23,9 @@ If deploying kubeflow or another resource-intensive application in this environm ### Operating System Requirements -* Ubuntu 18.04 (or greater) -* CentOS 7.6 (or greater) +Running DeepOps virtually assumes that the host machine's OS is suitable for Vagrant, libvirt, and any optional GPU passthrough configuration. Ubuntu 22.04 LTS is the preferred host path for this legacy lab workflow. -Running DeepOps virtually assumes that the host machine's OS is an approved OS. If this is not the case, the scripts used in the steps below may be modified to work with a different OS. - -The Vagrantfiles currently cover Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. The startup script defaults to Ubuntu 20.04. Newer operating systems should be validated on real target systems unless the Vagrantfiles have been refreshed and tested for that release. +The Vagrantfiles currently cover Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. The startup script defaults to Ubuntu 20.04. Treat these Vagrantfiles as legacy/community-supported lab references; validate current release work on real target systems unless the Vagrantfiles have been refreshed and tested for that release. Also, using VMs and optionally GPU passthrough assumes that the host machine has been configured to enable virtualization in the BIOS. For instructions on how to accomplish this, refer to the sections at the bottom of this README: [Enabling virtualization and GPU passthrough](#enabling-virtualization-and-gpu-passthrough). @@ -163,7 +160,7 @@ The default Vagrantfiles create VMs that are very minimal in terms of resources ### Specify the cluster Operating System -By default, all virtual nodes deploy with Ubuntu 20.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Available Vagrantfiles include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. +By default, all virtual nodes deploy with Ubuntu 20.04. This can be changed by overriding the environment variables `DEEPOPS_VAGRANT_OS` and `DEEPOPS_OS_VERSION`. Available Vagrantfiles include Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, CentOS 7, and CentOS 8. These operating system choices are maintained as legacy lab fixtures, not as release validation targets. ```sh export DEEPOPS_VAGRANT_OS=centos @@ -358,4 +355,3 @@ $ lspci -nnk -d 10de:1db1 Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia_vgpu_vfio, nvidia ``` -