Merge pull request #7548 from VersatusHPC/fix/update-cuda-docs

docs: update NVIDIA CUDA documentation for modern OS support
2026-05-05 08:39:08 +00:00 · 2026-05-05 09:30:21 +02:00
parent 2618b532c5 60820b1abe
commit 5b12eecd40
14 changed files with 333 additions and 555 deletions
--- a/docs/source/advanced/gpu/nvidia/deploy_cuda_node.rst
+++ b/docs/source/advanced/gpu/nvidia/deploy_cuda_node.rst
@@ -4,17 +4,17 @@ Deploy CUDA nodes
 Diskful
 -------

-* To provision diskful nodes using osimage ``rhels7.5-ppc64le-install-cudafull``: ::
+Provision diskful nodes using the CUDA osimage::

-    nodeset <noderange> osimage=rhels7.5-ppc64le-install-cudafull
+    nodeset <noderange> osimage=<osver>-<arch>-install-cuda
    rsetboot <noderange> net
    rpower <noderange> boot

 Diskless
 --------

-* To provision diskless nodes using osimage ``rhels7.5-ppc64le-netboot-cudafull``: ::
+Provision diskless nodes using the CUDA osimage::

-    nodeset <noderange> osimage=rhels7.5-ppc64le-netboot-cudafull
+    nodeset <noderange> osimage=<osver>-<arch>-netboot-cuda
    rsetboot <noderange> net
    rpower <noderange> boot
--- a/docs/source/advanced/gpu/nvidia/index.rst
+++ b/docs/source/advanced/gpu/nvidia/index.rst
@@ -5,16 +5,84 @@ CUDA (Compute Unified Device Architecture) is a parallel computing platform and

 For more information, see NVIDIAs website: https://developer.nvidia.com/cuda-zone

-xCAT supports CUDA installation for Ubuntu 14.04.3 and RHEL 7.5 on PowerNV (Non-Virtualized) for both diskful and diskless nodes.
+xCAT supports CUDA installation for both diskful and diskless nodes using the ``otherpkgs`` mechanism.  The following OS and architecture combinations are supported by NVIDIA's CUDA repository:

-Within the NVIDIA CUDA Toolkit, installing the ``cuda`` package will install both the ``cuda-runtime`` and the ``cuda-toolkit``.  The ``cuda-toolkit`` is intended for developing CUDA programs and monitoring CUDA jobs.  If your particular installation requires only running GPU jobs, it's recommended to install only the ``cuda-runtime`` package.
+.. list-table::
+   :header-rows: 1
+
+   * - OS family
+     - x86_64
+     - ppc64le
+     - sbsa (ARM)
+   * - RHEL 6
+     - Yes
+     -
+     -
+   * - RHEL 7
+     - Yes
+     - Yes
+     -
+   * - RHEL 8
+     - Yes
+     - Yes
+     - Yes
+   * - RHEL 9
+     - Yes
+     -
+     - Yes
+   * - RHEL 10
+     - Yes
+     -
+     - Yes
+   * - SLES 11
+     - Yes
+     -
+     -
+   * - SLES 12
+     - Yes
+     -
+     -
+   * - SLES 15
+     - Yes
+     -
+     - Yes
+   * - Ubuntu 14.04
+     - Yes
+     - Yes
+     -
+   * - Ubuntu 16.04
+     - Yes
+     - Yes
+     -
+   * - Ubuntu 18.04
+     - Yes
+     -
+     - Yes
+   * - Ubuntu 20.04
+     - Yes
+     -
+     - Yes
+   * - Ubuntu 22.04
+     - Yes
+     -
+     - Yes
+   * - Ubuntu 24.04
+     - Yes
+     -
+     - Yes
+   * - Ubuntu 26.04
+     - Yes
+     -
+     - Yes
+
+Within the NVIDIA CUDA Toolkit, installing the ``cuda`` package will install both the ``cuda-runtime`` and the ``cuda-toolkit``.  The ``cuda-toolkit`` is intended for developing CUDA programs and monitoring CUDA jobs.  If your particular installation requires only running GPU jobs, it's recommended to install only the ``cuda-runtime-<major>-<minor>`` package (e.g., ``cuda-runtime-13-2``).

 .. toctree::
   :maxdepth: 2

-   repo/index.rst
-   osimage/index.rst
-   deploy_cuda_node.rst
-   verify_cuda_install.rst
-   management.rst
-   update_nvidia_driver.rst
+   repo_setup
+   osimage_setup
+   deploy_cuda_node
+   verify_cuda_install
+   management
+   update_nvidia_driver
--- a/docs/source/advanced/gpu/nvidia/osimage/index.rst
+++ b/docs/source/advanced/gpu/nvidia/osimage/index.rst
@@ -1,11 +0,0 @@
-Create osimage definitions
-==========================
-
-Generate ``osimage`` definitions to provision the compute nodes with the NVIDIA CUDA toolkit installed.
-
-.. toctree::
-   :maxdepth: 2
-
-   rhels.rst
-   ubuntu.rst
-   postscripts.rst
--- a/docs/source/advanced/gpu/nvidia/osimage/postscripts.rst
+++ b/docs/source/advanced/gpu/nvidia/osimage/postscripts.rst
@@ -1,35 +0,0 @@
-Postscripts
-===========
-
-The following sections demonstrates how to use xCAT to configure post-installation steps
-
-Setting PATH and LD_LIBRARY_PATH
--------------------------------
-
-NVIDIA recommends various post-installation actions that should be performed to properly configure the nodes.  A sample script is provided by xCAT for this purpose ``config_cuda`` and can be modified to fit your specific installation.
-
-Add this script to your node object using the ``chdef`` command: ::
-
-    chdef -t node -o <noderange> -p postscripts=config_cuda
-
-
-Setting GPU Configurations
--------------------------
-
-NVIDIA allows for changing GPU attributes using the ``nvidia-smi`` commands.  These settings do not persist when a compute node is rebooted.  One way set these attributes is to use an xCAT postscript to set the values every time the node is rebooted.
-
-
-* Set the power limit to 175W: ::
-
-    # set the power limit to 175W
-    nvidia-smi -pl 175
-
-
-*  Set the GPUs to persistence mode to increase performance: ::
-
-    # nvidia-smi -pm 1
-    Enabled persistence mode for GPU 0000:03:00.0.
-    Enabled persistence mode for GPU 0000:04:00.0.
-    Enabled persistence mode for GPU 0002:03:00.0.
-    Enabled persistence mode for GPU 0002:04:00.0.
-    All done.
--- a/docs/source/advanced/gpu/nvidia/osimage/rhels.rst
+++ b/docs/source/advanced/gpu/nvidia/osimage/rhels.rst
@@ -1,209 +0,0 @@
-RHEL 7.5
-========
-
-xCAT provides a sample package list (pkglist) files for CUDA. You can find them:
-
-    * Diskful: ``/opt/xcat/share/xcat/install/rh/cuda*``
-    * Diskless: ``/opt/xcat/share/xcat/netboot/rh/cuda*``
-
-Diskful images
--------------
-
-The following examples will create diskful images for ``cudafull`` and ``cudaruntime``.  The osimage definitions will be created from the base ``rhels7.5-ppc64le-install-compute`` osimage.
-
-**[Note]**: There is a requirement to reboot the machine after the CUDA drivers are installed.  To satisfy this requirement, the CUDA software is installed in the ``pkglist`` attribute of the osimage definition where a reboot will happen after the Operating System is installed.
-
-cudafull
-^^^^^^^^
-
-#. Create a copy of the ``install-compute`` image and label it ``cudafull``: ::
-
-    lsdef -t osimage -z rhels7.5-ppc64le-install-compute \
-      | sed 's/install-compute:/install-cudafull:/' \
-      | mkdef -z
-
-#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute: ::
-
-    chdef -t osimage -o rhels7.5-ppc64le-install-cudafull -p \
-      pkgdir=/install/cuda-9.2/ppc64le/cuda-core,/install/cuda-9.2/ppc64le/cuda-deps
-
-#. Use the provided ``cudafull`` pkglist to install the CUDA packages: ::
-
-    chdef -t osimage -o rhels7.5-ppc64le-install-cudafull \
-      pkglist=/opt/xcat/share/xcat/install/rh/cudafull.rhels7.ppc64le.pkglist
-
-cudaruntime
-^^^^^^^^^^^
-
-#. Create a copy of the ``install-compute`` image and label it ``cudaruntime``: ::
-
-    lsdef -t osimage -z rhels7.5-ppc64le-install-compute \
-      | sed 's/install-compute:/install-cudaruntime:/' \
-      | mkdef -z
-
-#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute: ::
-
-    chdef -t osimage -o rhels7.5-ppc64le-install-cudaruntime -p \
-      pkgdir=/install/cuda-9.2/ppc64le/cuda-core,/install/cuda-9.2/ppc64le/cuda-deps
-
-#. Use the provided ``cudaruntime`` pkglist to install the CUDA packages: ::
-
-    chdef -t osimage -o rhels7.5-ppc64le-install-cudaruntime \
-      pkglist=/opt/xcat/share/xcat/instal/rh/cudaruntime.rhels7.ppc64le.pkglist
-
-Diskless images
---------------
-
-The following examples will create diskless images for ``cudafull`` and ``cudaruntime``.  The osimage definitions will be created from the base ``rhels7.5-ppc64le-netboot-compute`` osimage.
-
-**[Note]**: For diskless, the install of the CUDA packages MUST be done in the ``otherpkglist`` and **NOT** the ``pkglist`` as with diskful.  The requirement for rebooting the machine is not applicable in diskless nodes because the image is loaded on each reboot.
-
-cudafull
-^^^^^^^^
-
-#. Create a copy of the ``netboot-compute`` image and label it ``cudafull``: ::
-
-    lsdef -t osimage -z rhels7.5-ppc64le-netboot-compute \
-      | sed 's/netboot-compute:/netboot-cudafull:/' \
-      | mkdef -z
-
-#. Verify that the CUDA repo created in the previous step is available in the directory specified by the ``otherpkgdir`` attribute.
-
-   The ``otherpkgdir`` directory can be obtained by running lsdef on the osimage: ::
-
-       # lsdef -t osimage rhels7.5-ppc64le-netboot-cudafull -i otherpkgdir
-       Object name: rhels7.5-ppc64le-netboot-cudafull
-           otherpkgdir=/install/post/otherpkgs/rhels7.5/ppc64le
-
-   Create a symbolic link of the CUDA repository in the directory specified by ``otherpkgdir`` ::
-
-       ln -s /install/cuda-9.2 /install/post/otherpkgs/rhels7.5/ppc64le/cuda-9.2
-
-#. Change the ``rootimgdir`` for the cudafull osimage: ::
-
-    chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
-       rootimgdir=/install/netboot/rhels7.5/ppc64le/cudafull
-
-#. Create a custom pkglist file to install additional operating system packages for your CUDA node.
-
-    #. Copy the default compute pkglist file as a starting point: ::
-
-        mkdir -p /install/custom/netboot/rh/
-
-        cp /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.pkglist \
-          /install/custom/netboot/rh/cudafull.rhels7.ppc64le.pkglist
-
-    #. Edit the pkglist file and append any packages you desire to be installed.  For example: ::
-
-        vi /install/custom/netboot/rh/cudafull.rhels7.ppc64le.pkglist
-        ...
-        # Additional packages for CUDA
-        pciutils
-
-    #. Set the new file as the ``pkglist`` attribute for the cudafull osimage: ::
-
-        chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
-          pkglist=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.pkglist
-
-
-#. Create the ``otherpkg.pkglist`` file to do the install of the CUDA full packages:
-
-    #. Create the otherpkg.pkglist file for cudafull: ::
-
-        vi /install/custom/netboot/rh/cudafull.rhels7.ppc64le.otherpkgs.pkglist
-        # add the following packages
-        cuda-9.2/ppc64le/cuda-deps/dkms
-        cuda-9.2/ppc64le/cuda-core/cuda
-
-    #. Set the ``otherpkg.pkglist`` attribute for the cudafull osimage: ::
-
-        chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
-          otherpkglist=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.otherpkgs.pkglist
-
-#. Generate the image: ::
-
-    genimage rhels7.5-ppc64le-netboot-cudafull
-
-#. Package the image: ::
-
-    packimage rhels7.5-ppc64le-netboot-cudafull
-
-cudaruntime
-^^^^^^^^^^^
-
-#. Create a copy of the ``netboot-compute`` image and label it ``cudaruntime``: ::
-
-    lsdef -t osimage -z rhels7.5-ppc64le-netboot-compute \
-      | sed 's/netboot-compute:/netboot-cudaruntime:/' \
-      | mkdef -z
-
-#. Verify that the CUDA repo created previously is available in the directory specified by the ``otherpkgdir`` attribute.
-
-    #. Obtain the ``otherpkgdir`` directory using the ``lsdef`` command: ::
-
-        # lsdef -t osimage rhels7.5-ppc64le-netboot-cudaruntime -i otherpkgdir
-          Object name: rhels7.5-ppc64le-netboot-cudaruntime
-             otherpkgdir=/install/post/otherpkgs/rhels7.5/ppc64le
-
-    #. Create a symbolic link to the CUDA repository in the directory specified by ``otherpkgdir`` ::
-
-        ln -s /install/cuda-9.2 /install/post/otherpkgs/rhels7.5/ppc64le/cuda-9.2
-
-#. Change the ``rootimgdir`` for the cudaruntime osimage: ::
-
-    chdef -t osimage -o rhels7.5-ppc64le-netboot-cudaruntime \
-       rootimgdir=/install/netboot/rhels7.5/ppc64le/cudaruntime
-
-#. Create the ``otherpkg.pkglist`` file to do the install of the CUDA runtime packages:
-
-    #. Create the otherpkg.pkglist file for cudaruntime: ::
-
-        vi /install/custom/netboot/rh/cudaruntime.rhels7.ppc64le.otherpkgs.pkglist
-
-        # Add the following packages:
-        cuda-9.2/ppc64le/cuda-deps/dkms
-        cuda-9.2/ppc64le/cuda-core/cuda-runtime-9-2
-
-    #. Set the ``otherpkg.pkglist`` attribute for the cudaruntime osimage: ::
-
-        chdef -t osimage -o rhels7.5-ppc64le-netboot-cudaruntime \
-          otherpkglist=/install/custom/netboot/rh/cudaruntime.rhels7.ppc64le.otherpkgs.pkglist
-
-#. Generate the image: ::
-
-    genimage rhels7.5-ppc64le-netboot-cudaruntime
-
-#. Package the image: ::
-
-    packimage rhels7.5-ppc64le-netboot-cudaruntime
-
-POWER9 Setup
------------
-
-NVIDIA POWER9 CUDA driver need some additional setup. Refer the URL below for details.
-
-http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup
-
-xCAT includes a script, ``cuda_power9_setup`` as example, to help user handle this situation.
-
-Diskful osimage
-^^^^^^^^^^^^^^^
-
-For diskful deployment, there is no need to change the osimage definition. Instead, add this postscript to your compute node postscripts list. ::
-
-    chdef p9compute -p postscripts=cuda_power9_setup
-
-Diskless osimage
-^^^^^^^^^^^^^^^^
-
-For diskless deployment, the script need to add to the postinstall script of the osimage. And it should be run in the chroot environment. Please refer the following commands as an example. ::
-
-    mkdir -p /install/custom/netboot/rh
-    cp /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.postinstall /install/custom/netboot/rh/cudafull.rhels7.ppc64le.postinstall
-
-    cat >>/install/custom/netboot/rh/cudafull.rhels7.ppc64le.postinstall <<-EOF
-
-    /install/postscripts/cuda_power9_setup
-    EOF
-
-    chdef -t osimage rhels7.5-ppc64le-netboot-cudafull postinstall=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.postinstall
--- a/docs/source/advanced/gpu/nvidia/osimage/ubuntu.rst
+++ b/docs/source/advanced/gpu/nvidia/osimage/ubuntu.rst
@@ -1,146 +0,0 @@
-Ubuntu 14.04.3
-==============
-
-
-Diskful images
---------------
-
-The following examples will create diskful images for ``cudafull`` and ``cudaruntime``.  The osimage definitions will be created from the base ``ubuntu14.04.3-ppc64el-install-compute`` osimage.
-
-xCAT provides a sample package list files for CUDA. You can find them at:
-
-    * ``/opt/xcat/share/xcat/install/ubuntu/cudafull.ubuntu14.04.3.ppc64el.pkglist``
-    * ``/opt/xcat/share/xcat/install/ubuntu/cudaruntime.ubuntu14.04.3.ppc64el.pkglist``
-
-**[diskful note]**: There is a requirement to reboot the machine after the CUDA drivers are installed.  To satisfy this requirement, the CUDA software is installed in the ``pkglist`` attribute of the osimage definition where the reboot happens after the Operating System is installed.
-
-cudafull
-^^^^^^^^
-
-#. Create a copy of the ``install-compute`` image and label it ``cudafull``: ::
-
-    lsdef -t osimage -z ubuntu14.04.3-ppc64el-install-compute \
-      | sed 's/install-compute:/install-cudafull:/' \
-      | mkdef -z
-
-#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute.
-
-   If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the pkgdir::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudafull \
-     -p pkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
-
-
-   **TODO:** Need to add Ubuntu Port?  "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
-
-#. Use the provided ``cudafull`` pkglist to install the CUDA packages: ::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudafull \
-    pkglist=/opt/xcat/share/xcat/install/ubuntu/cudafull.ubuntu14.04.3.ppc64el.pkglist
-
-cudaruntime
-^^^^^^^^^^^
-
-#. Create a copy of the ``install-compute`` image and label it ``cudaruntime``: ::
-
-    lsdef -t osimage -z ubuntu14.04.3-ppc64el-install-compute \
-      | sed 's/install-compute:/install-cudaruntime:/' \
-      | mkdef -z
-
-#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute:
-
-   If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the pkgdir::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudaruntime \
-     -p pkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
-
-   **TODO:** Need to add Ubuntu Port?  "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
-
-#. Use the provided ``cudaruntime`` pkglist to install the CUDA packages: ::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudaruntime \
-    pkglist=/opt/xcat/share/xcat/install/ubuntu/cudaruntime.ubuntu14.04.3.ppc64el.pkglist
-
-Diskless images
---------------
-
-The following examples will create diskless images for ``cudafull`` and ``cudaruntime``.  The osimage definitions will be created from the base ``ubuntu14.04.3-ppc64el-netboot-compute`` osimage.
-
-xCAT provides a sample package list files for CUDA. You can find them at:
-
-    * ``/opt/xcat/share/xcat/netboot/ubuntu/cudafull.ubuntu14.04.3.ppc64el.pkglist``
-    * ``/opt/xcat/share/xcat/netboot/ubuntu/cudaruntime.ubuntu14.04.3.ppc64el.pkglist``
-
-**[diskless note]**: For diskless images, the requirement for rebooting the machine is not applicable because the images is loaded on each reboot.  The install of the CUDA packages is required to be done in the ``otherpkglist`` **NOT** the ``pkglist``.
-
-cudafull
-^^^^^^^^
-
-#. Create a copy of the ``netboot-compute`` image and label it ``cudafull``: ::
-
-    lsdef -t osimage -z ubuntu14.04.3-ppc64el-netboot-compute \
-      | sed 's/netboot-compute:/netboot-cudafull:/' \
-      | mkdef -z
-
-#. Add the CUDA repo created in the previous step to the ``otherpkgdir`` attribute.
-
-   If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the ``otherpkgdir``::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudafull \
-    otherpkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
-
-#. Add the provided ``cudafull`` otherpkg.pkglist file to install the CUDA packages: ::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudafull \
-    otherpkglist=/opt/xcat/share/xcat/netboot/ubuntu/cudafull.otherpkgs.pkglist
-
-   **TODO:** Need to add Ubuntu Port?  "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
-
-#. Verify that ``acpid`` is installed on the Management Node or on the Ubuntu host where you are generating the diskless image: ::
-
-    apt-get install -y acpid
-
-#. Generate the image: ::
-
-    genimage ubuntu14.04.3-ppc64el-netboot-cudafull
-
-#. Package the image: ::
-
-    packimage ubuntu14.04.3-ppc64el-netboot-cudafull
-
-cudaruntime
-^^^^^^^^^^^
-
-#. Create a copy of the ``netboot-compute`` image and label it ``cudaruntime``: ::
-
-    lsdef -t osimage -z ubuntu14.04.3-ppc64el-netboot-compute \
-      | sed 's/netboot-compute:/netboot-cudaruntime:/' \
-      | mkdef -z
-
-#. Add the CUDA repo created in the previous step to the ``otherpkgdir`` attribute.
-
-   If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the ``otherpkgdir``::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudaruntime  \
-    otherpkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
-
-#. Add the provided ``cudaruntime`` otherpkg.pkglist file to install the CUDA packages: ::
-
-    chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudaruntime \
-    otherpkglist=/opt/xcat/share/xcat/netboot/ubuntu/cudaruntime.otherpkgs.pkglist
-
-   **TODO:** Need to add Ubuntu Port?  "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
-
-#. Verify that ``acpid`` is installed on the Management Node or on the Ubuntu host where you are generating the diskless image: ::
-
-    apt-get install -y acpid
-
-#. Generate the image: ::
-
-    genimage ubuntu14.04.3-ppc64el-netboot-cudaruntime
-
-#. Package the image: ::
-
-    packimage ubuntu14.04.3-ppc64el-netboot-cudaruntime
-
-
--- a/docs/source/advanced/gpu/nvidia/osimage_setup.rst
+++ b/docs/source/advanced/gpu/nvidia/osimage_setup.rst
@@ -0,0 +1,153 @@
+CUDA osimage configuration
+==========================
+
+CUDA packages are installed through xCAT's ``otherpkgs``.  Replace
+``<osver>``, ``<arch>``, and ``<distro>`` below with your values
+(e.g., ``rocky10.1``, ``x86_64``, ``rhel10``).
+
+Diskful nodes (RHEL)
+------------------
+
+#. Create a copy of the base install osimage for CUDA::
+
+    lsdef -t osimage -z <osver>-<arch>-install-compute \
+      | sed 's/install-compute:/install-cuda:/' \
+      | mkdef -z
+
+#. Add the CUDA repository to the ``pkgdir`` attribute.
+
+   For online setups, use the NVIDIA repository URL directly::
+
+    chdef -t osimage <osver>-<arch>-install-cuda -p \
+      pkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
+
+   For offline setups with a local mirror::
+
+    chdef -t osimage <osver>-<arch>-install-cuda -p \
+      pkgdir=/install/cuda/<distro>/<arch>
+
+#. Create a pkglist file for the CUDA packages::
+
+    mkdir -p /install/custom/install/rh
+    echo "cuda" > /install/custom/install/rh/cuda.pkglist
+
+   Or for runtime-only installations::
+
+    echo "cuda-runtime-13-2" > /install/custom/install/rh/cuda-runtime.pkglist
+
+#. Set the pkglist on the osimage::
+
+    chdef -t osimage <osver>-<arch>-install-cuda \
+      pkglist=/install/custom/install/rh/cuda.pkglist
+
+.. note::
+
+   For diskful installations, the CUDA packages should be installed via the
+   ``pkglist`` attribute so that the required reboot after driver installation
+   happens naturally at the end of the OS install.
+
+Diskful nodes (Ubuntu)
+----------------------
+
+#. Create a copy of the base install osimage::
+
+    lsdef -t osimage -z <osver>-<arch>-install-compute \
+      | sed 's/install-compute:/install-cuda:/' \
+      | mkdef -z
+
+#. Add the CUDA repository.
+
+   For online setups::
+
+    chdef -t osimage <osver>-<arch>-install-cuda -p \
+      otherpkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
+
+   For offline setups::
+
+    chdef -t osimage <osver>-<arch>-install-cuda -p \
+      otherpkgdir=/install/cuda/<distro>/<arch>
+
+#. Create an otherpkgs.pkglist file::
+
+    mkdir -p /install/custom/install/ubuntu
+    echo "cuda" > /install/custom/install/ubuntu/cuda.otherpkgs.pkglist
+
+#. Set it on the osimage::
+
+    chdef -t osimage <osver>-<arch>-install-cuda \
+      otherpkglist=/install/custom/install/ubuntu/cuda.otherpkgs.pkglist
+
+Diskless nodes
+--------------
+
+For diskless (stateless) nodes, the CUDA packages must be installed via
+``otherpkglist`` (not ``pkglist``). The reboot requirement for CUDA drivers
+does not apply since diskless nodes reload the image on each boot.
+
+#. Create a copy of the netboot osimage::
+
+    lsdef -t osimage -z <osver>-<arch>-netboot-compute \
+      | sed 's/netboot-compute:/netboot-cuda:/' \
+      | mkdef -z
+
+#. Add the CUDA repo to ``otherpkgdir``.
+
+   For online setups::
+
+    chdef -t osimage <osver>-<arch>-netboot-cuda -p \
+      otherpkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
+
+   For offline setups with a local mirror::
+
+    chdef -t osimage <osver>-<arch>-netboot-cuda -p \
+      otherpkgdir=/install/cuda/<distro>/<arch>
+
+#. Create an otherpkgs.pkglist::
+
+    mkdir -p /install/custom/netboot/rh
+    echo "cuda" > /install/custom/netboot/rh/cuda.otherpkgs.pkglist
+
+#. Set it and rebuild the image::
+
+    chdef -t osimage <osver>-<arch>-netboot-cuda \
+      otherpkglist=/install/custom/netboot/rh/cuda.otherpkgs.pkglist
+
+    genimage <osver>-<arch>-netboot-cuda
+    packimage <osver>-<arch>-netboot-cuda
+
+POWER9 setup
+-------------
+
+NVIDIA POWER9 CUDA drivers need additional configuration. See:
+https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup
+
+xCAT includes a sample script ``cuda_power9_setup`` to handle this.
+
+For diskful nodes::
+
+    chdef <noderange> -p postscripts=cuda_power9_setup
+
+For diskless nodes, add it to the osimage postinstall script::
+
+    cp /opt/xcat/share/xcat/netboot/rh/compute.<osver>.<arch>.postinstall \
+      /install/custom/netboot/rh/cuda.<osver>.<arch>.postinstall
+
+    echo "/install/postscripts/cuda_power9_setup" >> \
+      /install/custom/netboot/rh/cuda.<osver>.<arch>.postinstall
+
+    chdef -t osimage <osver>-<arch>-netboot-cuda \
+      postinstall=/install/custom/netboot/rh/cuda.<osver>.<arch>.postinstall
+
+Post-installation configuration
+--------------------------------
+
+NVIDIA recommends setting PATH and LD_LIBRARY_PATH for CUDA. xCAT provides
+a sample postscript ``config_cuda`` for this::
+
+    chdef <noderange> -p postscripts=config_cuda
+
+To set GPU attributes on each boot (these do not persist across reboots),
+create a postscript that runs ``nvidia-smi`` commands. For example, to enable
+persistence mode::
+
+    nvidia-smi -pm 1
--- a/docs/source/advanced/gpu/nvidia/repo/index.rst
+++ b/docs/source/advanced/gpu/nvidia/repo/index.rst
@@ -1,13 +0,0 @@
-Create CUDA software repository
-===============================
-
-The NVIDIA CUDA Toolkit is available to download at http://developer.nvidia.com/cuda-downloads.
-
-Download the toolkit and prepare the software repository on the xCAT Management Node to server the NVIDIA CUDA files.
-
-.. toctree::
-   :maxdepth: 2
-
-   rhels.rst
-   ubuntu.rst
-
--- a/docs/source/advanced/gpu/nvidia/repo/rhels.rst
+++ b/docs/source/advanced/gpu/nvidia/repo/rhels.rst
@@ -1,27 +0,0 @@
-RHEL 7.5
-========
-
-#. Create a repository on the MN node installing the CUDA Toolkit: ::
-
-    # For cuda toolkit name: /path/to/cuda-repo-rhel7-9-2-local-9.2.64-1.ppc64le.rpm
-    # extract the contents from the rpm
-    mkdir -p /tmp/cuda
-    cd /tmp/cuda
-    rpm2cpio /path/to/cuda-repo-rhel7-9-2-local-9.2.64-1.ppc64le.rpm | cpio -i -d
-
-    # Create the repo directory under xCAT /install dir for cuda 9.2
-    mkdir -p /install/cuda-9.2/ppc64le/cuda-core
-    cp /tmp/cuda/var/cuda-repo-9-2-local/*.rpm /install/cuda-9.2/ppc64le/cuda-core
-
-    # Create the yum repo files
-    createrepo /install/cuda-9.2/ppc64le/cuda-core
-
-#. The NVIDIA CUDA Toolkit contains rpms that have dependencies on other external packages (such as ``DKMS``).  These are provided by EPEL.  It's up to the system administrator to obtain the dependency packages and add those to the ``cuda-deps`` directory: ::
-
-    mkdir -p /install/cuda-9.2/ppc64le/cuda-deps
-
-    # Copy the DKMS rpm to this directory
-    cp /path/to/dkms-2.4.0-1.20170926git959bd74.el7.noarch.rpm /install/cuda-9.2/ppc64le/cuda-deps
-
-    # Execute createrepo in this directory
-    createrepo /install/cuda-9.2/ppc64le/cuda-deps
--- a/docs/source/advanced/gpu/nvidia/repo/ubuntu.rst
+++ b/docs/source/advanced/gpu/nvidia/repo/ubuntu.rst
@@ -1,37 +0,0 @@
-Ubuntu 14.04.3
-==============
-
-NVIDIA supports two types of debian repositories that can be used to install Cuda Toolkit: **local** and **network**.  You can download the installers from https://developer.nvidia.com/cuda-downloads.
-
-Local
-----
-
-A local package repo will contain all of the CUDA packages.  Extract the CUDA packages into ``/install/cuda-repo/ppc64le``: ::
-
-    # For CUDA toolkit: /root/cuda-repo-ubuntu1404-7-5-local_7.5-18_ppc64el.deb
-
-    # Create the repo directory under xCAT /install dir
-    mkdir -p /install/cuda-repo/ppc64el
-
-    # extract the package
-    dpkg -x /root/cuda-repo-ubuntu1404-7-5-local_7.5-18_ppc64el.deb /install/cuda-repo/ppc64el
-
-
-
-Network
-------
-
-The online package repo provides a source list entry pointing to a URL containing the CUDA packages.  This can be used directly on the Compute Nodes.
-
-The ``sources.list`` entry may look similar to: ::
-
-   deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/ppc64el /
-
-
-Authorize the CUDA repo
-----------------------
-
-In order to access the CUDA repository you must import the CUDA GPGKEY into the ``apt_key`` trust list.  xCAT provides a sample postscript ``/install/postscripts/addcudakey`` to help with this task: ::
-
-   chdef -t node -o <noderange> -p postscripts=addcudakey
-
--- a/docs/source/advanced/gpu/nvidia/repo_setup.rst
+++ b/docs/source/advanced/gpu/nvidia/repo_setup.rst
@@ -0,0 +1,79 @@
+CUDA repository setup
+=====================
+
+NVIDIA hosts package repositories at::
+
+    https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/
+
+Where ``<distro>`` is one of ``rhel6``, ``rhel7``, ``rhel8``, ``rhel9``,
+``rhel10``, ``sles11``, ``sles12``, ``sles15``, ``ubuntu1404``, ``ubuntu1604``,
+``ubuntu1804``, ``ubuntu2004``, ``ubuntu2204``, ``ubuntu2404``, ``ubuntu2604``
+and ``<arch>`` is ``x86_64``, ``ppc64le`` (RHEL 7-8, Ubuntu 14.04-16.04), or
+``sbsa`` (ARM).
+
+.. note::
+
+   Older Ubuntu releases (14.04, 16.04) use ``ppc64el`` instead of
+   ``ppc64le`` in the repository URL path.
+
+Online setup
+------------
+
+If nodes have network access, point ``otherpkgdir`` at the NVIDIA URL directly::
+
+    chdef -t osimage <osimage> -p \
+      otherpkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
+
+The ``otherpkgs`` postscript will configure this as a package repository on
+the node during provisioning.
+
+Offline setup (air-gapped clusters)
+------------------------------------
+
+For clusters without internet access, mirror the NVIDIA repository to a
+local directory under ``/install`` on the management node.
+
+RHEL
+^^^^
+
+Use ``dnf download`` (or ``yumdownloader`` on RHEL 7) on a system with internet
+access to download the CUDA packages and their dependencies::
+
+    mkdir -p /install/cuda/<distro>/<arch>
+    dnf download --resolve --destdir /install/cuda/<distro>/<arch> cuda
+    createrepo /install/cuda/<distro>/<arch>
+
+For EPEL dependencies such as ``dkms``::
+
+    dnf download --resolve --destdir /install/cuda/<distro>/<arch> dkms
+    createrepo /install/cuda/<distro>/<arch>
+
+SLES
+^^^^
+
+Use ``zypper download`` on a system with internet access::
+
+    mkdir -p /install/cuda/<distro>/<arch>
+    zypper --pkg-cache-dir /install/cuda/<distro>/<arch> download cuda
+    createrepo /install/cuda/<distro>/<arch>
+
+For a runtime-only installation, replace ``cuda`` with
+``cuda-runtime-<major>-<minor>`` (e.g., ``cuda-runtime-13-2``).
+
+Ubuntu
+^^^^^^
+
+Use ``apt download`` on a system with internet access::
+
+    mkdir -p /install/cuda/<distro>/<arch>
+    cd /install/cuda/<distro>/<arch>
+    apt download cuda $(apt-cache depends --recurse --no-recommends \
+        --no-suggests --no-conflicts --no-breaks --no-replaces \
+        --no-enhances cuda | grep "^\w" | sort -u)
+    dpkg-scanpackages . /dev/null | gzip -9c > Packages.gz
+
+.. note::
+
+   The offline approach requires downloading packages on a system running
+   the same OS version and architecture as the target nodes. Transfer the
+   resulting directory to the management node under ``/install``.
--- a/docs/source/advanced/gpu/nvidia/update_nvidia_driver.rst
+++ b/docs/source/advanced/gpu/nvidia/update_nvidia_driver.rst
@@ -1,21 +1,21 @@
 Update NVIDIA Driver
 =====================

-If the user wants to update the newer NVIDIA driver on the system,  follow the :doc:`Create CUDA software repository </advanced/gpu/nvidia/repo/index>` document to create another repository for the new driver.
+If the user wants to update the newer NVIDIA driver on the system,  follow the :doc:`CUDA repository setup </advanced/gpu/nvidia/repo_setup>` document to create another repository for the new driver.

-The following example assumes the new driver is in ``/install/cuda-9.2/ppc64le/nvidia_new``.
+The following example assumes the new driver is in ``/install/cuda/<distro>/<arch>/nvidia_new``.

 Diskful
 -------

 #.  Change pkgdir for the cuda image: ::

-      chdef -t osimage -o rhels7.5-ppc64le-install-cudafull \
-        pkgdir=/install/cuda-9.2/ppc64le/nvidia_new,/install/cuda-9.2/ppc64le/cuda-deps
+      chdef -t osimage -o <osver>-<arch>-install-cuda \
+        pkgdir=/install/cuda/<distro>/<arch>/nvidia_new

 #.  Use xdsh command to remove all the NVIDIA rpms: ::

-      xdsh <noderange> "yum remove *nvidia* -y"
+      xdsh <noderange> "dnf remove *nvidia* -y"

 #.  Run updatenode command to update NVIDIA driver on the compute node: ::

@@ -35,4 +35,4 @@ Diskless

 To update a new NVIDIA driver on diskless compute nodes, re-generate the osimage pointing to the new NVIDIA driver repository and reboot the node to load the diskless image.

-Refer to :doc:`Create osimage definitions </advanced/gpu/nvidia/osimage/index>` for specific instructions.
+Refer to :doc:`CUDA osimage configuration </advanced/gpu/nvidia/osimage_setup>` for specific instructions.
--- a/docs/source/advanced/gpu/nvidia/verify_cuda_install.rst
+++ b/docs/source/advanced/gpu/nvidia/verify_cuda_install.rst
@@ -1,80 +1,36 @@
 Verify CUDA Installation
 ========================

-**The following verification steps only apply to the ``cudafull`` installations.**
+The following verification steps only apply to the ``cuda`` (full) installations and require nodes with physical NVIDIA GPU hardware.

 #. Verify driver version by looking at: ``/proc/driver/nvidia/version``: ::

-    # cat /proc/driver/nvidia/version
-     NVRM version: NVIDIA UNIX ppc64le Kernel Module  352.39  Fri Aug 14 17:10:41 PDT 2015
-     GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
+    cat /proc/driver/nvidia/version

 #. Verify the CUDA Toolkit version ::

-    # nvcc -V
-     nvcc: NVIDIA (R) Cuda compiler driver
-     Copyright (c) 2005-2015 NVIDIA Corporation
-     Built on Tue_Aug_11_14:31:50_CDT_2015
-     Cuda compilation tools, release 7.5, V7.5.17
+    nvcc -V

 #. Verify running CUDA GPU jobs by compiling the samples and executing the ``deviceQuery`` or ``bandwidthTest`` programs.

-   * Compile the samples:
+   * Compile the samples: ::

-     **[RHEL]:** ::
-
-        cd ~/
-        cuda-install-samples-7.5.sh .
-        cd NVIDIA_CUDA-7.5_Samples
+        git clone https://github.com/NVIDIA/cuda-samples.git
+        cd cuda-samples/Samples/1_Utilities/deviceQuery
        make

-     **[Ubuntu]:** ::
-
-        cd ~/
-        apt-get install cuda-samples-7-0 -y
-        cd /usr/local/cuda-7.0/samples
-        make
-
-
   * Run the ``deviceQuery`` sample: ::

-        # ./bin/ppc64le/linux/release/deviceQuery
-          ./deviceQuery Starting...
-          CUDA Device Query (Runtime API) version (CUDART static linking)
-          Detected 4 CUDA Capable device(s)
-          Device 0: "Tesla K80"
-            CUDA Driver Version / Runtime Version          7.5 / 7.5
-            CUDA Capability Major/Minor version number:    3.7
-            Total amount of global memory:                 11520 MBytes (12079136768 bytes)
-            (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
-            GPU Max Clock rate:                            824 MHz (0.82 GHz)
-            Memory Clock rate:                             2505 Mhz
-            Memory Bus Width:                              384-bit
-            L2 Cache Size:                                 1572864 bytes
-            ............
-            deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 4, Device0 = Tesla K80, Device1 = Tesla K80, Device2 = Tesla K80, Device3 = Tesla K80
-            Result = PASS
+        ./deviceQuery
+
+     A successful run will end with ``Result = PASS``.

   * Run the ``bandwidthTest`` sample: ::

-        # ./bin/ppc64le/linux/release/bandwidthTest
-          [CUDA Bandwidth Test] - Starting...
-          Running on...
-          Device 0: Tesla K80
-          Quick Mode
-          Host to Device Bandwidth, 1 Device(s)
-          PINNED Memory Transfers
-            Transfer Size (Bytes)        Bandwidth(MB/s)
-            33554432                     7765.1
-          Device to Host Bandwidth, 1 Device(s)
-          PINNED Memory Transfers
-            Transfer Size (Bytes)        Bandwidth(MB/s)
-            33554432                     7759.6
-          Device to Device Bandwidth, 1 Device(s)
-          PINNED Memory Transfers
-            Transfer Size (Bytes)        Bandwidth(MB/s)
-            33554432                     141485.3
-          Result = PASS
+        cd ../bandwidthTest
+        make
+        ./bandwidthTest
+
+     A successful run will end with ``Result = PASS``.

    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
-
--- a/docs/source/advanced/kit/hpc/quickstart.rst
+++ b/docs/source/advanced/kit/hpc/quickstart.rst
@@ -77,7 +77,7 @@ The following software kits will be used to install the IBM HPC software stack o

       The ESSL software kit has an *external dependency* to the ``libxlf`` which is provided in the XLF software kit.  Since it's already added in the above step, there is no action needed here.

-       If CUDA toolkit is being used, ESSL has a runtime dependency on the CUDA rpms.  The administrator needs to create the repository for the CUDA 7.5 toolkit or a runtime error will occur when provisioning the node.  See the :doc:`/advanced/gpu/nvidia/repo/index` section for more details about setting up the CUDA repository on the xCAT management node. ::
+       If CUDA toolkit is being used, ESSL has a runtime dependency on the CUDA rpms.  The administrator needs to create the repository for the CUDA 7.5 toolkit or a runtime error will occur when provisioning the node.  See the :doc:`/advanced/gpu/nvidia/repo_setup` section for more details about setting up the CUDA repository on the xCAT management node. ::

        #
        # Assuming that the cuda repo has been configured at: