2
0
mirror of https://github.com/xcat2/xcat-core.git synced 2026-05-05 08:39:08 +00:00

Merge pull request #7548 from VersatusHPC/fix/update-cuda-docs

docs: update NVIDIA CUDA documentation for modern OS support
This commit is contained in:
Markus Hilger
2026-05-05 09:30:21 +02:00
committed by GitHub
14 changed files with 333 additions and 555 deletions

View File

@@ -4,17 +4,17 @@ Deploy CUDA nodes
Diskful
-------
* To provision diskful nodes using osimage ``rhels7.5-ppc64le-install-cudafull``: ::
Provision diskful nodes using the CUDA osimage::
nodeset <noderange> osimage=rhels7.5-ppc64le-install-cudafull
nodeset <noderange> osimage=<osver>-<arch>-install-cuda
rsetboot <noderange> net
rpower <noderange> boot
Diskless
--------
* To provision diskless nodes using osimage ``rhels7.5-ppc64le-netboot-cudafull``: ::
Provision diskless nodes using the CUDA osimage::
nodeset <noderange> osimage=rhels7.5-ppc64le-netboot-cudafull
nodeset <noderange> osimage=<osver>-<arch>-netboot-cuda
rsetboot <noderange> net
rpower <noderange> boot

View File

@@ -5,16 +5,84 @@ CUDA (Compute Unified Device Architecture) is a parallel computing platform and
For more information, see NVIDIAs website: https://developer.nvidia.com/cuda-zone
xCAT supports CUDA installation for Ubuntu 14.04.3 and RHEL 7.5 on PowerNV (Non-Virtualized) for both diskful and diskless nodes.
xCAT supports CUDA installation for both diskful and diskless nodes using the ``otherpkgs`` mechanism. The following OS and architecture combinations are supported by NVIDIA's CUDA repository:
Within the NVIDIA CUDA Toolkit, installing the ``cuda`` package will install both the ``cuda-runtime`` and the ``cuda-toolkit``. The ``cuda-toolkit`` is intended for developing CUDA programs and monitoring CUDA jobs. If your particular installation requires only running GPU jobs, it's recommended to install only the ``cuda-runtime`` package.
.. list-table::
:header-rows: 1
* - OS family
- x86_64
- ppc64le
- sbsa (ARM)
* - RHEL 6
- Yes
-
-
* - RHEL 7
- Yes
- Yes
-
* - RHEL 8
- Yes
- Yes
- Yes
* - RHEL 9
- Yes
-
- Yes
* - RHEL 10
- Yes
-
- Yes
* - SLES 11
- Yes
-
-
* - SLES 12
- Yes
-
-
* - SLES 15
- Yes
-
- Yes
* - Ubuntu 14.04
- Yes
- Yes
-
* - Ubuntu 16.04
- Yes
- Yes
-
* - Ubuntu 18.04
- Yes
-
- Yes
* - Ubuntu 20.04
- Yes
-
- Yes
* - Ubuntu 22.04
- Yes
-
- Yes
* - Ubuntu 24.04
- Yes
-
- Yes
* - Ubuntu 26.04
- Yes
-
- Yes
Within the NVIDIA CUDA Toolkit, installing the ``cuda`` package will install both the ``cuda-runtime`` and the ``cuda-toolkit``. The ``cuda-toolkit`` is intended for developing CUDA programs and monitoring CUDA jobs. If your particular installation requires only running GPU jobs, it's recommended to install only the ``cuda-runtime-<major>-<minor>`` package (e.g., ``cuda-runtime-13-2``).
.. toctree::
:maxdepth: 2
repo/index.rst
osimage/index.rst
deploy_cuda_node.rst
verify_cuda_install.rst
management.rst
update_nvidia_driver.rst
repo_setup
osimage_setup
deploy_cuda_node
verify_cuda_install
management
update_nvidia_driver

View File

@@ -1,11 +0,0 @@
Create osimage definitions
==========================
Generate ``osimage`` definitions to provision the compute nodes with the NVIDIA CUDA toolkit installed.
.. toctree::
:maxdepth: 2
rhels.rst
ubuntu.rst
postscripts.rst

View File

@@ -1,35 +0,0 @@
Postscripts
===========
The following sections demonstrates how to use xCAT to configure post-installation steps
Setting PATH and LD_LIBRARY_PATH
--------------------------------
NVIDIA recommends various post-installation actions that should be performed to properly configure the nodes. A sample script is provided by xCAT for this purpose ``config_cuda`` and can be modified to fit your specific installation.
Add this script to your node object using the ``chdef`` command: ::
chdef -t node -o <noderange> -p postscripts=config_cuda
Setting GPU Configurations
--------------------------
NVIDIA allows for changing GPU attributes using the ``nvidia-smi`` commands. These settings do not persist when a compute node is rebooted. One way set these attributes is to use an xCAT postscript to set the values every time the node is rebooted.
* Set the power limit to 175W: ::
# set the power limit to 175W
nvidia-smi -pl 175
* Set the GPUs to persistence mode to increase performance: ::
# nvidia-smi -pm 1
Enabled persistence mode for GPU 0000:03:00.0.
Enabled persistence mode for GPU 0000:04:00.0.
Enabled persistence mode for GPU 0002:03:00.0.
Enabled persistence mode for GPU 0002:04:00.0.
All done.

View File

@@ -1,209 +0,0 @@
RHEL 7.5
========
xCAT provides a sample package list (pkglist) files for CUDA. You can find them:
* Diskful: ``/opt/xcat/share/xcat/install/rh/cuda*``
* Diskless: ``/opt/xcat/share/xcat/netboot/rh/cuda*``
Diskful images
--------------
The following examples will create diskful images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``rhels7.5-ppc64le-install-compute`` osimage.
**[Note]**: There is a requirement to reboot the machine after the CUDA drivers are installed. To satisfy this requirement, the CUDA software is installed in the ``pkglist`` attribute of the osimage definition where a reboot will happen after the Operating System is installed.
cudafull
^^^^^^^^
#. Create a copy of the ``install-compute`` image and label it ``cudafull``: ::
lsdef -t osimage -z rhels7.5-ppc64le-install-compute \
| sed 's/install-compute:/install-cudafull:/' \
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute: ::
chdef -t osimage -o rhels7.5-ppc64le-install-cudafull -p \
pkgdir=/install/cuda-9.2/ppc64le/cuda-core,/install/cuda-9.2/ppc64le/cuda-deps
#. Use the provided ``cudafull`` pkglist to install the CUDA packages: ::
chdef -t osimage -o rhels7.5-ppc64le-install-cudafull \
pkglist=/opt/xcat/share/xcat/install/rh/cudafull.rhels7.ppc64le.pkglist
cudaruntime
^^^^^^^^^^^
#. Create a copy of the ``install-compute`` image and label it ``cudaruntime``: ::
lsdef -t osimage -z rhels7.5-ppc64le-install-compute \
| sed 's/install-compute:/install-cudaruntime:/' \
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute: ::
chdef -t osimage -o rhels7.5-ppc64le-install-cudaruntime -p \
pkgdir=/install/cuda-9.2/ppc64le/cuda-core,/install/cuda-9.2/ppc64le/cuda-deps
#. Use the provided ``cudaruntime`` pkglist to install the CUDA packages: ::
chdef -t osimage -o rhels7.5-ppc64le-install-cudaruntime \
pkglist=/opt/xcat/share/xcat/instal/rh/cudaruntime.rhels7.ppc64le.pkglist
Diskless images
---------------
The following examples will create diskless images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``rhels7.5-ppc64le-netboot-compute`` osimage.
**[Note]**: For diskless, the install of the CUDA packages MUST be done in the ``otherpkglist`` and **NOT** the ``pkglist`` as with diskful. The requirement for rebooting the machine is not applicable in diskless nodes because the image is loaded on each reboot.
cudafull
^^^^^^^^
#. Create a copy of the ``netboot-compute`` image and label it ``cudafull``: ::
lsdef -t osimage -z rhels7.5-ppc64le-netboot-compute \
| sed 's/netboot-compute:/netboot-cudafull:/' \
| mkdef -z
#. Verify that the CUDA repo created in the previous step is available in the directory specified by the ``otherpkgdir`` attribute.
The ``otherpkgdir`` directory can be obtained by running lsdef on the osimage: ::
# lsdef -t osimage rhels7.5-ppc64le-netboot-cudafull -i otherpkgdir
Object name: rhels7.5-ppc64le-netboot-cudafull
otherpkgdir=/install/post/otherpkgs/rhels7.5/ppc64le
Create a symbolic link of the CUDA repository in the directory specified by ``otherpkgdir`` ::
ln -s /install/cuda-9.2 /install/post/otherpkgs/rhels7.5/ppc64le/cuda-9.2
#. Change the ``rootimgdir`` for the cudafull osimage: ::
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
rootimgdir=/install/netboot/rhels7.5/ppc64le/cudafull
#. Create a custom pkglist file to install additional operating system packages for your CUDA node.
#. Copy the default compute pkglist file as a starting point: ::
mkdir -p /install/custom/netboot/rh/
cp /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.pkglist \
/install/custom/netboot/rh/cudafull.rhels7.ppc64le.pkglist
#. Edit the pkglist file and append any packages you desire to be installed. For example: ::
vi /install/custom/netboot/rh/cudafull.rhels7.ppc64le.pkglist
...
# Additional packages for CUDA
pciutils
#. Set the new file as the ``pkglist`` attribute for the cudafull osimage: ::
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
pkglist=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.pkglist
#. Create the ``otherpkg.pkglist`` file to do the install of the CUDA full packages:
#. Create the otherpkg.pkglist file for cudafull: ::
vi /install/custom/netboot/rh/cudafull.rhels7.ppc64le.otherpkgs.pkglist
# add the following packages
cuda-9.2/ppc64le/cuda-deps/dkms
cuda-9.2/ppc64le/cuda-core/cuda
#. Set the ``otherpkg.pkglist`` attribute for the cudafull osimage: ::
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudafull \
otherpkglist=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.otherpkgs.pkglist
#. Generate the image: ::
genimage rhels7.5-ppc64le-netboot-cudafull
#. Package the image: ::
packimage rhels7.5-ppc64le-netboot-cudafull
cudaruntime
^^^^^^^^^^^
#. Create a copy of the ``netboot-compute`` image and label it ``cudaruntime``: ::
lsdef -t osimage -z rhels7.5-ppc64le-netboot-compute \
| sed 's/netboot-compute:/netboot-cudaruntime:/' \
| mkdef -z
#. Verify that the CUDA repo created previously is available in the directory specified by the ``otherpkgdir`` attribute.
#. Obtain the ``otherpkgdir`` directory using the ``lsdef`` command: ::
# lsdef -t osimage rhels7.5-ppc64le-netboot-cudaruntime -i otherpkgdir
Object name: rhels7.5-ppc64le-netboot-cudaruntime
otherpkgdir=/install/post/otherpkgs/rhels7.5/ppc64le
#. Create a symbolic link to the CUDA repository in the directory specified by ``otherpkgdir`` ::
ln -s /install/cuda-9.2 /install/post/otherpkgs/rhels7.5/ppc64le/cuda-9.2
#. Change the ``rootimgdir`` for the cudaruntime osimage: ::
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudaruntime \
rootimgdir=/install/netboot/rhels7.5/ppc64le/cudaruntime
#. Create the ``otherpkg.pkglist`` file to do the install of the CUDA runtime packages:
#. Create the otherpkg.pkglist file for cudaruntime: ::
vi /install/custom/netboot/rh/cudaruntime.rhels7.ppc64le.otherpkgs.pkglist
# Add the following packages:
cuda-9.2/ppc64le/cuda-deps/dkms
cuda-9.2/ppc64le/cuda-core/cuda-runtime-9-2
#. Set the ``otherpkg.pkglist`` attribute for the cudaruntime osimage: ::
chdef -t osimage -o rhels7.5-ppc64le-netboot-cudaruntime \
otherpkglist=/install/custom/netboot/rh/cudaruntime.rhels7.ppc64le.otherpkgs.pkglist
#. Generate the image: ::
genimage rhels7.5-ppc64le-netboot-cudaruntime
#. Package the image: ::
packimage rhels7.5-ppc64le-netboot-cudaruntime
POWER9 Setup
------------
NVIDIA POWER9 CUDA driver need some additional setup. Refer the URL below for details.
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup
xCAT includes a script, ``cuda_power9_setup`` as example, to help user handle this situation.
Diskful osimage
^^^^^^^^^^^^^^^
For diskful deployment, there is no need to change the osimage definition. Instead, add this postscript to your compute node postscripts list. ::
chdef p9compute -p postscripts=cuda_power9_setup
Diskless osimage
^^^^^^^^^^^^^^^^
For diskless deployment, the script need to add to the postinstall script of the osimage. And it should be run in the chroot environment. Please refer the following commands as an example. ::
mkdir -p /install/custom/netboot/rh
cp /opt/xcat/share/xcat/netboot/rh/compute.rhels7.ppc64le.postinstall /install/custom/netboot/rh/cudafull.rhels7.ppc64le.postinstall
cat >>/install/custom/netboot/rh/cudafull.rhels7.ppc64le.postinstall <<-EOF
/install/postscripts/cuda_power9_setup
EOF
chdef -t osimage rhels7.5-ppc64le-netboot-cudafull postinstall=/install/custom/netboot/rh/cudafull.rhels7.ppc64le.postinstall

View File

@@ -1,146 +0,0 @@
Ubuntu 14.04.3
==============
Diskful images
---------------
The following examples will create diskful images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``ubuntu14.04.3-ppc64el-install-compute`` osimage.
xCAT provides a sample package list files for CUDA. You can find them at:
* ``/opt/xcat/share/xcat/install/ubuntu/cudafull.ubuntu14.04.3.ppc64el.pkglist``
* ``/opt/xcat/share/xcat/install/ubuntu/cudaruntime.ubuntu14.04.3.ppc64el.pkglist``
**[diskful note]**: There is a requirement to reboot the machine after the CUDA drivers are installed. To satisfy this requirement, the CUDA software is installed in the ``pkglist`` attribute of the osimage definition where the reboot happens after the Operating System is installed.
cudafull
^^^^^^^^
#. Create a copy of the ``install-compute`` image and label it ``cudafull``: ::
lsdef -t osimage -z ubuntu14.04.3-ppc64el-install-compute \
| sed 's/install-compute:/install-cudafull:/' \
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute.
If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the pkgdir::
chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudafull \
-p pkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
**TODO:** Need to add Ubuntu Port? "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
#. Use the provided ``cudafull`` pkglist to install the CUDA packages: ::
chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudafull \
pkglist=/opt/xcat/share/xcat/install/ubuntu/cudafull.ubuntu14.04.3.ppc64el.pkglist
cudaruntime
^^^^^^^^^^^
#. Create a copy of the ``install-compute`` image and label it ``cudaruntime``: ::
lsdef -t osimage -z ubuntu14.04.3-ppc64el-install-compute \
| sed 's/install-compute:/install-cudaruntime:/' \
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``pkgdir`` attribute:
If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the pkgdir::
chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudaruntime \
-p pkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
**TODO:** Need to add Ubuntu Port? "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
#. Use the provided ``cudaruntime`` pkglist to install the CUDA packages: ::
chdef -t osimage -o ubuntu14.04.3-ppc64el-install-cudaruntime \
pkglist=/opt/xcat/share/xcat/install/ubuntu/cudaruntime.ubuntu14.04.3.ppc64el.pkglist
Diskless images
---------------
The following examples will create diskless images for ``cudafull`` and ``cudaruntime``. The osimage definitions will be created from the base ``ubuntu14.04.3-ppc64el-netboot-compute`` osimage.
xCAT provides a sample package list files for CUDA. You can find them at:
* ``/opt/xcat/share/xcat/netboot/ubuntu/cudafull.ubuntu14.04.3.ppc64el.pkglist``
* ``/opt/xcat/share/xcat/netboot/ubuntu/cudaruntime.ubuntu14.04.3.ppc64el.pkglist``
**[diskless note]**: For diskless images, the requirement for rebooting the machine is not applicable because the images is loaded on each reboot. The install of the CUDA packages is required to be done in the ``otherpkglist`` **NOT** the ``pkglist``.
cudafull
^^^^^^^^
#. Create a copy of the ``netboot-compute`` image and label it ``cudafull``: ::
lsdef -t osimage -z ubuntu14.04.3-ppc64el-netboot-compute \
| sed 's/netboot-compute:/netboot-cudafull:/' \
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``otherpkgdir`` attribute.
If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the ``otherpkgdir``::
chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudafull \
otherpkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
#. Add the provided ``cudafull`` otherpkg.pkglist file to install the CUDA packages: ::
chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudafull \
otherpkglist=/opt/xcat/share/xcat/netboot/ubuntu/cudafull.otherpkgs.pkglist
**TODO:** Need to add Ubuntu Port? "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
#. Verify that ``acpid`` is installed on the Management Node or on the Ubuntu host where you are generating the diskless image: ::
apt-get install -y acpid
#. Generate the image: ::
genimage ubuntu14.04.3-ppc64el-netboot-cudafull
#. Package the image: ::
packimage ubuntu14.04.3-ppc64el-netboot-cudafull
cudaruntime
^^^^^^^^^^^
#. Create a copy of the ``netboot-compute`` image and label it ``cudaruntime``: ::
lsdef -t osimage -z ubuntu14.04.3-ppc64el-netboot-compute \
| sed 's/netboot-compute:/netboot-cudaruntime:/' \
| mkdef -z
#. Add the CUDA repo created in the previous step to the ``otherpkgdir`` attribute.
If your Management Node IP is 10.0.0.1, the URL for the repo would be ``http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local``, add it to the ``otherpkgdir``::
chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudaruntime \
otherpkgdir=http://10.0.0.1/install/cuda-repo/ppc64el/var/cuda-repo-7-5-local
#. Add the provided ``cudaruntime`` otherpkg.pkglist file to install the CUDA packages: ::
chdef -t osimage -o ubuntu14.04.3-ppc64el-netboot-cudaruntime \
otherpkglist=/opt/xcat/share/xcat/netboot/ubuntu/cudaruntime.otherpkgs.pkglist
**TODO:** Need to add Ubuntu Port? "http://ports.ubuntu.com/ubuntu-ports trusty main,http://ports.ubuntu.com/ubuntu-ports trusty-updates main"
#. Verify that ``acpid`` is installed on the Management Node or on the Ubuntu host where you are generating the diskless image: ::
apt-get install -y acpid
#. Generate the image: ::
genimage ubuntu14.04.3-ppc64el-netboot-cudaruntime
#. Package the image: ::
packimage ubuntu14.04.3-ppc64el-netboot-cudaruntime

View File

@@ -0,0 +1,153 @@
CUDA osimage configuration
==========================
CUDA packages are installed through xCAT's ``otherpkgs``. Replace
``<osver>``, ``<arch>``, and ``<distro>`` below with your values
(e.g., ``rocky10.1``, ``x86_64``, ``rhel10``).
Diskful nodes (RHEL)
------------------
#. Create a copy of the base install osimage for CUDA::
lsdef -t osimage -z <osver>-<arch>-install-compute \
| sed 's/install-compute:/install-cuda:/' \
| mkdef -z
#. Add the CUDA repository to the ``pkgdir`` attribute.
For online setups, use the NVIDIA repository URL directly::
chdef -t osimage <osver>-<arch>-install-cuda -p \
pkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
For offline setups with a local mirror::
chdef -t osimage <osver>-<arch>-install-cuda -p \
pkgdir=/install/cuda/<distro>/<arch>
#. Create a pkglist file for the CUDA packages::
mkdir -p /install/custom/install/rh
echo "cuda" > /install/custom/install/rh/cuda.pkglist
Or for runtime-only installations::
echo "cuda-runtime-13-2" > /install/custom/install/rh/cuda-runtime.pkglist
#. Set the pkglist on the osimage::
chdef -t osimage <osver>-<arch>-install-cuda \
pkglist=/install/custom/install/rh/cuda.pkglist
.. note::
For diskful installations, the CUDA packages should be installed via the
``pkglist`` attribute so that the required reboot after driver installation
happens naturally at the end of the OS install.
Diskful nodes (Ubuntu)
----------------------
#. Create a copy of the base install osimage::
lsdef -t osimage -z <osver>-<arch>-install-compute \
| sed 's/install-compute:/install-cuda:/' \
| mkdef -z
#. Add the CUDA repository.
For online setups::
chdef -t osimage <osver>-<arch>-install-cuda -p \
otherpkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
For offline setups::
chdef -t osimage <osver>-<arch>-install-cuda -p \
otherpkgdir=/install/cuda/<distro>/<arch>
#. Create an otherpkgs.pkglist file::
mkdir -p /install/custom/install/ubuntu
echo "cuda" > /install/custom/install/ubuntu/cuda.otherpkgs.pkglist
#. Set it on the osimage::
chdef -t osimage <osver>-<arch>-install-cuda \
otherpkglist=/install/custom/install/ubuntu/cuda.otherpkgs.pkglist
Diskless nodes
--------------
For diskless (stateless) nodes, the CUDA packages must be installed via
``otherpkglist`` (not ``pkglist``). The reboot requirement for CUDA drivers
does not apply since diskless nodes reload the image on each boot.
#. Create a copy of the netboot osimage::
lsdef -t osimage -z <osver>-<arch>-netboot-compute \
| sed 's/netboot-compute:/netboot-cuda:/' \
| mkdef -z
#. Add the CUDA repo to ``otherpkgdir``.
For online setups::
chdef -t osimage <osver>-<arch>-netboot-cuda -p \
otherpkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
For offline setups with a local mirror::
chdef -t osimage <osver>-<arch>-netboot-cuda -p \
otherpkgdir=/install/cuda/<distro>/<arch>
#. Create an otherpkgs.pkglist::
mkdir -p /install/custom/netboot/rh
echo "cuda" > /install/custom/netboot/rh/cuda.otherpkgs.pkglist
#. Set it and rebuild the image::
chdef -t osimage <osver>-<arch>-netboot-cuda \
otherpkglist=/install/custom/netboot/rh/cuda.otherpkgs.pkglist
genimage <osver>-<arch>-netboot-cuda
packimage <osver>-<arch>-netboot-cuda
POWER9 setup
-------------
NVIDIA POWER9 CUDA drivers need additional configuration. See:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup
xCAT includes a sample script ``cuda_power9_setup`` to handle this.
For diskful nodes::
chdef <noderange> -p postscripts=cuda_power9_setup
For diskless nodes, add it to the osimage postinstall script::
cp /opt/xcat/share/xcat/netboot/rh/compute.<osver>.<arch>.postinstall \
/install/custom/netboot/rh/cuda.<osver>.<arch>.postinstall
echo "/install/postscripts/cuda_power9_setup" >> \
/install/custom/netboot/rh/cuda.<osver>.<arch>.postinstall
chdef -t osimage <osver>-<arch>-netboot-cuda \
postinstall=/install/custom/netboot/rh/cuda.<osver>.<arch>.postinstall
Post-installation configuration
--------------------------------
NVIDIA recommends setting PATH and LD_LIBRARY_PATH for CUDA. xCAT provides
a sample postscript ``config_cuda`` for this::
chdef <noderange> -p postscripts=config_cuda
To set GPU attributes on each boot (these do not persist across reboots),
create a postscript that runs ``nvidia-smi`` commands. For example, to enable
persistence mode::
nvidia-smi -pm 1

View File

@@ -1,13 +0,0 @@
Create CUDA software repository
===============================
The NVIDIA CUDA Toolkit is available to download at http://developer.nvidia.com/cuda-downloads.
Download the toolkit and prepare the software repository on the xCAT Management Node to server the NVIDIA CUDA files.
.. toctree::
:maxdepth: 2
rhels.rst
ubuntu.rst

View File

@@ -1,27 +0,0 @@
RHEL 7.5
========
#. Create a repository on the MN node installing the CUDA Toolkit: ::
# For cuda toolkit name: /path/to/cuda-repo-rhel7-9-2-local-9.2.64-1.ppc64le.rpm
# extract the contents from the rpm
mkdir -p /tmp/cuda
cd /tmp/cuda
rpm2cpio /path/to/cuda-repo-rhel7-9-2-local-9.2.64-1.ppc64le.rpm | cpio -i -d
# Create the repo directory under xCAT /install dir for cuda 9.2
mkdir -p /install/cuda-9.2/ppc64le/cuda-core
cp /tmp/cuda/var/cuda-repo-9-2-local/*.rpm /install/cuda-9.2/ppc64le/cuda-core
# Create the yum repo files
createrepo /install/cuda-9.2/ppc64le/cuda-core
#. The NVIDIA CUDA Toolkit contains rpms that have dependencies on other external packages (such as ``DKMS``). These are provided by EPEL. It's up to the system administrator to obtain the dependency packages and add those to the ``cuda-deps`` directory: ::
mkdir -p /install/cuda-9.2/ppc64le/cuda-deps
# Copy the DKMS rpm to this directory
cp /path/to/dkms-2.4.0-1.20170926git959bd74.el7.noarch.rpm /install/cuda-9.2/ppc64le/cuda-deps
# Execute createrepo in this directory
createrepo /install/cuda-9.2/ppc64le/cuda-deps

View File

@@ -1,37 +0,0 @@
Ubuntu 14.04.3
==============
NVIDIA supports two types of debian repositories that can be used to install Cuda Toolkit: **local** and **network**. You can download the installers from https://developer.nvidia.com/cuda-downloads.
Local
-----
A local package repo will contain all of the CUDA packages. Extract the CUDA packages into ``/install/cuda-repo/ppc64le``: ::
# For CUDA toolkit: /root/cuda-repo-ubuntu1404-7-5-local_7.5-18_ppc64el.deb
# Create the repo directory under xCAT /install dir
mkdir -p /install/cuda-repo/ppc64el
# extract the package
dpkg -x /root/cuda-repo-ubuntu1404-7-5-local_7.5-18_ppc64el.deb /install/cuda-repo/ppc64el
Network
-------
The online package repo provides a source list entry pointing to a URL containing the CUDA packages. This can be used directly on the Compute Nodes.
The ``sources.list`` entry may look similar to: ::
deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/ppc64el /
Authorize the CUDA repo
-----------------------
In order to access the CUDA repository you must import the CUDA GPGKEY into the ``apt_key`` trust list. xCAT provides a sample postscript ``/install/postscripts/addcudakey`` to help with this task: ::
chdef -t node -o <noderange> -p postscripts=addcudakey

View File

@@ -0,0 +1,79 @@
CUDA repository setup
=====================
NVIDIA hosts package repositories at::
https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/
Where ``<distro>`` is one of ``rhel6``, ``rhel7``, ``rhel8``, ``rhel9``,
``rhel10``, ``sles11``, ``sles12``, ``sles15``, ``ubuntu1404``, ``ubuntu1604``,
``ubuntu1804``, ``ubuntu2004``, ``ubuntu2204``, ``ubuntu2404``, ``ubuntu2604``
and ``<arch>`` is ``x86_64``, ``ppc64le`` (RHEL 7-8, Ubuntu 14.04-16.04), or
``sbsa`` (ARM).
.. note::
Older Ubuntu releases (14.04, 16.04) use ``ppc64el`` instead of
``ppc64le`` in the repository URL path.
Online setup
------------
If nodes have network access, point ``otherpkgdir`` at the NVIDIA URL directly::
chdef -t osimage <osimage> -p \
otherpkgdir=https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>
The ``otherpkgs`` postscript will configure this as a package repository on
the node during provisioning.
Offline setup (air-gapped clusters)
------------------------------------
For clusters without internet access, mirror the NVIDIA repository to a
local directory under ``/install`` on the management node.
RHEL
^^^^
Use ``dnf download`` (or ``yumdownloader`` on RHEL 7) on a system with internet
access to download the CUDA packages and their dependencies::
mkdir -p /install/cuda/<distro>/<arch>
dnf download --resolve --destdir /install/cuda/<distro>/<arch> cuda
createrepo /install/cuda/<distro>/<arch>
For EPEL dependencies such as ``dkms``::
dnf download --resolve --destdir /install/cuda/<distro>/<arch> dkms
createrepo /install/cuda/<distro>/<arch>
SLES
^^^^
Use ``zypper download`` on a system with internet access::
mkdir -p /install/cuda/<distro>/<arch>
zypper --pkg-cache-dir /install/cuda/<distro>/<arch> download cuda
createrepo /install/cuda/<distro>/<arch>
For a runtime-only installation, replace ``cuda`` with
``cuda-runtime-<major>-<minor>`` (e.g., ``cuda-runtime-13-2``).
Ubuntu
^^^^^^
Use ``apt download`` on a system with internet access::
mkdir -p /install/cuda/<distro>/<arch>
cd /install/cuda/<distro>/<arch>
apt download cuda $(apt-cache depends --recurse --no-recommends \
--no-suggests --no-conflicts --no-breaks --no-replaces \
--no-enhances cuda | grep "^\w" | sort -u)
dpkg-scanpackages . /dev/null | gzip -9c > Packages.gz
.. note::
The offline approach requires downloading packages on a system running
the same OS version and architecture as the target nodes. Transfer the
resulting directory to the management node under ``/install``.

View File

@@ -1,21 +1,21 @@
Update NVIDIA Driver
=====================
If the user wants to update the newer NVIDIA driver on the system, follow the :doc:`Create CUDA software repository </advanced/gpu/nvidia/repo/index>` document to create another repository for the new driver.
If the user wants to update the newer NVIDIA driver on the system, follow the :doc:`CUDA repository setup </advanced/gpu/nvidia/repo_setup>` document to create another repository for the new driver.
The following example assumes the new driver is in ``/install/cuda-9.2/ppc64le/nvidia_new``.
The following example assumes the new driver is in ``/install/cuda/<distro>/<arch>/nvidia_new``.
Diskful
-------
#. Change pkgdir for the cuda image: ::
chdef -t osimage -o rhels7.5-ppc64le-install-cudafull \
pkgdir=/install/cuda-9.2/ppc64le/nvidia_new,/install/cuda-9.2/ppc64le/cuda-deps
chdef -t osimage -o <osver>-<arch>-install-cuda \
pkgdir=/install/cuda/<distro>/<arch>/nvidia_new
#. Use xdsh command to remove all the NVIDIA rpms: ::
xdsh <noderange> "yum remove *nvidia* -y"
xdsh <noderange> "dnf remove *nvidia* -y"
#. Run updatenode command to update NVIDIA driver on the compute node: ::
@@ -35,4 +35,4 @@ Diskless
To update a new NVIDIA driver on diskless compute nodes, re-generate the osimage pointing to the new NVIDIA driver repository and reboot the node to load the diskless image.
Refer to :doc:`Create osimage definitions </advanced/gpu/nvidia/osimage/index>` for specific instructions.
Refer to :doc:`CUDA osimage configuration </advanced/gpu/nvidia/osimage_setup>` for specific instructions.

View File

@@ -1,80 +1,36 @@
Verify CUDA Installation
========================
**The following verification steps only apply to the ``cudafull`` installations.**
The following verification steps only apply to the ``cuda`` (full) installations and require nodes with physical NVIDIA GPU hardware.
#. Verify driver version by looking at: ``/proc/driver/nvidia/version``: ::
# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX ppc64le Kernel Module 352.39 Fri Aug 14 17:10:41 PDT 2015
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
cat /proc/driver/nvidia/version
#. Verify the CUDA Toolkit version ::
# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:31:50_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
nvcc -V
#. Verify running CUDA GPU jobs by compiling the samples and executing the ``deviceQuery`` or ``bandwidthTest`` programs.
* Compile the samples:
* Compile the samples: ::
**[RHEL]:** ::
cd ~/
cuda-install-samples-7.5.sh .
cd NVIDIA_CUDA-7.5_Samples
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/1_Utilities/deviceQuery
make
**[Ubuntu]:** ::
cd ~/
apt-get install cuda-samples-7-0 -y
cd /usr/local/cuda-7.0/samples
make
* Run the ``deviceQuery`` sample: ::
# ./bin/ppc64le/linux/release/deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 4 CUDA Capable device(s)
Device 0: "Tesla K80"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
............
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 4, Device0 = Tesla K80, Device1 = Tesla K80, Device2 = Tesla K80, Device3 = Tesla K80
Result = PASS
./deviceQuery
A successful run will end with ``Result = PASS``.
* Run the ``bandwidthTest`` sample: ::
# ./bin/ppc64le/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K80
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 7765.1
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 7759.6
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 141485.3
Result = PASS
cd ../bandwidthTest
make
./bandwidthTest
A successful run will end with ``Result = PASS``.
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@@ -77,7 +77,7 @@ The following software kits will be used to install the IBM HPC software stack o
The ESSL software kit has an *external dependency* to the ``libxlf`` which is provided in the XLF software kit. Since it's already added in the above step, there is no action needed here.
If CUDA toolkit is being used, ESSL has a runtime dependency on the CUDA rpms. The administrator needs to create the repository for the CUDA 7.5 toolkit or a runtime error will occur when provisioning the node. See the :doc:`/advanced/gpu/nvidia/repo/index` section for more details about setting up the CUDA repository on the xCAT management node. ::
If CUDA toolkit is being used, ESSL has a runtime dependency on the CUDA rpms. The administrator needs to create the repository for the CUDA 7.5 toolkit or a runtime error will occur when provisioning the node. See the :doc:`/advanced/gpu/nvidia/repo_setup` section for more details about setting up the CUDA repository on the xCAT management node. ::
#
# Assuming that the cuda repo has been configured at: