Resources

Installation

UJ machines are controlled by puppet. The CUDA class is here on the group GitHub account. Ubuntu 22.04 machines use the developer.download.nvidia.com deb repo. The psi-cluster machine should act as a http proxy for apt, accelerating downloads. Via apt, puppet installs the cuda, cudnn and datacenter-gpu-manager packages.

Examples

Keras distributed notebook

Jupyter lab can run Keras notebooks fully with PYPI packages (and cuda-drivers). The notebook for this example highlights how to train and evaluate a ML model across multiple GPUs in a single host machine. See the online version here. To set up with pip:

pip install tensorflow[and-cuda] keras jupyterlab matplotlib tensorflow-addons tensorflow_datasets
wget https://raw.githubusercontent.com/tensorflow/docs/master/site/en/tutorials/distribute/keras.ipynb
jupyter lab --ip=0.0.0.0

This will start jupyter lab on all network interfaces. If you need access from outside the machines network, take a look at SSHTunnels, in particular how to run a Dynamic SOCKS proxy.

Note: While evaluating the notebook, you should see some errors about CUDA plugins already having been registered. This is to be expected and seems to have no impact on execution. Hopefully this is fixed when TensorFlow 2.16 is released. This will be the first version to support Keras V3.