PSIgroup | Computing / GPUAccessNotes

Note: Angular brackets, < >, in the following notes mean that you should replace with your own relevant details.

Synchronising GPU access

I have put a Perl script called solo in /usr/local/bin/. This prevents two programs from running at the same time, by connecting to a specified port. The format is:

solo -port=<port> <command>

This is easiest to use wrapped in a bash script, e.g. run.sh like this:

#!/bin/bash
solo -port=3801 ". /home/daq/miniconda3/etc/profile.d/conda.sh &&
conda activate py3ml && python <python_script.py>"

If we all use the same port, 3801, then only one user can access the GPU at a time.

If someone else is currently using the GPU, you will get this message: solo(3801): Address already in use.

I looked at other process synchronisation alternatives, like acquiring a lock, making a folder in bash or using the lockrun c app. All of these had various disadvantages, but solo seems to work well.

Easily running code on Wolf GPU

I have written a bash script that I am using to copy data and code to Wolf, run it, and copy the results back.

First up, set up ssh keys to log in to wolf.

1. Generate a key (if you do not already have one).

 
ssh-keygen -t rsa -C "your_email@example.com"

Pick a name for the key that will identify you so it is easier to keep track. There is no need to use a passphrase, though you can if you want.

2. Copy the public key to wolf

ssh-copy-id -i ~/.ssh/<keyname> daq@196.24.76.207

If you don't have ssh-copy-id, then manually copy the .pub file that the ssh-keygen created to wolf, then append it to ~/.ssh/authorized_keys (creating if necessary), then make sure that all the folders have the correct permissions (google for details). If ssh using keys does not work, then directory permissions are the most common culprit.

3. Add an entry to ssh config file

On your own computer, add an entry to ~/.ssh/config, like this:

Host wolf
Hostname 196.24.76.207
User <username>
IdentityFile ~/.ssh/<keyname>

At this point, you should be able to login with ssh wolf and it should not ask for passwords at any point. Debug it until this works.

The bash script

Replace everything within angular brackets < > as necessary:

#!/bin/bash
echo "#####################################################################################"
echo "  Copying files to wolf"
echo "#####################################################################################"
if rsync --progress -auvx <file1_or_dir1> \
<file1_or_dir2> \
<file1_or_dir3> \
<...> \
--exclude '__pycache__' wolf:<destination_folder> ; then

  echo "done."
  echo
  echo "#####################################################################################"
  echo "  Running code"
  echo "#####################################################################################"


  if ssh wolf 'cd <destination_folder> && . run.sh' ; then

    echo "done."
    echo
    echo "#####################################################################################"
    echo "  Copying results back"
    echo "#####################################################################################"
    rsync --progress -auvx wolf:<destination_folder>/<results> .
  fi
fi

Some comments:

This script assumes you have a folder in your home directory, (referred to as <destination_folder> above), and in this folder there is a script called run.sh, that sets up the environment, takes out a lock on the GPU and runs your code, like the example at the top.

You can list several files to rsync on different lines by escaping the newline character. I.e. the \ must be the last character on the line.

rsync is used because it only copies files that have been changed. This is helpful for repeated runs with small tweaks on the same data set. The -x flag zips the contents, which is useful for large files that are not already compressed.

--exclude '__pycache__' is useful when copying whole folders, as the temporary cache files that Python creates on execution aren't copied across.

The if statements are included so that if one step fails, it stops rather than continuing with the next step