High Performance Computing: A Cheat Sheet
How to connect to, set up and use a cluster
It seems strange to say now, but during my PhD I didn’t – and couldn’t – code. I had the romantic idea that pen and paper physics was more ‘real’ than getting a computer to do the work for me, and I just wasn’t interested in ‘becoming a programmer’. Since my first postdoc, however, I’ve been moving in a more and more numerical direction and at this point the majority of what I do is numerical in nature. Over the years I’ve acquired quite a bit of experience in working with all sorts of setups, from workstations to clusters to national supercomputers.
As part of my job, I’ve often had to train graduate students in how to work with cluster computers, and in particular how to remotely connect and run code over the command line. This post is compiled from a series of tutorial documents I’ve written over the years, and I’m basically posting it here to have a list of handy terminal commands all in one place that I can point students to in the future.
If you’ve ever used a cluster computer before, this likely won’t be for you, but if you’re new to the concept or if you stumbled across this blog while searching for information on how to use SLURM, hopefully you’ll find something useful here!
Connecting: SSH and SFTP
Let’s assume you’re already logged into the university network, either because you’re connecting from on-site or have a VPN active. (If not, you may have to first SSH into the university’s external-facing servers, but I won’t cover that here…!)
We’ll connect to the server using the
ssh (Secure Shell Protocol) command. Note that if you’re using Windows, you may need to install a separate program such as PuTTY in order to use
ssh, but on Linux or Mac this will work straight out of the box using your computer’s Terminal.
You will then be asked for your password. This gets you onto the HPC network from outside.
Once you connect using SSH, you can access the computer’s file system. You can edit files, create new files, copy or delete files, and remotely launch Python scripts.
Typing in your username and password every single time isn’t very convenient, so if you wish, you can use
ssh-keygen to speed up the process. This works by creating a pair of files known as a public and private key. You upload the public key to the server you wish to connect to, and the private key always stays on your computer. This allows you to connect without having to enter your password every time. You can find the full instructions for how to do this here.
Once you’ve done this, to set up an even simpler login process, you can navigate to your SSH configuration file using the command line (which on a Mac can be found at
.ssh/config), and enter the following text:
Host [nickname] Hostname [server name] User [username] Host * AddKeysToAgent yes UseKeychain yes IdentityFile [path to private key]
This now allows you to connect simply by typing
ssh [nickname] and bypasses you having to enter your full username or type in your password. If you have to add more cluster computers to the config file in the future, just add a new
Host [nickname] entry with the same parameters as the top one - there’s no need to add the
Host * part again in the future.
To upload files to/from a system, we instead need to use SFTP. The syntax is the same:
Or, if you’ve set up the fast login method above, simply type
Then to upload files, use the command:
And to download files to your computer:
To upload/download an entire folder, use:
put –r [folder name] get –r [folder name]
This works fine for uploading and downloading a handful of files, but when downloading a large number of files from a system it’s often easier to use
rsync. This synchronises a folder on your local drive with a folder on the remote machine, only downloading new files and skipping any that have not changed since the last time you connected. If you’re downloading a large amount of data,
rsync is far more efficient then
sftp which requires you to re-download all files in a directory regardless of whether or not you already have them. If, for example, you only need a handful of new files but they’re in a directory with 10Gb of other data, clearly you don’t want to download all 10Gb of files again. That’s where
rsync comes in, which can be run using the following command:
rsync -ave ssh [username]@[servername]:[remote directory]/ [local directory]/
Or with the fast login method explained above:
rsync -ave ssh [nickname]:[remote directory]/ [local directory]/
(Note that the
-ave flags are just the options I like to use - you can find the full list in the
Once you have connected to a system via either SSH or SFTP, you can use the following commands to navigate the filesystem. If you’re familiar with the command line on your local computer, this should be familiar to you. (Unless you’re using Windows, in which case a few of these may be new!)
ls- list all files in your current directory
ls –hralt- list all files in reverse chronological order
ls –hralt | tail -5- list only the last 5 generated files in reverse chronological order
ls –l | wc –lcount all files in a folder
du –h- list directory structure, showing all folders and subfolders
top- show you all currently running processes on the computer (then press ‘q’ to quit)
cd [folder name]- change directory to the specified folder, e.g.
cd ..- go back up one directory, e.g. from the folder
SYK/data, this command will take you to the folder
rm [filename]- delete file
rm –r [folder name]- delete folder
vim [filename]- open file in a text editor, or create a file if it does not exist (e.g.
vim main.pyto open a Python script called
nano [filename]- an alternative to the above, opens the file in a text editor
From inside the
vim text editor, you can use the following:
ato enter the editing mode
Escto stop editing
:wqto write and close a file, or
:q!to close a file without writing
Here I’ll assume that we want to install Python. Some clusters will come with all necessary libraries already installed, some will require you to make a request to the local admin, and others may allow you to install libraries yourself.
All things being equal, for Python I find Miniconda to be the smoothest, simplest way to get up and running on a cluster. Some admins will disagree (our local admin is ’not a fan’ of ‘bloated’ Conda distributions, for example!) but if you’re just getting started and don’t want to waste a lot of time setting things up yourself, to my mind Miniconda is the best compromise between ease of use and efficiency, as it’ll get you up and running very quickly. As you gain experience, you may find that a more fine-tuned approach is better, but here let’s take the easiest option.
As most clusters are Linux-based, you can install the latest version of Miniconda by typing the following command into the console:
It will take a little while to download and run, but when it’s finished, you’ll have Python on your cluster!
Launching a Python script using SLURM
To launch a Python script on a workstation/desktop, you can use:
For the cluster, we can’t do it like this – never type
python main.py into the cluster, or you’ll run your script on the master node which may cause problems for other users1. Every cluster I’ve used has handled queueing using a workload manager called SLURM. Depending on how your local admin has set up your cluster, jobs might be handled on a first-come-first-served basis, or there might be a system based on priority where users are granted diminishing priority the more they use the cluster, ensuring that anyone who tries to submit thousands of jobs to the cluster will not block other users from being able to run jobs.
To get a job to run on a cluster using SLURM, we need to submit a ‘batch file’ (with extension .sh) so that our job can be properly queued and sent to the right part of the cluster. An example batch file looks like the following:
#!/bin/sh #SBATCH --job-name=test #SBATCH --partition=[partition name] #SBATCH --ntasks=1 #SBATCH --cpus-per-task=5 ############################################## # INITIALISATION STEPS ############################################## python main.py
This submission script requests 5 CPU cores on a single node in order to run the file
main.py. Of course, the Python script itself must be set up to use parallel processing for this to be useful, but that’s between you and your code.
You can create this file using
vim run.sh, then paste the above text, then press
Esc followed by
:wq to save and exit the file. (Note that you might have to change the filename from the above to match whatever your script is called…!) Of course, if what you want to run isn’t Python, just swap out the
python main.py line for whatever is the right command for the software you want to use. Some clusters will ask you to specify other things too, such as the total memory required for the job or the maximum time the job is allowed to run before the system will cancel it. Exactly what’s required will depend on how the cluster in question is set up by the admin in charge, but it’s good practice to submit a job requesting the minimum resources you need in order to run your code. Requesting more resources won’t hurt – and it’s good to give yourself a bit of a margin in case your estimates are off – but if you’re using a busy cluster, it may mean that your job spends longer in the queue waiting for the required amounf of resources to be freed up.
Once you have a batch file, you can launch a job by using:
e.g. for our script
run.sh, we type:
This will put your job into the queue for the cluster. It will run as soon as a node is available. You can see all current jobs on the cluster from all users by typing:
And you can see just your own current jobs on the cluster by typing:
You can cancel a job by typing:
scancel [job number]
And you can get a status report on the cluster by typing:
Depending on how heavily used your cluster is, it may take some time before sufficient resources are available to run your job. Using the above commands, you can get some idea of how busy the cluster is, though there’s typically no way to know when resources will be freed unless you know precisely how long other users’ jobs will take.
One small note before we finish up: if you’re going to be generating a lot of data, it’s good practice not to save it all in your
/home directory, as this can slow down the performance of the cluster computer if there is a lot of data. Normally there will be a second storage drive conventionally called
/scratch, and unless you want to annoy your local admin, this is where you’ll want to save all your large datasets.
Pro tip: Make sure that there is in fact a save command in your data, and don’t make the mistake of running code for days/weeks/months only to find that you forgot to include a save command, or commented it out for testing purposes2. If possible, before running a big job always do a small test run that will finish quickly, just to check it executes with no issues and generates the expected output.
While hardly complete, this brief guide should be enough to get you up and running even if you’ve never connected to a cluster computer before. The best thing to do is to dive in and start playing with these commands. Start small, with a short script that will run in a few seconds/minutes on your local machine, and then in that case if anything goes wrong you won’t cause too much of a problem for other users.
It’s also worth keeping in mind that clusters come with a cost, both monetary (paying for the electricity, cooling, etc) and environmental (also the electricity, cooling, etc…!). While you may not see this cost up front, it is still there, so be mindful of what you’re running on a cluster, don’t abuse your access by running frivolous things and consider making a note of the carbon cost of your work for the purposes of reporting or offsetting it. You can check out the Scientific CO2nduct framework for more on estimating the carbon cost of scientific research.
Anything I missed? Are there any tricks I don’t know about, or any bad advice contained above? Let me know by dropping me an e-mail or hitting me up on Twitter!