GitHub Email ORCID Google Scholar Stack Overflow Twitter FOAF Zurich, Switzerland RSS
In the previous lab where I used to work, I used to manage training neural network models on a GPU server without any job scheduler. ssh
, create a virtual environment in Python and train in a screen
! That simple.
I have recently changed my lab and suddenly, found myself working with a cluster requiring considerably more preparation to train models. However, as I gradually learnt more about it, it became clearer that it is actually very easy and extremely efficient when resources are shared.
This short blog post aims to a) give the most important steps to use a cluster for a complete newbie and b) summarize what should be done to use in the future!
One important basic distinction that one should make is the difference between a server and a cluster. A server, also known as a node, is a single computer, pretty much like the one that you are using to read this post. On the other hand, a set of nodes connected together which are able to coordinate between themselves in a scheduled and systematic way is called a cluster. Therefore, you can imagine clusters are more robust and efficient for running intensive jobs.
The following figure shows a cluster.
As you see in the figure, the user initially connects to the login nodes where jobs are prepared and the requirements for running a specific code are set up. Login nodes serve as a staging area to submit jobs to the batch scheduler in the cluster.
Once your code, data and whatever you need to run your job is ready on the login nodes, it is time to submit your job to the batch scheduler. Once your work is queued to be submitted to the cluster, batch scheduler facilitates scheduling your job depending on your requirements in terms of memory, GPU usage and time, and those of the existing jobs on the cluster. In other words, a batch scheduler is a referee that makes sure everyone is treated fairly by reducing conflicts in resource allocation and increasing the accessibility of computing resources to the users.
In this tutorial, we learn how to use SLURM, a famous job scheduling system for large and small Linux clusters.
NOTE: How things are organized in the cluster that you use may be different from mine. Make sure you get familiar with your cluster beforehand.
Let’s delve more into how to use a cluster in practice.
Say, you want to train a model in Python using a set of datasets. Normally, you’d do the following:
Now, if you want a job scheduling system like SLURM take care of training your model, you should forget about running your code directly. Rather, you should submit a job by creating a configuration in SLURM that specifies what you need from the cluster.
You configuration in SLURM looks like this (notice the comments that I have left with ##):
The configuration has two main parts:
Save this as config.sh
or config.slurm
and make sure it’s in your working directory on the cluster.
Even though it is possible to use command-line editors like Vim or Emacs, it is sometimes easier to handle some configurations locally. So, you can setup most of your project locally; you can always apply tiny modifications on the server later.
Once your code and data are available, transfer everything to your desired directory on the server. For this, usually scp
is used as follows:
You should make sure that my_id@domain:/directory
points to the correct directory at your profile on the cluster. As a side note, you can also transfer the data back from the server to your local machine as follows:
And, there you go. Just submit the job using sbatch
as follows:
Now, you should let SLURM allocate resources to your job and train your model. You can check your jobs as follows:
This is pretty much all!
To compress use this:
and to decompress, use this:
Before submitting your job, make sure that all your required packages are installed in a virtual environment. I presonaly use venv
as follows:
and activate it within my SLURM configuration as follows:
If you want to run programs in batches, rather than creating multiple projects and setting up the parameters for each one individually, try creating one single script and passing arguments to work with different setups. That way, you can also use job arrays (sarray
) to pass arguments in an array.
The body of your configuration would look like the following if you use sarray
:
This way, instead of submitting many jobs, you can submit only one and let SLURM iterate over the configurations using sarray
.
It’s a bit difficult to get used to SLURM, I must admit. However, once you learn how to use it, it’s truly enjoyable! There are many ways to make your SLURM jobs more efficient. Here are some useful links:
Last updated on 28 January 2023.