IGE clusters#
IGE computing servers are ige-calcul[1-4]: ige-calcul1, ige-calcul2, ige-calcul3, ige-calcul4.
You can replace calcul1 by calcul2, calcul3 or calcul4 in the following documentation depending on the cluster your choose
Features of the clusters#
ige-calcul1 |
ige-calcul2 |
ige-calcul3 |
ige-calcul4 |
2 CPU 24c (48c) |
2 CPU 18c (36c) |
2 CPU 28c (56c) |
2 CPU 16c (32c) |
256 Go |
512 Go |
768 Go |
256 Go |
NVIDIA A40 (46G) |
NO |
NVIDIA RTX 6000 (?) |
2 NVIDIA RTX A4500 (2x20G) |
Connection to the server#
Before using slurm, make sure that you are able to connect to the server:
ssh your_agalan_login@ige-calcul1.u-ga.fr
If you want to connect without using a password and from outside the lab, add these 4 lines to the file $HOME/.ssh/config (create it if you don’t have it):
Host calcul1
ProxyCommand ssh -qX your_agalan_login@ige-ssh.u-ga.fr nc -w 60 ige-calcul1.u-ga.fr 22
User your_agalan_login
GatewayPorts yes
then you should create and copy your ssh keys to the server:
ssh-keygen -t rsa (tape Enter twice without providing a password)
ssh-copy-id your_agalan_login@ige-ssh.u-ga.fr
ssh-copy-id calcul1
Now, you should be able to connect without any password:
ssh calcul1
Then you should ask for a storage space and a slurm account.
Available slurm accounts are:
Please send an email to `mondher.chekki@uXXXX-gYYYY-aZZZZ.fr asking for storage under /workdir and a slurm account by providing the name of your team and the space you need (1G,10G,100G,1TB).
Available softwares#
- MATLAB (through modules,i.e: module load matlab)
Slurm: Submit a job on the cluster#
Slurm is an open-source workload manager/scheduler for the Discovery cluster. Slurm is basically the intermediary between the Login nodes and compute nodes.
Hence, the Slurm scheduler is the gateway for the users on the login nodes to submit work/jobs to the compute nodes for processing.
Command |
Syntax |
Description |
sbatch |
Submit a batch script to Slurm for processing. |
squeue |
Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue. |
srun |
Run jobs interactively on the cluster |
srun |
Run MPI jobs on the cluster |
scancel |
End or cancel a queued job. |
sacct |
Show information about current and previous jobs (cf 5. Job Accounting for example) |
scontrol |
Show more details about a running job |
sinfo |
Get information about the resources on available nodes that make up the HPC cluster |
seff |
Provides statistics related to the efficiency of resource usage by the completed job. |
Available working directories#
There are 2 working directories available on the clusters
/workdir (only on ige-calcul1 and ige-calcul4)
/workdir2 (available on all clusters: SUMMER STORAGE)
Running jupyter notebooks on the clusters#
Please refer to this doc How to run jupyter notebooks on Ige clusters
Running python code on the clusters#
We recommend that you use micromamba instead of conda/miniconda.
Micromamba is just faster then conda !
Check here how to set up your python environement with micromamba.
Job submission example#
Consider you have a script in one of the programming languages such as Python, MatLab, C, Fortran , or Java. How would you execute it using Slurm?
The following section explains a step by step process to creating and submitting a simple job. Also, the SBATCH script is created and used for the execution of a python script or fortran code.
Prepare your data/code/script
Copy your files to the server with rsync:
rsync -rav YOUR_DIRECTORY calcul1:/workdir/your_slurm_account/your_agalan_login/
Then write your python script or compile your fortran code.
Example of Hello World in MPI hello_mpi.f90
PROGRAM hello_world_mpi
include 'mpif.h'
integer process_Rank, size_Of_Cluster, ierror, tag
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror)
print *, 'Hello World from process: ', process_Rank, 'of ', size_Of_Cluster
print *, 'Hello World from process: ', PNAME(1:RESULTLEN)
call MPI_FINALIZE(ierror)
Compile the code using mpif90:
mpif90 -o hello_mpi hello_mpi.f90
Now you have an executable hello_mpi that you can run using slurm.
Create your submission job
A job consists in two parts: resource requests and job steps.
Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM or disk space, etc.
Job steps describe tasks that must be done, software which must be run.
The typical way of creating a job is to write a submission script. A submission script is a shell script. If they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch or sbatch -h.
In this example, job.sh
contains ressources request (lines starting with #SBATCH) and the run of the previous generated executable.
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --account=cryodyn
#SBATCH --mem=4000
#SBATCH --time=01:00:00
#SBATCH --output helloMPI.%j.output
#SBATCH --error helloMPI.%j.error
cd /workdir/$USER/
## Run an MPI program
srun --mpi=pmix -N 1 -n 4 ./hello_mpi
## Run a python script
# python script.py
request 4 cores for 1 hour, along with 4000 MB of RAM, in the default queue.
The account is important in order to get statisticis about the number of CPU hours consumed within the account: make sure to be part of an acccount before submitting any jobs
When started, the job would run the hello_mpi program using 4 cores in parallel. To run the job.sh
script use sbatch
command and squeue
to see the state of the job:
chekkim@ige-calcul1:~$ sbatch job.sh
Submitted batch job 51
chekkim@ige-calcul1:~$ squeue
51 calcul helloMPI chekkim R 0:02 1 ige-calcul1
Gpu support#
To use gpus in a job , add the following in your submission file
#SBATCH --gres=gpu:1
or for 2 gpus
#SBATCH --gres=gpu:1
Use the interactive mode#
For interactive mode you should use the srun or salloc commands.
The most common way is to use the srun followed by –pty bash -i. Then you can run any program you need.
srun --nodes=1 --ntasks=4 --mem=40000 --time=01:00:00 --pty bash -i
to use gpu add the –gres=gpu command
srun --nodes=1 --gres=gpu:1 --ntasks=4 --mem=40000 --time=01:00:00 --pty bash -i
or to use 2 gpus
srun --nodes=1 --gres=gpu:2 --ntasks=4 --mem=40000 --time=01:00:00 --pty bash -i
If you use srun followed by your program (without running the previous command) it will allocate the ressource, run the program and exit.
An equivalent to the job.sh
will be :
Run mpi hello example with 4 cores
srun --mpi=pmix -n 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 hello_mpi
==> This will run and exit once it is done
srun --mpi=pmix -n 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 --pty bash -i
srun --mpi=pmix -n 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 hello_mpi
==> This will keep the ressources even when the program is done
Run Qgis with 8 threads (graphic interface)
srun --mpi=pmix -n 1 -c 8 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 qgis
Run Jupiter notebook with 4 threads
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 jupyter notebook
Run matlab with 4 threads
module load matlab/R2022b
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 matlab -nodisplay -nosplash -nodesktop -r "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 matlab -nodisplay -nosplash -nodesktop -batch "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 matlab -nodisplay -nosplash -nodesktop < test.m
Example of job_matlab.sh :
#SBATCH -J matlab
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --account=cryodyn
#SBATCH --mem=4000
#SBATCH --time=01:00:00
#SBATCH --output matlab.%j.output
#SBATCH --error matlab.%j.error
cd /workdir/$USER/
## Run on Matlab
module load matlab/R2022b
srun --mpi=pmix -n 1 -c 4 -N 1 matlab -nodisplay -nosplash -nodesktop -r "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 matlab -nodisplay -nosplash -nodesktop -batch "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 matlab -nodisplay -nosplash -nodesktop < test.m
How to check the available ressources#
If your job is pending, you need to wait for the ressources or adapt you submission file You can get the available memory/cpus on the cluster with the squeue command
chekkim@ige-calcul3:~$ sinfo -o "%20N %15c %15C %10m %20e %30G " | awk -F "/" '{print $1, $2, $4}'
ige-calcul3 |
112 |
4 108 112 |
740000 |
525031 |
(null) |
A: Allocated (Used) I: Idle (free) T: Total (Total)
If your job is slow, you should check the CPU_LOAD and make sure it is equivalent to the number of Allocated CPUS. For this example , the CPU_LOAD is 101 and the allocated is 0, which means that there are some programs running on the background and they are not using SLURM and this is not NORMAL
chekkim@ige-calcul3:~$ sinfo -o "%20N %10O| %10c %15C %10m %20e %30G " | awk -F "/" '{print $1, $2, $4}'
ige-calcul3 |
101.33 |
112 |
0 112 112 |
740000 |
539958 |
(null) |
Code efficiency on the cluster#
Once the job is finished you can get direct statistics using the seff command (for more statistics , refer to the accounting section below) Here is the outputs for another job , more memory consuming (we asked in this job for 20000 MB ~~ 19.53 GB)
[ige-calcul1 /home/chekkim ]$ seff 501759
Job ID: 501759
Cluster: ige-calcul1
User/Group: chekkim/ige-cryodyn
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:01:16
CPU Efficiency: 49.35% of 00:02:34 core-walltime
Job Wall-clock time: 00:01:17
Memory Utilized: 16.00 GB
Memory Efficiency: 81.94% of 19.53 GB (19.53 GB/node)
In case you ask for less memory , for this case let’s say 10G , you will get the following message: srun: error: ige-calcul1: task 0: Out Of Memory srun: launch/slurm: _step_signal: Terminating StepId=501757.0 slurmstepd-ige-calcul1: error: Detected 1 oom-kill event(s) in StepId=501757.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
Job Accounting#
You can get near-realtime information about your running program (memory consumption, etc.) with the sstat command:
sstat -j JOBID
It is possible to get informations and statistics about you job after they are finished using the sacct/sreport command (sacct -e for more help):
chekkim@ige-calcul1:~$ sacct -j 51 --format="Account,JobID,JobName,NodeList,CPUTime,elapsed,MaxRSS,State%20"
Account JobID JobName NodeList CPUTime MaxRSS State
---------- ------------ ---------- --------------- ---------- ---------- --------------------
cryodyn 51 helloMPI ige-calcul1 00:00:20 COMPLETED
cryodyn 51.batch batch ige-calcul1 00:00:20 132K COMPLETED
cryodyn 51.0 hello_mpi ige-calcul1 00:00:12 3564K COMPLETED
**MaxRSS: Maximum RAM used by the job, you can also get the MAximum RAM used by a given task