IGE clusters#
IGE computing servers are ige-calcul1, ige-calcul2, ige-calcul3, ige-calcul4.
You can replace calcul1 by calcul2, calcul3 or calcul4 in the following documentation according to your use.
Slurm#
Slurm is an open-source workload manager/scheduler for the Discovery cluster. Slurm is basically the intermediary between the Login nodes and compute nodes. Hence, the Slurm scheduler is the gateway for the users on the login nodes to submit work/jobs to the compute nodes for processing.
Connection to the server#
Before using slurm, make sure that you are able to connect to the server:
ssh your_agalan_login@ige-calcul1.u-ga.fr
If you want to connect without using a password and from outside the lab, add these 4 lines to the file $HOME/.ssh/config (create it if you don’t have it):
Host calcul1
ProxyCommand ssh -qX your_agalan_login@ige-ssh.u-ga.fr nc -w 60 ige-calcul1.u-ga.fr 22
User your_agalan_login
GatewayPorts yes
then you should create and copy your ssh keys to the server:
ssh-keygen -t rsa (tape Enter twice without providing a password)
ssh-copy-id your_agalan_login@ige-ssh.u-ga.fr
ssh-copy-id calcul1
Now, you should be able to connect without any password:
ssh calcul1
Then you should ask for a storage space and a slurm account.
Available slurm accounts are:
cryodyn
meom
phyrev
hmcis
hydrimz
c2h
ecrins
ice3
chianti
Please send an email to `mondher.chekki@uXXXX-gYYYY-aZZZZ.fr OR ige-support@uXXXX-gYYYY-aZZZZ.fr, asking for storage under /workdir and a slurm account by providing the name of your team and the space you need (1G,10G,100G,1TB).
Available software#
- NCO
- CDO
- FERRET
- NCVIEW
- QGIS
- MATLAB (through modules,i.e: module load matlab)
Commands#
Command |
Syntax |
Description |
---|---|---|
sbatch |
|
Submit a batch script to Slurm for processing. |
squeue |
|
Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue. |
srun |
|
Run jobs interactively on the cluster |
srun |
|
Run MPI jobs on the cluster |
scancel |
|
End or cancel a queued job. |
sacct |
|
Show information about current and previous jobs (cf 5. Job Accounting for example) |
scontrol |
|
Show more details about a running job |
sinfo |
|
Get information about the resources on available nodes that make up the HPC cluster |
Job submission example#
Consider you have a script in one of the programming languages such as Python, MatLab, C, Fortran , or Java. How would you execute it using Slurm?
The following section explains a step by step process to creating and submitting a simple job. Also, the SBATCH script is created and used for the execution of a python script or fortran code.
Prepare your data/code/script
Copy your files to the server with rsync:
rsync -rav YOUR_DIRECTORY calcul1:/workdir/your_slurm_account/your_agalan_login/
Then write your python script or compile your fortran code.
Example of Hello World in MPI hello_mpi.f90
PROGRAM hello_world_mpi
include 'mpif.h'
integer process_Rank, size_Of_Cluster, ierror, tag
CHARACTER(256) PNAME
INTEGER RESULTLEN
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror)
call MPI_GET_PROCESSOR_NAME( PNAME, RESULTLEN, ierror)
print *, 'Hello World from process: ', process_Rank, 'of ', size_Of_Cluster
print *, 'Hello World from process: ', PNAME(1:RESULTLEN)
call MPI_FINALIZE(ierror)
END PROGRAM
Compile the code using mpif90:
mpif90 -o hello_mpi hello_mpi.f90
Now you have an executable hello_mpi that you can run using slurm.
Create your submission job
A job consists in two parts: resource requests and job steps.
Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM or disk space, etc.
Job steps describe tasks that must be done, software which must be run.
The typical way of creating a job is to write a submission script. A submission script is a shell script. If they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch or sbatch -h.
In this example, job.sh
contains ressources request (lines starting with #SBATCH) and the run of the previous generated executable.
#!/bin/bash
#SBATCH -J helloMPI
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --account=cryodyn
#SBATCH --mem=4000
#SBATCH --time=01:00:00
#SBATCH --output helloMPI.%j.output
#SBATCH --error helloMPI.%j.error
cd /workdir/$USER/
## Run an MPI program
srun --mpi=pmix -N 1 -n 4 ./hello_mpi
## Run a python script
# python script.py
job.sh
request 4 cores for 1 hour, along with 4000 MB of RAM, in the default queue.
The account is important in order to get statisticis about the number of CPU hours consumed within the account: make sure to be part of an acccount before submitting any jobs
When started, the job would run the hello_mpi program using 4 cores in parallel. To run the job.sh
script use sbatch
command and squeue
to see the state of the job:
chekkim@ige-calcul1:~$ sbatch job.sh
Submitted batch job 51
chekkim@ige-calcul1:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
51 calcul helloMPI chekkim R 0:02 1 ige-calcul1
Interactive mode
For interactive mode you should use the srun/salloc commands.
Either you get the ressources using srun followed by –pty bash -i. Then you can run any program you need.
Or you use srun followed by your program and then it will allocate the ressource, run the program and exit.
An equivalent to the job.sh
will be :
Run mpi hello example with 4 cores
srun --mpi=pmix -n 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 hello_mpi
==> This will run and exit once it is done
or
srun --mpi=pmix -n 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 --pty bash -i
srun --mpi=pmix -n 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 hello_mpi
==> keep the ressources even when the program is done
Run Qgis with 8 threads (graphic interface)
srun --mpi=pmix -n 1 -c 8 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 qgis
Run Jupiter notebook with 4 threads
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 jupyter notebook
Run matlab with 4 threads
module load matlab/R2022b
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 matlab -nodisplay -nosplash -nodesktop -r "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 matlab -nodisplay -nosplash -nodesktop -batch "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 --account=cryodyn --mem=4000 --time=01:00:00 matlab -nodisplay -nosplash -nodesktop < test.m
Example of job_matlab.sh :
#!/bin/bash
#SBATCH -J matlab
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --account=cryodyn
#SBATCH --mem=4000
#SBATCH --time=01:00:00
#SBATCH --output matlab.%j.output
#SBATCH --error matlab.%j.error
cd /workdir/$USER/
## Run on Matlab
module load matlab/R2022b
srun --mpi=pmix -n 1 -c 4 -N 1 matlab -nodisplay -nosplash -nodesktop -r "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 matlab -nodisplay -nosplash -nodesktop -batch "MATLAB_command"
# or
srun --mpi=pmix -n 1 -c 4 -N 1 matlab -nodisplay -nosplash -nodesktop < test.m
For Python users
We recommend that you use micromamba instead of conda/miniconda.
Micromamba is just faster then conda !
Check here how to set up your python environement with micromamba.
Job Accounting
Interestingly, you can get near-realtime information about your running program (memory consumption, etc.) with the sstat command:
sstat -j JOBID
It is possible to get informations and statistics about you job after they are finished using the sacct/sreport command (sacct -e for more help):
chekkim@ige-calcul1:~$ sacct -j 51 --format="Account,JobID,JobName,NodeList,CPUTime,elapsed,MaxRSS,State%20"
Account JobID JobName NodeList CPUTime MaxRSS State
---------- ------------ ---------- --------------- ---------- ---------- --------------------
cryodyn 51 helloMPI ige-calcul1 00:00:20 COMPLETED
cryodyn 51.batch batch ige-calcul1 00:00:20 132K COMPLETED
cryodyn 51.0 hello_mpi ige-calcul1 00:00:12 3564K COMPLETED
**MaxRSS: Maximum RAM used by the job, you can also get the MAximum RAM used by a given task