Bigfoot#

This is the name of the GPU server of Gricad

Several nodes are accessible to you:

1. NVIDIA Tesla V100 NVLink 4 GPUs
2. Nodes with 2 NVIDIA A100
3. Grace-Hopper node with 1 chip GH200 (ARM plus GPU in HBM3e)

Here is a complete list:

chekkim@bigfoot:~$ recap.py
==================================================================================
|   node    |  cpumodel       | gpumodel  | gpus | cpus | cores| mem | gpumem  |MIG|
==================================================================================
|bigfoot1   | intel  Gold 6130| V100      |   4  |   2  |   32 | 192 |   32  |  NO |
|    [ + 1  more node(s) ]                                                         |
|bigfoot3   | intel  Gold 6130| V100      |   4  |   2  |   32 | 192 |   32  |  NO |
|bigfoot4   | intel Gold 5218R| V100      |   4  |   2  |   40 | 192 |   32  |  NO |
|    [ + 1  more node(s) ]                                                         |
|bigfoot6   | intel Gold 5218R| V100      |   4  |   2  |   40 | 192 |   32  |  NO |
|bigfoot7   | amd    EPYC 7452| A100      |   2  |   2  |   64 | 192 |   40  | YES |
|bigfoot8   | intel Gold 5218R| V100      |   4  |   2  |   40 | 192 |   32  |  NO |
|bigfoot9   | amd    EPYC 7452| A100      |   2  |   2  |   64 | 192 |   40  |  NO |
|    [ + 2  more node(s) ]                                                         |
|bigfoot12  | amd    EPYC 7452| A100      |   2  |   2  |   64 | 192 |   40  |  NO |
|bigfoot-gh1| arm        ARMv9| GH200     |   1  |   1  |   72 | 480 |  480  |  NO |
===================================================================================
# of GPUS: 10 A100, 28 V100, 1 GH200

In order to connect , you should refer to the dahu page to create the ssh keys , the connection is the same, here are the differences:

In the file $HOME/.ssh/config :

Host bigfoot 
ProxyCommand ssh -qX login_gricad@trinity.u-ga.fr nc bigfoot.u-ga.fr 22  
User login_gricad  
GatewayPorts yes

Warning

replace login_gricad with yours

Next, you need to set the correct rights:

chmod ugo-rwx .ssh/config 
chmod u+rw .ssh/config

Warning

keep read/write rights only for the user

Then, copy the ssh keys

ssh-copy-id login_gricad@trinity.u-ga.fr 

Enter the agalan passwod

then

ssh-copy-id bigfoot

Enter the agalan password

and you should be good for future sessions.

Submit a job#

You should use OAR to submit your job as mentioned for dahu cluster

The difference compared to dahu cluster is the following:

Replace gpu with cores (max 4 for nvidia V100)

#OAR -l nodes=1/gpu=1,walltime=00:10:00

Ask for the gpu model you need A100/V100,

#OAR -p gpumodel='A100'
or
#OAR -p gpumodel='V100'
or  for one of them
#OAR -p gpumodel='A100' or gpumodel='V100'

Note

There is only one node GH200 and it is experimental fro now; to use it add this, instead of gpumodel #OAR -t gh

Make a reservation#

Mainly to avoid waiting in the queue or for a training session

In this example the reservation is made for the Grace Hopper Gh200 (only one is available)

chekkim@bigfoot:~$ oarsub -r '2025-01-20 10:40:00' -t container=testres -l /nodes=1/gpu=1,walltime=1:00:00  --project sno-elmerice -t gh
[ADMISSION RULE] Adding gpubrand=nvidia constraint by default
[ADMISSION RULE] Use -t amd if you want to use AMD GPUs
[ADMISSION RULE] Adding Grace Hopper constraint
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=2870075
Reservation mode: waiting validation...
Reservation valid --> OK

If you need to reserve more gpus at once, try A100 ou V100 models and share between users

chekkim@bigfoot:~$ oarsub -r '2025-01-26 08:00:00' -t container=testres -l /nodes=1/gpu=2,walltime=1:00:00  --project sno-elmerice -p "gpumodel='A100' or gpumodel='V100'"

Now check the status of the reservation , if it is running then ok

chekkim@bigfoot:~$ oarstat -u
Job id    S User     Duration   System message
--------- - -------- ---------- ------------------------------------------------
2870075   R chekkim     0:07:45 R=72,W=0:59:53,J=R,P=sno-elmerice,T=container=testres|gh

Use the reservation#

Now connect directly to the node and start using the ressources

chekkim@bigfoot:~$ oarsub -I -t inner=2870075   -l /gpu=1,walltime=00:05:00 --project sno-elmerice -t gh
[ADMISSION RULE] Adding gpubrand=nvidia constraint by default
[ADMISSION RULE] Use -t amd if you want to use AMD GPUs
[ADMISSION RULE] Adding Grace Hopper constraint
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=2870114
Interactive mode: waiting...
Starting...
Connect to OAR job 2870114 via the node bigfoot-gh1
oarsh: Using systemd
chekkim@bigfoot-gh1:~$