Contact

Porträt Dr. Schulz, Henrik; FWCI

Photo: HZDR / Oliver Killig

Dr. Henrik Schulz

Head IT Infrastructure
h.schulzAthzdr.de
Phone: +49 351 260 3268

Access to the clusters

  • The access to the HPC resources at HZDR is restricted. It has to be opened by the division of IT Infrastructure.
  • The hemera cluster is accessible in the HZDR-LAN trough the login nodes hemera4 or hemera5.
  • The submission of batch jobs at hemera have to be performed using the SLURM commands.
  • The disc space on the cluster is sufficient only for running jobs. It is strongly recommended to store data on the gss-fileserver (/bigdata).
  • Information on states of the queues, on jobs submitted into queues and on states of the compute nodes is given by the X11 client sview.
  • Running jobs consuming big amounts of resources on the login node is not allowed. Graphical analysis and interactive programs can be started via interactive jobs on the compute nodes.

Configuration of the HPC clusters at the HZDR

hemera

Nodes overview

Quantity Type Name CPU Cores CPU Type Main Memory GPUs per Node GPU Type GPU Memory per GPU
2 head node hemera1/hemera2 32 Intel 16-Core Xeon 3,2 GHz 256 GB
2 login and submit node hemera4/hemera5 32 Intel 16-Core Xeon 2,1 GHz 256 GB
90 compute node csk001 - csk068, csk077 - csk098 40 Intel 20-Core Xeon 2,4 GHz 384 GB
8 compute node csk069 - csk076 40 Intel 20-Core Xeon 2,4 GHz 768 GB
28 compute node cro001 - cro028 128 AMD 64-Core Epyc 7702 2,0 GHz 512 GB
6 compute node cmi001 - cmi006 128 AMD 64-Core Epyc 7713 2,0 GHz 512 GB
26 compute node cmi007 - cmi032 128 AMD 64-Core Epyc 7713 2,0 GHz 1024 GB
26 compute node cge001 - cge026 192 AMD 96-Core Epyc 9654 2,4 GHz 1536 GB
10 GPU compute node gp001 - gp010 24 Intel 12-Core Xeon 3,0 GHz 384 GB 4 Nvidia Tesla P100 16 GB
32 GPU compute node gv001 - gv032 24 Intel 12-Core Xeon 3,0 GHz 384 GB 4 Nvidia Tesla V100 32 GB
5 GPU compute node ga001 - ga005 64 AMD 32-Core Epyc 7282 2,8 GHz 512 GB 4 Nvidia Tesla A100 40 GB
4 GPU compute node ga006 - ga009 32 AMD 16-Core Epyc 7302 3,0 GHz 1024 GB 8 Nvidia Tesla A100 40 GB
6 GPU compute node ga010 - ga015 128 AMD 64-Core Epyc 7763 2,4 GHz 4096 GB 4 Nvidia Tesla A100 80 GB
1 GPU hotel h001 24 Intel 12-Core Xeon 3,0 GHz 96 GB max. 4 div.
1 FPGA compute node h002 24 Intel 12-Core Xeon 3,0 GHz 384 GB 2 Xilinx Alveo U200
4 compute node intel015 - intel018 32 Intel 16-Core Xeon 2,3 GHz 128 GB
20 compute node intel019 - intel038 32 Intel 16-Core Xeon 2,3 GHz 256 GB
11 compute node fluid021 - fluid031 32 Intel 16-Core Xeon 2,3 GHz 128 GB
10 compute node ion027 - ion036 32 Intel 16-Core Xeon 2,3 GHz 256 GB
1 compute node ion039 32 Intel 16-Core Xeon 2,3 GHz 256 GB
12 compute node fluid033 - fluid044 32 Intel 16-Core Xeon 2,3 GHz 128 GB
2 compute node chem001 - chem002 32 Intel 16-Core Xeon 2,3 GHz 256 GB
7 compute node reac007 - reac013 32 Intel 16-Core Xeon 2,3 GHz 256 GB

Queues overview

Partition * Walltime (max) Nodes Reservation Access max jobs per user max CPUs per user Start Priority
defq 96:00:00 csk001-csk068, csk077-csk098 free 128 ** 960 **  
mem768 96:00:00 csk069-csk076 free 128 ** 960 **  
rome 96:00:00 cro001-cro028 free 128 ** 960 **  
reac2 96:00:00 cmi001-cmi012 FWOR   1536  
milan 96:00:00 cmi013-cmi032 free 128 ** 960 **  
genoa 96:00:00 cge001-cge002 free 128 ** 960 **  
casus_genoa 96:00:00 cge003-cge026 FWU 128 ** 960 **  
gpu_p100 48:00:00 gp001-gp010 free   32 GPUs  
gpu_v100 48:00:00 gv025 free   4 GPUs  
hotel 48:00:00 h001 on request      
fpga 48:00:00 h002 FWC      
intel,intel_32 96:00:00 intel015-intel038, fluid021-fluid044, ion027-ion036, chem001-chem002, reac007-reac013 free 128 ** 960 **  
casus 48:00:00 gv001-gv021, gv023-gv024 FWU 23 92 GPUs  
fwkt_v100 24:00:00 gv001-gv021, gv023-gv024 FWKT 23 92 GPUs  
fwkh_v100 24:00:00 gv001-gv021, gv023-gv024 FWKH 23 92 GPUs  
hlab 48:00:00 gv026-gv032 FWKT 7 28 GPUs  
haicu_v100 48:00:00 gv022 FWCC   4 GPUs  
haicu_a100 48:00:00 ga001-ga003 FWCC   12 GPUs  
circ_a100 48:00:00 ga006-ga009 FWG   32 GPUs  
casus_a100 48:00:00 ga010-ga015 FWU   24 GPUs  

* For the partitions defq, intel, gpu, k20 and k80 there exist the partitions defq_low, intel_low, gpu_low, k20_low and k80_low, where jobs with a longer walltime can be submitted. The jobs there will be cancelled when resources are needed in the main partitions. The user himself has the responsibility to implement checkpoint/restart functionalities into his jobs.

** In the partitions defq, rome and intel the mentioned number of jobs per user and CPUs per user is available in sum and not per partition.

Installed software

All applications, compilers and libraries are accessible using the modules environment. The command "module avail" returns a list of the installed software.