Access to the clusters
- The access to the HPC resources at HZDR is restricted. It has to be opened by the division of IT Infrastructure.
- The hemera cluster is accessible in the HZDR-LAN trough the login nodes hemera4 or hemera5.
- The submission of batch jobs at hemera have to be performed using the SLURM commands.
- The disc space on the cluster is sufficient only for running jobs. It is strongly recommended to store data on the gss-fileserver (/bigdata).
- Information on states of the queues, on jobs submitted into queues and on states of the compute nodes is given by the X11 client sview.
- Running jobs consuming big amounts of resources on the login node is not allowed. Graphical analysis and interactive programs can be started via interactive jobs on the compute nodes.
Configuration of the HPC clusters at the HZDR
hemera
Nodes overview
Quantity | Type | Name | CPU Cores | CPU Type | Main Memory | GPUs per Node | GPU Type | GPU Memory per GPU |
---|---|---|---|---|---|---|---|---|
2 | head node | hemera1/hemera2 | 32 | Intel 16-Core Xeon 3,2 GHz | 256 GB | |||
2 | login and submit node | hemera4/hemera5 | 32 | Intel 16-Core Xeon 2,1 GHz | 256 GB | |||
90 | compute node | csk001 - csk068, csk077 - csk098 | 40 | Intel 20-Core Xeon 2,4 GHz | 384 GB | |||
8 | compute node | csk069 - csk076 | 40 | Intel 20-Core Xeon 2,4 GHz | 768 GB | |||
28 | compute node | cro001 - cro028 | 128 | AMD 64-Core Epyc 7702 2,0 GHz | 512 GB | |||
6 | compute node | cmi001 - cmi006 | 128 | AMD 64-Core Epyc 7713 2,0 GHz | 512 GB | |||
26 | compute node | cmi007 - cmi032 | 128 | AMD 64-Core Epyc 7713 2,0 GHz | 1024 GB | |||
26 | compute node | cge001 - cge026 | 192 | AMD 96-Core Epyc 9654 2,4 GHz | 1536 GB | |||
10 | GPU compute node | gp001 - gp010 | 24 | Intel 12-Core Xeon 3,0 GHz | 384 GB | 4 | Nvidia Tesla P100 | 16 GB |
32 | GPU compute node | gv001 - gv032 | 24 | Intel 12-Core Xeon 3,0 GHz | 384 GB | 4 | Nvidia Tesla V100 | 32 GB |
5 | GPU compute node | ga001 - ga005 | 64 | AMD 32-Core Epyc 7282 2,8 GHz | 512 GB | 4 | Nvidia Tesla A100 | 40 GB |
4 | GPU compute node | ga006 - ga009 | 32 | AMD 16-Core Epyc 7302 3,0 GHz | 1024 GB | 8 | Nvidia Tesla A100 | 40 GB |
6 | GPU compute node | ga010 - ga015 | 128 | AMD 64-Core Epyc 7763 2,4 GHz | 4096 GB | 4 | Nvidia Tesla A100 | 80 GB |
1 | GPU hotel | h001 | 24 | Intel 12-Core Xeon 3,0 GHz | 96 GB | max. 4 | div. | |
1 | FPGA compute node | h002 | 24 | Intel 12-Core Xeon 3,0 GHz | 384 GB | 2 | Xilinx Alveo U200 | |
4 | compute node | intel015 - intel018 | 32 | Intel 16-Core Xeon 2,3 GHz | 128 GB | |||
20 | compute node | intel019 - intel038 | 32 | Intel 16-Core Xeon 2,3 GHz | 256 GB | |||
11 | compute node | fluid021 - fluid031 | 32 | Intel 16-Core Xeon 2,3 GHz | 128 GB | |||
10 | compute node | ion027 - ion036 | 32 | Intel 16-Core Xeon 2,3 GHz | 256 GB | |||
1 | compute node | ion039 | 32 | Intel 16-Core Xeon 2,3 GHz | 256 GB | |||
12 | compute node | fluid033 - fluid044 | 32 | Intel 16-Core Xeon 2,3 GHz | 128 GB | |||
2 | compute node | chem001 - chem002 | 32 | Intel 16-Core Xeon 2,3 GHz | 256 GB | |||
7 | compute node | reac007 - reac013 | 32 | Intel 16-Core Xeon 2,3 GHz | 256 GB |
Queues overview
Partition * | Walltime (max) | Nodes Reservation | Access | max jobs per user | max CPUs per user | Start Priority |
---|---|---|---|---|---|---|
defq | 96:00:00 | csk001-csk068, csk077-csk098 | free | 128 ** | 960 ** | |
mem768 | 96:00:00 | csk069-csk076 | free | 128 ** | 960 ** | |
rome | 96:00:00 | cro001-cro028 | free | 128 ** | 960 ** | |
reac2 | 96:00:00 | cmi001-cmi012 | FWOR | 1536 | ||
milan | 96:00:00 | cmi013-cmi032 | free | 128 ** | 960 ** | |
genoa | 96:00:00 | cge001-cge002 | free | 128 ** | 960 ** | |
casus_genoa | 96:00:00 | cge003-cge026 | FWU | 128 ** | 960 ** | |
gpu_p100 | 48:00:00 | gp001-gp010 | free | 32 GPUs | ||
gpu_v100 | 48:00:00 | gv025 | free | 4 GPUs | ||
hotel | 48:00:00 | h001 | on request | |||
fpga | 48:00:00 | h002 | FWC | |||
intel,intel_32 | 96:00:00 | intel015-intel038, fluid021-fluid044, ion027-ion036, chem001-chem002, reac007-reac013 | free | 128 ** | 960 ** | |
casus | 48:00:00 | gv001-gv021, gv023-gv024 | FWU | 23 | 92 GPUs | |
fwkt_v100 | 24:00:00 | gv001-gv021, gv023-gv024 | FWKT | 23 | 92 GPUs | |
fwkh_v100 | 24:00:00 | gv001-gv021, gv023-gv024 | FWKH | 23 | 92 GPUs | |
hlab | 48:00:00 | gv026-gv032 | FWKT | 7 | 28 GPUs | |
haicu_v100 | 48:00:00 | gv022 | FWCC | 4 GPUs | ||
haicu_a100 | 48:00:00 | ga001-ga003 | FWCC | 12 GPUs | ||
circ_a100 | 48:00:00 | ga006-ga009 | FWG | 32 GPUs | ||
casus_a100 | 48:00:00 | ga010-ga015 | FWU | 24 GPUs |
* For the partitions defq, intel, gpu, k20 and k80 there exist the partitions defq_low, intel_low, gpu_low, k20_low and k80_low, where jobs with a longer walltime can be submitted. The jobs there will be cancelled when resources are needed in the main partitions. The user himself has the responsibility to implement checkpoint/restart functionalities into his jobs.
** In the partitions defq, rome and intel the mentioned number of jobs per user and CPUs per user is available in sum and not per partition.