Linux-Cluster (rzcluster)

no/neinStudents yes/jaEmployees yes/jaFaculties no/neinStud. Unions

rzcluster (2016)

For serial and moderately parallel computations the University Computing Centre operates a separate Linux cluster (rzcluster). Currently, this system consists of 142 compute nodes with different processor architectures and in total 1948 cores. The main memory configuration varies between 32GB and 256GB. In addition to Gigabit Ethernet interconnection, some of the compute nodes are equipped with an infiniband network for executing multi-node computations.


 

Informations on the new BeeGFS file system are collected here .

 

Hardware

1 Front end (login node)

  • two 6-Core Intel Xeon X5650 processors (2.7GHz)
  • 48GB main memory

 

142 Compute nodes (batch nodes)

  • main memory: 32GB to 256GB
  • different processor architectures (AMD and Intel)
  • partly interconnected with an infiniband network (QDR) 

 

Operating system

  • Scientific Linux 6.x

 

Access

User account

System access

  • On the rzcluster, computations shall be performed primarily in batch mode and only short interactive operations (such as  compiling/testing programs or scripts) are allowed on the front end. If your computations require longer interactive computation time and/or a lot of main memory, please contact us in advance.
  • Access to the front end rzcluster.rz.uni-kiel.de is possible only via an SSH-connection established within the internal network of Kiel University:

    ssh -X <username>@rzcluster.rz.uni-kiel.de

  • Please note that the preceding $-sign represents the command line prompt and is not part of the input!
  • With the additional option -X (uppercase X) one activates X11-forwarding.
  • Suitable SHH-clients for a Windows-PC are Putty, X-Win32 or MobaXterm. MobaXterm and WinSCP can be used for data transfer.

 

File systems

For distinct tasks, the rzcluster provides different file systems:

Home directory

  • Accessible via the environment variable $HOME
  • Global file system with daily data backup
  • Directory for scripts, programs and small amounts of results

 

Work directory

  • Global file system without data backup
  • For each user there exists a work directory. Those directories are distributed over different file servers to achieve better performance. For new users a work directory is created automatically on the parallel BeeGFS filesystem /work_beegfs.
  • Batch computations shall be performed only in this directory.

 

Local disk space

  • Via the environment variable $TMPDIR local disk space is available on each batch node.
  • Particularly I/O-intensive computations should be performed always in this directory.
  • Attention: The local disk space is only available during a running batch job, i.e., all data on the local disk space will be removed automatically after job termination.

 

Tape library

  • Files which are currently not used should be transferred to the additional file system /nfs/tape_cache as archived data (for example as a tar file).
  • Files under /nfs/tape_cache will be stored automatically on tape after a while. Nevertheless, it is possible, at any time, to copy the data back to the home or work directory.
  • Not suitable for the storage of a lot of small files
  • Recommended size of an archived file: 3GB to 50 GB (max. size of one tar-file should not exceed 1TB)
  • Data transfer to and from the tape library must not be performed with the rsync command
  • Attention: Slow access speed - avoid to work directly with files on the tape library. Instead, copy files back to the work directory before further processing (incl. unpacking).

 

Software

Compiler

  • For compilation of serial and parallel programs, there are several compilers on the rzccluster:
    • gnu compiler: gfortran, gcc und g++
    • Intel compiler: ifort, icc und icpc (for initialization use the command module load intel16.0.0 )
    • Portland compiler: pgf90, pgcc und pgCC (for initialization use the command module load pgi16.161 )

 

MPI parallelization

  • For the development and execution of MPI-parallelized programs, the rzcluster provides the Intel-MPI environment.
  • For initialization use the command: module load intelmpi16.0.0
  • Compilers:
    • mpiifort, mpiicc und mpiicpc (use of the Intel compilers)
    • mpif90, mpigcc und mpigxx (use of the gnu compilers)
  • Multi-node computations:
    • For multi-node MPI computations, the involved batch nodes need to communicate without password request. To achieve this, each user has to perform once the following two steps on the login node of the rzcluster:
      1. Create a SSH key pair with the command:
        ssh-keygen -t rsa  (confirm requests only with Return)
      2. Copy the public key with the command:
        cp $HOME/.ssh/id_rsa.pub $HOME/.ssh/authorized_keys

 

Libraries

  • netcdf, hdf5, fftw, mkl, ...

 

User software

  • Tubomole, Python, R, SPSS, Matlab, ...
  • Due to licensing reasons, there are different Matlab versions for members of Kiel University and employees of GEOMAR. If you belong to the latter, please load the corresponding module with the ending _geomar.

 

Module concept

  • Compilers, libraries, software and specific tools are provided via a system-wide module concept.
  • An overview of the installed programs can be obtained by entering the following command: module avail
  • Further commands for software usage:
    Command Explanation
    module load <name>

     

    Loads the module <name>, i.e., performs all settings which are required for using the program
    module unload <name> Removes the module, i.e., resets all settings
    module list Lists all modules which are currently loaded
    module show <name> Displays the settings which are performed by the module

 

Batch processing

As of December 2016, we deploy on the rzcluster the batch system Slurm to manage the workload: http://slurm.schedmd.com/ .

For informations regarding the transition from PBSPro to Slurm see here .

 

Batch classes

For interactive and regular batch computations, there currently exist the following batch classes on the rzcluster:

Batch class (Slurm partition) Max. runtime (walltime) Max. main memory Max. number of cores per node Node details
express 3h 32-48GB 8-12 AMD and Intel-Westmere
small 24h 32-48GB 8-12 AMD and Intel-Westmere
medium 240h 32-48GB 8-12 AMD and Intel-Westmere
long 480h 32-48GB 8-12 AMD and Intel-Westmere
test (max. 2 nodes per job) 30min 32-256GB 4-48 Different Types
  • A user should always specify a walltime with the Slurm directive --time=[hh:mm:ss] . If no walltime is given, the maximum walltime is set.
  • In addition, there exist batch classes for which an extra validation is required (angus, fermion, fobigmem, focean, ikmb_a, msb, spin). Here, the default walltime comprises 48h.
  • To request a batch class, use the Slurm directive --partition=<partition> , for examples see below.
  • All AMD nodes are equipped with 8 Cores and 32GB main memory, all Intel-Westmere nodes have 12 Cores and 48GB main memory.
  • To explicitly request the Intel-Westmere nodes, use the additional Slurm directive --constraint=westmere .

 

Performing batch computations

To execute a computation in batch mode it is not only important to instruct the batch system which program to run, but also to specify the required resources (such as computation time and required main memory) for the program. These resources are written together with the program call into a batch or job script, which is finally submitted to the batch system with the command

sbatch <jobscript>

(Note that the $-sign only represents the command line promt).

  • Note, that every batch script starts with the directive  #!/bin/bash on the first line. The subsequent lines then contain the directive #SBATCH , followed by a specific resource request or some other job information, for examples see below.
  • The most important job parameters are summarized in the following table.

 

Job parameters

Explanation Parameter
Batch script directive #SBATCH
Batch class (Slurm partition) --partition=<partition> or -p <partition>
Job Name --job-name=<jobname> or -J <jobname>
Stdout file --output=<filename> or -o <filename>
Stderr file --error=<filename> or -e <filename>
Stdout and stderr into the same file --output=<filename> or -o <filename>
(default if no --error is specified)
Number of nodes --nodes=<n> or -N <n>
Number of (MPI) processes per node --tasks-per-node=<m>
Number of CPUs per node or task --cpus-per-task=<m> or -c <m>
Main memory (in MB) --mem=<memory>
... in GB --mem=1G
Walltime --time=[hh:mm:ss] or -t [hh:mm:ss]
Never requeue the job --no-requeue
Email address --mail-user=<address>
Email notifications --mail-type=BEGIN
  --mail-type=END
  --mail-type=FAIL
  --mail-type=ALL
Use of a node feature --constraint=<feature>
Use of a "quality of service" (for special nodes only) --qos=<quality-of-service>
(angus, fermion, fobigmem, focean, ikmb_a, msb or spin)

 

  • For interactive usage of the batch nodes without X11 support, there are currently two possibilities:
    1. For an interactive usage, where the prompt of the login node remains and commands are only executed remotely with a preceding srun (e.g., srun hostname ), use

      salloc --nodes=1 --cpus-per-task=1 --time=00:10:00 --partition=small
    2. For an interactive usage, where all commands are executed remotely on the compute node, use:

      srun --pty --nodes=1 --cpus-per-task=1 --time=00:10:00 --partition=small /bin/bash

 

Commands for job control

Explanation Command
Submit a batch job sbatch <jobscript>
Delete or terminate a batch job scancel <jobid>
List all jobs in the system squeue
List own jobs squeue -u <userid>
Show status of a specific job squeue -j <jobid>
Show details of a specific job scontrol show job <jobid>
Informations about batch classes sinfo

 

  • For a running job, detailed resource information can be gathered with the command sstat -j <jobid>.batch .

 

Environment variables

Explanation Environment variable
Job ID $SLURM_JOBID
Job name $SLURM_JOB_NAME
Job user $SLURM_JOB_USER
Directory from which the job has been submitted $SLURM_SUBMIT_DIR
Node list $SLURM_NODELIST

 

  • In Slurm, it is not necessary to explicitly change into the directory named $SLURM_SUBMIT_DIR , as jobs will automatically use the directory from which they have been submitted.
  • The environment variable $TMPDIR for accessing the lokal disk space of a batch node has the following structure: /scratch/SlurmTMP/$SLURM_JOB_USER.$SLURM_JOBID .
  • Important: During job submission in Slurm, all environment variables which have been defined (so far) will be transfered to the batch job. While this is convenient at first glance, it can lead to the fact that the same job script behaves differently depending on the environment variables set. For this reason it is advisable to set the following directive: #SBATCH --export=NONE .

 

 

Batch script templates

  • Example for a serial calculation:
    #!/bin/bash
    #SBATCH --job-name=test
    #SBATCH --output=test.out
    #SBATCH --error=test.err
    #SBATCH --nodes=1
    #SBATCH --tasks-per-node=1
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=1000
    #SBATCH --time=01:00:00
    #SBATCH --partition=small
    
    export OMP_NUM_THREADS=1
    ./test.ex
  • Example for a shared-memory calculation (thread or OpenMP parallelization):
    #!/bin/bash
    #SBATCH --job-name=test
    #SBATCH --output=test.out
    #SBATCH --error=test.err
    #SBATCH --nodes=2
    #SBATCH --tasks-per-node=1
    #SBATCH --cpus-per-task=8
    #SBATCH --mem=10000
    #SBATCH --time=01:00:00
    #SBATCH --partition=small
    
    export OMP_NUM_THREADS=8
    ./test.ex
    
  • Example for a parallel multi-node MPI calculation:
    #!/bin/bash
    #SBATCH --job-name=test
    #SBATCH --output=test.out
    #SBATCH --error=test.err
    #SBATCH --nodes=2
    #SBATCH --tasks-per-node=12
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=10000
    #SBATCH --time=01:00:00
    #SBATCH --partition=small
    
    export OMP_NUM_THREADS=1
    module load intelmpi16.0.0
    mpirun -np 24 ./test.ex
    
  • Example for a parallel multi-node hybrid (MPI+OpenMP) calculation:
    #!/bin/bash
    #SBATCH --job-name=test
    #SBATCH --output=test.out
    #SBATCH --error=test.err
    #SBATCH --nodes=2
    #SBATCH --tasks-per-node=6
    #SBATCH --cpus-per-task=2
    #SBATCH --mem=10000
    #SBATCH --time=01:00:00
    #SBATCH --partition=small
    
    export OMP_NUM_THREADS=2
    module load intelmpi16.0.0
    mpirun -np 12 ./test.ex
  • Example for a job array:
    #!/bin/bash
    #SBATCH --job-name=test
    #SBATCH --array 0-9
    #SBATCH --output test-%A_%a.out
    #SBATCH --nodes=1
    #SBATCH --tasks-per-node=1
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=1000
    #SBATCH --time=00:10:00
    #SBATCH --partition=small
    
    echo "Hi, I am step $SLURM_ARRAY_TASK_ID in the array job $SLURM_ARRAY_JOB_ID"


Support and Consulting

HPC-Support-Team: hpcsupport@rz.uni-kiel.de
Responsible contact persons at the Computing Centre:
Please see HPC-Support and Consulting.