NEC SX-Aurora TSUBASA Vector System

no/neinStudents yes/jaEmployees yes/jaFaculties no/neinStud. Unions

NEC_SX-Aurora_TSUBASA

The NEC SX-Aurora TSUBASA vector system is part of the hybrid NEC HPC System at the University Computing Centre. As the successor to the NEC SX-ACE the SX-Aurora TSUBASA vector system comprises a total of 512 SX vector cores distributed over 8 nodes, so called vector hosts (VH), which have 8 vector engines (VE) each that are connected to the VH via PCIe. Each VE offers 48 GB main memory and a memory bandwidth of 1.2 TB/s. Batch jobs are managed by a batch system called NQSV.


 

Hardware

NEC SX-Aurora TSUBASA batch nodes (A300-8)

  • 8 vector hosts [neshve00 ... neshve07] each with
    • 24 Intel Skylake x86 cores (2.6 GHz) and 192 GB main memory per VH
    • 8 vector engines per VH (type 10B, 1.4 GHz)
      • 8 SX vektor cores and 48 GB main memory per VE
      • 64 logical vector registers per core (length: 256 x 64 Bits)
      • 1.2 TB/s memory bandwidth
      neshve
    • Within a VH node, VEs communicate with each other through PCI express (PCIe)
  • Interconnect technology between nodes: Infiniband

Frontend

  • 4 nodes [nesh-fe.rz.uni-kiel.de] each with
    • 2 Intel Xeon Gold 6130 (Skylake-SP) processors (2.1 GHz)
    • 32 cores per node and 768 GB main memory

Operating system

  • VH: Red Hat Enterprise Linux 7
  • VE: Operating system functionality offloaded entirely to the VH (realized by VEOS kernel modules and deamons)

 

Access

  • SSH command line access to the frontend from within the University network:

    ssh -X <username>@nesh-fe.rz.uni-kiel.de

  • The available filesystems (in particular $HOME and $WORK) are the same as for the NEC HPC Linux-Cluster.

 

 

File systems

The front end and all batch nodes share a global 5 PB file system aInind Home and Work directories. In addition, there is the possibility to transfer data to the tape library.

Home directory

  • Accessible via the environment variable $HOME
  • Globally available on all nodes (front end and batch nodes)
  • Daily data backup
  • Suitable for saving scripts, programs and small results

 

Work directory

  • Accessible on the front end via the environment variable $WORK
  • Globally available on all nodes (front end and batch nodes)
  • File system without data backup
  • User quota for disk space and inodes: workquota
  • Batch computations should only be performed in this directory

 

 

Tape library

 

  • Accessible via the environment variable $TAPE_CACHE
  • Available on the login node or via the batch class feque
  • Files which are currently not used should be transferred to the additional file system /nfs/tape_cache as archived data (for example as a tar file).
  • Files under /nfs/tape_cache will be stored automatically on tape after a while. Nevertheless, it is possible to copy them back to the home or work directory via the login node at any time.
  • Not suitable for the storage of a lot of small files
  • Recommended size of an archived file: 3GB to 50 GB (max. size of one tar-file should not exceed 1TB)
  • Data transfer to and from the tape library must not be performed with the rsync command
  • Attention: Slow access speed - avoid to work directly with files on the tape library. Instead, copy files back to the work-directory before further processing (such as unpacking).

 

Software and program compilation

  • In order to be able to run on the VE, source code needs to be cross-compiled.
  • Each process on the VE has a shadow process on the VH, which is doing the I/O and other administrative tasks for the process on the VE. The VE itself does not run an operating system, instead the so called VEOS is processed on the VH and a corresponding library part is linked to the application running on the VE. All the I/O is executed by the VH in behalf of the VE, and can therefore access any filesystem mounted on the VH.
  • Cross-compilers:

    nfort / ncc / nc++ ...

  • Cross-compilers to build MPI programs:

    module load necmpi

    mpinfort / mpincc / mpinc++ ...

  • The software development kit moreover contains a collection of optimized numeric libraries (NLC) that have been optimized for the VE: BLAS, SBLAS, LAPACK, SCALAPACK, ASL, Heterosolver. Documentation: https://www.hpc.nec/documents/sdk/SDK_NLC/UsersGuide/main/en/index.html
  • Example for using OpenMP and a VE-optimized LAPACK library:

    module load necnlc; source /opt/nec/ve/nlc/1.0.0/bin/nlcvars.sh [args]

    nfort test.f90 -fopenmp -llapack -lblas_sequential

    nc++ -O2 test.cpp -fopenmp -llapack -lblas_sequential -lnfort

 

Batch processing

  • For resource management, we deploy on the NEC SX-Aurora TSUBASA the batch system NQSV (Network Queuein System V).

Batch classes

  • Currently, the following batch classes are available:
    batch class max. runtime (walltime) / default max. number of VHs [neshve##]
    veexpress 2 h / 15 min 8
    vequeue 48 h / 30 min 8
    veinteractive 1 h / 30 min 1

NQSV options

  • The most important NQSV options for job submissions on the NEC SX-Aurora TSUBASA are the following:
    NQSV option Explanation
    #!/bin/bash defines the shell
    #PBS -T necmpi specifies the job type (NEC-MPI), only required for parallel calculations over multiple VHs
    #PBS --use-hca=1 enables Infiniband communication to other VHs; only required for parallel calculations over multiple VHs
    #PBS -b 2 number of requested vector hosts (here 2; max. 8)
    #PBS --venum-lhost=1 number of requested vector engines per VH (max. 8)
    #PBS -l elapstim_req=01:00:00 requested walltime (here 1 h)
    #PBS -N test name of the batch job (here test)
    #PBS -o test.out file for the standard output (here test.out)
    #PBS -e test.err file for the standard error (here test.err)
    #PBS -j o joins the standard output and error in one single file
    #PBS -q vequeue requested batch class (here vequeue)
    #PBS -m abe email notification if the job begins (b), ends (e) or aborts (a)
    #PBS -M <E-Mail-Adresse> email address for job notifications (see -m option)

Batch script examples

  • Example for a serial calculation (using one core on one VE):
    #!/bin/bash
    #PBS -b 1 
    #PBS --venum-lhost=1
    #PBS -l elapstim_req=01:00:00 
    #PBS -N test
    #PBS -o test.out 
    #PBS -j o 
    #PBS -q veexpress 
    
    # give performance informations after the job
    # code has to be compiled with -proginf option
    export VE_PROGINF=DETAIL
    
    # change into the qsub directory
    cd $PBS_O_WORKDIR
    
    # program call
    ./test.ex
  • Example for an OpenMP-parallel calculation (using all 8 cores on one VE):
    #!/bin/bash
    #PBS -b 1 
    #PBS --venum-lhost=1
    #PBS -l elapstim_req=01:00:00 
    #PBS -N test
    #PBS -o test.out 
    #PBS -j o 
    #PBS -q veexpress 
    
    # change into the qsub directory
    cd $PBS_O_WORKDIR
    
    # program call (max. 8 parallel threads per VE)
    export OMP_NUM_THREADS=8
    ./test.ex 
    
  • Example for an MPI-parallel calculation (using all 8 VEs on one VH):
    #!/bin/bash
    #PBS -b 1 
    #PBS --venum-lhost=8
    #PBS -l elapstim_req=01:00:00 
    #PBS -N test
    #PBS -o test.out 
    #PBS -j o 
    #PBS -q veexpress 
    
    # change into the qsub directory
    cd $PBS_O_WORKDIR
    
    # program call
    mpirun -nn 1 -nnp 64 -ve 0-7 ./test.ex 
    

Commands for job submission and control

  • $ qsub <nqs_script> submits a batch job
  • $ qstat shows status information about own jobs
  • $ qdel <jobid> terminates a running or removes a waiting job
  • $ qstat -f <jobid> shows detailed information about the specified job
  • $ qalter <jobid> ... alters a job resource of a waiting job
  • $ qcat -n <number> -o <jobid> displays the last <number> lines of the standard output that a specified job has produced so far
  • $ qlogin -q veinteractive -l cpunum_job=1 -l elapstim_req=1:00:00 interactive batch job; directly working on one vektor host
  • To monitor processes running on the vector engines of a specific vector host we provide the following command

    vetop <vh>

    where the argument <vh> ist the name of the vector host, e.g.,  neshve00 [00 ... 07].
  • To show used resources for a running job (CPU time and main memory) on the vector engine use the command

    qstat -J -e

 

Documentation

 


Support und Beratung

HPC-Support-Team: hpcsupport@rz.uni-kiel.de
Verantwortliche Ansprechparter am Rechenzentrum:
Siehe unter HPC-Support und Beratung.