Parallel BeeGFS Filesystem

no/neinStudierende yes/jaBeschäftigte yes/jaEinrichtungen no/neinstud. Gruppen

Both the rzluster and the caucluster are connected to a BeeGFS filesystem to provide applications with high I/O demands with fast workspace. The filesystem is mounted at /work_beegfs on all compute nodes and the caucluster and rzcluster login nodes. This filesystem has the following basic properties:

  • 350TB usable space (space managed by user quotas)
  • hosted on four fileservers for parallel access
  • separate metadata server, hosting the metadata on fast SSDs
  • access via Infiniband (if available on the nodes)

 

Hints and Rules for using the BeeGFS Installation

  • Currently, new users will get their WORKDIR on /work_beegfs, existing users will not automatically get a directory there. Please ask via hpcsupport@rz.uni-kiel.de to get access.
  • Space and number of files on /work_beegfs will be managed via user quotas.
  • Initially all users will get a quota of 1 TB and 1 million "chunks" (which is ~250k files) (see below at #Quotas).
  • If you need more this can be increased on request, provided we are not running low on free space.
  • As with the NFS working directories there will be no backup of data stored on BeeGFS.
  • The filesystem is designed as fast workspace for "hot" data - it is not intended as permanent storage for large amounts of "cold" data. Please try to move huge input or output files you do not need for active calculations/projects off the cluster workspace - either to the tape library or your local resources.
  • The I/O rates one can expect from the filesystem vary with the networks available on the different nodes:
    • Nodes with QDR Infiniband - 'focean' and 'angus' partitions, caucluster nodes, data mover 'rzcldata1': A single thread can reach between 600 MB/s up to 1.5 GB/s, several threads writing to different files can reach up to 3.8 GB/s.
    • Nodes with DDR Infiniband - 'spin' partition, some of the AMD nodes in the public partition, the rzcluster login node 'rzcl00b': A single thread can reach between 500 MB/s up to 1 GB/s, several threads writing to different files can reach up to 1.9 GB/s.
    • Nodes with 10 GbE - rzcluster loginnode 'ikmbhead': A single thread can reach between 300 MB/s up to 600 MB/s, several threads writing to different files can reach up to 0.9 GB/s.
    • Nodes with 1 GbE - all remaining rzcluster nodes, caucluster login node: A single thread can reach up to 100 MB/s (which is already the network limit).
  • Filesystem I/O is a 'cluster shared resource', meaning that no matter on which part of the cluster you are using it, it will affect everyone else. So transfer speeds can be much lower if someone else is doing lots of I/O at the same time as you.

 

New feature: Data mover/Data staging node

  • There is now a node called rzcldata1 as the single node in a SLURM partition data (see 'sinfo'). This node has both 10 GbE and QDR Infiniband interfaces wired directly to the filesystem.
  • It has access to both the BeeGFS and all the NFS servers available to the cluster (if we missed one drop us a note).
  • This can be used to stage your data from one of the old NFS Fileservers (or the Isilon of the IKMB) to the BeeGFS. You can do this both interactively and with batch jobs to the data partition.
  • Doing I/O (to /work_beegfs) on this node has only minimal impact on the I/O performance for the rest of the cluster, as it has a dedicated network link to the filesystem - so if you want to move large amounts of data this is the ideal place to do it. (Note: You will impact the performance of the NFS servers.)
  • Exemplary workflow:
    • You need to analyse a huge amount of data from the Isilon of the IKMB (available via NFS via a 10GbE interface) on the 'ikmb' nodes (only equipped with 1 GbE links).
    • First you can submit a 'staging job' to the data node, which copies all needed input from the Isilon to the BeeGFS.
    • After this has finished the worker jobs can start to use the data from BeeGFS for analysis. While they are still limited to access BeeGFS at the network limit of 100 MB/s they will not overload the Isilon with parallel requests - while they are working on the data the Isilon can already be used at full bandwidth for the next staging job.
    • After the workers have finished another staging job can be used to push the results back to the Isilon from the data node.

 

Quotas

The space on the BeeGFS filesystem will be accounted and limited by user quotas. The default limit for all users initially is 1 TB, which can be increased upon request. The number of files is also limited by the number of "chunks" you are allowed to create - BeeGFS by default spreads your files over four different storage targets, so a single file will usually exist as four "chunks" on the physical disks. (Very small files might need less than four, as the minimum size of a chunk is 1 MB, so files smaller than 3 MB need 3 or less.) The default limit of 1 Million chunks will normally allow 250k files. Again, this can be increased upon request.

To check your currently used quota use the command

beegfs-ctl --getquota --uid <username>

Note that the quota checks only run every few minutes - so you might overshoot your quota if you do lots of I/O while crossing your limit. For the same reason it can take a few minutes for you to be able to write again after you reduce your usage below your qouta.

 

Tips for I/O and data handling

  • If you have to use a node without Infiniband and only a 1 GbE link, it can still be faster to use the local disks of the cluster node for I/O, as the network will limit you to 100 MB/s otherwise.
  • BeeGFS will not perform very well if you need to access huge amount of small files, or if you repeatedly open and close files for reading/writing small amounts of data.
  • For optimal performance you should try to do I/O only to a few files and in large blocks (i.e. do not write a single result value to several different files each iteration, but try to collect results and write them to a single file).
  • If your workload is very I/O intensive, do not hesitate to check back with us (hpcsupport@rz.uni-kiel.de) to see if there are tuning options available to optimize the performance of your application.