Why use a Cluster?


  • High Performance Computing (HPC) typically involves connecting to very large computing systems elsewhere in the world.
  • These other systems can be used to do work that would either be impossible or much slower on smaller systems.
  • HPC resources are shared by multiple users.
  • The standard method of interacting with such systems is via a command line interface.

Connecting to a remote HPC system


  • An HPC system is a set of networked machines.
  • HPC systems typically provide login nodes and a set of worker nodes.
  • The resources found on independent (worker) nodes can vary in volume and type (amount of RAM, processor architecture, availability of network mounted filesystems, etc.).
  • Files saved on one node are available on all nodes.

Working on a remote HPC system


  • “An HPC system is a set of networked machines.”
  • “HPC systems typically provide login nodes and a set of worker nodes.”
  • “The resources found on independent (worker) nodes can vary in volume and type (amount of RAM, processor architecture, availability of network mounted filesystems, etc.).”
  • “Files saved on one node are available on all nodes.”

Scheduler Fundamentals


  • The scheduler handles how compute resources are shared between users.
  • A job is just a shell script.
  • Request slightly more resources than you will need.

Accessing software via Modules


  • Load software with module load softwareName.
  • Unload software with module unload
  • The module system handles software versioning and package conflicts for you automatically.

Transferring files with remote computers


  • wget and curl -O download a file from the internet.
  • scp and rsync transfer files to and from your computer.
  • You can use an SFTP client like FileZilla to transfer files through a GUI.

Using the Research Data Warehouse


  • cp and rsync transfer files between RDW and HPC.
  • try a dry-run of rsync to avoid accidental duplications or deletions
  • re-run large rsync commands to confirm success
  • output to a log to keep a record
  • group permissions on RDW can’t be changed from linux
  • RDW shares have a pre-set ‘modify’ group of campus users
  • some RDW shares have a pre-set ‘read’ group of campus users
  • RDW has a roll-back feature in case of accidents

Find out more about where to store data on Comet: https://hpc.researchcomputing.ncl.ac.uk/dokuwiki/doku.php?id=started:filesystems

Running a parallel job


  • Parallel programming allows applications to take advantage of parallel hardware.
  • The queuing system facilitates executing parallel tasks.
  • Performance improvements from parallel execution do not scale linearly.

Using resources effectively


  • Accurate job scripts help the queuing system efficiently allocate shared resources.

Using shared resources responsibly


  • Be careful how you use the login node.
  • Your data on the system is your responsibility.
  • Plan and test large data transfers.
  • It is often best to convert many files to a single archive file before transferring.