Using the Research Data Warehouse

Last updated on 2025-11-10 | Edit this page

Overview

Questions

  • How do I transfer files to (and from) the cluster?
  • What is the best way to back up research data?

Objectives

  • Understand how to use Newcastle University’s Research Data Warehouse (aka, RDW and Campus Filestore) with Comet HPC

Transferring files to and from Campus Storage for Research Data (RDW)


RDW is mounted on Comet at /rdw. Although it’s a separate physical system it’s located in the same data centre as Comet and connected via fast ethernet. You can use scp and rsync to transfer data to RDW in the same way as copying to any other directory on Comet.
RDW is intended for data storage and NOT suitable for interactive use or software installation.
Working data should be in your home or project directory. User installed software should be in your home directory

Using cp to copy to RDW

Because /rdw is a mounted filesystem, we can use cp instead of scp:

BASH

[user@cometlogin01(comet) ~] pwd

OUTPUT

/mnt/nfs/home/user

BASH

[user@cometlogin01(comet) ~] touch file.txt
[user@cometlogin01(comet) ~] ls /rdw/04/rse-training/
[user@cometlogin01(comet) ~] mkdir /rdw/04/rse-training/user
[user@cometlogin01(comet) ~] cp file.txt /rdw/04/rse-training/user/
[user@cometlogin01(comet) ~] cd /rdw/04/rse-training/user/
[user@cometlogin02(comet) rse-training]$ pwd

OUTPUT

/rdw/04/rse-training/user

BASH

[user@cometlogin02(comet) rse-training]$ ls

OUTPUT

file.txt

Using rsync to copy to RDW

As you gain experience with transferring files, you may find the scp command limiting. The rsync utility provides advanced features for file transfer and is typically faster compared to both scp and sftp (see below). It is especially useful for transferring large and/or many files and creating synced backup folders. The syntax is similar to cp and scp. Rsync can be used on a locally mounted filesystem or a remote filesystem.

Transfer to RDW from your work area on Comet

Try out a dry run:

BASH

[user@cometlogin01(comet) ~] cd /nobackup/proj/training/user/
[user@cometlogin01(comet) ~] mkdir TestDir
[user@cometlogin01(comet) ~] touch TestDir/testfile1
[user@cometlogin01(comet) ~] touch TestDir/testfile2
[user@cometlogin01(comet) ~] rsync -av TestDir /rdw/04/rse-training/user --dry-run

OUTPUT

sending incremental file list
TestDir/
TestDir/testfile1
TestDir/testfile2

sent 121 bytes  received 26 bytes  294.00 bytes/sec
total size is 0  speedup is 0.00 (DRY RUN)

Run ‘for real’:

BASH

[user@cometlogin01(comet) ~] rsync -av TestDir /rdw/04/rse-training/user

OUTPUT

sending incremental file list
created directory /rdw/04/rse-training/user
rsync: chgrp "/rdw/04/rse-training/user/TestDir" failed: Invalid argument (22)
TestDir/
TestDir/testfile1
TestDir/testfile2
rsync: chgrp "/rdw/04/rse-training/user/TestDir/.testfile1.ofeRqX" failed: Invalid argument (22)
rsync: chgrp "/rdw/04/rse-training/user/TestDir/.testfile2.fS1m6j" failed: Invalid argument (22)

sent 197 bytes  received 415 bytes  408.00 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2]

What happened? rsync returned an error. files/attrs were not transferred This is because RDW doesn’t ‘know’ about Comet’s groups. The transfer was successful though! Only the ‘group’ attribute of the file couldn’t be transferred. RDW has ‘trumped’ our local permissions and imposed its own standard permissions. This isn’t important, the correct user keeps ownership of the files.

BASH

[user@cometlogin01(comet) ~] ls -l TestDir/

OUTPUT

total 0
-rw------- 1 user comet_training 0 Mar 11 20:06 testfile1
-rw------- 1 user comet_training 0 Mar 11 20:06 testfile2

BASH

[user@cometlogin01(comet) ~] ls -l /rdw/04/rse-training/user/TestDir/

OUTPUT

total 33
-rwxrwx--- 1 user domainusers 0 Mar 11 20:10 testfile1
-rwxrwx--- 1 user domainusers 0 Mar 11 20:10 testfile2

It’s still easier to read output without errors that we have to ignore, so let’s remove that error.

The -a (archive) option preserves permissions, this is why we see group modification errors above.
For Comet and RDW, replace -av with -rltv
-r = recurse through subdirectories
-l = copy symlinks
-t = preserve timestamps
-v = verbose

BASH

[user@cometlogin01(comet) ~] rsync -rltv TestDir /rdw/04/rse-training/user/ 

OUTPUT

sending incremental file list
./
testfile1
testfile2

sent 150 bytes  received 57 bytes  414.00 bytes/sec
total size is 0  speedup is 0.00
Challenge

Spot the difference

Can you spot the difference betweent the 2 previous rsync commands? Try ls -l on the destination.

BASH

[user@cometlogin01(comet) ~] ls -R /rdw/04/rse-training/user/

OUTPUT

/rdw/04/rse-training/user/:
TestDir  testfile1  testfile2

/rdw/04/rse-training/user/TestDir:
testfile1  testfile2

We now have too many files! The first rsync command copied TestDir because there was no trailing /.
The second rsync command only copied the contents of TestDir because of the trailing /.
We could have spotted this by looking at the output of --dry-run but this shows it’s a good idea to check the destination after you copy.

Large data copies


When copying large amounts of data, rsync really comes into its own. When you’re copying a lot of data, it’s important to keep track in case the copy is interrupted. Rsync is great because it can pick up where it left off, rather than starting the copy all over again. It’s also useful to output to a log so you can see what was transferred and find any errors that need to be addressed.

Fast Connections

Transfers from Comet to RDW don’t leave our fast data centre network. If you’re using rsync with a fast network or disk to disk in the same machine:

  • DON’T use compression -z
  • DO use --inplace

Why? compression uses lots of CPU, Rsync usually creates a temp file on disk before copying. For fast transfers, this places too much load on the CPU and hard drive.
--inplace tells rsync not to create the temp file but send the data straight away. It doesn’t matter if the connection is interrupted, because rsync keeps track and tries again. Always re-run transfer command to ensure nothing was missed. The second run should be very fast, just listing all the files and not copying anything.

Slow Connections

For a slow connection like the internet:

  • DO use compression -z
  • DON’T use --inplace.
Challenge

Large Transfer to RDW

RDW has a super-fast connection to Comet, which means that it takes more resource to compress and un-compress the data than it does to do the transfer. What command would best for backing up a large amount of data from Comet to RDW?

BASH

[user@login01 ~]$ rsync -rltv --inplace --size-only DataDir /rdw/04/rse-training/user/

–inplace - saves resources by not creating temporary files –size-only - saves time by only checking whether a file’s size has changed (and not its last-modified time)

Discussion

Large Transfer to RDW (continued)

see man rsync and https://rsync.samba.org/ for more examples –delete is an option that is very useful for tidying up when files have been duplicated. However it should be used with care!
Perhaps a collaborator has placed additional files in the directory you are syncing to.
Use –dry-run –progress –stats to check before you run Accidental deletions on RDW can be rolled back using Windows File Explorer. Log a ticket with NUIT for help with rollback.

Challenge

add a dry run and a log file

Try out a dry run:

BASH

rsync --dry-run -rltv --inplace --itemize-changes --progress --stats  --size-only /nobackup/myusername/source /rdw/path/to/my/share/destination/ 2>&1 | tee /home/myusername/meaningful-log-name.log1

Run ‘for real’:

BASH

rsync -rltv --inplace --itemize-changes --progress --stats  --size-only /nobackup/myusername/source /rdw/path/to/my/share/destination/ 2>&1 | tee /home/myusername/meaningful-log-name.log2
  • --inplace --size-only speed up transfer and prevent rsync filling up space with a large temporary directory
  • --itemize-changes --progress --stats for more informative output
  • Remember | from the Unix Shell workshop?
    | tee sends output both to the screen and to a log file
  • All the arguments can be single letters like -v or full words like --verbose. Use man rsync to craft your favourite arguments list.
Key Points
  • cp and rsync transfer files between RDW and HPC.
  • try a dry-run of rsync to avoid accidental duplications or deletions
  • re-run large rsync commands to confirm success
  • output to a log to keep a record
  • group permissions on RDW can’t be changed from linux
  • RDW shares have a pre-set ‘modify’ group of campus users
  • some RDW shares have a pre-set ‘read’ group of campus users
  • RDW has a roll-back feature in case of accidents

Find out more about where to store data on Comet: https://hpc.researchcomputing.ncl.ac.uk/dokuwiki/doku.php?id=started:filesystems