3.1. DataLad on High Throughput or High Performance Compute Clusters

For efficient computing of large analysis, to comply to best computing practices, or to fulfil the requirements that responsible system administrators impose, users may turn to computational clusters such as high-performance computing (HPC) or high-throughput computing (HTC) infrastructure for data analysis, back-up, or storage.

This chapter is a collection of useful resources and examples that aims to help you get started with DataLad-centric workflows on clusters. We hope to grow this chapter further, so please get in touch if you want to share your use case or seek more advice.

3.1.1. Pointers to content in other chapters

To find out more about centralized storage solutions, you may want to checkout the use case Building a scalable data storage for scientific computing or the section Remote indexed archives for dataset storage and backup.

3.1.2. DataLad installation on a cluster

Users of a compute cluster generally do not have administrative privileges (sudo rights) and thus cannot install software as easily as on their own, private machine. In order to get DataLad and its underlying tools installed, you can either bribe (kindly ask) your system administrator[1] or install everything for your own user only following the instructions in the paragraph Linux-machines with no root access (e.g. HPC systems) of the installation page. If you opt for the first, your administrator can install Datalad version 0.18.4 via EasyBuild <https://github.com/easybuilders>, which is a tool for building software reprobucibly and is common on clusters that use a module system. The caveat this introduces, of course, is that you will need to load the module every time you want to use DataLad on your cluster.