OHBM 2020 Open Science Room: Reproducible Research Objects with DataLad

This is a detailed write-up of the OHBM Open Science Room demonstration on Reproducible Research Objects with DataLad. You can find the slides here. If you have attended the talk and are looking for resources to read up on in your own pace, then you’re correct here. You will find the code snippets executed in the talk together with detailed background descriptions and references to the appropriate handbook chapters. If you’re new to DataLad and want to find out more, the DataLad handbook could be exactly what you need.

Workflow description

The software demonstration at the OHBM open science room is part of the open workflows theme. Therefore, this demonstration performed a complete data consumption, data analysis, and data publication routine.

Workflow software requirements

This workflow requires DataLad version 0.13 or higher. In particular, this version requirement stems from the remote indexed archive (RIA) store used in this demonstration. You can find installation instructions in the section Installation and configuration.

Step 1: Setting up a dataset

When using DataLad, everything happens in datasets, DataLad’s core data structure. You can find out more about datasets in the section DataLad datasets.

In short, a dataset is a directory on your computer that is managed by DataLad, and a git/git-annex repository. Datasets can

  • version control their contents,

  • they can be shared and installed, and

  • they can be nested in order to link them.

The datalad create <path> (manual) creates a new dataset:

$ datalad create OSRdemo
  [INFO   ] Creating a new annex repo at /home/adina/scratch/OSRdemo
  create(ok): /home/adina/scratch/OSRdemo (dataset)

When using Datalad for analysis projects, a set of principles called “The YODA principles” can help to create a modular project structure. Those principles are detailed and explained in the section YODA: Best practices for data analyses in a dataset. Typical analysis datasets will hold results and code, consume or create so-called “toolbox datasets” with containerized and appropriately configured analysis pipelines, and have all relevant datasets linked to the analysis as subdatasets.

Step 2: Linking data

One of the analysis components for this and most other workflows is data. DataLad makes it easy to “install” data as if it would be software, and the Datalad 0.13 release comes with some even more exiting features for data consumption than what DataLad can already do.

For example, the human connectome project (HCP) data exists as a datalad dataset on Github now. You can find and install it at github.com/datalad-datasets/human-connectome-project-openaccess. If you are interested in the creation of this dataset, the use case Scaling up: Managing 80TB and 15 million files from the HCP release will talk about the details. Beyond access to the full HCP data, there are also subsets of the HCP data being created and transformed into BIDS-like formats, and the newly introduced feature of RIA stores makes it possible to install these HCP data subsets in specific versions, for example BIDS formatted. You can read up on this new feature in the section Remote indexed archives for dataset storage and backup. Here is how to install the “structural preprocessed” subset of the HCP dataset that has been transformed into a bids like format from a public datalad RIA store into a directory called .source:

$ datalad clone -d . 'ria+https://store.datalad.org#~hcp-structural-preprocessed@bids' .source
[INFO   ] Configured RIA store not found at ria+file:///ds/hcp/srv
install(ok): .source (dataset)
add(ok): .source (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1)

Why is it called “.source”?

By installing data into a hidden directory (anything that starts with a .), the input data is linked, but doesn’t show up at first sight when browsing the dataset. This is not a requirement, but helpful if you want to be extraordinarily organized.

Step 3: Space for outputs

To later link code execution, input data, and results, the results should be saved in the analysis dataset, i.e., in the top level dataset OSRdemo. If the amount of files becomes too large (beyond ~100-200k files), datasets can struggle, though. Large-scale analysis thus may require splitting outputs across subdatasets in order to create several smaller-sized datasets. More information on this can be found in the chapter Go big or go home. Here, in anticipation of up to 500.000 files, two output subdatasets are used to collect the results.

Usually, you can just create a new, empty dataset with datalad create as in datalad create -d <root-to-dataset> <path>. In the demonstration, empty, pre-created datasets (fmriprep for fmriprep outputs, freesurfer for freesurfer outputs) are installed from GIN.

# Note: don't execute these lines - the datasets are not empty anymore!
$ datalad clone -d . git@gin.g-node.org:/adswa/OSRfmriprep.git fmriprep
$ datalad clone -d . git@gin.g-node.org:/adswa/OSRfreesurfer.git freesurfer

Why install empty datasets?

The choice to install empty datasets was a convenience hack for easy publishing routines. At the end of the workflow, a recursive datalad push (manual) was able to publish all results and the complete hierarchy of datasets in one go. This was only this easy because by installing the fmriprep and freesurfer datasets those subdatasets already had a sibling configuration to GIN. Had the subdatasets been created from scratch, each subdataset would have required setting up a sibling before hand, in the same way it was done with the top-level dataset. You can read more about this in the section Walk-through: Dataset hosting on GIN, in particular the paragraph Subdataset publishing.

Step 4: Linking software

Containerized pipelines can be linked to datasets. One can either create such a toolbox dataset from scratch, or consume pre-existing containers from github.com/ReproNim/containers. This ReproNim Webinar walks through a complete workflow, if you are interested in more details on how to use them.

Here is how to create a custom fmriprep toolbox:

First, create a dataset for it:

$ datalad create -c text2git fmriprep_toolbox

Because fmriprep requires a freesurfer license file, you can add this file to your dataset:

$ cd fmriprep_toolbox
$ cp ../license.txt .
$ datalad save -m "add freesurfer license file"

Note that due to the text2git configuration of the dataset this file will be available right away after this dataset is cloned and does not need to be retrieved in an extra step. To understand this, chapter Under the hood: git-annex is very much recommended, and to find out more about procedures, read on in the section Configurations to go.

Finally, a Docker or Singularity container that contains the pipeline or required software can be added. Note that this requires the DataLad extension datalad-container. A demonstration can be found in the section Computational reproducibility with software containers. The pipeline can be configured with custom call specifications with the --call-fmt option. In this demonstration, programmatic bind mounts are attached. Whenever this toolbox is installed as a subdataset called .tools, running the container in this dataset with bind-mount the parent directory (i.e., the analysis dataset), and the license file inside of the toolbox. Note that the placeholders {img} and {cmd} will expand to the container and the command line call given to datalad containers-run (manual).

$ datalad containers-add fmriprep \
--url shub://ReproNim/containers:bids-fmriprep--20.1.1 \
--call-fmt 'singurity run --cleanenv -B $PWD,$PWD/.tools/license.txt {img} {cmd}'

This toolbox dataset can be added just as data as a subdataset of the analysis. In the code block below, the dataset is installed from a local path.

$ datalad clone -d . ~/fmriprep_toolbox .tools

Could I share such a toolbox dataset?

Note that unlike the fmriprep and freesurfer subdatasets, this subdataset of OSRdemo will not be pushed anywhere public when the results are published later. This is because it does not have a sibling on Gin or a similar web-based hosting service. As this dataset also contains a personal license file, it isn’t intended for publication anywhere. This toolbox dataset serves an individual user or a group of users on shared infrastructure as a standard frmiprep analysis toolbox. In order to share such a toolbox, create a sibling for the dataset on a repository-hosting service like Github or Gin, make sure that no personal files are included, and publish as demonstrated later in this workflow.

Step 5: Running an analysis

With containerized pipeline and data set up, the datalad containers-run command can be used to execute fmriprep preprocessing on the data. This command will retrieve and use the container linked inside the toolbox, retrieve any input data that is specified with the --input flag(s), and execute the final command inside of the container (the {cmd} placeholder in the call format specification will be replaced by it). In the example below, a single subject is preprocessed:

$ datalad containers-run -n .tools/fmriprep \
    -m "preprocess exemplary subject with fmriprep" \
    --input .source/sub-170631 \
    --output fmriprep \
    --output freesurfer \
    ".source . participant --participant-label 170631 --skip-bids-validation --anat-only -w /tmp --fs-license-file .tools/license.txt"

The results will be saved into the subdatasets because they were conveniently named after the output directories that fmriprep produces, and the command will produce and save a machine-readable and re-executable run record from this that allows others or yourself to rerun the computation, for example if data has been updated. A complete, small-stepped introduction to datalad run (manual) can be found in chapter DataLad, run!.

Step 6: Result publication

The results and their provenance (on which data, with which software, with what command line call, by whom, and when were the results created?) can be published. This demonstration focuses on public access and therefore shows a publication routine to the repository-hosting service Gin. A complete write-up on dataset publishing to Gin is in section Walk-through: Dataset hosting on GIN.

To publish a dataset, an empty repository needs to be created via Gin’s webinterface. The SSH url to this repository can be given to datalad siblings add (manual) to register this repository as a sibling.

$ datalad siblings add --name origin --url git@gin.g-node.org:/adswa/OSRdemo.git

Note that the name of the sibling is origin. This is because the two subdatasets have siblings of the same name. A recursive datalad push will traverse down the dataset hierarchy and publish datasets to their origin sibling if they have one. With a single operation from the dataset root, the superdataset and the two subdatasets that hold the results are published in one go.

$ datalad push --to origin

Step 7: Result retrieval

From GIN, you or others can now access this dataset without having an account on Gin via anonymous HTTP access. For this, datalad clone (manual) only needs the https url of the dataset, found in the webbrowsers address bar. You can take a look at the data for yourself, if you want to:

$ datalad clone https://gin.g-node.org/adswa/OSRdemo
# retrieve data using datalad get
$ datalad get fmriprep

Tada! We’re done!

Hopefully, this workflow gave you an idea of how DataLad can be helpful in reproducible and open workflows. It wouldn’t be surprising if you are feeling a bit overwhelmed from this dense write-up. This workflow was a very concise write-up of a large amount of many basic and advanced principles and commands of DataLad. But if you found it intriguing and want to learn more, then stay right here in the handbook and find out more about DataLad. At the end of the “Basics” part of this book, you should have all of the knowledge you need to perform a similar workflow on your own. If you’re also reading into the linked sections from the “Advanced” and “Usecases” part of the handbook, you will have a thorough understanding of everything that has happened in this workflow.

If you run into problems or have questions, don’t hesitate to get in touch.