An automatically reproducible analysis of public neuroimaging data

This use case sketches the basics of an analysis that can be automatically reproduced by anyone:

  1. Public open data stems from the DataLad superdataset ///.

  2. Automatic data retrieval can be ensured by using DataLad’s commands in the analysis scripts, or the --input specification of datalad run,

  3. Analyses are executed using datalad run and datalad rerun commands to capture everything relevant to reproduce the analysis.

  4. The final dataset can be kept as lightweight as possible by dropping input that can be easily re-obtained.

  5. A complete reproduction of the computation (including input retrieval), is possible with a single datalad rerun command.

This use case is a specialization of Writing a reproducible paper: It is a data analysis that requires and creates large data files, uses specialized analysis software, and is fully automated using solely DataLad commands and tools. While exact data types, analysis methods, and software mentioned in this use case belong to the scientific field of neuroimaging, the basic workflow is domain-agnostic.

The Challenge

Creating reproducible (scientific) analyses seems to require so much: One needs to share data, scripts, results, and instructions on how to use data and scripts to obtain the results. A researcher at any stage of their career can struggle to remember which script needs to be run in which order, or to create comprehensible instructions for others on where and how to obtain data and how to run which script at what point in time. This leads to failed replications, a loss of confidence in results, and major time requirements for anyone trying to reproduce others or even their own analyses.

The DataLad Approach

Scientific studies should be reproducible, and with the increasing accessibility of data, there is not much excuse for a lack of reproducibility anymore. DataLad can help with the technical aspects of reproducible science.

For neuroscientific studies, the DataLad superdataset /// provides unified access to a large amount of data. Using it to install datasets into an analysis-superdataset makes it easy to share this data together with the analysis. By ensuring that all relevant data is downloaded via datalad get via DataLad’s command line tools in the analysis scripts, or --input specifications in a datalad run, an analysis can retrieve all required inputs fully automatically during execution. Recording executed commands with datalad run allows to rerun complete analysis workflows with a single command, even if input data does not exist locally. Combining these three steps allows to share fully automatically reproducible analyses as lightweight datasets.

Step-by-Step

It always starts with a dataset:

$ datalad create -c yoda demo
[INFO] Creating a new annex repo at /home/me/usecases/repro/demo 
[INFO] Running procedure cfg_yoda 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
create(ok): /home/me/usecases/repro/demo (dataset)

For this demo we are using two public brain imaging datasets that were published on OpenFMRI.org, and are available from the DataLad superdataset /// (datasets.datalad.org). When installing datasets from this superdataset, we can use its abbreviation ///. The two datasets, ds000001 and ds000002, are installed into the subdirectory inputs/.

$ cd demo
$ datalad install -d . -s ///openfmri/ds000001 inputs/ds000001
[INFO] Cloning http://datasets.datalad.org/openfmri/ds000001 [1 other candidates] into '/home/me/usecases/repro/demo/inputs/ds000001' 
install(ok): inputs/ds000001 (dataset)
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1)
$ cd demo
$ datalad install -d . -s ///openfmri/ds000002 inputs/ds000002
[INFO] Cloning http://datasets.datalad.org/openfmri/ds000002 [1 other candidates] into '/home/me/usecases/repro/demo/inputs/ds000002' 
install(ok): inputs/ds000002 (dataset)
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1)

Both datasets are now registered as subdatasets, and their precise versions (e.g. in the form of the commit shasum of the lastest commit) are on record:

$ datalad --output-format '{path}: {gitshasum}' subdatasets
/home/me/usecases/repro/demo/inputs/ds000001: f7fe2e38852915e7042ca1755775fcad0ff166e5
/home/me/usecases/repro/demo/inputs/ds000002: 6b16eff0c9e8d443ee551784981ddd954f657071

DataLad datasets are fairly lightweight in size, they only contain pointers to data and history information in their minimal form. Thus, so far very little data were actually downloaded:

$ du -sh inputs/
14M	inputs/

Both datasets would actually be several gigabytes in size, once the dataset content gets downloaded:

$ datalad -C inputs/ds000001 status --annex
$ datalad -C inputs/ds000002 status --annex
130 annex'd files (2.3 GB recorded total size)
274 annex'd files (2.7 GB recorded total size)

Both datasets contain brain imaging data, and are compliant with the BIDS standard. This makes it really easy to locate particular images and perform analysis across datasets.

Here we will use a small script that performs ‘brain extraction’ using FSL as a stand-in for a full analysis pipeline. The script will be stored inside of the code/ directory that the yoda-procedure created that at the time of dataset-creation.

$ cat << EOT > code/brain_extraction.sh
# enable FSL
. /etc/fsl/5.0/fsl.sh

# obtain all inputs
datalad get \$@
# perform brain extraction
count=1
for nifti in \$@; do
   subdir="sub-\$(printf %03d \$count)"
   mkdir -p \$subdir
   echo "Processing \$nifti"
   bet \$nifti \$subdir/anat -m
   count=\$((count + 1))
done
EOT

Note that this script uses the datalad get command which automatically obtains the required files from their remote source – we will see this in action shortly.

We are saving this script in the dataset. This way, we will know exactly which code was used for the analysis. Everything inside of code/ is tracked with Git thanks to the yoda-procedure, so we can see more easily how it was edited over time. In addition, we will “tag” this state of the dataset with the tag setup_done to mark the repository state at which the analysis script was completed. This is optional, but it can help to identify important milestones more easily.

$ datalad save --version-tag setup_done -m "Brain extraction script" code/brain_extraction.sh
add(ok): code/brain_extraction.sh (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Now we can run our analysis code to produce results. However, instead of running it directly, we will run it with DataLad – this will automatically create a record of exactly how this script was executed.

For this demo we will just run it on the structural images (T1w) of the first subject (sub-01) from each dataset. The uniform structure of the datasets makes this very easy. Of course we could run it on all subjects; we are simply saving some time for this demo. While the command runs, you should notice a few things:

  1. We run this command with ‘bash -e’ to stop at any failure that may occur

  2. You’ll see the required data files being obtained as they are needed – and only those that are actually required will be downloaded (because of the appropriate --input specification of the datalad run – but as a datalad get is also included in the bash script, forgetting an --input specification would not be problem).

$ datalad run -m "run brain extract workflow" \
  --input "inputs/ds*/sub-01/anat/sub-01_T1w.nii.gz" \
  --output "sub-*/anat" \
  bash -e code/brain_extraction.sh inputs/ds*/sub-01/anat/sub-01_T1w.nii.gz
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
action summary:
  get (notneeded: 2)
Processing inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz
Processing inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz
[INFO] == Command exit (modification check follows) ===== 
get(ok): inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]
get(ok): inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]
add(ok): sub-001/anat.nii.gz (file)
add(ok): sub-001/anat_mask.nii.gz (file)
add(ok): sub-002/anat.nii.gz (file)
add(ok): sub-002/anat_mask.nii.gz (file)
save(ok): . (dataset)
action summary:
  add (ok: 4)
  get (ok: 2)
  save (notneeded: 2, ok: 1)

The analysis step is done, all generated results were saved in the dataset. All changes, including the command that caused them are on record:

$ git show --stat
commit 6318a4fbd6a480bcbc8b69d937837e8c5d461455
Author: Elena Piscopia <elena@example.net>
Date:   Tue Oct 8 15:32:11 2019 +0200

    [DATALAD RUNCMD] run brain extract workflow
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash -e code/brain_extraction.sh inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz",
     "dsid": "f084b1ac-e9cf-11e9-b64e-e86a64c8054c",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [
      "inputs/ds*/sub-01/anat/sub-01_T1w.nii.gz"
     ],
     "outputs": [
      "sub-*/anat"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

 sub-001/anat.nii.gz      | 1 +
 sub-001/anat_mask.nii.gz | 1 +
 sub-002/anat.nii.gz      | 1 +
 sub-002/anat_mask.nii.gz | 1 +
 4 files changed, 4 insertions(+)

DataLad has enough information stored to be able to re-run a command.

On command exit, it will inspect the results and save them again, but only if they are different. In our case, the re-run yields bit-identical results, hence nothing new is saved.

$ datalad rerun
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
action summary:
  get (notneeded: 2)
Processing inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz
Processing inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz
[INFO] == Command exit (modification check follows) ===== 
unlock(ok): sub-001/anat.nii.gz (file)
unlock(ok): sub-001/anat_mask.nii.gz (file)
unlock(ok): sub-002/anat.nii.gz (file)
unlock(ok): sub-002/anat_mask.nii.gz (file)
add(ok): sub-001/anat.nii.gz (file)
add(ok): sub-001/anat_mask.nii.gz (file)
add(ok): sub-002/anat.nii.gz (file)
add(ok): sub-002/anat_mask.nii.gz (file)
action summary:
  add (ok: 4)
  get (notneeded: 2)
  save (notneeded: 3)
  unlock (notneeded: 4, ok: 4)

Now that we are done, and have checked that we can reproduce the results ourselves, we can clean up. DataLad can easily verify if any part of our input dataset was modified since we configured our analysis, using datalad diff and the tag we provided:

$ datalad diff setup_done inputs

Nothing was changed.

With DataLad with don’t have to keep those inputs around – without losing the ability to reproduce an analysis. Let’s uninstall them, and check the size on disk before and after.

$ du -sh
26M	.
$ datalad uninstall inputs/*
drop(ok): /home/me/usecases/repro/demo/inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [checking http://openneuro.s3.amazonaws.com/ds000002/ds000002_R2.0.0/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=vXK2.bQ360phhPqbVV_n6RMYqaWAy4Dg...]
drop(ok): /home/me/usecases/repro/demo/inputs/ds000002 (directory)
uninstall(ok): /home/me/usecases/repro/demo/inputs/ds000002 (dataset)
drop(ok): /home/me/usecases/repro/demo/inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [checking http://openneuro.s3.amazonaws.com/ds000001/ds000001_R1.1.0/uncompressed/sub001/anatomy/highres001.nii.gz?versionId=8TJ17W9WInNkQPdiQ9vS7wo8ZJ9llF80...]
drop(ok): /home/me/usecases/repro/demo/inputs/ds000001 (directory)
uninstall(ok): /home/me/usecases/repro/demo/inputs/ds000001 (dataset)
action summary:
  drop (ok: 4)
  uninstall (ok: 2)
$ du -sh
3.1M	.

The dataset is substantially smaller as all inputs are gone…

$ ls inputs/*
inputs/ds000001:

inputs/ds000002:

But as these inputs were registered in the dataset when we installed them, getting them back is very easy. Only the remaining data (our code and the results) need to be kept and require a backup for long term archival. Everything else can be re-obtained as needed, when needed.

As DataLad knows everything needed about the inputs, including where to get the right version, we can re-run the analysis with a single command. Watch how DataLad re-obtains all required data, re-runs the code, and checks that none of the results changed and need saving.

$ datalad rerun
[INFO] Making sure inputs are available (this may take some time) 
[INFO] Cloning http://datasets.datalad.org/openfmri/ds000001/.git into '/home/me/usecases/repro/demo/inputs/ds000001' 
[INFO] Cloning http://datasets.datalad.org/openfmri/ds000002/.git into '/home/me/usecases/repro/demo/inputs/ds000002' 
[INFO] == Command start (output follows) ===== 
action summary:
  get (notneeded: 2)
Processing inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz
Processing inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz
[INFO] == Command exit (modification check follows) ===== 
install(ok): inputs/ds000001 (dataset) [Installed subdataset in order to get /home/me/usecases/repro/demo/inputs/ds000001]
install(ok): inputs/ds000002 (dataset) [Installed subdataset in order to get /home/me/usecases/repro/demo/inputs/ds000002]
get(ok): inputs/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]
get(ok): inputs/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]
unlock(ok): sub-001/anat.nii.gz (file)
unlock(ok): sub-001/anat_mask.nii.gz (file)
unlock(ok): sub-002/anat.nii.gz (file)
unlock(ok): sub-002/anat_mask.nii.gz (file)
add(ok): sub-001/anat.nii.gz (file)
add(ok): sub-001/anat_mask.nii.gz (file)
add(ok): sub-002/anat.nii.gz (file)
add(ok): sub-002/anat_mask.nii.gz (file)
action summary:
  add (ok: 4)
  get (ok: 2)
  install (ok: 2)
  save (notneeded: 3)
  unlock (notneeded: 4, ok: 4)

Reproduced!

This dataset could now be published and shared as a lightweight yet fully reproducible resource and enable anyone to replicate the exact same analysis – with a single command. Public data and reproducible execution for the win!