An automatically and computationally reproducible neuroimaging analysis from scratch

This use case sketches the basics of a portable analysis that can be automatically computationally reproduced, starting from the acquisition of a neuroimaging dataset with a magnetic resonance imaging (MRI) scanner up to complete data analysis results:

  1. Two extension packages, datalad-container and datalad-neuroimaging extend DataLad’s functionality with the ability to work with computational containers and neuroimaging data workflows.

  2. The analysis is conducted in a way that leaves comprehensive provenance (including software environments) all the way from the raw data, and structures study components in a way that facilitates reuse.

  3. It starts with preparing a raw data (dicom) dataset, and subsequently uses the prepared data for a general linear model (GLM) based analysis.

  4. After completion, data and results are archived, and disk usage of the dataset is maximally reduced.

This use case is adapted from the ReproIn/DataLad tutorial by Michael Hanke and Yaroslav Halchenko, given at the 2018 OHBM training course ran by ReproNim.

The Challenge

Allan is an exemplary neuroscientist and researcher. He has spent countless hours diligently learning not only the statistical methods for his research questions and the software tools for his computations, but also taught himself about version control and data standards in neuroimaging, such as the brain imaging data structure (BIDS). For his final PhD project, he patiently acquires functional MRI data of many participants, and prepares it according to the BIDS standard afterwards. It takes him a full week of time and two failed attempts, but he eventually has a BIDS-compliant dataset.

When he writes his analysis scripts he takes extra care to responsibly version control every change. He happily notices how much cleaner his directories are, and how he and others can transparently see how his code evolved. Once everything is set up, he runs his analysis using large and complex neuroscientific software packages that he installed on his computer a few years back. Finally, he writes a paper and publishes his findings in a prestigious peer-reviewed journal. His data and code can be accessed by others easily, as he makes them publicly available. Colleagues and supervisors admire him for his wonderful contribution to open science.

However, a few months after publication, Allan starts to get emails by that report that his scripts do not produce the same results as the ones reported in the publication. Startled and confused he investigates what may be the issue. After many sleepless nights he realizes: The software he used was fairly old! More recent versions of the same software compute results slightly different, changed function’s names, or fixed discovered bugs in the underlying source code. Shocked, he realizes that his scripts are even incompatible with the most recent release of the software package he used and throw an error. Luckily, he can quickly fix this by adding information about the required software versions to the README of his project, and he is grateful for colleagues and other scientists that provide adjusted versions of his code for more recent software releases. In the end, his results prove to be robust regardless of software version. But while Allen shared code and data, not including any information about his software environment prevented his analysis from becoming computationally reproducible.

The DataLad Approach

Even if an analysis workflow is fully captured and version-controlled, and data and code are being linked, an analysis may not reproduce. Comprehensive computational reproducibility requires that also the software involved in an analysis and its precise versions need to be known. DataLad can help with this. Using the datalad-containers extension, complete software environments can be captured in computational containers, added to (and thus shared together with) datasets, and linked with commands and outputs they were used for.

Step-by-Step

The first part of this Step-by-Step guide details how to computationally reproducibly and automatically reproducibly perform data preparation from raw DICOM files to BIDS-compliant NIfTI images. The actual analysis, a first-level GLM for a localization task, is performed in the second part. A final paragraph shows how to prepare the dataset for the afterlife.

For this use case, two DataLad extensions are required:

You can install them via pip like this:

$ pip install datalad-neuroimaging datalad-container

Data Preparation

We start by creating a home for the raw data:

$ datalad create localizer_scans
$ cd localizer_scans
[INFO] Creating a new annex repo at /home/me/usecases/repro2/localizer_scans
create(ok): /home/me/usecases/repro2/localizer_scans (dataset)

For this example, we use a number of publicly available DICOM files. Luckily, at the time of data acquisition, these DICOMs were already equipped with the relevant metadata: Their headers contain all necessary information to identify the purpose of individual scans and encode essential properties to create a BIDS compliant dataset from them. The DICOMs are stored on Github (as a Git repository1), so they can be installed as a subdataset. As they are the raw inputs of the analysis, we store them in a directory we call inputs/raw.

$ datalad clone --dataset . \
 https://github.com/datalad/example-dicom-functional.git  \
 inputs/rawdata
[INFO] Cloning dataset to Dataset(/home/me/usecases/repro2/localizer_scans/inputs/rawdata)
[INFO] Attempting to clone from https://github.com/datalad/example-dicom-functional.git to /home/me/usecases/repro2/localizer_scans/inputs/rawdata
[INFO] Start enumerating objects
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/usecases/repro2/localizer_scans/inputs/rawdata)
install(ok): inputs/rawdata (dataset)
add(ok): inputs/rawdata (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  install (ok: 1)
  save (ok: 2)

The datalad subdatasets reports the installed dataset to be indeed a subdataset of the superdataset localizer_scans:

$ datalad subdatasets
subdataset(ok): inputs/rawdata (dataset)

Given that we have obtained raw data, this data is not yet ready for data analysis. Prior to performing actual computations, the data needs to be transformed into appropriate formats and standardized to an intuitive layout. For neuroimaging, a useful transformation is a transformation from DICOMs into the NIfTI format, a format specifically designed for scientific analyses of brain images. An intuitive layout is the BIDS standard. Performing these transformations and standardizations, however, requires software. For the task at hand, HeudiConv, a DICOM converter, is our software of choice. Beyond converting DICOMs, it also provides assistance in converting a raw data set to the BIDS standard, and it integrates with DataLad to place converted and original data under Git/Git-annex version control, while automatically annotating files with sensitive information (e.g., non-defaced anatomicals, etc).

To take extra care to know exactly what software is used both to be able to go back to it at a later stage should we have the need to investigate an issue, and to capture full provenance of the transformation process, we are using a software container that contains the relevant software setup. A ready-made singularity container is available from singularity-hub at shub://ReproNim/ohbm2018-training:heudiconvn.

Using the datalad containers-add command we can add this container to the localizer_scans superdataset. We are giving it the name heudiconv.

$ datalad containers-add heudiconv --url shub://ReproNim/ohbm2018-training:heudiconvn
[INFO] Initiating special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/me/usecases/repro2/localizer_scans/.datalad/environments/heudiconv/image (file)
action summary:
  add (ok: 1)
  containers_add (ok: 1)
  save (ok: 1)

The command datalad containers-list can verify that this worked:

$ datalad containers-list
heudiconv -> .datalad/environments/heudiconv/image

Great. The dataset now tracks all of the input data and the computational environment for the DICOM conversion. Thus far, we have a complete record of all components. Let’s stay transparent, but also automatically reproducible in the actual data conversion by wrapping the necessary heudiconv command seen below:

$ heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . \
 -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms

within a datalad containers-run command. To save time, we will only transfer one subjects data (sub-02, hence the subject identifier -s 02 in the command). Note that the output below is how it indeed should look like – the software we are using in this example produces very wordy output.

$ datalad containers-run -m "Convert sub-02 DICOMs into BIDS" \
  --container-name heudiconv \
  'heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms'
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
INFO: Running heudiconv version 0.5.2-dev
INFO: Analyzing 5460 dicoms
INFO: Filtering out 0 dicoms based on their filename
WARNING: dcmstack without support of pydicom >= 1.0 is detected. Adding a plug
INFO: Generated sequence info for 1 studies with 1 entries total
INFO: Processing sequence infos to deduce study/session
INFO: Study session for {'locator': 'Hanke/Stadler/0083_transrep2', 'session': None, 'subject': '02'}
INFO: Need to process 1 study sessions
INFO: PROCESSING STARTS: {'outdir': '/tmp/heudiconv.sub-02/', 'session': None, 'subject': '02'}
INFO: Processing 1 pre-sorted seqinfo entries
INFO: Processing 1 seqinfo entries
INFO: Doing conversion using dcm2niix
INFO: Converting ./sub-02/func/sub-02_task-oneback_run-01_bold (5460 DICOMs) -> ./sub-02/func . Converter: dcm2niix . Output types: ('nii.gz', 'dicom')
INFO: Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
220413-10:40:47,152 nipype.workflow INFO:
	 [Node] Setting-up "convert" in "/tmp/dcm2niixzov3_x0n/convert".
INFO: [Node] Setting-up "convert" in "/tmp/dcm2niixzov3_x0n/convert".
220413-10:40:48,206 nipype.workflow INFO:
	 [Node] Running "convert" ("nipype.interfaces.dcm2nii.Dcm2niix"), a CommandLine Interface with command:
dcm2niix -b y -z y -x n -t n -m n -f func -o . -s n -v n /tmp/dcm2niixzov3_x0n/convert
INFO: [Node] Running "convert" ("nipype.interfaces.dcm2nii.Dcm2niix"), a CommandLine Interface with command:
dcm2niix -b y -z y -x n -t n -m n -f func -o . -s n -v n /tmp/dcm2niixzov3_x0n/convert
220413-10:40:50,350 nipype.interface INFO:
	 stdout 2022-04-13T10:40:50.350426:Chris Rorden's dcm2niiX version v1.0.20180622 GCC6.3.0 (64-bit Linux)
INFO: stdout 2022-04-13T10:40:50.350426:Chris Rorden's dcm2niiX version v1.0.20180622 GCC6.3.0 (64-bit Linux)
220413-10:40:50,350 nipype.interface INFO:
	 stdout 2022-04-13T10:40:50.350426:Found 5460 DICOM file(s)
INFO: stdout 2022-04-13T10:40:50.350426:Found 5460 DICOM file(s)
220413-10:40:50,350 nipype.interface INFO:
	 stdout 2022-04-13T10:40:50.350426:swizzling 3rd and 4th dimensions (XYTZ -> XYZT), assuming interslice distance is 3.300000
INFO: stdout 2022-04-13T10:40:50.350426:swizzling 3rd and 4th dimensions (XYTZ -> XYZT), assuming interslice distance is 3.300000
220413-10:40:50,350 nipype.interface INFO:
	 stdout 2022-04-13T10:40:50.350426:Warning: Images sorted by instance number  [0020,0013](1..5460), but AcquisitionTime [0008,0032] suggests a different order (160423..160223)
INFO: stdout 2022-04-13T10:40:50.350426:Warning: Images sorted by instance number  [0020,0013](1..5460), but AcquisitionTime [0008,0032] suggests a different order (160423..160223)
220413-10:40:50,350 nipype.interface INFO:
	 stdout 2022-04-13T10:40:50.350426:Using RWVSlope:RWVIntercept = 4.00757:0
INFO: stdout 2022-04-13T10:40:50.350426:Using RWVSlope:RWVIntercept = 4.00757:0
220413-10:40:50,351 nipype.interface INFO:
	 stdout 2022-04-13T10:40:50.350426: Philips Scaling Values RS:RI:SS = 4.00757:0:0.0132383 (see PMC3998685)
INFO: stdout 2022-04-13T10:40:50.350426: Philips Scaling Values RS:RI:SS = 4.00757:0:0.0132383 (see PMC3998685)
220413-10:40:50,351 nipype.interface INFO:
	 stdout 2022-04-13T10:40:50.350426:Convert 5460 DICOM as ./func (80x80x35x156)
INFO: stdout 2022-04-13T10:40:50.350426:Convert 5460 DICOM as ./func (80x80x35x156)
220413-10:40:51,205 nipype.interface INFO:
	 stdout 2022-04-13T10:40:51.205022:compress: "/usr/bin/pigz" -n -f -6 "./func.nii"
INFO: stdout 2022-04-13T10:40:51.205022:compress: "/usr/bin/pigz" -n -f -6 "./func.nii"
220413-10:40:51,205 nipype.interface INFO:
	 stdout 2022-04-13T10:40:51.205022:Conversion required 2.930881 seconds (2.127429 for core code).
INFO: stdout 2022-04-13T10:40:51.205022:Conversion required 2.930881 seconds (2.127429 for core code).
220413-10:40:51,391 nipype.workflow INFO:
	 [Node] Finished "convert".
INFO: [Node] Finished "convert".
INFO: Populating template files under ./
INFO: PROCESSING DONE: {'outdir': '/tmp/heudiconv.sub-02/', 'session': None, 'subject': '02'}
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/localizer_scans (dataset) [singularity exec -B /home/me/usecases/re...]
add(ok): CHANGES (file)
add(ok): README (file)
add(ok): dataset_description.json (file)
add(ok): participants.tsv (file)
add(ok): sourcedata/README (file)
add(ok): sourcedata/sub-02/func/sub-02_task-oneback_run-01_bold.dicom.tgz (file)
add(ok): sub-02/func/sub-02_task-oneback_run-01_bold.json (file)
add(ok): sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file)
add(ok): sub-02/func/sub-02_task-oneback_run-01_events.tsv (file)
add(ok): sub-02/sub-02_scans.tsv (file)
add(ok): task-oneback_bold.json (file)
save(ok): . (dataset)
action summary:
  add (ok: 11)
  get (notneeded: 1)
  run (ok: 1)
  save (notneeded: 1, ok: 1)

Find out what changed after this command by comparing the most recent commit by DataLad (i.e., HEAD) to the previous one (i.e., HEAD~1) with datalad diff:

$ datalad diff -f HEAD~1
    added: CHANGES (symlink)
    added: README (symlink)
    added: dataset_description.json (symlink)
    added: participants.tsv (symlink)
    added: sourcedata/README (symlink)
    added: sourcedata/sub-02/func/sub-02_task-oneback_run-01_bold.dicom.tgz (symlink)
    added: sub-02/func/sub-02_task-oneback_run-01_bold.json (symlink)
    added: sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (symlink)
    added: sub-02/func/sub-02_task-oneback_run-01_events.tsv (symlink)
    added: sub-02/sub-02_scans.tsv (symlink)
    added: task-oneback_bold.json (symlink)

As expected, DICOM files of one subject were converted into NIfTI files, and the outputs follow the BIDS standard’s layout and naming conventions! But what’s even better is that this action and the relevant software environment was fully recorded.

There is only one thing missing before the functional imaging data can be analyzed: A stimulation protocol, so that we know what stimulation was done at which point during the scan. Thankfully, the data was collected using an implementation that exported this information directly in the BIDS events.tsv format. The file came with our DICOM dataset and can be found at inputs/rawdata/events.tsv. All we need to do is copy it to the right location under the BIDS-mandated name. To keep track of where this file came from, we will also wrap the copying into a datalad run command. The {inputs} and {outputs} placeholders can help to avoid duplication in the command call:

$ datalad run -m "Import stimulation events" \
  --input inputs/rawdata/events.tsv \
  --output sub-02/func/sub-02_task-oneback_run-01_events.tsv \
  cp {inputs} {outputs}
[INFO] Making sure inputs are available (this may take some time)
unlock(ok): sub-02/func/sub-02_task-oneback_run-01_events.tsv (file)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/localizer_scans (dataset) [cp 'inputs/rawdata/events.tsv' 'sub-02/f...]
add(ok): sub-02/func/sub-02_task-oneback_run-01_events.tsv (file)
save(ok): . (dataset)

git log shows what information DataLad captured about this command’s execution:

$ git log -n 1
commit fdb48651884f37ed35dfedf482f4d9be330d4f73
Author: Elena Piscopia <elena@example.net>
Date:   Wed Apr 13 12:41:11 2022 +0200

    [DATALAD RUNCMD] Import stimulation events

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "cp '{inputs}' '{outputs}'",
     "dsid": "9d3a8e3b-aa03-4c6f-972c-a78739009a5e",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [
      "inputs/rawdata/events.tsv"
     ],
     "outputs": [
      "sub-02/func/sub-02_task-oneback_run-01_events.tsv"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

Analysis execution

Since the raw data are reproducibly prepared in BIDS standard we can now go further and conduct an analysis. For this example, we will implement a very basic first-level GLM analysis using FSL that takes only a few minutes to run. As before, we will capture all provenance: inputs, computational environments, code, and outputs.

Following the YODA principles2, the analysis is set up in a new dataset, with the input dataset localizer_scans as a subdataset:

# get out of localizer_scans
$ cd ../

$ datalad create glm_analysis
$ cd glm_analysis
[INFO] Creating a new annex repo at /home/me/usecases/repro2/glm_analysis
create(ok): /home/me/usecases/repro2/glm_analysis (dataset)

We install localizer_scans by providing its path as a --source to datalad install:

$ datalad clone -d . \
  ../localizer_scans \
  inputs/rawdata
[INFO] Cloning dataset to Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
[INFO] Attempting to clone from ../localizer_scans to /home/me/usecases/repro2/glm_analysis/inputs/rawdata
[INFO] Completed clone attempts for Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
install(ok): inputs/rawdata (dataset)
add(ok): inputs/rawdata (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  install (ok: 1)
  save (ok: 2)

datalad subdatasets reports the number of installed subdatasets again:

$ datalad subdatasets
subdataset(ok): inputs/rawdata (dataset)

We almost forgot something really useful: Structuring the dataset with the help of DataLad! Luckily, procedures such as yoda can not only be applied upon creating of a dataset (as in Create a dataset), but also with the run-procedure command (as in Configurations to go)

$ datalad run-procedure cfg_yoda
subdataset(ok): inputs/rawdata (dataset)
subdataset(ok): inputs/rawdata (dataset)
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [/home/adina/env/handbook2/bin/python /ho...]

The analysis obviously needs custom code. For the simple GLM analysis with FSL we use:

  1. A small script to convert BIDS-formatted events.tsv files into the EV3 format FSL understands, available at https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh

  2. An FSL analysis configuration template script, available at https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf

These script should be stored and tracked inside the dataset within code/. The datalad download-url command downloads these scripts and records where they were obtained from:

$ datalad download-url  --path code/ \
  https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh \
  https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf
[INFO] Downloading 'https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh' into '/home/me/usecases/repro2/glm_analysis/code/'
download_url(ok): /home/me/usecases/repro2/glm_analysis/code/events2ev3.sh (file)
[INFO] Downloading 'https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf' into '/home/me/usecases/repro2/glm_analysis/code/'
download_url(ok): /home/me/usecases/repro2/glm_analysis/code/ffa_design.fsf (file)
add(ok): code/events2ev3.sh (file)
add(ok): code/ffa_design.fsf (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  download_url (ok: 2)
  save (ok: 1)

The commit message that DataLad created shows the URL where each script has been downloaded from:

$ git log -n 1
commit 4ebfb362bb46877ad3a9b5f472a9ef4ae7fdd73d
Author: Elena Piscopia <elena@example.net>
Date:   Wed Apr 13 12:41:21 2022 +0200

    [DATALAD] Download URLs

    URLs:
      https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh
      https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf

Prior to the actual analysis, we need to run the events2ev3.sh script to transform inputs into the format that FSL expects. The datalad run makes this maximally reproducible and easy, as the files given as --inputs and --outputs are automatically managed by DataLad.

$ datalad run -m 'Build FSL EV3 design files' \
  --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_events.tsv \
  --output 'sub-02/onsets' \
  bash code/events2ev3.sh sub-02 {inputs}
[INFO] Making sure inputs are available (this may take some time)
get(ok): inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_events.tsv (file) [from origin...]
[INFO] == Command start (output follows) =====
sub-02
1
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [bash code/events2ev3.sh sub-02 'inputs/r...]
add(ok): sub-02/onsets/run-1/body.txt (file)
add(ok): sub-02/onsets/run-1/face.txt (file)
add(ok): sub-02/onsets/run-1/house.txt (file)
add(ok): sub-02/onsets/run-1/object.txt (file)
add(ok): sub-02/onsets/run-1/scene.txt (file)
add(ok): sub-02/onsets/run-1/scramble.txt (file)
save(ok): . (dataset)

The dataset now contains and manages all of the required inputs, and we’re ready for FSL. Since FSL is not a simple program, we make sure to record the precise software environment for the analysis with datalad containers-run. First, we get a container with FSL in the version we require:

$ datalad containers-add fsl --url shub://mih/ohbm2018-training:fsl
[INFO] Initiating special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/me/usecases/repro2/glm_analysis/.datalad/environments/fsl/image (file)
action summary:
  add (ok: 1)
  containers_add (ok: 1)
  save (ok: 1)

As the analysis setup is now complete, let’s label this state of the dataset:

$ datalad save --version-tag ready4analysis
save(ok): . (dataset)

All we have left is to configure the desired first-level GLM analysis with FSL. At this point, the template contains placeholders for the basepath and the subject ID, and they need to be replaced. The following command uses the arcane, yet powerful sed editor to do this. We will again use datalad run to invoke our command so that we store in the history how this template was generated (so that we may audit, alter, or regenerate this file in the future — fearlessly).

$ datalad run \
 -m "FSL FEAT analysis config script" \
 --output sub-02/1stlvl_design.fsf \
 bash -c 'sed -e "s,##BASEPATH##,{pwd},g" -e "s,##SUB##,sub-02,g" \
 code/ffa_design.fsf > {outputs}'
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [bash -c 'sed -e "s,##BASEPATH##,/home/me...]
add(ok): sub-02/1stlvl_design.fsf (file)
save(ok): . (dataset)

To compute the analysis, a simple feat sub-02/1stlvl_design.fsf command is wrapped into a datalad containers-run command with appropriate --input and --output specification:

$ datalad containers-run --container-name fsl -m "sub-02 1st-level GLM" \
  --input sub-02/1stlvl_design.fsf \
  --input sub-02/onsets \
  --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
  --output sub-02/1stlvl_glm.feat \
  feat {inputs[0]}
[INFO] Making sure inputs are available (this may take some time)
get(ok): inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file) [from origin...]
[INFO] == Command start (output follows) =====
To view the FEAT progress and final report, point your web browser at /home/me/usecases/repro2/glm_analysis/sub-02/1stlvl_glm.feat/report_log.html
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [singularity exec -B /home/me/usecases/re...]
add(ok): sub-02/1stlvl_glm.feat/tsplot/tsplotc_zstat1p.png (file)
save(ok): . (dataset)
action summary:
  add (ok: 344)
  get (notneeded: 4, ok: 1)
  run (ok: 1)
  save (notneeded: 1, ok: 1)

Once this command finishes, DataLad will have captured the entire FSL output, and the dataset will contain a complete record all the way from the input BIDS dataset to the GLM results. The BIDS subdataset in turn has a complete record of all processing down from the raw DICOMs onwards.

Many files need more planning

See how many files were created and added in this computation of a single participant? If your study has many participants, analyses like the one above could inflate your dataset. Please check out the chapter Go big or go home. in particular the section Calculate in greater numbers for tips and tricks on how to create analyses datasets that scale.

Archive data and results

After study completion it is important to properly archive data and results, for example for future inquiries by reviewers or readers of the associated publication. Thanks to the modularity of the study units, this tasks is easy and avoids needless duplication.

The raw data is tracked in its own dataset (localizer_scans) that only needs to be archived once, regardless of how many analysis are using it as input. This means that we can “throw away” this subdataset copy within this analysis dataset. DataLad can re-obtain the correct version at any point in the future, as long as the recorded location remains accessible.

To make sure we’re not deleting information we are not aware of, datalad diff and git log can help to verify that the subdataset is in the same state as when it was initially added:

$ datalad diff -- inputs

The command does not show any output, thus indicating that there is indeed no difference. git log confirms that the only action that was performed on inputs/ was the addition of it as a subdataset:

$ git log -- inputs
commit 4f06fffe8ad7bbbab18eaaba5944c26e5bbb6f74
Author: Elena Piscopia <elena@example.net>
Date:   Wed Apr 13 12:41:15 2022 +0200

    [DATALAD] Added subdataset

Since the state of the subdataset is exactly the state of the original localizer_scans dataset it is safe to uninstall it.

$ datalad uninstall --dataset . inputs --recursive
drop(ok): inputs/rawdata (key)
drop(ok): inputs/rawdata (key)
uninstall(ok): inputs/rawdata (dataset)
action summary:
  drop (notneeded: 1, ok: 2)
  uninstall (ok: 1)

Prior to archiving the results, we can go one step further and verify their computational reproducibility. DataLad’s rerun command is capable of “replaying” any recorded command. The following command re-executes the FSL analysis by re-running everything since the dataset was tagged as ready4analysis). It will record the recomputed results in a separate Git branch named verify. Afterwards, we can automatically compare these new results to the original ones in the master branch. We will see that all outputs can be reproduced in bit-identical form. The only changes are observed in log files that contain volatile information, such as time steps.

$ datalad rerun --branch verify --onto ready4analysis --since ready4analysis
[INFO] checkout commit 70749c5;
[INFO] run commit 4dc7c45; (FSL FEAT analysis...)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [bash -c 'sed -e "s,##BASEPATH##,/home/me...]
add(ok): sub-02/1stlvl_design.fsf (file)
save(ok): . (dataset)
[INFO] run commit b8fcebf; (sub-02 1st-level GLM)
[INFO] Making sure inputs are available (this may take some time)
[INFO] Cloning dataset to Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
[INFO] Attempting to clone from ../localizer_scans to /home/me/usecases/repro2/glm_analysis/inputs/rawdata
[INFO] Completed clone attempts for Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
get(ok): inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file) [from origin...]
[INFO] == Command start (output follows) =====
To view the FEAT progress and final report, point your web browser at /home/me/usecases/repro2/glm_analysis/sub-02/1stlvl_glm.feat/report_log.html
[INFO] == Command exit (modification check follows) =====
add(ok): sub-02/1stlvl_glm.feat/tsplot/tsplotc_zstat1p.png (file)
save(ok): . (dataset)
action summary:
  add (ok: 345)
  get (notneeded: 4, ok: 1)
  run (ok: 2)
  save (notneeded: 1, ok: 2)
# check that we are now on the new `verify` branch
$ git branch
  git-annex
  master
* verify
# compare which files have changes with respect to the original results
$ git diff master --stat
 sub-02/1stlvl_glm.feat/logs/feat0                                      | 2 +-
 sub-02/1stlvl_glm.feat/logs/{feat0_init.e582363 => feat0_init.e588071} | 0
 sub-02/1stlvl_glm.feat/logs/{feat0_init.o582363 => feat0_init.o588071} | 0
 sub-02/1stlvl_glm.feat/logs/feat1                                      | 2 +-
 sub-02/1stlvl_glm.feat/logs/{feat2_pre.e582446 => feat2_pre.e588161}   | 0
 sub-02/1stlvl_glm.feat/logs/{feat2_pre.o582446 => feat2_pre.o588161}   | 0
 sub-02/1stlvl_glm.feat/logs/{feat3_film.e582954 => feat3_film.e588680} | 0
 sub-02/1stlvl_glm.feat/logs/{feat3_film.o582954 => feat3_film.o588680} | 0
 sub-02/1stlvl_glm.feat/logs/{feat4_post.e583460 => feat4_post.e589246} | 0
 sub-02/1stlvl_glm.feat/logs/{feat4_post.o583460 => feat4_post.o589246} | 0
 sub-02/1stlvl_glm.feat/logs/{feat5_stop.e584094 => feat5_stop.e589883} | 0
 sub-02/1stlvl_glm.feat/logs/{feat5_stop.o584094 => feat5_stop.o589883} | 0
 sub-02/1stlvl_glm.feat/report.html                                     | 2 +-
 sub-02/1stlvl_glm.feat/report_log.html                                 | 2 +-
 14 files changed, 4 insertions(+), 4 deletions(-)
# switch back to the master branch and remove the `verify` branch
$ git checkout master
$ git branch -D verify
Switched to branch 'master'
Deleted branch verify (was 62ede95).

The outcome of this usecase can be found as a dataset on Github here.

Footnotes

1

“Why can such data exist as a Git repository, shouldn’t large files be always stored outside of Git?” you may ask. The DICOMs exist in a Git-repository for a number of reasons: First, it makes them easily available for demonstrations and tutorials without involving DataLad at all. Second, the DICOMs are comparatively small: 21K per file. Importantly, the repository is not meant to version control those files and future states or derivatives and results obtained from them – this would bring a Git repositories to its knees.

2

To re-read everything about the YODA principles, checkout out section YODA: Best practices for data analyses in a dataset.