An automatically and computationally reproducible neuroimaging analysis from scratch¶
This use case sketches the basics of a portable analysis that can be automatically computationally reproduced, starting from the acquisition of a neuroimaging dataset with a magnetic resonance imaging (MRI) scanner up to complete data analysis results:
Two extension packages, datalad-container and datalad-neuroimaging extend DataLad’s functionality with the ability to work with computational containers and neuroimaging data workflows.
The analysis is conducted in a way that leaves comprehensive provenance (including software environments) all the way from the raw data, and structures study components in a way that facilitates reuse.
It starts with preparing a raw data (dicom) dataset, and subsequently uses the prepared data for a general linear model (GLM) based analysis.
After completion, data and results are archived, and disk usage of the dataset is maximally reduced.
This use case is adapted from the ReproIn/DataLad tutorial by Michael Hanke and Yaroslav Halchenko, given at the 2018 OHBM training course ran by ReproNim.
The Challenge¶
Allan is an exemplary neuroscientist and researcher. He has spent countless hours diligently learning not only the statistical methods for his research questions and the software tools for his computations, but also taught himself about version control and data standards in neuroimaging, such as the brain imaging data structure (BIDS). For his final PhD project, he patiently acquires functional MRI data of many participants, and prepares it according to the BIDS standard afterwards. It takes him a full week of time and two failed attempts, but he eventually has a BIDS-compliant dataset.
When he writes his analysis scripts he takes extra care to responsibly version control every change. He happily notices how much cleaner his directories are, and how he and others can transparently see how his code evolved. Once everything is set up, he runs his analysis using large and complex neuroscientific software packages that he installed on his computer a few years back. Finally, he writes a paper and publishes his findings in a prestigious peer-reviewed journal. His data and code can be accessed by others easily, as he makes them publicly available. Colleagues and supervisors admire him for his wonderful contribution to open science.
However, a few months after publication, Allan starts to get
emails by that report that his scripts do not produce the same
results as the ones reported in the publication. Startled and confused he
investigates what may be the issue. After many sleepless nights he realizes:
The software he used was fairly old! More recent versions of the same
software compute results slightly different, changed function’s names, or
fixed discovered bugs in the underlying source code. Shocked, he realizes that
his scripts are even incompatible with the most recent release of the software
package he used and throw an error. Luckily, he can quickly fix this by
adding information about the required software versions to the README
of his
project, and he is grateful for colleagues and other scientists that provide
adjusted versions of his code for more recent software releases. In the end,
his results prove to be robust regardless of software version. But while
Allen shared code and data, not including any information about his software
environment prevented his analysis from becoming computationally
reproducible.
The DataLad Approach¶
Even if an analysis workflow is fully captured and version-controlled, and
data and code are being linked, an analysis may not reproduce. Comprehensive
computational reproducibility requires that also the software involved
in an analysis and its precise versions need to be known.
DataLad can help with this. Using the datalad-containers
extension,
complete software environments can be captured in computational containers,
added to (and thus shared together with) datasets, and linked with commands
and outputs they were used for.
Step-by-Step¶
The first part of this Step-by-Step guide details how to computationally reproducibly and automatically reproducibly perform data preparation from raw DICOM files to BIDS-compliant NIfTI images. The actual analysis, a first-level GLM for a localization task, is performed in the second part. A final paragraph shows how to prepare the dataset for the afterlife.
For this use case, two DataLad extensions are required:
You can install them via pip
like this:
$ pip install datalad-neuroimaging datalad-container
Data Preparation¶
We start by creating a home for the raw data:
$ datalad create localizer_scans
$ cd localizer_scans
create(ok): /home/me/usecases/repro2/localizer_scans (dataset)
For this example, we use a number of publicly available DICOM files. Luckily,
at the time of data acquisition, these DICOMs were already equipped with the
relevant metadata: Their headers contain all necessary information to
identify the purpose of individual scans and encode essential properties to
create a BIDS compliant dataset from them. The DICOMs are stored on Github
(as a Git repository1), so they can be installed as a subdataset. As
they are the raw inputs of the analysis, we store them in a directory we call
inputs/raw
.
$ datalad clone --dataset . \
https://github.com/datalad/example-dicom-functional.git \
inputs/rawdata
[INFO] Cloning dataset to Dataset(/home/me/usecases/repro2/localizer_scans/inputs/rawdata)
[INFO] Attempting to clone from https://github.com/datalad/example-dicom-functional.git to /home/me/usecases/repro2/localizer_scans/inputs/rawdata
[INFO] Start enumerating objects
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/usecases/repro2/localizer_scans/inputs/rawdata)
install(ok): inputs/rawdata (dataset)
add(ok): inputs/rawdata (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
install (ok: 1)
save (ok: 2)
The datalad subdatasets
(manual) reports the installed dataset to be indeed
a subdataset of the superdataset localizer_scans
:
$ datalad subdatasets
subdataset(ok): inputs/rawdata (dataset)
Given that we have obtained raw data, this data is not yet ready for data analysis. Prior to performing actual computations, the data needs to be transformed into appropriate formats and standardized to an intuitive layout. For neuroimaging, a useful transformation is a transformation from DICOMs into the NIfTI format, a format specifically designed for scientific analyses of brain images. An intuitive layout is the BIDS standard. Performing these transformations and standardizations, however, requires software. For the task at hand, HeudiConv, a DICOM converter, is our software of choice. Beyond converting DICOMs, it also provides assistance in converting a raw data set to the BIDS standard, and it integrates with DataLad to place converted and original data under Git/Git-annex version control, while automatically annotating files with sensitive information (e.g., non-defaced anatomicals, etc).
To take extra care to know exactly what software is used both to be able to go back to it at a later stage should we have the need to investigate an issue, and to capture full provenance of the transformation process, we are using a software container that contains the relevant software setup. A ready-made singularity container is available from singularity-hub at shub://ReproNim/ohbm2018-training:heudiconvn.
Using the datalad containers-add
(manual) command we can add this container
to the localizer_scans
superdataset. We are giving it the name
heudiconv
.
$ datalad containers-add heudiconv --url shub://ReproNim/ohbm2018-training:heudiconvn
[INFO] Initializing special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/me/usecases/repro2/localizer_scans/.datalad/environments/heudiconv/image (file)
action summary:
add (ok: 1)
containers_add (ok: 1)
save (ok: 1)
The command datalad containers-list
(manual) can verify that this worked:
$ datalad containers-list
heudiconv -> .datalad/environments/heudiconv/image
Great. The dataset now tracks all of the input data and the computational
environment for the DICOM conversion. Thus far, we have a complete record of
all components. Let’s stay transparent, but also automatically reproducible
in the actual data conversion by wrapping the necessary heudiconv
command
seen below:
$ heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . \
-o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms
within a datalad containers-run
(manual) command.
To save time, we will only transfer one subjects data (sub-02
, hence the
subject identifier -s 02
in the command). Note that the output below is
how it indeed should look like – the software we are using in this example
produces very wordy output.
$ datalad containers-run -m "Convert sub-02 DICOMs into BIDS" \
--container-name heudiconv \
'heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms'
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
INFO: Running heudiconv version 0.5.2-dev
INFO: Analyzing 5460 dicoms
INFO: Filtering out 0 dicoms based on their filename
WARNING: dcmstack without support of pydicom >= 1.0 is detected. Adding a plug
INFO: Generated sequence info for 1 studies with 1 entries total
INFO: Processing sequence infos to deduce study/session
INFO: Study session for {'session': None, 'locator': 'Hanke/Stadler/0083_transrep2', 'subject': '02'}
INFO: Need to process 1 study sessions
INFO: PROCESSING STARTS: {'session': None, 'subject': '02', 'outdir': '/tmp/heudiconv.sub-02/'}
INFO: Processing 1 pre-sorted seqinfo entries
INFO: Reloading existing filegroup.json because /tmp/heudiconv.sub-02/.heudiconv/02/02.edit.txt exists
INFO: Doing conversion using dcm2niix
INFO: Converting ./sub-02/func/sub-02_task-oneback_run-01_bold (5460 DICOMs) -> ./sub-02/func . Converter: dcm2niix . Output types: ('nii.gz', 'dicom')
INFO: Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
221214-17:24:59,972 nipype.workflow INFO:
[Node] Setting-up "convert" in "/tmp/dcm2niix5rxd8wjy/convert".
INFO: [Node] Setting-up "convert" in "/tmp/dcm2niix5rxd8wjy/convert".
221214-17:25:01,83 nipype.workflow INFO:
[Node] Running "convert" ("nipype.interfaces.dcm2nii.Dcm2niix"), a CommandLine Interface with command:
dcm2niix -b y -z y -x n -t n -m n -f func -o . -s n -v n /tmp/dcm2niix5rxd8wjy/convert
INFO: [Node] Running "convert" ("nipype.interfaces.dcm2nii.Dcm2niix"), a CommandLine Interface with command:
dcm2niix -b y -z y -x n -t n -m n -f func -o . -s n -v n /tmp/dcm2niix5rxd8wjy/convert
221214-17:25:03,168 nipype.interface INFO:
stdout 2022-12-14T17:25:03.167924:Chris Rorden's dcm2niiX version v1.0.20180622 GCC6.3.0 (64-bit Linux)
INFO: stdout 2022-12-14T17:25:03.167924:Chris Rorden's dcm2niiX version v1.0.20180622 GCC6.3.0 (64-bit Linux)
221214-17:25:03,168 nipype.interface INFO:
stdout 2022-12-14T17:25:03.167924:Found 5460 DICOM file(s)
INFO: stdout 2022-12-14T17:25:03.167924:Found 5460 DICOM file(s)
221214-17:25:03,168 nipype.interface INFO:
stdout 2022-12-14T17:25:03.167924:swizzling 3rd and 4th dimensions (XYTZ -> XYZT), assuming interslice distance is 3.300000
INFO: stdout 2022-12-14T17:25:03.167924:swizzling 3rd and 4th dimensions (XYTZ -> XYZT), assuming interslice distance is 3.300000
221214-17:25:03,168 nipype.interface INFO:
stdout 2022-12-14T17:25:03.167924:Warning: Images sorted by instance number [0020,0013](1..5460), but AcquisitionTime [0008,0032] suggests a different order (160423..160223)
INFO: stdout 2022-12-14T17:25:03.167924:Warning: Images sorted by instance number [0020,0013](1..5460), but AcquisitionTime [0008,0032] suggests a different order (160423..160223)
221214-17:25:03,168 nipype.interface INFO:
stdout 2022-12-14T17:25:03.167924:Using RWVSlope:RWVIntercept = 4.00757:0
INFO: stdout 2022-12-14T17:25:03.167924:Using RWVSlope:RWVIntercept = 4.00757:0
221214-17:25:03,168 nipype.interface INFO:
stdout 2022-12-14T17:25:03.167924: Philips Scaling Values RS:RI:SS = 4.00757:0:0.0132383 (see PMC3998685)
INFO: stdout 2022-12-14T17:25:03.167924: Philips Scaling Values RS:RI:SS = 4.00757:0:0.0132383 (see PMC3998685)
221214-17:25:03,168 nipype.interface INFO:
stdout 2022-12-14T17:25:03.167924:Convert 5460 DICOM as ./func (80x80x35x156)
INFO: stdout 2022-12-14T17:25:03.167924:Convert 5460 DICOM as ./func (80x80x35x156)
221214-17:25:03,921 nipype.interface INFO:
stdout 2022-12-14T17:25:03.921288:compress: "/usr/bin/pigz" -n -f -6 "./func.nii"
INFO: stdout 2022-12-14T17:25:03.921288:compress: "/usr/bin/pigz" -n -f -6 "./func.nii"
221214-17:25:03,921 nipype.interface INFO:
stdout 2022-12-14T17:25:03.921288:Conversion required 2.767616 seconds (2.072800 for core code).
INFO: stdout 2022-12-14T17:25:03.921288:Conversion required 2.767616 seconds (2.072800 for core code).
221214-17:25:04,125 nipype.workflow INFO:
[Node] Finished "convert".
INFO: [Node] Finished "convert".
INFO: Populating template files under ./
INFO: PROCESSING DONE: {'session': None, 'subject': '02', 'outdir': '/tmp/heudiconv.sub-02/'}
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/localizer_scans (dataset) [singularity exec -B /home/me/usecases/re...]
add(ok): CHANGES (file)
add(ok): README (file)
add(ok): dataset_description.json (file)
add(ok): participants.tsv (file)
add(ok): sourcedata/README (file)
add(ok): sourcedata/sub-02/func/sub-02_task-oneback_run-01_bold.dicom.tgz (file)
add(ok): sub-02/func/sub-02_task-oneback_run-01_bold.json (file)
add(ok): sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file)
add(ok): sub-02/func/sub-02_task-oneback_run-01_events.tsv (file)
add(ok): sub-02/sub-02_scans.tsv (file)
add(ok): task-oneback_bold.json (file)
save(ok): . (dataset)
action summary:
add (ok: 11)
get (notneeded: 1)
run (ok: 1)
save (notneeded: 1, ok: 1)
Find out what changed after this command by comparing the most recent commit
by DataLad (i.e., HEAD
) to the previous one (i.e., HEAD~1
) with
datalad diff
(manual):
$ datalad diff -f HEAD~1
added: CHANGES (symlink)
added: README (symlink)
added: dataset_description.json (symlink)
added: participants.tsv (symlink)
added: sourcedata/README (symlink)
added: sourcedata/sub-02/func/sub-02_task-oneback_run-01_bold.dicom.tgz (symlink)
added: sub-02/func/sub-02_task-oneback_run-01_bold.json (symlink)
added: sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (symlink)
added: sub-02/func/sub-02_task-oneback_run-01_events.tsv (symlink)
added: sub-02/sub-02_scans.tsv (symlink)
added: task-oneback_bold.json (symlink)
As expected, DICOM files of one subject were converted into NIfTI files, and the outputs follow the BIDS standard’s layout and naming conventions! But what’s even better is that this action and the relevant software environment was fully recorded.
There is only one thing missing before the functional imaging data can be
analyzed: A stimulation protocol, so that we know what stimulation was done
at which point during the scan.
Thankfully, the data was collected using an implementation that exported
this information directly in the BIDS events.tsv format. The file came with
our DICOM dataset and can be found at inputs/rawdata/events.tsv
. All we need
to do is copy it to the right location under the BIDS-mandated name. To keep
track of where this file came from, we will also wrap the copying into a
datalad run
(manual) command. The {inputs}
and {outputs}
placeholders can help to avoid duplication in the command call:
$ datalad run -m "Import stimulation events" \
--input inputs/rawdata/events.tsv \
--output sub-02/func/sub-02_task-oneback_run-01_events.tsv \
cp {inputs} {outputs}
[INFO] Making sure inputs are available (this may take some time)
[INFO] Unlocking files
unlock(ok): sub-02/func/sub-02_task-oneback_run-01_events.tsv (file)
[INFO] Recording unlocked state in git
[INFO] Completed unlocking files
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/localizer_scans (dataset) [cp 'inputs/rawdata/events.tsv' 'sub-02/f...]
add(ok): sub-02/func/sub-02_task-oneback_run-01_events.tsv (file)
save(ok): . (dataset)
git log
shows what information DataLad captured about this command’s
execution:
$ git log -n 1
commit dd8b5e1331bb278774d08be3d5c748566b681e7f
Author: Elena Piscopia <elena@example.net>
Date: Wed Dec 14 17:25:23 2022 +0100
[DATALAD RUNCMD] Import stimulation events
=== Do not change lines below ===
{
"chain": [],
"cmd": "cp '{inputs}' '{outputs}'",
"dsid": "c25e7e1f-e9e9-4521-8644-6aaf5c3d77b1",
"exit": 0,
"extra_inputs": [],
"inputs": [
"inputs/rawdata/events.tsv"
],
"outputs": [
"sub-02/func/sub-02_task-oneback_run-01_events.tsv"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
Analysis execution¶
Since the raw data are reproducibly prepared in BIDS standard we can now go further and conduct an analysis. For this example, we will implement a very basic first-level GLM analysis using FSL that takes only a few minutes to run. As before, we will capture all provenance: inputs, computational environments, code, and outputs.
Following the YODA principles2, the analysis is set up in a new
dataset, with the input dataset localizer_scans
as a subdataset:
# get out of localizer_scans
$ cd ../
$ datalad create glm_analysis
$ cd glm_analysis
create(ok): /home/me/usecases/repro2/glm_analysis (dataset)
We install localizer_scans
by providing its path as a --source
to
datalad install
(manual):
$ datalad clone -d . \
../localizer_scans \
inputs/rawdata
[INFO] Cloning dataset to Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
[INFO] Attempting to clone from ../localizer_scans to /home/me/usecases/repro2/glm_analysis/inputs/rawdata
[INFO] Completed clone attempts for Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
install(ok): inputs/rawdata (dataset)
add(ok): inputs/rawdata (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
install (ok: 1)
save (ok: 2)
datalad subdatasets
reports the number of installed subdatasets
again:
$ datalad subdatasets
subdataset(ok): inputs/rawdata (dataset)
We almost forgot something really useful: Structuring the dataset with
the help of DataLad! Luckily, procedures such as yoda
can not only be
applied upon creating of a dataset (as in Create a dataset), but also with the
datalad run-procedure
(manual) command (as in Configurations to go)
$ datalad run-procedure cfg_yoda
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [/home/adina/env/handbook/bin/python /hom...]
The analysis obviously needs custom code. For the simple GLM analysis with FSL we use:
A small script to convert BIDS-formatted
events.tsv
files into theEV3
format FSL understands, available at https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.shAn FSL analysis configuration template script, available at https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf
These script should be stored and tracked inside the dataset within code/
.
The datalad download-url
(manual) command downloads these scripts and
records where they were obtained from:
$ datalad download-url --path code/ \
https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh \
https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf
[INFO] Downloading 'https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh' into '/home/me/usecases/repro2/glm_analysis/code/'
download_url(ok): /home/me/usecases/repro2/glm_analysis/code/events2ev3.sh (file)
[INFO] Downloading 'https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf' into '/home/me/usecases/repro2/glm_analysis/code/'
download_url(ok): /home/me/usecases/repro2/glm_analysis/code/ffa_design.fsf (file)
add(ok): code/events2ev3.sh (file)
add(ok): code/ffa_design.fsf (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
download_url (ok: 2)
save (ok: 1)
The commit message that DataLad created shows the URL where each script has been downloaded from:
$ git log -n 1
commit 9ae9716f14b90a74a2269a435c89ea7156399c5c
Author: Elena Piscopia <elena@example.net>
Date: Wed Dec 14 17:25:28 2022 +0100
[DATALAD] Download URLs
URLs:
https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh
https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf
Prior to the actual analysis, we need to run the events2ev3.sh
script to
transform inputs into the format that FSL expects. The datalad run
makes this maximally reproducible and easy, as the files given as
--inputs
and --outputs
are automatically managed by DataLad.
$ datalad run -m 'Build FSL EV3 design files' \
--input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_events.tsv \
--output 'sub-02/onsets' \
bash code/events2ev3.sh sub-02 {inputs}
[INFO] Making sure inputs are available (this may take some time)
get(ok): inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_events.tsv (file) [from origin...]
[INFO] == Command start (output follows) =====
sub-02
1
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [bash code/events2ev3.sh sub-02 'inputs/r...]
add(ok): sub-02/onsets/run-1/body.txt (file)
add(ok): sub-02/onsets/run-1/face.txt (file)
add(ok): sub-02/onsets/run-1/house.txt (file)
add(ok): sub-02/onsets/run-1/object.txt (file)
add(ok): sub-02/onsets/run-1/scene.txt (file)
add(ok): sub-02/onsets/run-1/scramble.txt (file)
save(ok): . (dataset)
The dataset now contains and manages all of the required inputs, and we’re
ready for FSL. Since FSL is not a simple program, we make sure to record the
precise software environment for the analysis with
datalad containers-run
. First, we get a container with FSL in the
version we require:
$ datalad containers-add fsl --url shub://mih/ohbm2018-training:fsl
[INFO] Initializing special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/me/usecases/repro2/glm_analysis/.datalad/environments/fsl/image (file)
action summary:
add (ok: 1)
containers_add (ok: 1)
save (ok: 1)
As the analysis setup is now complete, let’s label this state of the dataset:
$ datalad save --version-tag ready4analysis
save(ok): . (dataset)
All we have left is to configure the desired first-level GLM analysis with
FSL. At this point, the template contains placeholders for the basepath
and the subject ID, and they need to be replaced.
The following command uses the arcane, yet powerful sed editor to do
this. We will again use datalad run
to invoke our command so that
we store in the history how this template was generated (so that we may
audit, alter, or regenerate this file in the future — fearlessly).
$ datalad run \
-m "FSL FEAT analysis config script" \
--output sub-02/1stlvl_design.fsf \
bash -c 'sed -e "s,##BASEPATH##,{pwd},g" -e "s,##SUB##,sub-02,g" \
code/ffa_design.fsf > {outputs}'
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [bash -c 'sed -e "s,##BASEPATH##,/home/me...]
add(ok): sub-02/1stlvl_design.fsf (file)
save(ok): . (dataset)
To compute the analysis, a simple feat sub-02/1stlvl_design.fsf
command
is wrapped into a datalad containers-run
command with appropriate
--input
and --output
specification:
$ datalad containers-run --container-name fsl -m "sub-02 1st-level GLM" \
--input sub-02/1stlvl_design.fsf \
--input sub-02/onsets \
--input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output sub-02/1stlvl_glm.feat \
feat {inputs[0]}
[INFO] Making sure inputs are available (this may take some time)
get(ok): inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file) [from origin...]
[INFO] == Command start (output follows) =====
To view the FEAT progress and final report, point your web browser at /home/me/usecases/repro2/glm_analysis/sub-02/1stlvl_glm.feat/report_log.html
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [singularity exec -B /home/me/usecases/re...]
add(ok): sub-02/1stlvl_glm.feat/tsplot/tsplotc_zstat1p.png (file)
save(ok): . (dataset)
action summary:
add (ok: 344)
get (notneeded: 4, ok: 1)
run (ok: 1)
save (notneeded: 1, ok: 1)
Once this command finishes, DataLad will have captured the entire FSL output, and the dataset will contain a complete record all the way from the input BIDS dataset to the GLM results. The BIDS subdataset in turn has a complete record of all processing down from the raw DICOMs onwards.
Many files need more planning
See how many files were created and added in this computation of a single participant? If your study has many participants, analyses like the one above could inflate your dataset. Please check out the chapter Go big or go home. in particular the section Calculate in greater numbers for tips and tricks on how to create analyses datasets that scale.
Archive data and results¶
After study completion it is important to properly archive data and results, for example for future inquiries by reviewers or readers of the associated publication. Thanks to the modularity of the study units, this tasks is easy and avoids needless duplication.
The raw data is tracked in its own dataset (localizer_scans
) that only
needs to be archived once, regardless of how many analysis are using it as
input. This means that we can “throw away” this subdataset copy within this
analysis dataset. DataLad can re-obtain the correct version at any point in
the future, as long as the recorded location remains accessible.
To make sure we’re not deleting information we are not aware of,
datalad diff
and git log
(manual) can help to verify that the
subdataset is in the same state as when it was initially added:
$ datalad diff -- inputs
The command does not show any output, thus indicating that there is indeed no
difference. git log
confirms that the only action that was performed on
inputs/
was the addition of it as a subdataset:
$ git log -- inputs
commit 742a5f69860a021f098413895eba1419f9674b4d
Author: Elena Piscopia <elena@example.net>
Date: Wed Dec 14 17:25:25 2022 +0100
[DATALAD] Added subdataset
Since the state of the subdataset is exactly the state of the original
localizer_scans
dataset it is safe to uninstall it.
$ datalad uninstall --dataset . inputs --recursive
drop(ok): inputs/rawdata (key)
drop(ok): inputs/rawdata (key)
uninstall(ok): inputs/rawdata (dataset)
action summary:
drop (notneeded: 1, ok: 2)
uninstall (ok: 1)
Prior to archiving the results, we can go one step further and verify their
computational reproducibility. DataLad’s rerun
command is
capable of “replaying” any recorded command. The following command
re-executes the FSL analysis by re-running everything since the dataset was
tagged as ready4analysis
). It will record the recomputed results in a
separate Git branch named verify
. Afterwards, we can automatically
compare these new results to the original ones in the master
branch. We
will see that all outputs can be reproduced in bit-identical form. The only
changes are observed in log files that contain volatile information, such as
time steps.
$ datalad rerun --branch verify --onto ready4analysis --since ready4analysis
[INFO] checkout commit dbfd914;
[INFO] run commit e7f9bab; (FSL FEAT analysis...)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/usecases/repro2/glm_analysis (dataset) [bash -c 'sed -e "s,##BASEPATH##,/home/me...]
add(ok): sub-02/1stlvl_design.fsf (file)
save(ok): . (dataset)
[INFO] run commit 61f1316; (sub-02 1st-level GLM)
[INFO] Making sure inputs are available (this may take some time)
[INFO] Cloning dataset to Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
[INFO] Attempting to clone from ../localizer_scans to /home/me/usecases/repro2/glm_analysis/inputs/rawdata
[INFO] Completed clone attempts for Dataset(/home/me/usecases/repro2/glm_analysis/inputs/rawdata)
get(ok): inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file) [from origin...]
[INFO] == Command start (output follows) =====
To view the FEAT progress and final report, point your web browser at /home/me/usecases/repro2/glm_analysis/sub-02/1stlvl_glm.feat/report_log.html
[INFO] == Command exit (modification check follows) =====
add(ok): sub-02/1stlvl_glm.feat/tsplot/tsplotc_zstat1p.png (file)
save(ok): . (dataset)
action summary:
add (ok: 345)
get (notneeded: 4, ok: 1)
run (ok: 2)
save (notneeded: 1, ok: 2)
# check that we are now on the new `verify` branch
$ git branch
git-annex
master
* verify
# compare which files have changes with respect to the original results
$ git diff master --stat
sub-02/1stlvl_glm.feat/logs/feat0 | 2 +-
sub-02/1stlvl_glm.feat/logs/{feat0_init.e657173 => feat0_init.e660126} | 0
sub-02/1stlvl_glm.feat/logs/{feat0_init.o657173 => feat0_init.o660126} | 0
sub-02/1stlvl_glm.feat/logs/feat1 | 2 +-
sub-02/1stlvl_glm.feat/logs/{feat2_pre.e657253 => feat2_pre.e660207} | 0
sub-02/1stlvl_glm.feat/logs/{feat2_pre.o657253 => feat2_pre.o660207} | 0
sub-02/1stlvl_glm.feat/logs/{feat3_film.e657707 => feat3_film.e660654} | 0
sub-02/1stlvl_glm.feat/logs/{feat3_film.o657707 => feat3_film.o660654} | 0
sub-02/1stlvl_glm.feat/logs/{feat4_post.e658119 => feat4_post.e661021} | 0
sub-02/1stlvl_glm.feat/logs/{feat4_post.o658119 => feat4_post.o661021} | 0
sub-02/1stlvl_glm.feat/logs/{feat5_stop.e658751 => feat5_stop.e661653} | 0
sub-02/1stlvl_glm.feat/logs/{feat5_stop.o658751 => feat5_stop.o661653} | 0
sub-02/1stlvl_glm.feat/report.html | 2 +-
sub-02/1stlvl_glm.feat/report_log.html | 2 +-
14 files changed, 4 insertions(+), 4 deletions(-)
# switch back to the master branch and remove the `verify` branch
$ git checkout master
$ git branch -D verify
Switched to branch 'master'
Deleted branch verify (was 29c21ca).
The outcome of this usecase can be found as a dataset on Github here.
Footnotes
- 1
“Why can such data exist as a Git repository, shouldn’t large files be always stored outside of Git?” you may ask. The DICOMs exist in a Git-repository for a number of reasons: First, it makes them easily available for demonstrations and tutorials without involving DataLad at all. Second, the DICOMs are comparatively small: 21K per file. Importantly, the repository is not meant to version control those files and future states or derivatives and results obtained from them – this would bring a Git repositories to its knees.
- 2
To re-read everything about the YODA principles, checkout out section YODA: Best practices for data analyses in a dataset.