A Neuroimaging Datasets

Todo

Currently, this is a left over. Later, we can rework this into something, but its unclear yet what ;-)

This section is a concise demonstration of what a DataLad dataset is, showcased on a dataset from the field of neuroimaging. A DataLad dataset is the core data type of DataLad. We will explore the concepts of it with one public example dataset, the studyforrest phase 2 data (studyforrest.org). Note that this is just one type and use of a Datalad dataset, and you throughout there are many more flavors of using DataLad datasets in the basics or in upcoming use cases.

Please follow along and run the commands below in your own terminal for a hands-on experience.

$ datalad install https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
[INFO] Cloning https://github.com/psychoinformatics-de/studyforrest-data-phase2.git [1 other candidates] into '/home/me/usecases/studyforrest/studyforrest-data-phase2'
[INFO]   Remote origin not usable by git-annex; setting annex-ignore 
install(ok): /home/me/usecases/studyforrest/studyforrest-data-phase2 (dataset)

Once installed, a DataLad dataset looks like any other directory on your filesystem:

$ cd studyforrest-data-phase2
$ ls # output below is only an excerpt from ls
Makefile
participants.tsv
README.rst
recording-cardresp_physio.json
recording-eyegaze_physio.json
src
stimuli
sub-01
sub-02
sub-03
sub-04

However, all files and directories within the DataLad dataset can be tracked (should you want them to be tracked), regardless of their size. Large content is tracked in an annex that is automatically created and handled by DataLad. Whether text files or larger files change, all of these changes can be written to your DataLad datasets history.

Note for Git users

A DataLad dataset is a Git repository. Large file content in the dataset in the annex is tracked with git-annex. An ls -a reveals that Git is secretly working in the background:

$ ls -a # show also hidden files (excerpt)
code
.datalad
dataset_description.json
.git
.gitattributes
.gitignore
.gitmodules
participants.tsv
README.rst
recording-cardresp_physio.json
recording-eyegaze_physio.json
src
stimuli
sub-01
sub-02
sub-03
sub-04
sub-05

Users can create new DataLad datasets from scratch, or install existing DataLad datasets from paths, urls, or open-data collections. This makes sharing and accessing data fast and easy. Moreover, when sharing or installing a DataLad dataset, all copies also include the datasets history. An installed DataLad dataset knows the dataset it was installed from, and if changes in this original DataLad dataset happen, the installed dataset can simply be updated.

You can view the DataLad datasets history with tools of your choice. The code block below is used to illustrate the history and is an exempt from git log.

$ git log --oneline --graph --decorate
* a6623bff (HEAD -> master, origin/master, origin/HEAD) [DATALAD] dataset aggregate metadata update
* 72d535d5 Enable DataLad metadata extractors
* 9c15094e [DATALAD] new dataset
* d97455f5 [DATALAD] Set default backend for all files to be MD5E
* e2a2cf1c Update changelog for 1.5
* 6da25fb6 BF: Re-import respiratory trace after bug fix in converter (fixes gh-11)
* 131edcb7 Fix type in physio log converter (fixes gh-11)
* fbb5d619 ENH: Report per-stimulus events (fixes gh-6)
* d6f3fcd2 Add BIDS-compatible stimuli/ directory (with symlinks)

Dataset content identity and availability information

Upon installation of a DataLad dataset, DataLad retrieves only (small) metadata information about the dataset. This exposes the datasets file hierarchy for exploration, and speeds up the installation of a DataLad dataset of many TB in size to a few seconds. Just after installation, the dataset is small in size:

$ du -sh
17M	.

This is because only small files are present locally – for shits and giggles, you can try opening both a small .tsv file in the root of the dataset, and a larger compressed nifti (nii.gz) in one of the subdirectories in this dataset. A small .tsv (1.9K) file exists and can be opened locally, but what would be a large, compressed nifti file is not. In this state, one cannot open or work with the nifti file, but you can explore which files exist without the potentially large download.

$ ls participants.tsv  sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
participants.tsv
sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz

The retrieval of the actual, potentially large file content can happen at any later time for the full dataset or subsets of files. Let’s get the nifti file:

$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]

Wasn’t this easy?

Dataset Nesting

Within DataLad datasets one can nest other DataLad datasets arbitralily deep. This does not seem particulary spectacular - after all, any directory on a filesystem can have other directories inside it. The possibility for nested Datasets, however, is one of many advantages DataLad datasets have: Any lower-level DataLad dataset (the subdataset) has a stand-alone history. The top-level DataLad dataset (the superdataset) only stores which version of the subdataset is currently used.

By taking advantage of dataset nesting, one can take datasets such as the studyforrest phase-2 data and install it as a subdataset within a superdataset containing analysis code and results computed from the studyforrest data. Should the studyforrest data get extended or changed, its subdataset can be updated to include the changes easily. More detailed examples of this can be found in the use cases in the last section (for example in Writing a reproducible paper).

The figure below illustrates dataset nesting in a neuroimaging context schematically:

Virtual directory tree of a nested DataLad dataset

Creating your own dataset yourself

Anyone can create, populate, and optionally share a new DataLad dataset. A new DataLad dataset is always created empty, even if the target directory already contains additional files or directories. After creation, arbitralily large amounts of data can be added. Once files are added and saved to the dataset, any changes done to these data files can be saved to the history.

Note for Git users

Creation of datasets relies on the git init and git annex init commands.

As already shown, already existing datalad dataset can be simply installed from a url or path, or from the datalad open-data collection.

Note for Git users

datalad install used the git clone command.