A Neuroimaging Datasets

This section is a concise demonstration of what a DataLad dataset is, showcased on a dataset from the field of neuroimaging. A DataLad dataset is the core data type of DataLad. We will explore the concepts of it with one public example dataset, the studyforrest phase 2 data (studyforrest.org). Note that this is just one type and use of a DataLad dataset, and you throughout there are many more flavors of using DataLad datasets in the basics or in upcoming use cases.

Please follow along and run the commands below in your own terminal for a hands-on experience.

$ datalad install https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
[INFO] Cloning dataset to Dataset(/home/me/usecases/studyforrest/studyforrest-data-phase2) 
[INFO] Attempting to clone from https://github.com/psychoinformatics-de/studyforrest-data-phase2.git to /home/me/usecases/studyforrest/studyforrest-data-phase2 
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 
[INFO] Start receiving objects 
[INFO] Start resolving deltas 
[INFO] Completed clone attempts for Dataset(/home/me/usecases/studyforrest/studyforrest-data-phase2) 
[INFO] scanning for annexed files (this may take some time) 
[INFO] Remote origin not usable by git-annex; setting annex-ignore 
[INFO] https://github.com/psychoinformatics-de/studyforrest-data-phase2.git/config download failed: Not Found 
[INFO] RIA store unavailable. -caused by- Failed to access http://studyforrest.ds.inm7.deria-layout-version -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to establish a new session 1 times.  -caused by- HTTPConnectionPool(host='studyforrest.ds.inm7.de', port=80): Max retries exceeded with url: /ria-layout-version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f96af4ea260>: Failed to establish a new connection: [Errno -2] Name or service not known')) 
install(ok): /home/me/usecases/studyforrest/studyforrest-data-phase2 (dataset)

Once installed, a DataLad dataset looks like any other directory on your file system:

$ cd studyforrest-data-phase2
$ ls # output below is only an excerpt from ls
LICENSE
Makefile
participants.tsv
README.rst
recording-cardresp_physio.json
recording-eyegaze_physio.json
src
stimuli
sub-01
sub-02
sub-03

However, all files and directories within the DataLad dataset can be tracked (should you want them to be tracked), regardless of their size. Large content is tracked in an annex that is automatically created and handled by DataLad. Whether text files or larger files change, all of these changes can be written to your DataLad datasets history.

Large-file tracking

A DataLad dataset is a Git repository. Large file content in the dataset in the annex is tracked with git-annex. An ls -a reveals that Git is secretly working in the background:

$ ls -a # show also hidden files (excerpt)
CHANGES
code
datacite.yml
.datalad
dataset_description.json
.git
.gitattributes
LICENSE
Makefile
participants.tsv
README.rst
recording-cardresp_physio.json
recording-eyegaze_physio.json
src
stimuli
sub-01
sub-02
sub-03

Users can create new DataLad datasets from scratch, or install existing DataLad datasets from paths, urls, or open-data collections. This makes sharing and accessing data fast and easy. Moreover, when sharing or installing a DataLad dataset, all copies also include the datasets history. An installed DataLad dataset knows the dataset it was installed from, and if changes in this original DataLad dataset happen, the installed dataset can simply be updated.

You can view the DataLad datasets history with tools of your choice. The code block below is used to illustrate the history and is an exempt from git log (manual).

$ git log --oneline --graph --decorate
* 01ed4601 (HEAD -> master, origin/master, origin/HEAD) Fix  author list (all git & paper authors)
*   974614cb Merge pull request #21 from psychoinformatics-de/christian-monch-patch-1
|\  
| * 547c519c Add authors revealed by git-log
|/  
*   72f835ad Merge remote-tracking branch 'github/master'
|\  
| *   fb36de08 Merge pull request #20 from psychoinformatics-de/christian-moench-gin-datacite
| |\  

Dataset content identity and availability information

Upon installation of a DataLad dataset, DataLad retrieves only (small) metadata information about the dataset. This exposes the datasets file hierarchy for exploration, and speeds up the installation of a DataLad dataset of many TB in size to a few seconds. Just after installation, the dataset is small in size:

$ du -sh
20M	.

This is because only small files are present locally – for shits and giggles, you can try opening both a small .tsv file in the root of the dataset, and a larger compressed nifti (nii.gz) in one of the subdirectories in this dataset. A small .tsv (1.9K) file exists and can be opened locally, but what would be a large, compressed nifti file is not. In this state, one cannot open or work with the nifti file, but you can explore which files exist without the potentially large download.

$ ls participants.tsv  sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
participants.tsv
sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz

The retrieval of the actual, potentially large file content can happen at any later time for the full dataset or subsets of files. Let’s get the nifti file:

$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]

Wasn’t this easy?

Dataset Nesting

Within DataLad datasets one can nest other DataLad datasets arbitrarily deep. This does not seem particularly spectacular - after all, any directory on a file system can have other directories inside it. The possibility for nested Datasets, however, is one of many advantages DataLad datasets have: Any lower-level DataLad dataset (the subdataset) has a stand-alone history. The top-level DataLad dataset (the superdataset) only stores which version of the subdataset is currently used.

By taking advantage of dataset nesting, one can take datasets such as the studyforrest phase-2 data and install it as a subdataset within a superdataset containing analysis code and results computed from the studyforrest data. Should the studyforrest data get extended or changed, its subdataset can be updated to include the changes easily. More detailed examples of this can be found in the use cases in the last section (for example in Writing a reproducible paper).

The figure below illustrates dataset nesting in a neuroimaging context schematically:

Virtual directory tree of a nested DataLad dataset

Creating your own dataset yourself

Anyone can create, populate, and optionally share a new DataLad dataset. A new DataLad dataset is always created empty, even if the target directory already contains additional files or directories. After creation, arbitrarily large amounts of data can be added. Once files are added and saved to the dataset, any changes done to these data files can be saved to the history.

Create internals

Creation of datasets relies on the git init (manual) and git annex init (manual) commands.

As already shown, already existing datalad dataset can be simply installed from a url or path, or from the datalad open-data collection.

Install internals

datalad install (manual) used the git clone (manual) command.