An introduction to DataLad at the Open Science Office Hour¶

Welcome to this introduction to DataLad!

On this website you will find the code from the live demonstrations together with a few additional pointers and explanations, so that you can work through the materials at any later point or in your own time or find out more about a workflow, command or concepts. If you have all relevant software installed, open up a terminal on your computer and copy-paste the code snippets in this section into your terminal to code along (if you hover above the right corner of a snippet, you can copy it into your clipboard).

Introduction & set-up¶

In order to code along, you should have …

a recent DataLad version, e.g., 0.19.4 installed,

If you need them, installation, updating, and configuration instructions for DataLad and Git are in the section Installation and configuration. If you are unsure about your version of DataLad, you can check it using the following command:

datalad --version

a configured Git identity (you may have done this during the introduction to version control with Git)

If you are unsure if you have configured your Git identity already, you can check if your name and email are printed to the terminal when you run

git config --get user.name
git config --get user.email

If nothing is returned, you need to configure your Git identity (Installation and configuration shows you how).

the DataLad extension “datalad-container” installed, as well as the Python package black.

In order to install datalad-container and/or black, use a package manager such as pip:

pip install datalad-container black

an SSH key and account on gin.g-node.org

Beyond software usage, this tutorial will show you how to publish data. For this, we will be using Gin, a free dataset hosting service. If you want to code along to this part of the tutorial, you may want to create a free user account and upload your SSH key – but worry not, you can also do this at a later stage, too. Detailed instructions are in section Walk-through: Dataset hosting on GIN.

How to use DataLad¶

DataLad is a command line tool and it has a Python API. It is operated in your terminal using the command line (as done above), or used it in scripts such as shell scripts, Python scripts, Jupyter Notebooks, and so forth. This is how you would import DataLad’s Python API:

ipython       # if not installed, use python
>>> import datalad.api as dl
>>> dl.create(path='mydataset')
>>> exit

In scripts using other programming languages, DataLad commands can be invoked via system calls. Here is an example with R:

R       # or use in RStudio
> system("datalad create mydataset")

DataLad datasets¶

Everything happens in or involves DataLad datasets - DataLad’s core data structure.

You either create datasets yourself, or clone an existing dataset. Creating a dataset from scratch is done with the datalad create command.

By navigating into a directory, and running datalad create -f . (manual) (with the -f/--force option). Section Transitioning existing projects into DataLad provides more info on how to transform existing directories into DataLad datasets. It is advised, though, to first learn a bit of DataLad Basics first, so stay tuned.

datalad create only needs a name, and it will subsequently create a new directory under this name and instruct DataLad to manage it. Here, the command also has an additional option, the -c text2git option. With the -c option, datasets can be configured in a certain way at the time of creation, and text2git is a so-called run procedure:

datalad create -c text2git my-analysis

my-analysis dataset is now a new directory, and you can “change directories” (cd) inside it:

cd my-analysis

The “text2git” procedure pre-created a useful dataset configuration that will make version control workflows with files of varying sizes and types easier. It will also help us later to understand the two version control tools involved in DataLad datasets.

Version control¶

Version controlling a file means to record its changes over time, associate those changes with an author, date, and identifier, creating a lineage of file content, and being able to revert changes or restore previous file versions. DataLad datasets make use of two established version control tools, Git and git-annex, to version control files regardless of size or type.

Let’s build a dataset for an analysis by adding a README. The command below writes a simple header into a new file README.md:

echo "# My example DataLad dataset" > README.md

datalad status (manual) can report on the state of a dataset: What has changed, compared to the last saved version? As we added a new file, README.md shows up as being “untracked”:

datalad status

Procedurally, version control with DataLad commands can be simpler that what you might be used to: In order to save any new file or modification to an existing file in a dataset you use the datalad save (manual) command. The -m/--message option lets you attach a concise summary of your changes. Such a commit message makes it easier for others and your later self to understand a dataset’s history:

datalad save -m "Create a short README"

Let us modify this file by extending the description a bit further. The command below appends a short description to the existing contents of the README:

echo "This dataset contains a toy data analysis" >> README.md

If you want to, you can also use git or git-annex commands in DataLad datasets. Git commands such as git status or git diff are equally able to tell you that the file now differs from its last saved state and is thus “modified”:

git diff

Let’s save this modifications with a helpful message again:

datalad save -m "Add information on the dataset contents to the README"

When run without any file constraints, datalad save will save all modifications in the dataset at once - every untracked file and every modification made to existing files. If you have several unrelated modifications, it is advisable to save them individually. To do this, you can supply the command with a path to the file (or files) you want to save, e.g., datalad save -m "adding raw data" raw/

With each saved change, you build up your dataset’s revision history. Tools such as git log (manual) allow you to interrogate this history, and if you want to, you can use this history to find out what has been done in a dataset, reset it to previous states, and much more:

git log

Importantly, you can version control data of any size - yes, even if the data reaches the size of the human connectome project, of the UK Biobank, or even larger. datalad save is all you need.

And version control does not stop at research data - as long as something is a digital file, you can save it to a DataLad dataset. This includes software containers, such as Docker or Singularity containers.

software containers are useful to capture, share, and use a specific software environment for an analysis. The DataLad extension datalad-container therefore equips DataLad with additional commands that go beyond version controlling software containers, adding additional convenience commands for reproducible science. datalad containers-add, for example, can register a container from a path or a URL inside a dataset in a way that can allow us to perform a provenance-captured data analysis inside of it.

The following command will add a prepared Singularity container from a remote source and register it under the name nilearn (as the container we would use entails a Python environment with nilearn inside):

datalad containers-add nilearn \
     --url shub://adswa/nilearn-container:latest

If your own system supports Docker rather than Singularity, you can get the very same container from Dockerhub by running:

datalad containers-add nilearn \
      --url dhub://djarecka/nilearn:yale

If you are interested in using containers for your data analysis, checkout github.com/repronim/containers, a curated DataLad dataset with a variety of neuroimaging-related software containers ready for you to use.

The command datalad containers-list can show you which containers are registered in your datasets:

datalad containers-list

Data consumption and dataset nesting¶

DataLad makes data consumption very convenient: The datalad clone (manual) command allows you to install datasets from local or remote sources. And there are many public dataset sources, such as all of OpenNeuro’s datasets (github.com/OpenNeuroDatasets), the Human Connectome Project’s open access data (github.com/datalad-datasets/human-connectome-project-openaccess), or other collections of Open Neuroimaging data (datasets.datalad.org), giving you streamlined access to several hundreds of Terabytes of neuroscientific data.

While you can clone datasets ‘as is’ as standalone data packages, you can also link datasets into one another in superdataset-subdataset hierarchies, a process we call “nesting”.

Among several advantages, nesting helps to link datasets as modular units together, and maximizes the potential for reuse of the individual datasets. In the context of data analysis, it is especially helpful to link input data to an analysis dataset – it helps to reuse data in multiple analysis, to link input data in a precise version, and to create an intuitively structured dataset layout.

Let’s get input data for our analysis by cloning some BIDS-structured data under the name input. We make sure to link it to the dataset by running the command inside of the dataset and pointing the -d/--dataset argument to its root - this will register the input data as a subdataset of it:

# clone a remote dataset and register it as
datalad clone -d . \
 https://gin.g-node.org/adswa/bids-data \
 input

The last commit will shed some light on how this linkage works:

git show

It records the dataset’s origin, and importantly, also the datasets version state. This allows the analysis dataset to track exactly where the input data came from and which version of the data was used. The subdatasets own version history is not impacted by this, and you could inspect it to learn how exactly the input dataset evolved.

Data transport¶

The input dataset contains functional MRI data in BIDS format from a single subject. While we cloned the dataset, you probably noticed that this process did not take long enough to involve downloads of sizeable neuroimaging data. Indeed, after cloning the resulting dataset typically takes up only a fraction of the total size of the data that it tracks. However, you can browse the directory tree to discover available files:

ls input/sub-02/func

And you can get the file content of files, directories, or entire datasets on demand via the command datalad get (manual)

datalad get input/sub-02

If you don’t need a file anymore, you can drop its content to free up disk space again:

datalad drop input/sub-02

This mechanism gives you access to data without the necessity to store all of the data locally. Your analysis dataset links the exact data it requires in just a few bytes, with actionable access to retrieve the data on demand, and your computer can have access to more data than your hard drive can store.

Digital provenance¶

Digital provenance is information on how a file came to be and an essential element in the FAIR principles. Version control already captures some digital provenance, such as the date, time, and author of a file or file modification. DataLad can add additional provenance. One useful piece of provenance information is the origin of files.

Imagine that you are getting a script from a colleague to perform your analysis, but they email it to you or upload it to a random place for to download:

# download a script without provenance information
wget -P code/ \
   https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py

The wget command downloaded a script for extracting a brain mask from the web into a code directory:

datalad status

You can save it into your dataset to have the script ready for your analysis:

datalad save -m "Adding a nilearn-based script for brain masking"

But… in a years time, would you remember where you downloaded this from?

Let’s use a DataLad command to download and save a file, and also register the original location of this file internally:

# in addition to a nilearn-based script, let's get a nilearn tutorial
datalad download-url -m "Add a tutorial on nilearn" \
   -O code/nilearn-tutorial.pdf \
   https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf

This command downloads a file from the web, saves it under the provided commit message, and, internally, registers the original location of this file. We will see in a short while how this location provenance information is actionable, and can be used to automatically re-retrieve it.

# download-url spares you a save - the dataset state is already clean
datalad status

A different useful piece of provenance is information on processes that generated or modified files, such as the information that executing a specific script generates a specific figure. DataLad has a set of commands for reproducible execution and re-execution: The datalad run (manual) command can run any command execution in a way that links the command or script to the results it produces. This provenance, similar to the provenance download-url stores internally, is actionable, and the datalad rerun (manual) can take this recorded provenance and recompute the command automatically.

Let’s imagine that the script you got from your colleague does not follow the formatting guidelines you typically use, so you let black, a Python code formatter, run over the code to reformat it.

Without DataLad, you would run it like this: black code/get_brainmask.py. But if you wrap it into a basic datalad run command you can capture the changes of the command execution automatically, and record provenance about it:

datalad run -m "Reformat code with black" \
 "black code/get_brainmask.py"

The resulting commit captured the formatting changes:

git show

And the provenance, saved in a structured record in the commit message, allows automatic re-execution:

datalad rerun

Computational reproducibility¶

We have all the building blocks for a reproducible analysis, so let’s get started. If you are on a system that supports container execution, you can skip the next code block and use datalad containers-run as shown in the important note below.

Otherwise, we’ll stick to datalad run and parameterize it with a few more helpful options. Those are the -i/--input and -o/--output parameter. These flags have two purposes: For one, they add provenance information on inputs and outputs to the structured provenance. More importantly, they help command execution whenever handling annexed files: --input files contents will be retrieved prior to command execution, and --output files will be unlocked prior to command execution, allowing changes in the outputs over multiple reruns to save new versions of these files:

datalad run -m "Compute brain mask" \
  --input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
  --output "figures/*" \
  --output "sub-02*" \
  "python code/get_brainmask.py"

If you are on a system that supports container execution, you can now use datalad containers-run (manual) in order to perform a containerized and provenance-tracked analysis, executing the script inside of the software environment the container provides. In addition to datalad run, datalad containers-run needs a container specification which container should be used. Other than that, the commands get the same arguments:

datalad containers-run -m "Compute brain mask" \
     -n nilearn \
     --input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
     --output "figures/*" \
     --output "sub-02*" \
     "python code/get_brainmask.py"

You can now query an individual file how it came to be…

git log sub-02_brain-mask.nii.gz

… and the computation can be redone automatically and checked for computational reproducibility based on the recorded provenance using datalad rerun:

datalad rerun

Data publication¶

Afterwards, you could publish your analysis for others to consume or collaborate with you. You can choose from a variety of places, and even if the amount of data you want to share is sizeable, you will likely find a free solution to do it the chapter Third party infrastructure.

If the annexed files in your repository, e.g., the nilearn tutorial, the figures, or the brain mask file, contain appropriate provenance to either reobtain them from public sources, or provenance to recompute them automatically, you could even skip the publication of annexed data, and use repository hosting services without support for annexed contents only. For example, if you have a GitHub account and an SSH key setup, you could run datalad create-sibling-github --access-protocol ssh my-analysis followed by a datalad push to create a sibling repository on GitHub and publish the Git part of your repository to it. To get an overview on publishing datasets, however, you best go to Beyond shared infrastructure first, or view one of the many data publication tutorials on YouTube.

Another convenient way is Gin, a free hosting service for DataLad datasets.

First, you need to head over to gin.g-node.org, log in, and upload an SSH key. Then, under your user account, create a new repository, and copy it’s SSH URL. A step by step instruction with screenshots is in the section Walk-through: Dataset hosting on GIN:

datalad create-sibling-gin \
 example-analysis \
 --access-protocol ssh

It is now a known sibling dataset to which you can publish data:

datalad siblings

Note that Gin is a particularly handy hosting service because it has annex support. This means that you can publish your complete dataset, including all data, to it in one command:

datalad push --to gin

Your data is now published! If you make your repository public (it is private by default), anyone can clone your dataset via its https URL. If you keep it private, you can invite your collaborators via the Gin webinterface.

By the way: Now that your data is stored in a second place, you can drop the local copies to save disk space. If necessary, you can reobtain the data from Gin again via datalad get.

A look under the hood…¶

Whenever a file’s content is not available after cloning a dataset, this file is internally managed by the second version control tool, git-annex.

Git will never know an annexed file’s content, it will only know its content identity (to ensure data integrity at all times) and all the locations where file content of this file exists. So when you clone a dataset, Git will show you the file name, and datalad get will retrieve the file contents on demand from wherever they are stored.

Consider the nilearn tutorial we added to the dataset. This file is annexed, and its location information is kept internally. If you run the following command, you will see a list of known file content locations were the content can be reretrieved from if you drop it locally:

git annex whereis code/nilearn-tutorial.pdf

Just as your dataset can have multiple linked clones (in DataLad’s terms, siblings), each annexed file can have multiple possible registered sources, from web sources, cloud infrastructure, scientific clusters to USB-sticks. This decentral approach to data management has advantages for data consumers and producers: You can create a resilient, decentral network where several data sources can provide access even if some sources fail, and regardless of where data is hosted, data retrieval is streamlined and works with the same command. As long as there is one location where data is available from (a dataset on a shared cluster, a web source, cloud storage, a USB-stick, …) and this source is known, there is no need for storing data when it is not in use. Moreover, this mechanism allows to exert fine-grained access control over files. You can share datasets publicly, but only authorized actors might be able to get certain file contents.

Cleaning up¶

The lecture wouldn’t have the term “data management” in its title if we were to leave clutter in your home directory. This gives us the chance to take a look at how to remove files or datasets, which, given that there are version control tools at work that protect your data, can be a challenging task (Spoiler: if you rm a file and save the deletion, the file can be brought back to life easily, and an rm -rf on a dataset with annexed files will cause an explosion of permission errors).

Two commands, datalad drop (manual) and datalad remove (manual), come into play for this. datalad drop is the antagonist of datalad get. By default, everything that drop does can be undone with a get.

You already know that datalad drop drops file contents from the dataset to free up diskspace:

datalad drop input/sub-02

But drop can also uninstall subdatasets:

datalad drop --what all input

Importantly, datalad get can find information where that dataset came from and reinstall it:

datalad get --no-data input

In order to permanently wipe a subdataset, you need remove (which internally uses a destructively parametrized drop). remove is the antagonist to clone, and will leave no trace of the dataset:

datalad remove input

However, both commands have built-in security checks. They require that dropped files can be reobtained to prevent accidental data loss, and that removed datasets could be re-cloned in their most recent version from other places, i.e., that there is a sibling that has all revisions that exist locally.

Dropping one of the just computed figures will fail because of this check:

datalad drop figures/sub-02_mean-epi.png

But it can be overridden with the --reckless parameter’s availability mode:

datalad drop figures/sub-02_mean-epi.png --reckless availability

Likewise, removing the top level dataset with remove will fail the availability check:

cd ../
datalad remove -d my-analysis

But it can be overridden the very same way:

datalad remove -d my-analysis --reckless availability

And with this, we’re done! Thanks for following along, and reach out with any questions you might have!

Table of Contents

Related Topics