OHBM Brainhack TrainTrack: DataLad

This code belongs to the 2020 OHBM Brainhack Traintrack session on DataLad. Copy-paste them into your terminal to follow along.

Introduction & set-up

DataLad is a command line tool and it has a Python API. Whenever I use it, I thus operate it in a terminal using the command line, or I use it in scripts such as shell scripts, Python scripts, Jupyter Notebooks, and so forth. In the command line, this always start with the general datalad command:

datalad

For example, I can type datalad --help to find out more about the available commands.

To use its python interface, I import the datalad.api as dl:

ipython
import datalad.api as dl
exit

You can find more details about how to install DataLad and its dependencies on all operating systems in the DataLad handbook, in the section Installation and configuration. It also details how to install DataLad on shared machines that you don’t have administrative privileges (sudo rights) on, such as high performance compute clusters. If you already have datalad installed, make sure that it is a recent version, at least 0.12 or higher:

datalad --version

The very first thing to do if you haven’t done so yet is to configure your Git identity. Don’t worry if you have never used Git. The identity you are configuring consists of your name and email-adress so that the changes that you do to a project can be associated with you as an author of the changes:

git config --global --add user.name "Adina Wagner"
git config --global --add user.email "adina.wagner@t-online.de"

Creating a dataset is done with the datalad create command. This command only needs a name, and it will subsequently create a new directory under this name and instruct DataLad to manage it. Here, the command also has an additional option, the -c text2git option. With the -c option, datasets can be configured in a certain way at the time of creation. You can find out about the details of the text2git configuration in the datalad handbook in sections Configurations to go, but in general this configuration is a very useful standard configuration for datasets:

datalad create -c text2git DataLad-101

Right after dataset creation, there is a new directory on the computer called DataLad-101:

cd DataLad-101
ls # ls does not show any output, because the dataset is empty

Datasets have the exciting features that they can record everything that is done inside of them, version control all content given to Datalad, regardless of the size this content, and have a complete history that you can interact with. This history is already present, although it is very short at this point in time. Let’s check it out nevertheless.

This history exists thanks to Git. Therefore, you can access the history of a dataset with any tool that shows you git history. We’ll stay basic and just use gits build-in git log command, but you could also use tools with graphical user interfaces if you want to, for example tig:

git log

Version control workflows

I’ll start by creating a books directory with the mkdir command, and then I will download two books from the internet. Here, I’m using the command line tool wget to do this in order to do everything from the commandline. But you can also just download the book manually and save it into the dataset with a file manager if you are more comfortable doing it this way. Remember, a dataset is just a directory on your computer:

mkdir books
cd books && wget -nv https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf && wget -nv https://edisciplinas.usp.br/pluginfile.php/3252353/mod_resource/content/1/b_Swaroop_Byte_of_python.pdf -O byte-of-python.pdf && cd ../

The tree command can visualize the directory hierarchy:

tree

Use the datalad status command to find out what happened in the dataset. This command is very helpful and reports on the current state of your dataset. Any content that is new or changed will be highlighted. If nothing has changed, a datalad status will report what is called a clean dataset state. And in general it is very useful to always have a clean dataset state:

datalad status

Any content that we want DataLad to manage needs to be explicitly given to DataLad, it is not enough to simply put it inside of the dataset. To give new or changed content to DataLad, we need to save it using datalad save. This is the first time that we need to specify a commit message, and this is done with the -m option of the command:

datalad save -m "add books on Python and Unix to read later"

With git log -n 1 you can take a look at the most recent commit in the history:

git log -p -n 1

“datalad save” saved all untracked contents to the dataset. Sometimes this is inconvinient. One great advantage of a datasets history is that it allows you to revert changes you are not happy with, but this is only easily possible in the units of single commits. So if one save commits several unrelated files or changes, they are hard to disentangle if you ever want to revert some of those changes. But if you for example provide a path to the file you want to save you can specify more precisiley what will be saved together:

cd books && wget -nv https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf && cd ../
datalad status

Attach a path to the next datalad save command:

datalad save -m "add reference book about git" books/progit.pdf

lets take a look at files that are frequently modified such as code or text. To try this, I will create a file and modify it. I do this with a here doc, but you can also write the note with an editor of your choice. If you execute this code snippet, make sure you copy-paste everything, starting with cat and ending with the second EOT:

cat << EOT > notes.txt
One can create a new dataset with '"'"'datalad create PATH'"'"'.
The dataset is created empty


EOT

Datalad status will, as expected, say that there is a new untracked file in the dataset:

datalad status

We can save it with datalad save command and a helpful commit message. As its the only change in the dataset, there is no need to provide a path:

datalad save -m "Add notes on datalad create"

Let’s now add another note to modifiy this file:

cat << EOT >> notes.txt
The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.

EOT

A datalad status reports the file not to be untracked, but because it differs now from the state it was saved under it is reported to be modified:

datalad status

Let’s save this:

datalad save -m "add note on datalad save"

If you take a look at the history of this file with git log, the history neatly summarizes all of the changes that have been done:

git log -p -n 2

Dataset consumption and nesting

First, create a new subdirectory to be organized:

mkdir recordings

Afterwards, I’ll install the dataset I am interested in, either from a path or a URL. The dataset I want to install lives on GitHub, so in order to get it, I will privide its URL to the datalad clone command. I’m also attaching a path to where I want to have it installed to this call. Importantly I am installing this dataset as a subdataset of DataLad-101, in other words I will nest the two datasets inside of each other. This is done with the –dataset flag:

datalad clone --dataset . \
https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow

There are new directories in my DataLad/101 dataset, and in these new directories, there are hundreds of mp3 files:

tree -d # we limit the output to directories
cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
ls

here is the crucial and incredibly handy feature of DataLad datasets: At this point, after cloning, the dataset has small files, for example the README, but larger files in it don’t have any file content yet. It only retrieved what we in a simplified way call file availability metadata and shows that as the file hierarchy in the dataset. So while I can read the file names and find out what the dataset contains, I don’t have the file contents yet. If I would try to play one of the recordings with the vlc player, this would fail:

vlc Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3

his is a curious behavior, but there are many advantages to this. One is speed, and another one is small diskusage. Here is the total size of this dataset:

cd ../ # in longnow/
du -sh  # Unix command to show size of contents

Its tiny! But we can also find out how large the dataset would be if we had all of its contents with datalad status and the –annex flag. In total, there are more than 15GB of podcasts you have now access to:

datalad status --annex

You can get individual or groups of files, directories, or datasets with the datalad get command. This command retrieves the content for you:

datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3

Content that is already present is not re-retrieved:

datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3  \Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3  \Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3

If you don’t need the data locally anymore you can drop the content from your dataset to save diskspace:

datalad drop Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3

Afterwards, as long as DataLad knows where a file came from, its content can be retrieved again:

datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3

Dataset nesting

Let’s take a look into the history of the longnow subdataset: We can see that it has preserved its history completely. This means that the data we retrieved preserved all of its provenance:

git log --reverse

ow does this look in the top-level dataset? If we query DataLad-101s history, there will be no commit about mp3 files or any of the commits we have seen in the subdataset. Instead, we can see that the superdataset recorded the recordings|longnow dataset as a subdataset. This means, that it recorded where this dataset came from and what version it is in:

cd ../../
git log -p -n 1

The subproject commit registered the most recent commit of the subdataset, and thus the subdataset version:

cd recordings/longnow
git log --oneline
cd ../../

More on data versioning, nesting, and a glimpse into reproducible paper

We’ll clone a repository for a paper that shares manuscript, code, and data:

cd ../
datalad clone git@github.com:psychoinformatics-de/paper-remodnav.git

The top-level dataset has many subdatasets. One of it, remodnav, is a dataset that contains the sourcecode for a Python package called remodnav used in eyetracking analyses:

cd paper-remodnav
datalad subdatasets

After cloning a dataset, its subdatasets will be known, but just as content is not yet retrieved for files in datasets, subdatasets of datasets are not yet installed. If I navigate into an uninstalled subdataset it will appear like an empty directory:

cd remodnav
ls

In order to install a subdataset, I use datalad get:

datalad get --recursive --recursion-limit 2 -n .
ls

This command doesn’t only retrieve file contents, but it also installs subdatasets. So if you want to be really lazy, just run datalad get –recursive -n in the root of a dataset to install all subdatasets that are available. The -n option prevents get from downloading any data, so that only subdataset are installed, but no data is downloaded. Here, the depth of recursion is limited. For one, it would take a while to install all subdatasets, but the very raw eye tracking dataset contains subject IDs that should not be shared, and therefore, this subdataset is not accessible - if you try to install all subdatasets, the source eyetracking data will throw an error, because it is not made publicly available.

Afterwards, you can see that the remodnav subdataset also contains further subdatasets. In this case, these subdatasets contain data that is used for testing and validating software performance:

datalad subdatasets

One of the validation data subdatasets came form another lab that shared their data. After I was almost finished with my paper, I found another paper that reported a mistake in this data. The mistake was still present in the data I was using, though. So by inspecting the history of this dataset you can see that at one point, I contributed a fix that changed the data:

cd remodnav/tests/data/anderson_etal
git log -n 3

But because I can link subdatasets in precise version I can consciously decide and openly record which version of the data I am using or even test how much my results change by resetting the subdataset to an ealier state or updating the dataset to a more recent version.

Reproducible analyses

Not only can I version control data and consume data with datalad, I can also create datasets with data analyses in a way that my future self and others can easily and automatically recompute what was done:

cd ../../../../ # get out of the paper repository

First, create a new dataset, in this case with the yoda configuration:

datalad create -c yoda myanalysis

This sets up a helpful structure for my dataset with a code directory and some README files, and applies helpful configurations:

cd myanalysis
tree

Read up more about the YODA principles and the yoda configuration in the section YODA: Best practices for data analyses in a dataset.

Next, install input data as a subdataset. For this, I created a dataset with the Iris data and published it on Github. Here, we’re installing it into a directory input:

datalad clone -d . git@github.com:datalad-handbook/iris_data.git input/

The last thing is code to run on the data and produce results. For this, here is a k-means classification analysis script written in Python. You can find this analysis in more detail in the section YODA-compliant data analysis projects:

cat << EOT > code/script.py

import pandas as pd
import seaborn as sns
import datalad.api as dl
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

data = "input/iris.csv"

# make sure that the data are obtained (get will also install linked sub-ds!):
dl.get(data)

# prepare the data as a pandas dataframe
df = pd.read_csv(data)
attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"]
df.columns = attributes

# create a pairplot to plot pairwise relationships in the dataset
plot = sns.pairplot(df, hue='"'"'class'"'"', palette='"'"'muted'"'"')
plot.savefig('"'"'pairwise_relationships.png'"'"')

# perform a K-nearest-neighbours classification with scikit-learn
# Step 1: split data in test and training dataset (20:80)
array = df.values
X = array[:,0:4]
Y = array[:,4]
test_size = 0.20
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,
                                                                    test_size=test_size,
                                                                    random_state=seed)
# Step 2: Fit the model and make predictions on the test dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_test)

# Step 3: Save the classification report
report = classification_report(Y_test, predictions, output_dict=True)
df_report = pd.DataFrame(report).transpose().to_csv('"'"'prediction_report.csv'"'"')

EOT

So far the script is untracked:

datalad status

Let’s save it with a datalad save command and also attach an identifier with the --version-tag flag:

datalad save -m "add script for kNN classification and plotting" --version-tag ready4analysis code/script.py

The challenge DataLad helps me to accomplish is running this script in a way that links the script to the results it produces and the data it was computed from. I can do this with the datalad run command. In principle, it is simple. You start with a clean dataset:

datalad status

Then, give the command you would execute to datalad run, in this case python code/script.py. Datalad will take the command, run it, and save all of the changes in the dataset that this leads this to under the commit message specified with the -m option. Thus, it associates the script with the results. But it can be even more helpful. Here, we also specify the input data the command needs and datalad will get the data beforehand. And we also specify the output of the command. To understand fully what this does, please read chapters DataLad, Run! and Under the hood: git-annex, but specifying the outputs will allow me later to rerun the command and let me update outdated results:

datalad run -m "analyze iris data with classification analysis" \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"

Datalad creates a commit in my history. This commit has my commit message as a human readable summary of what was done, it contains the produced output, and it has a machine readable record that contains information on the input data, the results, and the command that was run to create this result:

git log -n 1

This machine readable record is particularly helpful, because I can now instruct datalad to rerun this command so that I don’t have to memorize what I had done and people I share my dataset with don’t need to ask me how this result was produced, by can simply let DataLad tell them.

This is done with the datalad rerun command. For this demonstration, I have prepared this analysis dataset and published it to GitHub at github.com/adswa/my_analysis:

cd ../
git clone git@github.com:adswa/myanalysis.git analysis_clone

I can clone this repository and give for example the checksum of the run command to the datalad rerun command. DataLad will read the machine readable record of what was done and recompute the exact same thing:

datalad rerun 71cb8c5

This allows others to very easily rerun my computations, but it also spares me the need to remember how I executed my script, and I can ask results where they came from:

git log pairwise_relationships.png

Computational reproducibility

If you don’t have the required python packages available, running the script and computing the results will fail. In order to be computationally reproducible I need to attach the software that is necessary for a computation to this execution record:

cd ../myanalysis

And the way I can do this is with a datalad extension called datalad containers. You can install this extension with pip by running pip install datalad-containers. This extension allow to attach software containers such as singularity images to my dataset and execute my commands inside of these containers. Thus, I can share share data, code, code execution, and software.

Here is how this works: First, I attach a software container to my dataset using datalad containers-add with a name of the container (here I call it software) and a url or path where to find this container, here it is singularity hub. This records the software in the dataset:

datalad containers-add software --url shub://adswa/resources:1

Note: You need to have singularity installed to run this!

Afterwards, rerun the analysis in the software container with the datalad containers-run command. This container works just as the run command before, I only need to specify the container name. If you were to rerun such an analysis, datalad would not only retrieve the input data but also the software container:

datalad containers-run -m "rerun analysis in container" \
--container-name software \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"

Read more about this in the section Computational reproducibility with software containers.

Done! Thanks for coding along!