OHBM Brainhack TrainTrack: DataLad¶
This code belongs to the 2020 OHBM Brainhack Traintrack session on DataLad. Copy-paste them into your terminal to follow along.
Introduction & set-up¶
DataLad is a command line tool and it has a Python API. Whenever I use it,
I thus operate it in a terminal using the command line, or I use it in scripts
such as shell scripts, Python scripts, Jupyter Notebooks, and so forth.
In the command line, this always starts with the general
For example, I can type
datalad --help to find out more about
the available commands.
To use its python interface, I import the
datalad.api as dl:
ipython import datalad.api as dl exit
You can find more details about how to install DataLad and its dependencies on all operating systems in the DataLad handbook, in the section Installation and configuration. It also details how to install DataLad on shared machines that you don’t have administrative privileges (sudo rights) on, such as high performance compute clusters. If you already have datalad installed, make sure that it is a recent version, at least 0.12 or higher:
The very first thing to do if you haven’t done so yet is to configure your Git identity. Don’t worry if you have never used Git. The identity you are configuring consists of your name and email-adress so that the changes that you do to a project can be associated with you as an author of the changes:
git config --global --add user.name "Adina Wagner" git config --global --add user.email "email@example.com"
Creating a dataset is done with the
datalad create command. This command only needs a name, and
it will subsequently create a new directory under this name and instruct DataLad
to manage it. Here, the command also has an additional option, the -c text2git
option. With the -c option, datasets can be configured in a certain way at the
time of creation. You can find out about the details of the text2git configuration
in the datalad handbook in sections Configurations to go, but in general this configuration is a very useful
standard configuration for datasets:
datalad create -c text2git DataLad-101
Right after dataset creation, there is a new directory on the computer called
cd DataLad-101 ls # ls does not show any output, because the dataset is empty
Datasets have the exciting features that they can record everything that is done inside of them, version control all content given to Datalad, regardless of the size of this content, and have a complete history that you can interact with. This history is already present, although it is very short at this point in time. Let’s check it out nevertheless.
This history exists thanks to Git. Therefore, you can access the history of a
dataset with any tool that shows you git
history. We’ll stay basic and just use gits built-in
git log command, but you
could also use tools with graphical user interfaces if you want to, for example
Version control workflows¶
I’ll start by creating a books directory with the mkdir command, and then I will download two books from the internet. Here, I’m using the command line tool wget to do this in order to do everything from the commandline. But you can also just download the book manually and save it into the dataset with a file manager if you are more comfortable doing it this way. Remember, a dataset is just a directory on your computer:
mkdir books cd books && wget -nv https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf && wget -nv https://edisciplinas.usp.br/pluginfile.php/3252353/mod_resource/content/1/b_Swaroop_Byte_of_python.pdf -O byte-of-python.pdf && cd ../
The tree command can visualize the directory hierarchy:
Use the datalad status command to find out what happened in the dataset. This command is very helpful and reports on the current state of your dataset. Any content that is new or changed will be highlighted. If nothing has changed, a datalad status will report what is called a clean dataset state. And in general it is very useful to always have a clean dataset state:
Any content that we want DataLad to manage needs to be explicitly given to DataLad, it is not enough to simply put it inside of the dataset. To give new or changed content to DataLad, we need to save it using datalad save. This is the first time that we need to specify a commit message, and this is done with the -m option of the command:
datalad save -m "add books on Python and Unix to read later"
With git log -n 1 you can take a look at the most recent commit in the history:
git log -p -n 1
“datalad save” saved all untracked contents to the dataset. Sometimes this is inconvenient. One great advantage of a datasets history is that it allows you to revert changes you are not happy with, but this is only easily possible in the units of single commits. So if one save commits several unrelated files or changes, they are hard to disentangle if you ever want to revert some of those changes. But if you for example provide a path to the file you want to save you can specify more precisely what will be saved together:
cd books && wget -nv https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf && cd ../ datalad status
Attach a path to the next
datalad save command:
datalad save -m "add reference book about git" books/progit.pdf
Let’s take a look at files that are
frequently modified such as code or text. To try this, I will create a file
and modify it. I do this with a here doc,
but you can also write the note with an editor of your choice. If you execute this
code snippet, make sure you copy-paste everything, starting with
cat and ending
with the second
cat << EOT > notes.txt One can create a new dataset with '"'"'datalad create PATH'"'"'. The dataset is created empty EOT
Datalad status will, as expected, say that there is a new untracked file in the dataset:
We can save it with datalad save command and a helpful commit message. As its the only change in the dataset, there is no need to provide a path:
datalad save -m "Add notes on datalad create"
Let’s now add another note to modify this file:
cat << EOT >> notes.txt The command "datalad save [-m] PATH" saves the file (modifications) to history. Note to self: Always use informative, concise commit messages. EOT
A datalad status reports the file not to be untracked, but because it differs now from the state it was saved under it is reported to be modified:
Let’s save this:
datalad save -m "add note on datalad save"
If you take a look at the history of this file with git log, the history neatly summarizes all of the changes that have been done:
git log -p -n 2
Dataset consumption and nesting¶
First, create a new subdirectory to be organized:
Afterwards, I’ll install the dataset I am interested in, either from a path or a URL. The dataset I want to install lives on GitHub, so in order to get it, I will provide its URL to the datalad clone command. I’m also attaching a path to where I want to have it installed to this call. Importantly I am installing this dataset as a subdataset of DataLad-101, in other words I will nest the two datasets inside of each other. This is done with the –dataset flag:
datalad clone --dataset . \ https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
There are new directories in my DataLad/101 dataset, and in these new directories, there are hundreds of mp3 files:
tree -d # we limit the output to directories cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking ls
here is the crucial and incredibly handy feature of DataLad datasets: At this point, after cloning, the dataset has small files, for example the README, but larger files in it don’t have any file content yet. It only retrieved what we in a simplified way call file availability metadata and shows that as the file hierarchy in the dataset. So while I can read the file names and find out what the dataset contains, I don’t have the file contents yet. If I would try to play one of the recordings with the vlc player, this would fail:
This is a curious behavior, but there are many advantages to this. One is speed, and another one is small disk usage. Here is the total size of this dataset:
cd ../ # in longnow/ du -sh # Unix command to show the size of contents
It’s tiny! But we can also find out how large the dataset would be if we had all of its contents with datalad status and the –annex flag. In total, there are more than 15GB of podcasts you now have access to:
datalad status --annex
You can get individual or groups of files, directories, or datasets with the datalad get command. This command retrieves the content for you:
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
Content that is already present is not re-retrieved:
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
If you don’t need the data locally anymore you can drop the content from your dataset to save disk space:
datalad drop Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3
Afterwards, as long as DataLad knows where a file came from, its content can be retrieved again:
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3
Let’s take a look into the history of the longnow subdataset: We can see that it has preserved its history completely. This means that the data we retrieved preserved all of its provenance:
git log --reverse
How does this look in the top-level dataset? If we query DataLad-101s history, there will be no commit about mp3 files or any of the commits we have seen in the subdataset. Instead, we can see that the super dataset recorded the recordings|longnow dataset as a subdataset. This means that it recorded where this dataset came from and what version it is in:
cd ../../ git log -p -n 1
The subproject commit registered the most recent commit of the subdataset, and thus the subdataset version:
cd recordings/longnow git log --oneline cd ../../
More on data versioning, nesting, and a glimpse into reproducible paper¶
We’ll clone a repository for a paper that shares the manuscript, code, and data:
cd ../ datalad clone firstname.lastname@example.org:psychoinformatics-de/paper-remodnav.git
The top-level dataset has many subdatasets. One of it, remodnav, is a dataset that contains the source code for a Python package called remodnav used in eye tracking analyses:
cd paper-remodnav datalad subdatasets
After cloning a dataset, its subdatasets will be known, but just as content is not yet retrieved for files in datasets, subdatasets of datasets are not yet installed. If I navigate into an uninstalled subdataset it will appear like an empty directory:
cd remodnav ls
In order to install a subdataset, I use datalad get:
datalad get --recursive --recursion-limit 2 -n . ls
This command doesn’t only retrieve file contents, but it also installs subdatasets. So if you want to be really lazy, just run datalad get –recursive -n in the root of a dataset to install all subdatasets that are available. The -n option prevents get from downloading any data, so that only subdataset are installed, but no data is downloaded. Here, the depth of recursion is limited. For one, it would take a while to install all subdatasets, but the very raw eye tracking dataset contains subject IDs that should not be shared, and therefore, this subdataset is not accessible - if you try to install all subdatasets, the source eyetracking data will throw an error, because it is not made publicly available.
Afterwards, you can see that the remodnav subdataset also contains further subdatasets. In this case, these subdatasets contain data that is used for testing and validating software performance:
One of the validation data subdatasets came form another lab that shared their data. After I was almost finished with my paper, I found another paper that reported a mistake in this data. The mistake was still present in the data I was using, though. So by inspecting the history of this dataset you can see that at one point, I contributed a fix that changed the data:
cd remodnav/tests/data/anderson_etal git log -n 3
But because I can link subdatasets in precise version I can consciously decide and openly record which version of the data I am using or even test how much my results change by resetting the subdataset to an earlier state or updating the dataset to a more recent version.
Not only can I version control data and consume data with datalad, I can also create datasets with data analyses in a way that my future self and others can easily and automatically recompute what was done:
cd ../../../../ # get out of the paper repository
First, create a new dataset, in this case with the yoda configuration:
datalad create -c yoda myanalysis
This sets up a helpful structure for my dataset with a code directory and some README files, and applies helpful configurations:
cd myanalysis tree
Read up more about the YODA principles and the yoda configuration in the section YODA: Best practices for data analyses in a dataset.
Next, install input data as a subdataset. For this, I created a
dataset with the Iris data and published it on Github. Here, we’re installing it
into a directory
datalad clone -d . email@example.com:datalad-handbook/iris_data.git input/
The last thing is code to run on the data and produce results. For this, here is a k-means classification analysis script written in Python. You can find this analysis in more detail in the section YODA-compliant data analysis projects:
cat << EOT > code/script.py import pandas as pd import seaborn as sns import datalad.api as dl from sklearn import model_selection from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report data = "input/iris.csv" # make sure that the data are obtained (get will also install linked sub-ds!): dl.get(data) # prepare the data as a pandas dataframe df = pd.read_csv(data) attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"] df.columns = attributes # create a pairplot to plot pairwise relationships in the dataset plot = sns.pairplot(df, hue='"'"'class'"'"', palette='"'"'muted'"'"') plot.savefig('"'"'pairwise_relationships.png'"'"') # perform a K-nearest-neighbours classification with scikit-learn # Step 1: split data in test and training dataset (20:80) array = df.values X = array[:,0:4] Y = array[:,4] test_size = 0.20 seed = 7 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed) # Step 2: Fit the model and make predictions on the test dataset knn = KNeighborsClassifier() knn.fit(X_train, Y_train) predictions = knn.predict(X_test) # Step 3: Save the classification report report = classification_report(Y_test, predictions, output_dict=True) df_report = pd.DataFrame(report).transpose().to_csv('"'"'prediction_report.csv'"'"') EOT
So far the script is untracked:
Let’s save it with a datalad save command and also attach an identifier with the
datalad save -m "add script for kNN classification and plotting" --version-tag ready4analysis code/script.py
The challenge DataLad helps me to accomplish is running this script in a way that links the script to the results it produces and the data it was computed from. I can do this with the datalad run command. In principle, it is simple. You start with a clean dataset:
Then, give the command you would execute to datalad run, in this case
Datalad will take the command, run it, and save all of the changes in the
dataset that leads it to under the commit message specified with
the -m option. Thus, it associates the script with the results.
But it can be even more helpful. Here, we also specify the input data the command
needs and datalad will get the data beforehand. And we also specify the output
of the command. To understand fully what this does, please read chapters
DataLad, Run! and Under the hood: git-annex, but specifying the outputs will allow me later to rerun
the command and let me update outdated results:
datalad run -m "analyze iris data with classification analysis" \ --input "input/iris.csv" \ --output "prediction_report.csv" \ --output "pairwise_relationships.png" \ "python3 code/script.py"
Datalad creates a commit in my history. This commit has my commit message as a human readable summary of what was done, it contains the produced output, and it has a machine readable record that contains information on the input data, the results, and the command that was ran to create this result:
git log -n 1
This machine readable record is particularly helpful, because I can now instruct datalad to rerun this command so that I don’t have to memorize what I had done and people I share my dataset with don’t need to ask me how this result was produced, by can simply let DataLad tell them.
This is done with the
datalad rerun command. For this demonstration, I have
prepared this analysis dataset and published it to GitHub at
cd ../ git clone firstname.lastname@example.org:adswa/myanalysis.git analysis_clone
I can clone this repository and give for example the checksum of the run command
datalad rerun command. DataLad will read the machine readable record of
what was done and recompute the exact same thing:
datalad rerun 71cb8c5
This allows others to very easily rerun my computations, but it also spares me the need to remember how I executed my script, and I can ask for results where they came from:
git log pairwise_relationships.png
If you don’t have the required python packages available, running the script and computing the results will fail. In order to be computationally reproducible I need to attach the software that is necessary for computation to this execution record:
And the way I can do this is with a datalad extension called datalad containers.
You can install this extension with pip by running
pip install datalad-containers.
This extension allows attaching software containers such as singularity
images to my dataset and execute my commands inside of these containers. Thus, I
can share data, code, code execution, and software.
Here is how this works: First, I attach a software container to my dataset using
datalad containers-add with a name of the container (here I call it
and a url or path where to find this container, here it is singularity hub. This
records the software in the dataset:
datalad containers-add software --url shub://adswa/resources:2
Note: You need to have singularity installed to run this!
Afterwards, rerun the analysis in the software container with the
datalad containers-run command. This container works just as the run command before, I
only need to specify the container name. If you were to rerun such an analysis,
datalad would not only retrieve the input data but also the software container:
datalad containers-run -m "rerun analysis in container" \ --container-name software \ --input "input/iris.csv" \ --output "prediction_report.csv" \ --output "pairwise_relationships.png" \ "python3 code/script.py"
Read more about this in the section Computational reproducibility with software containers.
Done! Thanks for coding along!