The DataLad Handbook

Virtual directory tree of a nested DataLad dataset

Welcome to the DataLad handbook!

This handbook is a living resource about why and – more importantly – how to use DataLad. It aims to provide novices and advanced users of all backgrounds with both the basics of DataLad and start-to-end use cases of specific applications. If you want to get hands-on experience and learn DataLad, the Basics part of this book will teach you. If you want to know what is possible, the use cases will show you. And if you want to help others to get started with DataLad, the companion repository provides free and open source teaching material tailored to the handbook.

Before you read on, please note that the handbook is based on DataLad version 0.12, but the section Installation and configuration will set you up with what you need if you currently do not have DataLad 0.12 or higher installed. If you’re new here, please start the handbook here.


The handbook is currently in beta stage. If you would be willing to provide feedback on its contents, please get in touch.

Basics 1 – DataLad datasets

Basics 2 – Datalad, Run!

How DataLad records provenance of dataset modifications

Basics 3 – Under the hood: git-annex

A closer look at how and why things work

Basics 4 – Collaboration

Basics 5 – Tuning datasets to your needs

Various types and methods for dataset configurations

Basics 6 – Make the most out of datasets

Basics 7 – One step further

Basics 8 – Help yourself

Basics 9 – Third party infrastructure

Leverage third party services to share datasets

Basics 10 – Further options

Small pieces of advice and helpful additional options


Use case I – Collaboration

Bob uses public data for a data analysis project and collaborates with Alice to develop his analysis scripts.

Use case II – Provenance

Track provenance of dataset contents: From their origin from websources to scripts or commands that used, modified, or produced them.

Use case III – Reproducible Research Objects

Share code, data, and computational workflows of your science, and allow others to not only recompute your results, but also build a PDF manuscript from scratch.

Use case IV – Supervision

Use DataLad in a supervision setup and benefit from easy data management workflows and automatic log keeping.

Use case V – Reproducible Analysis

A thorough demonstration of data management and data analysis practices to create reproducible (neuro-)science.

Use case VI – Infrastructure

An implementation of a domain-agnostic data store to enable scalability of scientific computing.