What you really need to know

DataLad is a data management multitool that can assist you in handling the entire life cycle of digital objects. It is a command-line tool, free and open source, and available for all major operating systems.

This document is the 10.000 feet overview of important concepts, commands, and capacities of DataLad. Each section briefly highlights one type of functionality or concept and the associated commands, and the upcoming Basics chapters will demonstrate in detail how to use them.

DataLad datasets

Every command affects or uses DataLad datasets, the core data structure of DataLad. A dataset is a directory on a computer that DataLad manages.

Create DataLad datasets

You can create new, empty datasets from scratch and populate them, or transform existing directories into datasets.

Simplified local version control workflows

Building on top of Git and git-annex, DataLad allows you to version control arbitrarily large files in datasets.

Version control arbitrarily large contents

Thus, you can keep track of revisions of data of any size, and view, interact with or restore any version of your dataset’s history.

Consumption and collaboration

DataLad lets you consume datasets provided by others, and collaborate with them. You can install existing datasets and update them from their sources, or create sibling datasets that you can publish updates to and pull updates from for collaboration and data sharing.

Consume and collaborate

Additionally, you can get access to publicly available open data collections with the DataLad superdataset ///.

Dataset linkage

Datasets can contain other datasets (subdatasets), nested arbitrarily deep. Each dataset has an independent revision history, but can be registered at a precise version in higher-level datasets. This allows to combine datasets and to perform commands recursively across a hierarchy of datasets, and it is the basis for advanced provenance capture abilities.

Dataset nesting

Full provenance capture and reproducibility

DataLad allows to capture full provenance: The origin of datasets, the origin of files obtained from web sources, complete machine-readable and automatically reproducible records of how files were created (including software environments).

provenance capture

You or your collaborators can thus re-obtain or reproducibly recompute content with a single command, and make use of extensive provenance of dataset content (who created it, when, and how?).

Third party service integration

Export datasets to third party services such as GitHub, GitLab, or Figshare with built-in commands.

third party integration

Alternatively, you can use a multitude of other available third party services such as Dropbox, Google Drive, Amazon S3, owncloud, or many more that DataLad datasets are compatible with.

Metadata handling

Extract, aggregate, and query dataset metadata. This allows to automatically obtain metadata according to different metadata standards (EXIF, XMP, ID3, BIDS, DICOM, NIfTI1, …), store this metadata in a portable format, share it, and search dataset contents.

meta data capabilities

All in all…

You can use DataLad for a variety of use cases. At its core, it is a domain-agnostic and self-effacing tool: DataLad allows to improve your data management without custom data structures or the need for central infrastructure or third party services. If you are interested in more high-level information on DataLad, you can find answers to common questions in the section Frequently Asked Questions, and a concise command cheat-sheet in section DataLad cheat sheet.

But enough of the introduction now – let’s dive into the Basics