5. What you really need to know¶
DataLad is a data management multitool that can assist you in handling the entire life cycle of digital objects. It is a command-line tool, free and open source, and available for all major operating systems.
This document is the 10.000 feet overview of important concepts, commands, and capacities of DataLad. Each section briefly highlights one type of functionality or concept and the associated commands, and the upcoming Basics chapters will demonstrate in detail how to use them.
5.1. DataLad datasets¶
Every command affects or uses DataLad datasets, the core data structure of DataLad. A dataset is a directory on a computer that DataLad manages.
You can create new, empty datasets from scratch and populate them, or transform existing directories into datasets.
5.2. Simplified local version control workflows¶
Building on top of Git and git-annex, DataLad allows you to version control arbitrarily large files in datasets.
Thus, you can keep track of revisions of data of any size, and view, interact with or restore any version of your dataset’s history.
5.3. Consumption and collaboration¶
DataLad lets you consume datasets provided by others, and collaborate with them. You can install existing datasets and update them from their sources, or create sibling datasets that you can publish updates to and pull updates from for collaboration and data sharing.
Additionally, you can get access to publicly available open data collections with the DataLad superdataset ///.
5.4. Dataset linkage¶
Datasets can contain other datasets (subdatasets), nested arbitrarily deep. Each dataset has an independent revision history, but can be registered at a precise version in higher-level datasets. This allows to combine datasets and to perform commands recursively across a hierarchy of datasets, and it is the basis for advanced provenance capture abilities.
5.5. Full provenance capture and reproducibility¶
DataLad allows to capture full provenance: The origin of datasets, the origin of files obtained from web sources, complete machine-readable and automatically reproducible records of how files were created (including software environments).
You or your collaborators can thus re-obtain or reproducibly recompute content with a single command, and make use of extensive provenance of dataset content (who created it, when, and how?).
5.6. Third party service integration¶
Export datasets to third party services such as GitHub, GitLab, or Figshare with built-in commands.
Alternatively, you can use a multitude of other available third party services such as Dropbox, Google Drive, Amazon S3, owncloud, or many more that DataLad datasets are compatible with.
5.7. Metadata handling¶
Extract, aggregate, and query dataset metadata. This allows to automatically obtain metadata according to different metadata standards (EXIF, XMP, ID3, BIDS, DICOM, NIfTI1, …), store this metadata in a portable format, share it, and search dataset contents.
5.8. All in all…¶
You can use DataLad for a variety of use cases. At its core, it is a domain-agnostic and self-effacing tool: DataLad allows to improve your data management without custom data structures or the need for central infrastructure or third party services. If you are interested in more high-level information on DataLad, you can find answers to common questions in the section Frequently Asked Questions, and a concise command cheat-sheet in section DataLad cheat sheet.
But enough of the introduction now – let’s dive into the Basics