Frequently Asked Questions

This section answers frequently asked questions about high-level DataLad concepts or commands. If you have a question you want to see answered in here, please create an issue or a pull request.

What is Git?

Git is a free and open source distributed version control system. In a directory that is initialized as a Git repository, it can track small-sized files and the modifications done to them. Git thinks of its data like a series of snapshots – it basically takes a picture of what all files look like whenever a modification in the repository is saved. It is a powerful and yet small and fast tool with many features such as branching and merging for independent development, checksumming of contents for integrity, and easy collaborative workflows thanks to its distributed nature.

DataLad uses Git underneath the hood. Every DataLad dataset is a Git repository, and you can use any Git command within a DataLad dataset. Based on the configurations in .gitattributes, file content can be version controlled by Git or managed by git-annex, based on path pattern, file types, or file size. The section More on DIY configurations details how these configurations work. This chapter gives a comprehensive overview on what Git is.

Where is Git’s “staging area” in DataLad datasets?

As mentioned in Populate a dataset, a local version control workflow with DataLad “skips” the staging area (that is typical for Git workflows) from the user’s point of view.

What is git-annex?

git-annex (https://git-annex.branchable.com/) is a distributed file synchronization system written by Joey Hess. It can share and synchronize large files independent from a commercial service or a central server. It does so by managing all file content in a separate directory (the annex, object tree, or key-value-store in .git/annex/objects/), and placing only file names and metadata into version control by Git. Among many other features, git-annex can ensure sufficient amounts of file copies to prevent accidental data loss and enables a variety of data transfer mechanisms. DataLad uses git-annex underneath the hood for file content tracking and transport logistics. git-annex offers an astonishing range of functionality that DataLad tries to expose in full. That being said, any DataLad dataset (with the exception of datasets configured to be pure Git repositories) is fully compatible with git-annex – you can use any git-annex command inside a DataLad dataset.

The chapter III - Under the hood: git-annex can give you more insights into how git-annex takes care of your data. git-annex’s website can give you a complete walk-through and detailed technical background information.

What does DataLad add to Git and git-annex?

DataLad sits on top of Git and git-annex and tries to integrate and expose their functionality fully. While DataLad thus is a “thin layer” on top of these tools and tries to minimize the use of unique/idiosyncratic functionality, it also tries to simplify working with repositories and adds a range of useful concepts and functions:

  • Both Git and git-annex are made to work with a single repository at a time. For example, while nesting pure Git repositories is possible via Git submodules (that DataLad also uses internally), cleaning up after placing a random file somewhere into this repository hierarchy can be very painful. A key advantage that DataLad brings to the table is that it makes the boundaries between repositories vanish from a user’s point of view. Most core commands have a --recursive option that will discover and traverse any subdatasets and do-the-right-thing. Whereas git and git-annex would require the caller to first cd to the target repository, DataLad figures out which repository the given paths belong to and then works within that repository. datalad save . --recursive will solve the subdataset problem above for example, no matter what was changed/added, no matter where in a tree of subdatasets.

  • DataLad provides users with the ability to act on “virtual” file paths. If software needs data files that are carried in a subdataset (in Git terms: submodule) for a computation or test, a datalad get will discover if there are any subdatasets to install at a particular version to eventually provide the file content.

  • DataLad adds metadata facilities for metadata extraction in various flavors, and can store extracted and aggregated metadata under .datalad/metadata.

  • Todo

    more here.

Does DataLad host my data?

No, DataLad manages your data, but it does not host it. When publishing a dataset with annexed data, you will need to find a place that the large file content can be stored in – this could be a web server, a cloud service such as Dropbox, an S3 bucket, or many other storage solutions – and set up a publication dependency on this location. This gives you all the freedom to decide where your data lives, and who can have access to it. Once this set up is complete, publishing and accessing a published dataset and its data are as easy as if it would lie on your own machine.

Todo

sketch a typical workflow, link a yet to write section on sharing on third party infrastructure

How does GitHub relate to DataLad?

DataLad can make good use of GitHub, if you have figured out storage for your large files otherwise. You can make DataLad publish file content to one location and afterwards automatically push an update to GitHub, such that users can install directly from GitHub and seemingly also obtain large file content from GitHub. GitHub is also capable of resolving submodule/subdataset links to other GitHub repos, which makes for a nice UI.

What is the difference between a superdataset, a subdataset, and a dataset?

Conceptually and technically, there is no difference between a dataset, a subdataset, or a superdataset. The only aspect that makes a dataset a sub- or superdataset is whether it is registered in another dataset (by means of an entry in the .gitmodules, automatically performed upon an appropriate datalad install -d or datalad create -d command) or contains registered datasets.

How can I convert/import/transform an existing Git or git-annex repository into a DataLad dataset?

You can transform any existing Git or git-annex repository of yours into a DataLad dataset by running:

$ datalad create -f

inside of it. Afterwards, you may want to tweak settings in .gitattributes according to your needs (see sections DIY configurations and More on DIY configurations for additional insights on this).

How can I cite DataLad?

There is no official paper on DataLad (yet). To cite it, please use the latest zenodo entry found here: https://zenodo.org/record/3512712.

How can I help others get started with a shared dataset?

If you want to share your dataset with users that are not already familiar with DataLad, it is helpful to include some information on how to interact with DataLad datasets in your dataset’s README (or similar) file. Below, we provide a standard text block that you can use (and adapt as you wish) for such purposes:

Find out more: Textblock in .rst format:

DataLad datasets and how to use them
------------------------------------

This repository is a `DataLad <https://www.datalad.org/>`__ dataset. It provides
fine-grained data access down to the level of individual files, and allows for
tracking future updates. In order to use this repository for data retrieval,
`DataLad <https://www.datalad.org>`_ is required.
It is a free and open source command line tool, available for all
major operating systems, and builds up on Git and `git-annex
<https://git-annex.branchable.com>`__ to allow sharing, synchronizing, and
version controlling collections of large files. You can find information on
how to install DataLad at `handbook.datalad.org/en/latest/intro/installation.html
<http://handbook.datalad.org/en/latest/intro/installation.html>`_.

Get the dataset
^^^^^^^^^^^^^^^

A DataLad dataset can be ``cloned`` by running::

   datalad clone <url>

Once a dataset is cloned, it is a light-weight directory on your local machine.
At this point, it contains only small metadata and information on the
identity of the files in the dataset, but not actual *content* of the
(sometimes large) data files.

Retrieve dataset content
^^^^^^^^^^^^^^^^^^^^^^^^

After cloning a dataset, you can retrieve file contents by running::

   datalad get <path/to/directory/or/file>

This command will trigger a download of the files, directories, or
subdatasets you have specified.

DataLad datasets can contain other datasets, so called *subdatasets*. If you
clone the top-level dataset, subdatasets do not yet contain metadata and
information on the identity of files, but appear to be empty directories. In
order to retrieve file availability metadata in subdatasets, run::

   datalad get -n <path/to/subdataset>

Afterwards, you can browse the retrieved metadata to find out about
subdataset contents, and retrieve individual files with ``datalad get``. If you
use ``datalad get <path/to/subdataset>``, all contents of the subdataset will
be downloaded at once.

Stay up-to-date
^^^^^^^^^^^^^^^

DataLad datasets can be updated. The command ``datalad update`` will *fetch*
updates and store them on a different branch (by default
``remotes/origin/master``). Running::

   datalad update --merge

will *pull* available updates and integrate them in one go.

More information
^^^^^^^^^^^^^^^^

More information on DataLad and how to use it can be found in the DataLad Handbook at
`handbook.datalad.org <http://handbook.datalad.org/en/latest/index.html>`_. The
chapter "DataLad datasets" can help you to familiarize yourself with the
concept of a dataset.

Find out more: Textblock in markdown format

[![made-with-datalad](https://www.datalad.org/badges/made_with.svg)](https://datalad.org)

## DataLad datasets and how to use them

This repository is a [DataLad](https://www.datalad.org/) dataset. It provides
fine-grained data access down to the level of individual files, and allows for
tracking future updates. In order to use this repository for data retrieval,
[DataLad](https://www.datalad.org/) is required. It is a free and
open source command line tool, available for all major operating
systems, and builds up on Git and [git-annex](https://git-annex.branchable.com/)
to allow sharing, synchronizing, and version controlling collections of
large files. You can find information on how to install DataLad at
[handbook.datalad.org/en/latest/intro/installation.html](http://handbook.datalad.org/en/latest/intro/installation.html).

### Get the dataset

A DataLad dataset can be `cloned` by running

```
datalad clone <url>
```

Once a dataset is cloned, it is a light-weight directory on your local machine.
At this point, it contains only small metadata and information on the
identity of the files in the dataset, but not actual *content* of the
(sometimes large) data files.

### Retrieve dataset content

After cloning a dataset, you can retrieve file contents by running

```
datalad get <path/to/directory/or/file>`
```

This command will trigger a download of the files, directories, or
subdatasets you have specified.

DataLad datasets can contain other datasets, so called *subdatasets*.
If you clone the top-level dataset, subdatasets do not yet contain
metadata and information on the identity of files, but appear to be
empty directories. In order to retrieve file availability metadata in
subdatasets, run

```
datalad get -n <path/to/subdataset>
```

Afterwards, you can browse the retrieved metadata to find out about
subdataset contents, and retrieve individual files with `datalad get`.
If you use `datalad get <path/to/subdataset>`, all contents of the
subdataset will be downloaded at once.

### Stay up-to-date

DataLad datasets can be updated. The command `datalad update` will
*fetch* updates and store them on a different branch (by default
`remotes/origin/master`). Running

```
datalad update --merge
```

will *pull* available updates and integrate them in one go.

### More information

More information on DataLad and how to use it can be found in the DataLad Handbook at
[handbook.datalad.org](http://handbook.datalad.org/en/latest/index.html). The chapter
"DataLad datasets" can help you to familiarize yourself with the concept of a dataset.

Find out more: Textblock without formatting

DataLad datasets and how to use them

This repository is a DataLad (https://www.datalad.org/) dataset. It provides fine-grained data access down to the level of individual files, and allows for tracking future updates. In order to use this repository for data retrieval, DataLad (https://www.datalad.org/) is required. It is a free and open source command line tool, available for all major operating systems, and builds up on Git and git-annex (https://git-annex.branchable.com/) to allow sharing, synchronizing, and version controlling collections of large files. You can find information on how to install DataLad at http://handbook.datalad.org/en/latest/intro/installation.html.

Get the dataset

A DataLad dataset can be “cloned” by running ‘datalad clone <url>’. Once a dataset is cloned, it is a light-weight directory on your local machine. At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual content of the (sometimes large) data files.

Retrieve dataset content

After cloning a dataset, you can retrieve file contents by running ‘datalad get <path/to/directory/or/file>’

This command will trigger a download of the files, directories, or subdatasets you have specified.

DataLad datasets can contain other datasets, so called “subdatasets”. If you clone the top-level dataset, subdatasets do not yet contain metadata and information on the identity of files, but appear to be empty directories. In order to retrieve file availability metadata in subdatasets, run ‘datalad get -n <path/to/subdataset>’

Afterwards, you can browse the retrieved metadata to find out about subdataset contents, and retrieve individual files with datalad get. If you use ‘datalad get <path/to/subdataset>’, all contents of the subdataset will be downloaded at once.

Stay up-to-date

DataLad datasets can be updated. The command ‘datalad update’ will “fetch” updates and store them on a different branch (by default ‘remotes/origin/master’). Running ‘datalad update –merge’ will “pull” available updates and integrate them in one go.

More information

More information on DataLad and how to use it can be found in the DataLad Handbook at http://handbook.datalad.org/en/latest/index.html. The chapter “DataLad datasets” can help you to familiarize yourself with the concept of a dataset.

What is the difference between DataLad, Git LFS, and Flywheel?

Flywheel is an informatics platform for biomedical research and collaboration.

Git Large File Storage (Git LFS) is a command line tool that extends Git with the ability to manage large files. In that it appears similar to git-annex.

Todo

this.

A more elaborate delineation from related solutions can be found in the DataLad developer documentation.

DataLad version-controls my large files – great. But how much is saved in total?

Todo

this.

How can I copy data out of a DataLad dataset?

Moving or copying data out of a DataLad dataset is always possible and works in many cases just like in any regular directory. The only caveat exists in the case of annexed data: If file content is managed with git-annex and stored in the object-tree, what appears to be the file in the dataset is merely a symlink (please read section Data integrity for details). Moving or copying this symlink will not yield the intended result – instead you will have a broken symlink outside of your dataset.

When using the terminal command cp1, it is sufficient to use the -L/--dereference option. This will follow symbolic links, and make sure that content gets moved instead of symlinks. With tools other than cp (e.g., graphical file managers), to copy or move annexed content, make sure it is unlocked first: After a datalad unlock copying and moving contents will work fine. A subsequent datalad save in the dataset will annex the content again.

Is there Python 2 support for DataLad?

No, Python 2 support has been dropped in September 2019.

Is there a graphical user interface for DataLad?

No, DataLad’s functionality is available in the command line or via it’s Python API.

Todo

maybe update this by mentioning the DataLad webapp extension

How does DataLad interface with OpenNeuro?

OpenNeuro is a free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data. It publishes hosted data as DataLad datasets on GitHub. The entire collection can be found at github.com/OpenNeuroDatasets. You can obtain the datasets just as any other DataLad datasets with datalad clone or datalad install.

What is the git-annex branch?

If your DataLad dataset contains an annex, there is also a git-annex branch that is created, used, and maintained solely by git-annex. It is completely unconnected to any other branches in your dataset, and contains different types of log files. The contents of this branch are used for git-annex internal tracking of the dataset and its annexed contents. For example, git-annex stores information where file content can be retrieved from in a .log file for each object, and if the object was obtained from web-sources (e.g., with datalad download-url), a .log.web file stores the URL. Other files in this branch store information about the known remotes of the dataset and their description, if they have one. You can find out much more about the git-annex branch and its contents in the documentation. This branch, however, is managed by git-annex, and you should not temper with it.

Footnotes

1

The absolutely amazing Midnight Commander mc can also follow symlinks.