Together with your room mate you have just discovered how to share, update, and collaborate on a DataLad dataset on a shared file system. Thus, you have glimpsed into the principles and advantages of sharing a dataset with a simple example.
To obtain a dataset, one can also use
datalad clone(manual) with a path. Potential subdatasets will not be installed right away. As they are registered in the superdataset, you can do
datalad get -n/--no-data(manual), or specify the
datalad get -n -r <subds>) with a decent
-R/--recursion-limitchoice to install them afterwards.
The configuration of the original dataset determines which types of files will have their content available right after the installation of the dataset, and which types of files need to be retrieved via
datalad get: Any file content stored in Git will be available right away, while all file content that is
annexedonly has small metadata about its availability attached to it. The original
DataLad-101dataset used the
text2gitconfiguration template to store text files such as
code/list_titles.shin Git – these files’ content is therefore available right after installation.
Annexed content can be retrieved via
datalad getfrom the file content sources.
git annex whereis PATH(manual) will list all locations known to contain file content for a particular file. This location is where git-annex will attempt to retrieve file content from, and it is described with the
--descriptionprovided during a
datalad create(manual). It is a very helpful command to find out where file content resides, and how many locations with copies exist.
A shared copy of a dataset includes the datasets history. If well made,
datalad run(manual) commands can then easily be
Because an installed dataset knows its origin – the place it was originally installed from – it can be kept up-to-date with the
datalad update(manual) command. This command will query the origin of the dataset for updates, and a
datalad update --how mergewill integrate these changes into the dataset copy.
Thus, using DataLad, data can be easily shared and kept up to date with only two commands:
By configuring a dataset as a sibling, collaboration becomes easy.
To avoid integrating conflicting modifications of a sibling dataset into your own dataset, a
datalad update -s SIBLINGNAMEwill “
fetch” modifications and store them on a different branch of your dataset. The commands
datalad diff(manual) and
git diff(manual) can subsequently help to find out what changes have been made in the sibling.
4.6.1. Now what I can do with that?¶
Most importantly, you have experienced the first way of sharing and updating a dataset. The example here may strike you as too simplistic, but in later parts of the book you will see examples in which datasets are shared on the same file system in surprisingly useful ways.
Simultaneously, you have observed dataset properties you already knew
(for example how annexed files need to be retrieved via
but you have also seen novel aspects of a dataset – for example that
subdatasets are not automatically installed by default, how
git annex whereis can help you find out where file content might be stored,
how useful commands that capture provenance about the origin or creation of files
datalad run or
datalad download-url (manual)) are,
or how a shared dataset can be updated to reflect changes that were made
to the original dataset.
Also, you have successfully demonstrated a large number of DataLad dataset
principles to your room mate: How content stored in Git is present right
away and how annexed content first needs to be retrieved, how easy a
datalad rerun (manual) is if the original
datalad run command was well
specified, how a datasets history is shared and not only its data.
Lastly, with the configuration of a sibling, you have experienced one
way to collaborate in a dataset, and with
datalad update --how merge
datalad update, you also glimpsed into more advances aspects
of Git, namely the concept of a branch.
Therefore, these last few sections have hopefully been a good review of what you already knew, but also a big knowledge gain, and cause joyful anticipation of collaboration in a real-world setting of one of your own use cases.