Without noticing, the previous section demonstrated another core principle and feature of DataLad datasets: Nesting.
Within DataLad datasets one can nest other DataLad
datasets arbitrarily deep. We for example just installed one dataset, the
longnow podcasts, into another dataset, the
This was done by supplying the
-d flag in the command call.
At first glance, nesting does not seem particularly spectacular – after all, any directory on a file system can have other directories inside of it.
The possibility for nested Datasets, however, is one of many advantages DataLad datasets have:
One aspect of nested datasets is that any lower-level DataLad dataset (the subdataset) has a stand-alone history. The top-level DataLad dataset (the superdataset) only stores which version of the subdataset is currently used.
Let’s dive into that.
Remember how we had to navigate into
recordings/longnow to see the history,
and how this history was completely independent of the
superdataset history? This was the subdataset’s own history.
Apart from stand-alone histories of super- or subdatasets, this highlights another
very important advantage that nesting provides: Note that the
is a completely independent, standalone dataset that was once created and
published. Nesting allows for a modular re-use of any other DataLad dataset,
and this re-use is possible and simple precisely because all of the information
is kept within a (sub)dataset.
But now let’s also check out how the superdataset’s (
looks like after the addition of a subdataset. To do this, make sure you are
outside of the subdataset
longnow. Note that the first commit is our recent
notes.txt, so we’ll look at the second most recent commit in
$ git log -p -n 2 commit 2fcef51461b5c8e0c1fcf1ed6511035fb3c79509 Author: Elena Piscopia <firstname.lastname@example.org> Date: Thu Jan 9 07:51:52 2020 +0100 [DATALAD] Recorded changes diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..1b59b8c --- /dev/null +++ b/.gitmodules @@ -0,0 +1,4 @@ +[submodule "recordings/longnow"] + path = recordings/longnow + url = https://github.com/datalad-datasets/longnow-podcasts.git + datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e diff --git a/recordings/longnow b/recordings/longnow new file mode 160000 index 0000000..dcc34fb --- /dev/null +++ b/recordings/longnow @@ -0,0 +1 @@ +Subproject commit dcc34fbe669b06ced84ced381ba0db21cf5e665f
We have highlighted the important part of this rather long commit summary.
Note that you can not see any
.mp3s being added to the dataset,
as was previously the case when we datalad saved PDFs that we
DataLad stores what it calls a subproject commit of the subdataset.
The cryptic character sequence in this line is the shasum we have briefly
mentioned before, and it is
how DataLad internally identifies files and changes to files. Exactly this
shasum is what describes the state of the subdataset.
Navigate back into
longnow and try to find the highlighted shasum in the
$ cd recordings/longnow $ git log --oneline dcc34fb Update aggregated metadata 36a30a1 [DATALAD RUNCMD] Update from feed bafdc04 Uniformize JSON-LD context with DataLad's internal extractors 004e484 [DATALAD RUNCMD] .datalad/maint/make_readme.py 7ee3ded Sort episodes newest-first e829615 Link to the handbook as a source of wisdom 4b37790 Fix README generator to parse correct directory 43fdea1 Add script to generate a README from DataLad metadata 997e07a Update aggregated metadata 8031017 Consolidate all metadata-related files under .datalad 8053eed Add annexed feed logos 1a396a6 Prepare to annex big feed logos 75d7f3f Rename metadata directory 5dd7772 Manually place extracted metadata in Git b9c517e Make sure extracted metadata is directly in Git 0553111 content removed from git annex 39226e9 Update aggregated metadata 740fa14 [DATALAD RUNCMD] Update from feed 61f46fc Add base dataset metadata 3e96466 More diff-able 979bd25 Single update maintainer script ead809e Be resilient with different delimiters 9bece59 Add duration to the metadata f0831b9 Script to convert the RSS feed metadata into JSON-LD metadata e64d00f Prepare for addition of RSS feed metadata on episodes e1bf31e [DATALAD RUNCMD] Update SALT series 21d9290 [DATALAD RUNCMD] Update Interval seminar series 7f36dea Update from feed ff00713 Update from feed a052af9 Include publication date in the filename 9f3127f Import Interval feed b81bdea Import SALT feed 3d0dc8f [DATALAD] new dataset 8df130b [DATALAD] Set default backend for all files to be MD5E
We can see that it is the most recent commit shasum of the subdataset (albeit we can see only the first seven characters here – a git log would show you the full shasum). Thus, your dataset does not only know the origin of its subdataset, but also its version, i.e., it has an identifier of the stage of the subdatasets evolution. This is what is meant by “the top-level DataLad dataset (the superdataset) only stores which version of the subdataset is currently used”.
Importantly, once we learn how to make use of the history of a dataset, we can set subdatasets to previous states, or update them.
Find out more: Do I have to navigate into the subdataset to see it’s history?
Previously, we used cd to navigate into the subdataset, and
subsequently opened the Git log. This is necessary, because a git log
in the superdataset would only return the superdatasets history.
While moving around with
cd is straightforward, you also found it
slightly annoying from time to time to use the
cd command so often and also
to remember in which directory you currently are in. There is one
git -C (note that it is a capital C) lets you perform any
Git command in a provided path. Providing this option together with a path to
a Git command let’s you run the command as if Git was started in this path
instead of the current working directory.
Thus, from the root of
DataLad-101, this command would have given you the
subdataset’s history as well:
$ git -C recordings/longnow log --oneline
In the upcoming sections, we’ll experience the perks of dataset nesting frequently, and everything that might seem vague at this point will become clearer. To conclude this demonstration, the figure below illustrates the current state of the dataset and nesting schematically:
Thus, without being consciously aware of it, by taking advantage of dataset
nesting, we took a dataset
longnow and installed it as a
subdataset within the superdataset
If you have executed the above code snippets, make sure to go back into the root of the dataset again:
$ cd ../../