Dataset nesting

Without noticing, the previous section demonstrated another core principle and feature of DataLad datasets: Nesting.

Within DataLad datasets one can nest other DataLad datasets arbitrarily deep. We for example just installed one dataset, the longnow podcast, into another dataset, the DataLad-101 dataset. This does not seem particularly spectacular – after all, any directory on a file system can have other directories inside of it.

The possibility for nested Datasets, however, is one of many advantages DataLad datasets have:

One aspect of nested datasets is that any lower-level DataLad dataset (the subdataset) has a stand-alone history. The top-level DataLad dataset (the superdataset) only stores which version of the subdataset is currently used.

Let’s dive into that. Remember how we had to navigate into recordings/longnow to see the history, and how this history was completely independent of the DataLad-101 superdataset history? This was the subdataset’s own history.

But now let’s also check out how the superdataset’s (DataLad-101) history looks like after the installation of a subdataset. To do this, make sure you are outside of the subdataset longnow (note that the first commit is our recent addition to notes.txt, so we’ll look at the second most recent commit in this excerpt).

$ git log -p -n 2
commit 49e85d196f5bfaf09567c10008fc71c212cdbb1d
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:08 2019 +0100

    [DATALAD] Recorded changes

diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..1b59b8c
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,4 @@
+[submodule "recordings/longnow"]
+	path = recordings/longnow
+	url = https://github.com/datalad-datasets/longnow-podcasts.git
+	datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
diff --git a/recordings/longnow b/recordings/longnow
new file mode 160000
index 0000000..dcc34fb
--- /dev/null
+++ b/recordings/longnow
@@ -0,0 +1 @@
+Subproject commit dcc34fbe669b06ced84ced381ba0db21cf5e665f

We have highlighted the important part of this rather long commit summary. Note that you can not see any .mp3s being added to the dataset, as was previously the case when we datalad saved PDFs that we downloaded into books/. Instead, DataLad stores what it calls a subproject commit of the subdataset. The cryptic character sequence in this line is the shasum we have briefly mentioned before, and it is how DataLad internally identifies files and changes to files. Exactly this shasum is what describes the state of the subdataset.

This highlights a different aspect as well: Note that the longnow dataset is a completely independent, standalone dataset that was once created and published. Nesting allows for a modular re-use of any other DataLad dataset, and this re-use is possible and simple precisely because all of the information is kept within a (sub)dataset.

Navigate back into longnow and try to find the highlighted shasum in the subdataset’s history:

$ cd recordings/longnow
$ git log --oneline
dcc34fb Update aggregated metadata
36a30a1 [DATALAD RUNCMD] Update from feed
bafdc04 Uniformize JSON-LD context with DataLad's internal extractors
004e484 [DATALAD RUNCMD] .datalad/maint/make_readme.py
7ee3ded Sort episodes newest-first
e829615 Link to the handbook as a source of wisdom
4b37790 Fix README generator to parse correct directory
43fdea1 Add script to generate a README from DataLad metadata
997e07a Update aggregated metadata
8031017 Consolidate all metadata-related files under .datalad
8053eed Add annexed feed logos
1a396a6 Prepare to annex big feed logos
75d7f3f Rename metadata directory
5dd7772 Manually place extracted metadata in Git
b9c517e Make sure extracted metadata is directly in Git
0553111 content removed from git annex
39226e9 Update aggregated metadata
740fa14 [DATALAD RUNCMD] Update from feed
61f46fc Add base dataset metadata
3e96466 More diff-able
979bd25 Single update maintainer script
ead809e Be resilient with different delimiters
9bece59 Add duration to the metadata
f0831b9 Script to convert the RSS feed metadata into JSON-LD metadata
e64d00f Prepare for addition of RSS feed metadata on episodes
e1bf31e [DATALAD RUNCMD] Update SALT series
21d9290 [DATALAD RUNCMD] Update Interval seminar series
7f36dea Update from feed
ff00713 Update from feed
a052af9 Include publication date in the filename
9f3127f Import Interval feed
b81bdea Import SALT feed
3d0dc8f [DATALAD] new dataset
8df130b [DATALAD] Set default backend for all files to be MD5E

We can see that it is the most recent commit shasum of the subdataset (albeit we can see only the first seven characters here – a git log would show you the full shasum). This is what is meant by “the top-level DataLad dataset (the superdataset) only stores which version of the subdataset is currently used”.

Importantly, once we learn how to make use of the history of a dataset, we can set subdatasets to previous states, or update them.

Find out more: Do I have to navigate into the subdataset to see it’s history?

Previously, we used cd to navigate into the subdataset, and subsequently opened the Git log. This is necessary, because a git log in the superdataset would only return the superdatasets history. There is one trick, though: git -C lets you perform any Git command in a provided path. Thus, from the root of DataLad-101, this command would have given you the subdatasets history as well:

$ git -C recordings/longnow log --oneline

In the upcoming sections, we’ll experience the perks of dataset nesting frequently, and everything that might seem vague at this point will become clearer. To conclude this demonstration, the figure below illustrates the current state of the dataset and nesting schematically:

Virtual directory tree of a nested DataLad dataset

Thus, without being consciously aware of it, by taking advantage of dataset nesting, we took a dataset longnow and installed it as a subdataset within the superdataset DataLad-101.

If you have executed the above code snippets, make sure to go back into the root of the dataset again:

$ cd ../../