1.5. Dataset nesting¶
Without noticing, the previous section demonstrated another core principle and feature of DataLad datasets: Nesting.
Within DataLad datasets one can nest other DataLad
datasets arbitrarily deep. We for example just installed one dataset, the
longnow
podcasts, into another dataset, the DataLad-101
dataset.
This was done by supplying the --dataset
/-d
flag in the command call.
At first glance, nesting does not seem particularly spectacular – after all, any directory on a file system can have other directories inside of it. The possibility for nested Datasets, however, is one of many advantages DataLad datasets have:
One aspect of nested datasets is that any DataLad dataset (subdataset or superdataset) keeps their stand-alone history. The top-level DataLad dataset (the superdataset) only stores which version of the subdataset is currently used through an identifier.
Let’s dive into that.
Remember how we had to navigate into recordings/longnow
to see the history,
and how this history was completely independent of the DataLad-101
superdataset history? This was the subdataset’s own history.
Apart from stand-alone histories of super- or subdatasets, this highlights another
very important advantage that nesting provides: Note that the longnow
dataset
is a completely independent, standalone dataset that was once created and
published. Nesting allows for a modular reuse of any other DataLad dataset,
and this reuse is possible and simple precisely because all of the information
is kept within a (sub)dataset.
But now let’s also check out how the superdataset’s (DataLad-101
) history
looks like after the addition of a subdataset. To do this, make sure you are
outside of the subdataset longnow
. Note that the first commit is our recent
addition to notes.txt
, so we’ll look at the second most recent commit in
this excerpt.
$ git log -p -n 3
commit 3c016f73✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD] Added subdataset
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..9bc9ee9
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,5 @@
+[submodule "recordings/longnow"]
+ path = recordings/longnow
+ url = https://github.com/datalad-datasets/longnow-podcasts.git
+ datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
+ datalad-url = https://github.com/datalad-datasets/longnow-podcasts.git
diff --git a/recordings/longnow b/recordings/longnow
new file mode 160000
index 0000000..dcc34fb
--- /dev/null
+++ b/recordings/longnow
@@ -0,0 +1 @@
+Subproject commit dcc34fbe✂SHA1
commit e310b465✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
add note on datalad save
diff --git a/notes.txt b/notes.txt
index 3a7a1fe..0142412 100644
--- a/notes.txt
+++ b/notes.txt
@@ -1,3 +1,7 @@
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty
+The command "datalad save [-m] PATH" saves the file (modifications) to
+history.
We have highlighted the important part of this rather long commit summary.
Note that you cannot see any .mp3
s being added to the dataset,
as was previously the case when we datalad save
(manual)d PDFs that we
downloaded into books/
. Instead,
DataLad stores what it calls a subproject commit of the subdataset.
The cryptic character sequence in this line is the shasum we have briefly
mentioned before, and it is the identifier that
DataLad internally used to identify the files and the changes to the files in the subdataset. Exactly this
shasum is what identifies the state of the subdataset.
Navigate back into longnow
and try to find the highlighted shasum in the
subdataset’s history:
$ cd recordings/longnow
$ git log --oneline
dcc34fb Update aggregated metadata
36a30a1 [DATALAD RUNCMD] Update from feed
bafdc04 Uniformize JSON-LD context with DataLad's internal extractors
004e484 [DATALAD RUNCMD] .datalad/maint/make_readme.py
7ee3ded Sort episodes newest-first
e829615 Link to the handbook as a source of wisdom
4b37790 Fix README generator to parse correct directory
We can see that it is the most recent commit shasum of the subdataset
(albeit we can see only the first seven characters here – a git log
(manual)
would show you the full shasum). Thus, your dataset does not only know the origin
of its subdataset, but also which version of the subdataset to use,
i.e., it has the identifier of the stage/version in the subdataset’s evolution to be used.
This is what is meant by “the top-level DataLad dataset (the superdataset) only stores
which version of the subdataset is currently used through an identifier”.
Importantly, once we learn how to make use of the history of a dataset, we can set subdatasets to previous states, or update them.
Do I have to navigate into the subdataset to see its history?
Previously, we used cd
to navigate into the subdataset, and
subsequently opened the Git log. This is necessary, because a git log
in the superdataset would only return the superdatasets history.
While moving around with cd
is straightforward, you also found it
slightly annoying from time to time to use the cd
command so often and also
to remember in which directory you currently are in. There is one
trick, though: git -C
and datalad -C
(note that it is a capital C) let you perform any
Git or DataLad command in a provided path. Providing this option together with a path to
a Git or DataLad command let’s you run the command as if it was started in this path
instead of the current working directory.
Thus, from the root of DataLad-101
, this command would have given you the
subdataset’s history as well:
$ git -C recordings/longnow log --oneline
In the upcoming sections, we’ll experience the perks of dataset nesting
frequently, and everything that might seem vague at this point will become
clearer. To conclude this demonstration,
Fig. 1.1 illustrates the current state of our dataset, DataLad-101
, with its nested subdataset.
Thus, without being consciously aware of it, by taking advantage of dataset
nesting, we took a dataset longnow
and installed it as a
subdataset within the superdataset DataLad-101
.
If you have executed the above code snippets, make sure to go back into the root of the dataset again:
$ cd ../../