7.1. More on dataset nesting

You may have noticed how working in the subdataset felt as if you would be working in an independent dataset – there was no information or influence at all from the top-level DataLad-101 superdataset, and you build up a completely stand-alone history:

$ git log --oneline
6fc0d0f Provide project description
9aadac7 [DATALAD RUNCMD] analyze iris data with classification analysis
ca0c747 add script for kNN classification and plotting
4f945ed [DATALAD] Added subdataset
18f4a98 Apply YODA dataset setup
bf231d5 [DATALAD] new dataset

In principle, this is no news to you. From section Dataset nesting and the YODA principles you already know that nesting allows for a modular reuse of any other DataLad dataset, and that this reuse is possible and simple precisely because all of the information is kept within a (sub)dataset.

What is new now, however, is that you applied changes to the dataset. While you already explored the looks and feels of the longnow subdataset in previous sections, you now modified the contents of the midterm_project subdataset. How does this influence the superdataset, and how does this look like in the superdataset’s history? You know from section Dataset nesting that the superdataset only stores the state of the subdataset. Upon creation of the dataset, the very first, initial state of the subdataset was thus recorded in the superdataset. But now, after you finished your project, your subdataset evolved. Let’s query the superdataset what it thinks about this.

$ # move into the superdataset
$ cd ../
$ datalad status
 modified: midterm_project (dataset)

From the superdataset’s perspective, the subdataset appears as being “modified”. Note how it is not individual files that show up as “modified”, but indeed the complete subdataset as a single entity.

What this shows you is that the modifications of the subdataset you performed are not automatically recorded to the superdataset. This makes sense, after all it should be up to you to decide whether you want record something or not. But it is worth repeating: If you modify a subdataset, you will need to save this in the superdataset in order to have a clean superdataset status.

Let’s save the modification of the subdataset into the history of the superdataset. For this, to avoid confusion, you can specify explicitly to which dataset you want to save a modification. -d . specifies the current dataset, i.e., DataLad-101, as the dataset to save to:

$ datalad save -d . -m "finished my midterm project" midterm_project
add(ok): midterm_project (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)

More on how ‘datalad save’ can operate on nested datasets

In a superdataset with subdatasets, datalad save (manual) by default tries to figure out on its own which dataset’s history of all available datasets a datalad save should be written to. However, it can reduce confusion or allow specific operations to be very explicit in the command call and tell DataLad where to save what kind of modifications to.

If you want to save the current state of the subdataset into the superdataset (as necessary here), start a save from the superdataset and have the -d/--dataset option point to its root:

$ # in the root of the superds
$ datalad save -d . -m "update subdataset"

If you are in the superdataset, and you want to save an unsaved modification in a subdataset to the subdatasets history, let -d/--dataset point to the subdataset:

$ # in the superds
$ datalad save -d path/to/subds -m "modified XY"

The recursive option allows you to save any content underneath the specified directory, and recurse into any potential subdatasets:

$ datalad save . --recursive

Let’s check which subproject commit is now recorded in the superdataset:

$ git log -p -n 1
commit c5c90178✂SHA1
Author: Elena Piscopia <elena@example.net>
Date:   Tue Jun 18 16:13:00 2019 +0000

    finished my midterm project

diff --git a/midterm_project b/midterm_project
index 18f4a98..6fc0d0f 160000
--- a/midterm_project
+++ b/midterm_project
@@ -1 +1 @@
-Subproject commit 18f4a981✂SHA1
+Subproject commit 6fc0d0f5✂SHA1

As you can see in the log entry, the subproject commit changed from the first commit hash in the subdataset history to the most recent one. With this change, therefore, your superdataset tracks the most recent version of the midterm_project dataset, and your dataset’s status is clean again.

This time in DataLad-101 is a convenient moment to dive a bit deeper into the functions of the datalad status (manual) command. If you are interested in this, checkout the dedicated Findoutmore.

More on ‘datalad status’

First of all, let’s start with a quick overview of the different content types and content states various datalad status commands in the course of DataLad-101 have shown up to this point. You have seen the following content types:

  • file, e.g., notes.txt: any file (or symlink that is a placeholder to an annexed file)

  • directory, e.g., books: any directory that does not qualify for the dataset type

  • symlink, e.g., the .jgp that was manually unlocked in section Input and output: any symlink that is not used as a placeholder for an annexed file

  • dataset, e.g., the midterm_project: any top-level dataset, or any subdataset that is properly registered in the superdataset

And you have seen the following content states: modified and untracked. The section Miscellaneous file system operations will show you many instances of deleted content state as well.

But beyond understanding the report of datalad status, there is also additional functionality: datalad status can handle status reports for a whole hierarchy of datasets, and it can report on a subset of the content across any number of datasets in this hierarchy by providing selected paths. This is useful as soon as datasets become more complex and contain subdatasets with changing contents.

When performed without any arguments, datalad status will report the state of the current dataset. However, you can specify a path to any sub- or superdataset with the --dataset option. In order to demonstrate this a bit better, we will make sure that not only the state of the subdataset within the superdataset is modified, but also that the subdataset contains a modification. For this, let’s add an empty text file into the midterm_project subdataset:

$ touch midterm_project/an_empty_file

If you are in the root of DataLad-101, but interested in the status within the subdataset, simply provide a path (relative to your current location) to the command:

$ datalad status midterm_project
untracked: midterm_project/an_empty_file (file)

Alternatively, to achieve the same, specify the superdataset as the --dataset and provide a path to the subdataset with a trailing path separator like this:

$ datalad status -d . midterm_project/
untracked: midterm_project/an_empty_file (file)

Note that both of these commands return only the untracked file and not not the modified subdataset because we’re explicitly querying only the subdataset for its status. If you however, as done outside of this Find-out-more, you want to know about the subdataset record in the superdataset without causing a status query for the state within the subdataset itself, you can also provide an explicit path to the dataset (without a trailing path separator). This can be used to specify a specific subdataset in the case of a dataset with many subdatasets:

$ datalad status -d . midterm_project
 modified: midterm_project (dataset)

But if you are interested in both the state within the subdataset, and the state of the subdataset within the superdataset, you can combine the two paths:

$ datalad status -d . midterm_project midterm_project/
 modified: midterm_project (dataset)
untracked: midterm_project/an_empty_file (file)

Finally, if these subtle differences in the paths are not easy to memorize, the -r/--recursive option will also report you both status aspects:

$ datalad status --recursive
 modified: midterm_project (dataset)
untracked: midterm_project/an_empty_file (file)

Importantly, the regular output from a datalad status command in the commandline is “condensed” to the most important information by a tailored result renderer. You can, however, also get status’ unfiltered full output by switching the -f/--output-format from tailored (the default) to json or, for the same infos as json but better readability, json_pp:

$ datalad -f json_pp status -d . midterm_project
  "action": "status",
  "gitshasum": "6fc0d0f5✂SHA1",
  "parentds": "/home/me/dl-101/DataLad-101",
  "path": "/home/me/dl-101/DataLad-101/midterm_project",
  "prev_gitshasum": "6fc0d0f5✂SHA1",
  "refds": "/home/me/dl-101/DataLad-101",
  "state": "modified",
  "status": "ok",
  "type": "dataset"

This still was not all of the available functionality of the datalad status command. You could, for example, adjust whether and how untracked dataset content should be reported with the --untracked option, or get additional information from annexed content with the --annex option (especially powerful when combined with -f json_pp). To get a complete overview on what you could do, check out the technical documentation of datalad status here.

Before we leave this Find-out-more, lets undo the modification of the subdataset by removing the untracked file:

$ rm midterm_project/an_empty_file
$ datalad status --recursive
nothing to save, working tree clean