Keeping track

In previous examples, all changes that happened to the dataset or the files it contains were saved to the dataset’s history by hand. We added larger and smaller files and saved them, and we also modified smaller file contents and saved these modifications.

Often, however, files get changed by shell commands or by scripts. Consider a data scientist. She has data files with numeric data, and code scripts in Python, R, Matlab or any other programming language that will use the data to compute results or figures. Such output is stored in new files, or modifies existing files.

But only a few weeks after these scripts were executed she finds it hard to remember which script was modified for which reason or created which output. How did this result came to be? Which script would she need to run again on which data to produce this particular figure?

In this section we will experience how DataLad can help to record the changes in a dataset after execution of a script from the shell.

Let’s say, for example, that you enjoyed the longnow podcasts a lot, and you start a podcast-night with friends to wind down from all of the exciting DataLad lectures. They propose to make a list of speakers and titles to cross out what they’ve already listened to, and ask you to prepare such a list.

“Mhh… probably there is a DataLad way to do this… wasn’t there also a note about metadata extraction at some point?” But as we’re not that far into the lectures, you decide to write a simple shell script to generate a text file that lists speaker and title name instead.

To do this, we’re following a best practice that will reappear in the later section on YODA principles : Collecting all additional scripts that work with content of a subdataset outside of this subdataset, in a dedicated code/ directory, and collating the output of the execution of these scripts outside of the subdataset as well – and therefore not modifying the subdataset.

The motivation behind this will become clear in later sections, but for now we’ll start with best-practice building. Therefore, create a subdirectory code/ in the DataLad-101 superdataset:

$ mkdir code
$ tree -d
.
├── books
├── code
└── recordings
    └── longnow
        ├── Long_Now__Conversations_at_The_Interval
        └── Long_Now__Seminars_About_Long_term_Thinking

6 directories

Inside of Datalad-101/code, create a simple shell script list_titles.sh. This script will carry out a simple task: It will loop through the file names of the .mp3 files and write out speaker names and talk titles in a very simple fashion. The content of this script is written below – the cat command will write it into the script.

$ cat << EOT > code/list_titles.sh
for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
   # get the filename
   base=\$(basename "\$i");
   # strip the extension
   base=\${base%.mp3};
   # date as yyyy-mm-dd
   printf "\${base%%__*}\t" | tr '_' '-';
   # name and title without underscores
   printf "\${base#*__}\n" | tr '_' ' ';
done
EOT

Save this script to the dataset.

$ datalad status
untracked: code (directory)
$ datalad save -m "Add simple script to write a list of podcast speakers and titles"
add(ok): code/list_titles.sh (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Once we run this script, it will simply print dates, names and titles to your terminal. We can save its outputs to a new file recordings/podcasts.tsv in the superdataset by redirecting these outputs with bash code/list_titles.sh > recordings/podcasts.tsv.

Obviously, we could create this file, and subsequently save it to the superdataset. However, just as in the example about the data scientist, in a bit of time, we will forget how this file came into existence, or that the script code/list_titles.sh is associated with this file, and can be used to update it later on.

The datalad run command (datalad-run manual) can help with this. It records a command’s impact on a dataset.

Let’s try the simplest way to use this command: datalad run, followed by a commit message (-m "a concise summary"), and the command that executes the script from the shell: bash code/list_titles.sh. It is helpful to enclose the command in quotation marks.

Note that we execute the command from the root of the superdataset. It is recommended to use datalad run in the root of the dataset you want to record the changes in, so make sure to run this command from the root of DataLad-101.

$ datalad run -m "create a list of podcast titles" "bash code/list_titles.sh > recordings/podcasts.tsv"
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): recordings/podcasts.tsv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (notneeded: 1, ok: 1)

Find out more: Why is there a “notneeded” in the command summary?

If you have stumbled across the command execution summary save (notneeded: 1, ok: 1) and wondered what is “notneeded”: the datalad save at the end of a datalad run will query all potential subdatasets recursively for modifications, and as there are no modifications in the longnow subdataset, this part of save returns a “notneeded” summary. Thus, after a datalad run, you’ll get a “notneeded” for every subdataset with no modifications in the execution summary.

Let’s take a look into the history:

$ git log -p -n 1
commit ecde4e4b41d219728bc3f2b28adf2f33b366c533
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:21 2019 +0100

    [DATALAD RUNCMD] create a list of podcast titles
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
     "dsid": "686444f8-0555-11ea-98f1-e86a64c8054c",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

diff --git a/recordings/podcasts.tsv b/recordings/podcasts.tsv
new file mode 100644
index 0000000..f691b53
--- /dev/null
+++ b/recordings/podcasts.tsv
@@ -0,0 +1,206 @@
+2003-11-15	Brian Eno  The Long Now
+2003-12-13	Peter Schwartz  The Art Of The Really Long View
+2004-01-10	George Dyson  There s Plenty of Room at the Top  Long term Thinking About Large scale Computing
+2004-02-14	James Dewar  Long term Policy Analysis

The commit message we have supplied with -m directly after datalad run appears in our history as a short summary. Additionally, the output of the command, recordings/podcasts.tsv, was saved right away.

But there is more in this log entry, a section in between the markers

=== Do not change lines below === and

^^^ Do not change lines above ^^^.

This is the so-called run record – a recording of all of the information in the datalad run command, generated by DataLad. In this case, it is a very simple summary. One informative part is highlighted: "cmd": "bash code/list_titles.sh" is the command that was run in the terminal. This information therefore maps the command, and with it the script, to the output file, in one commit. Nice, isn’t it?

Arguably, the run record is not the most human-readable way to display information. This representation however is less for the human user (the human user should rely on their informative commit message), but for DataLad, in particular for the datalad rerun command, which you will see in action shortly.

Note that any datalad run command that does not result in any changes in a dataset (no modification of existing content; no additional files) will not produce any record in the dataset’s history. Try to run the exact same command as before, and check whether anything in your log changes:

$ datalad run -m "Try again to create a list of podcast titles" "bash code/list_titles.sh > recordings/podcasts.tsv"
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
action summary:
  save (notneeded: 2)
$ git log --oneline
ecde4e4 [DATALAD RUNCMD] create a list of podcast titles
ef8b726 Add simple script to write a list of podcast speakers and titles
e61c229 Add note on datalad install
49e85d1 [DATALAD] Recorded changes

The most recent commit is still the datalad run command from before, and there was no second datalad run commit created.

The datalad run can therefore help you to keep track of what you are doing in a dataset. The next sections will demonstrate how to make use of this information, and also how to extend the command with additional arguments that will prove to be helpful over the course of this chapter.