2.1. Keeping track¶
In previous examples, with the exception of datalad download-url
(manual), all
changes that happened to the dataset or the files it contains were
saved to the dataset’s history by hand. We added larger and smaller
files and saved them, and we also modified smaller file contents and
saved these modifications.
Often, however, files get changed by shell commands or by scripts. Consider a data scientist. She has data files with numeric data, and code scripts in Python, R, Matlab or any other programming language that will use the data to compute results or figures. Such output is stored in new files, or modifies existing files.
But only a few weeks after these scripts were executed she finds it hard to remember which script was modified for which reason or created which output. How did this result came to be? Which script would she need to run again on which data to produce this particular figure?
In this section we will experience how DataLad can help
to record the changes in a dataset after executing a script
from the shell. Just as datalad download-url
was able to associate
a file with its origin and store this information, we want to be
able to associate a particular file with the commands, scripts, and inputs
it was produced from, and thus capture and store full provenance.
Let’s say, for example, that you enjoyed the longnow podcasts a lot, and you start a podcast-night with friends to wind down from all of the exciting DataLad lectures. They propose to make a list of speakers and titles to cross out what they’ve already listened to, and ask you to prepare such a list.
“Mhh… probably there is a DataLad way to do this… wasn’t there also a note about metadata extraction at some point?” But as we are not that far into the lectures, you decide to write a short shell script to generate a text file that lists speaker and title name instead.
To do this, we are following a best practice that will reappear in the
later section on YODA principles: Collecting all
additional scripts that work with content of a subdataset outside
of this subdataset, in a dedicated code/
directory,
and collating the output of the execution of these scripts
outside of the subdataset as well – and
therefore not modifying the subdataset.
The motivation behind this will become clear in later sections,
but for now we’ll start with best-practice building.
Therefore, create a subdirectory code/
in the DataLad-101
superdataset:
$ mkdir code
$ tree -d
.
├── books
├── code
└── recordings
└── longnow
├── Long_Now__Conversations_at_The_Interval
└── Long_Now__Seminars_About_Long_term_Thinking
6 directories
Inside of DataLad-101/code
, create a simple shell script list_titles.sh
.
This script will carry out a simple task:
It will loop through the file names of the .mp3
files and
write out speaker names and talk titles in a very basic fashion.
The cat
command will write the script content into code/list_titles.sh
.
Here’s a script for Windows users
Please use an editor of your choice to create a file list_titles.sh
inside of the code
directory.
These should be the contents:
for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
# get the filename
base=$(basename "$i");
# strip the extension
base=${base%.mp3};
# date as yyyy-mm-dd
printf "${base%%__*}\t" | tr '_' '-';
# name and title without underscores
printf "${base#*__}\n" | tr '_' ' ';
done
Note that this is not identical to the script in the text – it lacks a few \
characters, which is a meaningful difference.
Be mindful of hidden extensions when creating files!
By default, Windows does not show common file extensions when you view directory contents with a file explorer.
Instead, it only displays the base of the file name and indicates the file type with the display icon.
You can see if this is the case for you, too, by opening the books\
directory in a file explorer, and checking if the file extension (.pdf
) is a part of the file name displayed underneath its PDF icon.
Hidden file extensions can be a confusing source of errors, because some Windows editors (for example, Notepad) automatically add a .txt
extension to your files – when you save the script above under the name list_titles.sh
, your editor may add an extension (list_titles.sh.txt
), and the file explorer displays your file as list_titles.sh
(hiding the .txt
extension).
To prevent confusion, configure the file explorer to always show you the file extension. For this, open the Explorer, click on the “View” tab, and tick the box “File name extensions”.
Beyond this, double check the correct naming of your file, ideally in the terminal.
$ cat << EOT > code/list_titles.sh
for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
# get the filename
base=\$(basename "\$i");
# strip the extension
base=\${base%.mp3};
# date as yyyy-mm-dd
printf "\${base%%__*}\t" | tr '_' '-';
# name and title without underscores
printf "\${base#*__}\n" | tr '_' ' ';
done
EOT
Save this script to the dataset.
$ datalad status
untracked: code (directory)
$ datalad save -m "Add short script to write a list of podcast speakers and titles"
add(ok): code/list_titles.sh (file)
save(ok): . (dataset)
Once we run this script, it will simply print dates, names and titles to
your terminal. We can save its outputs to a new file
recordings/podcasts.tsv
in the superdataset by redirecting these
outputs with bash code/list_titles.sh > recordings/podcasts.tsv
.
Obviously, we could create this file, and subsequently save it to the superdataset.
However, just as in the example about the data scientist,
in a bit of time, we will forget how this file came into existence, or
that the script code/list_titles.sh
is associated with this file, and
can be used to update it later on.
The datalad run
(manual) command
can help with this. Put simply, it records a command’s impact on a dataset. Put
more technically, it will record a shell command, and datalad save
(manual) all changes
this command triggered in the dataset – be that new files or changes to existing
files.
Let’s try the simplest way to use this command: datalad run
,
followed by a commit message (-m "a concise summary"
), and the
command that executes the script from the shell: bash code/list_titles.sh > recordings/podcasts.tsv
.
It is helpful to enclose the command in quotation marks.
Note that we execute the command from the root of the superdataset.
It is recommended to use datalad run
in the root of the dataset
you want to record the changes in, so make sure to run this
command from the root of DataLad-101
.
$ datalad run -m "create a list of podcast titles" \
"bash code/list_titles.sh > recordings/podcasts.tsv"
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/dl-101/DataLad-101 (dataset) [bash code/list_titles.sh > recordings/po...]
add(ok): recordings/podcasts.tsv (file)
save(ok): . (dataset)
Let’s take a look into the history:
$ git log -p -n 1 # On Windows, you may just want to type "git log".
commit e37c9fc9✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD RUNCMD] create a list of podcast titles
=== Do not change lines below ===
{
"chain": [],
"cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
"dsid": "e3e70682-c209-4cac-629f-6fbed82c07cd",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
diff --git a/recordings/podcasts.tsv b/recordings/podcasts.tsv
new file mode 100644
index 0000000..f691b53
--- /dev/null
+++ b/recordings/podcasts.tsv
@@ -0,0 +1,206 @@
+2003-11-15 Brian Eno The Long Now
+2003-12-13 Peter Schwartz The Art Of The Really Long View
+2004-01-10 George Dyson There s Plenty of Room at the Top Long term Thinking About Large scale Computing
+2004-02-14 James Dewar Long term Policy Analysis
The commit message we have supplied with -m
directly after datalad run
appears
in our history as a short summary.
Additionally, the output of the command, recordings/podcasts.tsv
,
was saved right away.
But there is more in this log entry, a section in between the markers
=== Do not change lines below ===
and
^^^ Do not change lines above ^^^
.
This is the so-called run record
– a recording of all of the
information in the datalad run
command, generated by DataLad.
In this case, it is a very simple summary. One informative
part is highlighted:
"cmd": "bash code/list_titles.sh"
is the command that was run
in the terminal.
This information therefore maps the command, and with it the script,
to the output file, in one commit. Nice, isn’t it?
Arguably, the run record is not the most human-readable way to display information.
This representation however is less for the human user (the human user should
rely on their informative commit message), but for DataLad, in particular for the
datalad rerun
(manual) command, which you will see in action shortly. This
run record
is machine-readable provenance that associates an output with
the command that produced it.
You have probably already guessed that every datalad run
command
ends with a datalad save
. A logical consequence from this fact is that any
datalad run
that does not result in any changes in a dataset (no modification
of existing content; no additional files) will not produce any record in the
dataset’s history (just as a datalad save
with no modifications present
will not create a history entry). Try to run the exact same
command as before, and check whether anything in your log changes:
$ datalad run -m "Try again to create a list of podcast titles" \
"bash code/list_titles.sh > recordings/podcasts.tsv"
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/dl-101/DataLad-101 (dataset) [bash code/list_titles.sh > recordings/po...]
$ git log --oneline
e37c9fc [DATALAD RUNCMD] create a list of podcast titles
e799b6b Add short script to write a list of podcast speakers and titles
87609a3 Add note on datalad clone
3c016f7 [DATALAD] Added subdataset
The most recent commit is still the datalad run
command from before,
and there was no second datalad run
commit created.
The datalad run
can therefore help you to keep track of what you are doing
in a dataset and capture provenance of your files: When, by whom, and how exactly
was a particular file created or modified?
The next sections will demonstrate how to make use of this information,
and also how to extend the command with additional arguments that will prove to
be helpful over the course of this chapter.