2.2. DataLad, rerun!¶
So far, you created a .tsv
file of all
speakers and talk titles in the longnow/
podcasts subdataset.
Let’s actually take a look into this file now:
$ less recordings/podcasts.tsv
2003-11-15 Brian Eno The Long Now
2003-12-13 Peter Schwartz The Art Of The Really Long View
2004-02-14 James Dewar Long term Policy Analysis
2004-03-13 Rusty Schweickart The Asteroid Threat Over the Next 100 000 Years
2004-04-10 Daniel Janzen Third World Conservation It s ALL Gardening
-✂--✂-
Not too bad, and certainly good enough for the podcast night people.
What’s been cool about creating this file is that it was created with
a script within a datalad run
(manual) command. Thanks to datalad run
,
the output file podcasts.tsv
is associated with the script it
generated.
Upon reviewing the list you realized that you made a mistake, though: you only
listed the talks in the SALT series (the
Long_Now__Seminars_About_Long_term_Thinking/
directory), but not
in the Long_Now__Conversations_at_The_Interval/
directory.
Let’s fix this in the script. Replace the contents in code/list_titles.sh
with the following, fixed script:
Here’s a script adjustment for Windows users
Please use an editor of your choice to replace the contents of list_titles.sh
inside of the code
directory with the following:
for i in recordings/longnow/Long_Now*/*.mp3; do
# get the filename
base=$(basename "$i");
# strip the extension
base=${base%.mp3};
# date as yyyy-mm-dd
printf "${base%%__*}\t" | tr '_' '-';
# name and title without underscores
printf "${base#*__}\n" | tr '_' ' ';
done
$ cat << EOT >| code/list_titles.sh
for i in recordings/longnow/Long_Now*/*.mp3; do
# get the filename
base=\$(basename "\$i");
# strip the extension
base=\${base%.mp3};
printf "\${base%%__*}\t" | tr '_' '-';
# name and title without underscores
printf "\${base#*__}\n" | tr '_' ' ';
done
EOT
Because the script is now modified, save the modifications to the dataset. We can use the shorthand “BF” to denote “Bug fix” in the commit message.
$ datalad status
modified: code/list_titles.sh (file)
$ datalad save -m "BF: list both directories content" \
code/list_titles.sh
add(ok): code/list_titles.sh (file)
save(ok): . (dataset)
What we could do is run the same datalad run
command as before to recreate
the file, but now with all of the contents:
$ # do not execute this!
$ datalad run -m "create a list of podcast titles" \
"bash code/list_titles.sh > recordings/podcasts.tsv"
However, think about any situation where the command would be longer than this,
or that is many months past the first execution. It would not be easy to remember
the command, nor would it be very convenient to copy it from the run record
.
Luckily, a fellow student remembered the DataLad way of re-executing
a run
command, and he’s eager to show it to you.
“In order to re-execute a datalad run
command,
find the commit and use its shasum (or a tag, or anything else that Git
understands) as an argument for the
datalad rerun
(manual) command! That’s it!”,
he says happily.
So you go ahead and find the commit shasum in your history:
$ git log -n 2
commit f7ea9f3d✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
BF: list both directories content
commit e37c9fc9✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD RUNCMD] create a list of podcast titles
Take that shasum and paste it after datalad rerun
(the first 6-8 characters of the shasum would be sufficient,
here we are using all of them).
$ datalad rerun e37c9fc9✂SHA1
[INFO] run commit e37c9fc; (create a list of ...)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/dl-101/DataLad-101 (dataset) [bash code/list_titles.sh > recordings/po...]
add(ok): recordings/podcasts.tsv (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
run (ok: 1)
save (notneeded: 1, ok: 1)
unlock (notneeded: 1)
Now DataLad has made use of the run record
, and
re-executed the original command based on the information in it.
Because we updated the script, the output podcasts.tsv
has changed and now contains the podcast
titles of both subdirectories.
You’ve probably already guessed it, but the easiest way
to check whether a datalad rerun
has changed the desired output file is
to check whether the rerun command appears in the datasets history:
If a datalad rerun
does not add or change any content in the dataset,
it will also not be recorded in the history.
$ git log -n 1
commit 08120c38✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD RUNCMD] create a list of podcast titles
=== Do not change lines below ===
{
"chain": [
"e37c9fc9✂SHA1"
],
"cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
"dsid": "e3e70682-c209-4cac-629f-6fbed82c07cd",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
In the dataset’s history,
we can see that a new datalad run
was recorded. This action is
committed by DataLad under the original commit message of the run
command, and looks just like the previous datalad run
commit.
Two cool tools that go beyond the git log
(manual)
are the datalad diff
(manual) and git diff
(manual) commands.
Both commands can report differences between two states of
a dataset. Thus, you can get an overview of what changed between two commits.
Both commands have a similar, but not identical structure: datalad diff
compares one state (a commit specified with -f
/--from
,
by default the latest change)
and another state from the dataset’s history (a commit specified with
-t
/--to
). Let’s do a datalad diff
between the current state
of the dataset and the previous commit (called “HEAD~1
” in Git terminology[1]):
please use ‘datalad diff –from main –to HEAD~1’
While this example works on Unix file systems, it will not provide the same output on Windows. This is due to different file handling on Windows. When executing this command, you will see all files being modified between the most recent and the second-most recent commit. On a technical level, this is correct given the underlying file handling on Windows, and chapter Under the hood: git-annex will shed light on why that is.
For now, to get the same output as shown in the code snippet below, use the following command where main
(or master
) is the name of your default branch:
$ datalad diff --from main --to HEAD~1
The --from
argument specifies a different starting point for the comparison - the main
or master branch, which would be the starting point on most Unix-based systems.
$ datalad diff --to HEAD~1
modified: recordings/podcasts.tsv (file)
This indeed shows the output file as “modified”. However, we do not know
what exactly changed. This is a task for git diff
(get out of the
diff view by pressing q
):
$ git diff HEAD~1
diff --git a/recordings/podcasts.tsv b/recordings/podcasts.tsv
index f691b53..d77891d 100644
--- a/recordings/podcasts.tsv
+++ b/recordings/podcasts.tsv
@@ -1,3 +1,31 @@
+2017-06-09 How Digital Memory Is Shaping Our Future Abby Smith Rumsey
+2017-06-09 Pace Layers Thinking Stewart Brand Paul Saffo
+2017-06-09 Proof The Science of Booze Adam Rogers
+2017-06-09 Seveneves at The Interval Neal Stephenson
+2017-06-09 Talking with Robots about Architecture Jeffrey McGrew
+2017-06-09 The Red Planet for Real Andy Weir
+2017-07-03 Transforming Perception One Sense at a Time Kara Platoni
+2017-08-01 How Climate Will Evolve Government and Society Kim Stanley Robinson
+2017-09-01 Envisioning Deep Time Jonathon Keats
+2017-10-01 Thinking Long term About the Evolving Global Challenge The Refugee Reality
+2017-11-01 The Web In An Eye Blink Jason Scott
+2017-12-01 Ideology in our Genes The Biological Basis for Political Traits Rose McDermott
+2017-12-07 Can Democracy Survive the Internet Nathaniel Persily
+2018-01-02 The New Deal You Don t Know Louis Hyman
This output actually shows the precise changes between the contents created
with the first version of the script and the second script with the bug fix.
All of the files that are added after the second directory
was queried as well are shown in the diff
, preceded by a +
.
Quickly create a note about these two helpful commands in notes.txt
:
$ cat << EOT >> notes.txt
There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.
EOT
Finally, save this note.
$ datalad save -m "add note datalad and git diff"
add(ok): notes.txt (file)
save(ok): . (dataset)
Note that datalad rerun
can re-execute the run records of both a datalad run
or a datalad rerun
command,
but not with any other type of DataLad command in your history
such as a datalad save
(manual) on results or outputs after you executed a script.
Therefore, make it a
habit to record the execution of scripts by plugging it into datalad run
.
This very basic example of a datalad run
is as simple as it can get, but it
is already
convenient from a memory-load perspective: Now you do not need to
remember the commands or scripts involved in creating an output. DataLad kept track
of what you did, and you can instruct it to “rerun
” it.
Also, incidentally, we have generated provenance information. It is
now recorded in the history of the dataset how the output podcasts.tsv
came
into existence. And we can interact with and use this provenance information with
other tools than from the machine-readable run record
.
For example, to find out who (or what) created or modified a file,
give the file path to git log
(prefixed by --
):
use ‘git log main – recordings/podcasts.tsv’
A previous Windows Wit already advised to append main
or master
, the common “default branch”, to any command that starts with git log
.
Here, the last part of the command specifies a file (-- recordings/podcasts.tsv
).
Please append main
or master
to git log
, prior to the file specification.
$ git log -- recordings/podcasts.tsv
commit 08120c38✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD RUNCMD] create a list of podcast titles
=== Do not change lines below ===
{
"chain": [
"e37c9fc9✂SHA1"
],
"cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
"dsid": "e3e70682-c209-4cac-629f-6fbed82c07cd",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
commit e37c9fc9✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD RUNCMD] create a list of podcast titles
=== Do not change lines below ===
{
"chain": [],
"cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
"dsid": "e3e70682-c209-4cac-629f-6fbed82c07cd",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
Neat, isn’t it?
Still, this datalad run
was very simple.
The next section will demonstrate how datalad run
becomes handy in
more complex standard use cases: situations with locked contents.
But prior to that, make a note about datalad run
and datalad rerun
in your
notes.txt
file.
$ cat << EOT >> notes.txt
The datalad run command can record the impact a script or command has
on a Dataset. In its simplest form, datalad run only takes a commit
message and the command that should be executed.
Any datalad run command can be re-executed by using its commit shasum
as an argument in datalad rerun CHECKSUM. DataLad will take
information from the run record of the original commit, and re-execute
it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!
EOT
Finally, save this note.
$ datalad save -m "add note on basic datalad run and datalad rerun"
add(ok): notes.txt (file)
save(ok): . (dataset)
Footnotes