DataLad, Re-Run!

So far, you created a simple .tsv file of all speakers and talk titles in the longnow/ podcasts subdataset. Let’s actually take a look into this file now:

$ less recordings/podcasts.tsv
2003-11-15	Brian Eno  The Long Now
2003-12-13	Peter Schwartz  The Art Of The Really Long View
2004-01-10	George Dyson  There s Plenty of Room at the Top  Long term Thinking About Large scale Computing
2004-02-14	James Dewar  Long term Policy Analysis
2004-03-13	Rusty Schweickart  The Asteroid Threat Over the Next 100 000 Years
2004-04-10	Daniel Janzen  Third World Conservation  It s ALL Gardening
2004-05-15	David Rumsey  Mapping Time
2004-06-12	Bruce Sterling  The Singularity  Your Future as a Black Hole
2004-07-10	Jill Tarter  The Search for Extra terrestrial Intelligence  Necessarily a Long term Strategy
2004-08-14	Phillip Longman  The Depopulation Problem
2004-09-11	Danny Hillis  Progress on the 10 000 year Clock
2004-10-16	Paul Hawken  The Long Green
2004-11-13	Michael West  The Prospects of Human Life Extension
2004-12-04	Ken Dychtwald  The Consequences of Human Life Extension
2005-01-15	James Carse  Religious War In Light of the Infinite Game
2005-02-26	Roger Kennedy  The Political History of North America from 25 000 BC to 12 000 AD
2005-03-12	Spencer Beebe  Very Long term Very Large scale Biomimicry
2005-04-09	Stewart Brand  Cities   Time
2005-06-11	Robert Neuwirth  The 21st Century Medieval City
2005-07-16	Jared Diamond  How Societies Fail And Sometimes Succeed
2005-08-13	Robert Fuller  Patient Revolution  Human Rights Past and Future
2005-09-24	Ray Kurzweil  Kurzweil s Law
2005-10-06	Esther Dyson  Freeman Dyson  George Dyson  The Difficulty of Looking Far Ahead
2005-11-15	Clay Shirky  Making Digital Durable  What Time Does to Categories
2005-12-10	Sam Harris  The View from the End of the World
2006-01-14	Ralph Cavanagh  Peter Schwartz  Nuclear Power  Climate Change and the Next 10 000 Years
2006-02-14	Stephen Lansing  Perfect Order  A Thousand Years in Bali
2006-03-11	Kevin Kelly  The Next 100 Years of Science  Long term Trends in the Scientific Method.
2006-04-15	Jimmy Wales  Vision  Wikipedia and the Future of Free Culture

Not too bad, and certainly good enough for the podcast night people. What’s been cool about creating this file is that it was created with a script within a datalad run command. Thanks to datalad run, the output file podcasts.tsv is associated with the script it generated.

Upon reviewing the list you realized that you made a mistake, though: you only listed the talks in the SALT series (the Long_Now__Seminars_About_Long_term_Thinking/ directory), but not in the Long_Now__Conversations_at_The_Interval/ directory. Let’s fix this in the script. Replace the contents in code/list_titles.sh with the following, fixed script:

$ cat << EOT >| code/list_titles.sh
for i in recordings/longnow/Long_Now*/*.mp3; do
   # get the filename
   base=\$(basename "\$i");
   # strip the extension
   base=\${base%.mp3};
   printf "\${base%%__*}\t" | tr '_' '-';
   # name and title without underscores
   printf "\${base#*__}\n" | tr '_' ' ';

done
EOT

Because the script is now modified, save the modifications to the dataset. We can use the shorthand “BF” to denote “Bug fix” in the commit message.

$ datalad status
 modified: code/list_titles.sh (file)
$ datalad save -m "BF: list both directories content" code/list_titles.sh
add(ok): code/list_titles.sh (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

What we could do is run the same datalad run command as before to recreate the file, but now with all of the contents:

# do not execute this!
$ datalad run -m "create a list of podcast titles" "bash code/list_titles.sh > recordings/podcasts.tsv"

However, think about any situation where the command would be longer than this, or that is many months past the first execution. It would not be easy to remember the command, nor would it be very convenient to copy it from the run record.

Luckily, a fellow student remembered the DataLad way of re-executing a run command, and he’s eager to show it to you.

“In order to re-execute a datalad run command, find the commit and use its shasum (or a tag, or anything else that Git understands) as an argument for the datalad rerun command (datalad-rerun manual)! That’s it!”, he says happily.

So you go ahead and find the commit shasum in your history:

$ git log -n 2
commit d04c4afe95721b7f23bddd391a343995b071b8b5
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:22 2019 +0100

    BF: list both directories content

commit ecde4e4b41d219728bc3f2b28adf2f33b366c533
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:21 2019 +0100

    [DATALAD RUNCMD] create a list of podcast titles

Take that shasum and paste it after datalad rerun (the first 6-8 characters of the shasum would be sufficient, here we’re using all of them).

$ datalad rerun ecde4e4b41d219728bc3f2b28adf2f33b366c533
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): recordings/podcasts.tsv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (notneeded: 1, ok: 1)
  unlock (notneeded: 1)

Now DataLad has made use of the run record, and re-executed the original command based on the information in it. Because we updated the script, the output podcasts.tsv has changed and now contains the podcast titles of both subdirectories. You’ve probably already guessed it, but the easiest way to check whether a datalad rerun has changed the desired output file is to check whether the rerun command appears in the datasets history: If a datalad rerun does not add or change any content in the dataset, it will also not be recorded in the history.

$ git log -n 1
commit d26a29bd97a4cb8dd5af34bd3c0ce11626b4c2d9
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:23 2019 +0100

    [DATALAD RUNCMD] create a list of podcast titles
    
    === Do not change lines below ===
    {
     "chain": [
      "ecde4e4b41d219728bc3f2b28adf2f33b366c533"
     ],
     "cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
     "dsid": "686444f8-0555-11ea-98f1-e86a64c8054c",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

In the dataset’s history, we can see that a new datalad run was recorded. This action is committed by Datalad under the original commit message of the run command, and looks just like the previous datalad run commit apart from the execution time.

Two cool tools that go beyond the git log are the datalad diff (datalad-diff manual) and git diff commands. Both commands can report differences between two states of a dataset. Thus, you can get an overview of what changed between two commits. Both commands have a similar, but not identical structure: datalad diff compares one state (a commit specified with -f/--from, by default the latest change) and another state from the dataset’s history (a commit specified with -t/--to). Let’s do a datalad diff between the current state of the dataset and the previous commit (called “HEAD~1” in Git terminology):

$ datalad diff --to HEAD~1
 modified: recordings/podcasts.tsv (file)

This indeed shows the output file as “modified”. However, we do not know what exactly changed. This is a task for git diff (get out of the diff view by pressing q):

$ git diff HEAD~1
diff --git a/recordings/podcasts.tsv b/recordings/podcasts.tsv
index f691b53..d77891d 100644
--- a/recordings/podcasts.tsv
+++ b/recordings/podcasts.tsv
@@ -1,3 +1,31 @@
+2017-06-09	How Digital Memory Is Shaping Our Future  Abby Smith Rumsey
+2017-06-09	Pace Layers Thinking  Stewart Brand  Paul Saffo
+2017-06-09	Proof  The Science of Booze  Adam Rogers
+2017-06-09	Seveneves at The Interval  Neal Stephenson
+2017-06-09	Talking with Robots about Architecture  Jeffrey McGrew
+2017-06-09	The Red Planet for Real  Andy Weir
+2017-07-03	Transforming Perception  One Sense at a Time  Kara Platoni
+2017-08-01	How Climate Will Evolve Government and Society  Kim Stanley Robinson
+2017-09-01	Envisioning Deep Time  Jonathon Keats
+2017-10-01	Thinking Long term About the Evolving Global Challenge  The Refugee Reality
+2017-11-01	The Web In An Eye Blink  Jason Scott
+2017-12-01	Ideology in our Genes  The Biological Basis for Political Traits  Rose McDermott
+2017-12-07	Can Democracy Survive the Internet   Nathaniel Persily
+2018-01-02	The New Deal You Don t Know  Louis Hyman
+2018-02-01	Humanity and the Deep Ocean  James Nestor
+2018-03-01	Our Future in Algorithm Farming  Mike Kuniavsky
+2018-04-18	The Organized Pursuit of Knowledge  Margaret Levi
+2018-08-15	Facts  Feelings and Stories  How to Motivate Action on Climate Change  Shahzeen Attari
+2019-03-26	Charting the High Frontier of Space  Ed Lu
+2019-04-04	The Science of Climate Fiction  Can Stories Lead to Social Action   James Holland Jones
+2019-04-10	The Spirit Singularity  Science and the Afterlife at the Turn of the 20th Century  Hannu Rajaniemi
+2019-04-18	The Evolving Science of Behavior Change  Christopher Bryan
+2019-04-30	Siberia  A Journey to the Mammoth Steppe  Stewart Brand  Kevin Kelly  Alexander Rose
+2019-05-06	Can Nationalism be a Resource for Democracy   Maya Tudor
+2019-05-14	Growing Up Ape  The Long term Science of Studying Our Closest Living Relatives  Elizabeth  Lonsdorf
+2019-05-21	Time Poverty Amidst Digital Abundance  Judy Wajcman
+2019-06-07	A Foundation of Trust  Building a Blockchain Future  Brian Behlendorf
+2019-07-12	Learning From Le Guin  Kim Stanley Robinson
 2003-11-15	Brian Eno  The Long Now
 2003-12-13	Peter Schwartz  The Art Of The Really Long View
 2004-01-10	George Dyson  There s Plenty of Room at the Top  Long term Thinking About Large scale Computing

This output actually shows the precise changes between the contents created with the first version of the script and the second script with the bug fix. All of the files that are added after the second directory was queried as well are shown in the diff, preceded by a +.

Quickly create a note about these two helpful commands in notes.txt:

$ cat << EOT >> notes.txt
There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.

EOT

Finally, save this note.

$ datalad save -m "add note datalad and git diff"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Note that datalad rerun can re-execute the run records of both a datalad run or a datalad rerun command, but not with any other type of datalad command in your history such as a datalad save on results or outputs after you executed a script. Therefore, make it a habit to record the execution of scripts by plugging it into datalad run.

This very basic example of a datalad run is as simple as it can get, but it is already convenient from a memory-load perspective: Now you do not need to remember the commands or scripts involved in creating an output. DataLad kept track of what you did, and you can instruct it to just “rerun” it. Also, incidentally, we have generated provenance information. It is now recorded in the history of the dataset how the output podcasts.tsv came into existence.

For example, to find out who (or what) created or modified a file, give the file path to git log (prefixed by --):

$ git log -- recordings/podcasts.tsv
commit d26a29bd97a4cb8dd5af34bd3c0ce11626b4c2d9
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:23 2019 +0100

    [DATALAD RUNCMD] create a list of podcast titles
    
    === Do not change lines below ===
    {
     "chain": [
      "ecde4e4b41d219728bc3f2b28adf2f33b366c533"
     ],
     "cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
     "dsid": "686444f8-0555-11ea-98f1-e86a64c8054c",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

commit ecde4e4b41d219728bc3f2b28adf2f33b366c533
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:21 2019 +0100

    [DATALAD RUNCMD] create a list of podcast titles
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
     "dsid": "686444f8-0555-11ea-98f1-e86a64c8054c",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

Neat, isn’t it?

Still, this datalad run was very simple. The next section will demonstrate how datalad run becomes handy in more complex standard use cases: situations with locked contents.

But prior to that, make a note about datalad run and datalad rerun in your notes.txt file.

$ cat << EOT >> notes.txt
The datalad run command can record the impact a script or command has on a Dataset.
In its simplest form, datalad run only takes a commit message and the command that
should be executed.

Any datalad run command can be re-executed by using its commit shasum as an argument
in datalad rerun CHECKSUM. DataLad will take information from the run record of the original
commit, and re-execute it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!

EOT

Finally, save this note.

$ datalad save -m "add note on basic datalad run and datalad rerun"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)