Basic provenance tracking

This use case demonstrates how the provenance of downloaded and generated files can be captured with DataLad by

  1. downloading a data file from an arbitrary URL from the web

  2. perform changes to this data file and

  3. capture provenance for all of this

Note

This section uses advanced Git commands and concepts on the side that are not covered in the book. If you want to learn more about the Git commands shown here, the ProGit book is an excellent resource.

The Challenge

Rob needs to turn in an art project at the end of the high school year. He wants to make it as easy as possible and decides to just make a photomontage of some pictures from the internet. When he submits the project, he does not remember where he got the input data from, nor the exact steps to create his project, even though he tried to take notes.

The DataLad Approach

Rob starts his art project as a DataLad dataset. When downloading the images he wants to use for his project, he tracks where they come from. And when he changes or creates output, he tracks how, when and why and this was done using standard DataLad commands. This will make it easy for him to find out or remember what he has done in his project, and how it has been done, a long time after he finished the project, without any note taking.

Step-by-Step

Rob starts by creating a dataset, because everything in a dataset can be version controlled and tracked:

$ datalad create artproject && cd artproject
[INFO] Creating a new annex repo at /home/me/usecases/provenance/artproject 
create(ok): /home/me/usecases/provenance/artproject (dataset)

For his art project, Rob decides to download a mosaic image composed of flowers from Wikimedia. As a first step, he extracts some of the flowers into individual files to reuse them later. He uses the datalad download-url command to get the resource straight from the web, but also capture all provenance automatically, and save the resource in his dataset together with a useful commit message:

$ mkdir sources
$ datalad download-url -m "Added flower mosaic from wikimedia" \
  https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg \
  --path sources/flowers.jpg
[INFO] Downloading 'https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg' into '/home/me/usecases/provenance/artproject/sources/flowers.jpg'
download_url(ok): /home/me/usecases/provenance/artproject/sources/flowers.jpg (file)
add(ok): sources/flowers.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)

If he later wants to find out where he obtained this file from, a git annex whereis1 command will tell him:

$ git annex whereis sources/flowers.jpg
whereis sources/flowers.jpg (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
	bedd4a73-1508-4db9-9ccc-a85962c7a482 -- me@muninn:~/usecases/provenance/artproject [here]

  web: https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg
ok

To extract some image parts for the first step of his project, he uses the extract tool from ImageMagick to extract the St. Bernard’s Lily from the upper left corner, and the pimpernel from the upper right corner. The commands will take the Wikimedia poster as an input and produce output files from it. To capture provenance on this action, Rob wraps it into datalad run2 commands.

$ datalad run -m "extract st-bernard lily" \
 --input "sources/flowers.jpg" \
 --output "st-bernard.jpg" \
 "convert -extract 1522x1522+0+0 sources/flowers.jpg st-bernard.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)
$ datalad run -m "extract pimpernel" \
  --input "sources/flowers.jpg" \
  --output "pimpernel.jpg" \
  "convert -extract 1522x1522+1470+1470 sources/flowers.jpg pimpernel.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): pimpernel.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)

He continues to process the images, capturing all provenance with DataLad. Later, he can always find out which commands produced or changed which file. This information is easily accessible within the history of his dataset, both with Git and DataLad commands such as git log or datalad diff.

$ git log --oneline HEAD~3..HEAD
ed908ca [DATALAD RUNCMD] extract pimpernel
b20f959 [DATALAD RUNCMD] extract st-bernard lily
cbf8e0e Added flower mosaic from wikimedia
$ datalad diff -f HEAD~3
    added: pimpernel.jpg (file)
    added: sources/flowers.jpg (file)
    added: st-bernard.jpg (file)

Based on this information, he can always reconstruct how an when any data file came to be – across the entire life-time of a project.

He decides that one image manipulation for his art project will be to displace pixels of an image by a random amount to blur the image:

$ datalad run -m "blur image" \
   --input "st-bernard.jpg" \
   --output "st-bernard-displaced.jpg" \
   "convert -spread 10 st-bernard.jpg st-bernard-displaced.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)

Because he is not completely satisfied with the first random pixel displacement, he decides to retry the operation. Because everything was wrapped in datalad run, he can rerun the command. Rerunning the command will produce a commit, because the displacement is random and the output file changes slightly from its previous version.

$ git log -1 --oneline HEAD
14a43c2 [DATALAD RUNCMD] blur image
$ datalad rerun 14a43c23e30c9a57e09655e9b7c714ba2aa33e73
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
unlock(ok): st-bernard-displaced.jpg (file)
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)
  unlock (ok: 1)

This blur also does not yet fulfill Robs expectations, so he decides to discard the change, using standard Git tools3.

$ git reset --hard HEAD~1
HEAD is now at 14a43c2 [DATALAD RUNCMD] blur image

He knows that within a DataLad dataset, he can also rerun a range of commands with the --since flag, and even specify alternative starting points for rerunning them with the --onto flag. Every command from commits reachable from the specified checksum until --since (but not including --since) will be re-executed. For example, datalad rerun --since=HEAD~5 will re-execute any commands in the last five commits. --onto indicates where to start rerunning the commands from. The default is HEAD, but anything other than HEAD will be checked out prior to execution, such that re-execution happens in a detached HEAD state, or checked out out on the new branch specified by the --branch flag. If --since is an empty string, it is set to rerun every command from the first commit that contains a recorded command. If --onto is an empty string, re-execution is performed on top to the parent of the first run commit in the revision list specified with --since. When both arguments are set to empty strings, it therefore means “rerun all commands with HEAD at the parent of the first commit a command”. In other words, Rob can “replay” all the history for his artproject in a single command. Using the --branch option of datalad rerun, he does it on a new branch he names replay:

$ datalad rerun --since= --onto= --branch=replay
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard.jpg (file)
save(ok): . (dataset)
add(ok): pimpernel.jpg (file)
save(ok): . (dataset)
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  get (notneeded: 3)
  save (ok: 3)

Now he is on a new branch of his project, which contains “replayed” history.

$ git log --oneline --graph master replay
* 0b0428a [DATALAD RUNCMD] blur image
* 3c25f9a [DATALAD RUNCMD] extract pimpernel
* 1c60430 [DATALAD RUNCMD] extract st-bernard lily
| * 14a43c2 [DATALAD RUNCMD] blur image
| * ed908ca [DATALAD RUNCMD] extract pimpernel
| * b20f959 [DATALAD RUNCMD] extract st-bernard lily
|/  
* cbf8e0e Added flower mosaic from wikimedia
* 452093d [DATALAD] new dataset

He can even compare the two branches:

$ datalad diff -t master -f replay
 modified: st-bernard-displaced.jpg (file)

He can see that the blurring, which involved a random element, produced different results. Because his dataset contains two branches, he can compare the two branches using normal Git operations. The next command, for example, marks which commits are “patch-equivalent” between the branches. Notice that all commits are marked as equivalent (=) except the ‘random spread’ ones.

$ git log --oneline --left-right --cherry-mark master...replay
> 0b0428a [DATALAD RUNCMD] blur image
= 3c25f9a [DATALAD RUNCMD] extract pimpernel
= 1c60430 [DATALAD RUNCMD] extract st-bernard lily
< 14a43c2 [DATALAD RUNCMD] blur image
= ed908ca [DATALAD RUNCMD] extract pimpernel
= b20f959 [DATALAD RUNCMD] extract st-bernard lily

Rob can continue processing images, and will turn in a sucessful art project. Long after he finishes high school, he finds his dataset on his old computer again and remembers this small project fondly.

Footnotes

1

If you want to learn more about git annex whereis, re-read section Where’s Waldo?.

2

If you want to learn more about datalad run, read on from section Keeping track.

3

Find out more about working with the history of a dataset with Git in section Miscellaneous file system operations