Basic provenance tracking

This use case demonstrates how the provenance of downloaded and generated files can be captured with DataLad by

  1. downloading a data file from an arbitrary URL from the web

  2. perform changes to this data file and

  3. capture provenance for all of this

How to become a Git pro

This section uses advanced Git commands and concepts on the side that are not covered in the book. If you want to learn more about the Git commands shown here, the ProGit book is an excellent resource.

The Challenge

Rob needs to turn in an art project at the end of the high school year. He wants to make it as easy as possible and decides to just make a photomontage of some pictures from the internet. When he submits the project, he does not remember where he got the input data from, nor the exact steps to create his project, even though he tried to take notes.

The DataLad Approach

Rob starts his art project as a DataLad dataset. When downloading the images he wants to use for his project, he tracks where they come from. And when he changes or creates output, he tracks how, when and why and this was done using standard DataLad commands. This will make it easy for him to find out or remember what he has done in his project, and how it has been done, a long time after he finished the project, without any note taking.

Step-by-Step

Rob starts by creating a dataset, because everything in a dataset can be version controlled and tracked:

$ datalad create artproject && cd artproject
[INFO] Creating a new annex repo at /home/me/usecases/provenance/artproject 
[INFO] scanning for unlocked files (this may take some time)
create(ok): /home/me/usecases/provenance/artproject (dataset)

For his art project, Rob decides to download a mosaic image composed of flowers from Wikimedia. As a first step, he extracts some of the flowers into individual files to reuse them later. He uses the datalad download-url command to get the resource straight from the web, but also capture all provenance automatically, and save the resource in his dataset together with a useful commit message:

$ mkdir sources
$ datalad download-url -m "Added flower mosaic from wikimedia" \
  https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg \
  --path sources/flowers.jpg
[INFO] Downloading 'https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg' into '/home/me/usecases/provenance/artproject/sources/flowers.jpg'
download_url(ok): /home/me/usecases/provenance/artproject/sources/flowers.jpg (file)
add(ok): sources/flowers.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)

If he later wants to find out where he obtained this file from, a git annex whereis1 command will tell him:

$ git annex whereis sources/flowers.jpg
whereis sources/flowers.jpg (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
	6a5e5a27-06b0-4987-b028-c9fcfa0d13c4 -- me@muninn:~/usecases/provenance/artproject [here]

  web: https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg
ok

To extract some image parts for the first step of his project, he uses the extract tool from ImageMagick to extract the St. Bernard’s Lily from the upper left corner, and the pimpernel from the upper right corner. The commands will take the Wikimedia poster as an input and produce output files from it. To capture provenance on this action, Rob wraps it into datalad run2 commands.

$ datalad run -m "extract st-bernard lily" \
 --input "sources/flowers.jpg" \
 --output "st-bernard.jpg" \
 "convert -extract 1522x1522+0+0 sources/flowers.jpg st-bernard.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)
$ datalad run -m "extract pimpernel" \
  --input "sources/flowers.jpg" \
  --output "pimpernel.jpg" \
  "convert -extract 1522x1522+1470+1470 sources/flowers.jpg pimpernel.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): pimpernel.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)

He continues to process the images, capturing all provenance with DataLad. Later, he can always find out which commands produced or changed which file. This information is easily accessible within the history of his dataset, both with Git and DataLad commands such as git log or datalad diff.

$ git log --oneline HEAD~3..HEAD
ee60856 [DATALAD RUNCMD] extract pimpernel
604e0ec [DATALAD RUNCMD] extract st-bernard lily
8abceb1 Added flower mosaic from wikimedia
$ datalad diff -f HEAD~3
    added: pimpernel.jpg (file)
    added: sources/flowers.jpg (file)
    added: st-bernard.jpg (file)

Based on this information, he can always reconstruct how an when any data file came to be – across the entire life-time of a project.

He decides that one image manipulation for his art project will be to displace pixels of an image by a random amount to blur the image:

$ datalad run -m "blur image" \
   --input "st-bernard.jpg" \
   --output "st-bernard-displaced.jpg" \
   "convert -spread 10 st-bernard.jpg st-bernard-displaced.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)

Because he is not completely satisfied with the first random pixel displacement, he decides to retry the operation. Because everything was wrapped in datalad run, he can rerun the command. Rerunning the command will produce a commit, because the displacement is random and the output file changes slightly from its previous version.

$ git log -1 --oneline HEAD
548ba71 [DATALAD RUNCMD] blur image
$ datalad rerun 548ba71b3f319d7f88b85ba899081245c567b31c
[INFO] run commit 548ba71; (blur image)
[INFO] Making sure inputs are available (this may take some time) 
unlock(ok): st-bernard-displaced.jpg (file)
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)
  unlock (ok: 1)

This blur also does not yet fulfill Robs expectations, so he decides to discard the change, using standard Git tools3.

$ git reset --hard HEAD~1
HEAD is now at 548ba71 [DATALAD RUNCMD] blur image

He knows that within a DataLad dataset, he can also rerun a range of commands with the --since flag, and even specify alternative starting points for rerunning them with the --onto flag. Every command from commits reachable from the specified checksum until --since (but not including --since) will be re-executed. For example, datalad rerun --since=HEAD~5 will re-execute any commands in the last five commits. --onto indicates where to start rerunning the commands from. The default is HEAD, but anything other than HEAD will be checked out prior to execution, such that re-execution happens in a detached HEAD state, or checked out out on the new branch specified by the --branch flag. If --since is an empty string, it is set to rerun every command from the first commit that contains a recorded command. If --onto is an empty string, re-execution is performed on top to the parent of the first run commit in the revision list specified with --since. When both arguments are set to empty strings, it therefore means “rerun all commands with HEAD at the parent of the first commit a command”. In other words, Rob can “replay” all the history for his artproject in a single command. Using the --branch option of datalad rerun, he does it on a new branch he names replay:

$ datalad rerun --since= --onto= --branch=replay
[INFO] checkout commit 8abceb1;
[INFO] run commit 604e0ec; (extract st-bernar...)
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard.jpg (file)
save(ok): . (dataset)
[INFO] run commit ee60856; (extract pimpernel)
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): pimpernel.jpg (file)
save(ok): . (dataset)
[INFO] run commit 548ba71; (blur image)
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  get (notneeded: 3)
  save (ok: 3)

Now he is on a new branch of his project, which contains “replayed” history.

$ git log --oneline --graph master replay
* 5a31ed6 [DATALAD RUNCMD] blur image
* b67c608 [DATALAD RUNCMD] extract pimpernel
* 5a861fa [DATALAD RUNCMD] extract st-bernard lily
| * 548ba71 [DATALAD RUNCMD] blur image
| * ee60856 [DATALAD RUNCMD] extract pimpernel
| * 604e0ec [DATALAD RUNCMD] extract st-bernard lily
|/  
* 8abceb1 Added flower mosaic from wikimedia
* 62829d6 [DATALAD] new dataset

He can even compare the two branches:

$ datalad diff -t master -f replay
 modified: st-bernard-displaced.jpg (file)

He can see that the blurring, which involved a random element, produced different results. Because his dataset contains two branches, he can compare the two branches using normal Git operations. The next command, for example, marks which commits are “patch-equivalent” between the branches. Notice that all commits are marked as equivalent (=) except the ‘random spread’ ones.

$ git log --oneline --left-right --cherry-mark master...replay
> 5a31ed6 [DATALAD RUNCMD] blur image
= b67c608 [DATALAD RUNCMD] extract pimpernel
= 5a861fa [DATALAD RUNCMD] extract st-bernard lily
< 548ba71 [DATALAD RUNCMD] blur image
= ee60856 [DATALAD RUNCMD] extract pimpernel
= 604e0ec [DATALAD RUNCMD] extract st-bernard lily

Rob can continue processing images, and will turn in a successful art project. Long after he finishes high school, he finds his dataset on his old computer again and remembers this small project fondly.

Footnotes

1

If you want to learn more about git annex whereis, re-read section Where’s Waldo?.

2

If you want to learn more about datalad run, read on from section Keeping track.

3

Find out more about working with the history of a dataset with Git in section Miscellaneous file system operations