Basic provenance tracking¶

This use case demonstrates how the provenance of downloaded and generated files can be captured with DataLad by

downloading a data file from an arbitrary URL from the web
perform changes to this data file and
capture provenance for all of this

This section uses advanced Git commands and concepts on the side that are not covered in the book. If you want to learn more about the Git commands shown here, the ProGit book is an excellent resource.

The Challenge¶

Rob needs to turn in an art project at the end of the high school year. He wants to make it as easy as possible and decides to just make a photomontage of some pictures from the internet. When he submits the project, he does not remember where he got the input data from, nor the exact steps to create his project, even though he tried to take notes.

The DataLad Approach¶

Rob starts his art project as a DataLad dataset. When downloading the images he wants to use for his project, he tracks where they come from. And when he changes or creates output, he tracks how, when and why and this was done using standard DataLad commands. This will make it easy for him to find out or remember what he has done in his project, and how it has been done, a long time after he finished the project, without any note taking.

Step-by-Step¶

Rob starts by creating a dataset, because everything in a dataset can be version controlled and tracked:

$ datalad create artproject && cd artproject
create(ok): /home/me/usecases/provenance/artproject (dataset)

For his art project, Rob decides to download a mosaic image composed of flowers from Wikimedia. As a first step, he extracts some of the flowers into individual files to reuse them later. He uses the datalad download-url (manual) command to get the resource straight from the web, but also capture all provenance automatically, and save the resource in his dataset together with a useful commit message:

$ mkdir sources
$ datalad download-url -m "Added flower mosaic from wikimedia" \
  https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg \
  --path sources/flowers.jpg
[INFO] Downloading 'https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg' into '/home/me/usecases/provenance/artproject/sources/flowers.jpg' 
download_url(ok): /home/me/usecases/provenance/artproject/sources/flowers.jpg (file)
add(ok): sources/flowers.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)

If he later wants to find out where he obtained this file from, a git annex whereis (manual)[1] command will tell him:

$ git annex whereis sources/flowers.jpg
whereis sources/flowers.jpg (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	0e5a10bb-b3d5-4c84-9f9c-f72f65ed0cae -- me@muninn:~/usecases/provenance/artproject [here]

  web: https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg
ok

To extract some image parts for the first step of his project, he uses the extract tool from ImageMagick to extract the St. Bernard’s Lily from the upper left corner, and the pimpernel from the upper right corner. The commands will take the Wikimedia poster as an input and produce output files from it. To capture provenance on this action, Rob wraps it into datalad run (manual)[2] commands.

$ datalad run -m "extract st-bernard lily" \
 --input "sources/flowers.jpg" \
 --output "st-bernard.jpg" \
 "convert -extract 1522x1522+0+0 sources/flowers.jpg st-bernard.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
run(ok): /home/me/usecases/provenance/artproject (dataset) [convert -extract 1522x1522+0+0 sources/f...]
add(ok): st-bernard.jpg (file)
save(ok): . (dataset)

$ datalad run -m "extract pimpernel" \
  --input "sources/flowers.jpg" \
  --output "pimpernel.jpg" \
  "convert -extract 1522x1522+1470+1470 sources/flowers.jpg pimpernel.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
run(ok): /home/me/usecases/provenance/artproject (dataset) [convert -extract 1522x1522+1470+1470 sou...]
add(ok): pimpernel.jpg (file)
save(ok): . (dataset)

He continues to process the images, capturing all provenance with DataLad. Later, he can always find out which commands produced or changed which file. This information is easily accessible within the history of his dataset, both with Git and DataLad commands such as git log (manual) or datalad diff (manual).

$ git log --oneline HEAD~3..HEAD
bd6a92e [DATALAD RUNCMD] extract pimpernel
92145d5 [DATALAD RUNCMD] extract st-bernard lily
4816a3e Added flower mosaic from wikimedia

$ datalad diff -f HEAD~3
    added: pimpernel.jpg (symlink)
    added: sources/flowers.jpg (symlink)
    added: st-bernard.jpg (symlink)

Based on this information, he can always reconstruct how an when any data file came to be – across the entire life-time of a project.

He decides that one image manipulation for his art project will be to displace pixels of an image by a random amount to blur the image:

$ datalad run -m "blur image" \
   --input "st-bernard.jpg" \
   --output "st-bernard-displaced.jpg" \
   "convert -spread 10 st-bernard.jpg st-bernard-displaced.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
run(ok): /home/me/usecases/provenance/artproject (dataset) [convert -spread 10 st-bernard.jpg st-ber...]
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)

Because he is not completely satisfied with the first random pixel displacement, he decides to retry the operation. Because everything was wrapped in datalad run, he can rerun the command. Rerunning the command will produce a commit, because the displacement is random and the output file changes slightly from its previous version.

$ git log -1 --oneline HEAD
4086a10 [DATALAD RUNCMD] blur image

$ datalad rerun 4086a10f2b436bc20df4950d911df188b733ed46
[INFO] run commit 4086a10; (blur image) 
[INFO] Making sure inputs are available (this may take some time) 
[INFO] Unlocking files 
unlock(ok): st-bernard-displaced.jpg (file)
[INFO] Recording unlocked state in git 
[INFO] Completed unlocking files 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
run(ok): /home/me/usecases/provenance/artproject (dataset) [convert -spread 10 st-bernard.jpg st-ber...]
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  run (ok: 1)
  save (ok: 1)
  unlock (ok: 1)

This blur also does not yet fulfill Robs expectations, so he decides to discard the change, using standard Git tools[3].

$ git reset --hard HEAD~1
HEAD is now at 4086a10 [DATALAD RUNCMD] blur image

He knows that within a DataLad dataset, he can also rerun a range of commands with the --since flag, and even specify alternative starting points for rerunning them with the --onto flag. Every command from commits reachable from the specified checksum until --since (but not including --since) will be re-executed. For example, datalad rerun --since=HEAD~5 will re-execute any commands in the last five commits. --onto indicates where to start rerunning the commands from. The default is HEAD, but anything other than HEAD will be checked out prior to execution, such that re-execution happens in a detached HEAD state, or checked out out on the new branch specified by the --branch flag. If --since is an empty string, it is set to rerun every command from the first commit that contains a recorded command. If --onto is an empty string, re-execution is performed on top to the parent of the first run commit in the revision list specified with --since. When both arguments are set to empty strings, it therefore means “rerun all commands with HEAD at the parent of the first commit a command”. In other words, Rob can “replay” all the history for his artproject in a single command. Using the --branch option of datalad rerun (manual), he does it on a new branch he names replay:

$ datalad rerun --since= --onto= --branch=replay
[INFO] checkout commit 4816a3e; 
[INFO] run commit 92145d5; (extract st-bernar...) 
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
run(ok): /home/me/usecases/provenance/artproject (dataset) [convert -extract 1522x1522+0+0 sources/f...]
add(ok): st-bernard.jpg (file)
save(ok): . (dataset)
[INFO] run commit bd6a92e; (extract pimpernel) 
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
run(ok): /home/me/usecases/provenance/artproject (dataset) [convert -extract 1522x1522+1470+1470 sou...]
add(ok): pimpernel.jpg (file)
save(ok): . (dataset)
[INFO] run commit 4086a10; (blur image) 
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
run(ok): /home/me/usecases/provenance/artproject (dataset) [convert -spread 10 st-bernard.jpg st-ber...]
add(ok): st-bernard-displaced.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  get (notneeded: 3)
  run (ok: 3)
  save (ok: 3)

Now he is on a new branch of his project, which contains “replayed” history.

$ git log --oneline --graph main replay
* f9a70df [DATALAD RUNCMD] blur image
* 445a54b [DATALAD RUNCMD] extract pimpernel
* f8a3d9f [DATALAD RUNCMD] extract st-bernard lily
| * 4086a10 [DATALAD RUNCMD] blur image
| * bd6a92e [DATALAD RUNCMD] extract pimpernel
| * 92145d5 [DATALAD RUNCMD] extract st-bernard lily
|/  
* 4816a3e Added flower mosaic from wikimedia
* 67e0c24 [DATALAD] new dataset

He can even compare the two branches:

$ datalad diff -t main -f replay
 modified: st-bernard-displaced.jpg (symlink)

He can see that the blurring, which involved a random element, produced different results. Because his dataset contains two branches, he can compare the two branches using normal Git operations. The next command, for example, marks which commits are “patch-equivalent” between the branches. Notice that all commits are marked as equivalent (=) except the ‘random spread’ ones.

$ git log --oneline --left-right --cherry-mark main...replay
> f9a70df [DATALAD RUNCMD] blur image
= 445a54b [DATALAD RUNCMD] extract pimpernel
= f8a3d9f [DATALAD RUNCMD] extract st-bernard lily
< 4086a10 [DATALAD RUNCMD] blur image
= bd6a92e [DATALAD RUNCMD] extract pimpernel
= 92145d5 [DATALAD RUNCMD] extract st-bernard lily

Rob can continue processing images, and will turn in a successful art project. Long after he finishes high school, he finds his dataset on his old computer again and remembers this small project fondly.

Footnotes

Table of Contents

Related Topics

Basic provenance tracking¶

The Challenge¶

The DataLad Approach¶

Step-by-Step¶