4.2. Where’s Waldo?

So far, you and your room mate have created a copy of the DataLad-101 dataset on the same file system but a different place by installing it from a path.

You have observed that the -r/--recursive option needs to be given to datalad get [-n/--no-data] (manual) in order to install further potential subdatasets in one go. Only then is the subdatasets file content availability metadata present to explore the file hierarchy available within the subdataset. Alternatively, a datalad get -n <subds> takes care of installing exactly the specified registered subdataset.

And you have mesmerized your room mate by showing him how git-annex retrieved large file contents from the original dataset. Your room mate is excited by this magical command. You however begin to wonder: how does DataLad know where to look for that original content?

This information comes from git-annex. Before getting another PDF, let’s query git-annex where its content is stored:

$ # navigate back into the clone of DataLad-101
$ cd ../mock_user/DataLad-101
$ git annex whereis books/TLCL.pdf
whereis books/TLCL.pdf (1 copy)
  0c450dc0-c48c-4057-a231-e2654b689600 -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]

Oh, another cryptic character sequence - this time however not a symlink, but an annex UUID. “That’s hard to read – what is it?” your room mate asks. You can recognize a path to the dataset on your computer, prefixed with the user and hostname of your computer. “This”, you exclaim, excited about your own realization, “is my dataset’s location I’m sharing it from!”

What is this location, and what if I provided a description?

Back in the very first section of the Basics, Create a dataset, a Find-out-more mentioned the ‘–description’ option of datalad create (manual). With this option, you can provide a description about the dataset location.

The git annex whereis (manual) command, finally, is where such a description can become handy: If you had created the dataset with

$ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101

the command would show course on DataLad-101 on my private laptop after the UUID – and thus a more human-readable description of where file content is stored. This becomes especially useful when the number of repository copies increases. If you have only one other dataset it may be easy to remember what and where it is. But once you have one back-up of your dataset on a USB stick, one dataset shared with Dropbox, and a third one on your institutions GitLab instance you will be grateful for the descriptions you provided these locations with.

The current report of the location of the dataset is in the format user@host:path.

If the physical location of a dataset is not relevant, ambiguous, or volatile, or if it has an annex that could move within the foreseeable lifetime of a dataset, a custom description with the relevant information on the dataset is superior. If this is not the case, decide for yourself whether you want to use the --description option for future datasets or not depending on what you find more readable – a self-made location description, or an automatic user@host:path information.

The message further informs you that there is only “(1 copy)” of this file content. This makes sense: There is only your own, original DataLad-101 dataset in which this book is saved.

To retrieve file content of an annexed file such as one of these PDFs, git-annex will try to obtain it from the locations it knows to contain this content. It uses the UUID to identify these locations. Every copy of a dataset will get a UUID as a unique identifier. Note however that just because git-annex knows a certain location where content was once it does not guarantee that retrieval will work. If one location is a USB stick that is in your bag pack instead of your USB port, a second location is a hard drive that you deleted all of its previous contents (including dataset content) from, and another location is a web server, but you are not connected to the internet, git-annex will not succeed in retrieving contents from these locations. As long as there is at least one location that contains the file and is accessible, though, git-annex will get the content. Therefore, for the books in your dataset, retrieving contents works because you and your room mate share the same file system. If you’d share the dataset with anyone without access to your file system, datalad get would not work, because it cannot access your files.

But there is one book that does not suffer from this restriction: The bash_guide.pdf. This book was not manually downloaded and saved to the dataset with wget (thus keeping DataLad in the dark about where it came from), but it was obtained with the datalad download-url (manual) command. This registered the books original source in the dataset, and here is why that is useful:

$ git annex whereis books/bash_guide.pdf
whereis books/bash_guide.pdf (2 copies)
  00000000-0000-0000-0000-000000000001 -- web
  0c450dc0-c48c-4057-a231-e2654b689600 -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]

  web: https://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf

Unlike the TLCL.pdf book, this book has two sources, and one of them is web. The second to last line specifies the precise URL you downloaded the file from. Thus, for this book, your room mate is always able to obtain it (as long as the URL remains valid), even if you would delete your DataLad-101 dataset.

We can also see a report of the source that git-annex uses to retrieve the content from if we look at the very end of the get summary.

$ datalad get books/TLCL.pdf
$ datalad get books/bash_guide.pdf
get(ok): books/TLCL.pdf (file) [from origin...]
get(ok): books/bash_guide.pdf (file) [from origin...]

Both of these files were retrieved “from origin...”. Origin is Git terminology for “from where the dataset was copied from” – origin therefore is the original DataLad-101 dataset from which file content can be retrieved from very fast.

If your roommate did not have access to the same file system or you deleted your DataLad-101 dataset, this output would look differently. The datalad get command would fail on the TLCL.pdf book without a known second source, and bash_guide.pdf would be retrieved “from web...” - the registered second source, its original download URL. Let’s see a retrieval from web in action for another file. The .mp3 files in the longnow seminar series have registered web URLs[1].

$ # navigate into the subdirectory
$ cd recordings/longnow
$ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (2 copies)
	00000000-0000-0000-0000-000000000001 -- web
	✂UUID✂ -- mih@medusa:/tmp/seminars-on-longterm-thinking

  web: http://podcast.longnow.org/salt/redirect/salt-020031114-eno-podcast.mp3
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (file) [from web...]

As you can see at the end of the get result, the files has been retrieved “from web...”. Quite useful, this provenance, right? Let’s add a note on the git annex whereis command. Again, do this in the original DataLad-101 directory, and do not forget to save it.

$ # navigate back:
$ cd ../../../../DataLad-101

$ # write the note
$ cat << EOT >> notes.txt
The command "git annex whereis PATH" lists the repositories that have
the file content of an annexed file. When using "datalad get" to
retrieve file content, those repositories will be queried.

$ datalad status
 modified: notes.txt (file)
$ datalad save -m "add note on git annex whereis"
add(ok): notes.txt (file)
save(ok): . (dataset)