4.2. Where’s Waldo?¶
So far, you and your room mate have created a copy of the DataLad-101
dataset on the same file system but a different place by installing
it from a path.
You have observed that the -r
/--recursive
option needs to be given to datalad get [-n/--no-data]
(manual)
in order to install further potential subdatasets in one go. Only then
is the subdatasets file content availability metadata present to explore
the file hierarchy available within the subdataset.
Alternatively, a datalad get -n <subds>
takes care of installing
exactly the specified registered subdataset.
And you have mesmerized your room mate by showing him how git-annex retrieved large file contents from the original dataset. Your room mate is excited by this magical command. You however begin to wonder: how does DataLad know where to look for that original content?
This information comes from git-annex. Before getting another PDF, let’s query git-annex where its content is stored:
$ # navigate back into the clone of DataLad-101
$ cd ../mock_user/DataLad-101
$ git annex whereis books/TLCL.pdf
whereis books/TLCL.pdf (1 copy)
0c450dc0-c48c-4057-a231-e2654b689600 -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]
ok
Oh, another cryptic character sequence - this time however not a symlink, but an annex UUID. “That’s hard to read – what is it?” your room mate asks. You can recognize a path to the dataset on your computer, prefixed with the user and hostname of your computer. “This”, you exclaim, excited about your own realization, “is my dataset’s location I’m sharing it from!”
What is this location, and what if I provided a description?
Back in the very first section of the Basics, Create a dataset, a Find-out-more mentioned the ‘–description’ option of datalad create
(manual).
With this option, you can provide a description about the dataset location.
The git annex whereis
(manual) command, finally, is where such a description
can become handy: If you had created the dataset with
$ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
the command would show course on DataLad-101 on my private laptop
after
the UUID – and thus a more human-readable description of where
file content is stored.
This becomes especially useful when the number of repository copies
increases. If you have only one other dataset it may be easy to
remember what and where it is. But once you have one back-up
of your dataset on a USB stick, one dataset shared with
Dropbox, and a third one on your institutions
GitLab instance you will be grateful for the descriptions
you provided these locations with.
The current report of the location of the dataset is in the format
user@host:path
.
If the physical location of a dataset is not relevant, ambiguous, or volatile,
or if it has an annex that could move within the foreseeable lifetime of a
dataset, a custom description with the relevant information on the dataset is
superior. If this is not the case, decide for yourself whether you want to use
the --description
option for future datasets or not depending on what you
find more readable – a self-made location description, or an automatic
user@host:path
information.
The message further informs you that there is only “(1 copy)
” of this file content.
This makes sense: There is only your own, original DataLad-101
dataset in which this book is saved.
To retrieve file content of an annexed file such as one of these PDFs, git-annex will try to obtain it from the locations it knows to contain this content.
It uses the UUID to identify these locations.
Every copy of a dataset will get a UUID as a unique identifier.
Note however that just because git-annex knows a certain location where content was once it does not guarantee that retrieval will work.
If one location is a USB stick that is in your bag pack instead of your USB port, a second location is a hard drive that you deleted all of its previous contents (including dataset content) from,
and another location is a web server, but you are not connected to the internet, git-annex will not succeed in retrieving contents from these locations.
As long as there is at least one location that contains the file and is accessible, though, git-annex will get the content.
Therefore, for the books in your dataset, retrieving contents works because you and your room mate share the same file system.
If you’d share the dataset with anyone without access to your file system, datalad get
would not work, because it cannot access your files.
But there is one book that does not suffer from this restriction:
The bash_guide.pdf
.
This book was not manually downloaded and saved to the dataset with wget
(thus keeping DataLad in the dark about where it came from), but it was obtained with the datalad download-url
(manual) command.
This registered the books original source in the dataset, and here is why that is useful:
$ git annex whereis books/bash_guide.pdf
whereis books/bash_guide.pdf (2 copies)
00000000-0000-0000-0000-000000000001 -- web
0c450dc0-c48c-4057-a231-e2654b689600 -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]
web: https://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf
ok
Unlike the TLCL.pdf
book, this book has two sources, and one of them is web
.
The second to last line specifies the precise URL you downloaded the file from.
Thus, for this book, your room mate is always able to obtain it (as long as the URL remains valid), even if you would delete your DataLad-101
dataset.
We can also see a report of the source that git-annex uses to retrieve the content from if we look at the very end of the get
summary.
$ datalad get books/TLCL.pdf
$ datalad get books/bash_guide.pdf
get(ok): books/TLCL.pdf (file) [from origin...]
get(ok): books/bash_guide.pdf (file) [from origin...]
Both of these files were retrieved “from origin...
”.
Origin
is Git terminology for “from where the dataset was copied from” – origin
therefore is the original DataLad-101
dataset from which file content can be retrieved from very fast.
If your roommate did not have access to the same file system or you deleted your DataLad-101
dataset, this output would look differently.
The datalad get
command would fail on the TLCL.pdf
book without a known second source, and bash_guide.pdf
would be retrieved “from web...
” - the registered second source, its original download URL.
Let’s see a retrieval from web
in action for another file.
The .mp3
files in the longnow
seminar series have registered web URLs[1].
$ # navigate into the subdirectory
$ cd recordings/longnow
$ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (2 copies)
00000000-0000-0000-0000-000000000001 -- web
✂UUID✂ -- mih@medusa:/tmp/seminars-on-longterm-thinking
web: http://podcast.longnow.org/salt/redirect/salt-020031114-eno-podcast.mp3
ok
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (file) [from web...]
As you can see at the end of the get
result, the files has been retrieved “from web...
”.
Quite useful, this provenance, right?
Let’s add a note on the git annex whereis
command.
Again, do this in the original DataLad-101
directory, and do not forget to save it.
$ # navigate back:
$ cd ../../../../DataLad-101
$ # write the note
$ cat << EOT >> notes.txt
The command "git annex whereis PATH" lists the repositories that have
the file content of an annexed file. When using "datalad get" to
retrieve file content, those repositories will be queried.
EOT
$ datalad status
modified: notes.txt (file)
$ datalad save -m "add note on git annex whereis"
add(ok): notes.txt (file)
save(ok): . (dataset)
Footnotes