4.2. Where’s Waldo?¶
So far, you and your room mate have created a copy of the
dataset on the same file system but a different place by installing
it from a path.
You have observed that the
option needs to be given to datalad get [-n/--no-data]
in order to install further potential subdatasets in one go. Only then
is the subdatasets file content availability metadata present to explore
the file hierarchy available within the subdataset.
Alternatively, a datalad get -n <subds> takes care of installing
exactly the specified registered subdataset.
And you have mesmerized your room mate by showing him how git-annex retrieved large file contents from the original dataset.
Let’s now see the git annex whereis command in more detail,
and find out how git-annex knows where file content can be obtained from.
Within the original
DataLad-101 dataset, you retrieved some of the
files via datalad get, but not others. How will this influence the
output of git annex whereis, you wonder?
Together with your room mate, you decide to find out. You navigate back into the installed dataset, and run git annex whereis on a file that you once retrieved file content for, and on a file that you did not yet retrieve file content for. Here is the output for the retrieved file:
# navigate back into the clone of DataLad-101 $ cd ../mock_user/DataLad-101 # navigate into the subdirectory $ cd recordings/longnow # file content exists in original DataLad-101 for this file $ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (3 copies) 00000000-0000-0000-0000-000000000001 -- web da3bf937-5bd2-43ea-a07b-bcbe71f3b875 -- mih@medusa:/tmp/seminars-on-longterm-thinking eb47bf12-2366-495d-80a6-2861eb665f06 -- me@muninn:~/dl-101/DataLad-101/recordings/longnow [origin] web: http://podcast.longnow.org/salt/redirect/salt-020031114-eno-podcast.mp3 ok
And here is the output for a file that you did not yet retrieve
content for in your original
# but not for this: $ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 whereis Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 (2 copies) 00000000-0000-0000-0000-000000000001 -- web da3bf937-5bd2-43ea-a07b-bcbe71f3b875 -- mih@medusa:/tmp/seminars-on-longterm-thinking web: http://podcast.longnow.org/salt/redirect/salt-020050114-carse-podcast.mp3 ok
As you can see, the file content previously downloaded with a datalad get has a third source, your original dataset on your computer. The file we did not yet retrieve in the original dataset only has only two sources.
Let’s see how this affects a datalad get:
# get the first file $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 get(ok): Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (file) [from web...]
# get the second file $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 get(ok): Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 (file) [from web...]
The most important thing to note is: It worked in both cases, regardless of whether the original
DataLad-101 dataset contained the file content or not.
We can see that git-annex used two different sources to retrieve the content from,
though, if we look at the very end of the
The first file was retrieved “
Origin is Git terminology
for “from where the dataset was copied from” –
origin therefore is the
The second file was retrieved “
from web...”, and thus from a different source.
This source is called
web because it actually is a URL through which this particular
podcast-episode is made available in the first place. You might also have noticed that the
download from web took longer than the retrieval from the directory on the same
file system. But we will get into the details
of this type of content source
once we cover the
Let’s for now add a note on the git annex whereis command. Again, do
this in the original
DataLad-101 directory, and do not forget to save it.
# navigate back: $ cd ../../../../DataLad-101 # write the note $ cat << EOT >> notes.txt The command "git annex whereis PATH" lists the repositories that have the file content of an annexed file. When using "datalad get" to retrieve file content, those repositories will be queried. EOT
$ datalad status modified: notes.txt (file)
$ datalad save -m "add note on git annex whereis" add(ok): notes.txt (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1)
Maybe you wonder what the location
mih@medusais. It is a copy of the data on an account belonging to user
mihon the host name
medusa. Because we do not have the host names’ address, nor log-in credentials for this user, we can not retrieve content from this location. However, somebody else (for example the user