4.2. Where’s Waldo?¶
So far, you and your room mate have created a copy of the DataLad-101
dataset on the same file system but a different place by installing
it from a path.
You have observed that the -r
/--recursive
option needs to be given to datalad get [-n/--no-data]
(manual)
in order to install further potential subdatasets in one go. Only then
is the subdatasets file content availability metadata present to explore
the file hierarchy available within the subdataset.
Alternatively, a datalad get -n <subds>
takes care of installing
exactly the specified registered subdataset.
And you have mesmerized your room mate by showing him how git-annex retrieved large file contents from the original dataset.
Let’s now see the git annex whereis
(manual) command in more detail,
and find out how git-annex knows where file content can be obtained from.
Within the original DataLad-101
dataset, you retrieved some of the .mp3
files via datalad get
, but not others. How will this influence the
output of git annex whereis
, you wonder?
Together with your room mate, you decide to find out. You navigate
back into the installed dataset, and run git annex whereis
on a
file that you once retrieved file content for, and on a file
that you did not yet retrieve file content for.
Here is the output for the retrieved file:
# navigate back into the clone of DataLad-101
$ cd ../mock_user/DataLad-101
# navigate into the subdirectory
$ cd recordings/longnow
# file content exists in original DataLad-101 for this file
$ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (2 copies)
00000000-0000-0000-0000-000000000001 -- web
✂UUID✂ -- mih@medusa:/tmp/seminars-on-longterm-thinking
web: http://podcast.longnow.org/salt/redirect/salt-020031114-eno-podcast.mp3
ok
And here is the output for a file that you did not yet retrieve
content for in your original DataLad-101
dataset.
# but not for this:
$ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3
whereis Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 (2 copies)
00000000-0000-0000-0000-000000000001 -- web
✂UUID✂ -- mih@medusa:/tmp/seminars-on-longterm-thinking
web: http://podcast.longnow.org/salt/redirect/salt-020050114-carse-podcast.mp3
ok
As you can see, the file content previously downloaded with a
datalad get
has a third source, your original dataset on your computer.
The file we did not yet retrieve in the original dataset
only has only two sources.
Let’s see how this affects a datalad get
:
# get the first file
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (file) [from web...]
# get the second file
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 (file) [from web...]
The most important thing to note is: It worked in both cases, regardless of whether the original
DataLad-101
dataset contained the file content or not.
We can see that git-annex used two different sources to retrieve the content from,
though, if we look at the very end of the get
summary.
The first file was retrieved “from origin...
”. Origin
is Git terminology
for “from where the dataset was copied from” – origin
therefore is the
original DataLad-101
dataset.
The second file was retrieved “from web...
”, and thus from a different source.
This source is called web
because it actually is a URL through which this particular
podcast-episode is made available in the first place. You might also have noticed that the
download from web took longer than the retrieval from the directory on the same
file system. But we will get into the details
of this type of content source
once we cover the importfeed
and add-url
functions1.
Let’s for now add a note on the git annex whereis
command. Again, do
this in the original DataLad-101
directory, and do not forget to save it.
# navigate back:
$ cd ../../../../DataLad-101
# write the note
$ cat << EOT >> notes.txt
The command "git annex whereis PATH" lists the repositories that have
the file content of an annexed file. When using "datalad get" to
retrieve file content, those repositories will be queried.
EOT
$ datalad status
modified: notes.txt (file)
$ datalad save -m "add note on git annex whereis"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
Footnotes
- 1
Maybe you wonder what the location
mih@medusa
is. It is a copy of the data on an account belonging to usermih
on the host namemedusa
. Because we do not have the host names’ address, nor log-in credentials for this user, we can not retrieve content from this location. However, somebody else (for example the usermih
) could.