4.1. Looking without touching¶
Only now, several weeks into the DataLad-101 course does your room mate realize that he has enrolled in the course as well, but has not yet attended at all. “Oh man, can you help me catch up?” he asks you one day. “Sharing just your notes would be really cool for a start already!”
“Sure thing”, you say, and decide that it’s probably best if he gets
all of the DataLad-101
course dataset. Sharing datasets was
something you wanted to look into soon, anyway.
This is one exciting aspect of DataLad datasets that has yet been missing from this course: How does one share a dataset? In this section, we will cover the simplest way of sharing a dataset: on a local or shared file system, via an installation with a path as a source.
More on public data sharing
Interested in sharing datasets publicly? Read this chapter to get a feel for all relevant basic concepts of sharing datasets. Afterwards, head over to chapter Third party infrastructure to find out how to share a dataset on third-party infrastructure.
In this scenario multiple people can access the very same files at the same time, often on the same machine (e.g., a shared workstation, or a server that people can “SSH” into). You might think: “What do I need DataLad for, if everyone can already access everything?” However, universal, unrestricted access can easily lead to chaos. DataLad can help facilitate collaboration without requiring ultimate trust and reliability of all participants. Essentially, with a shared dataset, collaborators can look and use your dataset without ever touching it.
To demonstrate how to share a DataLad dataset on a common file system,
we will pretend that your personal computer
can be accessed by other users. Let’s say that
your room mate has access, and you’re making sure that there is
a DataLad-101
dataset in a different place on the file system
for him to access and work with.
This is indeed a common real-world use case: Two users on a shared
file system sharing a dataset with each other.
But as we can not easily simulate a second user in this handbook,
for now, you will have to share your dataset with yourself.
This endeavor serves several purposes: For one, you will experience a very easy
way of sharing a dataset. Secondly, it will show you
how a dataset can be obtained from a path (instead of a URL as shown in the section
Install datasets). Thirdly, DataLad-101
is a dataset that can
showcase many different properties of a dataset already, but it will
be an additional learning experience to see how the different parts
of the dataset – text files, larger files, datalad subdataset,
datalad run commands – will appear upon installation when shared.
And lastly, you will likely “share a dataset with yourself” whenever you
will be using a particular dataset of your own creation as input for
one or more projects.
“Awesome!” exclaims your room mate as you take out your Laptop to share the dataset. “You’re really saving my ass here. I’ll make up for it when we prepare for the final”, he promises.
To install DataLad-101
into a different part
of your file system, navigate out of DataLad-101
, and – for
simplicity – create a new directory, mock_user
, right next to it:
$ cd ../
$ mkdir mock_user
For simplicity, pretend that this is a second user’s – your room mate’s – home directory. Furthermore, let’s for now disregard anything about permissions. In a real-world example you likely would not be able to read and write to a different user’s directories, but we will talk about permissions later.
After creation, navigate into mock_user
and install the dataset DataLad-101
.
To do this, use datalad clone, and provide a path to your original
dataset. Here is how it looks like:
$ cd mock_user
$ datalad clone --description "DataLad-101 in mock_user" ../DataLad-101
[INFO] Cloning dataset to Dataset(/home/me/dl-101/mock_user/DataLad-101)
[INFO] Attempting to clone from ../DataLad-101 to /home/me/dl-101/mock_user/DataLad-101
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/mock_user/DataLad-101)
install(ok): /home/me/dl-101/mock_user/DataLad-101 (dataset)
This will install your dataset DataLad-101
into your room mate’s home
directory. Note that we have given this new
dataset a description about its location as well. Note further that we
have not provided the optional destination path to datalad clone,
and hence it installed the dataset under its original name in the current directory.
Together with your room mate, you go ahead and see what this dataset looks like. Before running the command, try to predict what you will see.
$ cd DataLad-101
$ tree
.
├── books
│ ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf
│ ├── byte-of-python.pdf -> ../.git/annex/objects/P5/qK/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf
│ ├── progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
│ └── TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
├── code
│ └── list_titles.sh
├── notes.txt
└── recordings
├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg
├── longnow
├── podcasts.tsv
└── salt_logo_small.jpg -> ../.git/annex/objects/fZ/wg/MD5E-s76402--87da732ff6d9a92c6afcaed7fefb133f.jpg/MD5E-s76402--87da732ff6d9a92c6afcaed7fefb133f.jpg
4 directories, 9 files
There are a number of interesting things, and your room mate is the first to notice them:
“Hey, can you explain some things to me?”, he asks. “This directory
here, “longnow
”, why is it empty?”
True, the subdataset has a directory name but apart from this,
the longnow
directory appears empty.
“Also, why do the PDFs in books/
and the .jpg
files
appear so weird? They have
this cryptic path right next to them, and look, if I try to open
one of them, it fails! Did something go wrong when we installed
the dataset?” he worries.
Indeed, the PDFs and pictures appear just as they did in the original dataset
on first sight: They are symlinks pointing to some location in the
object tree. To reassure your room mate that everything is fine you
quickly explain to him the concept of a symlink and the object-tree
of git-annex.
“But why does the PDF not open when I try to open it?” he repeats.
True, these files cannot be opened. This mimics our experience when
installing the longnow
subdataset: Right after installation,
the .mp3
files also could not be opened, because their file
content was not yet retrieved. You begin to explain to your room mate
how DataLad retrieves only minimal metadata about which files actually
exist in a dataset upon a datalad clone. “It’s really handy”,
you tell him. “This way you can decide which book you want to read,
and then retrieve what you need. Everything that is annexed is retrieved
on demand. Note though that the text files
contents are present, and the files can be opened – this is because
these files are stored in Git. So you already have my notes,
and you can decide for yourself whether you want to get
the books.”
To demonstrate this, you decide to examine the PDFs further. “Try to get one of the books”, you instruct your room mate:
$ datalad get books/progit.pdf
get(ok): books/progit.pdf (file) [from origin...]
“Opening this file will work, because the content was retrieved from the original dataset.”, you explain, proud that this worked just as you thought it would. Your room mate is excited by this magical command. You however begin to wonder: how does DataLad know where to look for that original content?
This information comes from git-annex. Before getting the next PDF, let’s query git-annex where its content is stored:
$ git annex whereis books/TLCL.pdf
whereis books/TLCL.pdf (1 copy)
REDACTED-UUID -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]
ok
Oh, another shasum! This time however not in a symlink… “That’s hard to read – what is it?” your room mate asks. You can recognize a path to the dataset on your computer, prefixed with the user and hostname of your computer. “This”, you exclaim, excited about your own realization, “is my dataset’s location I’m sharing it from!”
What is this location, and what if I provided a description?
Back in the very first section of the Basics, Create a dataset, a hidden
section mentioned the --description
option of datalad create.
With this option, you can provide a description about the location of
your dataset.
The git annex whereis command, finally, is where such a description can become handy: If you had created the dataset with
$ datalad create --description "course on DataLad-101 on my private Laptop" -c text2git DataLad-101
the command would show course on DataLad-101 on my private Laptop
after
the shasum – and thus a more human-readable description of where
file content is stored.
This becomes especially useful when the number of repository copies
increases. If you have only one other dataset it may be easy to
remember what and where it is. But once you have one back-up
of your dataset on a USB-Stick, one dataset shared with
Dropbox, and a third one on your institutions
GitLab instance you will be grateful for the descriptions
you provided these locations with.
The current report of the location of the dataset is in the format
user@host:path
.
As one computer this book is being build on is called “muninn” and its
user “me”, it could look like this: me@muninn:~/dl-101/DataLad-101
.
If the physical location of a dataset is not relevant, ambiguous, or volatile,
or if it has an annex that could move within the foreseeable lifetime of a
dataset, a custom description with the relevant information on the dataset is
superior. If this is not the case, decide for yourself whether you want to use
the --description
option for future datasets or not depending on what you
find more readable – a self-made location description, or an automatic
user@host:path
information.
The message further informs you that there is only “(1 copy)
”
of this file content. This makes sense: There
is only your own, original DataLad-101
dataset in which
this book is saved.
To retrieve file content of an annexed file such as one of
these PDFs, git-annex will try
to obtain it from the locations it knows to contain this content.
It uses the checksums to identify these locations. Every copy
of a dataset will get a unique ID with such a checksum.
Note however that just because git-annex knows a certain location
where content was once it does not guarantee that retrieval will
work. If one location is a USB-Stick that is in your bag pack instead
of your USB port,
a second location is a hard drive that you deleted all of its
previous contents (including dataset content) from,
and another location is a web server, but you are not connected
to the internet, git-annex will not succeed in retrieving
contents from these locations.
As long as there is at least one location that contains
the file and is accessible, though, git-annex will get the content.
Therefore, for the books in your dataset, retrieving contents works because you
and your room mate share the same file system. If you’d share the dataset
with anyone without access to your file system, datalad get
would not
work, because it can not access your files.
But there is one book that does not suffer from this restriction:
The bash_guide.pdf
.
This book was not manually downloaded and saved to the dataset with wget
(thus keeping DataLad in the dark about where it came from), but it was
obtained with the datalad download-url command. This registered
the books original source in the dataset, and here is why that is useful:
$ git annex whereis books/bash_guide.pdf
whereis books/bash_guide.pdf (2 copies)
00000000-0000-0000-0000-000000000001 -- web
REDACTED-UUID -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]
web: http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf
ok
Unlike the TLCL.pdf
book, this book has two sources, and one of them is
web
. The second to last line specifies the precise URL you downloaded the
file from. Thus, for this book, your room mate is always able to obtain it
(as long as the URL remains valid), even if you would delete your DataLad-101
dataset. Quite useful, this provenance, right?
Let’s now turn to the fact that the subdataset longnow
contains neither
file content nor file metadata information to explore the contents of the
dataset: there are no subdirectories or any files under recordings/longnow/
.
This is behavior that you have not observed until now.
To fix this and obtain file availability metadata, you have to run a somewhat unexpected command:
$ datalad get -n recordings/longnow
[INFO] Cloning dataset to Dataset(/home/me/dl-101/mock_user/DataLad-101/recordings/longnow)
[INFO] Attempting to clone from https://github.com/datalad-datasets/longnow-podcasts.git to /home/me/dl-101/mock_user/DataLad-101/recordings/longnow
[INFO] Start enumerating objects
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/mock_user/DataLad-101/recordings/longnow)
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] https://github.com/datalad-datasets/longnow-podcasts.git/config download failed: Not Found
install(ok): /home/me/dl-101/mock_user/DataLad-101/recordings/longnow (dataset) [Installed subdataset in order to get /home/me/dl-101/mock_user/DataLad-101/recordings/longnow]
The section below will elaborate on datalad get and the
-n/--no-data
option, but for now, let’s first see what has changed after
running the above command (excerpt):
$ tree
.
├── books
│ ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf
│ ├── byte-of-python.pdf -> ../.git/annex/objects/P5/qK/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf
│ ├── progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
│ └── TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
├── code
│ └── list_titles.sh
├── notes.txt
└── recordings
├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg
├── longnow
│ ├── Long_Now__Conversations_at_The_Interval
│ │ ├── 2017_06_09__How_Digital_Memory_Is_Shaping_Our_Future__Abby_Smith_Rumsey.mp3 -> ../.git/annex/objects/8j/kQ/MD5E-s66305442--c723d53d207e6d82dd64c3909a6a93b0.mp3/MD5E-s66305442--c723d53d207e6d82dd64c3909a6a93b0.mp3
│ │ ├── 2017_06_09__Pace_Layers_Thinking__Stewart_Brand__Paul_Saffo.mp3 -> ../.git/annex/objects/Qk/9M/MD5E-s112801659--00a42a1a617485fb2c03cbf8482c905c.mp3/MD5E-s112801659--00a42a1a617485fb2c03cbf8482c905c.mp3
│ │ ├── 2017_06_09__Proof__The_Science_of_Booze__Adam_Rogers.mp3 -> ../.git/annex/objects/FP/96/MD5E-s60091960--6e48eceb5c54d458164c2d0f47b540bc.mp3/MD5E-s60091960--6e48eceb5c54d458164c2d0f47b540bc.mp3
│ │ ├── 2017_06_09__Seveneves_at_The_Interval__Neal_Stephenson.mp3 -> ../.git/annex/objects/Wf/5Q/MD5E-s66431897--aff90c838a1c4a363bb9d83a46fa989b.mp3/MD5E-s66431897--aff90c838a1c4a363bb9d83a46fa989b.mp3
│ │ ├── 2017_06_09__Talking_with_Robots_about_Architecture__Jeffrey_McGrew.mp3 -> ../.git/annex/objects/Fj/9V/MD5E-s61491081--c4e88ea062c0afdbea73d295922c5759.mp3/MD5E-s61491081--c4e88ea062c0afdbea73d295922c5759.mp3
│ │ ├── 2017_06_09__The_Red_Planet_for_Real__Andy_Weir.mp3 -> ../.git/annex/objects/xq/Q3/MD5E-s136924472--0d1072105caa56475df9037670d35a06.mp3/MD5E-s136924472--0d1072105caa56475df9037670d35a06.mp3
Interesting! The file metadata information is now present, and we can explore the file hierarchy. The file content, however, is not present yet.
What has happened here?
When DataLad installs a dataset, it will by default only obtain the superdataset, and not any subdatasets. The superdataset contains the information that a subdataset exists though – the subdataset is registered in the superdataset. This is why the subdataset name exists as a directory. A subsequent datalad get -n path/to/longnow will install the registered subdataset again, just as we did in the example above.
But what about the -n
option for datalad get?
Previously, we used datalad get to get file content. However,
get can operate on more than just the level of files or directories.
Instead, it can also operate on the level of datasets. Regardless of whether
it is a single file (such as books/TLCL.pdf
) or a registered subdataset
(such as recordings/longnow
), get will operate on it to 1) install
it – if it is a not yet installed subdataset – and 2) retrieve the contents of any files.
That makes it very easy to get your file content, regardless of
how your dataset may be structured – it is always the same command, and DataLad
blurs the boundaries between superdatasets and subdatasets.
In the above example, we called datalad get with the option -n/--no-data
.
This option prevents that get obtains the data of individual files or
directories, thus limiting its scope to the level of datasets as only a
datalad clone is performed. Without this option, the command would
have retrieved all of the subdatasets contents right away. But with -n/--no-data
,
it only installed the subdataset to retrieve the meta data about file availability.
To explicitly install all potential subdatasets recursively, that is,
all of the subdatasets inside it as well, one can give the
-r
/--recursive
option to get:
datalad get -n -r <subds>
This would install the subds
subdataset and all potential further
subdatasets inside of it, and the meta data about file hierarchies would
have been available right away for every subdataset inside of subds
. If you
had several subdatasets and would not provide a path to a single dataset,
but, say, the current directory (.
as in datalad get -n -r .), it
would clone all registered subdatasets recursively.
So why is a recursive get not the default behavior? In Dataset nesting we learned that datasets can be nested arbitrarily deep. Upon getting the meta data of one dataset you might not want to also install a few dozen levels of nested subdatasets right away.
However, there is a middle way1: The --recursion-limit
option let’s
you specify how many levels of subdatasets should be installed together
with the first subdataset:
datalad get -n -r --recursion-limit 1 <subds>
To summarize what you learned in this section, write a note on how to install a dataset using a path as a source on a common file system.
Write this note in “your own” (the original) DataLad-101
dataset, though!
# navigate back into the original dataset
$ cd ../../DataLad-101
# write the note
$ cat << EOT >> notes.txt
A source to install a dataset from can also be a path, for example as
in "datalad clone ../DataLad-101".
Just as in creating datasets, you can add a description on the
location of the new dataset clone with the -D/--description option.
Note that subdatasets will not be installed by default, but are only
registered in the superdataset -- you will have to do a
"datalad get -n PATH/TO/SUBDATASET" to install the subdataset for file
availability meta data. The -n/--no-data options prevents that file
contents are also downloaded.
Note that a recursive "datalad get" would install all further
registered subdatasets underneath a subdataset, so a safer way to
proceed is to set a decent --recursion-limit:
"datalad get -n -r --recursion-limit 2 <subds>"
EOT
Save this note.
$ datalad save -m "add note about cloning from paths and recursive datalad get"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
Get a clone
A dataset that is installed from an existing source, e.g., a path or URL, is the DataLad equivalent of a clone in Git.
Footnotes
- 1
Another alternative to a recursion limit to datalad get -n -r is a dataset configuration that specifies subdatasets that should not be cloned recursively, unless explicitly given to the command with a path. With this configuration, a superdataset’s maintainer can safeguard users and prevent potentially large amounts of subdatasets to be cloned. You can learn more about this configuration in the section More on DIY configurations.