4.1. Looking without touching¶

Only now, several weeks into the DataLad-101 course does your room mate realize that he has enrolled in the course as well, but has not yet attended at all. “Oh man, can you help me catch up?” he asks you one day. “Sharing just your notes would be really cool for a start already!”

“Sure thing”, you say, and decide that it’s probably best if he gets all of the DataLad-101 course dataset. Sharing datasets was something you wanted to look into soon, anyway.

This is one exciting aspect of DataLad datasets that has yet been missing from this course: How does one share a dataset? In this section, we will cover the simplest way of sharing a dataset: on a local or shared file system, via an installation with a path as a source.

Interested in sharing datasets publicly? Read this chapter to get a feel for all relevant basic concepts of sharing datasets. Afterwards, head over to chapter Third party infrastructure to find out how to share a dataset on third-party infrastructure.

In this scenario multiple people can access the very same files at the same time, often on the same machine (e.g., a shared workstation, or a server that people can “SSH” into). You might think: “What do I need DataLad for, if everyone can already access everything?” However, universal, unrestricted access can easily lead to chaos. DataLad can help facilitate collaboration without requiring ultimate trust and reliability of all participants. Essentially, with a shared dataset, collaborators can look and use your dataset without ever touching it.

To demonstrate how to share a DataLad dataset on a common file system, we will pretend that your personal computer can be accessed by other users. Let’s say that your room mate has access, and you’re making sure that there is a DataLad-101 dataset in a different place on the file system for him to access and work with.

This is indeed a common real-world use case: Two users on a shared file system sharing a dataset with each other. But as we can not easily simulate a second user in this handbook, for now, you will have to share your dataset with yourself. This endeavor serves several purposes: For one, you will experience a very easy way of sharing a dataset. Secondly, it will show you how a dataset can be obtained from a path (instead of a URL as shown in the section Install datasets). Thirdly, DataLad-101 is a dataset that can showcase many different properties of a dataset already, but it will be an additional learning experience to see how the different parts of the dataset – text files, larger files, datalad subdataset, datalad run commands – will appear upon installation when shared. And lastly, you will likely “share a dataset with yourself” whenever you will be using a particular dataset of your own creation as input for one or more projects.

“Awesome!” exclaims your room mate as you take out your Laptop to share the dataset. “You’re really saving my ass here. I’ll make up for it when we prepare for the final”, he promises.

To install DataLad-101 into a different part of your file system, navigate out of DataLad-101, and – for simplicity – create a new directory, mock_user, right next to it:

$ cd ../
$ mkdir mock_user

For simplicity, pretend that this is a second user’s – your room mate’s – home directory. Furthermore, let’s for now disregard anything about permissions. In a real-world example you likely would not be able to read and write to a different user’s directories, but we will talk about permissions later.

After creation, navigate into mock_user and install the dataset DataLad-101. To do this, use datalad clone, and provide a path to your original dataset. Here is how it looks like:

$ cd mock_user
$ datalad clone --description "DataLad-101 in mock_user" ../DataLad-101
[INFO] Cloning dataset to Dataset(/home/me/dl-101/mock_user/DataLad-101)
[INFO] Attempting to clone from ../DataLad-101 to /home/me/dl-101/mock_user/DataLad-101
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/mock_user/DataLad-101)
install(ok): /home/me/dl-101/mock_user/DataLad-101 (dataset)

This will install your dataset DataLad-101 into your room mate’s home directory. Note that we have given this new dataset a description about its location as well. Note further that we have not provided the optional destination path to datalad clone, and hence it installed the dataset under its original name in the current directory.

Together with your room mate, you go ahead and see what this dataset looks like. Before running the command, try to predict what you will see.

$ cd DataLad-101
$ tree
.
├── books
│   ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf
│   ├── byte-of-python.pdf -> ../.git/annex/objects/P5/qK/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf
│   ├── progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
│   └── TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
├── code
│   └── list_titles.sh
├── notes.txt
└── recordings
    ├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg
    ├── longnow
    ├── podcasts.tsv
    └── salt_logo_small.jpg -> ../.git/annex/objects/fZ/wg/MD5E-s76402--87da732ff6d9a92c6afcaed7fefb133f.jpg/MD5E-s76402--87da732ff6d9a92c6afcaed7fefb133f.jpg

4 directories, 9 files

There are a number of interesting things, and your room mate is the first to notice them:

“Hey, can you explain some things to me?”, he asks. “This directory here, “longnow”, why is it empty?” True, the subdataset has a directory name but apart from this, the longnow directory appears empty.

“Also, why do the PDFs in books/ and the .jpg files appear so weird? They have this cryptic path right next to them, and look, if I try to open one of them, it fails! Did something go wrong when we installed the dataset?” he worries. Indeed, the PDFs and pictures appear just as they did in the original dataset on first sight: They are symlinks pointing to some location in the object tree. To reassure your room mate that everything is fine you quickly explain to him the concept of a symlink and the object-tree of git-annex.

“But why does the PDF not open when I try to open it?” he repeats. True, these files cannot be opened. This mimics our experience when installing the longnow subdataset: Right after installation, the .mp3 files also could not be opened, because their file content was not yet retrieved. You begin to explain to your room mate how DataLad retrieves only minimal metadata about which files actually exist in a dataset upon a datalad clone. “It’s really handy”, you tell him. “This way you can decide which book you want to read, and then retrieve what you need. Everything that is annexed is retrieved on demand. Note though that the text files contents are present, and the files can be opened – this is because these files are stored in Git. So you already have my notes, and you can decide for yourself whether you want to get the books.”

To demonstrate this, you decide to examine the PDFs further. “Try to get one of the books”, you instruct your room mate:

$ datalad get books/progit.pdf
get(ok): books/progit.pdf (file) [from origin...]

“Opening this file will work, because the content was retrieved from the original dataset.”, you explain, proud that this worked just as you thought it would. Your room mate is excited by this magical command. You however begin to wonder: how does DataLad know where to look for that original content?

This information comes from git-annex. Before getting the next PDF, let’s query git-annex where its content is stored:

$ git annex whereis books/TLCL.pdf
whereis books/TLCL.pdf (1 copy)
  	REDACTED-UUID -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]
ok

Oh, another shasum! This time however not in a symlink… “That’s hard to read – what is it?” your room mate asks. You can recognize a path to the dataset on your computer, prefixed with the user and hostname of your computer. “This”, you exclaim, excited about your own realization, “is my dataset’s location I’m sharing it from!”

Back in the very first section of the Basics, Create a dataset, a hidden section mentioned the --description option of datalad create. With this option, you can provide a description about the location of your dataset.

The git annex whereis command, finally, is where such a description can become handy: If you had created the dataset with

$ datalad create --description "course on DataLad-101 on my private Laptop" -c text2git DataLad-101

the command would show course on DataLad-101 on my private Laptop after the shasum – and thus a more human-readable description of where file content is stored. This becomes especially useful when the number of repository copies increases. If you have only one other dataset it may be easy to remember what and where it is. But once you have one back-up of your dataset on a USB-Stick, one dataset shared with Dropbox, and a third one on your institutions GitLab instance you will be grateful for the descriptions you provided these locations with.

The current report of the location of the dataset is in the format user@host:path. As one computer this book is being build on is called “muninn” and its user “me”, it could look like this: me@muninn:~/dl-101/DataLad-101.

If the physical location of a dataset is not relevant, ambiguous, or volatile, or if it has an annex that could move within the foreseeable lifetime of a dataset, a custom description with the relevant information on the dataset is superior. If this is not the case, decide for yourself whether you want to use the --description option for future datasets or not depending on what you find more readable – a self-made location description, or an automatic user@host:path information.

The message further informs you that there is only “(1 copy)” of this file content. This makes sense: There is only your own, original DataLad-101 dataset in which this book is saved.

To retrieve file content of an annexed file such as one of these PDFs, git-annex will try to obtain it from the locations it knows to contain this content. It uses the checksums to identify these locations. Every copy of a dataset will get a unique ID with such a checksum. Note however that just because git-annex knows a certain location where content was once it does not guarantee that retrieval will work. If one location is a USB-Stick that is in your bag pack instead of your USB port, a second location is a hard drive that you deleted all of its previous contents (including dataset content) from, and another location is a web server, but you are not connected to the internet, git-annex will not succeed in retrieving contents from these locations. As long as there is at least one location that contains the file and is accessible, though, git-annex will get the content. Therefore, for the books in your dataset, retrieving contents works because you and your room mate share the same file system. If you’d share the dataset with anyone without access to your file system, datalad get would not work, because it can not access your files.

But there is one book that does not suffer from this restriction: The bash_guide.pdf. This book was not manually downloaded and saved to the dataset with wget (thus keeping DataLad in the dark about where it came from), but it was obtained with the datalad download-url command. This registered the books original source in the dataset, and here is why that is useful:

$ git annex whereis books/bash_guide.pdf
whereis books/bash_guide.pdf (2 copies)
  	00000000-0000-0000-0000-000000000001 -- web
   	REDACTED-UUID -- me@appveyor-vm:~/dl-101/DataLad-101 [origin]

  web: http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf
ok

Unlike the TLCL.pdf book, this book has two sources, and one of them is web. The second to last line specifies the precise URL you downloaded the file from. Thus, for this book, your room mate is always able to obtain it (as long as the URL remains valid), even if you would delete your DataLad-101 dataset. Quite useful, this provenance, right?

Let’s now turn to the fact that the subdataset longnow contains neither file content nor file metadata information to explore the contents of the dataset: there are no subdirectories or any files under recordings/longnow/. This is behavior that you have not observed until now.

To fix this and obtain file availability metadata, you have to run a somewhat unexpected command:

$ datalad get -n recordings/longnow
[INFO] Cloning dataset to Dataset(/home/me/dl-101/mock_user/DataLad-101/recordings/longnow)
[INFO] Attempting to clone from https://github.com/datalad-datasets/longnow-podcasts.git to /home/me/dl-101/mock_user/DataLad-101/recordings/longnow
[INFO] Start enumerating objects
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/mock_user/DataLad-101/recordings/longnow)
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] https://github.com/datalad-datasets/longnow-podcasts.git/config download failed: Not Found
install(ok): /home/me/dl-101/mock_user/DataLad-101/recordings/longnow (dataset) [Installed subdataset in order to get /home/me/dl-101/mock_user/DataLad-101/recordings/longnow]

The section below will elaborate on datalad get and the -n/--no-data option, but for now, let’s first see what has changed after running the above command (excerpt):

$ tree
.
├── books
│   ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf
│   ├── byte-of-python.pdf -> ../.git/annex/objects/P5/qK/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf
│   ├── progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
│   └── TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
├── code
│   └── list_titles.sh
├── notes.txt
└── recordings
    ├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg/MD5E-s70348--4b2ec0db16882082d6bddffbabcfc45b.jpg
    ├── longnow
    │   ├── Long_Now__Conversations_at_The_Interval
    │   │   ├── 2017_06_09__How_Digital_Memory_Is_Shaping_Our_Future__Abby_Smith_Rumsey.mp3 -> ../.git/annex/objects/8j/kQ/MD5E-s66305442--c723d53d207e6d82dd64c3909a6a93b0.mp3/MD5E-s66305442--c723d53d207e6d82dd64c3909a6a93b0.mp3
    │   │   ├── 2017_06_09__Pace_Layers_Thinking__Stewart_Brand__Paul_Saffo.mp3 -> ../.git/annex/objects/Qk/9M/MD5E-s112801659--00a42a1a617485fb2c03cbf8482c905c.mp3/MD5E-s112801659--00a42a1a617485fb2c03cbf8482c905c.mp3
    │   │   ├── 2017_06_09__Proof__The_Science_of_Booze__Adam_Rogers.mp3 -> ../.git/annex/objects/FP/96/MD5E-s60091960--6e48eceb5c54d458164c2d0f47b540bc.mp3/MD5E-s60091960--6e48eceb5c54d458164c2d0f47b540bc.mp3
    │   │   ├── 2017_06_09__Seveneves_at_The_Interval__Neal_Stephenson.mp3 -> ../.git/annex/objects/Wf/5Q/MD5E-s66431897--aff90c838a1c4a363bb9d83a46fa989b.mp3/MD5E-s66431897--aff90c838a1c4a363bb9d83a46fa989b.mp3
    │   │   ├── 2017_06_09__Talking_with_Robots_about_Architecture__Jeffrey_McGrew.mp3 -> ../.git/annex/objects/Fj/9V/MD5E-s61491081--c4e88ea062c0afdbea73d295922c5759.mp3/MD5E-s61491081--c4e88ea062c0afdbea73d295922c5759.mp3
    │   │   ├── 2017_06_09__The_Red_Planet_for_Real__Andy_Weir.mp3 -> ../.git/annex/objects/xq/Q3/MD5E-s136924472--0d1072105caa56475df9037670d35a06.mp3/MD5E-s136924472--0d1072105caa56475df9037670d35a06.mp3

Interesting! The file metadata information is now present, and we can explore the file hierarchy. The file content, however, is not present yet.

What has happened here?

When DataLad installs a dataset, it will by default only obtain the superdataset, and not any subdatasets. The superdataset contains the information that a subdataset exists though – the subdataset is registered in the superdataset. This is why the subdataset name exists as a directory. A subsequent datalad get -n path/to/longnow will install the registered subdataset again, just as we did in the example above.

But what about the -n option for datalad get? Previously, we used datalad get to get file content. However, get can operate on more than just the level of files or directories. Instead, it can also operate on the level of datasets. Regardless of whether it is a single file (such as books/TLCL.pdf) or a registered subdataset (such as recordings/longnow), get will operate on it to 1) install it – if it is a not yet installed subdataset – and 2) retrieve the contents of any files. That makes it very easy to get your file content, regardless of how your dataset may be structured – it is always the same command, and DataLad blurs the boundaries between superdatasets and subdatasets.

In the above example, we called datalad get with the option -n/--no-data. This option prevents that get obtains the data of individual files or directories, thus limiting its scope to the level of datasets as only a datalad clone is performed. Without this option, the command would have retrieved all of the subdatasets contents right away. But with -n/--no-data, it only installed the subdataset to retrieve the meta data about file availability.

To explicitly install all potential subdatasets recursively, that is, all of the subdatasets inside it as well, one can give the -r/--recursive option to get:

datalad get -n -r <subds>

This would install the subds subdataset and all potential further subdatasets inside of it, and the meta data about file hierarchies would have been available right away for every subdataset inside of subds. If you had several subdatasets and would not provide a path to a single dataset, but, say, the current directory (. as in datalad get -n -r .), it would clone all registered subdatasets recursively.

So why is a recursive get not the default behavior? In Dataset nesting we learned that datasets can be nested arbitrarily deep. Upon getting the meta data of one dataset you might not want to also install a few dozen levels of nested subdatasets right away.

However, there is a middle way1: The --recursion-limit option let’s you specify how many levels of subdatasets should be installed together with the first subdataset:

datalad get -n -r --recursion-limit 1 <subds>

You may remember from section Install datasets that DataLad has two commands to obtain datasets, datalad clone and datalad install. The command structure of install and datalad clone are almost identical:

$ datalad install [-d/--dataset PATH] [-D/--description] --source PATH/URL [DEST-PATH ...]
$ datalad clone [-d/--dataset PATH] [-D/--description] SOURCE-PATH/URL [DEST-PATH]

Both commands are also often interchangeable: To create a copy of your DataLad-101 dataset for your roommate, or to obtain the longnow subdataset in section Install datasets you could have used datalad install as well. From a user’s perspective, the only difference is whether you’d need -s/--source in the command call:

$ datalad install --source ../DataLad-101
# versus
$ datalad clone ../DataLad-101

On a technical layer, datalad clone is a subset (or rather: the underlying function) of the datalad install command. Whenever you use datalad install, it will call datalad clone underneath the hood. datalad install, however, adds to datalad clone in that it has slightly more complex functionality. Thus, while command structure is more intuitive, the capacities of clone are also slightly more limited than those of install in comparison. Unlike datalad clone, datalad install provides a -r/--recursive operation, i.e., it can obtain (clone) a dataset and potential subdatasets right at the time of superdataset installation. You can pick for yourself which command you are more comfortable with. In the handbook, we use clone for its more intuitive behavior, but you will often note that we use the terms “installed dataset” and “cloned dataset” interchangeably.

To summarize what you learned in this section, write a note on how to install a dataset using a path as a source on a common file system.

Write this note in “your own” (the original) DataLad-101 dataset, though!

# navigate back into the original dataset
$ cd ../../DataLad-101
# write the note
$ cat << EOT >> notes.txt
A source to install a dataset from can also be a path, for example as
in "datalad clone ../DataLad-101".

Just as in creating datasets, you can add a description on the
location of the new dataset clone with the -D/--description option.

Note that subdatasets will not be installed by default, but are only
registered in the superdataset -- you will have to do a
"datalad get -n PATH/TO/SUBDATASET" to install the subdataset for file
availability meta data. The -n/--no-data options prevents that file
contents are also downloaded.

Note that a recursive "datalad get" would install all further
registered subdatasets underneath a subdataset, so a safer way to
proceed is to set a decent --recursion-limit:
"datalad get -n -r --recursion-limit 2 <subds>"

EOT

Save this note.

$ datalad save -m "add note about cloning from paths and recursive datalad get"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

A dataset that is installed from an existing source, e.g., a path or URL, is the DataLad equivalent of a clone in Git.

Footnotes

1: Another alternative to a recursion limit to datalad get -n -r is a dataset configuration that specifies subdatasets that should not be cloned recursively, unless explicitly given to the command with a path. With this configuration, a superdataset’s maintainer can safeguard users and prevent potentially large amounts of subdatasets to be cloned. You can learn more about this configuration in the section More on DIY configurations.