4.1. Looking without touching

Only now, several weeks into the DataLad-101 course does your room mate realize that he has enrolled in the course as well, but has not yet attended at all. “Oh man, can you help me catch up?” he asks you one day. “Sharing just your notes would be really cool for a start already!”

“Sure thing”, you say, and decide that it’s probably best if he gets all of the DataLad-101 course dataset. Sharing datasets was something you wanted to look into soon, anyway.

This is one exciting aspect of DataLad datasets that has yet been missing from this course: How does one share a dataset? In this section, we will cover the simplest way of sharing a dataset: on a local or shared file system, via an installation with a path as a source.

More on public data sharing

Interested in sharing datasets publicly? Read this chapter to get a feel for all relevant basic concepts of sharing datasets. Afterwards, head over to chapter Third party infrastructure to find out how to share a dataset on third-party infrastructure.

In this scenario multiple people can access the very same files at the same time, often on the same machine (e.g., a shared workstation, or a server that people can “SSH” into). You might think: “What do I need DataLad for, if everyone can already access everything?” However, universal, unrestricted access can easily lead to chaos. DataLad can help facilitate collaboration without requiring ultimate trust and reliability of all participants. Essentially, with a shared dataset, collaborators can see and use your dataset without any danger of undesired, or uncontrolled modification.

To demonstrate how to share a DataLad dataset on a common file system, we will pretend that your personal computer can be accessed by other users. Let’s say that your room mate has access, and you are making sure that there is a DataLad-101 dataset in a different place on the file system for him to access and work with.

This is indeed a common real-world use case: Two users on a shared file system sharing a dataset with each other. But as we cannot easily simulate a second user in this handbook, for now, you will have to share your dataset with yourself. This endeavor serves several purposes: For one, you will experience a very easy way of sharing a dataset. Secondly, it will show you how a dataset can be obtained from a path, instead of a URL as shown in section Install datasets. Thirdly, DataLad-101 is a dataset that can showcase many different properties of a dataset already, but it will be an additional learning experience to see how the different parts of the dataset – text files, larger files, subdatasets, run records – will appear upon installation when shared. And lastly, you will likely “share a dataset with yourself” whenever you will be using a particular dataset of your own creation as input for one or more projects.

“Awesome!” exclaims your room mate as you take out your laptop to share the dataset. “You are really saving my ass here. I’ll make up for it when we prepare for the final”, he promises.

To install DataLad-101 into a different part of your file system, navigate out of DataLad-101, and – for simplicity – create a new directory, mock_user, right next to it:

$ cd ../
$ mkdir mock_user

For simplicity, pretend that this is a second user’s – your room mate’s – home directory. Furthermore, let’s for now disregard anything about permissions. In a real-world example you likely would not be able to read and write to a different user’s directories, but we will talk about permissions later.

After creation, navigate into mock_user and install the dataset DataLad-101. To do this, use datalad clone (manual), and provide a path to your original dataset:

$ cd mock_user
$ datalad clone --description "DataLad-101 in mock_user" ../DataLad-101
install(ok): /home/me/dl-101/mock_user/DataLad-101 (dataset)

This will install your dataset DataLad-101 into your room mate’s home directory. Note that we have given this new dataset a description about its location. Note further that we have not provided the optional destination path to datalad clone, and hence it installed the dataset under its original name in the current directory.

Together with your room mate, you go ahead and see what this dataset looks like. Before running the command, try to predict what you will see.

$ cd DataLad-101
$ tree
.
├── books
│   ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/✂/MD5E-s1198170--0ab2c121✂MD5.pdf
│   ├── byte-of-python.pdf -> ../.git/annex/objects/xF/42/✂/MD5E-s4161086--c832fc13✂MD5.pdf
│   ├── progit.pdf -> ../.git/annex/objects/G6/Gj/✂/MD5E-s12465653--05cd7ed5✂MD5.pdf
│   └── TLCL.pdf -> ../.git/annex/objects/jf/3M/✂/MD5E-s2120211--06d1efcb✂MD5.pdf
├── code
│   └── list_titles.sh
├── notes.txt
└── recordings
    ├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/✂/MD5E-s70348--4b2ec0db✂MD5.jpg
    ├── longnow
    ├── podcasts.tsv
    └── salt_logo_small.jpg -> ../.git/annex/objects/fZ/wg/✂/MD5E-s76402--87da732f✂MD5.jpg

4 directories, 9 files

There are a number of interesting things, and your room mate is the first to notice them:

“Hey, can you explain some things to me?”, he asks. “This directory here, “longnow”, why is it empty?” True, the subdataset has a directory name but apart from this, the longnow directory appears empty.

“Also, why do the PDFs in books/ and the .jpg files appear so weird? They have this cryptic path right next to them, and look, if I try to open one of them, it fails! Did something go wrong when we installed the dataset?” he worries. Indeed, the PDFs and pictures appear just as they did in the original dataset on first sight: They are symlinks pointing to some location in the object tree. To reassure your room mate that everything is fine you quickly explain to him the concept of a symlink and the object-tree of git-annex.

“But why does the PDF not open when I try to open it?” he repeats. True, these files cannot be opened. This mimics our experience when installing the longnow subdataset: Right after installation, the .mp3 files also could not be opened, because their file content was not yet retrieved. You begin to explain to your room mate how DataLad retrieves only minimal metadata about which files actually exist in a dataset upon a datalad clone. “It’s really handy”, you tell him. “This way you can decide which book you want to read, and then retrieve what you need. Everything that is annexed is retrieved on demand. Note though that the text files contents are present, and the files can be opened – this is because these files are stored in Git. So you already have my notes, and you can decide for yourself whether you want to get the books.”

To demonstrate this, you decide to examine the PDFs further. “Try to get one of the books”, you instruct your room mate:

$ datalad get books/progit.pdf
get(ok): books/progit.pdf (file) [from origin...]

“Opening this file will work, because the content was retrieved from the original dataset.”, you explain, proud that this worked just as you thought it would.

Let’s now turn to the fact that the subdataset longnow contains neither file content nor file metadata information to explore the contents of the dataset: there are no subdirectories or any files under recordings/longnow/. This is behavior that you have not observed until now. To fix this and obtain file availability metadata, you have to run a somewhat unexpected command:

$ datalad get -n recordings/longnow
[INFO] Remote origin not usable by git-annex; setting annex-ignore
install(ok): /home/me/dl-101/mock_user/DataLad-101/recordings/longnow (dataset) [Installed subdataset in order to get /home/me/dl-101/mock_user/DataLad-101/recordings/longnow]

Before we look further into datalad get (manual) and the -n/--no-data option, let’s first see what has changed after running the above command (excerpt):

$ tree
.
├── books
│   ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/✂/MD5E-s1198170--0ab2c121✂MD5.pdf
│   ├── byte-of-python.pdf -> ../.git/annex/objects/xF/42/✂/MD5E-s4161086--c832fc13✂MD5.pdf
│   ├── progit.pdf -> ../.git/annex/objects/G6/Gj/✂/MD5E-s12465653--05cd7ed5✂MD5.pdf
│   └── TLCL.pdf -> ../.git/annex/objects/jf/3M/✂/MD5E-s2120211--06d1efcb✂MD5.pdf
├── code
│   └── list_titles.sh
├── notes.txt
└── recordings
    ├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/✂/MD5E-s70348--4b2ec0db✂MD5.jpg
    ├── longnow
    │   ├── Long_Now__Conversations_at_The_Interval
    │   │   ├── 2017_06_09__How_Digital_Memory_Is_Shaping_Our_Future__Abby_Smith_Rumsey.mp3 -> ../.git/annex/objects/8j/kQ/✂/MD5E-s66305442--c723d53d✂MD5.mp3
    │   │   ├── 2017_06_09__Pace_Layers_Thinking__Stewart_Brand__Paul_Saffo.mp3 -> ../.git/annex/objects/Qk/9M/✂/MD5E-s112801659--00a42a1a✂MD5.mp3
    │   │   ├── 2017_06_09__Proof__The_Science_of_Booze__Adam_Rogers.mp3 -> ../.git/annex/objects/FP/96/✂/MD5E-s60091960--6e48eceb✂MD5.mp3
    │   │   ├── 2017_06_09__Seveneves_at_The_Interval__Neal_Stephenson.mp3 -> ../.git/annex/objects/Wf/5Q/✂/MD5E-s66431897--aff90c83✂MD5.mp3
    │   │   ├── 2017_06_09__Talking_with_Robots_about_Architecture__Jeffrey_McGrew.mp3 -> ../.git/annex/objects/Fj/9V/✂/MD5E-s61491081--c4e88ea0✂MD5.mp3
    │   │   ├── 2017_06_09__The_Red_Planet_for_Real__Andy_Weir.mp3 -> ../.git/annex/objects/xq/Q3/✂/MD5E-s136924472--0d107210✂MD5.mp3

Interesting! The file metadata information is now present, and we can explore the file hierarchy. The file content, however, is not present yet.

What has happened here?

When DataLad installs a dataset, it will by default only obtain the superdataset, and not any subdatasets. The superdataset contains the information that a subdataset exists though – the subdataset is registered in the superdataset. This is why the subdataset name exists as a directory. A subsequent datalad get -n path/to/longnow will install the registered subdataset again, just as we did in the example above.

But what about the -n option for datalad get? Previously, we used datalad get to get file content. However, datalad get operates on more than just the level of files or directories. Instead, it can also operate on the level of datasets. Regardless of whether it is a single file (such as books/TLCL.pdf) or a registered subdataset (such as recordings/longnow), datalad get will operate on it to 1) install it – if it is a not yet installed subdataset – and 2) retrieve the contents of any files. That makes it very easy to get your file content, regardless of how your dataset may be structured – it is always the same command, and DataLad blurs the boundaries between superdatasets and subdatasets.

In the above example, we called datalad get with the option -n/--no-data. This option prevents that datalad get obtains the data of individual files or directories, thus limiting its scope to the level of datasets as only a datalad clone is performed. Without this option, the command would have retrieved all of the subdatasets contents right away. But with -n/--no-data, it only installed the subdataset to retrieve the meta data about file availability.

To explicitly install all potential subdatasets recursively, that is, all of the subdatasets inside it as well, one can give the -r/--recursive option to datalad get:

$ datalad get -n -r <subds>

This would install the subds subdataset and all potential further subdatasets inside of it, and the meta data about file hierarchies would have been available right away for every subdataset inside of subds. If you had several subdatasets and would not provide a path to a single dataset, but, say, the current directory (. as in datalad get -n -r .), it would clone all registered subdatasets recursively.

So why is a recursive get not the default behavior? In Dataset nesting we learned that datasets can be nested arbitrarily deep. Upon getting the meta data of one dataset you might not want to also install a few dozen levels of nested subdatasets right away.

However, there is a middle way[1]: The --recursion-limit option let’s you specify how many levels of subdatasets should be installed together with the first subdataset:

$ datalad get -n -r --recursion-limit 1 <subds>

To summarize what you learned in this section, write a note on how to install a dataset using a path as a source on a common file system.

Write this note in “your own” (the original) DataLad-101 dataset, though!

$ # navigate back into the original dataset
$ cd ../../DataLad-101
$ # write the note
$ cat << EOT >> notes.txt
A source to install a dataset from can also be a path, for example as
in "datalad clone ../DataLad-101".

Just as in creating datasets, you can add a description on the
location of the new dataset clone with the -D/--description option.

Note that subdatasets will not be installed by default, but are only
registered in the superdataset -- you will have to do a
"datalad get -n PATH/TO/SUBDATASET" to install the subdataset for file
availability meta data. The -n/--no-data options prevents that file
contents are also downloaded.

Note that a recursive "datalad get" would install all further
registered subdatasets underneath a subdataset, so a safer way to
proceed is to set a decent --recursion-limit:
"datalad get -n -r --recursion-limit 2 <subds>"

EOT

Save this note.

$ datalad save -m "add note about cloning from paths and recursive datalad get"
add(ok): notes.txt (file)
save(ok): . (dataset)

Get a clone

A dataset that is installed from an existing source, e.g., a path or URL, is the DataLad equivalent of a clone in Git.

Footnotes