Looking without touching

Only now, several weeks into the DataLad-101 course does your room mate realize that he has enrolled in the course as well, but has not yet attended at all. “Oh man, can you help me catch up?” he asks you one day. “Sharing just your notes would be really cool for a start already!”

“Sure thing”, you say, and decide that it’s probably best if he gets all of the DataLad-101 course dataset. Sharing datasets was something you wanted to look into soon, anyway.

This is one exciting aspect of DataLad datasets has yet been missing from this course: How does one share a dataset? In this section, we will cover the simplest way of sharing a dataset: on a local or shared file system, via an installation with a path as a source.

In this scenario multiple people can access the very same files at the same time, often on the same machine (e.g., a shared workstation, or a server than people can “SSH” into). You might think: “What do I need DataLad for, if everyone can already access everything?” However, universal, unrestricted access can easily lead to chaos. DataLad can help facilitate collaboration without requiring ultimate trust and reliability of all participants. Essentially, with a shared dataset, collaborators can look and use your dataset without ever touching it.

To demonstrate how to share a DataLad dataset on a common file system, we will pretend that your personal computer can be accessed by other users. Let’s pretend that your room mate has access, and you’re installing the DataLad-101 dataset in a different place in the file system for him to access and work with.

This is indeed a common real-world use case: Two users on a shared file system sharing a dataset with each other. But as we can not easily simulate a second user in this handbook, for now, you will have to share your dataset with yourself. This endeavor serves several purposes: For one, you will experience a very easy way of sharing a dataset. Secondly, it will show you the installation of a dataset from a path (instead of a URL as shown in the section Install datasets). Thirdly, DataLad-101 is a dataset that can showcase many different properties of a dataset already, but it will be an additional learning experience to see how the different parts of the dataset – text files, larger files, datalad subdataset, datalad run commands – will appear upon installation when shared. And lastly, you will likely “share a dataset with yourself” whenever you will be using a particular dataset of your own creation as input for one or more projects.

“Awesome!” exclaims your room mate as you take out your Laptop to share the dataset. “You’re really saving my ass here. I’ll make up for it when we prepare for the final”, he promises.

To install DataLad-101 with datalad install into a different part of your file system, navigate out of DataLad-101, and – for simplicity – create a new directory, mock_user, right next to it:

$ cd ../
$ mkdir mock_user

For simplicity, pretend that this is a second users’ – your room mates’ – home directory. Furthermore, let’s for now disregard anything about permissions. In a real-world example you likely would not be able to read and write to a different user’s directories, but we will talk about permissions later.

After creation, navigate into mock_user and install the dataset DataLad-101 by specifying its path as a --source (remember, the shorter option -s would work as well).

$ cd mock_user
$ datalad install --source ../DataLad-101 --description "DataLad-101 in mock_user"
[INFO] Cloning ../DataLad-101 into '/home/me/dl-101/mock_user/DataLad-101' 
install(ok): /home/me/dl-101/mock_user/DataLad-101 (dataset)

This will install your dataset DataLad-101 in your room mate’s home directory. Note that we have given this new dataset a description about its location as well. Note further that we have not provided a path to datalad install, and hence it installed the dataset under its original name in the current directory.

Together with your room mate, you go ahead and see what this dataset looks like. Before running the command, try to predict what you will see.

$ cd DataLad-101
$ tree
├── books
│   ├── byte-of-python.pdf -> ../.git/annex/objects/ZZ/f1/MD5E-s4407669--32e6b03a08a6edda12ad42eb7bb06a5c.pdf/MD5E-s4407669--32e6b03a08a6edda12ad42eb7bb06a5c.pdf
│   ├── progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
│   └── TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
├── code
│   └── list_titles.sh
├── notes.txt
└── recordings
    ├── interval_logo_small.jpg -> ../.git/annex/objects/36/jF/MD5E-s100877--0fea9537f9fe255d827e4401a7d539e7.jpg/MD5E-s100877--0fea9537f9fe255d827e4401a7d539e7.jpg
    ├── longnow
    ├── podcasts.tsv
    └── salt_logo_small.jpg -> ../.git/annex/objects/xJ/4G/MD5E-s260607--4e695af0f3e8e836fcfc55f815940059.jpg/MD5E-s260607--4e695af0f3e8e836fcfc55f815940059.jpg

4 directories, 8 files

There are a number of interesting things, and your room mate is the first to notice them:

“Hey, can you explain some things to me?”, he asks. “This directory here, “longnow”, why is it empty?” True, the subdataset has a directory name but apart from this, the longnow directory appears empty.

“Also, why do the PDFs in books/ and the .jpg files appear so weird? They have this cryptic path right next to them, and look, if I try to open one of them, it fails! Did something go wrong when we installed the dataset?” he worries. Indeed, the PDFs and pictures appear just as they did in the original dataset on first sight: They are symlinks pointing to some location in the object tree. To reassure your room mate that everything is fine you quickly explain to him the concept of a symlink and the object-tree of Git-annex.

“But why does the PDF not open when I try to open it?” he repeats. True, these files cannot be opened. This mimics our experience when installing the longnow subdataset: Right after installation, the .mp3 files also could not be opened, because their file content was not yet retrieved. You begin to explain to your room mate how DataLad retrieves only minimal metadata about which files actually exist in a dataset upon a datalad install. “It’s really handy”, you tell him. “This way you can decide which book you want to read, and then retrieve what you need. Everything that is annexed is retrieved on demand. Note though that the text files contents are present, and the files can be opened – this is because these files are stored in Git. So you already have my notes, and you can decide for yourself whether you want to get the books.”

To demonstrate this, you decide to examine the PDFs further. “Try to get one of the books”, you instruct your room mate:

$ datalad get books/progit.pdf
get(ok): books/progit.pdf (file) [from origin...]

“Opening this file will work, because the content was retrieved from the original dataset.”, you explain, proud that this worked just as you thought it would. Your room mate is excited by this magical command. You however begin to wonder: how does DataLad know where to look for that original content?

This information comes from Git-annex. Before getting the next PDF, let’s query Git-annex where its content is stored:

$ git annex whereis books/TLCL.pdf
whereis books/TLCL.pdf (1 copy) 
	8c154892-26d4-4130-bbe0-51c49154ad55 -- course on DataLad-101 on my private Laptop [origin]

Oh, another shasum! This time however not in a symlink… “That’s hard to read – what is it?” your room mate asks. Luckily, there is a human-readable description next to it: “course on DataLad-101 on my private Laptop”. “This”, you exclaim, excited about your own realization, “is my datasets location I’m sharing it from!”

This is, finally, where we see the description provided in datalad create in section Create a dataset becomes handy: It is a human-readable description of where file content is stored. This becomes especially useful when the number of repositories increases. If you have only one other dataset it may be easy to remember what and where it is. But once you have one back-up of your dataset on a USB-Stick, one dataset shared with Dropbox, and a third one on your institutions Gitlab instance you will be grateful for the descriptions you provided these locations with.

The message further informs you that there is only “(1 copy)” of this file content. This makes sense: There is only your own, original DataLad-101 dataset in which this book is saved.

To retrieve file content of an annexed file such as one of these PDFs, Git-annex will try to obtain it from the locations it knows to contain this content. It uses the checksums to identify these locations. Every copy of a dataset will get a unique ID with such a checksum. Note however that just because Git-annex knows a certain location where content was once it does not guarantee that retrieval will work. If one location is a USB-Stick that is in your bag pack instead of your USB port, a second location is a hard drive that you deleted all of its previous contents (including dataset content) from, and another location is a web server, but you are not connected to the internet, Git-annex will not succeed in retrieving contents from these locations. As long as there is at least one location that contains the file and is accessible, though, Git-annex will get the content.

Let’s now turn to the fact that the subdataset longnow does not contain not only no file content, but also no file metadata information to explore the contents of the dataset: There are no subdirectories or any files under recordings/longnow/. This is behavior that you have not observed until now.

To fix this and obtain file availability metadata, you have to run a somewhat unexpected command:

$ datalad install recordings/longnow
[INFO] Cloning /home/me/dl-101/DataLad-101/recordings/longnow [2 other candidates] into '/home/me/dl-101/mock_user/DataLad-101/recordings/longnow' 
install(ok): /home/me/dl-101/mock_user/DataLad-101/recordings/longnow (dataset) [Installed subdataset in order to get /home/me/dl-101/mock_user/DataLad-101/recordings/longnow]

Let’s what has changed (excerpt):

$ tree
├── books
│   ├── byte-of-python.pdf -> ../.git/annex/objects/ZZ/f1/MD5E-s4407669--32e6b03a08a6edda12ad42eb7bb06a5c.pdf/MD5E-s4407669--32e6b03a08a6edda12ad42eb7bb06a5c.pdf
│   ├── progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
│   └── TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
├── code
│   └── list_titles.sh
├── notes.txt
└── recordings
    ├── interval_logo_small.jpg -> ../.git/annex/objects/36/jF/MD5E-s100877--0fea9537f9fe255d827e4401a7d539e7.jpg/MD5E-s100877--0fea9537f9fe255d827e4401a7d539e7.jpg
    ├── longnow
    │   ├── Long_Now__Conversations_at_The_Interval
    │   │   ├── 2017_06_09__How_Digital_Memory_Is_Shaping_Our_Future__Abby_Smith_Rumsey.mp3 -> ../.git/annex/objects/8j/kQ/MD5E-s66305442--c723d53d207e6d82dd64c3909a6a93b0.mp3/MD5E-s66305442--c723d53d207e6d82dd64c3909a6a93b0.mp3
    │   │   ├── 2017_06_09__Pace_Layers_Thinking__Stewart_Brand__Paul_Saffo.mp3 -> ../.git/annex/objects/Qk/9M/MD5E-s112801659--00a42a1a617485fb2c03cbf8482c905c.mp3/MD5E-s112801659--00a42a1a617485fb2c03cbf8482c905c.mp3
    │   │   ├── 2017_06_09__Proof__The_Science_of_Booze__Adam_Rogers.mp3 -> ../.git/annex/objects/FP/96/MD5E-s60091960--6e48eceb5c54d458164c2d0f47b540bc.mp3/MD5E-s60091960--6e48eceb5c54d458164c2d0f47b540bc.mp3
    │   │   ├── 2017_06_09__Seveneves_at_The_Interval__Neal_Stephenson.mp3 -> ../.git/annex/objects/Wf/5Q/MD5E-s66431897--aff90c838a1c4a363bb9d83a46fa989b.mp3/MD5E-s66431897--aff90c838a1c4a363bb9d83a46fa989b.mp3
    │   │   ├── 2017_06_09__Talking_with_Robots_about_Architecture__Jeffrey_McGrew.mp3 -> ../.git/annex/objects/Fj/9V/MD5E-s61491081--c4e88ea062c0afdbea73d295922c5759.mp3/MD5E-s61491081--c4e88ea062c0afdbea73d295922c5759.mp3
    │   │   ├── 2017_06_09__The_Red_Planet_for_Real__Andy_Weir.mp3 -> ../.git/annex/objects/xq/Q3/MD5E-s136924472--0d1072105caa56475df9037670d35a06.mp3/MD5E-s136924472--0d1072105caa56475df9037670d35a06.mp3
    │   │   ├── 2017_07_03__Transforming_Perception__One_Sense_at_a_Time__Kara_Platoni.mp3 -> ../.git/annex/objects/J6/88/MD5E-s62941770--77ae65e0f84c4b1fbefe74183284c305.mp3/MD5E-s62941770--77ae65e0f84c4b1fbefe74183284c305.mp3
    │   │   ├── 2017_08_01__How_Climate_Will_Evolve_Government_and_Society__Kim_Stanley_Robinson.mp3 -> ../.git/annex/objects/kw/PF/MD5E-s60929439--86a30b6bab51e59af52ca8aa6684498f.mp3/MD5E-s60929439--86a30b6bab51e59af52ca8aa6684498f.mp3
    │   │   ├── 2017_09_01__Envisioning_Deep_Time__Jonathon_Keats.mp3 -> ../.git/annex/objects/W4/2q/MD5E-s57113552--82a985abe7fa362e29e4ffa3a9951cc3.mp3/MD5E-s57113552--82a985abe7fa362e29e4ffa3a9951cc3.mp3
    │   │   ├── 2017_10_01__Thinking_Long_term_About_the_Evolving_Global_Challenge__The_Refugee_Reality.mp3 -> ../.git/annex/objects/81/qF/MD5E-s78362767--5b077807c50d1fa02bebd399ec1431e0.mp3/MD5E-s78362767--5b077807c50d1fa02bebd399ec1431e0.mp3
    │   │   ├── 2017_11_01__The_Web_In_An_Eye_Blink__Jason_Scott.mp3 -> ../.git/annex/objects/03/4v/MD5E-s64398689--049e8d1c9288d201b275331afb71b316.mp3/MD5E-s64398689--049e8d1c9288d201b275331afb71b316.mp3
    │   │   ├── 2017_12_01__Ideology_in_our_Genes__The_Biological_Basis_for_Political_Traits__Rose_McDermott.mp3 -> ../.git/annex/objects/x0/2j/MD5E-s59979926--05127d163371d1152b72d98263d7848a.mp3/MD5E-s59979926--05127d163371d1152b72d98263d7848a.mp3
    │   │   ├── 2017_12_07__Can_Democracy_Survive_the_Internet___Nathaniel_Persily.mp3 -> ../.git/annex/objects/5M/Pv/MD5E-s64541470--64960bf95544bc76ed564b541ebb36bc.mp3/MD5E-s64541470--64960bf95544bc76ed564b541ebb36bc.mp3
    │   │   ├── 2018_01_02__The_New_Deal_You_Don_t_Know__Louis_Hyman.mp3 -> ../.git/annex/objects/MZ/MP/MD5E-s61802477--8c3056079a4d3bfe1adbbf0195d57f3c.mp3/MD5E-s61802477--8c3056079a4d3bfe1adbbf0195d57f3c.mp3
    │   │   ├── 2018_02_01__Humanity_and_the_Deep_Ocean__James_Nestor.mp3 -> ../.git/annex/objects/3G/5v/MD5E-s55707819--6bb054946ca3e3e95fd1b1792693706c.mp3/MD5E-s55707819--6bb054946ca3e3e95fd1b1792693706c.mp3
    │   │   ├── 2018_03_01__Our_Future_in_Algorithm_Farming__Mike_Kuniavsky.mp3 -> ../.git/annex/objects/GJ/J2/MD5E-s70246964--5a9f4538aa4d7bc3163067a9e7f093ca.mp3/MD5E-s70246964--5a9f4538aa4d7bc3163067a9e7f093ca.mp3
    │   │   ├── 2018_04_18__The_Organized_Pursuit_of_Knowledge__Margaret_Levi.mp3 -> ../.git/annex/objects/p3/k2/MD5E-s73210485--7c9aae9a75d469ba7f3f66c53b666537.mp3/MD5E-s73210485--7c9aae9a75d469ba7f3f66c53b666537.mp3

Interesting! The file metadata information is now present, and we can explore the file hierarchy. The file content, however, is not present yet.

What has happened here?

When DataLad installs a dataset, it will by default only install the superdataset, and not any subdatasets. The superdataset contains the information that a subdataset exists though – the subdataset is registered in the superdataset. This is why the subdataset name exists as a directory. A subsequent datalad install in recordings/longnow/ or a datalad install PATH/TO/longnow will install the registered dataset without the need to specify the source again, just as we did it in the example above.

To explicitly install a dataset right away recursively, that is, all of the subdatasets inside it as well, one has to specify the -r/--recursive option:

datalad install --source ../DataLad-101 -r --description "DataLad-101 in mock_user"

would have installed the longnow subdataset as well, and the meta data about file hierarchies would have been available right from the start.

So why is this behavior disabled by default? In Dataset nesting we learned that datasets can be nested arbitrarily deep. Upon installing a dataset you might not want to also install a few dozen levels of nested subdatasets right away.

However, there is a middle way: The --recursion-limit option let’s you specify how many levels of subdatasets should be installed together with the superdataset:

datalad install -s ../DataLad-101 --description "DataLad-101 in mock_user" -r --recursion-limit 1

Hence, this alternative command would have installed the subdataset right away.

To summarize what you learned in this section, write a note on how to install a dataset using a path as a source on a common file system. Include the options -r/--recursive and --recursion-limit.

Write this note in “your own” (the original) DataLad-101 dataset, though!

# navigate back into the original dataset
$ cd ../../DataLad-101
# write the note
$ cat << EOT >> notes.txt
A source to install a dataset from can also be a path,
for example as in "datalad install -s ../DataLad-101".
As when installing datasets before, make sure to add a
description on the location of the dataset to be
installed, and, if you want, a path to where the dataset
should be installed under which name.

Note that subdatasets will not be installed by default --
you will have to do a plain
"datalad install PATH/TO/SUBDATASET", or specify the
-r/--recursive option in the install command:
"datalad install -s ../DataLad-101 -r".

A recursive installation would however install all
installed subdatasets, so a safer way to proceed is to
set a decent --recursion-limit:
"datalad install -s ../DataLad-101 -r --recursion-limit 2"


Save this note.

$ datalad save -m "add note about installing from paths and recursive installations"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Note for Git users

A dataset that is installed from an existing source, e.g., a path or URL, it the DataLad equivalent of a clone in Git.