4.1. Looking without touching¶
Only now, several weeks into the DataLad-101 course does your room mate realize that he has enrolled in the course as well, but has not yet attended at all. “Oh man, can you help me catch up?” he asks you one day. “Sharing just your notes would be really cool for a start already!”
“Sure thing”, you say, and decide that it’s probably best if he gets
all of the DataLad-101
course dataset. Sharing datasets was
something you wanted to look into soon, anyway.
This is one exciting aspect of DataLad datasets that has yet been missing from this course: How does one share a dataset? In this section, we will cover the simplest way of sharing a dataset: on a local or shared file system, via an installation with a path as a source.
More on public data sharing
Interested in sharing datasets publicly? Read this chapter to get a feel for all relevant basic concepts of sharing datasets. Afterwards, head over to chapter Third party infrastructure to find out how to share a dataset on third-party infrastructure.
In this scenario multiple people can access the very same files at the same time, often on the same machine (e.g., a shared workstation, or a server that people can “SSH” into). You might think: “What do I need DataLad for, if everyone can already access everything?” However, universal, unrestricted access can easily lead to chaos. DataLad can help facilitate collaboration without requiring ultimate trust and reliability of all participants. Essentially, with a shared dataset, collaborators can see and use your dataset without any danger of undesired, or uncontrolled modification.
To demonstrate how to share a DataLad dataset on a common file system,
we will pretend that your personal computer
can be accessed by other users. Let’s say that
your room mate has access, and you are making sure that there is
a DataLad-101
dataset in a different place on the file system
for him to access and work with.
This is indeed a common real-world use case: Two users on a shared
file system sharing a dataset with each other.
But as we cannot easily simulate a second user in this handbook,
for now, you will have to share your dataset with yourself.
This endeavor serves several purposes: For one, you will experience a very easy
way of sharing a dataset. Secondly, it will show you
how a dataset can be obtained from a path, instead of a URL as shown in section
Install datasets. Thirdly, DataLad-101
is a dataset that can
showcase many different properties of a dataset already, but it will
be an additional learning experience to see how the different parts
of the dataset – text files, larger files, subdatasets,
run records – will appear upon installation when shared.
And lastly, you will likely “share a dataset with yourself” whenever you
will be using a particular dataset of your own creation as input for
one or more projects.
“Awesome!” exclaims your room mate as you take out your laptop to share the dataset. “You are really saving my ass here. I’ll make up for it when we prepare for the final”, he promises.
To install DataLad-101
into a different part
of your file system, navigate out of DataLad-101
, and – for
simplicity – create a new directory, mock_user
, right next to it:
$ cd ../
$ mkdir mock_user
For simplicity, pretend that this is a second user’s – your room mate’s – home directory. Furthermore, let’s for now disregard anything about permissions. In a real-world example you likely would not be able to read and write to a different user’s directories, but we will talk about permissions later.
After creation, navigate into mock_user
and install the dataset DataLad-101
.
To do this, use datalad clone
(manual), and provide a path to your original
dataset:
$ cd mock_user
$ datalad clone --description "DataLad-101 in mock_user" ../DataLad-101
install(ok): /home/me/dl-101/mock_user/DataLad-101 (dataset)
This will install your dataset DataLad-101
into your room mate’s home
directory. Note that we have given this new
dataset a description about its location. Note further that we
have not provided the optional destination path to datalad clone
,
and hence it installed the dataset under its original name in the current directory.
Together with your room mate, you go ahead and see what this dataset looks like. Before running the command, try to predict what you will see.
$ cd DataLad-101
$ tree
.
├── books
│ ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/✂/MD5E-s1198170--0ab2c121✂MD5.pdf
│ ├── byte-of-python.pdf -> ../.git/annex/objects/xF/42/✂/MD5E-s4161086--c832fc13✂MD5.pdf
│ ├── progit.pdf -> ../.git/annex/objects/G6/Gj/✂/MD5E-s12465653--05cd7ed5✂MD5.pdf
│ └── TLCL.pdf -> ../.git/annex/objects/jf/3M/✂/MD5E-s2120211--06d1efcb✂MD5.pdf
├── code
│ └── list_titles.sh
├── notes.txt
└── recordings
├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/✂/MD5E-s70348--4b2ec0db✂MD5.jpg
├── longnow
├── podcasts.tsv
└── salt_logo_small.jpg -> ../.git/annex/objects/fZ/wg/✂/MD5E-s76402--87da732f✂MD5.jpg
4 directories, 9 files
There are a number of interesting things, and your room mate is the first to notice them:
“Hey, can you explain some things to me?”, he asks. “This directory
here, “longnow
”, why is it empty?”
True, the subdataset has a directory name but apart from this,
the longnow
directory appears empty.
“Also, why do the PDFs in books/
and the .jpg
files
appear so weird? They have
this cryptic path right next to them, and look, if I try to open
one of them, it fails! Did something go wrong when we installed
the dataset?” he worries.
Indeed, the PDFs and pictures appear just as they did in the original dataset
on first sight: They are symlinks pointing to some location in the
object tree. To reassure your room mate that everything is fine you
quickly explain to him the concept of a symlink and the object-tree
of git-annex.
“But why does the PDF not open when I try to open it?” he repeats.
True, these files cannot be opened. This mimics our experience when
installing the longnow
subdataset: Right after installation,
the .mp3
files also could not be opened, because their file
content was not yet retrieved. You begin to explain to your room mate
how DataLad retrieves only minimal metadata about which files actually
exist in a dataset upon a datalad clone
. “It’s really handy”,
you tell him. “This way you can decide which book you want to read,
and then retrieve what you need. Everything that is annexed is retrieved
on demand. Note though that the text files
contents are present, and the files can be opened – this is because
these files are stored in Git. So you already have my notes,
and you can decide for yourself whether you want to get
the books.”
To demonstrate this, you decide to examine the PDFs further. “Try to get one of the books”, you instruct your room mate:
$ datalad get books/progit.pdf
get(ok): books/progit.pdf (file) [from origin...]
“Opening this file will work, because the content was retrieved from the original dataset.”, you explain, proud that this worked just as you thought it would.
Let’s now turn to the fact that the subdataset longnow
contains neither
file content nor file metadata information to explore the contents of the
dataset: there are no subdirectories or any files under recordings/longnow/
.
This is behavior that you have not observed until now.
To fix this and obtain file availability metadata,
you have to run a somewhat unexpected command:
$ datalad get -n recordings/longnow
[INFO] Remote origin not usable by git-annex; setting annex-ignore
install(ok): /home/me/dl-101/mock_user/DataLad-101/recordings/longnow (dataset) [Installed subdataset in order to get /home/me/dl-101/mock_user/DataLad-101/recordings/longnow]
Before we look further into datalad get
(manual) and the
-n/--no-data
option, let’s first see what has changed after
running the above command (excerpt):
$ tree
.
├── books
│ ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/✂/MD5E-s1198170--0ab2c121✂MD5.pdf
│ ├── byte-of-python.pdf -> ../.git/annex/objects/xF/42/✂/MD5E-s4161086--c832fc13✂MD5.pdf
│ ├── progit.pdf -> ../.git/annex/objects/G6/Gj/✂/MD5E-s12465653--05cd7ed5✂MD5.pdf
│ └── TLCL.pdf -> ../.git/annex/objects/jf/3M/✂/MD5E-s2120211--06d1efcb✂MD5.pdf
├── code
│ └── list_titles.sh
├── notes.txt
└── recordings
├── interval_logo_small.jpg -> ../.git/annex/objects/pw/Mf/✂/MD5E-s70348--4b2ec0db✂MD5.jpg
├── longnow
│ ├── Long_Now__Conversations_at_The_Interval
│ │ ├── 2017_06_09__How_Digital_Memory_Is_Shaping_Our_Future__Abby_Smith_Rumsey.mp3 -> ../.git/annex/objects/8j/kQ/✂/MD5E-s66305442--c723d53d✂MD5.mp3
│ │ ├── 2017_06_09__Pace_Layers_Thinking__Stewart_Brand__Paul_Saffo.mp3 -> ../.git/annex/objects/Qk/9M/✂/MD5E-s112801659--00a42a1a✂MD5.mp3
│ │ ├── 2017_06_09__Proof__The_Science_of_Booze__Adam_Rogers.mp3 -> ../.git/annex/objects/FP/96/✂/MD5E-s60091960--6e48eceb✂MD5.mp3
│ │ ├── 2017_06_09__Seveneves_at_The_Interval__Neal_Stephenson.mp3 -> ../.git/annex/objects/Wf/5Q/✂/MD5E-s66431897--aff90c83✂MD5.mp3
│ │ ├── 2017_06_09__Talking_with_Robots_about_Architecture__Jeffrey_McGrew.mp3 -> ../.git/annex/objects/Fj/9V/✂/MD5E-s61491081--c4e88ea0✂MD5.mp3
│ │ ├── 2017_06_09__The_Red_Planet_for_Real__Andy_Weir.mp3 -> ../.git/annex/objects/xq/Q3/✂/MD5E-s136924472--0d107210✂MD5.mp3
Interesting! The file metadata information is now present, and we can explore the file hierarchy. The file content, however, is not present yet.
What has happened here?
When DataLad installs a dataset, it will by default only obtain the
superdataset, and not any subdatasets. The superdataset contains the
information that a subdataset exists though – the subdataset is registered
in the superdataset. This is why the subdataset name exists as a directory.
A subsequent datalad get -n path/to/longnow
will install the registered
subdataset again, just as we did in the example above.
But what about the -n
option for datalad get
?
Previously, we used datalad get
to get file content. However,
datalad get
operates on more than just the level of files or directories.
Instead, it can also operate on the level of datasets. Regardless of whether
it is a single file (such as books/TLCL.pdf
) or a registered subdataset
(such as recordings/longnow
), datalad get
will operate on it to 1) install
it – if it is a not yet installed subdataset – and 2) retrieve the contents of any files.
That makes it very easy to get your file content, regardless of
how your dataset may be structured – it is always the same command, and DataLad
blurs the boundaries between superdatasets and subdatasets.
In the above example, we called datalad get
with the option -n/--no-data
.
This option prevents that datalad get
obtains the data of individual files or
directories, thus limiting its scope to the level of datasets as only a
datalad clone
is performed. Without this option, the command would
have retrieved all of the subdatasets contents right away. But with -n/--no-data
,
it only installed the subdataset to retrieve the meta data about file availability.
To explicitly install all potential subdatasets recursively, that is,
all of the subdatasets inside it as well, one can give the
-r
/--recursive
option to datalad get
:
$ datalad get -n -r <subds>
This would install the subds
subdataset and all potential further
subdatasets inside of it, and the meta data about file hierarchies would
have been available right away for every subdataset inside of subds
. If you
had several subdatasets and would not provide a path to a single dataset,
but, say, the current directory (.
as in datalad get -n -r .
), it
would clone all registered subdatasets recursively.
So why is a recursive get not the default behavior? In Dataset nesting we learned that datasets can be nested arbitrarily deep. Upon getting the meta data of one dataset you might not want to also install a few dozen levels of nested subdatasets right away.
However, there is a middle way[1]: The --recursion-limit
option let’s
you specify how many levels of subdatasets should be installed together
with the first subdataset:
$ datalad get -n -r --recursion-limit 1 <subds>
To summarize what you learned in this section, write a note on how to install a dataset using a path as a source on a common file system.
Write this note in “your own” (the original) DataLad-101
dataset, though!
$ # navigate back into the original dataset
$ cd ../../DataLad-101
$ # write the note
$ cat << EOT >> notes.txt
A source to install a dataset from can also be a path, for example as
in "datalad clone ../DataLad-101".
Just as in creating datasets, you can add a description on the
location of the new dataset clone with the -D/--description option.
Note that subdatasets will not be installed by default, but are only
registered in the superdataset -- you will have to do a
"datalad get -n PATH/TO/SUBDATASET" to install the subdataset for file
availability meta data. The -n/--no-data options prevents that file
contents are also downloaded.
Note that a recursive "datalad get" would install all further
registered subdatasets underneath a subdataset, so a safer way to
proceed is to set a decent --recursion-limit:
"datalad get -n -r --recursion-limit 2 <subds>"
EOT
Save this note.
$ datalad save -m "add note about cloning from paths and recursive datalad get"
add(ok): notes.txt (file)
save(ok): . (dataset)
Get a clone
A dataset that is installed from an existing source, e.g., a path or URL, is the DataLad equivalent of a clone in Git.
Footnotes