1.4. Install datasets

So far, we have created a DataLad-101 course dataset. We saved some additional readings into the dataset, and have carefully made and saved notes on the DataLad commands we discovered. Up to this point, we therefore know the typical, local workflow to create and populate a dataset from scratch.

But we’ve been told that with DataLad we could very easily get vast amounts of data to our computer. Rumor has it that this would be only a single command in the terminal! Therefore, everyone in today’s lecture excitedly awaits today’s topic: Installing datasets.

“With DataLad, users can install clones of existing DataLad datasets from paths, URLs, or open-data collections” our lecturer begins. “This makes accessing data fast and easy. A dataset that others could install can be created by anyone, without a need for additional software. Your own datasets can be installed by others, should you want that, for example. Therefore, not only accessing data becomes fast and easy, but also sharing.” “That’s so cool!”, you think. “Exam preparation will be a piece of cake if all of us can share our mid-term and final projects easily!” “But today, let’s only focus on how to install a dataset”, she continues. “Damn it! Can we not have longer lectures?”, you think and set alarms to all of the upcoming lecture dates in your calendar. There is so much exciting stuff to come, you cannot miss a single one.

“Psst!” a student from the row behind reaches over. “There are a bunch of audio recordings of a really cool podcast, and they have been shared in the form of a DataLad dataset! Shall we try whether we can install that?”

“Perfect! What a great way to learn how to install a dataset. Doing it now instead of looking at slides for hours is my preferred type of learning anyway”, you think as you fire up your terminal and navigate into your DataLad-101 dataset.

In this demonstration, we are using one of the many openly available datasets that DataLad provides in a public registry that anyone can access. One of these datasets is a collection of audio recordings of a great podcast, the longnow seminar series[2]. It consists of audio recordings about long-term thinking, and while the DataLad-101 course is not a long-term thinking seminar, those recordings are nevertheless a good addition to the large stash of yet-to-read text books we piled up. Let’s get this dataset into our existing DataLad-101 dataset.

To keep the DataLad-101 dataset neat and organized, we first create a new directory, called recordings.

$ # we are in the root of DataLad-101
$ mkdir recordings

The command that can be used to obtain a dataset is datalad clone (manual), but we often refer to the process of cloning a Dataset as installing. Let’s install the longnow podcasts in this new directory.

The datalad clone command takes a location of an existing dataset to clone. This source can be a URL or a path to a local directory, or an SSH server[1]. The dataset to be installed lives on GitHub, at https://github.com/datalad-datasets/longnow-podcasts.git, and we can give its GitHub URL as the first positional argument. Optionally, the command also takes as second positional argument a path to the destination, – a path to where we want to install the dataset to. In this case it is recordings/longnow. Because we are installing a dataset (the podcasts) into an existing dataset (the DataLad-101 dataset), we also supply a -d/--dataset flag to the command. This specifies the dataset to perform the operation on, and allows us to install the podcasts as a subdataset of DataLad-101. Because we are in the root of the DataLad-101 dataset, the pointer to the dataset is a . (which is Unix’ way of saying “current directory”).

As before with long commands, we line break the code with a \. You can copy it as it is presented here into your terminal, but in your own work you can write commands like this into a single line.

$ datalad clone --dataset . \
 https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
[INFO] Remote origin not usable by git-annex; setting annex-ignore
install(ok): recordings/longnow (dataset)
add(ok): recordings/longnow (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)

This command copied the repository found at the URL https://github.com/datalad-datasets/longnow-podcasts into the existing DataLad-101 dataset, into the directory recordings/longnow. The optional destination is helpful: If we had not specified the path recordings/longnow as a destination for the dataset clone, the command would have installed the dataset into the root of the DataLad-101 dataset, and instead of longnow it would have used the name of the remote repository “longnow-podcasts”. But the coolest feature of datalad clone is yet invisible: This command also recorded where this dataset came from, thus capturing its origin as provenance. Even though this is not obvious at this point in time, later chapters in this handbook will demonstrate how useful this information can be.

Clone internals

The datalad clone command uses git clone (manual). A dataset that is installed from an existing source, e.g., a path or URL, is the DataLad equivalent of a clone in Git.

Do I have to install from the root of datasets?

No. Instead of from the root of the DataLad-101 dataset, you could have also installed the dataset from within the recordings, or books directory. In the case of installing datasets into existing datasets you however need to adjust the paths that are given with the -d/--dataset option: -d needs to specify the path to the root of the dataset. This is important to keep in mind whenever you do not execute the datalad clone command from the root of this dataset. Luckily, there is a shortcut: -d^ will always point to root of the top-most dataset. For example, if you navigate into recordings, the command would be:

$ datalad clone -d^ https://github.com/datalad-datasets/longnow-podcasts.git longnow

What if I do not install into an existing dataset?

If you do not install into an existing dataset, you only need to omit the -d/--dataset option. You can try:

$ datalad clone https://github.com/datalad-datasets/longnow-podcasts.git

anywhere outside of your DataLad-101 dataset to install the podcast dataset into a new directory called longnow-podcasts. You could even do this inside of an existing dataset. However, whenever you install datasets into of other datasets, the -d/--dataset option is necessary to not only install the dataset, but also register it automatically into the higher level superdataset. The upcoming section will elaborate on this.

Here is the repository structure:

use tree

The Windows version of tree requires different parametrization, so please run tree instead of tree -d.

$ tree -d   # we limit the output to directories
.
├── books
└── recordings
    └── longnow
        ├── Long_Now__Conversations_at_The_Interval
        └── Long_Now__Seminars_About_Long_term_Thinking

5 directories

We can see that recordings has one subdirectory, our newly installed longnow dataset with two subdirectories. If we navigate into one of them and list its content, we’ll see many .mp3 files (here is an excerpt).

$ cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
$ ls
2003_11_15__Brian_Eno__The_Long_Now.mp3
2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3
2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
2004_02_14__James_Dewar__Long_term_Policy_Analysis.mp3
2004_03_13__Rusty_Schweickart__The_Asteroid_Threat_Over_the_Next_100_000_Years.mp3
2004_04_10__Daniel_Janzen__Third_World_Conservation__It_s_ALL_Gardening.mp3
2004_05_15__David_Rumsey__Mapping_Time.mp3
2004_06_12__Bruce_Sterling__The_Singularity__Your_Future_as_a_Black_Hole.mp3
2004_07_10__Jill_Tarter__The_Search_for_Extra_terrestrial_Intelligence__Necessarily_a_Long_term_Strategy.mp3
2004_08_14__Phillip_Longman__The_Depopulation_Problem.mp3
2004_09_11__Danny_Hillis__Progress_on_the_10_000_year_Clock.mp3
2004_10_16__Paul_Hawken__The_Long_Green.mp3
2004_11_13__Michael_West__The_Prospects_of_Human_Life_Extension.mp3

1.4.1. Dataset content identity and availability information

Surprised, you turn to your fellow student and wonder about how fast the dataset was installed. Should a download of that many .mp3 files not take much more time?

Here you can see another import feature of DataLad datasets and the datalad clone command: Upon installation of a DataLad dataset, DataLad retrieves only small files (for example, text files or markdown files) and (small) metadata about the dataset. It does not, however, download any large files (yet). The metadata exposes the dataset’s file hierarchy for exploration (note how you are able to list the dataset contents with ls), and downloading only this metadata speeds up the installation of a DataLad dataset of many TB in size to a few seconds. Just now, after installing, the dataset is small in size:

$ cd ../      # in longnow/
$ du -sh      # Unix command to show size of contents
3.7M	.

This is tiny indeed!

If you executed the previous ls command in your own terminal, you might have seen the .mp3 files highlighted in a different color than usually. On your computer, try to open one of the .mp3 files. You will notice that you cannot open any of the audio files. This is not your fault: None of these files exist on your computer yet.

Wait, what?

This sounds strange, but it has many advantages. Apart from a fast installation, it allows you to retrieve precisely the content you need, instead of all the contents of a dataset. Thus, even if you install a dataset that is many TB in size, it takes up only few MB of space after the install, and you can retrieve only those components of the dataset that you need.

Let’s see how large the dataset would be in total if all of the files were present. For this, we supply an additional option to datalad status (manual). Make sure to be (somewhere) inside of the longnow dataset to execute the following command:

$ datalad status --annex
236 annex'd files (15.4 GB recorded total size)
nothing to save, working tree clean

Woah! More than 200 files, totaling more than 15 GB? You begin to appreciate that DataLad did not download all of this data right away! That would have taken hours given the crappy internet connection in the lecture hall, and you are not even sure whether your hard drive has much space left…

But you nevertheless are curious on how to actually listen to one of these .mp3s now. So how does one actually “get” the files?

The command to retrieve file content is datalad get (manual). You can specify one or more specific files, or get all of the dataset by specifying datalad get . at the root directory of the dataset (with . denoting “current directory”).

First, we get one of the recordings in the dataset – take any one of your choice (here, it’s the first).

$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (file) [from web...]

Try to open it – it will now work.

If you would want to get the rest of the missing data, instead of specifying all files individually, we can use . to refer to all of the dataset like this:

$ datalad get .

However, with a total size of more than 15GB, this might take a while, so do not do that now. If you did execute the command above, interrupt it by pressing CTRL + C – Do not worry, this will not break anything.

Isn’t that easy? Let’s see how much content is now present locally. For this, datalad status --annex all has a nice summary:

$ datalad status --annex all
236 annex'd files (35.7 MB/15.4 GB present/total size)
nothing to save, working tree clean

This shows you how much of the total content is present locally. With one file, it is only a fraction of the total size.

Let’s get a few more recordings, just because it was so mesmerizing to watch DataLad’s fancy progress bars.

$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3 (file) [from web...]
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 (file) [from web...]
action summary:
  get (notneeded: 1, ok: 2)

Note that any data that is already retrieved (the first file) is not downloaded again. DataLad summarizes the outcome of the execution of get in the end and informs that the download of one file was notneeded and the retrieval of the other files was ok.

Get internals

datalad get uses git annex get (manual) underneath the hood.

1.4.2. Keep whatever you like

“Oh shit, oh shit, oh shit…” you hear from right behind you. Your fellow student apparently downloaded the full dataset accidentally. “Is there a way to get rid of file contents in dataset, too?”, they ask. “Yes”, the lecturer responds, “you can remove file contents by using datalad drop (manual). This is really helpful to save disk space for data you can easily reobtain, for example”.

The datalad drop command will remove file contents completely from your dataset. You should only use this command to remove contents that you can datalad get again, or generate again (for example, with next chapter’s datalad run (manual) command), or that you really do not need anymore.

Let’s remove the content of one of the files that we have downloaded, and check what this does to the total size of the dataset. Here is the current amount of retrieved data in this dataset:

$ datalad status --annex all
236 annex'd files (135.1 MB/15.4 GB present/total size)
nothing to save, working tree clean

We drop a single recording’s content that we previously downloaded with datalad get

$ datalad drop Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
drop(ok): Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3 (file)

… and check the size of the dataset again:

$ datalad status --annex all
236 annex'd files (93.5 MB/15.4 GB present/total size)
nothing to save, working tree clean

Dropping the file content of one mp3 file saved roughly 40MB of disk space. Whenever you need the recording again, it is easy to re-retrieve it:

$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3 (file) [from web...]

Reobtained!

This was only a quick digression into datalad drop. The main principles of this command will become clear after chapter Under the hood: git-annex, and its precise use is shown in the paragraph on removing file contents. At this point, however, you already know that datasets allow you to datalad drop file contents flexibly. If you want to, you could have more podcasts (or other data) on your computer than you have disk space available by using DataLad datasets – and that really is a cool feature to have.

1.4.3. Dataset archeology

You have now experienced how easy it is to (re)obtain shared data with DataLad. But beyond sharing only the data in the dataset, when sharing or installing a DataLad dataset, all copies also include the dataset’s history.

For example, we can find out who created the dataset in the first place (the output shows an excerpt of git log --reverse, which displays the history from first to most recent commit):

$ git log --reverse
commit 8df130bb✂SHA1
Author: Michael Hanke <michael.hanke@gmail.com>
Date:   Mon Jul 16 16:08:23 2018 +0200

    [DATALAD] Set default backend for all files to be MD5E

commit 3d0dc8f5✂SHA1
Author: Michael Hanke <michael.hanke@gmail.com>
Date:   Mon Jul 16 16:08:24 2018 +0200

    [DATALAD] new dataset

But that’s not all. The seminar series is ongoing, and more recordings can get added to the original repository shared on GitHub. Because an installed dataset knows the dataset it was installed from, your local dataset clone can be updated from its origin, and thus get the new recordings, should there be some. Later in this handbook, we will see examples of this.

Now you can not only create datasets and work with them locally, you can also consume existing datasets by installing them. Because that’s cool, and because you will use this command frequently, make a note of it into your notes.txt, and datalad save (manual) the modification.

$ # in the root of DataLad-101:
$ cd ../../
$ cat << EOT >> notes.txt
The command 'datalad clone URL/PATH [PATH]' installs a dataset from
e.g., a URL or a path. If you install a dataset into an existing
dataset (as a subdataset), remember to specify the root of the
superdataset with the '-d' option.

EOT
$ datalad save -m "Add note on datalad clone"
add(ok): notes.txt (file)
save(ok): . (dataset)

Empty files can be confusing

Listing files directly after the installation of a dataset will work if done in a terminal with ls. However, certain file managers (such as OSX’s Finder[3]) may fail to display files that are not yet present locally (i.e., before a datalad get was run). Therefore, be mindful when exploring a dataset hierarchy with a file manager – it might not show you the available but not yet retrieved files. Consider browsing datasets with the DataLad Gooey to be on the safe side. More about why this is will be explained in section Data integrity.

Footnotes