Create a dataset

We are about to start the educational course DataLad-101. In order to follow along and organize course content, let us create a directory on our computer to collate the materials, assignments, and notes in.

Since this is DataLad-101, let’s do it as a Datalad dataset. You might associate the term “dataset” with a large spreadsheet containing variables and data. But for DataLad, as we learned in the first lecture, a dataset is the core data type.

As noted in A brief overview of DataLad, a dataset is a collection of files in folders, and a file is the smallest unit any dataset can contain. While it at its core this is a very simple concept, datasets come with many useful features. Because experiencing is more insightful than just reading, we will explore the concepts of DataLad datasets together by creating one.

Find a nice place on your computers file system to put a dataset for DataLad-101, and create a fresh, empty dataset with the datalad create command (datalad-create manual). In a bit of time, you will thank yourself for adding a description about the location of your dataset with the optional --description flag. (At the moment, we will not dive into the details of where that description ends up or becomes useful in the future, but in more advanced parts of the book we will see how this gets handy.) Note the command structure of datalad create (optional bits are enclosed in [ ]):

datalad create [--description "..."] [-c <config options>] PATH
$ datalad create --description "course on DataLad-101 on my private Laptop" -c text2git DataLad-101
[INFO] Creating a new annex repo at /home/me/dl-101/DataLad-101 
[INFO] Running procedure cfg_text2git 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
create(ok): /home/me/dl-101/DataLad-101 (dataset)

This will create a dataset called DataLad-101 in the directory you are currently in. For now, disregard -c text2git. It applies a configuration template, but there will be other parts of this book to explain this in detail.

Once created, a DataLad dataset looks like any other directory on your file system. Currently, it seems empty.

$ cd DataLad-101
$ ls    # ls does not show any output, because the dataset is empty.

However, all files and directories you store within the DataLad dataset can be tracked (should you want them to be tracked). Tracking in this context means that edits done to a file are automatically associated with information about the change, the author of the edit, and the time of this change. This is already informative on its own, but what is especially helpful is that previous states of files or directories can be restored. Remember the last time you accidentally deleted content in a file, but only realized after you saved it? With DataLad, no mistakes are forever. We will see many examples of this later in the book, and such information is stored in what we will refer to as the history of a dataset.

This history is almost as small as it can be at the current state, but let’s take a look at it. For looking at the history, the code examples will use git log, a built-in git command1. Your git log might be opened in a terminal pager that lets you scroll up and down with your arrow keys, but not enter any more commands. If this happens, you can get out of git log by pressing q.

$ git log
commit 1a45fd4b9fa4698ea3397d3461599a19cffdd255
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:04:57 2019 +0100

    Instruct annex to add text files to Git

commit e85b9505f2ca1f2a5634a871a8f9f71c4179d32d
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:04:57 2019 +0100

    [DATALAD] new dataset

We can see two commits in the history of the repository. Each of them is identified by a unique 40 character sequence, called a shasum. Highlighted in this output is information about the author and about the time, as well as a commit message that summarizes the performed action concisely. In this case, both commit messages were written by DataLad itself. The most recent change is on the top. The first commit written to the history therefore states that a new dataset was created, and the second commit to the history is related to -c text2git which uses a configuration template to instruct DataLad to store text files in Git (but more on this later). Even though these commits were produced by DataLad, in most other cases, you will have to create the commit and an informative commit message yourself.

Note for Git users

datalad create uses git init and git-annex init. Therefore, the DataLad dataset is a Git repository. Large file content in the dataset in the annex is tracked with Git-annex. An ls -a reveals that Git has secretly done its work:

$ ls -a # show also hidden files
.
..
.datalad
.git
.gitattributes

For non-Git-Users: these hidden dot-directories are necessary for all git magic to work. Please do not temper with them, and, importantly, do not delete them.

Congratulations, you just created your first DataLad dataset! Let us now put some content inside.

Footnotes

1

A nice and easy tool we can recommend as an alternative to git log is tig. Once installed, exchange any git log command you see here with the single word tig.