1.1. Create a dataset¶
We are about to start the educational course
In order to follow along and organize course content, let us create
a directory on our computer to collate the materials, assignments, and
Since this is
DataLad-101, let’s do it as a DataLad dataset.
You might associate the term “dataset” with a large spreadsheet containing
variables and data.
But for DataLad, a dataset is the core data type:
As noted in A brief overview of DataLad, a dataset is a collection of files
in folders, and a file is the smallest unit any dataset can contain.
Although this is a very simple concept, datasets come with many
Because experiencing is more insightful than just reading, we will explore the
concepts of DataLad datasets together by creating one.
Find a nice place on your computer’s file system to put a dataset for
and create a fresh, empty dataset with the
datalad create (manual) command.
Note the command structure of
datalad create (optional bits are enclosed in
datalad create [--description "..."] [-c <config options>] PATH
What is the description option of ‘datalad create’?
--description flag allows you to provide a short description of
the location of your dataset, for example with
$ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
If you want, use the above command instead to provide a description. Its use will not be immediately clear now, but the chapter Collaboration shows where this description ends up and how it may be useful.
$ datalad create -c text2git DataLad-101 [INFO] Running procedure cfg_text2git [INFO] == Command start (output follows) ===== [INFO] == Command exit (modification check follows) ===== run(ok): /home/me/dl-101/DataLad-101 (dataset) [VIRTUALENV/bin/python /home/a...] create(ok): /home/me/dl-101/DataLad-101 (dataset)
This will create a dataset called
DataLad-101 in the directory you are currently
in. For now, disregard
-c text2git. It applies a configuration template, but there
will be other parts of this book to explain this in detail.
Once created, a DataLad dataset looks like any other directory on your file system. Currently, it seems empty.
$ cd DataLad-101 $ ls # ls does not show any output, because the dataset is empty.
However, all files and directories you store within the DataLad dataset can be tracked (should you want them to be tracked). Tracking in this context means that edits done to a file are automatically associated with information about the change, the author of the edit, and the time of this change. This is already informative important on its own – the provenance captured with this can, for example, be used to learn about a file’s lineage, and can establish trust in it. But what is especially helpful is that previous states of files or directories can be restored. Remember the last time you accidentally deleted content in a file, but only realized after you saved it? With DataLad, no mistakes are forever. We will see many examples of this later in the book, and such information is stored in what we will refer to as the history of a dataset.
This history is almost as small as it can be at the current state, but let’s take
a look at it. For looking at the history, the code examples will use
git log (manual),
a built-in Git command that works right in your terminal. Your log
might be opened in a terminal pager
that lets you scroll up and down with your arrow keys, but not enter any more commands.
If this happens, you can get out of
git log by pressing
$ git log commit e0ff3a73✂SHA1 Author: Elena Piscopia <email@example.com> Date: Tue Jun 18 16:13:00 2019 +0000 Instruct annex to add text files to Git commit 4ce681d6✂SHA1 Author: Elena Piscopia <firstname.lastname@example.org> Date: Tue Jun 18 16:13:00 2019 +0000 [DATALAD] new dataset
Your Git log may be more extensive - use ‘git log main’ instead!
The output of
git log shown in the handbook and the output you will see in your own datasets when executing the same commands may not always match – many times you might see commits about a “git-annex adjusted branch” in your history.
This is expected, and if you want to read up more about this, please progress on to chapter Under the hood: git-annex and afterwards take a look at this part of git-annex documentation.
In order to get a similar experience in your dataset, please add the name of your default branch (it will likely have the name
master) to every
git log command.
This should display the same output that the handbook displays.
The reason behind this is that datasets are using a special branch to be functional on Windows.
This branch’s history differs from the history that would be in the default branch.
With this workaround, you will be able to display the dataset history from the same branch that the handbook and all other operating system display.
Thus, whenever the handbook code snippet contains a line that starts with
git log, copy it and append the term
master, whichever is appropriate.
If you are eager to help to improve the handbook, you could do us a favor by reporting any places with mismatches between Git logs on Windows and in the handbook. Get in touch!
Highlighted in this output is information about the author and about
the time, as well as a commit message that summarizes the
performed action concisely. In this case, both commit messages were written by
DataLad itself. The most recent change is on the top. The first commit
written to the history therefore states that a new dataset was created,
and the second commit is related to the
-c text2git option (which
uses a configuration template to instruct DataLad to store text files
in Git, but more on this later).
While these commits were produced and described by DataLad,
in most other cases, you will have to create the commit and
an informative commit message yourself.
datalad create uses
git init (manual) and
git annex init (manual). Therefore,
the DataLad dataset is a Git repository.
Large file content in the
dataset is tracked with git-annex. An
reveals that Git has secretly done its work:
$ ls -a # show also hidden files . .. .datalad .git .gitattributes
For non-Git-Users: these hidden dot-directories and dot-files are necessary for all Git magic to work. Please do not tamper with them, and, importantly, do not delete them.
Congratulations, you just created your first DataLad dataset! Let us now put some content inside.