1.1. Create a dataset¶

We are about to start the educational course DataLad-101. In order to follow along and organize course content, let us create a directory on our computer to collate the materials, assignments, and notes in.

Since this is DataLad-101, let’s do it as a DataLad dataset. You might associate the term “dataset” with a large spreadsheet containing variables and data. But for DataLad, a dataset is the core data type: As noted in A brief overview of DataLad, a dataset is a collection of files in folders, and a file is the smallest unit any dataset can contain. Although this is a very simple concept, datasets come with many useful features. Because experiencing is more insightful than just reading, we will explore the concepts of DataLad datasets together by creating one.

Find a nice place on your computer’s file system to put a dataset for DataLad-101, and create a fresh, empty dataset with the datalad create command (datalad-create manual).

Note the command structure of datalad create (optional bits are enclosed in [ ]):

datalad create [--description "..."] [-c <config options>] PATH

The optional --description flag allows you to provide a short description of the location of your dataset, for example with

datalad create --description "course on DataLad-101 on my private Laptop" -c text2git DataLad-101

If you want, use the above command instead of the create command below to provide a description. Its use will not be immediately clear, the chapter Collaboration) will show you where this description ends up and how it may be useful.

Let’s start:

Hey there! If you are using Windows 10 with a native (i.e., not Windows Subsystem for Linux (WSL)-based) installation of DataLad and its underlying tools, and you are not using the custom git-annex installer from ` http://datasets.datalad.org/datalad/packages/windows/ < http://datasets.datalad.org/datalad/packages/windows/ >`_ starting into this narrative will be slightly different.

We’re really sorry about that - as foreshadowed in section Installation and configuration, Windows comes with a range of file system issues, and one of them concerns the very first command.

Instead of running datalad create -c text2git DataLad-101, please remove the configuration -c text2git from the command and run only datalad create DataLad-101:

$ datalad create DataLad-101
[INFO] Creating a new annex repo at C:\Users\mih\DataLad
[INFO] Detected a filesystem without fifo support.
[INFO] Disabling ssh connection caching.
[INFO] Detected a crippled filesystem.
[INFO] Scanning for unlocked files (this may take some time)
[INFO] Entering an adjusted branch where files are unlocked as this filesystem does not support locked files.
[INFO] Switched to branch 'adjusted/master(unlocked)'
create(ok): C:\Users\mih\DataLad (dataset)

This creates the DataLad-101 dataset without the text2git configuration (which is problematic on Windows).

In its place, we will need to create a configuration by hand. For now, please just follow the instructions given here and paste the following lines of text into the (hidden) .gitattributes file in your dataset. The details of and reason for this will become clear in chapter 5, Tuning datasets to your needs.

Here are lines that need to be appended to the existing lines in .gitattributes and will mimic the configuration -c text2git would apply:

*.txt annex.largefiles=nothing
code/** annex.largefiles=nothing

You can achieve this by copy-pasting the following code snippets into your terminal (but you can also add them using a text editor of your choice):

$ echo\ >> .gitattributes && echo *.txt annex.largefiles=nothing >> .gitattributes && echo code/** annex.largefiles=nothing >> .gitattributes

Afterwards, these should be the contents of .gitattributes:

$ cat .gitattributes
  * annex.backend=MD5E
  **/.git* annex.largefiles=nothing
  *.txt annex.largefiles=nothing
  code/** annex.largefiles=nothing

Lastly, run this piece of code to save your changes:

$ datalad save -m "Windows-workaround: custom config to place text into Git" .gitattributes

This should set you up with everything you need for most of the Basics. Other parts of the handbook that are influenced by this workaround will be marked with a similar “Windows Workaround” note, but for the majority of upcoming content, you should be good.

Note: Please do not execute the upcoming datalad create command below. Instead, start coding along with the cd command afterwards.

$ datalad create -c text2git DataLad-101
[INFO] Creating a new annex repo at /home/me/dl-101/DataLad-101 
[INFO] Scanning for unlocked files (this may take some time)
[INFO] Running procedure cfg_text2git 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
create(ok): /home/me/dl-101/DataLad-101 (dataset)

This will create a dataset called DataLad-101 in the directory you are currently in. For now, disregard -c text2git. It applies a configuration template, but there will be other parts of this book to explain this in detail.

Once created, a DataLad dataset looks like any other directory on your file system. Currently, it seems empty.

$ cd DataLad-101
$ ls    # ls does not show any output, because the dataset is empty.

However, all files and directories you store within the DataLad dataset can be tracked (should you want them to be tracked). Tracking in this context means that edits done to a file are automatically associated with information about the change, the author of the edit, and the time of this change. This is already informative important on its own – the provenance captured with this can for example be used to learn about a file’s lineage, and can establish trust in it. But what is especially helpful is that previous states of files or directories can be restored. Remember the last time you accidentally deleted content in a file, but only realized after you saved it? With DataLad, no mistakes are forever. We will see many examples of this later in the book, and such information is stored in what we will refer to as the history of a dataset.

This history is almost as small as it can be at the current state, but let’s take a look at it. For looking at the history, the code examples will use git log, a built-in Git command1 that works right in your terminal. Your log might be opened in a terminal pager that lets you scroll up and down with your arrow keys, but not enter any more commands. If this happens, you can get out of git log by pressing q.

$ git log
commit dcc11c1c1fc330949c2ed6ef951bed019047abff
Author: Elena Piscopia <elena@example.net>
Date:   Fri Jan 29 08:37:36 2021 +0100

    Instruct annex to add text files to Git

commit a82a9ceabe29efb67ad4ccd0c224addfff277d3f
Author: Elena Piscopia <elena@example.net>
Date:   Fri Jan 29 08:37:35 2021 +0100

    [DATALAD] new dataset

We can see two commits in the history of the repository. Each of them is identified by a unique 40 character sequence, called a shasum.

The output of git log shown in the handbook and the output you will see in your own datasets when executing the same commands may not always match – many times you might see commits about a “git-annex adjusted branch” in your history. This is expected, and if you want to read up more about this, please progress on to chapter 3 and afterwards take a look at this part of git-annex documentation.

In order to get a similar experience in your dataset, please add the term master to every git log command. This should display the same output that the handbook display. The reason behind this is that datasets are using a special branch to be functional on Windows. This branches history differs from the history that would be in the default branch, which is in most cases called master. With this workaround, you will be able to display the dataset history from the same branch that handbook and all other operating system display. Thus, whenever the handbook code snippet contains a line that starts with git log, copy it and append the term master.

If you are eager to help to improve the handbook, you could do us a favor by reporting any places with mismatches between Git logs on Windows and in the handbook. Get in touch!

Highlighted in this output is information about the author and about the time, as well as a commit message that summarizes the performed action concisely. In this case, both commit messages were written by DataLad itself. The most recent change is on the top. The first commit written to the history therefore states that a new dataset was created, and the second commit is related to the -c text2git option (which uses a configuration template to instruct DataLad to store text files in Git, but more on this later). Even though these commits were produced by DataLad, in most other cases, you will have to create the commit and an informative commit message yourself.

Note for Git users

datalad create uses git init and git-annex init. Therefore, the DataLad dataset is a Git repository. Large file content in the dataset is tracked with git-annex. An ls -a reveals that Git has secretly done its work:

$ ls -a # show also hidden files
.
..
.datalad
.git
.gitattributes

For non-Git-Users: these hidden dot-directories are necessary for all git magic to work. Please do not tamper with them, and, importantly, do not delete them.

Congratulations, you just created your first DataLad dataset! Let us now put some content inside.

Footnotes

1: A tool we can recommend as an alternative to git log is tig. Once installed, exchange any git log command you see here with the single word tig.