5.1. DIY configurations

Back in section Data safety, you already learned that there are dataset configurations, and that these configurations can be modified, for example, with the -c text2git option. This option applies a configuration template to store text files in Git instead of git-annex, and thereby modifies the DataLad dataset’s default configuration to store every file in git-annex.

The lecture today focuses entirely on the topic of configurations, and aims to equip everyone with the basics to configure their general and dataset specific setup to their needs. This is not only a handy way to tune a dataset to one’s wishes, but also helpful to understand potential differences in command execution and file handling between two users, computers, or datasets.

“First of all, when we talk about configurations, we have to differentiate between different scopes of configuration, and different tools the configuration belongs or applies to”, our lecturer starts. “In DataLad datasets, different tools can have a configuration: Git, git-annex, and DataLad itself. Because these tools are all combined by DataLad to help you manage your data, it is important to understand how the configuration of one software is used by or influences a second tool, or the overall dataset performance.”

“Oh crap, one of these theoretical lectures again” mourns a student from the row behind you. Personally, you’d also be much more excited about any hands-on lecture filled with commands. But the recent lecture about git-annex and the object-tree was surprisingly captivating, so you are actually looking forward to today. “Shht! I want to hear this!”, you shush him with a wink.

“We will start by looking into the very first configuration you did, already before the course started: The global Git configuration.” the lecturer says.

At one point in time, you likely followed instructions such as in Installation and configuration and configured your Git identity with the commands:

$ git config --global --add user.name "Elena Piscopia"
$ git config --global --add user.email elena@example.net

“What the above commands do is very simple: They search for a specific configuration file, and set the variables specified in the command – in this case user name and user email address – to the values provided with the command.” she explains.

“This general procedure, specifying a value for a configuration variable in a configuration file, is how you can configure the different tools to your needs. The configuration, therefore, is really easy. Even if you are only used to ticking boxes in the settings tab of a software tool so far, it’s intuitive to understand how a configuration file in principle works and also how to use it. The only piece of information you will need are the necessary files, or the command that writes to them, and the available options for configuration, that’s it. And what’s really cool is that all tools we’ll be looking at – Git, git-annex, and DataLad – can be configured using the git config (manual) command[1]. Therefore, once you understand the syntax of this command, you already know half of what’s relevant. The other half is understanding what you are doing. Now then, let’s learn how to configure settings, but also understand what we are doing with these configurations.”

“This seems easy enough”, you think. Let’s see what types of configurations there are.

5.1.1. Git config files

The user name and email configuration is a user-specific configuration (called global configuration by Git), and therefore applies to your user account. Wherever on your computer you run a Git, git-annex, or DataLad command, this global configuration will associate the name and email address you supplied in the git config commands above with this action. For example, whenever you datalad save, the information in this file is used for the history entry about commit author and email.

Apart from global Git configurations, there are also system-wide[2] and repository configurations. Each of these configurations resides in its own file. The global configuration is stored in a file called .gitconfig in your home directory. Among your name and email address, this file can store general per-user configurations, such as a default editor[3], or highlighting options.

The repository-specific configurations apply to each individual repository. Their scope is more limited than the global configuration (namely to a single repository), but it can overrule global configurations: The more specific the scope of a configuration file is, the more important it is, and the variables in the more specific configuration will take precedence over variables in less specific configuration files. One could, for example, have vim configured to be the default editor on a global scope, but could overrule this by setting the editor to nano in a given repository. For this reason, the repository-specific configuration does not reside in a file in your home directory, but in .git/config within every Git repository (and thus DataLad dataset).

Thus, there are three different scopes of Git configuration, and each is defined in a config file in a different location. The configurations will determine how Git behaves. In principle, all of these files can configure the same variables differently, but more specific scopes take precedence over broader scopes. Conveniently, not only can DataLad and git-annex be configured with the same command as Git, but in many cases they will also use exactly the same files as Git for their own configurations.

Let’s find out how the repository-specific configuration file in the DataLad-101 superdataset looks like:

$ cat .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[annex]
	uuid = 46b169aa-bb91-42d6-be06-355d957fb4f7
	version = 10
[filter "annex"]
	smudge = git-annex smudge -- %f
	clean = git-annex smudge --clean -- %f
	process = git-annex filter-process
[submodule "recordings/longnow"]
	active = true
	url = https://github.com/datalad-datasets/longnow-podcasts.git
[remote "roommate"]
	url = ../mock_user/DataLad-101
	fetch = +refs/heads/*:refs/remotes/roommate/*
	annex-uuid = ✂UUID✂
	annex-ignore = false

This file consists of so called “sections” with the section names in square brackets (e.g., core). Occasionally, a section can have subsections: This is indicated by subsection names in quotation marks after the section name. For example, roommate is a subsection of the section remote. Within each section, variable = value pairs specify configurations for the given (sub)section.

The first section is called core – as the name suggests, this configures core Git functionality. There are many more configurations than the ones in this config file, but they are related to Git, and less related or important to the configuration of a DataLad dataset. We will use this section to showcase the anatomy of the git config command. If, for example, you would want to specifically configure nano to be the default editor in this dataset, you can do it like this:

$ git config --local --add core.editor "nano"

The command consists of the base command git config, a specification of the scope of the configuration with the --local flag, a name specification consisting of section and key with the notation section.variable (here: core.editor), and finally the value specification "nano".

Let’s see what has changed:

$ cat .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
	editor = nano
[annex]
	uuid = 46b169aa-bb91-42d6-be06-355d957fb4f7
	version = 10
[filter "annex"]
	smudge = git-annex smudge -- %f
	clean = git-annex smudge --clean -- %f
	process = git-annex filter-process
[submodule "recordings/longnow"]
	active = true
	url = https://github.com/datalad-datasets/longnow-podcasts.git
[remote "roommate"]
	url = ../mock_user/DataLad-101
	fetch = +refs/heads/*:refs/remotes/roommate/*
	annex-uuid = ✂UUID✂
	annex-ignore = false

With this additional line in your repository’s Git configuration, nano will be used as a default editor regardless of the configuration in your global or system-wide configuration. Note that the flag --local applies the configuration to your repository’s .git/config file, whereas --global would apply it as a user specific configuration, and --system as a system-wide configuration. If you would want to change this existing line in your .git/config file, you would replace --add with --replace-all such as in:

$ git config --local --replace-all core.editor "vim"

to configure vim to be your default editor. Note that while being a good toy example, it is not a common thing to configure repository-specific editors.

This example demonstrated the structure of a git config command. By specifying the name option with section.variable (or section.subsection.variable if there is a subsection), and a value, one can configure Git, git-annex, and DataLad. Most of these configurations will be written to a config file of Git, depending on the scope (local, global, system-wide) specified in the command.

If things go wrong during Git config

If something goes wrong during the git config command, for example, you end up having two keys of the same name because you added a key instead of replacing an existing one, you can use the --unset option to remove the line. Alternatively, you can also open the config file in an editor and remove or change sections by hand.

The only information you need, therefore, is the name of a section and variable to configure, and the value you want to specify. But in many cases it is also useful to find out which configurations are already set in which way and where. For this, the git config --list --show-origin is useful. It will display all configurations and their location:

$ git config --list --show-origin
file:/home/bob/.gitconfig   user.name=Bob McBobface
file:/home/bob/.gitconfig   user.email=bob@mcbobface.com
file:.git/config    annex.uuid=1f83595e-bcba-4226-aa2c-6f0153eb3c54
file:.git/config    annex.backends=MD5E
file:.git/config    submodule.recordings/longnow.url=https://github.com/✂
file:.git/config    submodule.recordings/longnow.active=true
file:.git/config    remote.roommate.url=../mock_user/onemoredir/DataLad-101
file:.git/config    remote.roommate.annex-uuid=a5ae24de-1533-4b09-98b9-cd9ba6bf303c
file:.git/config    submodule.longnow.url=https://github.com/✂
file:.git/config    submodule.longnow.active=true
...

This example shows some configurations in the global .gitconfig file, and the configurations within DataLad-101/.git/config. The command is very handy to display all configurations at once to identify configuration problems, find the right configuration file to make a change to, or simply remind oneself of the existing configurations, and it is a useful helper to keep in the back of your head.

At this point you may feel like many of these configurations or the configuration file inside of DataLad-101 do not appear to be intuitively understandable enough to confidently apply changes to them, or identify necessary changes. And indeed, most of the sections and variables or values in there are irrelevant for understanding the book, your dataset, or DataLad, and can just be left as they are. The previous section merely served to de-mystify the git config command and the configuration files. Nevertheless, it might be helpful to get an overview about the meaning of the remaining sections in that file, and the that dissects this config file further can give you a glimpse of this.

Dissecting a Git config file further

Let’s walk through the Git config file of DataLad-101: As mentioned above, git-annex will use the Git config file for some of its configurations, such as the second section. It lists the repository version and annex UUID[4] (git annex whereis (manual) displays information about where the annexed content is with these UUIDs).

You may recognize the fourth part of the configuration, the subsection "recordings/longnow" in the section submodule. Clearly, this is a reference to the longnow podcasts we cloned as a subdataset. The name submodule is Git terminology, and describes a Git repository inside of another Git repository – just like the super- and subdataset principles you discovered in the section Dataset nesting. When you clone a DataLad dataset as a subdataset, it gets registered in this file. For each subdataset, an individual submodule entry will store the information about the subdataset’s --source or origin (the “url”). Thus, every subdataset in your dataset will be listed in this file. If you want, go back to section Install datasets to see that the “url” is the same URL we cloned the longnow dataset from, and go back to section Looking without touching to remind yourself of how cloning a dataset with subdatasets looked and felt like.

Another interesting part is the last section, “remote”. Here we can find the sibling “roommate” we defined in Networking. The term remote is Git-terminology and is used to describe other repositories or DataLad datasets that the repository knows about. This file, therefore, is where DataLad registered the sibling with datalad siblings add (manual), and thanks to it you can collaborate with your room mate. The value to the url variable is a path. If at any point either your superdataset or the remote moves on your file system, the association between the two datasets breaks – this can be fixed by adjusting this path, and a demonstration of this is in section Miscellaneous file system operations. fetch contains a specification which parts of the repository are updated – in this case everything (all of the branches). Lastly, the annex-ignore = false configuration allows git-annex to query the remote when it tries to retrieve data from annexed content.

5.1.2. The datalad configuration command

Although this section put a focus on the git config command, it is important to mention that there also is a datalad configuration (manual) command. It is not identical to git config, but while it lacks some feature of git config, such as the ability to set system-wide configuration, it has additional features. Beyond the local and global scopes, it also supports branch specific configurations in the .datalad/config file (further discussed in the next section), setting configurations recursively through dataset hierarchies, and multi-configuration queries (such as datalad configuration get user.name user.email). By default, datalad configuration will dump (list) the effective configuration including relevant DATALAD_* environment variables, and also annotate the purpose of many common configuration items. The subcommands datalad configuration get or datalad configuration set perform queries or set configurations. You can find out more information on this command in the command documentation.

5.1.3. .git/config versus other (configuration) files

One crucial aspect distinguishes the .git/config file from many other files in your dataset: Even though it is part of your dataset, it won’t be shared together with the dataset. The reason for this is that this file is not version controlled, as it lies within the .git directory. Repository-specific configurations within your .git/config file are thus not written to history. Any local configuration in .git/config applies to the dataset, but it does not stick to the dataset. One can have the misconception that because the configurations were made in the dataset, these configurations will also be shared together with the dataset. .git/config, however, behaves just as your global or system-wide configurations. These configurations are in effect on a system, or for a user, or for a dataset, but are not shared. A datalad clone (manual) command of someone’s dataset will not get you their editor configuration, should they have included one in their config file. Instead, upon a datalad clone, a new config file will be created.

This means, however, that configurations that should “stick” to a dataset[5] need to be defined in different files – files that are version controlled. The next section will talk about them.

Footnotes