5.1. DIY configurations¶
Back in section Data safety, you already learned that there
are dataset configurations, and that these configurations can
be modified, for example with the -c text2git
option.
This option applies a configuration template to store text
files in Git instead of git-annex, and thereby
modifies the DataLad dataset’s default configuration to store
every file in git-annex.
The lecture today focuses entirely on the topic of configurations, and aims to equip everyone with the basics to configure their general and dataset specific setup to their needs. This is not only a handy way to tune a dataset to one’s wishes, but also helpful to understand potential differences in command execution and file handling between two users, computers, or datasets.
“First of all, when we talk about configurations, we have to differentiate between different scopes of configuration, and different tools the configuration belongs or applies to”, our lecturer starts. “In DataLad datasets, different tools can have a configuration: Git, git-annex, and DataLad itself. Because these tools are all combined by DataLad to help you manage your data, it is important to understand how the configuration of one software is used by or influences a second tool, or the overall dataset performance.”
“Oh crap, one of these theoretical lectures again” mourns a student from the row behind you. Personally, you’d also be much more excited about any hands-on lecture filled with commands. But the recent lecture about git-annex and the object-tree was surprisingly captivating, so you’re actually looking forward to today. “Shht! I want to hear this!”, you shush him with a wink.
“We will start by looking into the very first configuration you did, already before the course started: The global Git configuration.” the lecturer says.
At one point in time, you likely followed instructions such as in Installation and configuration and configured your Git identity with the commands:
git config --global --add user.name "Elena Piscopia"
git config --global --add user.email elena@example.net
“What the above commands do is very simple: They search for a specific configuration file, and set the variables specified in the command – in this case user name and user email address – to the values provided with the command.” she explains.
“This general procedure, specifying a value for a configuration
variable in a configuration file, is how you can configure the
different tools to your needs. The configuration, therefore,
is really easy. Even if you are only used to ticking boxes
in the settings
tab of a software tool so far, it’s intuitive
to understand how a configuration file in principle works and also
how to use it. The only piece of information you will need
are the necessary files, or the command that writes to them, and
the available options for configuration, that’s it. And what’s
really cool is that all tools we’ll be looking at – Git, git-annex,
and DataLad – can be configured using the git config
command1. Therefore, once you understand the syntax of this
command, you already know half of what’s relevant. The other half
is understanding what you’re doing. Now then, let’s learn how
to configure settings, but also understand what we’re doing
with these configurations.”
“This seems easy enough”, you think. Let’s see what types of configurations there are.
5.1.1. Git config files¶
The user name and email configuration
is a user-specific configuration (called global
configuration by Git), and therefore applies to your user account.
Wherever on your computer you run a Git, git-annex, or DataLad
command, this global configuration will
associate the name and email address you supplied in
the git config commands above with this action.
For example, whenever you
datalad save
, the information in this file is used for the
history entry about commit author and email.
Apart from global Git configurations, there are also system-wide2
and repository configurations. Each of these configurations
resides in its own file. The global configuration is stored in a file called
.gitconfig
in your home directory. Among
your name and email address, this file can store general
per-user configurations, such as a default editor3, or highlighting
options.
The repository-specific configurations apply to each individual
repository. Their scope is more limited than the global
configuration (namely to a single repository), but it can overrule global
configurations: The more specific the scope of a configuration file is, the more
important it is, and the variables in the more specific configuration
will take precedence over variables in less specific configuration files.
One could for example have vim configured to be the default editor
on a global scope, but could overrule this by setting the editor to nano
in a given repository. For this reason, the repository-specific configuration
does not reside in a file in your home directory, but in .git/config
within every Git repository (and thus DataLad dataset).
Thus, there are three different scopes of Git configuration, and each is defined
in a config
file in a different location. The configurations will determine
how Git behaves. In principle, all of these files can configure
the same variables differently, but more specific scopes take precedence over broader
scopes. Conveniently, not only can DataLad and git-annex be configured with
the same command as Git, but in many cases they will also use exactly the same
files as Git for their own configurations.
Let’s find out how the repository-specific configuration file in the DataLad-101
superdataset looks like:
$ cat .git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
[annex]
uuid = 95c315c1-f3a5-411a-ad87-e594b92bf9da
version = 8
backends = MD5E
[filter "annex"]
smudge = git-annex smudge -- %f
clean = git-annex smudge --clean -- %f
[submodule "recordings/longnow"]
active = true
url = https://github.com/datalad-datasets/longnow-podcasts.git
[remote "roommate"]
url = ../mock_user/DataLad-101
fetch = +refs/heads/*:refs/remotes/roommate/*
annex-uuid = c9dab8de-7445-4c0b-94b1-dde183b91035
annex-ignore = false
This file consists of so called “sections” with the section names
in square brackets (e.g., core
). Occasionally, a section can have
subsections: This is indicated by subsection names in
quotation marks after the section name. For example, roommate
is a subsection
of the section remote
.
Within each section, variable = value
pairs specify configurations
for the given (sub)section.
The first section is called core
– as the name suggests,
this configures core Git functionality. There are
many more
configurations than the ones in this config file, but
they are related to Git, and less related or important to the configuration of
a DataLad dataset. We will use this section to showcase the anatomy of the
git config command. If for example you would want to specifically
configure nano to be the default editor in this dataset, you
can do it like this:
$ git config --local --add core.editor "nano"
The command consists of the base command git config,
a specification of the scope of the configuration with the --local
flag, a name
specification consisting of section and key with the
notation section.variable
(here: core.editor
), and finally the value
specification "nano"
.
Let’s see what has changed:
$ cat .git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
editor = nano
[annex]
uuid = 95c315c1-f3a5-411a-ad87-e594b92bf9da
version = 8
backends = MD5E
[filter "annex"]
smudge = git-annex smudge -- %f
clean = git-annex smudge --clean -- %f
[submodule "recordings/longnow"]
active = true
url = https://github.com/datalad-datasets/longnow-podcasts.git
[remote "roommate"]
url = ../mock_user/DataLad-101
fetch = +refs/heads/*:refs/remotes/roommate/*
annex-uuid = c9dab8de-7445-4c0b-94b1-dde183b91035
annex-ignore = false
With this additional line in your repositories Git configuration, nano will
be used as a default editor regardless of the configuration in your global
or system-wide configuration. Note that the flag --local
applies the
configuration to your repository’s .git/config
file, whereas --global
would apply it as a user specific configuration, and --system
as a
system-wide configuration.
If you would want to change this existing line in your .git/config
file, you would replace --add
with --replace-all
such as in:
git config --local --replace-all core.editor "vim"
to configure vim to be your default editor.
(Note that while being a good toy example, it is not a common thing to configure repository-specific editors)
This example demonstrated the structure of a git config
command. By specifying the name
option with section.variable
(or section.subsection.variable
if there is a subsection), and
a value, one can configure Git, git-annex, and DataLad.
Most of these configurations will be written to a config
file
of Git, depending on the scope (local, global, system-wide)
specified in the command.
If things go wrong
If something goes wrong during the git config command,
for example you end up having two keys of the same name because you
added a key instead of replacing an existing one, you can use the
--unset
option to remove the line. Alternatively, you can also open
the config file in an editor and remove or change sections by hand.
The only information you need, therefore, is the name of a section and variable to configure, and the value you want to specify. But in many cases it is also useful to find out which configurations are already set in which way and where. For this, the git config --list --show-origin is useful. It will display all configurations and their location:
$ git config --list --show-origin
file:/home/bob/.gitconfig user.name=Bob McBobface
file:/home/bob/.gitconfig user.email=bob@mcbobface.com
file:/home/bob/.gitconfig core.editor=vim
file:/home/bob/.gitconfig annex.security.allowed-url-schemes=http https file
file:.git/config core.repositoryformatversion=0
file:.git/config core.filemode=true
file:.git/config core.bare=false
file:.git/config core.logallrefupdates=true
file:.git/config annex.uuid=1f83595e-bcba-4226-aa2c-6f0153eb3c54
file:.git/config annex.version=5
file:.git/config annex.backends=MD5E
file:.git/config submodule.recordings/longnow.url=https://github.com/datalad-datasets/longnow-podcasts.git
file:.git/config submodule.recordings/longnow.active=true
file:.git/config remote.roommate.url=../mock_user/onemoredir/DataLad-101
file:.git/config remote.roommate.fetch=+refs/heads/*:refs/remotes/roommate/*
file:.git/config remote.roommate.annex-uuid=a5ae24de-1533-4b09-98b9-cd9ba6bf303c
file:.git/config remote.roommate.annex-ignore=false
file:.git/config submodule.longnow.url=https://github.com/datalad-datasets/longnow-podcasts.git
file:.git/config submodule.longnow.active=true
This example shows some configurations in the global .gitconfig
file, and the configurations within DataLad-101/.git/config
.
The command is very handy to display all configurations at once to identify
configuration problems, find the right configuration file to make a change to,
or simply remind oneself of the existing configurations, and it is a useful
helper to keep in the back of your head.
At this point you may feel like many of these configurations or the configuration file
inside of DataLad-101
do not appear to be
intuitively understandable enough to confidently apply changes to them,
or identify necessary changes. And indeed, most of the sections and variables
or values in there are irrelevant for understanding the book, your dataset,
or DataLad, and can just be left as they are. The previous section merely served
to de-mystify the git config command and the configuration files.
Nevertheless, it might be helpful to get an overview about the meaning of the
remaining sections in that file, and the following hidden section can give
you a glimpse of this.
More on this config file
The second section of .git/config
is a git-annex configuration.
As mentioned above, git-annex will use the
Git config file for some of its configurations.
For example, it lists the repository as a
“version 5 repository”, and gives the dataset its own git-annex
UUID. While the “annex-uuid”4 looks like yet another cryptic
random string of characters, you have seen a UUID like this before:
A git annex whereis displays information about where the
annexed content in a dataset is with these UUIDs.
This section also specifies the supported backends in this dataset.
If you have read the hidden section in the section
Data integrity you will recognize the name “MD5E”. This is the
hash function used to generate the annexed files keys and thus
paths in the object tree. All backends specified in this file (it
can be a list) can be used to hash your files.
You may recognize the third part of the configuration, the subsection
"recordings/longnow"
in the section submodule
.
Clearly, this is a reference to the longnow
podcasts
we cloned as a subdataset. The name submodule is Git
terminology, and describes a Git repository inside of
another Git repository – just like
the super- and subdataset principles you discovered in the
section Dataset nesting. When you clone a DataLad dataset
as a subdataset, it gets registered in this file.
For each subdataset, an individual submodule entry
will store the information about the subdataset’s
--source
or origin (the “url”).
Thus, every subdataset (and sub-subdataset, and so forth) in your dataset
will be listed in this file.
If you want, go back to section Install datasets to see that the
“url” is the same URL we cloned the longnow dataset from, and
go back to section Looking without touching to remind yourself of
how cloning a dataset with subdatasets looked and felt like.
Another interesting part is the last section, “remote”.
Here we can find the sibling “roommate” we defined
in Networking. The term remote is Git-terminology and is
used to describe other repositories or DataLad datasets that the
repository knows about and tracks.
This file, therefore, is where DataLad registered the sibling
with datalad siblings add, and thanks to it you can
collaborate with your room mate.
Note the path given as a value to the url
variable. If at any point
either your superdataset or the remote moves on your file system,
the association between the two datasets breaks – this can be fixed by adjusting this
path, and a demonstration of this is in section Miscellaneous file system operations.
fetch contains a specification which parts of the repository are
updated – in this case everything (all of the branches).
Lastly, the annex-ignore = false
configuration allows git-annex
to query the remote when it tries to retrieve data from annexed content.
5.1.2. .git/config
versus other (configuration) files¶
One crucial aspect distinguishes the .git/config
file from many other files
in your dataset: Even though it is part of your dataset, it won’t be shared together
with the dataset. The reason for this is that this file is not version
controlled, as it lies within the .git
directory.
Repository-specific configurations within your .git/config
file are thus not written to history. Any local configuration in .git/config
applies to the dataset, but it does not stick to the dataset.
One can have the misconception that because the configurations were made in
the dataset, these configurations will also be shared together with the dataset.
.git/config
, however, behaves just as your global or system-wide configurations.
These configurations are in effect on a system, or for a user, or for a dataset,
but are not shared.
A datalad clone command of someone’s dataset will not get your their
editor configuration, should they have included one in their config file.
Instead, upon a datalad clone, a new config file will be created.
This means, however, that configurations that should “stick” to a dataset5 need to be defined in different files – files that are version controlled. The next section will talk about them.
Footnotes
- 1
As an alternative to a
git config
command, you could also run configuration templates or procedures (see Configurations to go) that apply predefined configurations or in some cases even add the information to the configuration file by hand and save it using an editor of your choice.- 2
The third scope of a Git configuration are the system wide configurations. These are stored (if they exist) in
/etc/gitconfig
and contain settings that would apply to every user on the computer you are using. These configurations are not relevant for DataLad-101, and we will thus skip them. You can read more about Git’s configurations and different files here.- 3
If your default editor is vim and you do not like this, now can be the time to change it! Chose either of two options:
1) Open up the file with an editor for your choice (e.g., nano):
nano ~/.gitconfig
and either paste the following configuration or edit it if it already exists:
[core] editor = nano
run the following command, but exchange
nano
with an editor of your choice:git config --global --add core.editor "nano"
- 4
A UUID is a universally unique identifier – a 128-bit number that unambiguously identifies information.
- 5
Please note that not all configurations can be written to files other than
.git/config
. Some of the files introduced in the next section will not be queried by Git, and in principle, it is a good thing that one can not share arbitrary configurations together with a dataset, as this could be a potential security threat. In those cases where you need dataset clones to inherit certain non-sticky configurations, it is advised to write a custom procedure and distribute it together with the dataset. The next two sections contain concrete usecases and tutorials.