5.2. More on DIY configurations

As the last section already suggest, within a Git repository, .git/config is not the only configuration file. There are also .gitmodules and .gitattributes, and in DataLad datasets there also is a .datalad/config file.

All of these files store configurations, but have an important difference: They are version controlled, and upon sharing a dataset these configurations will be shared as well. An example for a shared configuration is the one that the text2git configuration template applied: In the shared copy of your dataset, text files are also saved with Git, and not git-annex (see section Networking). The configuration responsible for this behavior is in a .gitattributes file, and we’ll start this section by looking into it.

5.2.1. .gitattributes

This file lies right in the root of your superdataset:

$ cat .gitattributes

* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=((mimeencoding=binary)and(largerthan=0))

This looks neither spectacular nor pretty. Also, it does not follow the section-option-value organization of the .git/config file anymore. Instead, there are three lines, and all of these seem to have something to do with the configuration of git-annex. There even is one key word that you recognize: MD5E. If you have read the hidden section in Data integrity you will recognize it as a reference to the type of key used by git-annex to identify and store file content in the object-tree. The first row, * annex.backend=MD5E, therefore translates to “Everything in this directory should be hashed with a MD5E hash function”. But what is the rest? We’ll start with the last row:

* annex.largefiles=((mimeencoding=binary)and(largerthan=0))

Uhhh, cryptic. The lecturer explains:

“git-annex will annex, that is, store in the object-tree, anything it considers to be a “large file”. By default, anything in your dataset would be a “large file”, that means anything would be annexed. However, in section Data integrity I already mentioned that exceptions to this behavior can be defined based on

  1. file size

  2. and/or path/pattern, and thus for example file extensions, or names, or file types (e.g., text files, as with the text2git configuration template).

“In .gitattributes, you can define what a large file and what is not by simply telling git-annex by writing such rules.”

What you can see in this .gitattributes file is a rule based on file types: With (mimeencoding=binary))1, the text2git configuration template configured git-annex to regard all files of type “binary” as a large file. Thanks to this little line, your text files are not annexed, but stored directly in Git.

The patterns * and ** are so-called “wildcards” used in globbing. * matches any file or directory in the current directory, and ** matches all files and directories in the current directory and subdirectories. In technical terms, ** matches recursively. The third row therefore translates to “Do not annex anything that is a text file in this directory” for git-annex.

However, rules can be even simpler. The second row simply takes a complete directory (.git) and instructs git-annex to regard nothing in it as a “large file”. The second row, **/.git* annex.largefiles=nothing therefore means that no .git repository in this directory or a subdirectory should be considered a “large file”. This way, the .git repositories are protected from being annexed. If you had a single file (myfile.pdf) you would not want annexed, specifying a rule such as:

myfile.pdf annex.largefiles=nothing

will keep it stored in Git. To see an example of this, navigate into the longnow subdataset, and view this dataset’s .gitattributes file:

$ cat recordings/longnow/.gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
README.md annex.largefiles=nothing

The relevant part is README.md annex.largefiles=nothing. This instructs git-annex to specifically not annex README.md.

Lastly, if you wanted to configure a rule based on size, you could add a row such as:

** annex.largefiles(largerthan=20kb)

to store only files exceeding 20KB in size in git-annex2.

As you may have noticed, unlike .git/config files, there can be multiple .gitattributes files within a dataset. So far, you have seen one in the root of the superdataset, and in the root of the longnow subdataset. In principle, you can add one to every directory-level of your dataset. For example, there is another .gitattributes file within the .datalad directory:

$ cat .datalad/.gitattributes

config annex.largefiles=nothing
metadata/aggregate* annex.largefiles=nothing
metadata/objects/** annex.largefiles=(anything)

As with Git configuration files, more specific or lower-level configurations take precedence over more general or higher-level configurations. Specifications in a subdirectory can therefore overrule specifications made in the .gitattributes file of the parent directory.

In summary, the .gitattributes files will give you the possibility to configure what should be annexed and what should not be annexed up to individual file level. This can be very handy, and allows you to tune your dataset to your custom needs. For example, files you will often edit by hand could be stored in Git if they are not too large to ease modifying them3. Once you know the basics of this type of configuration syntax, writing your own rules is easy. For more tips on how configure git-annex’s content management in .gitattributes, take a look at this page of the git-annex documentation. Later however you will see preconfigured DataLad procedures such as text2git that can apply useful configurations for you, just as text2git added the last line in the root .gitattributes file.

5.2.2. .gitmodules

On last configuration file that Git creates is the .gitmodules file. There is one right in the root of your dataset:

$ cat .gitmodules
[submodule "recordings/longnow"]
	path = recordings/longnow
	url = https://github.com/datalad-datasets/longnow-podcasts.git
	datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
	datalad-url = https://github.com/datalad-datasets/longnow-podcasts.git

Based on these contents, you might have already guessed what this file stores. .gitmodules is a configuration file that stores the mapping between your own dataset and any subdatasets you have installed in it. There will be an entry for each submodule (subdataset) in your dataset. The name submodule is Git terminology, and describes a Git repository inside of another Git repository, i.e., the super- and subdataset principles. Upon sharing your dataset, the information about subdatasets and where to retrieve them from is stored and shared with this file.

Section Looking without touching already mentioned one additional configuration option in a footnote: The datalad-recursiveinstall key. This key is defined on a per subdataset basis, and if set to “skip”, the given subdataset will not be recursively installed unless it is explicitly specified as a path to datalad get [-n/--no-data] -r. If you are a maintainer of a superdataset with monstrous amounts of subdatasets, you can set this option and share it together with the dataset to prevent an accidental, large recursive installation in particularly deeply nested subdatasets. Below is a minimally functional example on how to apply the configuration and how it works:

Let’s create a dataset hierarchy to work with (note that we concatenate multiple commands into a single line using bash’s “and” && operator):

# create a superdataset with two subdatasets
$ datalad create superds && cd superds && datalad create -d . subds1 && datalad create -d . subds2
[INFO   ] Creating a new annex repo at /tmp/superds
create(ok): /tmp/superds (dataset)
[INFO   ] Creating a new annex repo at /tmp/superds/subds1
add(ok): subds1 (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): subds1 (dataset)
action summary:
  add (ok: 2)
  create (ok: 1)
  save (ok: 1)
[INFO   ] Creating a new annex repo at /tmp/superds/subds2
add(ok): subds2 (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): subds2 (dataset)
action summary:
  add (ok: 2)
  create (ok: 1)
  save (ok: 1)

Next, we create subdatasets in the subdatasets:

# create two subdatasets in subds1
$ cd subds1 && datalad create -d . subsubds1 && datalad create -d . subsubds2 && cd ../
[INFO   ] Creating a new annex repo at /tmp/superds/subds1/subsubds1
add(ok): subsubds1 (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): subsubds1 (dataset)
action summary:
  add (ok: 2)
  create (ok: 1)
  save (ok: 1)
[INFO   ] Creating a new annex repo at /tmp/superds/subds1/subsubds2
add(ok): subsubds2 (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): subsubds2 (dataset)
action summary:
  add (ok: 2)
  create (ok: 1)
  save (ok: 1)


# create two subdatasets in subds2
$ cd subds2 && datalad create -d . subsubds1 && datalad create -d . subsubds2
[INFO   ] Creating a new annex repo at /tmp/superds/subds2/subsubds1
add(ok): subsubds1 (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): subsubds1 (dataset)
action summary:
  add (ok: 2)
  create (ok: 1)
  save (ok: 1)
[INFO   ] Creating a new annex repo at /tmp/superds/subds2/subsubds2
add(ok): subsubds2 (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): subsubds2 (dataset)
action summary:
  add (ok: 2)
  create (ok: 1)
  save (ok: 1)

Here is the directory structure:

$ cd ../ && tree
.
├── subds1
│   ├── subsubds1
│   └── subsubds2
└── subds2
    ├── subsubds1
    └── subsubds2

# save in the superdataset
datalad save -m "add a few sub and subsub datasets"
add(ok): subds1 (file)
add(ok): subds2 (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)

Now, we can apply the datalad-recursiveinstall configuration to skip recursive installations for subds1

$ git config -f .gitmodules --add submodule.subds1.datalad-recursiveinstall skip

# save this configuration
$ datalad save -m "prevent recursion into subds1, unless explicitly given as path"
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

If the dataset is cloned, and someone runs a recursive datalad get, the subdatasets of subds1 will not be installed, the subdatasets of subds2, however, will be.

# clone the dataset somewhere else
$ cd ../ && datalad clone superds clone_of_superds
[INFO   ] Cloning superds into '/tmp/clone_of_superds'
install(ok): /tmp/clone_of_superds (dataset)

# recursively get all contents (without data)
$ cd clone_of_superds && datalad get -n -r .
[INFO   ] Installing <Dataset path=/tmp/clone_of_superds> underneath /tmp/clone_of_superds recursively
[INFO   ] Cloning /tmp/superds/subds2 into '/tmp/clone_of_superds/subds2'
get(ok): /tmp/clone_of_superds/subds2 (dataset)
[INFO   ] Cloning /tmp/superds/subds2/subsubds1 into '/tmp/clone_of_superds/subds2/subsubds1'
get(ok): /tmp/clone_of_superds/subds2/subsubds1 (dataset)
[INFO   ] Cloning /tmp/superds/subds2/subsubds2 into '/tmp/clone_of_superds/subds2/subsubds2'
get(ok): /tmp/clone_of_superds/subds2/subsubds2 (dataset)
action summary:
  get (ok: 3)

# only subsubds of subds2 are installed, not of subds1:
$ tree
.
├── subds1
└── subds2
    ├── subsubds1
    └── subsubds2

4 directories, 0 files

Nevertheless, if subds1 is provided with an explicit path, its subdataset subsubds will be cloned, essentially overriding the configuration:

$  datalad get -n -r subds1 && tree
[INFO   ] Cloning /tmp/superds/subds1 into '/tmp/clone_of_superds/subds1'
install(ok): /tmp/clone_of_superds/subds1 (dataset) [Installed subdataset in order to get /tmp/clone_of_superds/subds1]
[INFO   ] Installing <Dataset path=/tmp/clone_of_superds> underneath /tmp/clone_of_superds/subds1 recursively
.
├── subds1
│   ├── subsubds1
│   └── subsubds2
└── subds2
    ├── subsubds1
    └── subsubds2

6 directories, 0 files

5.2.3. .datalad/config

DataLad adds a repository-specific configuration file as well. It can be found in the .datalad directory, and just like .gitattributes and .gitmodules it is version controlled and is thus shared together with the dataset. One can configure many options, but currently, our .datalad/config file only stores a dataset ID. This ID serves to identify a dataset as a unit, across its entire history and flavors. In a geeky way, this is your dataset’s social security number: It will only exist one time on this planet.

$ cat .datalad/config
[datalad "dataset"]
	id = 8e04afb0-af85-4070-be29-858d30d85017

Note, though, that local configurations within a Git configuration file will take precedence over configurations that can be distributed with a dataset. Otherwise, dataset updates with datalad update (or, for Git-users, git pull) could suddenly and unintentionally alter local DataLad behavior that was specifically configured. Also, Git and git-annex will not query this file for configurations, so please store only sticky options that are specific to DataLad (i.e., under the datalad.* namespace) in it.

5.2.4. Writing to configuration files other than .git/config

“Didn’t you say that knowing the git config command is already half of what I need to know?” you ask. “Now there are three other configuration files, and I do not know with which command I can write into these files.”

“Excellent question”, you hear in return, “but in reality, you do know: it’s also the git config command. The only part of it you need to adjust is the -f, --file parameter. By default, the command writes to a Git config file. But it can write to a different file if you specify it appropriately. For example

git config --file=.gitmodules --replace-all submodule."name".url "new URL"

will update your submodule’s URL. Keep in mind though that you would need to commit this change, as .gitmodules is version controlled”.

Let’s try this:

$ git config --file=.gitmodules --replace-all submodule."recordings/longnow".url "git@github.com:datalad-datasets/longnow-podcasts.git"

This command will replace the submodule’s https URL with an SSH URL. The latter is often used if someone has an SSH key pair and added the public key to their GitHub account (you can read more about this here). We will revert this change shortly, but use it to show the difference between a git config on a .git/config file and on a version controlled file:

$ datalad status
 modified: .gitmodules (file)
$ git diff
diff --git a/.gitmodules b/.gitmodules
index 9bc9ee9..11273e1 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,5 +1,5 @@
 [submodule "recordings/longnow"]
 	path = recordings/longnow
-	url = https://github.com/datalad-datasets/longnow-podcasts.git
+	url = git@github.com:datalad-datasets/longnow-podcasts.git
 	datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
	datalad-url = https://github.com/datalad-datasets/longnow-podcasts.git

As these two commands show, the .gitmodules file is modified. The https URL has been deleted (note the -, and a SSH URL has been added. To keep these changes, we would need to datalad save them. However, as we want to stay with https URLs, we will just checkout this change – using a Git tool to undo an unstaged modification.

$ git checkout .gitmodules
$ datalad status
Updated 1 path from the index
nothing to save, working tree clean

Note, though, that the .gitattributes file can not be modified with a git config command. This is due to its different format that does not comply to the section.variable.value structure of all other configuration files. This file, therefore, has to be edited by hand, with an editor of your choice.

5.2.5. Environment variables

An environment variable is a variable set up in your shell that affects the way the shell or certain software works – for example the environment variables HOME, PWD, or PATH. Configuration options that determine the behavior of Git, git-annex, and DataLad that could be defined in a configuration file can also be set (or overridden) by the associated environment variables of these configuration options. Many configuration items have associated environment variables. If this environment variable is set, it takes precedence over options set in configuration files, thus providing both an alternative way to define configurations as well as an override mechanism. For example, the user.name configuration of Git can be overridden by its associated environment variable, GIT_AUTHOR_NAME. Likewise, one can define the environment variable instead of setting the user.name configuration in a configuration file.

Git, git-annex, and DataLad have more environment variables than anyone would want to remember. Here is a good overview on Git’s most useful available environment variables for a start. All of DataLad’s configuration options can be translated to their associated environment variables. Any environment variable with a name that starts with DATALAD_ will be available as the corresponding datalad. configuration variable, replacing any __ (two underscores) with a hyphen, then any _ (single underscore) with a dot, and finally converting all letters to lower case. The datalad.log.level configuration option thus is the environment variable DATALAD_LOG_LEVEL.

Some more general information on environment variables

Names of environment variables are often all-uppercase. While the $ is not part of the name of the environment variable, it is necessary to refer to the environment variable: To reference the value of the environment variable HOME for example you would need to use echo $HOME and not echo HOME. However, environment variables are set without a leading $. There are several ways to set an environment variable (note that there are no spaces before and after the = !), leading to different levels of availability of the variable:

  • THEANSWER=42 <command> makes the variable THEANSWER available for the process in <command>. For example, DATALAD_LOG_LEVEL=debug datalad get <file> will execute the datalad get command (and only this one) with the log level set to “debug”.

  • export THEANSWER=42 makes the variable THEANSWER available for other processes in the same session, but it will not be available to other shells.

  • echo 'export THEANSWER=42' >> ~/.bashrc will write the variable definition in the .bashrc file and thus available to all future shells of the user (i.e., this will make the variable permanent for the user)

To list all of the configured environment variables, type env into your terminal.

5.2.6. Summary

This has been an intense lecture, you have to admit. One definite take-away from it has been that you now know a second reason why the hidden .git and .datalad directory contents and also the contents of .gitmodules and .gitattributes should not be carelessly tampered with – they contain all of the repositories configurations.

But you now also know how to modify these configurations with enough care and background knowledge such that nothing should go wrong once you want to work with and change them. You can use the git config command for Git configuration files on different scopes, and even the .gitmodules or datalad/config files. Of course you do not yet know all of the available configuration options. However, you already know some core Git configurations such as name, email, and editor. Even more important, you know how to configure git-annex’s content management based on largefile rules, and you understand the majority of variables within .gitmodules or the sections in .git/config. Slowly, you realize with pride, you’re more and more becoming a DataLad power-user.

Write a note about configurations in datasets into notes.txt.

$ cat << EOT >> notes.txt
Configurations for datasets exist on different levels (systemwide,
global, and local), and in different types of files (not version
controlled (git)config files, or version controlled .datalad/config,
.gitattributes, or gitmodules files), or environment variables.
With the exception of .gitattributes, all configuration files share a
common structure, and can be modified with the git config command, but
also with an editor by hand.

Depending on whether a configuration file is version controlled or
not, the configurations will be shared together with the dataset.
More specific configurations and not-shared configurations will always
take precedence over more global or hared configurations, and
environment variables take precedence over configurations in files.

The git config --list --show-origin command is a useful tool to give
an overview over existing configurations. Particularly important may
be the .gitattributes file, in which one can set rules for git-annex
about which files should be version-controlled with Git instead of
being annexed.

EOT
$ datalad save -m "add note on configurations and git config"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Footnotes

1

When opening any file on a UNIX system, the file does not need to have a file extension (such as .txt, .pdf, .jpg) for the operating system to know how to open or use this file (in contrast to Windows, which does not know how to open a file without an extension). To do this, Unix systems rely on a file’s MIME type – an information about a file’s content. A .txt file for example has MIME type text/plain as does a bash script (.sh), a Python script has MIME type text/x-python, a .jpg file is image/jpg, and a .pdf file has MIME type application/pdf. You can find out the MIME type of a file by running:

$ file --mime-type path/to/file
2

Specifying annex.largefiles in your .gitattributes file will make the configuration “portable” – shared copies of your dataset will retain these configurations. You could however also set largefiles rules in your .git/config file. Rules specified in there take precedence over rules in .gitattributes. You can set them using the git config command:

$ git config annex.largefiles 'largerthan=100kb and not (include=*.c or include=*.h)'

The above command annexes files larger than 100KB, and will never annex files with a .c or .h extension.

3

Should you ever need to, this file is also where one would change the git-annex backend in order to store new files with a new backend. Switching the backend of all files (new as well as existing ones) requires the git annex migrate command (see the documentation for more information on this command).