5.3. Configurations to go

The past two sections should have given you a comprehensive overview on the different configuration options the tools Git, git-annex, and DataLad provide. They not only showed you a way to configure everything you may need to configure, but also gave explanations about what the configuration options actually mean.

But figuring out which configurations are useful and how to apply them are also not the easiest tasks. Therefore, some clever people decided to assist with these tasks, and created pre-configured procedures that process datasets in a particular way. These procedures can be shipped within DataLad or its extensions, lie on a system, or can be shared together with datasets.

One of such procedures is the text2git configuration. In order to learn about procedures in general, let’s demystify what the text2git procedure exactly is: It is nothing more than a simple script that

  • writes the relevant configuration (annex_largefiles = '((mimeencoding=binary)and(largerthan=0))', i.e., “Do not put anything that is a text file in the annex”) to the .gitattributes file of a dataset, and

  • saves this modification with the commit message “Instruct annex to add text files to Git”.

Why this configuration does not work for Windows users

If you’re on a Windows 10 machine with a native (i.e., non WSL based installation) of DataLad and did not use the custom git-annex installer from http://datasets.datalad.org/datalad/packages/windows/ at the start of the Basics, the text2git configuration will lead to errors upon a datalad save. This is because MagicMime (used in mimeencoding=binary to determine the file type of any given file by searching for magic numbers) is not natively available on Windows.

This particular procedure lives in a script called cfg_text2git in the sourcecode of DataLad. The amount of code in this script is not large, and the relevant lines of code are highlighted:

 import sys
 import os.path as op

 from datalad.distribution.dataset import require_dataset

 ds = require_dataset(
     sys.argv[1],
     check_installed=True,
     purpose='configuration')

 # the relevant configuration:
 annex_largefiles = '((mimeencoding=binary)and(largerthan=0))'
 attrs = ds.repo.get_gitattributes('*')
 if not attrs.get('*', {}).get(
         'annex.largefiles', None) == annex_largefiles:
     ds.repo.set_gitattributes([
         ('*', {'annex.largefiles': annex_largefiles})])

 git_attributes_file = op.join(ds.path, '.gitattributes')
 ds.save(
     git_attributes_file,
     message="Instruct annex to add text files to Git",
 )

Just like cfg_text2git, all DataLad procedures are executables (such as a script, or compiled code). In principle, they can be written in any language, and perform any task inside of a dataset. The text2git configuration for example applies a configuration for how git-annex treats different file types. Other procedures do not only modify .gitattributes, but can also populate a dataset with particular content, or automate routine tasks such as synchronizing dataset content with certain siblings. What makes them a particularly versatile and flexible tool is that anyone can write their own procedures. If a workflow is a standard in a team and needs to be applied often, turning it into a script can save time and effort. By pointing DataLad to the location the procedures reside in they can be applied, and by including them in a dataset they can even be shared. And even if the script is simple, it is very handy to have preconfigured procedures that can be run in a single command line call. In the case of text2git, all text files in a dataset will be stored in Git – this is a useful configuration that is applicable to a wide range of datasets. It is a shortcut that spares naive users the necessity to learn about the .gitattributes file when setting up a dataset.

To find out available procedures, the command datalad run-procedure --discover (datalad-run-procedure manual) is helpful. This command will make DataLad search the default location for procedures in a dataset, the source code of DataLad or installed DataLad extensions, and the default locations for procedures on the system for available procedures:

$ datalad run-procedure --discover
cfg_bids (/home/adina/env/handbook2/lib/python3.9/site-packages/datalad_neuroimaging/resources/procedures/cfg_bids.py) [python_script]
cfg_hirni (/home/adina/env/handbook2/lib/python3.9/site-packages/datalad_hirni/resources/procedures/cfg_hirni.py) [python_script]
cfg_metadatatypes (/home/adina/repos/datalad/datalad/resources/procedures/cfg_metadatatypes.py) [python_script]
cfg_text2git (/home/adina/repos/datalad/datalad/resources/procedures/cfg_text2git.py) [python_script]
cfg_yoda (/home/adina/repos/datalad/datalad/resources/procedures/cfg_yoda.py) [python_script]

The output shows that in this particular dataset, on the particular system the book is written on, there are at least three procedures available: cfg_metadatatypes, cfg_text2git, and cfg_yoda. It also lists where they are stored – in this case, they are all part of the source code of DataLad1.

  • cfg_yoda configures a dataset according to the yoda principles – the section YODA: Best practices for data analyses in a dataset talks about this in detail.

  • cfg_text2git configures text files to be stored in Git.

  • cfg_metadatatypes lets users configure additional metadata types – more about this in a later section on DataLad’s metadata handling.

5.3.1. Applying procedures

datalad run-procedure not only discovers but also executes procedures. If given the name of a procedure, this command will apply the procedure to the current dataset, or the dataset that is specified with the -d/--dataset flag:

datalad run-procedure [-d <PATH>] cfg_text2git

The typical workflow is to create a dataset and apply a procedure afterwards. However, some procedures shipped with DataLad or its extensions with a cfg_ prefix can also be applied right at the creation of a dataset with the -c/--cfg-proc <name> option in a datalad create command. This is a peculiarity of these procedures because, by convention, all of these procedures are written to not require arguments. The command structure looks like this:

datalad create -c text2git DataLad-101

Note that the cfg_ prefix of the procedures is omitted in these calls to keep it extra simple and short. The available procedures in this example (cfg_yoda, cfg_text2git) could thus be applied within a datalad create as

  • datalad create -c yoda <DSname>

  • datalad create -c text2git <DSname>

Applying multiple procedures

If you want to apply several configurations at once, feel free to do so, for example like this:

$ datalad create -c yoda -c text2git

Applying procedures in subdatasets

Procedures can be applied in datasets on any level in the dataset hierarchy, i.e., also in subdatasets. Note, though, that a subdataset will show up as being modified in datalad status in the superdataset after applying a procedure. This is expected, and it would also be the case with any other modification (saved or not) in the subdataset, as the version of the subdataset that is tracked in the superdataset simply changed. A datalad save in the superdataset will make sure that the version of the subdataset gets updated in the superdataset. The section More on Dataset nesting will elaborate on this general principle later in the handbook.

As a general note, it can be useful to apply procedures early in the life of a dataset. Procedures such as cfg_yoda (explained in detail in section YODA: Best practices for data analyses in a dataset), create files, change .gitattributes, or apply other configurations. If many other (possibly complex) configurations are already in place, or if files of the same name as the ones created by a procedure are already in existence, this can lead to unexpected problems or failures, especially for naive users. Applying cfg_text2git to a default dataset in which one has saved many text files already (as per default added to the annex) will not place the existing, saved files into Git – only those text files created after the configuration was applied.

Write your own procedures

Procedures can come with DataLad or its extensions, but anyone can write their own ones in addition, and deploy them on individual machines, or ship them within DataLad datasets. This allows to automate routine configurations or tasks in a dataset, or share configurations that would otherwise not “stick” to the dataset. Some general rules for creating a custom procedure are outlined below:

  • A procedure can be any executable. Executables must have the appropriate permissions and, in the case of a script, must contain an appropriate shebang.

    • If a procedure is not executable, but its filename ends with .sh, it is automatically executed via bash.

  • Procedures can implement any argument handling, but must be capable of taking at least one positional argument (the absolute path to the dataset they shall operate on).

  • Custom procedures rely heavily on configurations in .datalad/config (or the associated environment variables). Within .datalad/config, each procedure should get an individual entry that contains at least a short “help” description on what the procedure does. Below is a minimal .datalad/config entry for a custom procedure:

    [datalad "procedures.<NAME>"]
       help = "This is a string to describe what the procedure does"
    
  • By default, on GNU/Linux systems, DataLad will search for system-wide procedures (i.e., procedures on the system level) in /etc/xdg/datalad/procedures, for user procedures (i.e., procedures on the global level) in ~/.config/datalad/procedures, and for dataset procedures (i.e., the local level2) in .datalad/procedures relative to a dataset root. Note that .datalad/procedures does not exist by default, and the procedures directory needs to be created first.

    • Alternatively to the default locations, DataLad can be pointed to the location of a procedure with a configuration in .datalad/config (or with the help of the associated environment variables). The appropriate configuration keys for .datalad/config are either datalad.locations.system-procedures (for changing the system default), datalad.locations.user-procedures (for changing the global default), or datalad.locations.dataset-procedures (for changing the local default). An example .datalad/config entry for the local scope is shown below.

      [datalad "locations"]
          dataset-procedures = relative/path/from/dataset-root
      
  • By default, DataLad will call a procedure with a standard template defined by a format string:

    interpreter {script} {ds} {arguments}
    

    where arguments can be any additional command line arguments a script (procedure) takes or requires. This default format string can be customized within .datalad/config in datalad.procedures.<NAME>.call-format. An example .datalad/config entry with a changed call format string is shown below.

    [datalad "procedures.<NAME>"]
       help = "This is a string to describe what the procedure does"
       call-format = "python {script} {ds} {somearg1} {somearg2}"
    
  • By convention, procedures should leave a dataset in a clean state.

Therefore, in order to create a custom procedure, an executable script in the appropriate location is fine. Placing a script myprocedure into .datalad/procedures will allow running datalad run-procedure myprocedure in your dataset, and because it is part of the dataset it will also allow distributing the procedure. Below is a toy-example for a custom procedure:

$ datalad create somedataset; cd somedataset
[INFO] Creating a new annex repo at /home/me/procs/somedataset 
[INFO] Scanning for unlocked files (this may take some time)
create(ok): /home/me/procs/somedataset (dataset)
$ mkdir .datalad/procedures
$ cat << EOT > .datalad/procedures/example.py
"""A simple procedure to create a file 'example' and store
it in Git, and a file 'example2' and annex it. The contents
of 'example' must be defined with a positional argument."""

import sys
import os.path as op
from datalad.distribution.dataset import require_dataset
from datalad.utils import create_tree

ds = require_dataset(
    sys.argv[1],
    check_installed=True,
    purpose='showcase an example procedure')

# this is the content for file "example"
content = """\
This file was created by a custom procedure! Neat, huh?
"""

# create a directory structure template. Write
tmpl = {
    'somedir': {
        'example': content,
    },
    'example2': sys.argv[2] if sys.argv[2] else "got no input"
}

# actually create the structure in the dataset
create_tree(ds.path, tmpl)

# rule to store 'example' Git
ds.repo.set_gitattributes([('example', {'annex.largefiles': 'nothing'})])

# save the dataset modifications
ds.save(message="Apply custom procedure")

EOT
$ datalad save -m "add custom procedure"
add(ok): .datalad/procedures/example.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

At this point, the dataset contains the custom procedure example. This is how it can be executed and what it does:

$ datalad run-procedure example "this text will be in the file 'example2'"
[INFO] Running procedure example 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
#the directory structure has been created
$ tree
.
├── example2 -> .git/annex/objects/G6/zw/MD5E-s40--2ed1bce0db9f376c277a1ba6418f3ddd/MD5E-s40--2ed1bce0db9f376c277a1ba6418f3ddd
└── somedir
    └── example

1 directory, 2 files
#lets check out the contents in the files
$ cat example2  && echo '' && cat somedir/example
this text will be in the file 'example2'
This file was created by a custom procedure! Neat, huh?
$ git config -f .datalad/config datalad.procedures.example.help "A toy example"
$ datalad save -m "add help description"
add(ok): .datalad/config (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

To find out more about a given procedure, you can ask for help:

$ datalad run-procedure --help-proc example
example (.datalad/procedures/example.py)
A toy example

Todo

It might be helpful to have (or reference) a table with all available procedures and a short explanation. Maybe on the cheatsheet.

Summing up, DataLad’s run-procedure command is a handy tool with useful existing procedures but much flexibility for your own DIY procedure scripts. With the information of the last three sections you should be able to write and understand necessary configurations, but you can also rely on existing, preconfigured templates in the form of procedures, and even write and distribute your own.

Therefore, envision procedures as helper-tools that can minimize technical complexities in a dataset – users can concentrate on the actual task while the dataset is set-up, structured, processed, or configured automatically with the help of a procedure. Especially in the case of trainees and new users, applying procedures instead of doing relevant routines “by hand” can help to ease working with the dataset, as the use case Student supervision in a research project showcases. Other than by users, procedures can also be triggered to automatically run after any command execution if a command results matches a specific requirement. If you are interested in finding out more about this, read on in section DataLad’s result hooks.

Finally, make a note about running procedures inside of notes.txt:

$ cat << EOT >> notes.txt
It can be useful to use pre-configured procedures that can apply
configurations, create files or file hierarchies, or perform
arbitrary tasks in datasets. They can be shipped with DataLad,
its extensions, or datasets, and you can even write your own
procedures and distribute them. The "datalad run-procedure"
command is used to apply such a procedure to a dataset. Procedures
shipped with DataLad or its extensions starting with a "cfg" prefix
can also be applied at the creation of a dataset with
"datalad create -c <PROC-NAME> <PATH>" (omitting the "cfg" prefix).

EOT
$ datalad save -m "add note on DataLad's procedures"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Footnotes

1

In theory, because procedures can exist on different levels, and because anyone can create (and thus name) their own procedures, there can be name conflicts. The order of precedence in such cases is: user-level, system-level, dataset, DataLad extension, DataLad, i.e., local procedures take precedence over those coming from “outside” via datasets or DataLad extensions. If procedures in a higher-level dataset and a subdataset have the same name, the procedure closer to the dataset run-procedure is operating on takes precedence.

2

Note that we simplify the level of procedures that exist within a dataset by calling them local. Even though they apply to a dataset just as local Git configurations, unlike Git’s local configurations in .git/config, the procedures and procedure configurations in .datalad/config are committed and can be shared together with a dataset. The procedure level local therefore does not exactly corresponds to the local scope in the sense that Git uses it.