9.5. Gists

The more complex and larger your DataLad project, the more difficult it is to do efficient housekeeping. This section is a selection of code snippets tuned to perform specific, non-trivial tasks in datasets. Often, they are not limited to single commands of the version control tools you know, but combine helpful other command line tools and general Unix command line magic. Just like GitHub gists, its a collection of lightweight and easily accessible tips and tricks. For a more basic command overview, take a look at the DataLad cheat sheet. The tips collection of git-annex is also a very valuable resource.

../_images/gists.svg

9.5.1. Parallelize subdataset processing

DataLad cannot yet parallelize processes that are performed independently over a large number of subdatasets. Pushing across a dataset hierarchy, for example, is performed one after the other. Unix however, has a few tools such as xargs or the parallel tool of moreutils that can assist.

Here is an example of pushing all subdatasets (and their respective subdatasets) recursively to their (identically named) siblings:

$ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad push -r --to <sibling-name> -d

datalad -f '{path}' subdatasets discovers the paths of all subdatasets, and xargs hands them individually (-n 1) to a (recursive) datalad push (manual), but performs 10 of these operations in parallel (-P 10), thus achieving parallelization.

Here is an example of cross-dataset download parallelization:

$ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad get -d

Operations like this can safely be attempted for all commands that are independent across subdatasets.

9.5.2. Check whether all file content is present locally

In order to check if all the files in a dataset have their file contents locally available, you can ask git-annex:

$ git annex find --not --in=here

Any file that does not have its contents locally available will be listed. If there are subdatasets you want to recurse into, use the following command

$ git submodule foreach --quiet --recursive \
 'git annex find --not --in=here --format=$displaypath/$\\{file\\}\\n'

Alternatively, to get very comprehensive output, you can use

$ datalad -f json status --recursive --annex availability

The output will be returned as json, and the key has_content indicates local content availability (true or false). To filter through it, the command line tool jq works well:

$ datalad -f json status --recursive --annex all | jq '. | select(.has_content == true).path'

9.5.3. Drop annexed files from all past commits

If there is annexed file content that is not used anymore (i.e., data in the annex that no files in any branch point to anymore such as corrupt files), you can find out about it and remove this file content out of your dataset (i.e., completely and irrecoverably delete it) with git-annex’s commands git annex unused (manual) and git annex dropunused` (manual).

Find out which file contents are unused (not referenced by any current branch):

$ git annex unused
 unused . (checking for unused data...)
   Some annexed data is no longer used by any files in the repository.
     NUMBER  KEY
     1       SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
     2       SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
   (To see where data was previously used, try: git log --stat -S'KEY')
   (To remove unwanted data: git-annex dropunused NUMBER)
 ok

Remove a single unused file by specifying its number in the listing above:

$ git annex dropunused 1
 dropunused 1 ok

Or a range of unused data with

$ git annex dropunused 1-1000

Or all

$ git annex dropunused all

9.5.4. Getting single file sizes prior to downloading from the Python API and the CLI

For a single file, datalad status --annex -- myfile (manual) will report on the size of the file prior to a datalad get (manual).

If you want to do it in Python, try this approach:

import datalad.api as dl

ds = dl.Dataset("/path/to/some/dataset")
results = ds.status(path=<path or list of paths>, annex="basic", result_renderer=None)

9.5.5. Check whether a dataset contains an annex

Datasets can be either GitRepos (i.e., sole Git repositories; this happens when they are created with the --no-annex flag, for example), or AnnexRepos (i.e., datasets that contain an annex). Information on what kind of repository it is is stored in the dataset report of datalad wtf (manual) under the key repo. Here is a one-liner to get this info:

$ datalad -f'{infos[dataset][repo]}' wtf

9.5.6. Backing-up datasets

In order to back-up datasets you can publish them to a Remote Indexed Archive (RIA) store or to a sibling dataset. The former solution does not require Git, git-annex, or DataLad to be installed on the machine that the back-up is pushed to, the latter does require them.

To find out more about RIA stores, checkout the online-handbook. A sketch of how to implement a sibling for backups is below:

$ # create a back up sibling
$ datalad create-sibling --annex-wanted anything -r myserver:/path/to/backup
$ # publish a full backup of the current branch
$ datalad publish --to=myserver -r
$ # subsequently, publish updates to be backed up with
$ datalad publish --to=myserver -r --since= --missing=inherit

In order to push not only the current branch, but refs, add the option --publish-by-default "refs/*" to the datalad create-sibling (manual) call. Should you want to back up all annexed data, even past versions of files, use git annex sync (manual) to push to the sibling:

$ git annex sync --all --content <sibling-name>

For an in-depth explanation and example take a look at the GitHub issue that raised this question.

9.5.7. Retrieve partial content from a hierarchy of (uninstalled) datasets

In order to datalad get dataset content across a range of subdatasets, a bit of UNIX command line foo can increase the efficiency of your command.

Example: consider retrieving all ribbon.nii.gz files for all subjects in the HCP open access dataset (a dataset with about 4500 subdatasets – read on more about it in Scaling up: Managing 80TB and 15 million files from the HCP release). If all subject-subdatasets are installed (e.g., with datalad get -n -r for a recursive installation without file retrieval), globbing with the shell works fine:

$ datalad get HCP1200/*/T1W/ribbon.nii.gz

The Gist Parallelize subdataset processing can show you how to parallelize this. If the subdatasets are not yet installed, globbing will not work, because the shell can’t expand non-existent paths. As an alternative, you can pipe the output of an (arbitrarily complex) datalad search (manual) command into datalad get:

$ datalad -f '{path}' -c datalad.search.index-egrep-documenttype=all search 'path:.*T1w.*\.nii.gz' | xargs -n 100 datalad get

However, if you know the file locations within the dataset hierarchy and they are predictably named and consistent, you can create a file containing all paths to be retrieved and pipe that into datalad get as well:

$ # create file with all file paths
$ for sub in HCP1200/*; do echo ${sub}/T1w/ribbons.nii.gz; done > toget.txt
$ # pipe it into datalad get
$ cat toget.txt | xargs -n 100 datalad get

9.5.8. Speed up status reports in large datasets

In datasets with deep dataset hierarchies or large numbers of files, datalad status calls can be expensive. Handily, the command provides options that can boost performance by limiting what is being tested and reported. In order to speed up subdataset state state evaluation, -e/--eval-subdataset-state can be set commit or no. Instead of checking recursively for uncommitted modifications in subdatasets, this would lead status to only compare the most recent commit shasum in the subdataset against the recorded subdataset state in the superdataset (commit), or skip subdataset state evaluation completely (no). In order to speed up file type evaluation, the option -t/--report-filetype can be set to raw. This skips an evaluation on whether symlinks are pointers to annexed file (upon which, if true, the symlink would be reported as type “file”). Instead, all symlinks will be reported as being of type “symlink”.

9.5.9. Squashing git-annex history

A large number of commits in the git-annex branch (think: thousands rather than hundreds) can inflate your repository and increase the size of the .git directory, which can lead to slower cloning operations. There are, however, ways to shrink the commit history in the annex branch.

In order to squash the entire git-annex history into a single commit, run

$ git annex forget --drop-dead --force

Afterwards, if your dataset has a sibling, the branch needs to be force-pushed. If you attempt an operation to shrink your git-annex history, also checkout this thread for more information on shrinking git-annex’s history and helpful safeguards and potential caveats.