2.3. Fixing up too-large datasets

The previous section highlighted problems of too large monorepos and advised strategies to them prevent them. This section introduces some strategies to clean and fix up datasets that got out of hand size-wise. If there are use cases you would want to see discussed here or propose solutions for, please get in touch.

2.3.1. Getting contents out of Git

Let’s say you did a datalad run (manual) with an analysis that put too many files under version control by Git, and you want to see them gone. Sticking to the FSL FEAT analysis example from earlier, you may, for example, want to get rid of every tsplot directory, as it contains results that are irrelevant for you.

Note that there is no way to drop the files as they are in Git instead of git-annex. Removing the files with plain file system (rm, git rm) operation also does not shrink your dataset. The files are snapshot and even though they don’t exist in the current state of your dataset anymore, they still exist – and thus clutter – your datasets history. In order to really get committed files out of Git, you need to rewrite history. And for this you need heavy machinery: git-filter-repo[1]. It is a powerful and potentially dangerous tool to rewrite Git history. Treat this tool like a chainsaw. Very helpful for heavy duty tasks, but also life-threatening. The command git-filter-repo <path-specification> --force will “filter-out”, i.e., remove all files but the ones specified in <path-specification> from the datasets history. Before you use it, please make sure to read its help page thoroughly.

Installing git-filter-repo

git-filter-repo is not part of Git but needs to be installed separately. Its GitHub repository contains more and more detailed instructions, but it is possible to install via pip (pip install git-filter-repo), and available via standard package managers for macOS and some Linux distributions (mostly rpm-based ones).

The general procedure you should follow is the following:

  1. datalad clone (manual) the repository. This is a safeguard to protect your dataset should something go wrong. The clone you are creating will be your new, cleaned up dataset.

  2. datalad get (manual) all the dataset contents by running datalad get . in the clone.

  3. git-filter-repo what you don’t want anymore (see below)

  4. Run git annex unused and a subsequent git annex dropunused all to remove stale file contents that are not referenced anymore.

  5. Finally, do some aggressive garbage collection with git gc --aggressive

In order to get a hang on the git-filter-repo step, consider a directory structure similar to this exemplary run-wise FEAT analysis output structure:

$ tree
sub-*/run-*_<task>-<level>.feat
    ├── custom_timing_files
    ├── logs
    ├── reg
    ├── reg_standard
       ├── reg
       └── stats
    ├── stats
    └── tsplot

Each of such sub-* directories contains about 3000 files, and the majority of them are irrelevant text files in tsplot/. In order to remove them for all subjects and runs from the dataset history, the following command can be used:

$ git-filter-repo --path-regex '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$' --invert-paths --force

The --path-regex and the regex expression '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$'[2] match all file paths inside of the tsplot/ directories of all subjects and runs. The option --invert-paths then inverts this path specification, and leads to only the files in tsplot/ to be filtered out. Note that there are also non-regex based path specifications possible, for example with the option --path-match or path-glob, or with a specification placed in a file. Please see the manual of git-filter-repo for more information.

Footnotes