2.3. Fixing up too-large datasets¶
The previous section highlighted problems of too large monorepos and advised strategies to them prevent them. This section introduces some strategies to clean and fix up datasets that got out of hand size-wise. If there are use cases you would want to see discussed here or propose solutions for, please get in touch.
2.3.1. Getting contents out of Git¶
Let’s say you did a datalad run
(manual) with an analysis that put too
many files under version control by Git, and you want to see them gone.
Sticking to the FSL FEAT analysis example from earlier, you may, for example,
want to get rid of every tsplot
directory, as it contains results that are
irrelevant for you.
Note that there is no way to drop
the files as they are in Git instead of
git-annex. Removing
the files with plain file system (rm
, git rm
) operation also does not
shrink your dataset. The files are snapshot and even though they don’t exist in
the current state of your dataset anymore, they still exist – and thus clutter
– your datasets history. In order to really get committed files out of Git,
you need to rewrite history. And for this you need heavy machinery:
git-filter-repo[1].
It is a powerful and potentially dangerous tool to rewrite Git history.
Treat this tool like a chainsaw. Very helpful for heavy duty tasks, but also
life-threatening. The command
git-filter-repo <path-specification> --force
will “filter-out”, i.e., remove
all files but the ones specified in <path-specification>
from the datasets
history. Before you use it, please make sure to read its help page thoroughly.
Installing git-filter-repo
git-filter-repo
is not part of Git but needs to be installed separately.
Its GitHub repository contains
more and more detailed instructions, but it is possible to install via pip
(pip install git-filter-repo
), and available via standard package managers
for macOS and some Linux distributions (mostly rpm-based ones).
The general procedure you should follow is the following:
datalad clone
(manual) the repository. This is a safeguard to protect your dataset should something go wrong. The clone you are creating will be your new, cleaned up dataset.datalad get
(manual) all the dataset contents by runningdatalad get .
in the clone.git-filter-repo
what you don’t want anymore (see below)Run
git annex unused
and a subsequentgit annex dropunused all
to remove stale file contents that are not referenced anymore.Finally, do some aggressive garbage collection with
git gc --aggressive
In order to get a hang on the git-filter-repo
step, consider a directory
structure similar to this exemplary run-wise FEAT analysis output structure:
$ tree
sub-*/run-*_<task>-<level>.feat
├── custom_timing_files
├── logs
├── reg
├── reg_standard
│ ├── reg
│ └── stats
├── stats
└── tsplot
Each of such sub-*
directories contains about 3000 files, and the majority of
them are irrelevant text files in tsplot/
.
In order to remove them for all subjects and runs from the dataset history,
the following command can be used:
$ git-filter-repo --path-regex '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$' --invert-paths --force
The --path-regex
and the regex expression '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$'
[2]
match all file paths inside of the tsplot/
directories of all subjects and
runs.
The option --invert-paths
then inverts this path specification, and leads
to only the files in tsplot/
to be filtered out. Note that there are also
non-regex based path specifications possible, for example with the option
--path-match
or path-glob
, or with a specification placed in a file.
Please see the manual of git-filter-repo
for more information.
Footnotes