9.5. Gists¶
The more complex and larger your DataLad project, the more difficult it is to do efficient housekeeping. This section is a selection of code snippets tuned to perform specific, non-trivial tasks in datasets. Often, they are not limited to single commands of the version control tools you know, but combine helpful other command line tools and general Unix command line magic. Just like GitHub gists, its a collection of lightweight and easily accessible tips and tricks. For a more basic command overview, take a look at the DataLad cheat sheet. The tips collection of git-annex is also a very valuable resource.
9.5.1. Parallelize subdataset processing¶
DataLad can not yet parallelize processes that are performed
independently over a large number of subdatasets. Pushing across a dataset
hierarchy for example, is performed one after the other.
Unix however, has a few tools such as xargs
or the parallel
tool of moreutils
that can assist.
Here is an example of pushing all subdatasets (and their respective subdatasets) recursively to their (identically named) siblings:
$ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad push -r --to <sibling-name> -d
datalad -f '{path}' subdatasets
discovers the paths of all subdatasets,
and xargs
hands them individually (-n 1
) to a (recursive) datalad push
(manual),
but performs 10 of these operations in parallel (-P 10
), thus achieving
parallelization.
Here is an example of cross-dataset download parallelization:
$ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad get -d
Operations like this can safely be attempted for all commands that are independent across subdatasets.
9.5.2. Check whether all file content is present locally¶
In order to check if all the files in a dataset have their file contents locally available, you can ask git-annex:
$ git annex find --not --in=here
Any file that does not have its contents locally available will be listed. If there are subdatasets you want to recurse into, use the following command
$ git submodule foreach --quiet --recursive \
'git annex find --not --in=here --format=$displaypath/$\\{file\\}\\n'
Alternatively, to get very comprehensive output, you can use
$ datalad -f json status --recursive --annex availability
The output will be returned as json, and the key has_content
indicates local
content availability (true
or false
). To filter through it, the command
line tool jq works well:
$ datalad -f json status --recursive --annex all | jq '. | select(.has_content == true).path'
9.5.3. Drop annexed files from all past commits¶
If there is annexed file content that is not used anymore (i.e., data in the
annex that no files in any branch point to anymore such as corrupt files),
you can find out about it and remove this file content out of your dataset
(i.e., completely and irrecoverably delete it) with git-annex’s commands
git annex unused
(manual) and git annex dropunused`
(manual).
Find out which file contents are unused (not referenced by any current branch):
$ git annex unused
unused . (checking for unused data...)
Some annexed data is no longer used by any files in the repository.
NUMBER KEY
1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
(To see where data was previously used, try: git log --stat -S'KEY')
(To remove unwanted data: git-annex dropunused NUMBER)
ok
Remove a single unused file by specifying its number in the listing above:
$ git annex dropunused 1
dropunused 1 ok
Or a range of unused data with
$ git annex dropunused 1-1000
Or all
$ git annex dropunused all
9.5.4. Getting single file sizes prior to downloading from the Python API and the CLI¶
For a single file, datalad status --annex -- myfile
(manual) will report on
the size of the file prior to a datalad get
(manual).
If you want to do it in Python, try this approach:
import datalad.api as dl
ds = dl.Dataset("/path/to/some/dataset")
results = ds.status(path=<path or list of paths>, annex="basic", result_renderer=None)
9.5.5. Check whether a dataset contains an annex¶
Datasets can be either GitRepos (i.e., sole Git repositories; this happens when
they are created with the --no-annex
flag, for example), or AnnexRepos
(i.e., datasets that contain an annex). Information on what kind of repository it
is is stored in the dataset report of datalad wtf
(manual) under the key repo
.
Here is a one-liner to get this info:
$ datalad -f'{infos[dataset][repo]}' wtf
9.5.6. Backing-up datasets¶
In order to back-up datasets you can publish them to a Remote Indexed Archive (RIA) store or to a sibling dataset. The former solution does not require Git, git-annex, or DataLad to be installed on the machine that the back-up is pushed to, the latter does require them.
To find out more about RIA stores, checkout the online version of the handbook. A sketch of how to implement a sibling for backups is below:
# create a back up sibling
datalad create-sibling --annex-wanted anything -r myserver:/path/to/backup
# publish a full backup of the current branch
datalad publish --to=myserver -r
# subsequently, publish updates to be backed up with
datalad publish --to=myserver -r --since= --missing=inherit
In order to push not only the current branch, but refs, add the option
--publish-by-default "refs/*"
to the datalad create-sibling
(manual) call.
Should you want to back up all annexed data, even past versions of files, use
git annex sync
(manual) to push to the sibling:
$ git annex sync --all --content <sibling-name>
For an in-depth explanation and example take a look at the GitHub issue that raised this question.
9.5.7. Retrieve partial content from a hierarchy of (uninstalled) datasets¶
In order to datalad get
dataset content across a range of subdatasets, a bit
of UNIX command line foo can increase the efficiency of your command.
Example: consider retrieving all ribbon.nii.gz
files for all subjects in the
HCP open access dataset
(a dataset with about 4500 subdatasets – read on more about it in
Scaling up: Managing 80TB and 15 million files from the HCP release).
If all subject-subdatasets are installed (e.g., with datalad get -n -r
for
a recursive installation without file retrieval), globbing with the
shell works fine:
$ datalad get HCP1200/*/T1W/ribbon.nii.gz
The Gist Parallelize subdataset processing can show you how to parallelize this.
If the subdatasets are not yet installed, globbing will not work, because the
shell can’t expand non-existent paths. As an alternative, you can pipe the output
of an (arbitrarily complex) datalad search
(manual) command into
datalad get
:
$ datalad -f '{path}' -c datalad.search.index-egrep-documenttype=all search 'path:.*T1w.*\.nii.gz' | xargs -n 100 datalad get
However, if you know the file locations within the dataset hierarchy and they
are predictably named and consistent, you can create a file containing all paths to
be retrieved and pipe that into datalad get
as well:
# create file with all file paths
$ for sub in HCP1200/*; do echo ${sub}/T1w/ribbons.nii.gz; done > toget.txt
# pipe it into datalad get
$ cat toget.txt | xargs -n 100 datalad get
9.5.8. Speed up status reports in large datasets¶
In datasets with deep dataset hierarchies or large numbers of files,
datalad status
calls can be expensive. Handily,
the command provides options that can boost performance by limiting what is being
tested and reported. In order to speed up subdataset state state evaluation,
-e/--eval-subdataset-state
can be set commit
or no
. Instead of checking
recursively for uncommitted modifications in subdatasets, this would lead status
to only compare the most recent commit shasum in the subdataset against
the recorded subdataset state in the superdataset (commit
), or skip subdataset
state evaluation completely (no
). In order to speed up file type evaluation,
the option -t/--report-filetype
can be set to raw
. This skips an evaluation
on whether symlinks are pointers to annexed file (upon which, if true, the symlink
would be reported as type “file”). Instead, all symlinks will be reported as
being of type “symlink”.
9.5.9. Squashing git-annex history¶
A large number of commits in the git-annex branch (think: thousands
rather than hundreds) can inflate your repository and increase the size of the
.git
directory, which can lead to slower cloning operations.
There are, however, ways to shrink the commit history in the annex branch.
In order to squash the entire git-annex history into a single commit, run
$ git annex forget --drop-dead --force
Afterwards, if your dataset has a sibling, the branch needs to be force-pushed. If you attempt an operation to shrink your git-annex history, also checkout this thread for more information on shrinking git-annex’s history and helpful safeguards and potential caveats.