8.5. Walk-through: Git LFS as a special remote on GitHub

Some repository hosting services provide for-pay support for large files, and can thus be used as special remotes as well. GitHub and GitLab, for example, support Git Large File Storage (Git LFS) for managing data files using Git. A free GitHub subscription allows up to 1GB of free storage and up to 1GB of bandwidth monthly. As such, it might be sufficient for some use cases, and could be configured quite easily.

In order to store annexed dataset contents on GitHub, we need first to create a repository on GitHub:

$ datalad create-sibling-github test-github-lfs --access-protocol ssh
.: github(-) [git@github.com:yarikoptic/test-github-lfs.git (git)]
'git@github.com:yarikoptic/test-github-lfs.git' configured as sibling 'github' for <Dataset path=/tmp/test-github-lfs>

and then initialize a special remote of type git-lfs, pointing to the same GitHub repository:

$ git annex initremote github-lfs type=git-lfs url=git@github.com:yarikoptic/test-github-lfs autoenable=true encryption=none embedcreds=no

If you would like to compress data in Git LFS, you need to take a detour via encryption during git annex initremote (manual) – this has compression as a convenient side effect. Here is an example initialization:

$ git annex initremote --force github-lfs type=git-lfs url=git@github.com:yarikoptic/test-github-lfs autoenable=true encryption=shared

With this single step it becomes possible to transfer contents to GitHub:

$ git annex copy --to=github-lfs file.dat
copy file.dat (to github-lfs...)
(recording state in git...)

and the entire dataset to the same GitHub repository:

$ datalad push --to=github
[INFO   ] Publishing <Dataset path=/tmp/test-github-lfs> to github
publish(ok): . (dataset) [pushed to github: ['[new branch]', '[new branch]']]

Alternatively, to make publication even easier for you, the dataset provider, you can establish a publication dependency such that a datalad push (manual) performs the data transfer to git-lfs automatically:

$ datalad siblings configure -s github --publish-depends github-lfs
$ # afterwards, only datalad push is needed to publish dataset contents and history
$ datalad push --to github

Consumers of your dataset should be able to retrieve files right after cloning the dataset without a siblings enable command, as shown in section Walk-through: Dropbox as a special remote, because of the autoenable=true configuration for the special remote.

No drop from LFS

Unfortunately, it is impossible to datalad drop (manual) contents from Git LFS: help.github.com/en/github/managing-large-files