2.4. Summary

If you want to go big, DataLad is a suitable tool and can overcome shortcomings of Git and git-annex, if used correctly. Scaling up involves some thought, and in some instances compromise, though.

  • The general mechanism that allows scaling up is nesting datasets. This process can be done by hand or programmatically. Recursive operations ease working across a hierarchy of datasets and create a monorepo-like experience

  • Beware of accidentally placing to many (even small) files into Git’s version control in a single dataset! .gitignore files can keep irrelevant files out of version control, the explicit option datalad run (manual) may be helpful, and custom largefile rules in .gitattributes may be necessary to override dataset configurations such as text2git.

  • Don’t consider only the limits of version control software, but also the limits of your file system. Too many files in single directories can become problematic even without version control.

  • If things go wrong, it’s not all lost. There are ways to clean up your dataset if it ever gets clogged, although they are the software equivalent of a blowtorch and should be handled with care.

2.4.1. Now what can I do with it?

Go big, if you want to. Distribute 80TB of files or more, or version control large analyses with minimized performance loss of your version control tools.