7.3. Summary

The last two sections have first of all extended your knowledge on dataset nesting:

  • When subdatasets are created or installed, they are registered to the superdataset in their current version state (as identified by their most recent commit’s hash). For a freshly created subdatasets, the most recent commit is at the same time its first commit.

  • Once the subdataset evolves, the superdataset recognizes this as a modification of the subdatasets version state. If you want to record this, you need to datalad save (manual) it in the superdataset:

    $ datalad save -m "a short summary of changes in subds" <path to subds>

But more than nesting concepts, they have also extended your knowledge on reproducible analyses with datalad run (manual) and you have experienced for yourself why and how software containers can go hand-in-hand with DataLad:

  • A software container encapsulates a complete software environment, independent from the environment of the computer it runs on. This allows you to create or use secluded software and also share it together with your analysis to ensure computational reproducibility. The DataLad extension datalad containers can make this possible.

  • The command datalad containers-add (manual) registers an container image from a path or URL to your dataset.

  • If you use datalad containers-run (manual) instead of datalad run, you can reproducibly execute a command of your choice within the software environment.

  • A datalad rerun (manual) of a commit produced with datalad containers-run will re-execute the command in the same software environment.

7.3.1. Now what can I do with it?

For one, you will not be surprised if you ever see a subdataset being shown as modified by datalad status (manual): You now know that if a subdataset evolves, it’s most recent state needs to be explicitly saved to the superdatasets history.

On a different matter, you are now able to capture and share analysis provenance that includes the relevant software environment. This does not only make your analyses projects automatically reproducible, but automatically computationally reproducible - you can make sure that your analyses runs on any computer with Singularity, regardless of the software environment on this computer. Even if you are unsure how you can wrap up an environment into a software container image at this point, you could make use of hundreds of publicly available images on Singularity-Hub and Docker-Hub.

With this, you have also gotten a first glimpse into an extension of DataLad: A Python module you can install with Python package managers such as pip that extends DataLad’s functionality.