Computational reproducibility with software containers

Just after submitting your midterm data analysis project, you get together with your friends. “I’m curious: So what kind of analyses did y’all carry out?” you ask. The variety of methods and datasets the others used is huge, and one analysis interests you in particular. Later that day, you decide to install this particular analysis dataset to learn more about the methods used in there. However, when you re-run your friends analysis script, it throws an error. Hastily, you call her – maybe she can quickly fix her script and resubmit the project with only minor delays. “I don’t know what you mean”, you hear in return. “On my machine, everything works fine!”

On its own, DataLad datasets can contain almost anything that is relevant to ensure reproducibility: Data, code, human-readable analysis descriptions (e.g., README.md files), provenance on the origin of all files obtained from elsewhere, and machine-readable records that link generated outputs to the commands, scripts, and data they were created from.

This however may not be sufficient to ensure that an analysis reproduces (i.e., produces the same or highly similar results), let alone works on a computer different than the one it was initially composed on. This is because the analysis does not only depend on data and code, but also the software environment that it is conducted in.

A lack of information about the operating system of the computer, the precise versions of installed software, or their configurations may make it impossible to replicate your analysis on a different machine, or even on your own machine once a new software update is installed. Therefore, it is important to communicate all details about the computational environment for an analysis as thoroughly as possible. Luckily, DataLad provides an extension that can link computational environments to datasets, the datalad containers extension1.

This section will give a quick overview on what containers are and demonstrate how datalad-containers helps to capture full provenance of an analysis by linking containers to datasets and analyses.

Containers

To put it simple, computational containers are cut-down virtual machines that allow you to package all software libraries and their dependencies (all in the precise version your analysis requires) into a bundle you can share with others. On your own and other’s machines, the container constitutes a secluded software environment that

  • contains the exact software environment that you specified, ready to run analyses in

  • does not effect any software outside of the container

Unlike virtual machines, software containers do not have their own operating system. Instead, they use basic services of the underlying operating system of the computer they run on (in a read-only fashion). This makes them lightweight and portable. By sharing software environments with containers, others (and also yourself) have easy access to the correct software without the need to modify the software environment of the machine the container runs on. Thus, containers are ideal to encapsulate the software environment and share it together with the analysis code and data to ensure computational reproducibility of your analyses, or to create a suitable software environment on a computer that you do not have permissions to deploy software on.

There are a number of different tools to create and use containers, with Docker being one of the most well-known of them. While being a powerful tool, it is only rarely used on high performance computing (HPC) infrastructure2. An alternative is Singularity. Both of these tools share core terminology:

Recipe

A text file template that lists all required components of the computational environment. It is made by a human user.

Image

This is built from the recipe file. It is a static filesystem inside a file, populated with the software specified in the recipe, and some initial configuration.

Container

A running instance of an Image that you can actually use for your computations. If you want to create and run your own software container, you start by writing a recipe file and build an Image from it. Alternatively, you can can also pull an Image built from a publicly shared recipe from the Hub of the tool you are using.

Hub

A storage resource to share and consume images. There is Singularity-Hub and Docker-Hub. Both are optional, additional services not required to use software containers, but a convenient way to share recipes and have imaged built from them by a service (instead of building them manually and locally).

Note that as of now, the datalad-containers extension supports Singularity and Docker images. Singularity furthermore is compatible with Docker – you can use Docker Images as a basis for Singularity Images, or run Docker Images with Singularity (even without having Docker installed).

Note

In order to use Singularity containers (and thus datalad containers), you have to install the software singularity.

Using datalad containers

One core feature of the datalad containers extension is that it registers computational containers to a dataset. This is done with the datalad containers-add command. Once a container is registered, arbitrary commands can be executed inside of it, i.e., in the precise software environment the container encapsulates. All it needs for this it to swap the datalad run command introduced in section Keeping track with the datalad containers-run command.

Let’s see this in action for the midterm_analysis dataset by rerunning the analysis you did for the midterm project within a Singularity container. We start by registering a container to the dataset. For this, we will pull an Image from Singularity hub. This Image was made for the handbook, and it contains the relevant Python setup for the analysis. Its recipe lives in the handbook’s resources repository, and the Image is built from the recipe via Singularity hub. If you’re curious how to create a Singularity Image, the hidden section below has some pointers:

Find out more: How to make a Singularity Image

Singularity containers are build from Image files, often called “recipes”, that hold a “definition” of the software container and its contents and components. The singularity documentation has its own tutorial on how to build such Images from scratch. An alternative to writing the Image file by hand is to use Neurodocker. This command-line program can help you generate custom Singularity recipes (and also Dockerfiles, from which Docker Images are build). A wonderful tutorial on how to use Neurodocker is this introduction by Michael Notter.

Once a recipe exists, the command

sudo singularity build <NAME> <RECIPE>

will build a container (called <NAME>) from the recipe. Note that this command requires root privileges (“sudo”). You can build the container on any machine, though, not necessarily the one that is later supposed to actually run the analysis, e.g., your own laptop versus a compute cluster. Alternatively, Singularity Hub integrates with Github and builds containers from Images pushed to repositories on Github. The docs give you a set of instructions on how to do this.

The datalad containers-add command takes an arbitrary name to give to the container, and a path or url to a container Image:

# we are in the midterm_project subdataset
$ datalad containers-add midterm-software --url shub://adswa/resources:1
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/me/dl-101/DataLad-101/midterm_project/.datalad/environments/midterm-software/image (file)
action summary:
  add (ok: 1)
  containers_add (ok: 1)
  save (ok: 1)

This command downloaded the container from Singularity Hub, added it to the midterm_project dataset, and recorded basic information on the container under its name “midterm-software” in the dataset’s configuration at .datalad/config.

Find out more: What has been added to .datalad/config?

$ cat .datalad/config
[datalad "dataset"]
	id = 9efd113c-32ac-11ea-b7a4-e86a64c8054c
[datalad "containers.midterm-software"]
	updateurl = shub://adswa/resources:1
	image = .datalad/environments/midterm-software/image
	cmdexec = singularity exec {img} {cmd}

This recorded the Image’s origin on Singularity-Hub, the location of the Image in the dataset under .datalad/environments/<NAME>/image, and it specifies the way in which the container should be used: The line

cmdexec = singularity exec {img} {cmd}

can be read as: “If this container is used, take the cmd (what you wrap in a datalad containers-run command) and plug it into a singularity exec command. The mode of calling Singularity, namely exec, means that the command will be executed inside of the container.

Note that the Image is saved under .datalad/environments and the configuration is done in .datalad/config – as these files are version controlled and shared with together with a dataset, your software container and the information where it can be re-obtained from are linked to your dataset.

This is how the containers-add command is recorded in your history:

$ git log -n 1 -p
commit 4464d562134fc9a8e4a34344326d86c711f3d72f
Author: Elena Piscopia <elena@example.net>
Date:   Thu Jan 9 07:53:22 2020 +0100

    [DATALAD] Configure containerized environment 'midterm-software'

diff --git a/.datalad/config b/.datalad/config
index b233709..6bfd89e 100644
--- a/.datalad/config
+++ b/.datalad/config
@@ -1,2 +1,6 @@
 [datalad "dataset"]
	id = 9efd113c-32ac-11ea-b7a4-e86a64c8054c
+[datalad "containers.midterm-software"]
+	updateurl = shub://adswa/resources:1
+	image = .datalad/environments/midterm-software/image
+	cmdexec = singularity exec {img} {cmd}
diff --git a/.datalad/environments/midterm-software/image b/.datalad/environments/midterm-software/image
new file mode 120000
index 0000000..800282a
--- /dev/null
+++ b/.datalad/environments/midterm-software/image
@@ -0,0 +1 @@
+../../../.git/annex/objects/zJ/8f/MD5E-s232214559--49dcb6ac1a5787636c9897c4d4df7e10/MD5E-s232214559--49dcb6ac1a5787636c9897c4d4df7e10
\ No newline at end of file

Now that we have a complete computational environment linked to the midterm_project dataset, we can execute commands in this environment. Let us for example try to repeat the datalad run command from the section YODA-compliant data analysis projects as a datalad containers-run command.

The previous run command looked like this:

$ datalad run -m "analyze iris data with classification analysis" \
  --input "input/iris.csv" \
  --output "prediction_report.csv" \
  --output "pairwise_relationships.png" \
  "python3 code/script.py"

How would it look like as a containers-run command?

$ datalad containers-run -m "rerun analysis in container" \
  --container-name midterm-software \
  --input "input/iris.csv" \
  --output "prediction_report.csv" \
  --output "pairwise_relationships.png" \
  "python3 code/script.py"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
unlock(ok): pairwise_relationships.png (file)
unlock(ok): prediction_report.csv (file)
add(ok): pairwise_relationships.png (file)
add(ok): prediction_report.csv (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  get (notneeded: 4)
  save (notneeded: 1, ok: 1)
  unlock (ok: 2)

Almost exactly like a datalad run command! The only additional parameter is container-name. At this point, though, the --container-name flag is even optional because there is only a single container registered to the dataset. But if your dataset contains more than one container you will need to specify the name of the container you want to use in your command. The complete command’s structure looks like this:

$ datalad containers-run --name <containername> [-m ...] [--input ...] [--output ...] <COMMAND>

Find out more: How can I list available containers or remove them?

The command datalad containers-list will list all containers in the current dataset:

$ datalad containers-list
midterm-software -> .datalad/environments/midterm-software/image

The command datalad containers-remove will remove a container from the dataset, if there exists a container with name given to the command. Note that this will remove not only the Image from the dataset, but also the configuration for it in .datalad/config.

Here is how the history entry looks like:

$ git log -p -n 1
commit 936952fd146be26992fa172aa127e5a86f93861d
Author: Elena Piscopia <elena@example.net>
Date:   Thu Jan 9 07:53:28 2020 +0100

    [DATALAD RUNCMD] rerun analysis in container
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "singularity exec .datalad/environments/midterm-software/image python3 code/script.py",
     "dsid": "9efd113c-32ac-11ea-b7a4-e86a64c8054c",
     "exit": 0,
     "extra_inputs": [
      ".datalad/environments/midterm-software/image"
     ],
     "inputs": [
      "input/iris.csv"
     ],
     "outputs": [
      "prediction_report.csv",
      "pairwise_relationships.png"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

diff --git a/pairwise_relationships.png b/pairwise_relationships.png
index 2f69f64..6d00014 120000
--- a/pairwise_relationships.png
+++ b/pairwise_relationships.png
@@ -1 +1 @@
-.git/annex/objects/Pz/Xm/MD5E-s175662--8a9a3e225267f327e1671cf6d01b1957.png/MD5E-s175662--8a9a3e225267f327e1671cf6d01b1957.png
\ No newline at end of file
+.git/annex/objects/z3/23/MD5E-s176597--87d8a72f5f7b1f4f191d0be1bfd15288.png/MD5E-s176597--87d8a72f5f7b1f4f191d0be1bfd15288.png
\ No newline at end of file
diff --git a/prediction_report.csv b/prediction_report.csv
index 42d194b..b46a2d5 120000
--- a/prediction_report.csv
+++ b/prediction_report.csv
@@ -1 +1 @@
-.git/annex/objects/8q/6M/MD5E-s345--a88cab39b1a5ec59ace322225cc88bc9.csv/MD5E-s345--a88cab39b1a5ec59ace322225cc88bc9.csv
\ No newline at end of file
+.git/annex/objects/VF/27/MD5E-s347--7d984f53676358222aa7aa55980f205b.csv/MD5E-s347--7d984f53676358222aa7aa55980f205b.csv
\ No newline at end of file

If you would rerun this commit, it would be re-executed in the software container registered to the dataset. If you would share the dataset with a friend and they would rerun this commit, the Image would first be obtained from its registered url, and thus your friend can obtain the correct execution environment automatically.

Note that because this new containers-run command modified the midterm_project subdirectory, we need to also save the most recent state of the subdataset to the superdataset DataLad-101.

$ cd ../
$ datalad status
 modified: midterm_project (dataset)
$ datalad save -d . -m "add container and execute analysis within container" midterm_project
add(ok): midterm_project (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Software containers, the datalad-containers extension, and DataLad thus work well together to make your analysis completely reproducible – by not only linking code, data, and outputs, but also the software environment of an analysis. And this does not only benefit your future self, but also whomever you share your dataset with, as the information about the container is shared together with the dataset. How cool is that?

If you are interested in more, you can read about another example of datalad containers-run in the usecase An automatically reproducible analysis of public neuroimaging data.

Footnotes

1

To read more about DataLad’s extensions, see section DataLad’s extensions.

2

The main reason why Docker is not deployed on HPC systems is because it grants users “superuser privileges”. On multi-user systems such as HPC, users should not have those privileges, as it would enable them to temper with other’s or shared data and resources, posing a severe security threat.