7.2. Computational reproducibility with software containers¶
Just after submitting your midterm data analysis project, you get together
with your friends. “I’m curious: So what kind of analyses did y’all carry out?”
you ask. The variety of methods and datasets the others used is huge, and
one analysis interests you in particular. Later that day, you decide to
install this particular analysis dataset to learn more about the methods used
in there. However, when you datalad rerun
(manual) your friends analysis script,
it throws an error. Hastily, you call her – maybe she can quickly fix her
script and resubmit the project with only minor delays. “I don’t know what
you mean”, you hear in return.
“On my machine, everything works fine!”
On its own, DataLad datasets can contain almost anything that is relevant to
ensure reproducibility: Data, code, human-readable analysis descriptions
(e.g., README.md
files), provenance on the origin of all files
obtained from elsewhere, and machine-readable records that link generated
outputs to the commands, scripts, and data they were created from.
This however may not be sufficient to ensure that an analysis reproduces (i.e., produces the same or highly similar results), let alone works on a computer different than the one it was initially composed on. This is because the analysis does not only depend on data and code, but also the software environment that it is conducted in.
A lack of information about the operating system of the computer, the precise versions of installed software, or their configurations may make it impossible to replicate your analysis on a different machine, or even on your own machine once a new software update is installed. Therefore, it is important to communicate all details about the computational environment for an analysis as thoroughly as possible. Luckily, DataLad provides an extension that can link computational environments to datasets, the datalad containers extension.
This section will give a quick overview on what containers are and
demonstrate how datalad-container
helps to capture full provenance of an
analysis by linking containers to datasets and analyses.
Install the datalad-container extension
This section uses the DataLad extension datalad-container
.
As other extensions, it is a stand-alone Python package, and can be installed using pip:
$ pip install datalad-container
As with DataLad and other Python packages, you might want to do the installation in a virtual environment.
7.2.1. Containers¶
To put it simple, computational containers are cut-down virtual machines that allow you to package all software libraries and their dependencies (all in the precise version your analysis requires) into a bundle you can share with others. On your own and other’s machines, the container constitutes a secluded software environment that
contains the exact software environment that you specified, ready to run analyses
does not effect any software outside of the container
Unlike virtual machines, software containers do not run a full operating system on virtualized hardware. Instead, they use basic services of the host operating system (in a read-only fashion). This makes them lightweight and still portable. By sharing software environments with containers, others (and also yourself) have easy access to the correct software without the need to modify the software environment of the machine the container runs on. Thus, containers are ideal to encapsulate the software environment and share it together with the analysis code and data to ensure computational reproducibility of your analyses, or to create a suitable software environment on a computer that you do not have permissions to deploy software on.
There are a number of different tools to create and use containers, with Docker being one of the most well-known of them. While being a powerful tool, it is only rarely used on high performance computing (HPC) infrastructure[1]. An alternative is Singularity. Both of these tools share core terminology:
- container recipe
A text file that lists all required components of the computational environment. It is made by a human user.
- container image
This is built from the recipe file. It is a static file system inside a file, populated with the software specified in the recipe, and some initial configuration.
- container
A running instance of an image that you can actually use for your computations. If you want to create and run your own software container, you start by writing a recipe file and build an image from it. Alternatively, you can can also pull an image built from a publicly shared recipe from the Hub of the tool you are using.
- hub
A storage resource to share and consume images. Examples are Singularity-Hub, Docker-Hub, and Amazon ECR which hosts Docker images.
Note that as of now, the datalad-container
extension supports
Singularity and Docker images.
Singularity furthermore is compatible with Docker – you can use
Docker images as a basis for Singularity images, or run Docker images with
Singularity (even without having Docker installed).
See the Windows-wit on Docker for installation options.
Additional requirement: Singularity
To use Singularity containers you have to install the software singularity.
Docker installation Windows
The software singularity is not available for Windows.
Windows users therefore need to install Docker.
The currently recommended way to do so is by installing Docker Desktop, and use its “WSL2” backend (a choice one can set during the installation).
In the case of an “outdated WSL kernel version” issue, run wsl --update
in a regular Windows Command Prompt (CMD).
After the installation, run Docker Desktop, and wait several minutes for it to start the Docker engine in the background.
To verify that everything works as it should, run docker ps
in a Windows Command Prompt (CMD).
If it reports an error that asks “Is the docker daemon running?” give it a few more minutes to let Docker Desktop start it.
If it can’t find the docker command, something went wrong during installation.
7.2.2. Using datalad containers
¶
One core feature of the datalad containers
extension is that it registers
computational containers with a dataset. This is done with the
datalad containers-add
(manual) command.
Once a container is registered, arbitrary commands can be executed inside of
it, i.e., in the precise software environment the container encapsulates. All it
needs for this it to swap the datalad run
(manual) command introduced in
section Keeping track with the datalad containers-run
(manual) command.
Let’s see this in action for the midterm_analysis
dataset by rerunning
the analysis you did for the midterm project within a Singularity container.
We start by registering a container to the dataset.
For this, we will pull an image from Singularity hub. This image was made
for the handbook, and it contains the relevant Python setup for
the analysis. Its recipe lives in the handbook’s
resources repository.
If you are curious how to create a Singularity image, the Find-out-more on this topic has some pointers:
How to make a Singularity image
Singularity containers are build from image files, often
called “recipes”, that hold a “definition” of the software container and its
contents and components. The
singularity documentation
has its own tutorial on how to build such images from scratch.
An alternative to writing the image file by hand is to use
Neurodocker. This
command-line program can help you generate custom Singularity recipes (and
also Dockerfiles
, from which Docker images are built). A wonderful tutorial
on how to use Neurodocker is
this introduction
by Michael Notter.
Once a recipe exists, the command
$ sudo singularity build <NAME> <RECIPE>
will build a container (called <NAME>
) from the recipe. Note that this
command requires root
privileges (”sudo
”). You can build the container
on any machine, though, not necessarily the one that is later supposed to
actually run the analysis, e.g., your own laptop versus a compute cluster.
The datalad containers-add
command takes an arbitrary
name to give to the container, and a path or URL to a container image:
$ # we are in the midterm_project subdataset
$ datalad containers-add midterm-software --url shub://adswa/resources:2
[INFO] Initializing special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/me/dl-101/DataLad-101/midterm_project/.datalad/environments/midterm-software/image (file)
How do I add an image from Docker-Hub, Amazon ECR, or a local container?
Should the image you want to use sit on Dockerhub, specify the --url
option prefixed with docker://
or dhub://
instead of shub://
:
$ datalad containers-add midterm-software --url docker://adswa/resources:2
If your image lives on Amazon ECR, use a dhub://
prefix followed by the AWS ECR URL as in
$ datalad containers-add --url dhub://12345678.dkr.ecr.us-west-2.amazonaws.com/maze-code/data-import:latest data-import
If you want to add a container that exists locally, specify the path to it like this:
$ datalad containers-add midterm-software --url path/to/container
This command downloaded the container from Singularity Hub, added it to
the midterm_project
dataset, and recorded basic information on the
container under its name “midterm-software” in the dataset’s configuration at
.datalad/config
. You can find out more about them in a dedicated find-out-more on these additional configurations.
What changes in .datalad/config when one adds a container?
$ cat .datalad/config
[datalad "dataset"]
id = d95bafc8-f2a4-d27b-dcf4-bb99f4bea973
[datalad "containers.midterm-software"]
image = .datalad/environments/midterm-software/image
cmdexec = singularity exec {img} {cmd}
This recorded the image’s origin on Singularity-Hub, the location of the
image in the dataset under .datalad/environments/<NAME>/image
, and it
specifies the way in which the container should be used: The line
cmdexec = singularity exec {img} {cmd}
can be read as: “If this container is used, take the cmd
(what you wrap in a
datalad containers-run
command) and plug it into a
singularity exec
command. The mode of calling Singularity,
namely exec
, means that the command will be executed inside of the container.
You can configure this call format by modifying it in the config file, or calling datalad containers-add
with the option --call-fmt <alternative format>
.
This can be useful to, for example, automatically bind-mount the current working directory in the container.
In the alternative call format, the placeholders {img}
, {cmd}
, and {img_dspath}
(a relative path to the dataset containing the image) are available.
In all other cases with variables that use curly brackets, you need to escape them with another curly bracket.
Here is an example call format that bind-mounts the current working directory (and thus the dataset) automatically:
$ datalad containers-add --call-fmt 'singularity exec -B {{pwd}} --cleanenv {img} {cmd}'
Note that the image is saved under .datalad/environments
and the
configuration is done in .datalad/config
– as these files are version
controlled and shared with together with a dataset, your software
container and the information where it can be reobtained from are linked
to your dataset.
This is how the containers-add
command is recorded in your history:
$ git log -n 1 -p
commit 54aad5de✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD] Configure containerized environment 'midterm-software'
diff --git a/.datalad/config b/.datalad/config
index e99ec14..ad3e5d8 100644
--- a/.datalad/config
+++ b/.datalad/config
@@ -1,2 +1,5 @@
[datalad "dataset"]
id = d95bafc8-f2a4-d27b-dcf4-bb99f4bea973
+[datalad "containers.midterm-software"]
+ image = .datalad/environments/midterm-software/image
+ cmdexec = singularity exec {img} {cmd}
diff --git a/.datalad/environments/midterm-software/image b/.datalad/environments/midterm-software/image
new file mode 120000
index 0000000..75c8b41
--- /dev/null
+++ b/.datalad/environments/midterm-software/image
@@ -0,0 +1 @@
+../../../.git/annex/objects/F1/K3/✂/MD5E-s230694943--944b0300✂MD5
\ No newline at end of file
Such configurations can, among other things, be important to ensure correct container invocation on specific systems or across systems. One example is bind-mounting directories into containers, i.e., making a specific directory and its contents available inside a container. Different containerization software (versions) or configurations of those determine default bind-mounts on a given system. Thus, depending on the system and the location of the dataset on this system, a shared dataset may be automatically bind-mounted or not. To ensure that the dataset is correctly bind-mounted on all systems, let’s add a call-format specification with a bind-mount to the current working directory following the information in the find-out-more on additional container configurations.
$ git config -f .datalad/config datalad.containers.midterm-software.cmdexec 'singularity exec -B {{pwd}} {img} {cmd}'
$ datalad save -m "Modify the container call format to bind-mount the working directory"
add(ok): .datalad/config (file)
save(ok): . (dataset)
Now that we have a complete computational environment linked to the midterm_project
dataset, we can execute commands in this environment. Let us, for example, try to repeat
the datalad run
command from the section YODA-compliant data analysis projects as a
datalad containers-run
command.
The previous run
command looked like this:
$ datalad run -m "analyze iris data with classification analysis" \
--input "input/iris.csv" \
--output "pairwise_relationships.png" \
--output "prediction_report.csv" \
"python3 code/script.py {inputs} {outputs}"
How would it look like as a containers-run
command?
$ datalad containers-run -m "rerun analysis in container" \
--container-name midterm-software \
--input "input/iris.csv" \
--output "pairwise_relationships.png" \
--output "prediction_report.csv" \
"python3 code/script.py {inputs} {outputs}"
unlock(ok): pairwise_relationships.png (file)
unlock(ok): prediction_report.csv (file)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/dl-101/DataLad-101/midterm_project (dataset) [singularity exec -B /home/me/dl-101/Data...]
add(ok): pairwise_relationships.png (file)
add(ok): prediction_report.csv (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
get (notneeded: 4)
run (ok: 1)
save (notneeded: 1, ok: 1)
unlock (ok: 2)
Almost exactly like a datalad run
command! The only additional parameter
is container-name
. At this point, though, the --container-name
flag is even optional because there is only a single container registered to the dataset.
But if your dataset contains more than one container you will need to specify
the name of the container you want to use in your command.
The complete command’s structure looks like this:
$ datalad containers-run --name <containername> [-m ...] [--input ...] [--output ...] <COMMAND>
How can I list available containers or remove them?
The command datalad containers-list
(manual) will list all containers in
the current dataset:
$ datalad containers-list
midterm-software -> .datalad/environments/midterm-software/image
The command datalad containers-remove
(manual) will remove a container
from the dataset, if there exists a container with name given to the
command. Note that this will remove not only the image from the dataset,
but also the configuration for it in .datalad/config
.
Here is how the history entry looks like:
$ git log -p -n 1
commit 4f00ad07✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0000
[DATALAD RUNCMD] rerun analysis in container
=== Do not change lines below ===
{
"chain": [],
"cmd": "singularity exec -B {pwd} .datalad/environments/midterm-software/image python3 code/script.py {inputs} {outputs}",
"dsid": "d95bafc8-f2a4-d27b-dcf4-bb99f4bea973",
"exit": 0,
"extra_inputs": [
".datalad/environments/midterm-software/image"
],
"inputs": [
"input/iris.csv"
],
"outputs": [
"pairwise_relationships.png",
"prediction_report.csv"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
diff --git a/pairwise_relationships.png b/pairwise_relationships.png
index a24e6b9..963d5a8 120000
--- a/pairwise_relationships.png
+++ b/pairwise_relationships.png
@@ -1 +1 @@
-.git/annex/objects/G3/Mg/✂/MD5E-s260649--127313ad✂MD5.png
\ No newline at end of file
+.git/annex/objects/q1/gp/✂/MD5E-s261062--025dc493✂MD5.png
\ No newline at end of file
If you would datalad rerun
this commit, it would be re-executed in the
software container registered to the dataset. If you would share the dataset
with a friend and they would datalad rerun
this commit, the image would first
be obtained from its registered url, and thus your
friend can obtain the correct execution environment automatically.
Note that because this new datalad containers-run
command modified the
midterm_project
subdirectory, we need to also save
the most recent state of the subdataset to the superdataset DataLad-101
.
$ cd ../
$ datalad status
modified: midterm_project (dataset)
$ datalad save -d . -m "add container and execute analysis within container" midterm_project
add(ok): midterm_project (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
Software containers, the datalad-container
extension, and DataLad thus work well together
to make your analysis completely reproducible – by not only linking code, data,
and outputs, but also the software environment of an analysis. And this does not
only benefit your future self, but also whomever you share your dataset with, as
the information about the container is shared together with the dataset. How cool
is that?
Footnotes