DataLad for reproducible machine-learning analyses

This use case demonstrates an automatically and computationally reproducible analyses in the context of a machine learning (ML) project. It demonstrates on an example image classification analysis project how one can

  • link data, models, parametrization, software and results using datalad containers-run

  • keep track of results and compare them across models or parametrizations

  • stay computationally reproducible, transparent, and importantly, intuitive and clear

The Challenge

Chad is a recent college graduate and has just started in a wicked start-up that prides itself with using “AI and ML for individualized medicine” in the Bay area. Even though he’s extraordinarily motivated, the fast pace and pressure to deliver at his job are still stressful. For his first project, he’s tasked with training a machine learning model to detect cancerous tissue in computer tomography (CT) images. Excited and eager to impress, he builds his first image classification ML model with state of the art Python libraries and a stochastic gradient descent (SGD) classifier. “Not too bad”, he thinks, when he shares the classification accuracy with his team lead, “way higher than chance level!” “Fantastic, Chad, but listen, we really need a higher accuracy than this. Our customers deserve that. Turn up the number of iterations. Also, try a random forest classification instead. And also, I need that done by tomorrow morning latest, Chad. Take a bag of organic sea-weed-kale crisps from the kitchen, oh, and also, you’re coming to our next project pitch at the roof-top bar on Sunday?”

Hastily, Chad pulls an all-nighter to adjust his models by dawn. Increase iterations here, switch classifier there, oh no, did this increase or decrease the overall accuracy? Tune some parameters here and there, re-do that previous one just one more time just to be sure. A quick two-hour nap on the office couch, and he is ready for the daily scrum in the morning. “Shit, what accuracy belonged to which parametrization again?”, he thinks to himself as he pitches his analysis and presents his results. But everyone rushes to the next project already.

A week later, when a senior colleague is tasked with checking his analyses, Chad needs to spend a few hours with them to them guide through his chaotic analysis directory full of jupyter notebooks. They struggle to figure out which Python libraries to install on the colleagues computer, have to adjust hard-code absolute paths, and fail to reproduce the results that he presented.

The DataLad Approach

Machine learning analyses are complex: Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics. This complexity can jeopardize reproducibility – it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.

DataLad has many concepts and tools that assist in creating transparent and computationally and automatically reproducible analyses. From general principles on how to structure analyses projects to linking and versioning software and data alongside to code or capturing analysis execution as re-executable run-records. To make a machine-learning project intuitively structured and transparent, Chad applies DataLad’s YODA principles to his work. He keeps the training and testing data a reusable, standalone component, installed as a subdataset, and keeps his analysis dataset completely self-contained with relative paths in all his scripts. Later, he can share his dataset without the need to adjust paths. Chad also attaches a software container to his dataset, so that others don’t need to recreate his Python environment. And lastly, he wraps every command that he executes in a datalad containers-run call, such that others don’t need to rely on his brain to understand the analysis, but can have a computer recompute every analysis step in the correct software environment. Using concise commit messages and tags, Chad creates a transparent and intuitive dataset history. With these measures in place, he can experiment flexibly with various models and data, and does not only have means to compare his models, but can also set his dataset to the state in which his most preferred model is ready to be used.

Step-by-Step

Required software

The analysis requires the Python packages scikit-learn, scikit-image, pandas, and numpy. We have build a Singularity software container with all relevant software, and the code below will use the datalad-containers extension1 to download the container from Singularity-Hub and execute all analysis in this software environment. If you do not want to install the datalad-containers extension or Singularity, you can also create a virtual environment with all necessary software if you prefer2, and exchange the datalad containers-run commands below with datalad run commands.

Let’s start with an overview of the analysis plans: We’re aiming for an image classification analysis. In this type of ML analysis, a classifier is trained on a subset of data, the training set, and is then used for predictions on a previously unseen subset of data, the test set. Its task is to label the test data with one of several class attributes it is trained to classify, such as “cancerous” or “non-cancerous” with medical data, “cat” or “dog” with your pictures of pets, or “spam” versus “not spam” in your emails. In most cases, classification analyses are supervised learning methods: The correct class attributes are known, and the classifier is tested on a labeled set of training data. Its classification accuracy is calculated from comparing its performance on the unlabeled testing set with its correct labels. As a first analysis step, train and testing data therefore need to be labeled – both to allow model training and model evaluation. In a second step, a classifier needs to be trained on the labeled test data. It learns which features are to be associated with which class attribute. In a final step, the trained classifier classifies the test data, and its results are evaluated against the true labels.

Below, we will go through a image classification analysis on the Imagenette dataset, a smaller subset of the Imagenet dataset, one of the most widely used large scale dataset for bench-marking Image Classification algorithms. It contains images from ten categories (tench (a type of fish), English springer (a type of dog), cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute). We will prepare a subset of the data, and train and evaluate different types of classifier. The analysis is based on this tutorial.

First, let’s create an input data dataset. Later, this dataset will be installed as a subdataset of the analysis. This complies to the YODA principles and helps to keep the input data modular, reusable, and transparent.

$ datalad create imagenette
[INFO] Creating a new annex repo at /home/me/usecases/imagenette 
[INFO] Scanning for unlocked files (this may take some time) 
create(ok): /home/me/usecases/imagenette (dataset)

The Imagenette dataset can be downloaded as an archive from Amazon, and the datalad download-url --archive not only extracts and saves the data from its S3 URL, but also registers the datasets origin such that it can re-retrieved on demand from the S3 bucket.

$ cd imagenette
# 0.12.2 <= datalad < 0.13.4  needs the configuration option -c datalad.runtime.use-patool=1 to handle .tgz
$ datalad download-url \
  --archive \
  --message "Download Imagenette dataset" \
  'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz'
[INFO] Downloading 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz' into '/home/me/usecases/imagenette/' 
[INFO] Adding content of the archive /home/me/usecases/imagenette/imagenette2-160.tgz into annex AnnexRepo(/home/me/usecases/imagenette) 
[INFO] Initiating special remote datalad-archives 
[INFO] Finished adding /home/me/usecases/imagenette/imagenette2-160.tgz: Files processed: 13394, +annex: 13394 
download_url(ok): /home/me/usecases/imagenette/imagenette2-160.tgz (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)

Next, let’s create an analysis dataset. For a pre-structured and pre-configured starting point, the dataset can be created with the yoda and text2git run procedures3. These configurations create a code/ directory, place some place-holding README files in appropriate places, and make sure that all text files, e.g. scripts or evaluation results, are kept in Git to allow for easier modifications.

Note for Windows-Users

Hey there! If you are using Windows 10 (not Windows Subsystem for Linux (WSL)) without the custom-built git-annex installer mentioned in the installation section, you need a work-around.

Instead of running datalad create -c text2git -c yoda ml-project, please remove the configuration -c text2git from the command and run only datalad create -c yoda  ml-project:

$ datalad create -c yoda ml-project
[INFO] Creating a new annex repo at C:\Users\mih\ml-project
[INFO] Detected a filesystem without fifo support.
[INFO] Disabling ssh connection caching.
[INFO] Detected a crippled filesystem.
[INFO] Scanning for unlocked files (this may take some time)
[INFO] Entering an adjusted branch where files are unlocked as this filesystem does not support locked files.
[INFO] Switched to branch 'adjusted/master(unlocked)'
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
create(ok): C:\Users\mih\ml-project (dataset)

Instead of the text2git configuration, you need to create a configuration by hand by pasting the following lines of text into the (hidden) .gitattributes file in your newly created dataset. Tuning datasets to your needs can explain the details of this procedure.

Here are lines that need to be appended to the existing lines in .gitattributes and will mimic the configuration -c text2git would apply:

*.json annex.largefiles=nothing

You can achieve this by copy-pasting the following code snippets into your terminal (but you can also add them using a text editor of your choice):

$ echo\ >> .gitattributes && echo *.json annex.largefiles=nothing >> .gitattributes

Afterwards, these should be the contents of .gitattributes:

$ cat .gitattributes
  * annex.backend=MD5E
  **/.git* annex.largefiles=nothing
  CHANGELOG.md annex.largefiles=nothing
  README.md annex.largefiles=nothing
  *.json annex.largefiles=nothing

Lastly, run this piece of code to save your changes:

$ datalad save -m "Windows-workaround: custom config to place text into Git" .gitattributes
$ cd ../
$ datalad create -c text2git -c yoda ml-project
[INFO] Creating a new annex repo at /home/me/usecases/ml-project 
[INFO] Scanning for unlocked files (this may take some time) 
[INFO] Running procedure cfg_text2git 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
[INFO] Running procedure cfg_yoda 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
create(ok): /home/me/usecases/ml-project (dataset)

Afterwards, the input dataset can be installed from a local path as a subdataset, using datalad clone with the -d/--dataset flag and a . to denote the current dataset:

$ cd ml-project
$ mkdir -p data
# install the dataset into data/
$ datalad clone -d . ../imagenette data/raw
[INFO] Cloning dataset to Dataset(/home/me/usecases/ml-project/data/raw) 
[INFO] Attempting to clone from ../imagenette to /home/me/usecases/ml-project/data/raw 
[INFO] Completed clone attempts for Dataset(/home/me/usecases/ml-project/data/raw) 
[INFO] Scanning for unlocked files (this may take some time) 
install(ok): data/raw (dataset)
add(ok): data/raw (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1)

Here are the dataset contents up to now:

# show the directory hierarchy
$ tree -d
.
├── code
└── data
    └── raw
        └── imagenette2-160
            ├── train
            │   ├── n01440764
            │   ├── n02102040
            │   ├── n02979186
            │   ├── n03000684
            │   ├── n03028079
            │   ├── n03394916
            │   ├── n03417042
            │   ├── n03425413
            │   ├── n03445777
            │   └── n03888257
            └── val
                ├── n01440764
                ├── n02102040
                ├── n02979186
                ├── n03000684
                ├── n03028079
                ├── n03394916
                ├── n03417042
                ├── n03425413
                ├── n03445777
                └── n03888257

26 directories

Next, let’s add the necessary software to the dataset. This is done using the datalad containers extension and the datalad container-add command. This command takes an arbitrary name and a path or url to a software container, registers the containers origin, and adds it under the specified name to the dataset. If used with a public url, for example to Singularity-Hub, others that you share your dataset with can retrieve the container as well1.

$ datalad containers-add software --url shub://adswa/python-ml:1
[INFO] Initiating special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/me/usecases/ml-project/.datalad/environments/software/image (file)
action summary:
  add (ok: 1)
  containers_add (ok: 1)
  save (ok: 1)

At this point, with input data and software set-up, we can start with the first step: Dataset preparation. The imagenette dataset is structured in train/ and val/ folder, and each folder contains one sub-folder per image category. To prepare the dataset for training and testing a classifier, we create a mapping between file names and image categories.

Let’s start small and only use two categories, “golf balls” (subdirectory “n03445777”) and “parachutes” (subdirectory “n03888257”). The following script creates two files, data/train.csv and data/test.csv from the input data. Each contains file names and category associations for the files in those subdirectories. Note how, in accordance to the YODA principles, the script only contains relative paths to make the dataset portable.

$ cat << EOT > code/prepare.py
#!/usr/bin/env python3

import pandas as pd
from pathlib import Path

FOLDERS_TO_LABELS = {"n03445777": "golf ball",
                     "n03888257": "parachute"}


def get_files_and_labels(source_path):
    images = []
    labels = []
    for image_path in source_path.rglob("*/*.JPEG"):
        filename = image_path
        folder = image_path.parent.name
        if folder in FOLDERS_TO_LABELS:
            images.append(filename)
            label = FOLDERS_TO_LABELS[folder]
            labels.append(label)
    return images, labels


def save_as_csv(filenames, labels, destination):
    data_dictionary = {"filename": filenames, "label": labels}
    data_frame = pd.DataFrame(data_dictionary)
    data_frame.to_csv(destination)


def main(repo_path):
    data_path = repo_path / "data"
    train_path = data_path / "raw/imagenette2-160/train"
    test_path = data_path / "raw/imagenette2-160/val"
    train_files, train_labels = get_files_and_labels(train_path)
    test_files, test_labels = get_files_and_labels(test_path)
    save_as_csv(train_files, train_labels, data_path / "train.csv")
    save_as_csv(test_files, test_labels, data_path / "test.csv")


if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

Executing the here document in the code block above has created a script code/prepare.py:

$ datalad status
untracked: code/prepare.py (file)

We add it to the dataset using datalad save:

$ datalad save -m "Add script for data preparation for 2 categories" code/prepare.py
add(ok): code/prepare.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

This script can now be used to prepare the data. Note how it, in accordance to the YODA principles, saves the files into the superdataset, and leaves the input dataset untouched. When ran, it will create files with the following structure:

,filename,label
0,data/raw/imagenette2-160/val/n03445777/n03445777_20061.JPEG,golf ball
1,data/raw/imagenette2-160/val/n03445777/n03445777_9740.JPEG,golf ball
2,data/raw/imagenette2-160/val/n03445777/n03445777_3900.JPEG,golf ball
3,data/raw/imagenette2-160/val/n03445777/n03445777_5862.JPEG,golf ball
4,data/raw/imagenette2-160/val/n03445777/n03445777_4172.JPEG,golf ball
5,data/raw/imagenette2-160/val/n03445777/n03445777_14301.JPEG,golf ball
6,data/raw/imagenette2-160/val/n03445777/n03445777_2951.JPEG,golf ball
7,data/raw/imagenette2-160/val/n03445777/n03445777_8732.JPEG,golf ball
8,data/raw/imagenette2-160/val/n03445777/n03445777_5810.JPEG,golf ball
9,data/raw/imagenette2-160/val/n03445777/n03445777_3132.JPEG,golf ball
[...]

To capture all provenance and perform the computation in the correct software environment, this is best done in a datalad containers-run command:

$ datalad containers-run -n software \
  -m "Prepare the data for categories golf balls and parachutes" \
  --input 'data/raw/imagenette2-160/train/n03445777' \
  --input 'data/raw/imagenette2-160/val/n03445777' \
  --input 'data/raw/imagenette2-160/train/n03888257' \
  --input 'data/raw/imagenette2-160/val/n03888257' \
  --output 'data/train.csv' \
  --output 'data/test.csv' \
  "python3 code/prepare.py"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) ===== 
save(ok): . (dataset)
action summary:
  add (ok: 2)
  get (notneeded: 2, ok: 2704)
  save (notneeded: 1, ok: 1)

Beyond the script execution and container name (-n/--container-name), this command can take a human readable commit message to summarize the operation (-m/--message) and input and output specifications (-i/--input, -o/--output). DataLad will make sure to retrieve everything labeled as --input prior to running the command, and specifying --output ensures that the files can be updated should the command be reran at a later point4. It saves the results of this command together with a machine-readable run-record into the dataset history.

Next, the first model can be trained.

$ cat << EOT > code/train.py
#!/usr/bin/env python3

from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.linear_model import SGDClassifier


def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list


def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list


def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped


def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels


def main(repo_path):
    train_csv_path = repo_path / "data/train.csv"
    train_data, labels = load_data(train_csv_path)
    sgd = SGDClassifier(max_iter=10)
    trained_model = sgd.fit(train_data, labels)
    dump(trained_model, repo_path / "model.joblib")


if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

This script trains a stochastic gradient descent classifier on the training data. The files in the train.csv file a read, preprocessed into the same shape, and an SGD model is fitted to the predict the image labels from the data. The trained model is then saved into a model.joblib file – this allows to transparently cache the classifier as a Python object to disk. Later, the cached model can be applied to various data with the need to retrain the classifier. Let’s save the script.

$ datalad save -m "Add SGD classification script" code/train.py
add(ok): code/train.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

The last analysis step needs to test the trained classifier. We will use the following script for this:

$ cat << EOT > code/evaluate.py

#!/usr/bin/env python3

from joblib import load
import json
from pathlib import Path

from sklearn.metrics import accuracy_score

from train import load_data


def main(repo_path):
    test_csv_path = repo_path / "data/test.csv"
    test_data, labels = load_data(test_csv_path)
    model = load(repo_path / "model.joblib")
    predictions = model.predict(test_data)
    accuracy = accuracy_score(labels, predictions)
    metrics = {"accuracy": accuracy}
    print(metrics)
    accuracy_path = repo_path / "accuracy.json"
    accuracy_path.write_text(json.dumps(metrics))


if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

It will load the trained and dumped model and use it to test its prediction performance on the yet unseen test data. To evaluate the model performance, it calculates the accuracy of the prediction, i.e., the proportion of correctly labeled images, prints it to the terminal, and saves it into a json file in the superdataset. As this script constitutes the last analysis step, let’s save it with a tag. Its entirely optional to do this, but just as commit messages are an easier way for humans to get an overview of a commits contents, a tag is an easier way for humans to identify a change than a commit hash. With this script set up, we’re ready for analysis, and thus can tag this state ready4analysis to identify it more easily later.

$ datalad save -m "Add script to evaluate model performance" --version-tag "ready4analysis" code/evaluate.py
add(ok): code/evaluate.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Afterwards, we can train the first model:

$ datalad containers-run -n software \
  -m "Train an SGD classifier on the data" \
  --input 'data/raw/imagenette2-160/train/n03445777' \
  --input 'data/raw/imagenette2-160/train/n03888257' \
  --output 'model.joblib' \
  "python3 code/train.py"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_stochastic_gradient.py:573: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)
[INFO] == Command exit (modification check follows) ===== 
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 4)
  save (notneeded: 1, ok: 1)

And finally, we’re ready to find out how well the model did and run the last script:

$ datalad containers-run -n software \
  -m "Evaluate SGD classifier on test data" \
  --input 'data/raw/imagenette2-160/val/n03445777' \
  --input 'data/raw/imagenette2-160/val/n03888257' \
  --output 'accuracy.json' \
  "python3 code/evaluate.py"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
{'accuracy': 0.6121673003802282}
save(ok): . (dataset)
action summary:
  add (ok: 2)
  get (notneeded: 4)
  save (notneeded: 1, ok: 1)

Now this initial accuracy isn’t yet fully satisfying. What could have gone wrong? The model would probably benefit from a few more training iterations for a start. Instead of 10, the patch below increases the number of iterations to 100. Note that the code block below does this change with the stream editor sed for the sake of automatically executed code in the handbook, but you could also apply this change with a text editor “by hand”.

$ sed -i 's/SGDClassifier(max_iter=10)/SGDClassifier(max_iter=100)/g' code/train.py

Here’s what has changed:

$ git diff
diff --git a/code/train.py b/code/train.py
index 3b309e1..017a6bf 100644
--- a/code/train.py
+++ b/code/train.py
@@ -39,7 +39,7 @@ def load_data(data_path):
 def main(repo_path):
     train_csv_path = repo_path / "data/train.csv"
     train_data, labels = load_data(train_csv_path)
-    sgd = SGDClassifier(max_iter=10)
+    sgd = SGDClassifier(max_iter=100)
     trained_model = sgd.fit(train_data, labels)
     dump(trained_model, repo_path / "model.joblib")
 

Let’s save the change…

$ datalad save -m "Increase the amount of iterations to 100" --version-tag "SGD-100" code/train.py
add(ok): code/train.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

… and try again.

As we need to retrain the classifier and re-evaluate its performance, we rerun every run-record between the point in time we created the SGD tag and now. This will update both the model.joblib and the accuracy.json files, but their past versions are still in the dataset history. One was to do this is to specify a range between the two tags, but likewise, commit hashes would work, or a specification using --since5.

$ datalad rerun -m "Recompute classification with more iterations" ready4analysis..SGD-100
[INFO] run commit c081642; (Train an SGD clas...)
[INFO] Making sure inputs are available (this may take some time) 
unlock(ok): model.joblib (file)
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): model.joblib (file)
save(ok): . (dataset)
[INFO] run commit c174771; (Evaluate SGD clas...)
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
{'accuracy': 0.7477820025348543}
[INFO] == Command exit (modification check follows) ===== 
add(ok): accuracy.json (file)
add(ok): code/__pycache__/train.cpython-37.pyc (file)
save(ok): . (dataset)
[INFO] skip-or-pick commit c341b43; c341b43 does not have a command; skipping or cherry picking
run(ok): /home/me/usecases/ml-project (dataset) [c341b43 does not have a command; skipping]
action summary:
  add (ok: 3)
  get (notneeded: 8)
  run (ok: 1)
  save (notneeded: 2, ok: 2)
  unlock (notneeded: 2, ok: 1)

Any better? Mhh, not so much. Maybe a different classifier does the job better. Let’s switch from SGD to a random forest classification. The code block below writes the relevant changes (highlighted) into the script.

$ cat << EOT >| code/train.py
#!/usr/bin/env python3

from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list

def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list

def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped

def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels

def main(repo_path):
    train_csv_path = repo_path / "data/train.csv"
    train_data, labels = load_data(train_csv_path)
    rf = RandomForestClassifier()
    trained_model = rf.fit(train_data, labels)
    dump(trained_model, repo_path / "model.joblib")

if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

We need to save this change:

$ datalad save -m "Switch to random forest classification" --version-tag "random-forest" code/train.py
add(ok): code/train.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

And now we can retrain and reevaluate again. This time, in order to have very easy access to the trained models and results of the evaluation, we’re rerunning the sequence of run-records in a new branch6. This way, we have access to a trained random-forest model or a trained SGD model or their respective results by simply switching branches.

$ datalad rerun --branch="randomforest" -m "Recompute classification with random forest classifier" ready4analysis..SGD-100
[INFO] checkout commit b92dc98;
[INFO] run commit c081642; (Train an SGD clas...)
[INFO] Making sure inputs are available (this may take some time) 
unlock(ok): model.joblib (file)
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): model.joblib (file)
save(ok): . (dataset)
[INFO] run commit c174771; (Evaluate SGD clas...)
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
{'accuracy': 0.8086185044359949}
[INFO] == Command exit (modification check follows) ===== 
add(ok): accuracy.json (file)
add(ok): code/__pycache__/train.cpython-37.pyc (file)
save(ok): . (dataset)
[INFO] skip-or-pick commit c341b43; c341b43 does not have a command; skipping or cherry picking
run(ok): /home/me/usecases/ml-project (dataset) [c341b43 does not have a command; skipping]
action summary:
  add (ok: 3)
  get (notneeded: 8)
  run (ok: 1)
  save (notneeded: 2, ok: 2)
  unlock (notneeded: 2, ok: 1)

This updated the model.joblib file to a trained random forest classifier, and also updated accuracy.json with the current models’ evaluation. The difference in accuracy between models could now for example be compared with a git diff of the contents of accuracy.json to the master branch:

$ git diff master -- accuracy.json
diff --git a/accuracy.json b/accuracy.json
index 8e62cc5..41634f2 100644
--- a/accuracy.json
+++ b/accuracy.json
@@ -1 +1 @@
-{"accuracy": 0.7477820025348543}
\ No newline at end of file
+{"accuracy": 0.8086185044359949}
\ No newline at end of file

And if you decide to rather do more work on the SGD classier, you can go back to the previous master branch:

$ git checkout master
$ cat accuracy.json
Switched to branch 'master'
{"accuracy": 0.7477820025348543}

Your Git history becomes a log of everything you did as well as the chance to go back to and forth between analysis states. And this is not only useful for yourself, but it makes your analyses and results also transparent to others that you share your dataset with. If you cache your trained models, there is no need to retrain them when traveling to past states of your dataset. And if any aspect of your dataset changes – from changes to the input data to changes to your trained model or code – you can rerun these analysis stages automatically. The attached software container makes sure that your analysis will always be rerun in the correct software environment, even if the dataset is shared with collaborators with systems that lack a Python installation.

References

The analysis is adapted from the chapter Reproducible machine learning analyses: DataLad as DVC, which in turn is based on this tutorial at RealPython.org.

Footnotes

1(1,2)

You can install the datalad-containers extension from pip via pip install datalad-container. You can find out more about extensions in general in the section DataLad extensions, and you can more computationally reproducible analysis using datalad container in the chapter Computational reproducibility with software containers and the usecase An automatically and computationally reproducible neuroimaging analysis from scratch.

2

Unsure how to create a virtual environment? You can find a tutorial using pip and the virtualenv module in the Python docs.

3

To re-read about run procedures, check out section Configurations to go.

4

The chapter DataLad, Run! introduces the options of datalad run and demonstrates their use. Note that --outputs don’t need to be individual files, but could also be directories or globbing terms.

5

In order to re-execute any run-record in the last five commits, you could use --since=HEAD~5, for example. You could also, however, rerun the previous run commands sequentially, with datalad rerun <commit-hash>.

6

Rerunning on a different branch is optional but handy. Alternatively, you could checkout a previous state in the datasets history to get access to a previous version of a file, reset the dataset to a previous state, or use commands like git cat-file to read out a non-checked-out file. The section Back and forth in time summarizes a number of common Git operations to interact with the dataset history.