5.1. Reproducible machine learning analyses: DataLad as DVC¶

Machine learning analyses are complex: Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics. This complexity can jeopardize reproducibility – it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.

DataLad provides the functionality to achieve this, and previous sections have given some demonstrations on how to do it. But in the context of machine learning analyses, other domain-specific tools and workflows exist, too. One of the most well-known is DVC (Data Version Control), a “version control system for machine learning projects”. This section compares the two tools and demonstrates workflows for data versioning, data sharing, and analysis execution in the context of a machine learning project with DVC and DataLad. While they share a number of similarities and goals, their respective workflows are quite distinct.

The workflows showcased here are based on a DVC tutorial. This tutorial consists of the following steps:

A data set with pictures of 10 classes of objects (Imagenette) is version controlled with DVC
the data is pushed to a “storage remote” on a local path
the data are analyzed using various ML models in DVC pipelines

This handbook section demonstrates how DataLad could be used as an alternative to DVC. We demonstrate each step with DVC according to their tutorial, and then recreate a corresponding DataLad workflow. The use case DataLad for reproducible machine-learning analyses demonstrates a similar analysis in a completely DataLad-centric fashion. If you want to, you can code along, or simply read through the presentation of DVC and DataLad commands. Some familiarity with DataLad can be helpful, but if you have never used DataLad, footnotes in each section can point you relevant chapters for more insights on a command or concept. If you have never used DVC, its documentation (including the command reference) can answer further questions.

If you are not a Git user

DVC relies heavily on Git workflows. Understanding the DVC workflows requires a solid understanding of branches, Git’s concepts of Working tree, Index (“Staging Area”), and Repository, and some basic Git commands such as add, commit, and checkout. The Turing Way has an excellent chapter on version control with Git if you want to catch up on those basics first.

Be mindful: DVC (as DataLad) comes with a range of commands and concepts that have the same names, but differ in functionality to their Git namesake. Make sure to read the DVC documentation for each command to get more information on what it does.

5.1.1. Setup¶

The DVC tutorial comes with a pre-made repository that is structured for DVC machine learning analyses. If you want to code along, the repository needs to be forked (requires a GitHub account) and cloned from your own fork[1].

### DVC
# please clone this repository from your own fork when coding along
$ git clone https://github.com/datalad-handbook/data-version-control DVC
Cloning into 'DVC'...

The resulting Git repository is already pre-structured in a way that aids DVC ML analyses: It has the directories model and metrics, and a set of Python scripts for a machine learning analysis in src/.

### DVC
$ tree DVC
DVC
├── data
│   ├── prepared
│   └── raw
├── LICENSE
├── metrics
├── model
├── README.md
└── src
    ├── evaluate.py
    ├── prepare.py
    └── train.py

6 directories, 5 files

For a comparison, we will recreate a similarly structured DataLad dataset. For greater compliance with DataLad’s YODA principles, the dataset structure will differ marginally in that scripts will be kept in code/ instead of src/. We create the dataset with two configurations, yoda and text2git[2].

### DVC-DataLad
$ datalad create -c text2git -c yoda DVC-DataLad
$ cd DVC-DataLad
$ mkdir -p data/{raw,prepared} model metrics
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
create(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

Afterwards, we make sure to get the same scripts.

### DVC-DataLad
# get the scripts
$ datalad download-url -m "download scripts for ML analysis" \
  https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/{train,prepare,evaluate}.py \
  -O 'code/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/train.py (file)
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/prepare.py (file)
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/evaluate.py (file)
add(ok): code/evaluate.py (file)
add(ok): code/prepare.py (file)
add(ok): code/train.py (file)
save(ok): . (dataset)

Here’s the final directory structure:

### DVC-DataLad
$ tree
.
├── CHANGELOG.md
├── code
│   ├── evaluate.py
│   ├── prepare.py
│   ├── README.md
│   └── train.py
├── data
│   ├── prepared
│   └── raw
├── metrics
├── model
└── README.md

6 directories, 6 files

In order to code along, DVC, scikit-learn, scikit-image, pandas, and numpy are required. All tools are available via pip or conda. We recommend to install them in a virtual environment – the DVC tutorial has step-by-step instructions.

5.1.2. Version controlling data¶

In the first part of the tutorial, the directory tree will be populated with data that should be version controlled.

Although the implementation of version control for (large) data is very different between DataLad and DVC, the underlying concept is very similar: (Large) data is stored outside of Git – Git only tracks information on where this data can be found.

In DataLad datasets, (large) data is handled by git-annex. Data content is hashed and only the hash (represented as the original file name) is stored in Git[3]. Actual data is stored in the annex of the dataset, and annexed data can be transferred from and to a large number of storage solutions using either DataLad or git-annex commands. Information on where data is available from is stored in an internal representation of git-annex.

In DVC repositories, (large) data is also supposed to be stored in external remotes such as Google Drive. For internal representation of where files are available from, DVC uses one .dvc text file for each data file or directory given to DVC. The .dvc files contain information on the path to the data in the repository, where the associated data file is available from, and a hash, and those files should be committed to Git.

5.1.2.1. DVC workflow¶

Prior to adding and version controlling data, a “DVC project” needs to be initialized in the Git repository:

### DVC
$ cd ../DVC
$ dvc init
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

This populates the repository with a range of staged files – most of them are internal directories and files for DVC’s configuration.

### DVC
$ git status
On branch master
Your branch is up to date with 'github/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   .dvc/.gitignore
	new file:   .dvc/config
	new file:   .dvcignore

As they are only staged but not committed, we need to commit them (into Git):

### DVC
$ git commit -m "initialize dvc"
[master ae6d2e1] initialize dvc
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore

The DVC project is now ready to version control data. In the tutorial, data comes from the “Imagenette” dataset. This data is available from an Amazon S3 bucket as a compressed tarball, but to keep the download fast, there is a smaller two-category version of it on the Open Science Framework (OSF). We’ll download it and extract it into the data/raw/ directory of the repository.

### DVC
# download the data
$ wget -q https://osf.io/d6qbz/download -O imagenette2-160.tgz
# extract it
$ tar -xzf imagenette2-160.tgz
# move it into the directories
$ mv train data/raw/
$ mv val data/raw/
# remove the archive
$ rm -rf imagenette2-160.tgz

The data directories in data/raw are then version controlled with the dvc add command that can place files or complete directories under version control by DVC.

### DVC
$ dvc add data/raw/train
$ dvc add data/raw/val

To track the changes with git, run:

	git add data/raw/train.dvc data/raw/.gitignore

To enable auto staging, run:

	dvc config core.autostage true

To track the changes with git, run:

	git add data/raw/val.dvc data/raw/.gitignore

To enable auto staging, run:

	dvc config core.autostage true

Here is what this command has accomplished: The data files were copied into a cache in .dvc/cache (a non-human readable directory structure based on hashes similar to .git/annex/objects used by git-annex), data file names were added to a .gitignore[4] file to become invisible to Git, and two .dvc files, train.dvc and val.dvc, were created[5]. git status (manual) shows these changes:

### DVC
$ git status
On branch master
Your branch is ahead of 'github/master' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   data/raw/.gitignore

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	data/raw/train.dvc
	data/raw/val.dvc

no changes added to commit (use "git add" and/or "git commit -a")

In order to complete the version control workflow, Git needs to know about the .dvc files, and forget about the data directories. For this, the modified .gitignore file and the untracked .dvc files need to be added to Git:

### DVC
$ git add --all

Finally, we commit.

### DVC
$ git commit -m "control data with DVC"
[master 9420e0e] control data with DVC
 3 files changed, 14 insertions(+)
 create mode 100644 data/raw/train.dvc
 create mode 100644 data/raw/val.dvc

The data is now version controlled with DVC.

When adding data directories, they (i.e., the complete directory) are hashed, and this hash is stored in the respective .dvc file. If any file in the directory changes, this hash would change, and the dvc status command would report the directory to be “changed”. To demonstrate this, we pretend to accidentally delete a single file:

# if one or more files in the val/ data changes, dvc status reports a change
$ dvc status
data/raw/val.dvc:
    changed outs:
        modified:           data/raw/val

Important: Detecting a data modification requires the dvc status command – git status will not be able to detect changes as this directory as it is git-ignored!

5.1.2.2. DataLad workflow¶

DataLad has means to get data or data archives from web sources and store this availability information within git-annex. This has several advantages: For one, the original OSF file URL is known and stored as a location to re-retrieve the data from. This enables reliable data access for yourself and others that you share the dataset with. Beyond this, the data is also automatically extracted and saved, and thus put under version control. Note that this strays slightly from DataLad’s YODA principles in a DataLad-centric workflow, where data should become a standalone, reusable dataset that would be linked as a subdataset into a study/analysis specific dataset. Here, we stick to the project organization of DVC though.

### DVC-DataLad
$ cd ../DVC-DataLad
$ datalad download-url \
  --archive \
  --message "Download Imagenette dataset" \
  https://osf.io/d6qbz/download \
  -O 'data/raw/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz (file)
add(ok): data/raw/imagenette2-160.tgz (file)
save(ok): . (dataset)
[INFO] Adding content of the archive /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz into annex AnnexRepo(/home/me/DVCvsDL/DVC-DataLad)
[INFO] Initializing special remote datalad-archives
[INFO] Extracting archive
[INFO] Finished adding /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz: Files processed: 2701, renamed: 2701, +annex: 2701
[INFO] Finished extraction
add-archive-content(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

At this point, the data is already version controlled[6], and we have the following directory tree:

$ tree
.
├── code
│   └── [...]
├── data
│   └── raw
│          ├── train
│          │   ├──[...]
│          └── val
│              ├── [...]
├── metrics
└── model

29 directories

As DataLad always tracks files individually, datalad status (manual) (or, alternatively, git status or git annex status (manual)) will show modifications on the level of individual files:

$ datalad status
  deleted: /home/me/DVCvsDL/DVC-DataLad/data/raw/val/n01440764/n01440764_12021.JPEG (symlink)

$ git status
  On branch main
  Your branch is ahead of 'origin/main' by 2 commits.
    (use "git push" to publish your local commits)

  Changes not staged for commit:
    (use "git add/rm <file>..." to update what will be committed)
    (use "git restore <file>..." to discard changes in working directory)
      deleted:    data/raw/val/n01440764/n01440764_12021.JPEG

$ git annex status
  D data/raw/val/n01440764/n01440764_12021.JPEG

5.1.4. Data analysis¶

DVC is tuned towards machine learning analyses and comes with convenience commands and workflow management to build, compare, and reproduce machine learning pipelines. The tutorial therefore runs an SGD classifier and a random forest classifier on the data and compares the two models. For this, the pre-existing preparation, training, and evaluation scripts are used on the data we have downloaded and version controlled in the previous steps. DVC has means to transform such a structured ML analysis into a workflow, reproduce this workflow on demand, and compare it across different models or parametrizations.

In this general overview, we will only rush through the analysis: In short, it consists of three steps, each associated with a script. src/prepare.py creates two .csv files with mappings of file names in train/ and val/ to image categories. Later, these files will be used to train and test the classifiers. src/train.py loads the training CSV file prepared in the previous stage, trains a classifier on the training data, and saves the classifier into the model/ directory as model.joblib. The final script, src/evaluate.py is used to evaluate the trained classifier on the validation data and write the accuracy of the classification into the file metrics/accuracy.json. There are more detailed insights and explanations of the actual analysis code in the Tutorial if you’re interested in finding out more.

For workflow management, DVC has the concept of a “DVC pipeline”. A pipeline consists of multiple stages, which are set up and executed using a dvc stage add [--run] command. Each stage has three components: “deps”, “outs”, and “command”. Each of the scripts in the repository will be represented by a stage in the DVC pipeline.

DataLad does not have any workflow management functions. The closest to it are datalad run (manual) to record any command execution or analysis, datalad rerun (manual) to recompute such an analysis, and datalad containers-run (manual) to perform and record a command execution or analysis inside of a tracked software container[10].

5.1.4.1. DVC workflow¶

Model 1: SGD classifier

Each model will be analyzed in a different branch of the repository. Therefore, we start by creating a new branch.

### DVC
$ cd ../DVC
$ git checkout -b sgd-pipeline
Switched to a new branch 'sgd-pipeline'

The first stage in the pipeline is data preparation (performed by the script prepare.py). The following command sets up the stage:

### DVC
$ dvc stage add -n prepare \
  -d src/prepare.py -d data/raw \
  -o data/prepared/train.csv -o data/prepared/test.csv \
  --run \
  python src/prepare.py
Added stage 'prepare' in 'dvc.yaml'
Running stage 'prepare':
> python src/prepare.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/prepared/.gitignore dvc.lock dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true

The -n parameter gives the stage a name, the -d parameter passes the dependencies – the raw data – to the command, and the -o parameter defines the outputs of the command – the CSV files that prepare.py will create. python src/prepare.py is the command that will be executed in the stage.

The resulting changes can be added to Git:

### DVC
$ git add dvc.yaml data/prepared/.gitignore dvc.lock

This command runs the command, and also creates two YAML files, dvc.yaml and dvc.lock. They contain the pipeline description, which currently comprises of the first stage:

### DVC
$ cat dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
    - data/raw
    - src/prepare.py
    outs:
    - data/prepared/test.csv
    - data/prepared/train.csv

The lock file tracks the versions of all relevant files via MD5 hashes. This allows DVC to track all dependencies and outputs and detect if any of these files change.

### DVC
$ cat dvc.lock
schema: '2.0'
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
    - path: data/raw
      hash: md5
      md5: 3f163676✂MD5.dir
      size: 16711951
      nfiles: 2704
    - path: src/prepare.py
      hash: md5
      md5: ef804f35✂MD5
      size: 1231
    outs:
    - path: data/prepared/test.csv
      hash: md5
      md5: 0b90b0e8✂MD5
      size: 62023
    - path: data/prepared/train.csv
      hash: md5
      md5: 360a73ac✂MD5
      size: 155128

The command also added the results from the stage, train.csv and test.csv into a .gitignore file.

The next pipeline stage is training, in which train.py will be used to train a classifier on the data. Initially, this classifier is an SGD classifier. The following command sets it up:

$ dvc stage add -n train \
   -d src/train.py -d data/prepared/train.csv \
   -o model/model.joblib \
   --run \
   python src/train.py
Added stage 'train' in 'dvc.yaml'
Running stage 'train':
> python src/train.py
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock model/.gitignore

To enable auto staging, run:

	dvc config core.autostage true

Afterwards, train.py has been executed, and the pipelines have been updated with a second stage. The resulting changes can be added to Git:

### DVC
$ git add dvc.yaml model/.gitignore dvc.lock

Finally, we create the last stage, model evaluation. The following command sets it up:

$ dvc stage add -n evaluate \
         -d src/evaluate.py -d model/model.joblib \
         -M metrics/accuracy.json \
         --run \
         python src/evaluate.py
Added stage 'evaluate' in 'dvc.yaml'
Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock

To enable auto staging, run:

	dvc config core.autostage true

### DVC
$ git add dvc.yaml dvc.lock

Instead of “outs”, this final stage uses the -M flag to denote a “metric”. This type of flag can be used if floating-point or integer values that summarize model performance (e.g. accuracies, receiver operating characteristics, or area under the curve values) are saved in hierarchical files (JSON, YAML). DVC can then read from these files to display model performances and comparisons:

### DVC
$ dvc metrics show
Path                   accuracy
metrics/accuracy.json  0.67934

The complete pipeline now consists of preparation, training, and evaluation. It now needs to be committed, tagged, and pushed:

### DVC
$ git add --all
$ git commit -m "Add SGD pipeline"
$ dvc commit
$ git push --set-upstream origin sgd-pipeline
$ git tag -a sgd -m "Trained SGD as DVC pipeline."
$ git push origin --tags
$ dvc push
[sgd-pipeline 5aefac4] Add SGD pipeline
 5 files changed, 83 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 create mode 100644 metrics/accuracy.json
To /home/me/pushes/data-version-control
 * [new branch]      sgd-pipeline -> sgd-pipeline
branch 'sgd-pipeline' set up to track 'origin/sgd-pipeline'.
To /home/me/pushes/data-version-control
 * [new tag]         sgd -> sgd
3 files pushed

Model 2: random forest classifier

In order to explore a second model, a random forest classifier, we start with a new branch.

### DVC
$ git checkout -b random-forest
Switched to a new branch 'random-forest'

To switch from SGD to a random forest classifier, a few lines of code within train.py need to be changed. The following here doc changes the script accordingly (changes are highlighted):

### DVC
$ cat << EOT >| src/train.py
from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list

def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list

def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped

def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels

def main(repo_path):
    train_csv_path = repo_path / "data/prepared/train.csv"
    train_data, labels = load_data(train_csv_path)
    rf = RandomForestClassifier()
    trained_model = rf.fit(train_data, labels)
    dump(trained_model, repo_path / "model/model.joblib")

if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

Afterwards, since train.py is changed, dvc status will realize that one dependency of the pipeline stage “train” has changed:

### DVC
$ dvc status
train:
	changed deps:
		modified:           src/train.py

Since the code change (stage 2) will likely affect the metric (stage 3), it is best to reproduce the whole chain. You can reproduce a complete DVC pipeline file with the dvc repro <stagename> command:

### DVC
$ dvc repro evaluate
'data/raw/val.dvc' didn't change, skipping
'data/raw/train.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'train':
> python src/train.py
Updating lock file 'dvc.lock'

Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

DVC checks the dependencies of the pipeline and re-executes commands that need to be executed again. Compared to the branch sgd-pipeline, the workspace in the current random-forest branch contains a changed script (src/train.py), a changed trained classifier (model/model.joblib), and a changed metric (metric/accuracy.json). All these changes need to be committed, tagged, and pushed now.

### DVC
$ git add --all
$ git commit -m "Train Random Forest classifier"
$ dvc commit
$ git push --set-upstream origin random-forest
$ git tag -a randomforest -m "Random Forest classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push
[random-forest 523abff] Train Random Forest classifier
 3 files changed, 11 insertions(+), 17 deletions(-)
To /home/me/pushes/data-version-control
 * [new branch]      random-forest -> random-forest
branch 'random-forest' set up to track 'origin/random-forest'.
To /home/me/pushes/data-version-control
 * [new tag]         randomforest -> randomforest
1 file pushed

At this point, you can compare metrics across multiple tags:

### DVC
$ dvc metrics show -T
Revision      Path                   accuracy
workspace     metrics/accuracy.json  0.79848
randomforest  metrics/accuracy.json  0.79848
sgd           metrics/accuracy.json  0.67934

Done!

5.1.4.2. DataLad workflow¶

For a direct comparison to DVC, we’ll try to mimic the DVC workflow as closely as it is possible with DataLad.

Model 1: SGD classifier

### DVC-DataLad
$ cd ../DVC-DataLad

As there is no workflow manager in DataLad[9], each script execution needs to be done separately. To record the execution, get all relevant inputs, and recompute outputs at later points, we can set up a datalad run call[10]. Later on, we can rerun a range of datalad run calls at once to recompute the relevant aspects of the analysis. To harmonize execution and to assist with reproducibility of the results, we generally recommend to create a container (Docker or Singularity), add it to the repository as well, and use datalad containers-run call[11] and have that reran, but we’ll stay basic here.

Let’s start with data preparation. Instead of creating a pipeline stage and giving it a name, we attach a meaningful commit message.

### DVC-DataLad
$ datalad run --message "Prepare the train and testing data" \
   --input "data/raw/*" \
   --output "data/prepared/*" \
   python code/prepare.py
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/prepare.py]
save(ok): . (dataset)

The results of this computation are automatically saved and associated with their inputs and command execution. This information isn’t stored in a separate file, but in the Git history, and saved with the commit message we have attached to the datalad run command.

To stay close to the DVC tutorial, we will also work with tags to identify analysis versions, but DataLad could also use a range of other identifiers (such as commit hashes) to identify this computation. As we at this point have set up our data and are ready for the analysis, we will name the first tag “ready-for-analysis”. This can be done with git tag (manual), but also with datalad save (manual).

### DVC-DataLad
$ datalad save --version-tag ready-for-analysis
save(ok): . (dataset)

Let’s continue with training by running code/train.py on the prepared data.

### DVC-DataLad
$ datalad run --message "Train an SGD classifier" \
   --input "data/prepared/*" \
   --output "model/model.joblib" \
   python code/train.py
[INFO] == Command start (output follows) =====
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)

As before, the results of this computations are saved, an the Git history connects computation, results, and inputs.

As a last step, we evaluate the first model:

### DVC-DataLad
$ datalad run --message "Evaluate SGD classifier model" \
   --input "model/model.joblib" \
   --output "metrics/accuracy.json" \
   python code/evaluate.py
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)

At this point, the first accuracy metric is saved in metrics/accuracy.json. Let’s add a tag to declare that it belongs to the SGD classifier.

### DVC-DataLad
$ datalad save --version-tag SGD
save(ok): . (dataset)

Let’s now change the training script to use a random forest classifier as before:

### DVC-DataLad
$ cat << EOT >| code/train.py
from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list

def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list

def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped

def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels

def main(repo_path):
    train_csv_path = repo_path / "data/prepared/train.csv"
    train_data, labels = load_data(train_csv_path)
    rf = RandomForestClassifier()
    trained_model = rf.fit(train_data, labels)
    dump(trained_model, repo_path / "model/model.joblib")

if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

We need to save this change:

$ datalad save -m "Switch to random forest classification" code/train.py
add(ok): code/train.py (file)
save(ok): . (dataset)

Afterwards, we can rerun all run records between the tags ready-for-analysis and SGD using datalad rerun. We could automatically compute this on a different branch if we wanted to by using the branch option:

$ datalad rerun --branch="randomforest" -m "Recompute classification with random forest classifier" ready-for-analysis..SGD
[INFO] checkout commit c4e30d5;
[INFO] run commit 4759bf7; (Train an SGD clas...)
unlock(ok): model/model.joblib (file)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)
[INFO] run commit 8a9a2a0; (Evaluate SGD clas...)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  get (notneeded: 3)
  run (ok: 2)
  save (ok: 2)
  unlock (notneeded: 2, ok: 1)

Done! The difference in accuracies between models could now, for example, be compared with a git diff:

$ git diff SGD -- metrics/accuracy.json
diff --git a/metrics/accuracy.json b/metrics/accuracy.json
index 74a1ee15..f6e7ded9 100644
--- a/metrics/accuracy.json
+++ b/metrics/accuracy.json
@@ -1 +1 @@
-{"accuracy": 0.7629911280101395}
\ No newline at end of file
+{"accuracy": 0.8124207858048162}
\ No newline at end of file

Even though there is no one-to-one correspondence between a DVC and a DataLad workflow, a DVC workflow can also be implemented with DataLad.

5.1.5. Summary¶

DataLad and DVC aim to solve the same problems: Version control data, sharing data, and enabling reproducible analyses. DataLad provides generic solutions to these issues, while DVC is tuned for machine-learning pipelines. Despite their similar purpose, the looks, feels and functions of both tools are different, and it is a personal decision which one you feel more comfortable with. Using DVC requires solid knowledge of Git, because DVC workflows heavily rely on effective Git practices, such as branching, tags, and .gitignore files. But despite the reliance on Git, DVC barely integrates with Git – changes done to files in DVC cannot be detected by Git and vice versa, DVC and Git aspects of a repository have to be handled in parallel by the user, and DVC and Git have distinct command functions and concepts that nevertheless share the same name. Thus, DVC users need to master Git and DVC workflows and intertwine them correctly. In return, DVC provides users with workflow management and reporting tuned to machine learning analyses. It also provides a somewhat more lightweight and uniform across operating and file systems approach to “data version control” than git-annex used by DataLad.

Footnotes

Table of Contents

Related Topics

5.1. Reproducible machine learning analyses: DataLad as DVC¶

5.1.1. Setup¶

5.1.2. Version controlling data¶

5.1.2.1. DVC workflow¶

5.1.2.2. DataLad workflow¶

5.1.4. Data analysis¶

5.1.4.1. DVC workflow¶

5.1.4.2. DataLad workflow¶

5.1.5. Summary¶

5.1. Reproducible machine learning analyses: DataLad as DVC¶

5.1.1. Setup¶

5.1.2. Version controlling data¶

5.1.2.1. DVC workflow¶

5.1.2.2. DataLad workflow¶

5.1.3. Sharing data¶

5.1.3.1. DVC workflow¶

5.1.3.2. DataLad workflow¶

5.1.4. Data analysis¶

5.1.4.1. DVC workflow¶

5.1.4.2. DataLad workflow¶

5.1.5. Summary¶