5.1. Reproducible machine learning analyses: DataLad as DVC

Machine learning analyses are complex: Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics. This complexity can jeopardize reproducibility – it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.

../_images/ML.svg

DataLad provides the functionality to achieve this, and previous sections have given some demonstrations on how to do it. But in the context of machine learning analyses, other domain-specific tools and workflows exist, too. One of the most well-known is DVC (Data Version Control), a “version control system for machine learning projects”. This section compares the two tools and demonstrates workflows for data versioning, data sharing, and analysis execution in the context of a machine learning project with DVC and DataLad. While they share a number of similarities and goals, their respective workflows are quite distinct.

The workflows showcased here are based on a DVC tutorial. This tutorial consists of the following steps:

  • A data set with pictures of 10 classes of objects (Imagenette) is version controlled with DVC

  • the data is pushed to a “storage remote” on a local path

  • the data are analyzed using various ML models in DVC pipelines

This handbook section demonstrates how DataLad could be used as an alternative to DVC. We demonstrate each step with DVC according to their tutorial, and then recreate a corresponding DataLad workflow. The use case DataLad for reproducible machine-learning analyses demonstrates a similar analysis in a completely DataLad-centric fashion. If you want to, you can code along, or simply read through the presentation of DVC and DataLad commands. Some familiarity with DataLad can be helpful, but if you have never used DataLad, footnotes in each section can point you relevant chapters for more insights on a command or concept. If you have never used DVC, its documentation (including the command reference) can answer further questions.

If you are not a Git user

DVC relies heavily on Git workflows. Understanding the DVC workflows requires a solid understanding of branches, Git’s concepts of Working tree, Index (“Staging Area”), and Repository, and some basic Git commands such as add, commit, and checkout. The Turing Way has an excellent chapter on version control with Git if you want to catch up on those basics first.

Terminology

Be mindful: DVC (as DataLad) comes with a range of commands and concepts that have the same names, but differ in functionality to their Git namesake. Make sure to read the DVC documentation for each command to get more information on what it does.

5.1.1. Setup

The DVC tutorial comes with a pre-made repository that is structured for DVC machine learning analyses. If you want to code along, the repository needs to be forked (requires a GitHub account) and cloned from your own fork[1].

### DVC
# please clone this repository from your own fork when coding along
$ git clone https://github.com/datalad-handbook/data-version-control DVC
Cloning into 'DVC'...

The resulting Git repository is already pre-structured in a way that aids DVC ML analyses: It has the directories model and metrics, and a set of Python scripts for a machine learning analysis in src/.

### DVC
$ tree DVC
DVC
├── data
│   ├── prepared
│   └── raw
├── LICENSE
├── metrics
├── model
├── README.md
└── src
    ├── evaluate.py
    ├── prepare.py
    └── train.py

6 directories, 5 files

For a comparison, we will recreate a similarly structured DataLad dataset. For greater compliance with DataLad’s YODA principles, the dataset structure will differ marginally in that scripts will be kept in code/ instead of src/. We create the dataset with two configurations, yoda and text2git[2].

### DVC-DataLad
$ datalad create -c text2git -c yoda DVC-DataLad
$ cd DVC-DataLad
$ mkdir -p data/{raw,prepared} model metrics
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
create(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

Afterwards, we make sure to get the same scripts.

### DVC-DataLad
# get the scripts
$ datalad download-url -m "download scripts for ML analysis" \
  https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/{train,prepare,evaluate}.py \
  -O 'code/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/train.py (file)
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/prepare.py (file)
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/evaluate.py (file)
add(ok): code/evaluate.py (file)
add(ok): code/prepare.py (file)
add(ok): code/train.py (file)
save(ok): . (dataset)

Here’s the final directory structure:

### DVC-DataLad
$ tree
.
├── CHANGELOG.md
├── code
│   ├── evaluate.py
│   ├── prepare.py
│   ├── README.md
│   └── train.py
├── data
│   ├── prepared
│   └── raw
├── metrics
├── model
└── README.md

6 directories, 6 files

Required software for coding along

In order to code along, DVC, scikit-learn, scikit-image, pandas, and numpy are required. All tools are available via pip or conda. We recommend to install them in a virtual environment – the DVC tutorial has step-by-step instructions.

5.1.2. Version controlling data

In the first part of the tutorial, the directory tree will be populated with data that should be version controlled.

Although the implementation of version control for (large) data is very different between DataLad and DVC, the underlying concept is very similar: (Large) data is stored outside of GitGit only tracks information on where this data can be found.

In DataLad datasets, (large) data is handled by git-annex. Data content is hashed and only the hash (represented as the original file name) is stored in Git[3]. Actual data is stored in the annex of the dataset, and annexed data can be transferred from and to a large number of storage solutions using either DataLad or git-annex commands. Information on where data is available from is stored in an internal representation of git-annex.

In DVC repositories, (large) data is also supposed to be stored in external remotes such as Google Drive. For internal representation of where files are available from, DVC uses one .dvc text file for each data file or directory given to DVC. The .dvc files contain information on the path to the data in the repository, where the associated data file is available from, and a hash, and those files should be committed to Git.

5.1.2.1. DVC workflow

Prior to adding and version controlling data, a “DVC project” needs to be initialized in the Git repository:

### DVC
$ cd ../DVC
$ dvc init
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

This populates the repository with a range of staged files – most of them are internal directories and files for DVC’s configuration.

### DVC
$ git status
On branch master
Your branch is up to date with 'github/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   .dvc/.gitignore
	new file:   .dvc/config
	new file:   .dvcignore

As they are only staged but not committed, we need to commit them (into Git):

### DVC
$ git commit -m "initialize dvc"
[master ae6d2e1] initialize dvc
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore

The DVC project is now ready to version control data. In the tutorial, data comes from the “Imagenette” dataset. This data is available from an Amazon S3 bucket as a compressed tarball, but to keep the download fast, there is a smaller two-category version of it on the Open Science Framework (OSF). We’ll download it and extract it into the data/raw/ directory of the repository.

### DVC
# download the data
$ wget -q https://osf.io/d6qbz/download -O imagenette2-160.tgz
# extract it
$ tar -xzf imagenette2-160.tgz
# move it into the directories
$ mv train data/raw/
$ mv val data/raw/
# remove the archive
$ rm -rf imagenette2-160.tgz

The data directories in data/raw are then version controlled with the dvc add command that can place files or complete directories under version control by DVC.

### DVC
$ dvc add data/raw/train
$ dvc add data/raw/val

To track the changes with git, run:

	git add data/raw/train.dvc data/raw/.gitignore

To enable auto staging, run:

	dvc config core.autostage true

To track the changes with git, run:

	git add data/raw/val.dvc data/raw/.gitignore

To enable auto staging, run:

	dvc config core.autostage true

Here is what this command has accomplished: The data files were copied into a cache in .dvc/cache (a non-human readable directory structure based on hashes similar to .git/annex/objects used by git-annex), data file names were added to a .gitignore[4] file to become invisible to Git, and two .dvc files, train.dvc and val.dvc, were created[5]. git status (manual) shows these changes:

### DVC
$ git status
On branch master
Your branch is ahead of 'github/master' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   data/raw/.gitignore

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	data/raw/train.dvc
	data/raw/val.dvc

no changes added to commit (use "git add" and/or "git commit -a")

In order to complete the version control workflow, Git needs to know about the .dvc files, and forget about the data directories. For this, the modified .gitignore file and the untracked .dvc files need to be added to Git:

### DVC
$ git add --all

Finally, we commit.

### DVC
$ git commit -m "control data with DVC"
[master 9420e0e] control data with DVC
 3 files changed, 14 insertions(+)
 create mode 100644 data/raw/train.dvc
 create mode 100644 data/raw/val.dvc

The data is now version controlled with DVC.

How does DVC represent modifications to data?

When adding data directories, they (i.e., the complete directory) are hashed, and this hash is stored in the respective .dvc file. If any file in the directory changes, this hash would change, and the dvc status command would report the directory to be “changed”. To demonstrate this, we pretend to accidentally delete a single file:

# if one or more files in the val/ data changes, dvc status reports a change
$ dvc status
data/raw/val.dvc:
    changed outs:
        modified:           data/raw/val

Important: Detecting a data modification requires the dvc status command – git status will not be able to detect changes as this directory as it is git-ignored!

5.1.2.2. DataLad workflow

DataLad has means to get data or data archives from web sources and store this availability information within git-annex. This has several advantages: For one, the original OSF file URL is known and stored as a location to re-retrieve the data from. This enables reliable data access for yourself and others that you share the dataset with. Beyond this, the data is also automatically extracted and saved, and thus put under version control. Note that this strays slightly from DataLad’s YODA principles in a DataLad-centric workflow, where data should become a standalone, reusable dataset that would be linked as a subdataset into a study/analysis specific dataset. Here, we stick to the project organization of DVC though.

### DVC-DataLad
$ cd ../DVC-DataLad
$ datalad download-url \
  --archive \
  --message "Download Imagenette dataset" \
  https://osf.io/d6qbz/download \
  -O 'data/raw/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz (file)
add(ok): data/raw/imagenette2-160.tgz (file)
save(ok): . (dataset)
[INFO] Adding content of the archive /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz into annex AnnexRepo(/home/me/DVCvsDL/DVC-DataLad)
[INFO] Initializing special remote datalad-archives
[INFO] Extracting archive
[INFO] Finished adding /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz: Files processed: 2701, renamed: 2701, +annex: 2701
[INFO] Finished extraction
add-archive-content(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

At this point, the data is already version controlled[6], and we have the following directory tree:

$ tree
.
├── code
│   └── [...]
├── data
│   └── raw
│          ├── train
│          │   ├──[...]
│          └── val
│              ├── [...]
├── metrics
└── model

29 directories

How does DataLad represent modifications to data?

As DataLad always tracks files individually, datalad status (manual) (or, alternatively, git status or git annex status (manual)) will show modifications on the level of individual files:

$ datalad status
  deleted: /home/me/DVCvsDL/DVC-DataLad/data/raw/val/n01440764/n01440764_12021.JPEG (symlink)

$ git status
  On branch main
  Your branch is ahead of 'origin/main' by 2 commits.
    (use "git push" to publish your local commits)

  Changes not staged for commit:
    (use "git add/rm <file>..." to update what will be committed)
    (use "git restore <file>..." to discard changes in working directory)
      deleted:    data/raw/val/n01440764/n01440764_12021.JPEG

$ git annex status
  D data/raw/val/n01440764/n01440764_12021.JPEG

5.1.3. Sharing data

In the second part of the tutorial, the versioned data is transferred to a local directory to demonstrate data sharing.

The general mechanisms of DVC and DataLad data sharing are similar: (Large) data files are kept somewhere where potentially large files can be stored. They can be retrieved on demand as the location information is stored in Git. DVC uses the term “data remote” to refer to external storage locations for (large) data, whereas DataLad would refer to them as (storage-) siblings.

Both DVC and DataLad support a range of hosting solutions, from local paths and SSH servers to providers such as S3 or GDrive. For DVC, every supported remote is pre-implemented, which restricts the number of available services (a list is here), but results in a convenient, streamlined procedure for adding remotes based on URL schemes. DataLad, largely thanks to “external special remotes” mechanism of git-annex, has more storage options (in addition, for example, DropBox, the Open Science Framework (OSF), Git LFS, Figshare, GIN, or RIA stores), but depending on selected storage provider, the procedure to add a sibling may differ. In addition, DataLad is able to store complete datasets (annexed data and Git repository) in certain services (e.g., OSF, GIN, GitHub if used with GitLFS, Dropbox, …), enabling a clone from, for example, Google Drive, and while DVC can never keep data in Git repository hosting services, DataLad can do this if the hosting service supports hosting annexed data (default on Gin and possible with GitHub, GitLab or BitBucket if used with GitLFS).

5.1.3.1. DVC workflow

Step 1: Set up a remote

The DVC tutorial demonstrates data sharing via a local data remote[7]. As a first step, there needs to exist a directory to use as a remote, so we will create a new directory:

### DVC
# go back to DVC (we were in DVC-Datalad)
$ cd ../DVC
# create a directory somewhere else
$ mkdir ../dvc-remote

Afterwards, the new, empty directory can be added as a data remote using dvc remote add. The -d option sets it as the default remote, which simplifies pushing later on:

### DVC
$ dvc remote add -d remote_storage ../dvc-remote
Setting 'remote_storage' as a default remote.

The location of the remote is written into a config file:

### DVC
$ cat .dvc/config
[core]
    remote = remote_storage
['remote "remote_storage"']
    url = ../../dvc-remote

Note that dvc remote add only modifies the config file, and it still needs to be added and committed to Git:

### DVC
$ git status
On branch master
Your branch is ahead of 'github/master' by 2 commits.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   .dvc/config

no changes added to commit (use "git add" and/or "git commit -a")
### DVC
$ git add .dvc/config
$ git commit -m "add local remote"
[master e8ef28d] add local remote
 1 file changed, 4 insertions(+)

Remotes

The DVC and Git concepts of a “remote” are related, but not identical. Therefore, DVC remotes are invisible to git remote (manual), and likewise, Git remotes are invisible to the dvc remote list command.

Step 2: Push data to the remote

Once the remote is set up, the data that is managed by DVC can be pushed from the cache of the project to the remote. During this operation, all data for which .dvc files exist will be copied from .dvc/cache to the remote storage.

### DVC
$ dvc push
2703 files pushed

Step 3: Push Git history

At this point, all changes that were committed to Git (such as the .dvc files) still need to be pushed to a Git repository hosting service.

### DVC
# this will only work if you have cloned from your own fork
$ git push origin master
To /home/me/pushes/data-version-control
 * [new branch]      master -> master

Step 4: Data retrieval

In DVC projects, there are several ways to retrieve data into its original location or the project cache. In order to demonstrate this, we start by deleting a data directory (in its original location, data/raw/val/).

### DVC
$ rm -rf data/raw/val

Status

Do note that this deletion would not be detected by git status – you have to use dvc status instead.

At this point, a copy of the data still resides in the cache of the repository. These data are copied back to val/ with the dvc checkout command:

### DVC
$ dvc checkout data/raw/val.dvc
A       data/raw/val/

If the cache of the repository would be empty, the data can be re-retrieved into the cache from the data remote. To demonstrate this, let’s look at a repository with an empty cache by cloning this repository from GitHub into a new location.

### DVC
# clone the repo into a new location for demonstration purposes:
$ cd ../
$ git clone https://github.com/datalad-handbook/data-version-control DVC-2
Cloning into 'DVC-2'...
done.

Retrieving the data from the data remote to repopulate the cache is done with the dvc fetch command:

### DVC
$ cd DVC-2
$ dvc fetch data/raw/val.dvc
790 files fetched

Afterwards, another dvc checkout will copy the files from the cache back to val/. Alternatively, the command dvc pull performs fetch (get data into the cache) and checkout (copy data from the cache to its original location) in a single command.

Unless DVC is used on a small subset of file systems (trfs, XFS, OCFS2, or APFS), copying data between its original location and the cache is the default. This results in a “built-in data duplication” on most current file systems[8]. An alternative is to switch from copies to symlinks (as done by git-annex) or hardlinks.

5.1.3.2. DataLad workflow

Because the OSF archive containing the raw data is known and stored in the dataset, it strictly speaking isn’t necessary to create a storage sibling to push the data to – DataLad already treats the original web location as storage. Currently, the dataset can thus be shared via GitHub or similar hosting services, and the data can be retrieved using datalad get (manual).

Really?

Sure. Let’s demonstrate this. First, we create a sibling on GitHub for this dataset and push its contents to the sibling:

### DVC-DataLad
$ cd ../DVC-DataLad
$ datalad create-sibling-github DVC-DataLad --github-organization datalad-handbook
[INFO   ] Successfully obtained information about organization datalad-handbook using UserPassword(name='github', url='https://github.com/login') credential
 .: github(-) [https://github.com/datalad-handbook/DVC-DataLad.git (git)]
 'https://github.com/datalad-handbook/DVC-DataLad.git' configured as sibling 'github' for Dataset(/home/me/DVCvsDL/DVC-DataLad)
$ datalad push --to github
  Update availability for 'github': [...] [00:00<00:00, 28.9k Steps/s]Username for 'https://github.com': <user>
  Password for 'https://adswa@github.com': <password>
  publish(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [refs/heads/master->github:refs/heads/master [new branch]]
  publish(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]

Next, we can clone this dataset, and retrieve the files:

### DVC-DataLad
# outside of a dataset
$ datalad clone https://github.com/datalad-handbook/DVC-DataLad.git DVC-DataLad-2
$ cd DVC-DataLad-2
[INFO] Remote origin not usable by git-annex; setting annex-ignore
install(ok): /home/me/DVCvsDL/DVC-DataLad-2 (dataset)
### DVC-DataLad2
$ datalad get data/raw/val
[INFO] To obtain some keys we need to fetch an archive of size 15.1 MB
[INFO] datalad-archives special remote is using an extraction cache under /home/me/DVCvsDL/DVC-DataLad-2/.git/datalad/tmp/archives/8f2938add6. Remove it with DataLad's 'clean' command to save disk space.
get(ok): data/raw/val (directory)

The data was retrieved by re-downloading the original archive from OSF and extracting the required files.

Here’s an example of pushing a dataset to a local sibling nevertheless:

Step 1: Set up the sibling

The easiest way to share data is via a local sibling[7]. This won’t share only annexed data, but it instead will push everything, including the Git aspect of the dataset. First, we need to create a local sibling:

### DVC-DataLad
$ cd DVC-DataLad
$ datalad create-sibling --name mysibling ../datalad-sibling
[INFO] Considering to create a target dataset /home/me/DVCvsDL/DVC-DataLad at /home/me/DVCvsDL/datalad-sibling of localhost
update(ok): . (dataset)
[INFO] Adjusting remote git configuration
[INFO] Running post-update hooks in all created siblings
create_sibling(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

Step 2: Push the data

Afterwards, the dataset contents can be pushed using datalad push (manual).

### DVC-DataLad
$ datalad push --to mysibling
publish(ok): . (dataset) [refs/heads/git-annex->mysibling:refs/heads/git-annex a19cf78a..2592d518]
publish(ok): . (dataset) [refs/heads/main->mysibling:refs/heads/main [new branch]]

This pushed all of the annexed data and the Git history of the dataset.

Step 3: Retrieve the data

The data in the dataset (complete directories or individual files) can be dropped using datalad drop (manual), and reobtained using datalad get.

### DVC-DataLad
$ datalad drop data/raw/val
drop(ok): data/raw/val (directory)
### DVC-DataLad
$ datalad get data/raw/val
get(ok): data/raw/val (directory)

5.1.4. Data analysis

DVC is tuned towards machine learning analyses and comes with convenience commands and workflow management to build, compare, and reproduce machine learning pipelines. The tutorial therefore runs an SGD classifier and a random forest classifier on the data and compares the two models. For this, the pre-existing preparation, training, and evaluation scripts are used on the data we have downloaded and version controlled in the previous steps. DVC has means to transform such a structured ML analysis into a workflow, reproduce this workflow on demand, and compare it across different models or parametrizations.

In this general overview, we will only rush through the analysis: In short, it consists of three steps, each associated with a script. src/prepare.py creates two .csv files with mappings of file names in train/ and val/ to image categories. Later, these files will be used to train and test the classifiers. src/train.py loads the training CSV file prepared in the previous stage, trains a classifier on the training data, and saves the classifier into the model/ directory as model.joblib. The final script, src/evaluate.py is used to evaluate the trained classifier on the validation data and write the accuracy of the classification into the file metrics/accuracy.json. There are more detailed insights and explanations of the actual analysis code in the Tutorial if you’re interested in finding out more.

For workflow management, DVC has the concept of a “DVC pipeline”. A pipeline consists of multiple stages, which are set up and executed using a dvc stage add [--run] command. Each stage has three components: “deps”, “outs”, and “command”. Each of the scripts in the repository will be represented by a stage in the DVC pipeline.

DataLad does not have any workflow management functions. The closest to it are datalad run (manual) to record any command execution or analysis, datalad rerun (manual) to recompute such an analysis, and datalad containers-run (manual) to perform and record a command execution or analysis inside of a tracked software container[10].

5.1.4.1. DVC workflow

Model 1: SGD classifier

Each model will be analyzed in a different branch of the repository. Therefore, we start by creating a new branch.

### DVC
$ cd ../DVC
$ git checkout -b sgd-pipeline
Switched to a new branch 'sgd-pipeline'

The first stage in the pipeline is data preparation (performed by the script prepare.py). The following command sets up the stage:

### DVC
$ dvc stage add -n prepare \
  -d src/prepare.py -d data/raw \
  -o data/prepared/train.csv -o data/prepared/test.csv \
  --run \
  python src/prepare.py
Added stage 'prepare' in 'dvc.yaml'
Running stage 'prepare':
> python src/prepare.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/prepared/.gitignore dvc.lock dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true

The -n parameter gives the stage a name, the -d parameter passes the dependencies – the raw data – to the command, and the -o parameter defines the outputs of the command – the CSV files that prepare.py will create. python src/prepare.py is the command that will be executed in the stage.

The resulting changes can be added to Git:

### DVC
$ git add dvc.yaml data/prepared/.gitignore dvc.lock

This command runs the command, and also creates two YAML files, dvc.yaml and dvc.lock. They contain the pipeline description, which currently comprises of the first stage:

### DVC
$ cat dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
    - data/raw
    - src/prepare.py
    outs:
    - data/prepared/test.csv
    - data/prepared/train.csv

The lock file tracks the versions of all relevant files via MD5 hashes. This allows DVC to track all dependencies and outputs and detect if any of these files change.

### DVC
$ cat dvc.lock
schema: '2.0'
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
    - path: data/raw
      hash: md5
      md5: 3f163676✂MD5.dir
      size: 16711951
      nfiles: 2704
    - path: src/prepare.py
      hash: md5
      md5: ef804f35✂MD5
      size: 1231
    outs:
    - path: data/prepared/test.csv
      hash: md5
      md5: 0b90b0e8✂MD5
      size: 62023
    - path: data/prepared/train.csv
      hash: md5
      md5: 360a73ac✂MD5
      size: 155128

The command also added the results from the stage, train.csv and test.csv into a .gitignore file.

The next pipeline stage is training, in which train.py will be used to train a classifier on the data. Initially, this classifier is an SGD classifier. The following command sets it up:

$ dvc stage add -n train \
   -d src/train.py -d data/prepared/train.csv \
   -o model/model.joblib \
   --run \
   python src/train.py
Added stage 'train' in 'dvc.yaml'
Running stage 'train':
> python src/train.py
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock model/.gitignore

To enable auto staging, run:

	dvc config core.autostage true

Afterwards, train.py has been executed, and the pipelines have been updated with a second stage. The resulting changes can be added to Git:

### DVC
$ git add dvc.yaml model/.gitignore dvc.lock

Finally, we create the last stage, model evaluation. The following command sets it up:

$ dvc stage add -n evaluate \
         -d src/evaluate.py -d model/model.joblib \
         -M metrics/accuracy.json \
         --run \
         python src/evaluate.py
Added stage 'evaluate' in 'dvc.yaml'
Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
### DVC
$ git add dvc.yaml dvc.lock

Instead of “outs”, this final stage uses the -M flag to denote a “metric”. This type of flag can be used if floating-point or integer values that summarize model performance (e.g. accuracies, receiver operating characteristics, or area under the curve values) are saved in hierarchical files (JSON, YAML). DVC can then read from these files to display model performances and comparisons:

### DVC
$ dvc metrics show
Path                   accuracy
metrics/accuracy.json  0.67934

The complete pipeline now consists of preparation, training, and evaluation. It now needs to be committed, tagged, and pushed:

### DVC
$ git add --all
$ git commit -m "Add SGD pipeline"
$ dvc commit
$ git push --set-upstream origin sgd-pipeline
$ git tag -a sgd -m "Trained SGD as DVC pipeline."
$ git push origin --tags
$ dvc push
[sgd-pipeline c400246] Add SGD pipeline
 5 files changed, 83 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 create mode 100644 metrics/accuracy.json
To /home/me/pushes/data-version-control
 * [new branch]      sgd-pipeline -> sgd-pipeline
branch 'sgd-pipeline' set up to track 'origin/sgd-pipeline'.
To /home/me/pushes/data-version-control
 * [new tag]         sgd -> sgd
3 files pushed

Model 2: random forest classifier

In order to explore a second model, a random forest classifier, we start with a new branch.

### DVC
$ git checkout -b random-forest
Switched to a new branch 'random-forest'

To switch from SGD to a random forest classifier, a few lines of code within train.py need to be changed. The following here doc changes the script accordingly (changes are highlighted):

### DVC
$ cat << EOT >| src/train.py
from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list

def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list

def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped

def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels

def main(repo_path):
    train_csv_path = repo_path / "data/prepared/train.csv"
    train_data, labels = load_data(train_csv_path)
    rf = RandomForestClassifier()
    trained_model = rf.fit(train_data, labels)
    dump(trained_model, repo_path / "model/model.joblib")

if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

Afterwards, since train.py is changed, dvc status will realize that one dependency of the pipeline stage “train” has changed:

### DVC
$ dvc status
train:
	changed deps:
		modified:           src/train.py

Since the code change (stage 2) will likely affect the metric (stage 3), it is best to reproduce the whole chain. You can reproduce a complete DVC pipeline file with the dvc repro <stagename> command:

### DVC
$ dvc repro evaluate
'data/raw/val.dvc' didn't change, skipping
'data/raw/train.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'train':
> python src/train.py
Updating lock file 'dvc.lock'

Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

DVC checks the dependencies of the pipeline and re-executes commands that need to be executed again. Compared to the branch sgd-pipeline, the workspace in the current random-forest branch contains a changed script (src/train.py), a changed trained classifier (model/model.joblib), and a changed metric (metric/accuracy.json). All these changes need to be committed, tagged, and pushed now.

### DVC
$ git add --all
$ git commit -m "Train Random Forest classifier"
$ dvc commit
$ git push --set-upstream origin random-forest
$ git tag -a randomforest -m "Random Forest classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push
[random-forest c565b15] Train Random Forest classifier
 3 files changed, 11 insertions(+), 17 deletions(-)
To /home/me/pushes/data-version-control
 * [new branch]      random-forest -> random-forest
branch 'random-forest' set up to track 'origin/random-forest'.
To /home/me/pushes/data-version-control
 * [new tag]         randomforest -> randomforest
1 file pushed

At this point, you can compare metrics across multiple tags:

### DVC
$ dvc metrics show -T
Revision      Path                   accuracy
workspace     metrics/accuracy.json  0.79848
randomforest  metrics/accuracy.json  0.79848
sgd           metrics/accuracy.json  0.67934

Done!

5.1.4.2. DataLad workflow

For a direct comparison to DVC, we’ll try to mimic the DVC workflow as closely as it is possible with DataLad.

Model 1: SGD classifier

### DVC-DataLad
$ cd ../DVC-DataLad

As there is no workflow manager in DataLad[9], each script execution needs to be done separately. To record the execution, get all relevant inputs, and recompute outputs at later points, we can set up a datalad run call[10]. Later on, we can rerun a range of datalad run calls at once to recompute the relevant aspects of the analysis. To harmonize execution and to assist with reproducibility of the results, we generally recommend to create a container (Docker or Singularity), add it to the repository as well, and use datalad containers-run call[11] and have that reran, but we’ll stay basic here.

Let’s start with data preparation. Instead of creating a pipeline stage and giving it a name, we attach a meaningful commit message.

### DVC-DataLad
$ datalad run --message "Prepare the train and testing data" \
   --input "data/raw/*" \
   --output "data/prepared/*" \
   python code/prepare.py
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/prepare.py]
save(ok): . (dataset)

The results of this computation are automatically saved and associated with their inputs and command execution. This information isn’t stored in a separate file, but in the Git history, and saved with the commit message we have attached to the datalad run command.

To stay close to the DVC tutorial, we will also work with tags to identify analysis versions, but DataLad could also use a range of other identifiers (such as commit hashes) to identify this computation. As we at this point have set up our data and are ready for the analysis, we will name the first tag “ready-for-analysis”. This can be done with git tag (manual), but also with datalad save (manual).

### DVC-DataLad
$ datalad save --version-tag ready-for-analysis
save(ok): . (dataset)

Let’s continue with training by running code/train.py on the prepared data.

### DVC-DataLad
$ datalad run --message "Train an SGD classifier" \
   --input "data/prepared/*" \
   --output "model/model.joblib" \
   python code/train.py
[INFO] == Command start (output follows) =====
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn(
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)

As before, the results of this computations are saved, an the Git history connects computation, results, and inputs.

As a last step, we evaluate the first model:

### DVC-DataLad
$ datalad run --message "Evaluate SGD classifier model" \
   --input "model/model.joblib" \
   --output "metrics/accuracy.json" \
   python code/evaluate.py
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)

At this point, the first accuracy metric is saved in metrics/accuracy.json. Let’s add a tag to declare that it belongs to the SGD classifier.

### DVC-DataLad
$ datalad save --version-tag SGD
save(ok): . (dataset)

Let’s now change the training script to use a random forest classifier as before:

### DVC-DataLad
$ cat << EOT >| code/train.py
from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list

def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list

def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped

def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels

def main(repo_path):
    train_csv_path = repo_path / "data/prepared/train.csv"
    train_data, labels = load_data(train_csv_path)
    rf = RandomForestClassifier()
    trained_model = rf.fit(train_data, labels)
    dump(trained_model, repo_path / "model/model.joblib")

if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

We need to save this change:

$ datalad save -m "Switch to random forest classification" code/train.py
add(ok): code/train.py (file)
save(ok): . (dataset)

Afterwards, we can rerun all run records between the tags ready-for-analysis and SGD using datalad rerun. We could automatically compute this on a different branch if we wanted to by using the branch option:

$ datalad rerun --branch="randomforest" -m "Recompute classification with random forest classifier" ready-for-analysis..SGD
[INFO] checkout commit 1b3b757;
[INFO] run commit 88a6e86; (Train an SGD clas...)
unlock(ok): model/model.joblib (file)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)
[INFO] run commit 2d07713; (Evaluate SGD clas...)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  get (notneeded: 3)
  run (ok: 2)
  save (ok: 2)
  unlock (notneeded: 2, ok: 1)

Done! The difference in accuracies between models could now, for example, be compared with a git diff:

$ git diff SGD -- metrics/accuracy.json
diff --git a/metrics/accuracy.json b/metrics/accuracy.json
index 74a1ee15..f6e7ded9 100644
--- a/metrics/accuracy.json
+++ b/metrics/accuracy.json
@@ -1 +1 @@
-{"accuracy": 0.7629911280101395}
\ No newline at end of file
+{"accuracy": 0.8124207858048162}
\ No newline at end of file

Even though there is no one-to-one correspondence between a DVC and a DataLad workflow, a DVC workflow can also be implemented with DataLad.

5.1.5. Summary

DataLad and DVC aim to solve the same problems: Version control data, sharing data, and enabling reproducible analyses. DataLad provides generic solutions to these issues, while DVC is tuned for machine-learning pipelines. Despite their similar purpose, the looks, feels and functions of both tools are different, and it is a personal decision which one you feel more comfortable with. Using DVC requires solid knowledge of Git, because DVC workflows heavily rely on effective Git practices, such as branching, tags, and .gitignore files. But despite the reliance on Git, DVC barely integrates with Git – changes done to files in DVC cannot be detected by Git and vice versa, DVC and Git aspects of a repository have to be handled in parallel by the user, and DVC and Git have distinct command functions and concepts that nevertheless share the same name. Thus, DVC users need to master Git and DVC workflows and intertwine them correctly. In return, DVC provides users with workflow management and reporting tuned to machine learning analyses. It also provides a somewhat more lightweight and uniform across operating and file systems approach to “data version control” than git-annex used by DataLad.

Footnotes