5.1. Reproducible machine learning analyses: DataLad as DVC

Machine learning analyses are complex: Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics. This complexity can jeopardize reproducibility – it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.

../_images/ML.svg

DataLad provides the functionality to achieve this, and previous sections have given some demonstrations on how to do it. But in the context of machine learning analyses, other domain-specific tools and workflows exist, too. One of the most well-known is DVC (Data Version Control), a “version control system for machine learning projects”. This section compares the two tools and demonstrates workflows for data versioning, data sharing, and analysis execution in the context of a machine learning project with DVC and DataLad. While they share a number of similarities and goals, their respective workflows are quite distinct.

The workflows showcased here are based on a DVC tutorial. This tutorial consists of the following steps:

  • A data set with pictures of 10 classes of objects (Imagenette) is version controlled with DVC

  • the data is pushed to a “storage remote” on a local path

  • the data are analyzed using various ML models in DVC pipelines

This handbook section demonstrates how DataLad could be used as an alternative to DVC. We demonstrate each step with DVC according to their tutorial, and then recreate a corresponding DataLad workflow. The usecase DataLad for reproducible machine-learning analyses demonstrates a similar analysis in a completely DataLad-centric fashion. If you want to, you can code along, or simply read through the presentation of DVC and DataLad commands. Some familiarity with DataLad can be helpful, but if you have never used DataLad, footnotes in each section can point you relevant chapters for more insights on a command or concept. If you have never used DVC, its technical docs or collection of third-party tutorials can answer further questions.

If you are not a Git user

DVC relies heavily on Git workflows. Understanding the DVC workflows requires a solid understanding of branches, Git’s concepts of Working tree, Index (“Staging Area”), and Repository, and some basic Git commands such as add, commit, and checkout. The Turing Way has an excellent chapter on version control with Git if you want to catch up on those basics first.

Note for Git users

Be mindful: DVC (as DataLad) comes with a range of commands and concepts that have the same names, but differ in functionality to their Git namesake. Make sure to read the DVC documentation for each command to get more information on what it does.

5.1.1. Setup

The DVC tutorial comes with a pre-made repository that is structured for DVC machine learning analyses. If you want to code along, the repository needs to be forked (requires a GitHub account) and cloned from your own fork1.

### DVC
# please clone this repository from your own fork when coding along
$ git clone git@github.com:datalad-handbook/data-version-control.git DVC
Cloning into 'DVC'...

The resulting Git repository is already pre-structured in a way that aids DVC ML analyses: It has the directories model and metrics, and a set of Python scripts for a machine learning analysis in src/.

### DVC
$ tree DVC
DVC
├── data
│   ├── prepared
│   └── raw
├── LICENSE
├── metrics
├── model
├── README.md
└── src
    ├── evaluate.py
    ├── prepare.py
    └── train.py

6 directories, 5 files

For a comparison, we will recreate a similarly structured DataLad dataset. For greater compliance with DataLad’s YODA principles, the dataset structure will differ marginally in that scripts will be kept in code/ instead of src/. We create the dataset with two configurations, yoda and text2git2.

### DVC-DataLad
$ datalad create -c text2git -c yoda DVC-DataLad
$ cd DVC-DataLad
$ mkdir -p data/{raw,prepared} model metrics
[INFO] Creating a new annex repo at /home/me/DVCvsDL/DVC-DataLad 
[INFO] Scanning for unlocked files (this may take some time) 
[INFO] Running procedure cfg_text2git 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
[INFO] Running procedure cfg_yoda 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
create(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

Afterwards, we make sure to get the same scripts.

### DVC-DataLad
# get the scripts
$ datalad download-url -m "download scripts for ML analysis" \
  https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/{train,prepare,evaluate}.py \
  -O 'code/'
[INFO] Downloading 'https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/train.py' into '/home/me/DVCvsDL/DVC-DataLad/code/' 
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/train.py (file)
[INFO] Downloading 'https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/prepare.py' into '/home/me/DVCvsDL/DVC-DataLad/code/' 
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/prepare.py (file)
[INFO] Downloading 'https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/evaluate.py' into '/home/me/DVCvsDL/DVC-DataLad/code/' 
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/evaluate.py (file)
add(ok): /home/me/DVCvsDL/DVC-DataLad/code/evaluate.py (file)
add(ok): /home/me/DVCvsDL/DVC-DataLad/code/prepare.py (file)
add(ok): /home/me/DVCvsDL/DVC-DataLad/code/train.py (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
  add (ok: 3)
  download_url (ok: 3)
  save (ok: 1)

Here’s the final directory structure:

### DVC-DataLad
$ tree
.
├── CHANGELOG.md
├── code
│   ├── evaluate.py
│   ├── prepare.py
│   ├── README.md
│   └── train.py
├── data
│   ├── prepared
│   └── raw
├── metrics
├── model
└── README.md

6 directories, 6 files

Required software for coding along

In order to code along, DVC, scikit-learn, scikit-image, pandas, and numpy are required. All tools are available via pip or conda. We recommend to install them in a virtual environment – the DVC tutorial has step-by-step instructions.

5.1.2. Version controlling data

In the first part of the tutorial, the directory tree will be populated with data that should be version controlled.

Although the implementation of version control for (large) data is very different between DataLad and DVC, the underlying concept is very similar: (Large) data is stored outside of GitGit only tracks information on where this data can be found.

In DataLad datasets, (large) data is handled by git-annex. Data content is hashed and only the hash (represented as the original file name) is stored in Git3. Actual data is stored in the annex of the dataset, and annexed data can be transferred from and to a large number of storage solutions using either DataLad or git-annex commands. Information on where data is available from is stored in an internal representation of git-annex.

In DVC repositories, (large) data is also supposed to be stored in external remotes such as Google Drive. For internal representation of where files are available from, DVC uses one .dvc text file for each data file or directory given to DVC. The .dvc files contain information on the path to the data in the repository, where the associated data file is available from, and a hash, and those files should be committed to Git.

5.1.2.1. DVC workflow

Prior to adding and version controlling data, a “DVC project” needs to be initialized in the Git repository:

### DVC
$ cd ../DVC
$ dvc init

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|              https://dvc.org/doc/user-guide/analytics               |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: https://dvc.org/doc
- Get help and share ideas: https://dvc.org/chat
- Star us on GitHub: https://github.com/iterative/dvc

This populates the repository with a range of staged files – most of them are internal directories and files for DVC’s configuration.

### DVC
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   .dvc/.gitignore
	new file:   .dvc/config
	new file:   .dvc/plots/confusion.json
	new file:   .dvc/plots/default.json
	new file:   .dvc/plots/scatter.json
	new file:   .dvc/plots/smooth.json
	new file:   .dvcignore

As they are only staged but not committed, we need to commit them (into Git):

### DVC
$ git commit -m "initialize dvc"
[master 15adabf] initialize dvc
 7 files changed, 131 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/scatter.json
 create mode 100644 .dvc/plots/smooth.json
 create mode 100644 .dvcignore

The DVC project is now ready to version control data. In the tutorial, data comes from the “Imagenette” dataset. First, the data needs to be downloaded from an Amazon S3 bucket as a compressed tarball and extracted into the data/raw/ directory of the repository.

### DVC
# download the data
$ curl -s https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \
          -O imagenette2-160.tgz
# extract it
$ tar -xzf imagenette2-160.tgz
# move it into the directories
$ cp -r imagenette2-160/train data/raw/
$ cp -r imagenette2-160/val data/raw/
# remove the archive and extracted folder
$ rm -rf imagenette2-160
$ rm imagenette2-160.tgz

The data directories in data/raw are then version controlled with the dvc add command that can place files or complete directories under version control by DVC.

### DVC
$ dvc add data/raw/train
$ dvc add data/raw/val

To track the changes with git, run:

	git add data/raw/.gitignore data/raw/train.dvc

To track the changes with git, run:

	git add data/raw/.gitignore data/raw/val.dvc

Here is what this command has accomplished: The data files were copied into a cache in .dvc/cache (a non-human readable directory structure based on hashes similar to .git/annex/objects used by git-annex), data file names were added to a .gitignore4 file to become invisible to Git, and two .dvc files, train.dvc and val.dvc, were created5. git status shows these changes:

### DVC
$ git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   data/raw/.gitignore

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	data/raw/train.dvc
	data/raw/val.dvc

no changes added to commit (use "git add" and/or "git commit -a")

In order to complete the version control workflow, Git needs to know about the .dvc files, and forget about the data directories. For this, the modified .gitignore file and the untracked .dvc files need to be added to Git:

### DVC
$ git add --all

Finally, we commit.

### DVC
$ git commit -m "control data with DVC"
[master 60067f9] control data with DVC
 3 files changed, 8 insertions(+)
 create mode 100644 data/raw/train.dvc
 create mode 100644 data/raw/val.dvc

The data is now version controlled with DVC.

How does DVC represent modifications to data?

When adding data directories, they (i.e., the complete directory) are hashed, and this hash is stored in the respective .dvc file. If any file in the directory changes, this hash would change, and the dvc status command would report the directory to be “changed”. To demonstrate this, we pretend to accidentally delete a single file:

# if one or more files in the val/ data changes, dvc status reports a change
$ dvc status
data/raw/val.dvc:
    changed outs:
        modified:           data/raw/val

Important: Detecting a data modification requires the dvc status command – git status will not be able to detect changes as this directory as it is git-ignored!

5.1.2.2. DataLad workflow

DataLad has means to get data or data archives from web sources and store this availability information within git-annex. This has several advantages: For one, the original S3 bucket is known and stored as a location to re-retrieve the data from. This enables reliable data access for yourself and others that you share the dataset with. Beyond this, the data is also automatically extracted and saved, and thus put under version control. Note that this strays slightly from DataLad’s YODA principles in a DataLad-centric workflow, where data should become a standalone, reusable dataset that would be linked as a subdataset into a study/analysis specific dataset. Here, we stick to the project organization of DVC though.

### DVC-DataLad
$ cd ../DVC-DataLad
# datalad > 13.4 needs the configuration option -c datalad.runtime.use-patool=1
$ datalad -c datalad.runtime.use-patool=1 download-url \
  --archive \
  --message "Download Imagenette dataset" \
  'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz' \
  -O 'data/raw/'
[INFO] Downloading 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz' into '/home/me/DVCvsDL/DVC-DataLad/data/raw/' 
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz (file)
add(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
[INFO] Adding content of the archive /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz into annex AnnexRepo(/home/me/DVCvsDL/DVC-DataLad) 
[INFO] Initiating special remote datalad-archives 
[INFO] Finished adding /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz: Files processed: 13394, renamed: 13394, +annex: 13394 
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)

At this point, the data is already version controlled6, but the directory structure doesn’t resemble that of the DVC dataset yet – the extracted directory adds one unnecessary directory layer:

.
├── code
│   └── [...]
├── data
│   └── raw
│         └── imagenette-160
│              ├── train
│              │   ├──[...]
│              └── val
│                  ├── [...]
├── metrics
└── model

29 directories

To make the scripts work, we move the raw data up one level. This move needs to be saved.

### DVC-DataLad
$ mv data/raw/imagenette2-160/* data/raw/ && rmdir data/raw/imagenette2-160
$ datalad save -m "Move data into preferred locations"
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
  add (ok: 13394)
  delete (ok: 13394)
  save (ok: 1)

How does DataLad represent modifications to data?

As DataLad always tracks files individually, datalad status (or, alternatively, git status or git annex status) will show modifications on the level of individual files:

$ datalad status
  deleted: /home/me/DVCvsDL/DVC-DataLad/data/raw/val/n01440764/n01440764_12021.JPEG (symlink)

$ git status
  On branch master
  Your branch is ahead of 'origin/master' by 2 commits.
    (use "git push" to publish your local commits)

  Changes not staged for commit:
    (use "git add/rm <file>..." to update what will be committed)
    (use "git restore <file>..." to discard changes in working directory)
      deleted:    data/raw/val/n01440764/n01440764_12021.JPEG

$ git annex status
  D data/raw/val/n01440764/n01440764_12021.JPEG

5.1.3. Sharing data

In the second part of the tutorial, the versioned data is transferred to a local directory to demonstrate data sharing.

The general mechanisms of DVC and DataLad data sharing are similar: (Large) data files are kept somewhere where potentially large files can be stored. They can be retrieved on demand as the location information is stored in Git. DVC uses the term “data remote” to refer to external storage locations for (large) data, whereas DataLad would refer to them as (storage-) siblings.

Both DVC and DataLad support a range of hosting solutions, from local paths and SSH servers to providers such as S3 or GDrive. For DVC, every supported remote is pre-implemented, which restricts the number of available services (a list is here), but results in a convenient, streamlined procedure for adding remotes based on URL schemes. DataLad, largely thanks to “external special remotes” mechanism of git-annex, has more storage options (in addition for example DropBox, the Open Science Framework (OSF), Git LFS, Figshare, GIN, or RIA stores), but depending on selected storage provider, the procedure to add a sibling may differ. In addition, DataLad is able to store complete datasets (annexed data and Git repository) in certain services (e.g., OSF, GIN, GitHub if used with GitLFS, Dropbox, …), enabling a clone from for example Google Drive, and while DVC can never keep data in Git repository hosting services, DataLad can do this if the hosting service supports hosting annexed data (default on Gin and possible with GitHub, GitLab or BitBucket if used with GitLFS).

5.1.3.1. DVC workflow

Step 1: Set up a remote

The DVC tutorial demonstrates data sharing via a local data remote7. As a first step, there needs to exist a directory to use as a remote, so we will create a new directory:

### DVC
# go back to DVC (we were in DVC-Datalad)
$ cd ../DVC
# create a directory somewhere else
$ mkdir ../dvc-remote

Afterwards, the new, empty directory can be added as a data remote using dvc remote add. The -d option sets it as the default remote, which simplifies pushing later on:

### DVC
$ dvc remote add -d remote_storage ../dvc_remote
Setting 'remote_storage' as a default remote.

The location of the remote is written into a config file:

### DVC
$ cat .dvc/config
[core]
    remote = remote_storage
['remote "remote_storage"']
    url = ../../dvc_remote

Note that dvc remote add only modifies the config file, and it still needs to be added and committed to Git:

### DVC
$ git status
On branch master
Your branch is ahead of 'origin/master' by 2 commits.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   .dvc/config

no changes added to commit (use "git add" and/or "git commit -a")
### DVC
$ git add .dvc/config
$ git commit -m "add local remote"
[master adbaee9] add local remote
 1 file changed, 4 insertions(+)

Note for Git users

The DVC and Git concepts of a “remote” are related, but not identical. Therefore, DVC remotes are invisible to git remote, and likewise, Git remotes are invisible to the dvc remote list command.

Step 2: Push data to the remote

Once the remote is set up, the data that is managed by DVC can be pushed from the cache of the project to the remote. During this operation, all data for which .dvc files exist will be copied from .dvc/cache to the remote storage.

### DVC
$ dvc push
13396 files pushed

Step 3: Push Git history

At this point, all changes that were committed to Git (such as the .dvc files) still need to be pushed to a Git repository hosting service.

### DVC
# this will only work if you have cloned from your own fork
$ git push origin master
To github.com:datalad-handbook/data-version-control.git
   b796ba1..adbaee9  master -> master

Step 4: Data retrieval

In DVC projects, there are several ways to retrieve data into its original location or the project cache. In order to demonstrate this, we start by deleting a data directory (in its original location, data/raw/val/).

### DVC
$ rm -rf data/raw/val

Note for Git users

Do note that this deletion would not be detected by git status – you have to use dvc status instead.

At this point, a copy of the data still resides in the cache of the repository. These data are copied back to val/ with the dvc checkout command:

### DVC
$ dvc checkout data/raw/val.dvc
A	data/raw/val/

If the cache of the repository would be empty, the data can be re-retrieved into the cache from the data remote. To demonstrate this, let’s look at a repository with an empty cache by cloning this repository from GitHub into a new location.

### DVC
# clone the repo into a new location for demonstration purposes:
$ cd ../
$ git clone git@github.com:datalad-handbook/data-version-control.git DVC-2
Cloning into 'DVC-2'...

Retrieving the data from the data remote to repopulate the cache is done with the dvc fetch command:

### DVC
$ cd DVC-2
$ dvc fetch data/raw/val.dvc
3925 files fetched

Afterwards, another dvc checkout will copy the files from the cache back to val/. Alternatively, the command dvc pull performs fetch (get data into the cache) and checkout (copy data from the cache to its original location) in a single command.

Unless DVC is used on a small subset of file systems (trfs, XFS, OCFS2, or APFS), copying data between its original location and the cache is the default. This results in a “built-in data duplication” on most current file systems8. An alternative is to switch from copies to symlinks (as done by git-annex) or hardlinks.

5.1.3.2. DataLad workflow

Because the S3 bucket of the raw data is known and stored in the dataset, it strictly speaking isn’t necessary to create a storage sibling to push the data to – DataLad already treats the original S3 bucket as storage. Currently, the dataset can thus be shared via GitHub or similar hosting services, and the data can be retrieved using datalad get.

Really?

Sure. Let’s demonstrate this. First, we create a sibling on GitHub for this dataset and push its contents to the sibling:

### DVC-DataLad
$ cd ../DVC-DataLad
$ datalad create-sibling-github DVC-DataLad --github-organization datalad-handbook
[INFO   ] Successfully obtained information about organization datalad-handbook using UserPassword(name='github', url='https://github.com/login') credential
 .: github(-) [https://github.com/datalad-handbook/DVC-DataLad.git (git)]
 'https://github.com/datalad-handbook/DVC-DataLad.git' configured as sibling 'github' for Dataset(/home/me/DVCvsDL/DVC-DataLad)
$ datalad push --to github
  Update availability for 'github': [...] [00:00<00:00, 28.9k Steps/s]Username for 'https://github.com': <user>
  Password for 'https://adswa@github.com': <password>
  publish(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [refs/heads/master->github:refs/heads/master [new branch]]
  publish(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]

Next, we can clone this dataset, and retrieve the files:

### DVC-DataLad
# outside of a dataset
$ datalad clone git@github.com:datalad-handbook/DVC-DataLad.git DVC-DataLad-2
$ cd DVC-DataLad-2
[INFO] Cloning dataset to Dataset(/home/me/DVCvsDL/DVC-DataLad-2) 
[INFO] Attempting to clone from git@github.com:datalad-handbook/DVC-DataLad.git to /home/me/DVCvsDL/DVC-DataLad-2 
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 
[INFO] Start receiving objects 
[INFO] Start resolving deltas 
[INFO] Completed clone attempts for Dataset(/home/me/DVCvsDL/DVC-DataLad-2) 
[INFO] Scanning for unlocked files (this may take some time) 
[INFO] Unable to parse git config from origin 
[INFO] Remote origin does not have git-annex installed; setting annex-ignore
|   This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 
install(ok): /home/me/DVCvsDL/DVC-DataLad-2 (dataset)
### DVC-DataLad
$ datalad get data/raw/val
[INFO] To obtain some keys we need to fetch an archive of size 98.9 MB 
[INFO] PROGRESS-JSON: {"command":"get","note":"from web...\nchecksum...","success":true,"key":"MD5E-s98948031--0edfc972b5c9817ac36517c0057f3869.tgz","file":null} 
get(ok): /home/me/DVCvsDL/DVC-DataLad-2/data/raw/val (directory)
action summary:
  get (ok: 3926)

The data was retrieved by re-downloading the original archive from S3 and extracting the required files.

Here’s an example of pushing a dataset to a local sibling nevertheless:

Step 1: Set up the sibling

The easiest way to share data is via a local sibling7. This won’t share only annexed data, but it instead will push everything, including the Git aspect of the dataset. First, we need to create a local sibling:

### DVC-DataLad
$ cd DVC-DataLad
$ datalad create-sibling --name my-sibling ../datalad-sibling
[INFO] Considering to create a target dataset /home/me/DVCvsDL/DVC-DataLad at /home/me/DVCvsDL/datalad-sibling of localhost 
[INFO] Fetching updates for Dataset(/home/me/DVCvsDL/DVC-DataLad) 
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 
[INFO] Adjusting remote git configuration 
[INFO] Running post-update hooks in all created siblings 
create_sibling(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

Step 2: Push the data

Afterwards, the dataset contents can be pushed using datalad push.

### DVC-DataLad
$ datalad push --to my-sibling
[INFO] Publishing Dataset(/home/me/DVCvsDL/DVC-DataLad) to my-sibling 
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 
[INFO] Start writing objects 
[INFO] Start resolving deltas 
publish(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [pushed to my-sibling: ['6afeac804..c6b99789a', '[new branch]']]

This pushed all of the annexed data and the Git history of the dataset.

Step 3: Retrieve the data

The data in the dataset (complete directories or individual files) can be dropped using datalad drop, and reobtained using datalad get.

### DVC-DataLad
$ datalad drop data/raw/val
drop(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/val (directory)
action summary:
  drop (ok: 3926)
### DVC-DataLad
$ datalad get data/raw/val
get(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/val (directory)
action summary:
  get (ok: 3926)

5.1.4. Data analysis

DVC is tuned towards machine learning analyses and comes with convenience commands and workflow management to build, compare, and reproduce machine learning pipelines. The tutorial therefore runs an SGD classifier and a random forrest classifier on the data and compares the two models. For this, the pre-existing preparation, training, and evaluation scripts are used on the data we have downloaded and version controlled in the previous steps. DVC has means to transform such a structured ML analysis into a workflow, reproduce this workflow on demand, and compare it across different models or parametrizations.

In this general overview, we will only rush through the analysis: In short, it consists of three steps, each associated with a script. src/prepare.py creates two .csv files with mappings of file names in train/ and val/ to image categories. Later, these files will be used to train and test the classifiers. src/train.py loads the training CSV file prepared in the previous stage, trains a classifier on the training data, and saves the classifier into the model/ directory as model.joblib. The final script, src/evaluate.py is used to evaluate the trained classifier on the validation data and write the accuracy of the classification into the file metrics/accuracy.json. There are more detailed insights and explanations of the actual analysis code in the Tutorial if you’re interested in finding out more.

For workflow management, DVC has the concept of a “DVC pipeline”. A pipeline consists of multiple stages and is executed using a dvc run command. Each stage has three components: “deps”, “outs”, and “command”. Each of the scripts in the repository will be represented by a stage in the DVC pipeline.

DataLad does not have any workflow management functions. The closest to it are datalad run to record any command execution or analysis, datalad rerun to recompute such an analysis, and datalad containers-run to perform and record a command execution or analysis inside of a tracked software container10.

5.1.4.1. DVC workflow

Model 1: SGD classifier

Each model will be analyzed in a different branch of the repository. Therefore, we start by creating a new branch.

### DVC
$ cd ../DVC
$ git checkout -b sgd-pipeline
Switched to a new branch 'sgd-pipeline'

The first stage in the pipeline is data preparation (performed by the script prepare.py). The following command sets up the stage:

### DVC
$ dvc run -n prepare \
  -d src/prepare.py -d data/raw \
  -o data/prepared/train.csv -o data/prepared/test.csv \
  python src/prepare.py
Running stage 'prepare' with command:
	python src/prepare.py
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml data/prepared/.gitignore dvc.lock

The -n parameter gives the stage a name, the -d parameter passes the dependencies – the raw data – to the command, and the -o parameter defines the outputs of the command – the CSV files that prepare.py will create. python src/prepare.py is the command that will be executed in the stage.

The resulting changes can be added to Git:

### DVC
$ git add dvc.yaml data/prepared/.gitignore dvc.lock

This command runs the command, and also creates two YAML files, dvc.yaml and dvc.lock. They contain the pipeline description, which currently comprises of the first stage:

### DVC
$ cat dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
    - data/raw
    - src/prepare.py
    outs:
    - data/prepared/test.csv
    - data/prepared/train.csv

The lock file tracks the versions of all relevant files via MD5 hashes. This allows DVC to track all dependencies and outputs and detect if any of these files change.

### DVC
$ cat dvc.lock
prepare:
  cmd: python src/prepare.py
  deps:
  - path: data/raw
    md5: a8a5252d9b14ab2c1be283822a86981a.dir
  - path: src/prepare.py
    md5: ef804f358e00edcfe52c865b471f8f55
  outs:
  - path: data/prepared/test.csv
    md5: 536fe137c83d7119c45f5d978335425b
  - path: data/prepared/train.csv
    md5: 0bad47e2449d20d62df6fd9fdbeaa32b

The command also added the results from the stage, train.csv and test.csv into a .gitignore file.

The next pipeline stage is training, in which train.py will be used to train a classifier on the data. Initially, this classifier is an SGD classifier. The following command sets it up:

$ dvc run -n train \
   -d src/train.py -d data/prepared/train.csv \
   -o model/model.joblib \
   python src/train.py
Running stage 'train' with command:
	python src/train.py
/home/adina/env/handbook/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:570: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn("Maximum number of iteration reached before "
Adding stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml model/.gitignore dvc.lock

Afterwards, train.py has been executed, and the pipelines have been updated with a second stage. The resulting changes can be added to Git:

### DVC
$ git add dvc.yaml model/.gitignore dvc.lock

Finally, we create the last stage, model evaluation. The following command sets it up:

$ dvc run -n evaluate \
         -d src/evaluate.py -d model/model.joblib \
         -M metrics/accuracy.json \
         python src/evaluate.py
Running stage 'evaluate' with command:
	python src/evaluate.py
Adding stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock
### DVC
$ git add dvc.yaml dvc.lock

Instead of “outs”, this final stage uses the -M flag to denote a “metric”. This type of flag can be used if floating-point or integer values that summarize model performance (e.g. accuracies, receiver operating characteristics, or area under the curve values) are saved in hierarchical files (JSON, YAML). DVC can then read from these files to display model performances and comparisons:

### DVC
$ dvc metrics show
	metrics/accuracy.json:
		accuracy: 0.7858048162230672

The complete pipeline now consists of preparation, training, and evaluation. It now needs to be committed, tagged, and pushed:

### DVC
$ git add --all
$ git commit -m "Add SGD pipeline"
$ dvc commit
$ git push --set-upstream origin sgd-pipeline
$ git tag -a sgd-pipeline -m "Trained SGD as DVC pipeline."
$ git push origin --tags
$ dvc push
[sgd-pipeline 67fc8cb] Add SGD pipeline
 5 files changed, 60 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 create mode 100644 metrics/accuracy.json
error: src refspec sgd-pipeline matches more than one
error: failed to push some refs to 'github.com:datalad-handbook/data-version-control.git'
fatal: tag 'sgd-pipeline' already exists
Received disconnect from 140.82.121.3 port 22:11: Bye Bye
Disconnected from 140.82.121.3 port 22
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
3 files pushed

Model 2: random forrest classifier

In order to explore a second model, a random forrest classifier, we start with a new branch.

### DVC
$ git checkout -b random_forrest
Switched to a new branch 'random_forrest'

To switch from SGD to a random forrest classifier, a few lines of code within train.py need to be changed. The following here doc changes the script accordingly (changes are highlighted):

### DVC
$ cat << EOT >| src/train.py
from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list

def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list

def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped

def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels

def main(repo_path):
    train_csv_path = repo_path / "data/prepared/train.csv"
    train_data, labels = load_data(train_csv_path)
    rf = RandomForestClassifier()
    trained_model = rf.fit(train_data, labels)
    dump(trained_model, repo_path / "model/model.joblib")

if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

Afterwards, since train.py is changed, dvc status will realize that one dependency of the pipeline stage “train” has changed:

### DVC
$ dvc status
train:
	changed deps:
		modified:           src/train.py

Since the code change (stage 2) will likely affect the metric (stage 3), its best to reproduce the whole chain. You can reproduce a complete DVC pipeline file with the dvc repro <stagename> command:

### DVC
$ dvc repro evaluate
'data/raw/val.dvc' didn't change, skipping
'data/raw/train.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'train' with command:
	python src/train.py
Updating lock file 'dvc.lock'

Running stage 'evaluate' with command:
	python src/evaluate.py
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

DVC checks the dependencies of the pipeline and re-executes commands that need to be executed again. Compared to the branch sgd_pipeline, the workspace in the current random_forrest branch contains a changed script (src/train.py), a changed trained classifier (model/model.joblib), and a changed metric (metric/accuracy.json). All these changes need to be committed, tagged, and pushed now.

### DVC
$ git add --all
$ git commit -m "Train Random Forrest classifier"
$ dvc commit
$ git push --set-upstream origin random-forest
$ git tag -a random-forest -m "Random Forest classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push
[random_forrest ff5b449] Train Random Forrest classifier
 3 files changed, 8 insertions(+), 14 deletions(-)
Everything up-to-date
fatal: tag 'random-forest' already exists
Everything up-to-date
1 file pushed

At this point, you can compare metrics across multiple tags:

### DVC
$ dvc metrics show -T
workspace:
	metrics/accuracy.json:
		accuracy: 0.8136882129277566
random-forest:
	metrics/accuracy.json:
		accuracy: 0.8187579214195184
sgd-pipeline:
	metrics/accuracy.json:
		accuracy: 0.7427122940430925

Done!

5.1.4.2. DataLad workflow

For a direct comparison to DVC, we’ll try to mimic the DVC workflow as closely as it is possible with DataLad.

Model 1: SGD classifier

### DVC-DataLad
$ cd ../DVC-DataLad

As there is no workflow manager in DataLad9, each script execution needs to be done separately. To record the execution, get all relevant inputs, and recompute outputs at later points, we can set up a datalad run call10. Later on, we can rerun a range of datalad run calls at once to recompute the relevant aspects of the analysis. To harmonize execution and to assist with reproducibility of the results, we generally recommend to create a container (Docker or Singularity), add it to the repository as well, and use datalad containers-run call11 and have that reran, but we’ll stay basic here.

Let’s start with data preparation. Instead of creating a pipeline stage and giving it a name, we attach a meaningful commit message.

### DVC-DataLad
$ datalad run --message "Prepare the train and testing data" \
   --input "data/raw/*" \
   --output "data/prepared/*" \
   python code/prepare.py
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): /home/me/DVCvsDL/DVC-DataLad/data/prepared/test.csv (file)
add(ok): /home/me/DVCvsDL/DVC-DataLad/data/prepared/train.csv (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
  add (ok: 2)
  get (notneeded: 2)
  save (ok: 1)

The results of this computation are automatically saved and associated with their inputs and command execution. This information isn’t stored in a separate file, but in the Git history, and saved with the commit message we have attached to the run command.

To stay close to the DVC tutorial, we will also work with tags to identify analysis versions, but DataLad could also use a range of other identifiers, for example commit hashes, to identify this computation. As we at this point have set up our data and are ready for the analysis, we will name the first tag “ready-for-analysis”. This can be done with git tag, but also with datalad save.

### DVC-DataLad
$ datalad save --version-tag ready-for-analysis
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

Let’s continue with training by running code/train.py on the prepared data.

### DVC-DataLad
$ datalad run --message "Train an SGD classifier" \
   --input "data/prepared/*" \
   --output "model/model.joblib" \
   python code/train.py
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
/home/adina/env/handbook/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:570: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  warnings.warn("Maximum number of iteration reached before "
[INFO] == Command exit (modification check follows) ===== 
add(ok): /home/me/DVCvsDL/DVC-DataLad/model/model.joblib (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 2)
  save (ok: 1)

As before, the results of this computations are saved, an the Git history connects computation, results, and inputs.

As a last step, we evaluate the first model:

### DVC-DataLad
$ datalad run --message "Evaluate SGD classifier model" \
   --input "model/model.joblib" \
   --output "metrics/accuracy.json" \
   python code/evaluate.py
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): /home/me/DVCvsDL/DVC-DataLad/code/__pycache__/train.cpython-38.pyc (file)
add(ok): /home/me/DVCvsDL/DVC-DataLad/metrics/accuracy.json (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
  add (ok: 2)
  get (notneeded: 1)
  save (ok: 1)

At this point, the first accuracy metric is saved in metrics/accuracy.json. Let’s add a tag to declare that it belongs to the SGD classifier.

### DVC-DataLad
$ datalad save --version-tag SGD
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)

Let’s now change the training script to use a random forrest classifier as before:

### DVC-DataLad
$ cat << EOT >| code/train.py
from joblib import dump
from pathlib import Path

import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier

def load_images(data_frame, column_name):
    filelist = data_frame[column_name].to_list()
    image_list = imread_collection(filelist)
    return image_list

def load_labels(data_frame, column_name):
    label_list = data_frame[column_name].to_list()
    return label_list

def preprocess(image):
    resized = resize(image, (100, 100, 3))
    reshaped = resized.reshape((1, 30000))
    return reshaped

def load_data(data_path):
    df = pd.read_csv(data_path)
    labels = load_labels(data_frame=df, column_name="label")
    raw_images = load_images(data_frame=df, column_name="filename")
    processed_images = [preprocess(image) for image in raw_images]
    data = np.concatenate(processed_images, axis=0)
    return data, labels

def main(repo_path):
    train_csv_path = repo_path / "data/prepared/train.csv"
    train_data, labels = load_data(train_csv_path)
    rf = RandomForestClassifier()
    trained_model = rf.fit(train_data, labels)
    dump(trained_model, repo_path / "model/model.joblib")

if __name__ == "__main__":
    repo_path = Path(__file__).parent.parent
    main(repo_path)
EOT

We need to save this change:

$ datalad save -m "Switch to random forrest classification" code/train.py
add(ok): /home/me/DVCvsDL/DVC-DataLad/code/train.py (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Afterwards, we can rerun all run records between the tags ready-for-analysis and SGD using datalad rerun. We could automatically compute this on a different branch if we wanted to by using the branch option:

$ datalad rerun --branch="randomforrest" -m "Recompute classification with random forrest classifier" ready-for-analysis..SGD
[INFO] checkout commit e228e11; 
[INFO] run commit 076981f; (Train an SGD clas...) 
[INFO] Making sure inputs are available (this may take some time) 
unlock(ok): /home/me/DVCvsDL/DVC-DataLad/model/model.joblib (file)
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): /home/me/DVCvsDL/DVC-DataLad/model/model.joblib (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
[INFO] run commit c5053b8; (Evaluate SGD clas...) 
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): /home/me/DVCvsDL/DVC-DataLad/code/__pycache__/train.cpython-38.pyc (file)
add(ok): /home/me/DVCvsDL/DVC-DataLad/metrics/accuracy.json (file)
save(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
  add (ok: 3)
  get (notneeded: 3)
  save (ok: 2)
  unlock (notneeded: 2, ok: 1)

Done! The difference in accuracies between models could now for example be compared with a git diff:

$ git diff SGD -- metrics/accuracy.json
diff --git a/metrics/accuracy.json b/metrics/accuracy.json
index 9c9e139d4..380449534 100644
--- a/metrics/accuracy.json
+++ b/metrics/accuracy.json
@@ -1 +1 @@
-{"accuracy": 0.7896070975918885}
\ No newline at end of file
+{"accuracy": 0.8136882129277566}
\ No newline at end of file

Even though there is no one-to-one correspondence between a DVC and a DataLad workflow, a DVC workflow can also be implemented with DataLad.

5.1.5. Summary

DataLad and DVC aim to solve the same problems: Version control data, sharing data, and enabling reproducible analyses. DataLad provides generic solutions to these issues, while DVC is tuned for machine-learning pipelines. Despite their similar purpose, the looks, feels and functions of both tools are different, and its a personal decision which one you feel more comfortable with. Using DVC requires solid knowledge of Git, because DVC workflows heavily rely on effective Git practices, such as branching, tags, and .gitignore files. But despite the reliance on Git, DVC barely integrates with Git – changes done to files in DVC can not be detected by Git and vice versa, DVC and Git aspects of a repository have to be handled in parallel by the user, and DVC and Git have distinct command functions and concepts that nevertheless share the same name. Thus, DVC users need to master Git and DVC workflows and intertwine them correctly. In return, DVC provides users with workflow management and reporting tuned to machine learning analyses. It also provides a somewhat more lightweight and uniform across operating and file systems approach to “data version control” than git-annex used by DataLad.

Footnotes

1

Instructions on forking and cloning the repo are in the README of the repository: github.com/realpython/data-version-control.

2

The two procedures provide the dataset with useful structures and configurations for its purpose: yoda creates a dataset structure with a code directory and makes sure that everything kept in code will be committed to Git (thus allowing for direct sharing of code). text2git makes sure that any other text file in the dataset will be stored in Git as well. The sections Data safety and The YODA procedure explain the two configurations in detail.

3

To re-read about how git-annex handles versioning of (large) files, go back to section Data integrity.

4

You can read more about .gitignore files in the section How to hide content from DataLad

5

If you are curious about why data is duplicated in a cache or why the paths to the data are placed into a .gitignore file, this section in the DVC tutorial has more insights on the internals of this process.

6

The sections Populate a dataset and Modify content introduce the concepts of saving and modifying files in DataLad datasets.

7(1,2)

A similar procedure for sharing data on a local file system for DataLad is shown in the chapter Looking without touching.

8

In DataLad datasets, data duplication is usually avoided as git-annex uses symlinks. Only on file systems that lack support for symlinks or for removing write permissions from files (so called “crippled file systems” such as /sdcard on Android, FAT or NTFS) git-annex needs to duplicate data.

9

yet.

10(1,2)

To re-read about datalad run and datalad rerun, checkout chapter DataLad, Run!.

11

To re-read about joining code, execution, data, results and software environment in a re-executable record with datalad container-run, checkout section Computational reproducibility with software containers.