5.1. Reproducible machine learning analyses: DataLad as DVC¶
Machine learning analyses are complex: Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics. This complexity can jeopardize reproducibility – it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.
DataLad provides the functionality to achieve this, and previous sections have given some demonstrations on how to do it. But in the context of machine learning analyses, other domain-specific tools and workflows exist, too. One of the most well-known is DVC (Data Version Control), a “version control system for machine learning projects”. This section compares the two tools and demonstrates workflows for data versioning, data sharing, and analysis execution in the context of a machine learning project with DVC and DataLad. While they share a number of similarities and goals, their respective workflows are quite distinct.
The workflows showcased here are based on a DVC tutorial. This tutorial consists of the following steps:
A data set with pictures of 10 classes of objects (Imagenette) is version controlled with DVC
the data is pushed to a “storage remote” on a local path
the data are analyzed using various ML models in DVC pipelines
This handbook section demonstrates how DataLad could be used as an alternative to DVC. We demonstrate each step with DVC according to their tutorial, and then recreate a corresponding DataLad workflow. The use case DataLad for reproducible machine-learning analyses demonstrates a similar analysis in a completely DataLad-centric fashion. If you want to, you can code along, or simply read through the presentation of DVC and DataLad commands. Some familiarity with DataLad can be helpful, but if you have never used DataLad, footnotes in each section can point you relevant chapters for more insights on a command or concept. If you have never used DVC, its documentation (including the command reference) can answer further questions.
If you are not a Git user
DVC relies heavily on Git workflows.
Understanding the DVC workflows requires a solid understanding of branches, Git’s concepts of Working tree, Index (“Staging Area”), and Repository, and some basic Git commands such as add
, commit
, and checkout
.
The Turing Way has an excellent chapter on version control with Git if you want to catch up on those basics first.
Terminology
Be mindful: DVC (as DataLad) comes with a range of commands and concepts that have the same names, but differ in functionality to their Git namesake. Make sure to read the DVC documentation for each command to get more information on what it does.
5.1.1. Setup¶
The DVC tutorial comes with a pre-made repository that is structured for DVC machine learning analyses. If you want to code along, the repository needs to be forked (requires a GitHub account) and cloned from your own fork[1].
### DVC
# please clone this repository from your own fork when coding along
$ git clone https://github.com/datalad-handbook/data-version-control DVC
Cloning into 'DVC'...
The resulting Git repository is already pre-structured in a way that aids DVC ML analyses: It has the directories model
and metrics
, and a set of Python scripts for a machine learning analysis in src/
.
### DVC
$ tree DVC
DVC
├── data
│ ├── prepared
│ └── raw
├── LICENSE
├── metrics
├── model
├── README.md
└── src
├── evaluate.py
├── prepare.py
└── train.py
6 directories, 5 files
For a comparison, we will recreate a similarly structured DataLad dataset.
For greater compliance with DataLad’s YODA principles, the dataset structure will differ marginally in that scripts will be kept in code/
instead of src/
.
We create the dataset with two configurations, yoda
and text2git
[2].
### DVC-DataLad
$ datalad create -c text2git -c yoda DVC-DataLad
$ cd DVC-DataLad
$ mkdir -p data/{raw,prepared} model metrics
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
create(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
Afterwards, we make sure to get the same scripts.
### DVC-DataLad
# get the scripts
$ datalad download-url -m "download scripts for ML analysis" \
https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/{train,prepare,evaluate}.py \
-O 'code/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/train.py (file)
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/prepare.py (file)
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/evaluate.py (file)
add(ok): code/evaluate.py (file)
add(ok): code/prepare.py (file)
add(ok): code/train.py (file)
save(ok): . (dataset)
Here’s the final directory structure:
### DVC-DataLad
$ tree
.
├── CHANGELOG.md
├── code
│ ├── evaluate.py
│ ├── prepare.py
│ ├── README.md
│ └── train.py
├── data
│ ├── prepared
│ └── raw
├── metrics
├── model
└── README.md
6 directories, 6 files
Required software for coding along
In order to code along, DVC, scikit-learn, scikit-image, pandas, and numpy are required. All tools are available via pip or conda. We recommend to install them in a virtual environment – the DVC tutorial has step-by-step instructions.
5.1.2. Version controlling data¶
In the first part of the tutorial, the directory tree will be populated with data that should be version controlled.
Although the implementation of version control for (large) data is very different between DataLad and DVC, the underlying concept is very similar: (Large) data is stored outside of Git – Git only tracks information on where this data can be found.
In DataLad datasets, (large) data is handled by git-annex. Data content is hashed and only the hash (represented as the original file name) is stored in Git[3]. Actual data is stored in the annex of the dataset, and annexed data can be transferred from and to a large number of storage solutions using either DataLad or git-annex commands. Information on where data is available from is stored in an internal representation of git-annex.
In DVC repositories, (large) data is also supposed to be stored in external remotes such as Google Drive.
For internal representation of where files are available from, DVC uses one .dvc
text file for each data file or directory given to DVC.
The .dvc
files contain information on the path to the data in the repository, where the associated data file is available from, and a hash, and those files should be committed to Git.
5.1.2.1. DVC workflow¶
Prior to adding and version controlling data, a “DVC project” needs to be initialized in the Git repository:
### DVC
$ cd ../DVC
$ dvc init
Initialized DVC repository.
You can now commit the changes to git.
+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>
This populates the repository with a range of staged files – most of them are internal directories and files for DVC’s configuration.
### DVC
$ git status
On branch master
Your branch is up to date with 'github/master'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: .dvc/.gitignore
new file: .dvc/config
new file: .dvcignore
As they are only staged but not committed, we need to commit them (into Git):
### DVC
$ git commit -m "initialize dvc"
[master ae6d2e1] initialize dvc
3 files changed, 6 insertions(+)
create mode 100644 .dvc/.gitignore
create mode 100644 .dvc/config
create mode 100644 .dvcignore
The DVC project is now ready to version control data.
In the tutorial, data comes from the “Imagenette” dataset.
This data is available from an Amazon S3 bucket as a compressed tarball, but to keep the download fast, there is a smaller two-category version of it on the Open Science Framework (OSF).
We’ll download it and extract it into the data/raw/
directory of the repository.
### DVC
# download the data
$ wget -q https://osf.io/d6qbz/download -O imagenette2-160.tgz
# extract it
$ tar -xzf imagenette2-160.tgz
# move it into the directories
$ mv train data/raw/
$ mv val data/raw/
# remove the archive
$ rm -rf imagenette2-160.tgz
The data directories in data/raw
are then version controlled with the dvc add
command that can place files or complete directories under version control by DVC.
### DVC
$ dvc add data/raw/train
$ dvc add data/raw/val
To track the changes with git, run:
git add data/raw/train.dvc data/raw/.gitignore
To enable auto staging, run:
dvc config core.autostage true
To track the changes with git, run:
git add data/raw/val.dvc data/raw/.gitignore
To enable auto staging, run:
dvc config core.autostage true
Here is what this command has accomplished:
The data files were copied into a cache in .dvc/cache
(a non-human readable directory structure based on hashes similar to .git/annex/objects used by git-annex), data file names were added to a .gitignore
[4] file to become invisible to Git, and two .dvc
files, train.dvc
and val.dvc
, were created[5].
git status
(manual) shows these changes:
### DVC
$ git status
On branch master
Your branch is ahead of 'github/master' by 1 commit.
(use "git push" to publish your local commits)
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: data/raw/.gitignore
Untracked files:
(use "git add <file>..." to include in what will be committed)
data/raw/train.dvc
data/raw/val.dvc
no changes added to commit (use "git add" and/or "git commit -a")
In order to complete the version control workflow, Git needs to know about the .dvc
files, and forget about the data directories.
For this, the modified .gitignore
file and the untracked .dvc
files need to be added to Git:
### DVC
$ git add --all
Finally, we commit.
### DVC
$ git commit -m "control data with DVC"
[master 9420e0e] control data with DVC
3 files changed, 14 insertions(+)
create mode 100644 data/raw/train.dvc
create mode 100644 data/raw/val.dvc
The data is now version controlled with DVC.
How does DVC represent modifications to data?
When adding data directories, they (i.e., the complete directory) are hashed, and this hash is stored in the respective .dvc
file.
If any file in the directory changes, this hash would change, and the dvc status
command would report the directory to be “changed”.
To demonstrate this, we pretend to accidentally delete a single file:
# if one or more files in the val/ data changes, dvc status reports a change
$ dvc status
data/raw/val.dvc:
changed outs:
modified: data/raw/val
Important: Detecting a data modification requires the dvc status
command – git status
will not be able to detect changes as this directory as it is git-ignored!
5.1.2.2. DataLad workflow¶
DataLad has means to get data or data archives from web sources and store this availability information within git-annex. This has several advantages: For one, the original OSF file URL is known and stored as a location to re-retrieve the data from. This enables reliable data access for yourself and others that you share the dataset with. Beyond this, the data is also automatically extracted and saved, and thus put under version control. Note that this strays slightly from DataLad’s YODA principles in a DataLad-centric workflow, where data should become a standalone, reusable dataset that would be linked as a subdataset into a study/analysis specific dataset. Here, we stick to the project organization of DVC though.
### DVC-DataLad
$ cd ../DVC-DataLad
$ datalad download-url \
--archive \
--message "Download Imagenette dataset" \
https://osf.io/d6qbz/download \
-O 'data/raw/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz (file)
add(ok): data/raw/imagenette2-160.tgz (file)
save(ok): . (dataset)
[INFO] Adding content of the archive /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz into annex AnnexRepo(/home/me/DVCvsDL/DVC-DataLad)
[INFO] Initializing special remote datalad-archives
[INFO] Extracting archive
[INFO] Finished adding /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz: Files processed: 2701, renamed: 2701, +annex: 2701
[INFO] Finished extraction
add-archive-content(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
At this point, the data is already version controlled[6], and we have the following directory tree:
$ tree
.
├── code
│ └── [...]
├── data
│ └── raw
│ ├── train
│ │ ├──[...]
│ └── val
│ ├── [...]
├── metrics
└── model
29 directories
How does DataLad represent modifications to data?
As DataLad always tracks files individually, datalad status
(manual) (or, alternatively, git status
or git annex status
(manual)) will show modifications on the level of individual files:
$ datalad status
deleted: /home/me/DVCvsDL/DVC-DataLad/data/raw/val/n01440764/n01440764_12021.JPEG (symlink)
$ git status
On branch main
Your branch is ahead of 'origin/main' by 2 commits.
(use "git push" to publish your local commits)
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
deleted: data/raw/val/n01440764/n01440764_12021.JPEG
$ git annex status
D data/raw/val/n01440764/n01440764_12021.JPEG
5.1.4. Data analysis¶
DVC is tuned towards machine learning analyses and comes with convenience commands and workflow management to build, compare, and reproduce machine learning pipelines. The tutorial therefore runs an SGD classifier and a random forest classifier on the data and compares the two models. For this, the pre-existing preparation, training, and evaluation scripts are used on the data we have downloaded and version controlled in the previous steps. DVC has means to transform such a structured ML analysis into a workflow, reproduce this workflow on demand, and compare it across different models or parametrizations.
In this general overview, we will only rush through the analysis:
In short, it consists of three steps, each associated with a script.
src/prepare.py
creates two .csv
files with mappings of file names in train/
and val/
to image categories.
Later, these files will be used to train and test the classifiers.
src/train.py
loads the training CSV file prepared in the previous stage, trains a classifier on the training data, and saves the classifier into the model/
directory as model.joblib
.
The final script, src/evaluate.py
is used to evaluate the trained classifier on the validation data and write the accuracy of the classification into the file metrics/accuracy.json
.
There are more detailed insights and explanations of the actual analysis code in the Tutorial if you’re interested in finding out more.
For workflow management, DVC has the concept of a “DVC pipeline”.
A pipeline consists of multiple stages, which are set up and executed using a dvc stage add [--run]
command.
Each stage has three components: “deps”, “outs”, and “command”.
Each of the scripts in the repository will be represented by a stage in the DVC pipeline.
DataLad does not have any workflow management functions.
The closest to it are datalad run
(manual) to record any command execution or analysis, datalad rerun
(manual) to recompute such an analysis, and datalad containers-run
(manual) to perform and record a command execution or analysis inside of a tracked software container[10].
5.1.4.1. DVC workflow¶
Model 1: SGD classifier
Each model will be analyzed in a different branch of the repository. Therefore, we start by creating a new branch.
### DVC
$ cd ../DVC
$ git checkout -b sgd-pipeline
Switched to a new branch 'sgd-pipeline'
The first stage in the pipeline is data preparation (performed by the script prepare.py
).
The following command sets up the stage:
### DVC
$ dvc stage add -n prepare \
-d src/prepare.py -d data/raw \
-o data/prepared/train.csv -o data/prepared/test.csv \
--run \
python src/prepare.py
Added stage 'prepare' in 'dvc.yaml'
Running stage 'prepare':
> python src/prepare.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add data/prepared/.gitignore dvc.lock dvc.yaml
To enable auto staging, run:
dvc config core.autostage true
The -n
parameter gives the stage a name, the -d
parameter passes the dependencies – the raw data – to the command, and the -o
parameter defines the outputs of the command – the CSV files that prepare.py
will create.
python src/prepare.py
is the command that will be executed in the stage.
The resulting changes can be added to Git:
### DVC
$ git add dvc.yaml data/prepared/.gitignore dvc.lock
This command runs the command, and also creates two YAML files, dvc.yaml
and dvc.lock
.
They contain the pipeline description, which currently comprises of the first stage:
### DVC
$ cat dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- data/raw
- src/prepare.py
outs:
- data/prepared/test.csv
- data/prepared/train.csv
The lock file tracks the versions of all relevant files via MD5 hashes. This allows DVC to track all dependencies and outputs and detect if any of these files change.
### DVC
$ cat dvc.lock
schema: '2.0'
stages:
prepare:
cmd: python src/prepare.py
deps:
- path: data/raw
hash: md5
md5: 3f163676✂MD5.dir
size: 16711951
nfiles: 2704
- path: src/prepare.py
hash: md5
md5: ef804f35✂MD5
size: 1231
outs:
- path: data/prepared/test.csv
hash: md5
md5: 0b90b0e8✂MD5
size: 62023
- path: data/prepared/train.csv
hash: md5
md5: 360a73ac✂MD5
size: 155128
The command also added the results from the stage, train.csv
and test.csv
into a .gitignore
file.
The next pipeline stage is training, in which train.py
will be used to train a classifier on the data.
Initially, this classifier is an SGD classifier.
The following command sets it up:
$ dvc stage add -n train \
-d src/train.py -d data/prepared/train.csv \
-o model/model.joblib \
--run \
python src/train.py
Added stage 'train' in 'dvc.yaml'
Running stage 'train':
> python src/train.py
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.yaml dvc.lock model/.gitignore
To enable auto staging, run:
dvc config core.autostage true
Afterwards, train.py
has been executed, and the pipelines have been updated with a second stage.
The resulting changes can be added to Git:
### DVC
$ git add dvc.yaml model/.gitignore dvc.lock
Finally, we create the last stage, model evaluation. The following command sets it up:
$ dvc stage add -n evaluate \
-d src/evaluate.py -d model/model.joblib \
-M metrics/accuracy.json \
--run \
python src/evaluate.py
Added stage 'evaluate' in 'dvc.yaml'
Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.yaml dvc.lock
To enable auto staging, run:
dvc config core.autostage true
### DVC
$ git add dvc.yaml dvc.lock
Instead of “outs”, this final stage uses the -M
flag to denote a “metric”.
This type of flag can be used if floating-point or integer values that summarize model performance (e.g. accuracies, receiver operating characteristics, or area under the curve values) are saved in hierarchical files (JSON, YAML).
DVC can then read from these files to display model performances and comparisons:
### DVC
$ dvc metrics show
Path accuracy
metrics/accuracy.json 0.67934
The complete pipeline now consists of preparation, training, and evaluation. It now needs to be committed, tagged, and pushed:
### DVC
$ git add --all
$ git commit -m "Add SGD pipeline"
$ dvc commit
$ git push --set-upstream origin sgd-pipeline
$ git tag -a sgd -m "Trained SGD as DVC pipeline."
$ git push origin --tags
$ dvc push
[sgd-pipeline c400246] Add SGD pipeline
5 files changed, 83 insertions(+)
create mode 100644 dvc.lock
create mode 100644 dvc.yaml
create mode 100644 metrics/accuracy.json
To /home/me/pushes/data-version-control
* [new branch] sgd-pipeline -> sgd-pipeline
branch 'sgd-pipeline' set up to track 'origin/sgd-pipeline'.
To /home/me/pushes/data-version-control
* [new tag] sgd -> sgd
3 files pushed
Model 2: random forest classifier
In order to explore a second model, a random forest classifier, we start with a new branch.
### DVC
$ git checkout -b random-forest
Switched to a new branch 'random-forest'
To switch from SGD to a random forest classifier, a few lines of code within train.py
need to be changed.
The following here doc changes the script accordingly (changes are highlighted):
### DVC
$ cat << EOT >| src/train.py
from joblib import dump
from pathlib import Path
import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier
def load_images(data_frame, column_name):
filelist = data_frame[column_name].to_list()
image_list = imread_collection(filelist)
return image_list
def load_labels(data_frame, column_name):
label_list = data_frame[column_name].to_list()
return label_list
def preprocess(image):
resized = resize(image, (100, 100, 3))
reshaped = resized.reshape((1, 30000))
return reshaped
def load_data(data_path):
df = pd.read_csv(data_path)
labels = load_labels(data_frame=df, column_name="label")
raw_images = load_images(data_frame=df, column_name="filename")
processed_images = [preprocess(image) for image in raw_images]
data = np.concatenate(processed_images, axis=0)
return data, labels
def main(repo_path):
train_csv_path = repo_path / "data/prepared/train.csv"
train_data, labels = load_data(train_csv_path)
rf = RandomForestClassifier()
trained_model = rf.fit(train_data, labels)
dump(trained_model, repo_path / "model/model.joblib")
if __name__ == "__main__":
repo_path = Path(__file__).parent.parent
main(repo_path)
EOT
Afterwards, since train.py
is changed, dvc status
will realize that one dependency of the pipeline stage “train” has changed:
### DVC
$ dvc status
train:
changed deps:
modified: src/train.py
Since the code change (stage 2) will likely affect the metric (stage 3), it is best to reproduce the whole chain.
You can reproduce a complete DVC pipeline file with the dvc repro <stagename>
command:
### DVC
$ dvc repro evaluate
'data/raw/val.dvc' didn't change, skipping
'data/raw/train.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'train':
> python src/train.py
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
To enable auto staging, run:
dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
DVC checks the dependencies of the pipeline and re-executes commands that need to be executed again.
Compared to the branch sgd-pipeline
, the workspace in the current random-forest
branch contains a changed script (src/train.py
), a changed trained classifier (model/model.joblib
), and a changed metric (metric/accuracy.json
).
All these changes need to be committed, tagged, and pushed now.
### DVC
$ git add --all
$ git commit -m "Train Random Forest classifier"
$ dvc commit
$ git push --set-upstream origin random-forest
$ git tag -a randomforest -m "Random Forest classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push
[random-forest c565b15] Train Random Forest classifier
3 files changed, 11 insertions(+), 17 deletions(-)
To /home/me/pushes/data-version-control
* [new branch] random-forest -> random-forest
branch 'random-forest' set up to track 'origin/random-forest'.
To /home/me/pushes/data-version-control
* [new tag] randomforest -> randomforest
1 file pushed
At this point, you can compare metrics across multiple tags:
### DVC
$ dvc metrics show -T
Revision Path accuracy
workspace metrics/accuracy.json 0.79848
randomforest metrics/accuracy.json 0.79848
sgd metrics/accuracy.json 0.67934
Done!
5.1.4.2. DataLad workflow¶
For a direct comparison to DVC, we’ll try to mimic the DVC workflow as closely as it is possible with DataLad.
Model 1: SGD classifier
### DVC-DataLad
$ cd ../DVC-DataLad
As there is no workflow manager in DataLad[9], each script execution needs to be done separately.
To record the execution, get all relevant inputs, and recompute outputs at later points, we can set up a datalad run
call[10].
Later on, we can rerun a range of datalad run
calls at once to recompute the relevant aspects of the analysis.
To harmonize execution and to assist with reproducibility of the results, we generally recommend to create a container (Docker or Singularity), add it to the repository as well, and use datalad containers-run
call[11] and have that reran, but we’ll stay basic here.
Let’s start with data preparation. Instead of creating a pipeline stage and giving it a name, we attach a meaningful commit message.
### DVC-DataLad
$ datalad run --message "Prepare the train and testing data" \
--input "data/raw/*" \
--output "data/prepared/*" \
python code/prepare.py
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/prepare.py]
save(ok): . (dataset)
The results of this computation are automatically saved and associated with their inputs and command execution.
This information isn’t stored in a separate file, but in the Git history, and saved with the commit message we have attached to the datalad run
command.
To stay close to the DVC tutorial, we will also work with tags to identify analysis versions, but DataLad could also use a range of other identifiers (such as commit hashes) to identify this computation.
As we at this point have set up our data and are ready for the analysis, we will name the first tag “ready-for-analysis”.
This can be done with git tag
(manual), but also with datalad save
(manual).
### DVC-DataLad
$ datalad save --version-tag ready-for-analysis
save(ok): . (dataset)
Let’s continue with training by running code/train.py
on the prepared data.
### DVC-DataLad
$ datalad run --message "Train an SGD classifier" \
--input "data/prepared/*" \
--output "model/model.joblib" \
python code/train.py
[INFO] == Command start (output follows) =====
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)
As before, the results of this computations are saved, an the Git history connects computation, results, and inputs.
As a last step, we evaluate the first model:
### DVC-DataLad
$ datalad run --message "Evaluate SGD classifier model" \
--input "model/model.joblib" \
--output "metrics/accuracy.json" \
python code/evaluate.py
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)
At this point, the first accuracy metric is saved in metrics/accuracy.json
.
Let’s add a tag to declare that it belongs to the SGD classifier.
### DVC-DataLad
$ datalad save --version-tag SGD
save(ok): . (dataset)
Let’s now change the training script to use a random forest classifier as before:
### DVC-DataLad
$ cat << EOT >| code/train.py
from joblib import dump
from pathlib import Path
import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier
def load_images(data_frame, column_name):
filelist = data_frame[column_name].to_list()
image_list = imread_collection(filelist)
return image_list
def load_labels(data_frame, column_name):
label_list = data_frame[column_name].to_list()
return label_list
def preprocess(image):
resized = resize(image, (100, 100, 3))
reshaped = resized.reshape((1, 30000))
return reshaped
def load_data(data_path):
df = pd.read_csv(data_path)
labels = load_labels(data_frame=df, column_name="label")
raw_images = load_images(data_frame=df, column_name="filename")
processed_images = [preprocess(image) for image in raw_images]
data = np.concatenate(processed_images, axis=0)
return data, labels
def main(repo_path):
train_csv_path = repo_path / "data/prepared/train.csv"
train_data, labels = load_data(train_csv_path)
rf = RandomForestClassifier()
trained_model = rf.fit(train_data, labels)
dump(trained_model, repo_path / "model/model.joblib")
if __name__ == "__main__":
repo_path = Path(__file__).parent.parent
main(repo_path)
EOT
We need to save this change:
$ datalad save -m "Switch to random forest classification" code/train.py
add(ok): code/train.py (file)
save(ok): . (dataset)
Afterwards, we can rerun all run records between the tags ready-for-analysis
and SGD
using datalad rerun
.
We could automatically compute this on a different branch if we wanted to by using the branch
option:
$ datalad rerun --branch="randomforest" -m "Recompute classification with random forest classifier" ready-for-analysis..SGD
[INFO] checkout commit 1b3b757;
[INFO] run commit 88a6e86; (Train an SGD clas...)
unlock(ok): model/model.joblib (file)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)
[INFO] run commit 2d07713; (Evaluate SGD clas...)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
get (notneeded: 3)
run (ok: 2)
save (ok: 2)
unlock (notneeded: 2, ok: 1)
Done!
The difference in accuracies between models could now, for example, be compared with a git diff
:
$ git diff SGD -- metrics/accuracy.json
diff --git a/metrics/accuracy.json b/metrics/accuracy.json
index 74a1ee15..f6e7ded9 100644
--- a/metrics/accuracy.json
+++ b/metrics/accuracy.json
@@ -1 +1 @@
-{"accuracy": 0.7629911280101395}
\ No newline at end of file
+{"accuracy": 0.8124207858048162}
\ No newline at end of file
Even though there is no one-to-one correspondence between a DVC and a DataLad workflow, a DVC workflow can also be implemented with DataLad.
5.1.5. Summary¶
DataLad and DVC aim to solve the same problems: Version control data, sharing data, and enabling reproducible analyses.
DataLad provides generic solutions to these issues, while DVC is tuned for machine-learning pipelines.
Despite their similar purpose, the looks, feels and functions of both tools are different, and it is a personal decision which one you feel more comfortable with.
Using DVC requires solid knowledge of Git, because DVC workflows heavily rely on effective Git practices, such as branching, tags, and .gitignore
files.
But despite the reliance on Git, DVC barely integrates with Git – changes done to files in DVC cannot be detected by Git and vice versa, DVC and Git aspects of a repository have to be handled in parallel by the user, and DVC and Git have distinct command functions and concepts that nevertheless share the same name.
Thus, DVC users need to master Git and DVC workflows and intertwine them correctly.
In return, DVC provides users with workflow management and reporting tuned to machine learning analyses. It also provides a somewhat more lightweight and uniform across operating and file systems approach to “data version control” than git-annex used by DataLad.
Footnotes