5.1. Reproducible machine learning analyses: DataLad as DVC¶
Machine learning analyses are complex: Beyond data preparation and general scripting, they typically consist of training and optimizing several different machine learning models and comparing them based on performance metrics. This complexity can jeopardize reproducibility – it is hard to remember or figure out which model was trained on which version of what data and which has been the ideal optimization. But just like any data analysis project, machine learning projects can become easier to understand and reproduce if they are intuitively structured, appropriately version controlled, and if analysis executions are captured with enough (ideally machine-readable and re-executable) provenance.
DataLad provides the functionality to achieve this, and previous sections have given some demonstrations on how to do it. But in the context of machine learning analyses, other domain-specific tools and workflows exist, too. One of the most well-known is DVC (Data Version Control), a “version control system for machine learning projects”. This section compares the two tools and demonstrates workflows for data versioning, data sharing, and analysis execution in the context of a machine learning project with DVC and DataLad. While they share a number of similarities and goals, their respective workflows are quite distinct.
The workflows showcased here are based on a DVC tutorial. This tutorial consists of the following steps:
A data set with pictures of 10 classes of objects (Imagenette) is version controlled with DVC
the data is pushed to a “storage remote” on a local path
the data are analyzed using various ML models in DVC pipelines
This handbook section demonstrates how DataLad could be used as an alternative to DVC. We demonstrate each step with DVC according to their tutorial, and then recreate a corresponding DataLad workflow. The usecase DataLad for reproducible machine-learning analyses demonstrates a similar analysis in a completely DataLad-centric fashion. If you want to, you can code along, or simply read through the presentation of DVC and DataLad commands. Some familiarity with DataLad can be helpful, but if you have never used DataLad, footnotes in each section can point you relevant chapters for more insights on a command or concept. If you have never used DVC, its documentation (including the command reference) can answer further questions.
If you are not a Git user
DVC relies heavily on Git workflows.
Understanding the DVC workflows requires a solid understanding of branches, Git’s concepts of Working tree, Index (“Staging Area”), and Repository, and some basic Git commands such as add
, commit
, and checkout
.
The Turing Way has an excellent chapter on version control with Git if you want to catch up on those basics first.
Terminology
Be mindful: DVC (as DataLad) comes with a range of commands and concepts that have the same names, but differ in functionality to their Git namesake. Make sure to read the DVC documentation for each command to get more information on what it does.
5.1.1. Setup¶
The DVC tutorial comes with a pre-made repository that is structured for DVC machine learning analyses. If you want to code along, the repository needs to be forked (requires a GitHub account) and cloned from your own fork1.
### DVC
# please clone this repository from your own fork when coding along
$ git clone https://github.com/datalad-handbook/data-version-control DVC
Cloning into 'DVC'...
The resulting Git repository is already pre-structured in a way that aids DVC ML analyses: It has the directories model
and metrics
, and a set of Python scripts for a machine learning analysis in src/
.
### DVC
$ tree DVC
DVC
├── data
│ ├── prepared
│ └── raw
├── LICENSE
├── metrics
├── model
├── README.md
└── src
├── evaluate.py
├── prepare.py
└── train.py
6 directories, 5 files
For a comparison, we will recreate a similarly structured DataLad dataset.
For greater compliance with DataLad’s YODA principles, the dataset structure will differ marginally in that scripts will be kept in code/
instead of src/
.
We create the dataset with two configurations, yoda
and text2git
2.
### DVC-DataLad
$ datalad create -c text2git -c yoda DVC-DataLad
$ cd DVC-DataLad
$ mkdir -p data/{raw,prepared} model metrics
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [VIRTUALENV/bin/python /home/a...]
create(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
create (ok: 1)
run (ok: 2)
Afterwards, we make sure to get the same scripts.
### DVC-DataLad
# get the scripts
$ datalad download-url -m "download scripts for ML analysis" \
https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/{train,prepare,evaluate}.py \
-O 'code/'
[INFO] Downloading 'https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/train.py' into '/home/me/DVCvsDL/DVC-DataLad/code/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/train.py (file)
[INFO] Downloading 'https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/prepare.py' into '/home/me/DVCvsDL/DVC-DataLad/code/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/prepare.py (file)
[INFO] Downloading 'https://raw.githubusercontent.com/datalad-handbook/data-version-control/master/src/evaluate.py' into '/home/me/DVCvsDL/DVC-DataLad/code/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/code/evaluate.py (file)
add(ok): code/evaluate.py (file)
add(ok): code/prepare.py (file)
add(ok): code/train.py (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
download_url (ok: 3)
save (ok: 1)
Here’s the final directory structure:
### DVC-DataLad
$ tree
.
├── CHANGELOG.md
├── code
│ ├── evaluate.py
│ ├── prepare.py
│ ├── README.md
│ └── train.py
├── data
│ ├── prepared
│ └── raw
├── metrics
├── model
└── README.md
6 directories, 6 files
Required software for coding along
In order to code along, DVC, scikit-learn, scikit-image, pandas, and numpy are required. All tools are available via pip or conda. We recommend to install them in a virtual environment – the DVC tutorial has step-by-step instructions.
5.1.2. Version controlling data¶
In the first part of the tutorial, the directory tree will be populated with data that should be version controlled.
Although the implementation of version control for (large) data is very different between DataLad and DVC, the underlying concept is very similar: (Large) data is stored outside of Git – Git only tracks information on where this data can be found.
In DataLad datasets, (large) data is handled by git-annex. Data content is hashed and only the hash (represented as the original file name) is stored in Git3. Actual data is stored in the annex of the dataset, and annexed data can be transferred from and to a large number of storage solutions using either DataLad or git-annex commands. Information on where data is available from is stored in an internal representation of git-annex.
In DVC repositories, (large) data is also supposed to be stored in external remotes such as Google Drive.
For internal representation of where files are available from, DVC uses one .dvc
text file for each data file or directory given to DVC.
The .dvc
files contain information on the path to the data in the repository, where the associated data file is available from, and a hash, and those files should be committed to Git.
5.1.2.1. DVC workflow¶
Prior to adding and version controlling data, a “DVC project” needs to be initialized in the Git repository:
### DVC
$ cd ../DVC
$ dvc init
Initialized DVC repository.
You can now commit the changes to git.
+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>
This populates the repository with a range of staged files – most of them are internal directories and files for DVC’s configuration.
### DVC
$ git status
On branch master
Your branch is up to date with 'github/master'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: .dvc/.gitignore
new file: .dvc/config
new file: .dvcignore
As they are only staged but not committed, we need to commit them (into Git):
### DVC
$ git commit -m "initialize dvc"
[master ae6d2e1] initialize dvc
3 files changed, 6 insertions(+)
create mode 100644 .dvc/.gitignore
create mode 100644 .dvc/config
create mode 100644 .dvcignore
The DVC project is now ready to version control data.
In the tutorial, data comes from the “Imagenette” dataset.
This data is available from an Amazon S3 bucket as a compressed tarball, but to keep the download fast, there is a smaller two-category version of it on the Open Science Framework (OSF).
We’ll download it and extract it into the data/raw/
directory of the repository.
### DVC
# download the data
$ wget -q https://osf.io/d6qbz/download -O imagenette2-160.tgz
# extract it
$ tar -xzf imagenette2-160.tgz
# move it into the directories
$ mv train data/raw/
$ mv val data/raw/
# remove the archive
$ rm -rf imagenette2-160.tgz
The data directories in data/raw
are then version controlled with the dvc add command that can place files or complete directories under version control by DVC.
### DVC
$ dvc add data/raw/train
$ dvc add data/raw/val
To track the changes with git, run:
git add data/raw/.gitignore data/raw/train.dvc
To enable auto staging, run:
dvc config core.autostage true
To track the changes with git, run:
git add data/raw/.gitignore data/raw/val.dvc
To enable auto staging, run:
dvc config core.autostage true
Here is what this command has accomplished:
The data files were copied into a cache in .dvc/cache
(a non-human readable directory structure based on hashes similar to .git/annex/objects used by git-annex), data file names were added to a .gitignore
4 file to become invisible to Git, and two .dvc
files, train.dvc
and val.dvc
, were created5.
git status shows these changes:
### DVC
$ git status
On branch master
Your branch is ahead of 'github/master' by 1 commit.
(use "git push" to publish your local commits)
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: data/raw/.gitignore
Untracked files:
(use "git add <file>..." to include in what will be committed)
data/raw/train.dvc
data/raw/val.dvc
no changes added to commit (use "git add" and/or "git commit -a")
In order to complete the version control workflow, Git needs to know about the .dvc
files, and forget about the data directories.
For this, the modified .gitignore
file and the untracked .dvc
files need to be added to Git:
### DVC
$ git add --all
Finally, we commit.
### DVC
$ git commit -m "control data with DVC"
[master c233d53] control data with DVC
3 files changed, 12 insertions(+)
create mode 100644 data/raw/train.dvc
create mode 100644 data/raw/val.dvc
The data is now version controlled with DVC.
How does DVC represent modifications to data?
When adding data directories, they (i.e., the complete directory) are hashed, and this hash is stored in the respective .dvc
file.
If any file in the directory changes, this hash would change, and the dvc status command would report the directory to be “changed”.
To demonstrate this, we pretend to accidentally delete a single file:
# if one or more files in the val/ data changes, dvc status reports a change
$ dvc status
data/raw/val.dvc:
changed outs:
modified: data/raw/val
Important: Detecting a data modification requires the dvc status command – git status will not be able to detect changes as this directory as it is git-ignored!
5.1.2.2. DataLad workflow¶
DataLad has means to get data or data archives from web sources and store this availability information within git-annex. This has several advantages: For one, the original OSF file URL is known and stored as a location to re-retrieve the data from. This enables reliable data access for yourself and others that you share the dataset with. Beyond this, the data is also automatically extracted and saved, and thus put under version control. Note that this strays slightly from DataLad’s YODA principles in a DataLad-centric workflow, where data should become a standalone, reusable dataset that would be linked as a subdataset into a study/analysis specific dataset. Here, we stick to the project organization of DVC though.
### DVC-DataLad
$ cd ../DVC-DataLad
$ datalad download-url \
--archive \
--message "Download Imagenette dataset" \
https://osf.io/d6qbz/download \
-O 'data/raw/'
[INFO] Downloading 'https://osf.io/d6qbz/download' into '/home/me/DVCvsDL/DVC-DataLad/data/raw/'
download_url(ok): /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz (file)
add(ok): data/raw/imagenette2-160.tgz (file)
save(ok): . (dataset)
[INFO] Adding content of the archive /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz into annex AnnexRepo(/home/me/DVCvsDL/DVC-DataLad)
[INFO] Initializing special remote datalad-archives
[INFO] Extracting archive
[INFO] Finished adding /home/me/DVCvsDL/DVC-DataLad/data/raw/imagenette2-160.tgz: Files processed: 2701, renamed: 2701, +annex: 2701
[INFO] Finished extraction
add-archive-content(ok): /home/me/DVCvsDL/DVC-DataLad (dataset)
action summary:
add (ok: 1)
add-archive-content (ok: 1)
download_url (ok: 1)
save (ok: 1)
At this point, the data is already version controlled6, and we have the following directory tree:
$ tree
.
├── code
│ └── [...]
├── data
│ └── raw
│ ├── train
│ │ ├──[...]
│ └── val
│ ├── [...]
├── metrics
└── model
29 directories
How does DataLad represent modifications to data?
As DataLad always tracks files individually, datalad status (or, alternatively, git status or git annex status) will show modifications on the level of individual files:
$ datalad status
deleted: /home/me/DVCvsDL/DVC-DataLad/data/raw/val/n01440764/n01440764_12021.JPEG (symlink)
$ git status
On branch master
Your branch is ahead of 'origin/master' by 2 commits.
(use "git push" to publish your local commits)
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
deleted: data/raw/val/n01440764/n01440764_12021.JPEG
$ git annex status
D data/raw/val/n01440764/n01440764_12021.JPEG
5.1.4. Data analysis¶
DVC is tuned towards machine learning analyses and comes with convenience commands and workflow management to build, compare, and reproduce machine learning pipelines. The tutorial therefore runs an SGD classifier and a random forest classifier on the data and compares the two models. For this, the pre-existing preparation, training, and evaluation scripts are used on the data we have downloaded and version controlled in the previous steps. DVC has means to transform such a structured ML analysis into a workflow, reproduce this workflow on demand, and compare it across different models or parametrizations.
In this general overview, we will only rush through the analysis:
In short, it consists of three steps, each associated with a script.
src/prepare.py
creates two .csv
files with mappings of file names in train/
and val/
to image categories.
Later, these files will be used to train and test the classifiers.
src/train.py
loads the training CSV file prepared in the previous stage, trains a classifier on the training data, and saves the classifier into the model/
directory as model.joblib
.
The final script, src/evaluate.py
is used to evaluate the trained classifier on the validation data and write the accuracy of the classification into the file metrics/accuracy.json
.
There are more detailed insights and explanations of the actual analysis code in the Tutorial if you’re interested in finding out more.
For workflow management, DVC has the concept of a “DVC pipeline”. A pipeline consists of multiple stages and is executed using a dvc run command. Each stage has three components: “deps”, “outs”, and “command”. Each of the scripts in the repository will be represented by a stage in the DVC pipeline.
DataLad does not have any workflow management functions. The closest to it are datalad run to record any command execution or analysis, datalad rerun to recompute such an analysis, and datalad containers-run to perform and record a command execution or analysis inside of a tracked software container10.
5.1.4.1. DVC workflow¶
Model 1: SGD classifier
Each model will be analyzed in a different branch of the repository. Therefore, we start by creating a new branch.
### DVC
$ cd ../DVC
$ git checkout -b sgd-pipeline
Switched to a new branch 'sgd-pipeline'
The first stage in the pipeline is data preparation (performed by the script prepare.py
).
The following command sets up the stage:
### DVC
$ dvc run -n prepare \
-d src/prepare.py -d data/raw \
-o data/prepared/train.csv -o data/prepared/test.csv \
python src/prepare.py
Running stage 'prepare':
> python src/prepare.py
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock data/prepared/.gitignore dvc.yaml
To enable auto staging, run:
dvc config core.autostage true
The -n
parameter gives the stage a name, the -d
parameter passes the dependencies – the raw data – to the command, and the -o
parameter defines the outputs of the command – the CSV files that prepare.py
will create.
python src/prepare.py
is the command that will be executed in the stage.
The resulting changes can be added to Git:
### DVC
$ git add dvc.yaml data/prepared/.gitignore dvc.lock
This command runs the command, and also creates two YAML files, dvc.yaml
and dvc.lock
.
They contain the pipeline description, which currently comprises of the first stage:
### DVC
$ cat dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- data/raw
- src/prepare.py
outs:
- data/prepared/test.csv
- data/prepared/train.csv
The lock file tracks the versions of all relevant files via MD5 hashes. This allows DVC to track all dependencies and outputs and detect if any of these files change.
### DVC
$ cat dvc.lock
schema: '2.0'
stages:
prepare:
cmd: python src/prepare.py
deps:
- path: data/raw
md5: d39907b06425b95b440a692eb1af5ba4.dir
size: 16711927
nfiles: 2704
- path: src/prepare.py
md5: ef804f358e00edcfe52c865b471f8f55
size: 1231
outs:
- path: data/prepared/test.csv
md5: 0b90b0e8d6c62a0dc38b1ab63de6e06d
size: 62023
- path: data/prepared/train.csv
md5: 360a73acb93776091959dda916331021
size: 155128
The command also added the results from the stage, train.csv
and test.csv
into a .gitignore
file.
The next pipeline stage is training, in which train.py
will be used to train a classifier on the data.
Initially, this classifier is an SGD classifier.
The following command sets it up:
$ dvc run -n train \
-d src/train.py -d data/prepared/train.csv \
-o model/model.joblib \
python src/train.py
Running stage 'train':
> python src/train.py
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:704: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
Adding stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock dvc.yaml model/.gitignore
To enable auto staging, run:
dvc config core.autostage true
Afterwards, train.py
has been executed, and the pipelines have been updated with a second stage.
The resulting changes can be added to Git:
### DVC
$ git add dvc.yaml model/.gitignore dvc.lock
Finally, we create the last stage, model evaluation. The following command sets it up:
$ dvc run -n evaluate \
-d src/evaluate.py -d model/model.joblib \
-M metrics/accuracy.json \
python src/evaluate.py
Running stage 'evaluate':
> python src/evaluate.py
Adding stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock dvc.yaml
To enable auto staging, run:
dvc config core.autostage true
### DVC
$ git add dvc.yaml dvc.lock
Instead of “outs”, this final stage uses the -M
flag to denote a “metric”.
This type of flag can be used if floating-point or integer values that summarize model performance (e.g. accuracies, receiver operating characteristics, or area under the curve values) are saved in hierarchical files (JSON, YAML).
DVC can then read from these files to display model performances and comparisons:
### DVC
$ dvc metrics show
Path accuracy
metrics/accuracy.json 0.78074
The complete pipeline now consists of preparation, training, and evaluation. It now needs to be committed, tagged, and pushed:
### DVC
$ git add --all
$ git commit -m "Add SGD pipeline"
$ dvc commit
$ git push --set-upstream origin sgd-pipeline
$ git tag -a sgd -m "Trained SGD as DVC pipeline."
$ git push origin --tags
$ dvc push
[sgd-pipeline 75992db] Add SGD pipeline
5 files changed, 73 insertions(+)
create mode 100644 dvc.lock
create mode 100644 dvc.yaml
create mode 100644 metrics/accuracy.json
To /home/me/pushes/data-version-control
* [new branch] sgd-pipeline -> sgd-pipeline
branch 'sgd-pipeline' set up to track 'origin/sgd-pipeline'.
To /home/me/pushes/data-version-control
* [new tag] sgd -> sgd
3 files pushed
Model 2: random forest classifier
In order to explore a second model, a random forest classifier, we start with a new branch.
### DVC
$ git checkout -b random-forest
Switched to a new branch 'random-forest'
To switch from SGD to a random forest classifier, a few lines of code within train.py
need to be changed.
The following here doc changes the script accordingly (changes are highlighted):
### DVC
$ cat << EOT >| src/train.py
from joblib import dump
from pathlib import Path
import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier
def load_images(data_frame, column_name):
filelist = data_frame[column_name].to_list()
image_list = imread_collection(filelist)
return image_list
def load_labels(data_frame, column_name):
label_list = data_frame[column_name].to_list()
return label_list
def preprocess(image):
resized = resize(image, (100, 100, 3))
reshaped = resized.reshape((1, 30000))
return reshaped
def load_data(data_path):
df = pd.read_csv(data_path)
labels = load_labels(data_frame=df, column_name="label")
raw_images = load_images(data_frame=df, column_name="filename")
processed_images = [preprocess(image) for image in raw_images]
data = np.concatenate(processed_images, axis=0)
return data, labels
def main(repo_path):
train_csv_path = repo_path / "data/prepared/train.csv"
train_data, labels = load_data(train_csv_path)
rf = RandomForestClassifier()
trained_model = rf.fit(train_data, labels)
dump(trained_model, repo_path / "model/model.joblib")
if __name__ == "__main__":
repo_path = Path(__file__).parent.parent
main(repo_path)
EOT
Afterwards, since train.py
is changed, dvc status will realize that one dependency of the pipeline stage “train” has changed:
### DVC
$ dvc status
train:
changed deps:
modified: src/train.py
Since the code change (stage 2) will likely affect the metric (stage 3), its best to reproduce the whole chain. You can reproduce a complete DVC pipeline file with the dvc repro <stagename> command:
### DVC
$ dvc repro evaluate
'data/raw/val.dvc' didn't change, skipping
'data/raw/train.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'train':
> python src/train.py
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluate.py
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
To enable auto staging, run:
dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
DVC checks the dependencies of the pipeline and re-executes commands that need to be executed again.
Compared to the branch sgd-pipeline
, the workspace in the current random-forest
branch contains a changed script (src/train.py
), a changed trained classifier (model/model.joblib
), and a changed metric (metric/accuracy.json
).
All these changes need to be committed, tagged, and pushed now.
### DVC
$ git add --all
$ git commit -m "Train Random Forest classifier"
$ dvc commit
$ git push --set-upstream origin random-forest
$ git tag -a randomforest -m "Random Forest classifier with 80.99% accuracy."
$ git push origin --tags
$ dvc push
[random-forest 8509e4d] Train Random Forest classifier
3 files changed, 11 insertions(+), 17 deletions(-)
To /home/me/pushes/data-version-control
* [new branch] random-forest -> random-forest
branch 'random-forest' set up to track 'origin/random-forest'.
To /home/me/pushes/data-version-control
[new tag] randomforest -> randomforest
1 file pushed
At this point, you can compare metrics across multiple tags:
### DVC
$ dvc metrics show -T
Revision Path accuracy
workspace metrics/accuracy.json 0.81115
randomforest metrics/accuracy.json 0.81115
sgd metrics/accuracy.json 0.6654
Done!
5.1.4.2. DataLad workflow¶
For a direct comparison to DVC, we’ll try to mimic the DVC workflow as closely as it is possible with DataLad.
Model 1: SGD classifier
### DVC-DataLad
$ cd ../DVC-DataLad
As there is no workflow manager in DataLad9, each script execution needs to be done separately. To record the execution, get all relevant inputs, and recompute outputs at later points, we can set up a datalad run call10. Later on, we can rerun a range of datalad run calls at once to recompute the relevant aspects of the analysis. To harmonize execution and to assist with reproducibility of the results, we generally recommend to create a container (Docker or Singularity), add it to the repository as well, and use datalad containers-run call11 and have that reran, but we’ll stay basic here.
Let’s start with data preparation. Instead of creating a pipeline stage and giving it a name, we attach a meaningful commit message.
### DVC-DataLad
$ datalad run --message "Prepare the train and testing data" \
--input "data/raw/*" \
--output "data/prepared/*" \
python code/prepare.py
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/prepare.py]
save(ok): . (dataset)
The results of this computation are automatically saved and associated with their inputs and command execution. This information isn’t stored in a separate file, but in the Git history, and saved with the commit message we have attached to the run command.
To stay close to the DVC tutorial, we will also work with tags to identify analysis versions, but DataLad could also use a range of other identifiers, for example commit hashes, to identify this computation. As we at this point have set up our data and are ready for the analysis, we will name the first tag “ready-for-analysis”. This can be done with git tag, but also with datalad save.
### DVC-DataLad
$ datalad save --version-tag ready-for-analysis
save(ok): . (dataset)
Let’s continue with training by running code/train.py
on the prepared data.
### DVC-DataLad
$ datalad run --message "Train an SGD classifier" \
--input "data/prepared/*" \
--output "model/model.joblib" \
python code/train.py
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
VIRTUALENV/lib/python3.8/site-packages/sklearn/linear_model/_stochastic_gradient.py:702: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)
As before, the results of this computations are saved, an the Git history connects computation, results, and inputs.
As a last step, we evaluate the first model:
### DVC-DataLad
$ datalad run --message "Evaluate SGD classifier model" \
--input "model/model.joblib" \
--output "metrics/accuracy.json" \
python code/evaluate.py
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)
At this point, the first accuracy metric is saved in metrics/accuracy.json
.
Let’s add a tag to declare that it belongs to the SGD classifier.
### DVC-DataLad
$ datalad save --version-tag SGD
save(ok): . (dataset)
Let’s now change the training script to use a random forest classifier as before:
### DVC-DataLad
$ cat << EOT >| code/train.py
from joblib import dump
from pathlib import Path
import numpy as np
import pandas as pd
from skimage.io import imread_collection
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier
def load_images(data_frame, column_name):
filelist = data_frame[column_name].to_list()
image_list = imread_collection(filelist)
return image_list
def load_labels(data_frame, column_name):
label_list = data_frame[column_name].to_list()
return label_list
def preprocess(image):
resized = resize(image, (100, 100, 3))
reshaped = resized.reshape((1, 30000))
return reshaped
def load_data(data_path):
df = pd.read_csv(data_path)
labels = load_labels(data_frame=df, column_name="label")
raw_images = load_images(data_frame=df, column_name="filename")
processed_images = [preprocess(image) for image in raw_images]
data = np.concatenate(processed_images, axis=0)
return data, labels
def main(repo_path):
train_csv_path = repo_path / "data/prepared/train.csv"
train_data, labels = load_data(train_csv_path)
rf = RandomForestClassifier()
trained_model = rf.fit(train_data, labels)
dump(trained_model, repo_path / "model/model.joblib")
if __name__ == "__main__":
repo_path = Path(__file__).parent.parent
main(repo_path)
EOT
We need to save this change:
$ datalad save -m "Switch to random forest classification" code/train.py
add(ok): code/train.py (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
Afterwards, we can rerun all run records between the tags ready-for-analysis
and SGD
using datalad rerun.
We could automatically compute this on a different branch if we wanted to by using the branch
option:
$ datalad rerun --branch="randomforest" -m "Recompute classification with random forest classifier" ready-for-analysis..SGD
[INFO] checkout commit fa4f16d;
[INFO] run commit 1e265e4; (Train an SGD clas...)
[INFO] Making sure inputs are available (this may take some time)
[INFO] Unlocking files
unlock(ok): model/model.joblib (file)
[INFO] Recording unlocked state in git
[INFO] Completed unlocking files
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/train.py]
add(ok): model/model.joblib (file)
save(ok): . (dataset)
[INFO] run commit 812bbc0; (Evaluate SGD clas...)
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/DVCvsDL/DVC-DataLad (dataset) [python code/evaluate.py]
add(ok): code/__pycache__/train.cpython-38.pyc (file)
add(ok): metrics/accuracy.json (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
get (notneeded: 3)
run (ok: 2)
save (ok: 2)
unlock (notneeded: 2, ok: 1)
Done!
The difference in accuracies between models could now for example be compared with a git diff
:
$ git diff SGD -- metrics/accuracy.json
diff --git a/metrics/accuracy.json b/metrics/accuracy.json
index 917ce2ce..9279f676 100644
--- a/metrics/accuracy.json
+++ b/metrics/accuracy.json
@@ -1 +1 @@
-{"accuracy": 0.7084917617237009}
\ No newline at end of file
+{"accuracy": 0.8149556400506971}
\ No newline at end of file
Even though there is no one-to-one correspondence between a DVC and a DataLad workflow, a DVC workflow can also be implemented with DataLad.
5.1.5. Summary¶
DataLad and DVC aim to solve the same problems: Version control data, sharing data, and enabling reproducible analyses.
DataLad provides generic solutions to these issues, while DVC is tuned for machine-learning pipelines.
Despite their similar purpose, the looks, feels and functions of both tools are different, and its a personal decision which one you feel more comfortable with.
Using DVC requires solid knowledge of Git, because DVC workflows heavily rely on effective Git practices, such as branching, tags, and .gitignore
files.
But despite the reliance on Git, DVC barely integrates with Git – changes done to files in DVC can not be detected by Git and vice versa, DVC and Git aspects of a repository have to be handled in parallel by the user, and DVC and Git have distinct command functions and concepts that nevertheless share the same name.
Thus, DVC users need to master Git and DVC workflows and intertwine them correctly.
In return, DVC provides users with workflow management and reporting tuned to machine learning analyses. It also provides a somewhat more lightweight and uniform across operating and file systems approach to “data version control” than git-annex used by DataLad.
Footnotes
- 1
Instructions on forking and cloning the repo are in the README of the repository: github.com/realpython/data-version-control.
- 2
The two procedures provide the dataset with useful structures and configurations for its purpose:
yoda
creates a dataset structure with acode
directory and makes sure that everything kept incode
will be committed to Git (thus allowing for direct sharing of code).text2git
makes sure that any other text file in the dataset will be stored in Git as well. The sections Data safety and The YODA procedure explain the two configurations in detail.- 3
To re-read about how git-annex handles versioning of (large) files, go back to section Data integrity.
- 4
You can read more about
.gitignore
files in the section How to hide content from DataLad- 5
If you are curious about why data is duplicated in a cache or why the paths to the data are placed into a
.gitignore
file, this section in the DVC tutorial has more insights on the internals of this process.- 6
The sections Populate a dataset and Modify content introduce the concepts of saving and modifying files in DataLad datasets.
- 7(1,2)
A similar procedure for sharing data on a local file system for DataLad is shown in the chapter Looking without touching.
- 8
In DataLad datasets, data duplication is usually avoided as git-annex uses symlinks. Only on file systems that lack support for symlinks or for removing write permissions from files (so called “crippled file systems” such as
/sdcard
on Android, FAT or NTFS) git-annex needs to duplicate data.- 9
yet.
- 10(1,2)
To re-read about datalad run and datalad rerun, checkout chapter DataLad, Run!.
- 11
To re-read about joining code, execution, data, results and software environment in a re-executable record with datalad container-run, checkout section Computational reproducibility with software containers.