A typical collaborative data management workflow

This use case sketches the basics of a common, collaborative data management workflow for an analysis:

  1. A 3rd party dataset is obtained to serve as input for an analysis.

  2. Data processing is collaboratively performed by two colleagues.

  3. Upon completion, the results are published alongside the original data for further consumption.

The data types and methods mentioned in this usecase belong to the scientific field of neuroimaging, but the basic workflow is domain-agnostic.

The Challenge

Bob is a new PhD student and about to work on his first analysis. He wants to use an open dataset as the input for his analysis, so he asks a friend who has worked with the same dataset for the data and gets it on a hard drive. Later, he’s stuck with his analysis. Luckily, Alice, a senior grad student in the same lab, offers to help him. He sends his script to her via email and hopes she finds the solution to his problem. She responds a week later with the fixed script, but in the meantime Bob already performed some miscellaneous changes to his script as well. Identifying and integrating her fix into his slightly changed script takes him half a day. When he finally finishes his analysis, he wants to publish code and data online, but can not find a way to share his data together with his code.

The DataLad Approach

Bob creates his analysis project as a DataLad dataset. Complying with the YODA principles, he creates his scripts in a dedicated code/ directory, and clones the open dataset as a standalone DataLad subdataset within a dedicated subdirectory. To collaborate with his senior grad student Alice, he shares the dataset on the lab’s SSH server, and they can collaborate on the version controlled dataset almost in real time with no need for Bob to spend much time integrating the fix that Alice provides him with. Afterwards, Bob can execute his scripts in a way that captures all provenance for this results with a datalad run command. Bob can share his whole project after completion by creating a sibling on a webserver, and pushing all of his dataset, including the input data, to this sibling, for everyone to access and recompute.

Step-by-Step

Bob creates a DataLad dataset for his analysis project to live in. Because he knows about the YODA principles, he configures the dataset to be a YODA dataset right at the time of creation:

$ datalad create -c yoda --description "my 1st phd project on work computer" myanalysis
[INFO] Creating a new annex repo at /home/me/usecases/collab/myanalysis 
[INFO] Running procedure cfg_yoda 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
create(ok): /home/me/usecases/collab/myanalysis (dataset)

After creation, there already is a code/ directory, and all of its inputs are version-controlled by Git instead of git-annex thanks to the yoda procedure:

$ cd myanalysis
$ tree
.
├── CHANGELOG.md
├── code
│   └── README.md
└── README.md

1 directory, 3 files

Bob knows that a DataLad dataset can contain other datasets. He also knows that as any content of a dataset is tracked and its precise state is recorded, this is a powerful method to specify and later resolve data dependencies, and that including the dataset as a standalone data component will it also make it easier to keep his analysis organized and share it later. The dataset that Bob wants to work with is structural brain imaging data from the studyforrest project, a public data resource that the original authors share as a DataLad dataset through GitHub. This means that Bob can simply clone the relevant dataset from this service and into his own dataset. To do that, he clones it as a subdataset into a directory he calls src/ as he wants to make it obvious which parts of his analysis steps and code require 3rd party data:

$ datalad clone -d . https://github.com/psychoinformatics-de/studyforrest-data-structural.git src/forrest_structural
[INFO] Cloning https://github.com/psychoinformatics-de/studyforrest-data-structural.git [1 other candidates] into '/home/me/usecases/collab/myanalysis/src/forrest_structural' 
[INFO]   Remote origin not usable by git-annex; setting annex-ignore 
add(ok): src/forrest_structural (file)
save(ok): . (dataset)
install(ok): src/forrest_structural (dataset)
action summary:
  add (ok: 1)
  install (ok: 1)
  save (ok: 1)

Now that he executed this command, Bob has access to the entire dataset content, and the precise version of the dataset got linked to his top-level dataset myanalysis. However, no data was actually downloaded (yet). Bob very much appreciates that DataLad datasets primarily contain information on a dataset’s content and where to obtain it: Cloning above was done rather quickly, and will still be relatively lean even for a dataset that contains several hundred GBs of data. He knows that his script can obtain the relevant data he needs on demand if he wraps it into a datalad run command and therefore does not need to care about getting the data yet. Instead, he focuses to write his script code/run_analysis.sh. To save this progress, he runs frequent datalad save commands:

$ datalad save -m "First steps: start analysis script" code/run_analysis.py
add(ok): code/run_analysis.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Once Bob’s analysis is finished, he can wrap it into datalad run. To ease execution, he first makes his script executable by adding a shebang that specifies Python as an interpreter at the start of his script, and giving it executable permissions:

$ chmod +x code/run_analysis.py
$ datalad save -m "make script executable"
add(ok): code/run_analysis.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Importantly, prior to a datalad run, he specifies the necessary inputs such that DataLad can take care of the data retrieval for him:

$ datalad run -m "run first part of analysis workflow" \
  --input "src/forrest_structural" \
  --output results.txt \
  "code/run_analysis.py"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
get(ok): src/forrest_structural/sub-01/anat/sub-01_T1w.nii.gz (file) [from mddatasrc...]
action summary:
  get (notneeded: 1, ok: 1)
  save (notneeded: 2)

This will take care of retrieving the data, running Bobs script, and saving all outputs.

Some time later, Bob needs help with his analysis. He turns to his senior grad student Alice for help. Alice and Bob both work on the same computing server. Bob has told Alice in which directory he keeps his analysis dataset, and the directory is configured to have permissions that allow for read-access for all lab-members, so Alice can obtain Bob’s work directly from his home directory:

$ datalad clone /myanalysis bobs_analysis
[INFO] Cloning myanalysis into '/home/me/usecases/collab/bobs_analysis' 
install(ok): /home/me/usecases/collab/bobs_analysis (dataset)
$ cd bobs_analysis
# ... make contributions, and save them
$ [...]
$ datalad save -m "you're welcome, bob"
add(ok): code/run_analysis.py (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Alice can get the studyforrest data Bob used as an input as well as the result file, but she can also rerun his analysis by using datalad rerun. She goes ahead and fixes Bobs script, and saves the changes. To integrate her changes into his dataset, Bob registers Alice’s dataset as a sibling:

#in Bobs home directory
$ datalad siblings add -s alice --url '/bobs_analysis'
.: alice(+) [../bobs_analysis (git)]

Afterwards, he can get her changes with a datalad update --merge command:

$ datalad update -s alice --merge
[INFO] Fetching updates for <Dataset path=/home/me/usecases/collab/myanalysis> 
[INFO] Applying updates to <Dataset path=/home/me/usecases/collab/myanalysis> 
update(ok): . (dataset)

Finally, when Bob is ready to share his results with the world or a remote collaborator, he makes his dataset available by uploading them to a webserver via SSH. Bob does so by creating a sibling for the dataset on the server, to which the dataset can be published and later also updated.

# this generated sibling for the dataset and all subdatasets
$ datalad create-sibling --recursive -s public "$SERVER_URL"

Once the remote sibling is created and registered under the name “public”, Bob can publish his version to it.

$ datalad publish -r --to public .

This workflow allowed Bob to obtain data, collaborate with Alice, and publish or share his dataset with others easily – he cannot wait for his next project, given that this workflow made his life so simple.