4.1. Transitioning existing projects into DataLad

Using DataLad offers exciting and useful features that warrant transitioning existing projects into DataLad datasets – and in most cases, transforming your project into one or many DataLad datasets is easy. This sections outlines the basic steps to do so, and offers examples as well as advice and caveats.

4.1.1. Important: Your safety net

Chances are high that you are reading this section of the handbook after you stumbled across DataLad and were intrigued by its features, and you’re now looking for a quick way to get going. If you haven’t read much of the handbook, but are now planning to DataLad-ify the gigantic project you have been working on for the past months or years, this first paragraph is warning, advice, and a call for safety nets to prevent unexpected misery that can arise from transitioning to a new tool. Because while DataLad can do amazing things, you shouldn’t blindly trust it to do everything you think it can or should do, but gain some familiarity with it.

If you’re a DataLad novice, we highly recommend that you read through the Basics part of the handbook. This part of the book provides you with a solid understanding of DataLad’s functionality and a playground to experience working with DataLad. If you’re really pressed for time because your dog is sick, your toddler keeps eating your papers and your boss is behind you with a whip, the findoutmore below summarizes the most important sections from the Basics for you to read:

The Basics for the impatient

To get a general idea about DataLad, please read sections A brief overview of DataLad and What you really need to know from the introduction (reading time: 15 min).

To gain a good understanding of some important parts of DataLad, please read chapter DataLad datasets, DataLad, run!, and Under the hood: git-annex (reading time: 60 minutes).

To become confident in using DataLad, sections How to get help, Miscellaneous file system operations can be very useful. Depending on your aim, Collaboration (for collaborative workflows), Third party infrastructure (for data sharing), or Make the most out of datasets (for data analysis) may contain the relevant background for you.

Prior to transforming your project, regardless of how advanced of a user you are, we recommend to create a copy of it. We don’t believe there is much that can go wrong from the software-side of things, but data is precious and backups a necessity, so better be safe than sorry.

4.1.2. Step 1: Planning

The first step to DataLad-ify your project is to turn it into one or several nested datasets. Whether you turn a project into a single dataset or several is dependent on the current size of your project and how much you expect it to grow overtime, but also on its contents. You can find guidance on this in paragraph below.

The next step is to save dataset contents. You should take your time and invest thought into this, as this determines the looks and feels of your dataset, in particular the decision on which contents should be saved into Git or git-annex. The section Data integrity should give you some necessary background information, and the chapter Tuning datasets to your needs the relevant skills to configure your dataset appropriately. You should consider the size, file type and modification frequency of files in your decisions as well as potential plans to share a dataset with a particular third party infrastructure.

4.1.3. Step 2: Dataset creation

Transforming a directory into a dataset is done with datalad create --force (manual). The -f/--force option enforces dataset creation in non-empty directories. Consider applying procedures with -c <procedure-name> to apply configurations that suit your use case.

What if my directory is already a Git repository?

If you want to transform a Git repository to a DataLad dataset, a datalad create -f is the way to go, too, and completely safe. Your Git history will stay intact and will not be tampered with.

If you want to transform a series of nested directories into nested datasets, continue with datalad create -f commands in all further subdirectories.

One or many datasets?

In deciding how many datasets you need, try to follow the benchmarks in chapter Go big or go home and the yoda principles in section YODA: Best practices for data analyses in a dataset. Two simple questions can help you make a decision:

  1. Do you have independently reusable components in your directory, such as data from several studies, or data and code/results? If yes, make each individual component a dataset.

  2. How large is each individual component? If it exceeds 100k files, split it up into smaller datasets. The decision on where to place subdataset boundaries can be guided by the existing directory structure or by common access patterns, for example, based on data type (raw, processed, …) or subject association. One straightforward organization may be a top-level superdataset and subject-specific subdatasets, mimicking the structure chosen in the use case Scaling up: Managing 80TB and 15 million files from the HCP release.

You can automate this with bash loops, if you want.

Example bash loops

Consider a directory structure that follows a naming standard such as BIDS:

# create a mock-directory structure:
$ mkdir -p study/sub-0{1,2,3,4,5}/{anat,func}
$ tree study
study
  ├── sub-01
  │   ├── anat
  │   └── func
  ├── sub-02
  │   ├── anat
  │   └── func
  ├── sub-03
  │   ├── anat
  │   └── func
  ├── sub-04
  │   ├── anat
  │   └── func
  └── sub-05
      ├── anat
      └── func

Consider further that you have transformed the toplevel study directory into a dataset and now want to transform all sub-* directories into further subdatasets, registered in study. Here is a line that would do this for the example above:

$ for dir in study/sub-0{1,2,3,4,5}; do datalad -C $dir create -d^. --force .; done

4.1.4. Step 3: Saving dataset contents

Any existing content in your newly created dataset(s) still needs to be saved into its dataset at this point (unless it was already under version control with Git). This can be done with the datalad save (manual) command – either “in one go” using a plain datalad save (saves all untracked files and modifications to a dataset – by default into the dataset annex), or step-by-step by attaching paths to the save command. Make sure to run datalad status (manual) frequently.

Save things to Git or to git-annex?

By default, all dataset contents are saved into git-annex. Depending on your data and use case, this may or may not be useful for all files. Here are a few things to keep in mind:

  • large files, in particular binary files should almost always go into git-annex. If you have pure data dataset made up of large files, put it into the dataset annex.

  • small files, especially if they are text files and undergo frequent modifications (e.g., code, manuscripts, notes) are best put under version control by Git.

  • If you plan to publish a dataset to a repository hosting site without annex support such as GitHub or GitLab, and do not intend to set up third party storage for annexed contents, be aware that only contents placed in Git will be available to others after cloning your repository. At the same time, be mindful of file size limits the services impose. The largest file size GitHub allows is 100MB – a dataset with files exceeding 100MB in size in Git will be rejected by GitHub. GIN is an alternative hosting service with annex support, and the Open Science Framework (OSF) may also be a suitable option to share datasets including their annexed files.

You can find guidance on how to create configurations for your dataset (which need to be in place and saved prior to saving contents!) in the chapter Tuning datasets to your needs, in particular section More on DIY configurations.

Create desired subdatasets first

Be mindful during saving if you have a directory that should hold more, yet uncreated datasets down its hierarchy, as a plain datalad save will save all files and directories to the dataset! It is best to first create all subdatasets, and only then save their contents.

If you are operating in a hierarchy of datasets, running a recursive save from the top-most dataset (datalad save -r) will save you time: All contents are saved to their respective datasets, all subdatasets are registered to their respective superdatasets.

4.1.5. Step 4: Rerunning analyses reproducibly

If you are transforming a complete data analysis into a dataset, you may also want to rerun any computation with DataLad’s run commands. You can compose any datalad run (manual) or datalad containers-run (manual)[1] command to recreate and capture your previous analysis. Make sure to specify your previous results as --output in order to unlock them[2].

4.1.6. Summary

Existing projects and analysis can be DataLad-ified with a few standard commands. Be mindful about dataset sizes and whether you save contents into Git or git-annex, though, as these choices could potentially spoil your DataLad experience. The sections Miscellaneous file system operations and Fixing up too-large datasets can help you to undo unwanted changes, but it’s better to do things right instead of having to fix them up. If you can, read up on the DataLad Basics to understand what you are doing, and create a backup in case things go not as planned in your first attempts.

Footnotes