YODA: Best practices for data analyses in a dataset

The last requirement for the midterm projects reads “needs to comply to the YODA principles”. “What are the YODA principles?” you ask, as you have never heard of this before. “The topic of today’s lecture: Organizational principles of data analyses in DataLad datasets. This lecture will show you the basic principles behind creating, sharing, and publishing reproducible, understandable, and open data analysis projects with DataLad.”

The starting point…

Data analyses projects are very common, both in science and industry. But it can be very difficult to produce a reproducible, let alone comprehensible data analysis project. Many data analysis projects do not start out with a stringent organization, or fail to keep the structural organization of a directory intact as the project develops. Often, this can be due a lack of version-control. In these cases, a project will quickly end up with many almost-identical scripts suffixed with “_version_xyz”, or a chaotic results structure split between various directories with names such as results/, results_August19/, results_revision/ and now_with_nicer_plots/. Something like this for example is a very common shape a data science project may take after a while:

├── code/
│   ├── code_final/
│   │   ├── final_2/
│   │   │   ├── main_script_fixed.py
│   │   │   └──takethisscriptformostthingsnow.py
│   │   ├── utils_new.py
│   │   ├── main_script.py
│   │   ├── utils_new.py
│   │   ├── utils_2.py
│   │   └── main_analysis_newparameters.py
│   └── main_script_DONTUSE.py
├── data/
│   ├── data_updated/
│   │    └── dataset1/
│   │        ├── datafile_a

│   ├── dataset1/
│   │   ├── datafile_a
│   ├── outputs/
│   │   ├── figures/
│   │   │   ├── figures_new.py
│   │   │   └── figures_final_forreal.py
│   │   ├── important_results/
│   │   ├── random_results_file.tsv
│   │   ├── results_for_paper/
│   │   ├── results_for_paper_revised/
│   │   └── results_new_data/
│   ├── random_results_file.tsv
│   ├── random_results_file_v2.tsv

[...]

All data analysis endeavours in directories like this can work, for a while, if there is a person who knows the project well, and works on it all the time. But it inevitably will get messy once anyone tries to collaborate on a project like this, or simply goes on a two-week vacation and forgets whether the function in main_analysis_newparameters.py or the one in takethisscriptformostthingsnow.py was the one that created a particular figure.

But even if a project has an intuitive structure, and is version controlled, in many cases an analysis script will stop working, or maybe worse, will produce different results, because the software and tools used to conduct the analysis in the first place got an update. This update may have come with software changes that made functions stop working, or work differently than before. In the same vein, recomputing an analysis project on a different machine than the one the analysis was developed on can fail if the necessary software in the required versions is not installed or available on this new machine. The analysis might depend on software that runs on a Linux machine, but the project was shared with a Windows user. The environment during analysis development used Python 2, but the new system has only Python 3 installed. Or one of the dependent libraries needs to be in version X, but is installed as version Y.

The YODA principles are a clear set of organizational standards for datasets used for data analysis projects that aim to overcome issues like the ones outlined above. The name stands for “YODAs Organigram on Data Analysis”1. The principles outlined in YODA set simple rules for directory names and structures, best-practices for version-controlling dataset elements and analyses, facilitate usage of tools to improve the reproducibility and accountability of data analysis projects, and make collaboration easier. They are summarized in three basic principles, that translate to both dataset structures and best practices regarding the analysis:

As you will see, complying to these principles is easy if you use DataLad. Let’s go through them one by one:

P1: One thing, one dataset

Whenever a particular collection of files could be useful in more than one context, make them a standalone, modular component. In the broadest sense, this means to structure your study elements (data, code, computational environments, results, …) in dedicated directories. For example:

  • Store input data for an analysis in a dedicated inputs/ directory. Keep different formats or processing-stages of your input data as individual, modular components: Do not mix raw data, data that is already structured following community guidelines of the given field, or preprocessed data, but create one data component for each of them. And if your analysis relies on two or more data collections, these collections should each be an individual component, not combined into one.

  • Store scripts or code used for the analysis of data in a dedicated code/ directory, outside of the data component of the dataset.

  • Collect results of an analysis in a dedicated outputs/ directory, and leave the input data of an analysis untouched by your computations.

  • Include a place for complete execution environments, for example singularity images or docker containers2, in the form of an envs/ directory, if relevant for your analysis.

  • And if you conduct multiple different analyses, create a dedicated project for each analysis, instead of conflating them.

This, for example, would be a directory structure from the root of a superdataset of a very comprehensive3 data analysis project complying to the YODA principles:

├── ci/                         # continuous integration configuration
│   └── .travis.yml
├── code/                       # your code
│   ├── tests/                  # unit tests to test your code
│   │   └── test_myscript.py
│   └── myscript.py
├── docs                        # documentation about the project
│   ├── build/
│   └── source/
├── envs                        # computational environments
│   └── Singularity
├── inputs/                     # dedicated inputs/, will not be changed by an analysis
│   └─── data/
│       ├── dataset1/           # one stand-alone data component
│       │   └── datafile_a
│       └── dataset2/
│           └── datafile_a
├── outputs/                    # outputs away from the input data
│   └── important_results/
│       └── figures/
├── CHANGELOG.md                # notes for fellow humans about your project
├── HOWTO.md
└── README.md

There are many advantages to this modular way of organizing contents. Having input data as independent components that are not altered (only consumed) by an analysis does not conflate the data for an analysis with the results or the code, thus assisting understanding the project for anyone unfamiliar with it. But more than just structure, this organization aids modular reuse or publication of the individual components, for example data. In a YODA-compliant dataset, any processing stage of a data component can be reused in a new project or published and shared. The same is true for a whole analysis dataset. At one point you might also write a scientific paper about your analysis in a paper project, and the whole analysis project can easily become a modular component in a paper project, to make sharing paper, code, data, and results easy. The usecase Writing a reproducible paper contains a step-by-step instruction on how to build and share such a reproducible paper, if you want to learn more.

Modular structure of a data analysis project

Fig. 6 Data are modular components that can be re-used easily.

The directory tree above and Figure 3 highlight different aspects of this principle. The directory tree illustrates the structure of the individual pieces on the file system from the point of view of a single top-level dataset with a particular purpose. It for example could be an analysis dataset created by a statistician for a scientific project, and it could be shared between collaborators or with others during development of the project. In this superdataset, code is created that operates on input data to compute outputs, and the code and outputs are captured, version- controlled, and linked to the input data. Each input data in turn is a (potentially nested) subdataset, but this is not visible in the directory hierarchy. Figure 3 in comparison emphasizes a process view on a project and the nested structure of input subdataset: You can see how the preprocessed data that serves as an input for the analysis datasets evolves from raw data to standardized data organization to its preprocessed state. Within the data/ directory of the file system hierarchy displayed above one would find data datasets with their previous version as a subdataset, and this is repeated recursively until one reaches the raw data as it was originally collected at one point. A finished analysis project in turn can be used as a component (subdataset) in a paper project, such that the paper is a fully reproducible research object that shares code, analysis results, and data, as well as the history of all of these components.

Principle 1, therefore, encourages to structure data analysis projects in a clear and modular fashion that makes use of nested DataLad datasets, yielding comprehensible structures and re-usable components. Having each component version-controlled – regardless of size – will aid keeping directories clean and organized, instead of piling up different versions of code, data, or results.

P2: Record where you got it from, and where it is now

Its good to have data, but its even better if you and anyone you collaborate or share the project or its components with can find out where the data came from, or how it is dependent on or linked to other data. Therefore, this principle aims to attach this information to the components of your data analysis project.

Luckily, this is a no-brainer with DataLad, because the core data structure of DataLad, the dataset, and many of the DataLad commands already covered up to now fulfill this principle.

If data components of a project are DataLad datasets, they can be included in an analysis superdataset as subdatasets. Thanks to datalad install, information on the source of these subdatasets is stored in the history of the analysis superdataset, and they can even be updated from those sources if the original data dataset gets extended or changed. If you are including a file, for example code from Github, the datalad download-url command (introduced in section Networking) will record the source of it safely in the dataset’s history. And if you add anything to your dataset, from simple incremental coding progress in your analysis scripts up to files that a colleague sent you via email, a plain datalad save with a helpful commit message goes a very long way to fulfill this principle on its own already.

One core aspect of this principle is linking between re-usable data resource units (i.e., DataLad subdatasets containing pure data). You will be happy to hear that this is achieved by simply installing datasets as subdatasets. This part of this principle will therefore be absolutely obvious to you because you already know how to install and nest datasets within datasets. “I might just overcome my impostor syndrome if I experience such advanced reproducible analysis concepts as being obvious”, you think with a grin.

Datasets are installed as subdatasets

Fig. 7 Schematic illustration of two standalone data datasets installed as subdatasets into an analysis project.

But more than linking datasets in a superdataset, linkage also needs to be established between components of your dataset. Scripts inside of your code/ directory should point to data not as absolute paths that would only work on your system, but instead as relative paths that will work in any shared copy of your dataset. The next section on DataLads Python API will show concrete examples of this.

Lastly, this principle also includes moving, sharing, and publishing your datasets or its components. It is usually costly to collect data, and economically unfeasible4 to keep it locked in a drawer (or similarly out of reach behind complexities of data retrieval or difficulties in understanding the data structure). But conducting several projects on the same dataset yourself, sharing it with collaborators, or publishing it is easy if the project is a DataLad dataset that can be installed and retrieved on demand, and is kept clean from everything that is not part of the data according to principle 1. Conducting transparent open science is easier if you can link code, data, and results within a dataset, and share everything together. In conjunction with principle 1, this means that you can distribute your analysis projects (or parts of it) in a comprehensible form.

A full data analysis workflow complying with YODA principles

Fig. 8 In a dataset that complies to the YODA principles, modular components (data, analysis results, papers) can be shared or published easily.

Principle 2, therefore, facilitates transparent linkage of datasets and their components to other components, their original sources, or shared copies. With the DataLad tools you learned to master up to this point, you have all the necessary skills to comply to it already.

P3: Record what you did to it, and with what

This last principle is about capturing how exactly the content of every file came to be that was not obtained from elsewhere. For example, this relates to results generated from inputs by scripts or commands. The section Keeping track already outlined the problem of associating a result with an input and a script. It can be difficult to link a figure from your data analysis project with an input data file or a script, even if you created this figure yourself. The datalad run command however mitigates these difficulties, and captures the provenance of any output generated with a datalad run call in the history of the dataset. Thus, by using datalad run in analysis projects, your dataset knows which result was generated when, by which author, from which inputs, and by means of which command.

With another DataLad command one can even go one step further: The command datalad containers-run (it will be introduced in a later part of the book) performs a command execution within a configured containerized environment. Thus, not only inputs, outputs, command, time, and author, but also the software environment are captured as provenance of a dataset component such as a results file, and, importantly, can be shared together with the dataset in the form of a software container.

A very cute YODA

Fig. 9 “Feel the force!”

With this last principle, your dataset collects and stores provenance of all the contents you created in the wake of your analysis project. This established trust in your results, and enables others to understand where files derive from.

The YODA procedure

There is one tool that can make starting a yoda-compliant data analysis easier: DataLads yoda procedure. Just as the text2git procedure from section Create a dataset, the yoda procedure can be included in a datalad create command and will apply useful configurations to your dataset:

$ datalad create -c yoda "my_analysis"

[INFO   ] Creating a new annex repo at /home/adina/repos/testing/my_analysis
create(ok): /home/adina/repos/testing/my_analysis (dataset)
[INFO   ] Running procedure cfg_yoda
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====

Let’s take a look at what configurations and changes come with this procedure:

$ tree -a

.
├── .gitattributes
├── CHANGELOG.md
├── code
│   ├── .gitattributes
│   └── README.md
└── README.md

Let’s take a closer look into the .gitattributes files:

$ less .gitattributes

**/.git* annex.largefiles=nothing
CHANGELOG.md annex.largefiles=nothing
README.md annex.largefiles=nothing

$ less code/.gitattributes

* annex.largefiles=nothing

Summarizing these two glimpses into the dataset, this configuration has

  1. included a code directory in your dataset

  2. included three files for human consumption (README.md, CHANGELOG.md)

  3. configured everything in the code/ directory to be tracked by Git, not Git-annex5

  4. and configured README.md and CHANGELOG.md in the root of the dataset to be tracked by Git.

Your next data analysis project can thus get a headstart with useful configurations and the start of a comprehensible directory structure by applying the yoda procedure.

Sources

This section is based on this comprehensive poster and these publicly available slides about the YODA principles.

Footnotes

1

“Why does the acronym contain itself?” you ask confused. “That’s because it’s a recursive acronym, where the first letter stands recursively for the whole acronym.” you get in response. “This is a reference to the recursiveness within a DataLad dataset – all principles apply recursively to all the subdatasets a dataset has.” “And what does all of this have to do with Yoda?” you ask mildly amused. “Oh, well. That’s just because the DataLad team is full of geeks.”

2

If you want to learn more about Docker and Singularity, or general information about containerized computational environments for reproducible data science, check out this section in the wonderful book The Turing Way, a comprehensive guide to reproducible data science.

3

This directory structure is very comprehensive, and displays many best-practices for reproducible data science. For example,

  1. Within code/, it is best practice to add tests for the code. These tests can be run to check whether the code still works.

  2. It is even better to further use automated computing, for example continuous integration (CI) systems, to test the functionality of your functions and scripts automatically. If relevant, the setup for continuous integration frameworks (such as Travis) lives outside of code/, in a dedicated ci/ directory.

  3. Include documents for fellow humans: Notes in a README.md or a HOWTO.md, or even proper documentation (for example using in a dedicated docs/ directory. Within these documents, include all relevant metadata for your analysis. If you are conducting a scientific study, this might be authorship, funding, change log, etc.

If writing tests for analysis scripts or using continuous integration is a new idea for you, but you want to learn more, check out this excellent chapter on testing in the book The Turing Way.

4

Substitute unfeasible with wasteful, impractical, or simply stupid if preferred.

5

To re-read how .gitattributes work, go back to section DIY configurations, and to remind yourself about how this worked for the text2git configuration, go back to section Data safety.