6.2. YODA: Best practices for data analyses in a dataset¶
The last requirement for the midterm projects reads “needs to comply to the YODA principles”. “What are the YODA principles?” you ask, as you have never heard of this before. “The topic of today’s lecture: Organizational principles of data analyses in DataLad datasets. This lecture will show you the basic principles behind creating, sharing, and publishing reproducible, understandable, and open data analysis projects with DataLad.”, you hear in return.
6.2.1. The starting point…¶
Data analyses projects are very common, both in science and industry.
But it can be very difficult to produce a reproducible, let alone
comprehensible data analysis project.
Many data analysis projects do not start out with
a stringent organization, or fail to keep the structural organization of a
directory intact as the project develops. Often, this can be due to a lack of
version-control. In these cases, a project will quickly end up
almost-identical scripts suffixed with “_version_xyz”,
or a chaotic results structure split between various directories with names
now_with_nicer_plots/. Something like this is a very
common shape a data science project may take after a while:
├── code/ │ ├── code_final/ │ │ ├── final_2/ │ │ │ ├── main_script_fixed.py │ │ │ └──takethisscriptformostthingsnow.py │ │ ├── utils_new.py │ │ ├── main_script.py │ │ ├── utils_new.py │ │ ├── utils_2.py │ │ └── main_analysis_newparameters.py │ └── main_script_DONTUSE.py ├── data/ │ ├── data_updated/ │ │ └── dataset1/ │ │ ├── datafile_a │ ├── dataset1/ │ │ ├── datafile_a │ ├── outputs/ │ │ ├── figures/ │ │ │ ├── figures_new.py │ │ │ └── figures_final_forreal.py │ │ ├── important_results/ │ │ ├── random_results_file.tsv │ │ ├── results_for_paper/ │ │ ├── results_for_paper_revised/ │ │ └── results_new_data/ │ ├── random_results_file.tsv │ ├── random_results_file_v2.tsv [...]
All data analysis endeavors in directories like this can work, for a while,
if there is a person who knows the project well, and works on it all the time.
But it inevitably will get messy once anyone tries to collaborate on a project
like this, or simply goes on a two-week vacation and forgets whether
the function in
main_analysis_newparameters.py or the one in
takethisscriptformostthingsnow.py was the one that created a particular figure.
But even if a project has an intuitive structure, and is version controlled, in many cases an analysis script will stop working, or maybe worse, will produce different results, because the software and tools used to conduct the analysis in the first place got an update. This update may have come with software changes that made functions stop working, or work differently than before. In the same vein, recomputing an analysis project on a different machine than the one the analysis was developed on can fail if the necessary software in the required versions is not installed or available on this new machine. The analysis might depend on software that runs on a Linux machine, but the project was shared with a Windows user. The environment during analysis development used Python 2, but the new system has only Python 3 installed. Or one of the dependent libraries needs to be in version X, but is installed as version Y.
The YODA principles are a clear set of organizational standards for datasets used for data analysis projects that aim to overcome issues like the ones outlined above. The name stands for “YODAs Organigram on Data Analysis”1. The principles outlined in YODA set simple rules for directory names and structures, best-practices for version-controlling dataset elements and analyses, facilitate usage of tools to improve the reproducibility and accountability of data analysis projects, and make collaboration easier. They are summarized in three basic principles, that translate to both dataset structures and best practices regarding the analysis:
As you will see, complying to these principles is easy if you use DataLad. Let’s go through them one by one:
6.2.2. P1: One thing, one dataset¶
Whenever a particular collection of files could be useful in more than one context, make them a standalone, modular component. In the broadest sense, this means to structure your study elements (data, code, computational environments, results, …) in dedicated directories. For example:
Store input data for an analysis in a dedicated
inputs/directory. Keep different formats or processing-stages of your input data as individual, modular components: Do not mix raw data, data that is already structured following community guidelines of the given field, or preprocessed data, but create one data component for each of them. And if your analysis relies on two or more data collections, these collections should each be an individual component, not combined into one.
Store scripts or code used for the analysis of data in a dedicated
code/directory, outside of the data component of the dataset.
Collect results of an analysis in a dedicated
outputs/directory, and leave the input data of an analysis untouched by your computations.
And if you conduct multiple different analyses, create a dedicated project for each analysis, instead of conflating them.
This, for example, would be a directory structure from the root of a superdataset of a very comprehensive data analysis project complying to the YODA principles:
├── ci/ # continuous integration configuration │ └── .travis.yml ├── code/ # your code │ ├── tests/ # unit tests to test your code │ │ └── test_myscript.py │ └── myscript.py ├── docs # documentation about the project │ ├── build/ │ └── source/ ├── envs # computational environments │ └── Singularity ├── inputs/ # dedicated inputs/, will not be changed by an analysis │ └─── data/ │ ├── dataset1/ # one stand-alone data component │ │ └── datafile_a │ └── dataset2/ │ └── datafile_a ├── outputs/ # outputs away from the input data │ └── important_results/ │ └── figures/ ├── CHANGELOG.md # notes for fellow humans about your project ├── HOWTO.md └── README.md
You can get a few non-DataLad related advice for structuring your directories in the on best practices for analysis organization.
More best practices for organizing contents in directories
The exemplary YODA directory structure is very comprehensive, and displays many best-practices for reproducible data science. For example,
code/, it is best practice to add tests for the code. These tests can be run to check whether the code still works.
It is even better to further use automated computing, for example continuous integration (CI) systems, to test the functionality of your functions and scripts automatically. If relevant, the setup for continuous integration frameworks (such as Travis) lives outside of
code/, in a dedicated
Include documents for fellow humans: Notes in a README.md or a HOWTO.md, or even proper documentation (for example using in a dedicated
docs/directory. Within these documents, include all relevant metadata for your analysis. If you are conducting a scientific study, this might be authorship, funding, change log, etc.
If writing tests for analysis scripts or using continuous integration is a new idea for you, but you want to learn more, check out this chapter on testing.
There are many advantages to this modular way of organizing contents. Having input data as independent components that are not altered (only consumed) by an analysis does not conflate the data for an analysis with the results or the code, thus assisting understanding the project for anyone unfamiliar with it. But more than just structure, this organization aids modular reuse or publication of the individual components, for example data. In a YODA-compliant dataset, any processing stage of a data component can be reused in a new project or published and shared. The same is true for a whole analysis dataset. At one point you might also write a scientific paper about your analysis in a paper project, and the whole analysis project can easily become a modular component in a paper project, to make sharing paper, code, data, and results easy. The usecase Writing a reproducible paper contains a step-by-step instruction on how to build and share such a reproducible paper, if you want to learn more.
The directory tree above and Figure 6.2 highlight different aspects
of this principle. The directory tree illustrates the structure of
the individual pieces on the file system from the point of view of
a single top-level dataset with a particular purpose. It for example
could be an analysis dataset created by a statistician for a scientific
project, and it could be shared between collaborators or
with others during development of the project. In this
superdataset, code is created that operates on input data to
compute outputs, and the code and outputs are captured,
version-controlled, and linked to the input data. Each input data in turn
is a (potentially nested) subdataset, but this is not visible
in the directory hierarchy.
Figure 6.2, in comparison, emphasizes a process view on a project and
the nested structure of input subdataset:
You can see how the preprocessed data that serves as an input for
the analysis datasets evolves from raw data to
standardized data organization to its preprocessed state. Within
data/ directory of the file system hierarchy displayed
above one would find data datasets with their previous version as
a subdataset, and this is repeated recursively until one reaches
the raw data as it was originally collected at one point. A finished
analysis project in turn can be used as a component (subdataset) in
a paper project, such that the paper is a fully reproducible research
object that shares code, analysis results, and data, as well as the
history of all of these components.
Principle 1, therefore, encourages to structure data analysis projects in a clear and modular fashion that makes use of nested DataLad datasets, yielding comprehensible structures and re-usable components. Having each component version-controlled – regardless of size – will aid keeping directories clean and organized, instead of piling up different versions of code, data, or results.
6.2.3. P2: Record where you got it from, and where it is now¶
It is good to have data, but it is even better if you and anyone you collaborate or share the project or its components with can find out where the data came from, or how it is dependent on or linked to other data. Therefore, this principle aims to attach this information, the data’s provenance, to the components of your data analysis project.
Luckily, this is a no-brainer with DataLad, because the core data structure of DataLad, the dataset, and many of the DataLad commands already covered up to now fulfill this principle.
If data components of a project are DataLad datasets, they can be included in an analysis superdataset as subdatasets. Thanks to datalad clone, information on the source of these subdatasets is stored in the history of the analysis superdataset, and they can even be updated from those sources if the original data dataset gets extended or changed. If you are including a file, for example code from GitHub, the datalad download-url command (introduced in section Populate a dataset) will record the source of it safely in the dataset’s history. And if you add anything to your dataset, from simple incremental coding progress in your analysis scripts up to files that a colleague sent you via email, a plain datalad save with a helpful commit message goes a very long way to fulfill this principle on its own already.
One core aspect of this principle is linking between re-usable data resource units (i.e., DataLad subdatasets containing pure data). You will be happy to hear that this is achieved by simply installing datasets as subdatasets. This part of this principle will therefore be absolutely obvious to you because you already know how to install and nest datasets within datasets. “I might just overcome my impostor syndrome if I experience such advanced reproducible analysis concepts as being obvious”, you think with a grin.
But more than linking datasets in a superdataset, linkage also needs to
be established between components of your dataset. Scripts inside of
code/ directory should point to data not as absolute paths
that would only work on your system, but instead as relative paths
that will work in any shared copy of your dataset. The next section
demonstrates a YODA data analysis project and will show concrete examples of this.
Lastly, this principle also includes moving, sharing, and publishing your datasets or its components. It is usually costly to collect data, and economically unfeasible3 to keep it locked in a drawer (or similarly out of reach behind complexities of data retrieval or difficulties in understanding the data structure). But conducting several projects on the same dataset yourself, sharing it with collaborators, or publishing it is easy if the project is a DataLad dataset that can be installed and retrieved on demand, and is kept clean from everything that is not part of the data according to principle 1. Conducting transparent open science is easier if you can link code, data, and results within a dataset, and share everything together. In conjunction with principle 1, this means that you can distribute your analysis projects (or parts of it) in a comprehensible form.
Principle 2, therefore, facilitates transparent linkage of datasets and their components to other components, their original sources, or shared copies. With the DataLad tools you learned to master up to this point, you have all the necessary skills to comply to it already.
6.2.4. P3: Record what you did to it, and with what¶
This last principle is about capturing how exactly the content of
every file came to be that was not obtained from elsewhere. For example,
this relates to results generated from inputs by scripts or commands.
The section Keeping track already outlined the problem of associating
a result with an input and a script. It can be difficult to link a
figure from your data analysis project with an input data file or a
script, even if you created this figure yourself.
The datalad run command however mitigates these difficulties,
and captures the provenance of any output generated with a
datalad run call in the history of the dataset. Thus, by using
datalad run in analysis projects, your dataset knows
which result was generated when, by which author, from which inputs,
and by means of which command.
With another DataLad command one can even go one step further: The command datalad containers-run (it will be introduced in section Computational reproducibility with software containers) performs a command execution within a configured containerized environment. Thus, not only inputs, outputs, command, time, and author, but also the software environment are captured as provenance of a dataset component such as a results file, and, importantly, can be shared together with the dataset in the form of a software container.
Tip: Make use of
--dry-run option to craft your run-command (see Dry-running your run call)!
With this last principle, your dataset collects and stores provenance of all the contents you created in the wake of your analysis project. This established trust in your results, and enables others to understand where files derive from.
6.2.5. The YODA procedure¶
There is one tool that can make starting a yoda-compliant data analysis
yoda procedure. Just as the
from section Create a dataset, the
yoda procedure can be included in a
datalad create command and will apply useful configurations
to your dataset:
$ datalad create -c yoda "my_analysis" [INFO ] Creating a new annex repo at /home/me/repos/testing/my_analysis create(ok): /home/me/repos/testing/my_analysis (dataset) [INFO ] Running procedure cfg_yoda [INFO ] == Command start (output follows) ===== [INFO ] == Command exit (modification check follows) =====
Let’s take a look at what configurations and changes come with this procedure:
$ tree -a . ├── .gitattributes ├── CHANGELOG.md ├── code │ ├── .gitattributes │ └── README.md └── README.md
Let’s take a closer look into the
$ less .gitattributes **/.git* annex.largefiles=nothing CHANGELOG.md annex.largefiles=nothing README.md annex.largefiles=nothing $ less code/.gitattributes * annex.largefiles=nothing
Summarizing these two glimpses into the dataset, this configuration has
included a code directory in your dataset
included three files for human consumption (
configured everything in the
code/directory to be tracked by Git, not git-annex4
CHANGELOG.mdin the root of the dataset to be tracked by Git.
Your next data analysis project can thus get a head start with useful configurations
and the start of a comprehensible directory structure by applying the
“Why does the acronym contain itself?” you ask confused. “That’s because it’s a recursive acronym, where the first letter stands recursively for the whole acronym.” you get in response. “This is a reference to the recursiveness within a DataLad dataset – all principles apply recursively to all the subdatasets a dataset has.” “And what does all of this have to do with Yoda?” you ask mildly amused. “Oh, well. That’s just because the DataLad team is full of geeks.”
If you want to learn more about Docker and Singularity, or general information about containerized computational environments for reproducible data science, check out this section in the wonderful book The Turing Way, a comprehensive guide to reproducible data science, or read about it in section Computational reproducibility with software containers.
Substitute unfeasible with wasteful, impractical, or simply stupid if preferred.