Scaling up: Managing 80TB and 15 million files from the HCP release

This usecase outlines how a large data collection can be version controlled and published in an accessible manner with DataLad in a remote indexed archive (RIA) data store. Using the Human Connectome Project (HCP) data as an example, it shows how large-scale datasets can be managed with the help of modular nesting, and how access to data that is contingent on usage agreements and external service credentials is possible via DataLad without circumventing or breaching the data providers terms:

  1. The datalad addurls command is used to automatically aggregate files and information about their sources from public AWS S3 bucket storage into small-sized, modular DataLad datasets.

  2. Modular datasets are structured into a hierarchy of nested datasets, with a single HCP superdataset at the top. This modularizes storage and access, and mitigates performance problems that would arise in oversized standalone datasets, but maintains access to any subdataset from the top-level dataset.

  3. Individual datasets are stored in a remote indexed archive (RIA) store at store.datalad.org under their dataset ID. This setup constitutes a flexible, domain-agnostic, and scalable storage solution, while dataset configurations enable seamless automatic dataset retrieval from the store.

  4. The top-level dataset is published to GitHub as a public access point for the full HCP dataset. As the RIA store contains datasets with only file source information instead of hosting data contents, a datalad get retrieves file contents from the original AWS S3 sources.

  5. With DataLad’s authentication management, users will authenticate once – and are thus required to accept the HCP projects terms to obtain valid credentials –, but subsequent datalad get commands work swiftly without logging in.

  6. The datalad copy-file can be used to subsample special-purpose datasets for faster access.

The Challenge

The Human Connectome Project aims to provide an unparalleled compilation of neural data through a customized database. Its largest open access data collection is the WU-Minn HCP1200 Data. It is made available via a public AWS S3 bucket and includes high-resolution 3T magnetic resonance scans from young healthy adult twins and non-twin siblings (ages 22-35) using four imaging modalities: structural images (T1w and T2w), resting-state fMRI (rfMRI), task-fMRI (tfMRI), and high angular resolution diffusion imaging (dMRI). It further includes behavioral and other individual subject measure data for all, and magnetoencephalography data and 7T MR data for a subset of subjects (twin pairs). In total, the data release encompasses around 80TB of data in 15 million files, and is of immense value to the field of neuroscience.

Its large amount of data, however, also constitutes a data management challenge: Such amounts of data are difficult to store, structure, access, and version control. Even tools such as DataLad, and its foundations, Git and git-annex, will struggle or fail with datasets of this size or number of files. Simply transforming the complete data release into a single DataLad dataset would at best lead to severe performance issues, but quite likely result in software errors and crashes. Moreover, access to the HCP data is contingent on consent to the data usage agreement of the HCP project and requires valid AWS S3 credentials. Instead of hosting this data or providing otherwise unrestrained access to it, an HCP DataLad dataset would need to enable data retrieval from the original sources, conditional on the user agreeing to the HCP usage terms.

The DataLad Approach

Using the datalad addurls command, the HCP data release is aggregated into a large amount (N ~= 4500) of datasets. A lean top-level dataset combines all datasets into a nested dataset hierarchy that recreates the original HCP data release’s structure. The topmost dataset contains one subdataset per subject with the subject’s release notes, and within each subject’s subdataset, each additional available subdirectory is another subdataset. This preserves the original structure of the HCP data release, but builds it up from sensible components that resemble standalone dataset units. As with any DataLad dataset, dataset nesting and operations across dataset boundaries are seamless, and allow to easily retrieve data on a subject, modality, or file level.

The highly modular structure has several advantages. For one, with barely any data in the superdataset, the top-level dataset is very lean. It mainly consists of an impressive .gitmodules file1 with almost 1200 registered (subject-level) subdatasets. The superdataset is published to GitHub at github.com/datalad-datasets/human-connectome-project-openaccess to expose this superdataset and allow anyone to install it with a single datalad clone command in a few seconds. Secondly, the modularity from splitting the data release into several thousand subdatasets has performance advantages. If Git or git-annex repositories exceed a certain size (either in terms of file sizes or the number of files), performance can drop severely2. By dividing the vast amount of data into many subdatasets, this can be prevented: Subdatasets are small-sized units that are combined to the complete HCP dataset structure, and nesting comes with no additional costs or difficulties, as DataLad can work smoothly across hierarchies of subdatasets.

In order to simplify access to the data instead of providing data access that could circumvent HCP license term agreements for users, DataLad does not host any HCP data. Instead, thanks to datalad addurls, each data file knows its source (the public AWS S3 bucket of the HCP project), and a datalad get will retrieve HCP data from this bucket. With this setup, anyone who wants to obtain the data will still need to consent to data usage terms and retrieve AWS credentials from the HCP project, but can afterwards obtain the data solely with DataLad commands from the command line or in scripts. Only the first datalad get requires authentication with AWS credentials provided by the HCP project: DataLad will prompt any user at the time of retrieval of the first file content of the dataset. Afterwards, no further authentication is needed, unless the credentials become invalid or need to be updated for other reasons. Thus, in order to retrieve HCP data of up to single file level with DataLad, users only need to:

The HCP data release, despite its large size, can thus be version controlled and easily distributed with DataLad. In order to speed up data retrieval, subdataset installation can be parallelized, and the full HCP dataset can be subsampled into special-purpose datasets using DataLad’s copy-file command (introduced with DataLad version 0.13.0)

Step-by-Step

Building and publishing a DataLad dataset with HCP data consists of several steps: 1) Creating all necessary datasets, 2) publishing them to a RIA store, and 3) creating an access point to all files in the HCP data release. The upcoming subsections detail each of these.

Dataset creation with datalad addurls

The datalad addurls command (datalad-addurls manual) allows you to create (and update) potentially nested DataLad datasets from a list of download URLs that point to the HCP files in the S3 buckets. By supplying subject specific .csv files that contain S3 download links, a subject ID, a file name, and a version specification per file in the HCP dataset, as well as information on where subdataset boundaries are, datalad addurls can download all subjects’ files and create (nested) datasets to store them in. With the help of a few bash commands, this task can be automated, and with the help of a job scheduler, it can also be parallelized. As soon as files are downloaded and saved to a dataset, their content can be dropped with datalad drop: The origin of the file was successfully recorded, and a datalad get can now retrieve file contents on demand. Thus, shortly after a complete download of the HCP project data, the datasets in which it has been aggregated are small in size, and yet provide access to the HCP data for anyone who has valid AWS S3 credentials.

At the end of this step, there is one nested dataset per subject in the HCP data release. If you are interested in the details of this process, checkout the hidden section below.

How exactly did the datasets came to be?

All code and tables necessary to generate the HCP datasets can be found on GitHub at github.com/TobiasKadelka/build_hcp.

The datalad addurls command is capable of building all necessary nested subject datasets automatically, it only needs an appropriate specification of its tasks. We’ll approach the function of datalad addurls and how exactly it was invoked to build the HCP dataset by looking at the information it needs. Below are excerpts of the .csv table of one subject (100206) that illustrate how addurls works:

Listing 1 Table header and some of the release note files
"original_url","subject","filename","version"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Diffusion_unproc.txt","100206","release-notes/Diffusion_unproc.txt","j9bm9Jvph3EzC0t9Jl51KVrq6NFuoznu"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/ReleaseNotes.txt","100206","release-notes/ReleaseNotes.txt","RgG.VC2mzp5xIc6ZGN6vB7iZ0mG7peXN"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc.txt","100206","release-notes/Structural_preproc.txt","OeUYjysiX5zR7nRMixCimFa_6yQ3IKqf"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc_extended.txt","100206","release-notes/Structural_preproc_extended.txt","cyP8G5_YX5F30gO9Yrpk8TADhkLltrNV"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_unproc.txt","100206","release-notes/Structural_unproc.txt","AyW6GmavML6I7LfbULVmtGIwRGpFmfPZ"
Listing 2 Some files in the MNINonLinear directory
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.164k_fs_LR.wb.spec","100206","MNINonLinear//100206.164k_fs_LR.wb.spec","JSZJhZekZnMhv1sDWih.khEVUNZXMHTE"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","sP4uw8R1oJyqCWeInSd9jmOBjfOCtN4D"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","yD88c.HfsFwjyNXHQQv2SymGIsSYHQVZ"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii","100206","MNINonLinear

The .csv table contains one row per file, and includes the columns original_url, subject, filename, and version. original_url is an s3 URL pointing to an individual file in the S3 bucket, subject is the subject’s ID (here: 100206), filename is the path to the file within the dataset that will be build, and version is an S3 specific file version identifier. The first table excerpt thus specifies a few files in the directory release-notes in the dataset of subject 100206. For datalad addurls, the column headers serve as placeholders for fields in each row. If this table excerpt is given to a datalad addurls call as shown below, it will create a dataset and download and save the precise version of each file in it:

$ datalad addurls -d <Subject-ID> <TABLE> '{original_url}?versionId={version}' '{filename}'

This command translates to “create a dataset with the name of the subject ID (-d <Subject-ID>) and use the provided table (<TABLE>) to assemble the dataset contents. Iterate through the table rows, and perform one download per row. Generate the download URL from the original_url and version field of the table ({original_url}?versionId={version}'), and save the downloaded file under the name specified in the filename field ('{filename}')”.

If the file name contains a double slash (//), for example seen in the second table excerpt in "MNINonLinear//..., this file will be created underneath a subdataset of the name in front of the double slash. The rows in the second table thus translate to “save these files into the subdataset MNINonLinear, and if this subdataset does not exist, create it”.

Thus, with a single subject’s table, a nested, subject specific dataset is built. Here is how the directory hierarchy looks for this particular subject once datalad addurls worked through its table:

100206
├── MNINonLinear     <- subdataset
├── release-notes
├── T1w              <- subdataset
└── unprocessed      <- subdataset

This is all there is to assemble subject specific datasets. The interesting question is: How can this be done as automated as possible?

How to create subject-specific tables

One crucial part of the process are the subject specific tables for datalad addurls. The information on the file url, its name, and its version can be queried with the datalad ls command (datalad-ls manual). It is a DataLad-specific version of the Unix ls command and can be used to list summary information about s3 URLs and datasets. With this command, the public S3 bucket can be queried and the command will output the relevant information.

The datalad ls command is a rather old command and less user-friendly than other commands demonstrated in the handbook. One problem for automation is that the command is made for interactive use, and it outputs information in a non-structured fashion. In order to retrieve the relevant information, a custom Python script was used to split its output and extract it. This script can be found in the GitHub repository as code/create_subject_table.py.

How to schedule datalad addurls commands for all tables

Once the subject specific tables exist, datalad addurls can start to aggregate the files into datasets. To do it efficiently, this can be done in parallel by using a job scheduler. On the computer cluster the datasets were aggregated, this was HTCondor.

The jobs (per subject) performed by HTCondor consisted of

  • a datalad addurls command to generate the (nested) dataset and retrieve content once3:

    datalad -l warning addurls -d "$outds" -c hcp_dataset "$subj_table" '{original_url}?versionId={version}' '{filename}'
    
  • a subsequent datalad drop command to remove file contents as soon as they were saved to the dataset to save disk space (this is possible since the S3 source of the file is known, and content can be reobtained using get):

    datalad drop -d "$outds" -r --nocheck
    
  • a few (Git) commands to clean up well afterwards, as the system the HCP dataset was downloaded to had a strict 5TB limit on disk usage.

Summary

Thus, in order to download the complete HCP project and aggregate it into nested subject level datasets (on a system with much less disk space than the complete HCP project’s size!), only two DataLad commands, one custom configuration, and some scripts to parse terminal output into .csv tables and create subject-wise HTCondor jobs were necessary. With all tables set up, the jobs ran over the Christmas break and finished before everyone went back to work. Getting 15 million files into datasets? Check!

Using a Remote Indexed Archive Store for dataset hosting

All datasets were built on a scientific compute cluster. In this location, however, datasets would only be accessible to users with an account on this system. Subsequently, therefore, everything was published with datalad push to the publicly available store.datalad.org, a remote indexed archive (RIA) store.

A RIA store is a flexible and scalable data storage solution for DataLad datasets. While its layout may look confusing if one were to take a look at it, a RIA store is nothing but a clever storage solution, and users never consciously interact with the store to get the HCP datasets. On the lowest level, store.datalad.org is a directory on a publicly accessible server that holds a great number of datasets stored as bare git repositories. The only important aspect of it for this usecase is that instead of by their names (e.g., 100206), datasets are stored and identified via their dataset ID. The datalad clone command can understand this layout and install datasets from a RIA store based on their ID.

How would a datalad clone from a RIA store look like?

In order to get a dataset from a RIA store, datalad clone needs a RIA URL. It is build from the following components:

  • a ria+ identifier

  • a path/url to the store in question. For store.datalad.org, this is http://store.datalad.org, but it could also be an SSH url, such as ssh://juseless.inm7.de/data/group/psyinf/dataset_store

  • a pound sign (#)

  • the dataset ID

  • and optionally a version or branch specification (appended with a leading @)

Here is how a valid datalad clone command from the data store for one dataset would look like:

datalad clone 'ria+http://store.datalad.org#d1ca308e-3d17-11ea-bf3b-f0d5bf7b5561' subj-01

But worry not! To get the HCP data, no-one will ever need to compose clone commands to RIA stores apart from DataLad itself.

A RIA store is used, because – among other advantages – its layout makes the store flexible and scalable. With datasets of sizes like the HCP project, especially scalability becomes an important factor. If you are interested in finding out why, you can find more technical details on RIA stores, their advantages, and even how to create and use one yourself in the section Remote Indexed Archives for dataset storage and backup.

Making the datasets accessible

At this point, roughly 1200 nested datasets were created and published to a publicly accessible RIA store. This modularized the HCP dataset and prevented performance issues that would arise in oversized datasets. In order to make the complete dataset available and accessible from one central point, the only thing missing is a single superdataset.

For this, a new dataset, human-connectome-project-openaccess, was created. It contains a README file with short instructions on how to use it, a text-based copy of the HCP project’s data usage agreement, – and each subject dataset as a subdataset. The .gitmodules file1 of this superdataset thus is impressive. Here is an excerpt:

[submodule "100206"]
    path = HCP1200/100206
    url = ./HCP1200/100206
    branch = master
    datalad-id = 346a3ae0-2c2e-11ea-a27d-002590496000
[submodule "100307"]
    path = HCP1200/100307
    url = ./HCP1200/100307
    branch = master
    datalad-id = a51b84fc-2c2d-11ea-9359-0025904abcb0
[submodule "100408"]
    path = HCP1200/100408
    url = ./HCP1200/100408
    branch = master
    datalad-id = d3fa72e4-2c2b-11ea-948f-0025904abcb0
[...]

For each subdataset (named after subject IDs), there is one entry (note that individual urls of the subdatasets are pointless and not needed: As will be demonstrated shortly, DataLad resolves each subdataset ID from the common store automatically). Thus, this superdataset combines all individual datasets to the original HCP dataset structure. This (and only this) superdataset is published to a public GitHub repository that anyone can datalad clone4.

Data retrieval and interacting with the repository

HCP dataset version requirements

Using this dataset requires DataLad version 0.12.2 or higher. Upgrading an existing DataLad installation is detailed in section Installation and configuration.

Procedurally, getting data from this dataset is almost as simple as with any other public DataLad dataset: One needs to clone the repository and use datalad get [-n] [-r] PATH to retrieve any file, directory, or subdataset (content). But because the data will be downloaded from the HCP’s AWS S3 bucket, users will need to create an account at db.humanconnectome.org to agree to the project’s data usage terms and get credentials. When performing the first datalad get for file contents, DataLad will prompt for these credentials interactively from the terminal. Once supplied, all subsequent get commands will retrieve data right away.

Resetting AWS credentials

In case one misenters their AWS credentials or needs to reset them, this can easily be done using the Python keyring package. For more information on keyring and DataLad’s authetication process, see the Basic process section in Configure custom data access.

After launching Python, import the keyring package and use the set_password() function. This function takes 3 arguments:

  • system: “datalad-hcp-s3” in this case

  • username: “key_id” if modifying the AWS access key ID or “secret_id” if modifying the secret access key

  • password: the access key itself

import keyring

keyring.set_password("datalad-hcp-s3", "key_id", <password>)
keyring.set_password("datalad-hcp-s3", "secret_id", <password>)

Alternatively, one can set their credentials using environment variables. For more details on this method, see this Findoutmore.

$ export DATALAD_hcp_s3_key_id=<password>
$ export DATALAD_hcp_s3_secret_id=<password>

Internally, DataLad cleverly manages the crucial aspects of data retrieval: Linking registered subdatasets to the correct dataset in the RIA store. If you inspect the GitHub repository, you will find that the subdataset links in it will not resolve if you click on them, because none of the subdatasets were published to GitHub5, but lie in the RIA store instead. Dataset or file content retrieval will nevertheless work automatically with datalad get: Each .gitmodule entry lists the subdataset’s dataset ID. Based on a configuration of “subdataset-source-candidates” in .datalad/config of the superdataset, the subdataset ID is assembled to a RIA URL that retrieves the correct dataset from the store by get:

 $ cat .datalad/config
 [datalad "dataset"]
     id = 2e2a8a70-3eaa-11ea-a9a5-b4969157768c
 [datalad "get"]
     subdataset-source-candidate-origin = "ria+http://store.datalad.org#{id}"

This configuration allows get to flexibly generate RIA URLs from the base URL in the config file and the dataset IDs listed in .gitmodules. In the superdataset, it needed to be done “by hand” via the git config command. Because the configuration should be shared together with the dataset, the configuration needed to be set in .datalad/config6:

$ git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+http://store.datalad.org#{id}"

With this configuration, get will retrieve all subdatasets from the RIA store. Any subdataset that is obtained from a RIA store in turn gets the very same configuration automatically into .git/config. Thus, the configuration that makes seamless subdataset retrieval from RIA stores possible is propagated throughout the dataset hierarchy. With this in place, anyone can clone the top most dataset from GitHub, and – given they have valid credentials – get any file in the HCP dataset hierarchy.

Speeding operations up

At this point in time, the HCP dataset is a single, published superdataset with ~4500 subdatasets that are hosted in a remote indexed archive (RIA) store at store.datalad.org. This makes the HCP data accessible via DataLad and its download easier. One downside to gigantic nested datasets like this one, though, is the time it takes to retrieve all of it. Some tricks can help to mitigate this: Contents can either be retrieved in parallel, or, in the case of general need for subsets of the dataset, subsampled datasets can be created with datalad copy-file.

If the complete HCP dataset is required, subdataset installation and data retrieval can be sped up by parallelizing. The gists Parallelize subdataset processing and Retrieve partial content from a hierarchy of (uninstalled) datasets can shed some light on how to do this. If you are interested in learning about the datalad copy-file, checkout the section Subsample datasets using datalad copy-file.

Summary

This usecase demonstrated how it is possible to version control and distribute datasets of sizes that would otherwise be unmanageably large for version control systems. With the public HCP dataset available as a DataLad dataset, data access is simplified, data analysis that use the HCP data can link it (in precise versions) to their scripts and even share it, and the complete HCP release can be stored at a fraction of its total size for on demand retrieval.

Footnotes

1(1,2)

If you want to read up on how DataLad stores information about registered subdatasets in .gitmodules, checkout section More on DIY configurations.

2

Precise performance will always be dependent on the details of the repository, software setup, and hardware, but to get a feeling for the possible performance issues in oversized datasets, imagine a mere git status or datalad status command taking several minutes up to hours in a clean dataset.

3

Note that this command is more complex than the previously shown datalad addurls command. In particular, it has an additional loglevel configuration for the main command, and creates the datasets with an hcp_dataset configuration. The logging level was set (to warning) to help with post-execution diagnostics in the HTCondors log files. The configuration can be found in code/cfg_hcp_dataset and enables a special remote in the resulting dataset.

4

To re-read about publishing datasets to hosting services such as GitHub or GitLab, go back to Publishing the dataset to GitHub.

5

If you coded along in the Basics part of the book and published your dataset to Gin, you have experienced in Subdataset publishing how the links to unpublished subdatasets in a published dataset do not resolve in the webinterface: Its path points to a URL that would resolve to lying underneath the superdataset, but there is not published subdataset on the hosting platform!

6

To re-read on configurations of datasets, go back to sections DIY configurations and More on DIY configurations.