Scaling up: Managing 80TB and 15 million files from the HCP release¶
This usecase outlines how a large data collection can be version controlled and published in an accessible manner with DataLad in a remote indexed archive (RIA) data store. Using the Human Connectome Project (HCP) data as an example, it shows how large-scale datasets can be managed with the help of modular nesting, and how access to data that is contingent on usage agreements and external service credentials is possible via DataLad without circumventing or breaching the data providers terms:
The datalad addurls command is used to automatically aggregate files and information about their sources from public AWS S3 bucket storage into small-sized, modular DataLad datasets.
Modular datasets are structured into a hierarchy of nested datasets, with a single HCP superdataset at the top. This modularizes storage and access, and mitigates performance problems that would arise in oversized standalone datasets, but maintains access to any subdataset from the top-level dataset.
Individual datasets are stored in a remote indexed archive (RIA) store at store.datalad.org under their dataset ID. This setup constitutes a flexible, domain-agnostic, and scalable storage solution, while dataset configurations enable seamless automatic dataset retrieval from the store.
The top-level dataset is published to GitHub as a public access point for the full HCP dataset. As the RIA store contains datasets with only file source information instead of hosting data contents, a datalad get retrieves file contents from the original AWS S3 sources.
With DataLad’s authentication management, users will authenticate once – and are thus required to accept the HCP projects terms to obtain valid credentials –, but subsequent datalad get commands work swiftly without logging in.
The datalad copy-file can be used to subsample special-purpose datasets for faster access.
The Challenge¶
The Human Connectome Project aims to provide an unparalleled compilation of neural data through a customized database. Its largest open access data collection is the WU-Minn HCP1200 Data. It is made available via a public AWS S3 bucket and includes high-resolution 3T magnetic resonance scans from young healthy adult twins and non-twin siblings (ages 22-35) using four imaging modalities: structural images (T1w and T2w), resting-state fMRI (rfMRI), task-fMRI (tfMRI), and high angular resolution diffusion imaging (dMRI). It further includes behavioral and other individual subject measure data for all, and magnetoencephalography data and 7T MR data for a subset of subjects (twin pairs). In total, the data release encompasses around 80TB of data in 15 million files, and is of immense value to the field of neuroscience.
Its large amount of data, however, also constitutes a data management challenge: Such amounts of data are difficult to store, structure, access, and version control. Even tools such as DataLad, and its foundations, Git and git-annex, will struggle or fail with datasets of this size or number of files. Simply transforming the complete data release into a single DataLad dataset would at best lead to severe performance issues, but quite likely result in software errors and crashes. Moreover, access to the HCP data is contingent on consent to the data usage agreement of the HCP project and requires valid AWS S3 credentials. Instead of hosting this data or providing otherwise unrestrained access to it, an HCP DataLad dataset would need to enable data retrieval from the original sources, conditional on the user agreeing to the HCP usage terms.
The DataLad Approach¶
Using the datalad addurls command, the HCP data release is aggregated into a large amount (N ~= 4500) of datasets. A lean top-level dataset combines all datasets into a nested dataset hierarchy that recreates the original HCP data release’s structure. The topmost dataset contains one subdataset per subject with the subject’s release notes, and within each subject’s subdataset, each additional available subdirectory is another subdataset. This preserves the original structure of the HCP data release, but builds it up from sensible components that resemble standalone dataset units. As with any DataLad dataset, dataset nesting and operations across dataset boundaries are seamless, and allow to easily retrieve data on a subject, modality, or file level.
The highly modular structure has several advantages. For one, with barely any
data in the superdataset, the top-level dataset is very lean. It mainly consists
of an impressive .gitmodules
file1 with almost 1200 registered
(subject-level) subdatasets. The superdataset is published to GitHub at
github.com/datalad-datasets/human-connectome-project-openaccess
to expose this superdataset and allow anyone to install it with a single
datalad clone command in a few seconds.
Secondly, the modularity from splitting the data release into
several thousand subdatasets has performance advantages. If Git or
git-annex repositories exceed a certain size (either in terms of
file sizes or the number of files), performance can drop severely2.
By dividing the vast amount of data into many subdatasets,
this can be prevented: Subdatasets are small-sized units that are combined to the
complete HCP dataset structure, and nesting comes with no additional costs or
difficulties, as DataLad can work smoothly across hierarchies of subdatasets.
In order to simplify access to the data instead of providing data access that could circumvent HCP license term agreements for users, DataLad does not host any HCP data. Instead, thanks to datalad addurls, each data file knows its source (the public AWS S3 bucket of the HCP project), and a datalad get will retrieve HCP data from this bucket. With this setup, anyone who wants to obtain the data will still need to consent to data usage terms and retrieve AWS credentials from the HCP project, but can afterwards obtain the data solely with DataLad commands from the command line or in scripts. Only the first datalad get requires authentication with AWS credentials provided by the HCP project: DataLad will prompt any user at the time of retrieval of the first file content of the dataset. Afterwards, no further authentication is needed, unless the credentials become invalid or need to be updated for other reasons. Thus, in order to retrieve HCP data of up to single file level with DataLad, users only need to:
datalad clone the superdataset from GitHub (github.com/datalad-datasets/human-connectome-project-openaccess)
Create an account at http://db.humanconnectome.org to accept data use terms and obtain AWS credentials
Use datalad get [-n] [-r] PATH to retrieve file, directory, or subdataset contents on demand. Authentication is necessary only once (at the time of the first datalad get).
The HCP data release, despite its large size, can thus be version controlled and easily distributed with DataLad. In order to speed up data retrieval, subdataset installation can be parallelized, and the full HCP dataset can be subsampled into special-purpose datasets using DataLad’s copy-file command (introduced with DataLad version 0.13.0)
Step-by-Step¶
Building and publishing a DataLad dataset with HCP data consists of several steps: 1) Creating all necessary datasets, 2) publishing them to a RIA store, and 3) creating an access point to all files in the HCP data release. The upcoming subsections detail each of these.
Dataset creation with datalad addurls
¶
The datalad addurls command (datalad-addurls manual)
allows you to create (and update) potentially nested DataLad datasets from a list
of download URLs that point to the HCP files in the S3 buckets.
By supplying subject specific .csv
files that contain S3 download links,
a subject ID, a file name, and a version specification per file in the HCP dataset,
as well as information on where subdataset boundaries are,
datalad addurls can download all subjects’ files and create (nested) datasets
to store them in. With the help of a few bash commands, this task can be
automated, and with the help of a job scheduler,
it can also be parallelized.
As soon as files are downloaded and saved to a dataset, their content can be
dropped with datalad drop: The origin of the file was successfully
recorded, and a datalad get can now retrieve file contents on demand.
Thus, shortly after a complete download of the HCP project data, the datasets in
which it has been aggregated are small in size, and yet provide access to the HCP
data for anyone who has valid AWS S3 credentials.
At the end of this step, there is one nested dataset per subject in the HCP data release. If you are interested in the details of this process, checkout the hidden section below.
How exactly did the datasets came to be?
All code and tables necessary to generate the HCP datasets can be found on GitHub at github.com/TobiasKadelka/build_hcp.
The datalad addurls command is capable of building all necessary nested
subject datasets automatically, it only needs an appropriate specification of
its tasks. We’ll approach the function of datalad addurls and
how exactly it was invoked to build the HCP dataset by looking at the
information it needs. Below are excerpts of the .csv
table of one subject
(100206
) that illustrate how addurls works:
"original_url","subject","filename","version"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Diffusion_unproc.txt","100206","release-notes/Diffusion_unproc.txt","j9bm9Jvph3EzC0t9Jl51KVrq6NFuoznu"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/ReleaseNotes.txt","100206","release-notes/ReleaseNotes.txt","RgG.VC2mzp5xIc6ZGN6vB7iZ0mG7peXN"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc.txt","100206","release-notes/Structural_preproc.txt","OeUYjysiX5zR7nRMixCimFa_6yQ3IKqf"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc_extended.txt","100206","release-notes/Structural_preproc_extended.txt","cyP8G5_YX5F30gO9Yrpk8TADhkLltrNV"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_unproc.txt","100206","release-notes/Structural_unproc.txt","AyW6GmavML6I7LfbULVmtGIwRGpFmfPZ"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.164k_fs_LR.wb.spec","100206","MNINonLinear//100206.164k_fs_LR.wb.spec","JSZJhZekZnMhv1sDWih.khEVUNZXMHTE"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","sP4uw8R1oJyqCWeInSd9jmOBjfOCtN4D"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","yD88c.HfsFwjyNXHQQv2SymGIsSYHQVZ"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii","100206","MNINonLinear
The .csv
table contains one row per file, and includes the columns
original_url
, subject
, filename
, and version
. original_url
is an s3 URL pointing to an individual file in the S3 bucket, subject
is
the subject’s ID (here: 100206
), filename
is the path to the file
within the dataset that will be build, and version
is an S3 specific
file version identifier.
The first table excerpt thus specifies a few files in the directory release-notes
in the dataset of subject 100206
. For datalad addurls, the
column headers serve as placeholders for fields in each row.
If this table excerpt is given to a datalad addurls call as shown
below, it will create a dataset and download and save the precise version of each file
in it:
$ datalad addurls -d <Subject-ID> <TABLE> '{original_url}?versionId={version}' '{filename}'
This command translates to “create a dataset with the name of the subject ID
(-d <Subject-ID>
) and use the provided table (<TABLE>
) to assemble the
dataset contents. Iterate through the table rows, and perform one download per
row. Generate the download URL from the original_url
and version
field of the table ({original_url}?versionId={version}'
), and save the
downloaded file under the name specified in the filename
field ('{filename}'
)”.
If the file name contains a double slash (//
), for example seen in the second
table excerpt in "MNINonLinear//...
, this file will be created underneath a
subdataset of the name in front of the double slash. The rows in the second
table thus translate to “save these files into the subdataset MNINonLinear
,
and if this subdataset does not exist, create it”.
Thus, with a single subject’s table, a nested, subject specific dataset is built. Here is how the directory hierarchy looks for this particular subject once datalad addurls worked through its table:
100206
├── MNINonLinear <- subdataset
├── release-notes
├── T1w <- subdataset
└── unprocessed <- subdataset
This is all there is to assemble subject specific datasets. The interesting question is: How can this be done as automated as possible?
How to create subject-specific tables
One crucial part of the process are the subject specific tables for
datalad addurls. The information on the file url, its name, and its
version can be queried with the datalad ls command (datalad-ls
manual). It is a DataLad-specific version of the Unix ls
command and can
be used to list summary information about s3 URLs and datasets. With this
command, the public S3 bucket can be queried and the command will output the
relevant information.
The datalad ls command is a rather old command and less user-friendly than other commands demonstrated in the handbook. One problem for automation is that the command is made for interactive use, and it outputs information in a non-structured fashion. In order to retrieve the relevant information, a custom Python script was used to split its output and extract it. This script can be found in the GitHub repository as code/create_subject_table.py.
How to schedule datalad addurls commands for all tables
Once the subject specific tables exist, datalad addurls can start to aggregate the files into datasets. To do it efficiently, this can be done in parallel by using a job scheduler. On the computer cluster the datasets were aggregated, this was HTCondor.
The jobs (per subject) performed by HTCondor consisted of
a datalad addurls command to generate the (nested) dataset and retrieve content once3:
datalad -l warning addurls -d "$outds" -c hcp_dataset "$subj_table" '{original_url}?versionId={version}' '{filename}'
a subsequent datalad drop command to remove file contents as soon as they were saved to the dataset to save disk space (this is possible since the S3 source of the file is known, and content can be reobtained using get):
datalad drop -d "$outds" -r --nocheck
a few (Git) commands to clean up well afterwards, as the system the HCP dataset was downloaded to had a strict 5TB limit on disk usage.
Summary
Thus, in order to download the complete HCP project and aggregate it into
nested subject level datasets (on a system with much less disk space than the
complete HCP project’s size!), only two DataLad commands, one custom configuration,
and some scripts to parse terminal output into .csv
tables and create
subject-wise HTCondor jobs were necessary. With all tables set up, the jobs
ran over the Christmas break and finished before everyone went back to work.
Getting 15 million files into datasets? Check!
Using a Remote Indexed Archive Store for dataset hosting¶
All datasets were built on a scientific compute cluster. In this location, however, datasets would only be accessible to users with an account on this system. Subsequently, therefore, everything was published with datalad push to the publicly available store.datalad.org, a remote indexed archive (RIA) store.
A RIA store is a flexible and scalable data storage solution for DataLad datasets.
While its layout may look confusing if one were to take a look at it, a RIA store
is nothing but a clever storage solution, and users never consciously interact
with the store to get the HCP datasets.
On the lowest level, store.datalad.org
is a directory on a publicly accessible server that holds a great number of datasets
stored as bare git repositories. The only important aspect of it for this
usecase is that instead of by their names (e.g., 100206
), datasets are stored
and identified via their dataset ID.
The datalad clone command can understand this layout and install
datasets from a RIA store based on their ID.
How would a datalad clone from a RIA store look like?
In order to get a dataset from a RIA store, datalad clone needs a RIA URL. It is build from the following components:
a
ria+
identifiera path/url to the store in question. For store.datalad.org, this is
http://store.datalad.org
, but it could also be an SSH url, such asssh://juseless.inm7.de/data/group/psyinf/dataset_store
a pound sign (
#
)the dataset ID
and optionally a version or branch specification (appended with a leading
@
)
Here is how a valid datalad clone command from the data store for one dataset would look like:
datalad clone 'ria+http://store.datalad.org#d1ca308e-3d17-11ea-bf3b-f0d5bf7b5561' subj-01
But worry not! To get the HCP data, no-one will ever need to compose clone commands to RIA stores apart from DataLad itself.
A RIA store is used, because – among other advantages – its layout makes the store flexible and scalable. With datasets of sizes like the HCP project, especially scalability becomes an important factor. If you are interested in finding out why, you can find more technical details on RIA stores, their advantages, and even how to create and use one yourself in the section Remote Indexed Archives for dataset storage and backup.
Making the datasets accessible¶
At this point, roughly 1200 nested datasets were created and published to a publicly accessible RIA store. This modularized the HCP dataset and prevented performance issues that would arise in oversized datasets. In order to make the complete dataset available and accessible from one central point, the only thing missing is a single superdataset.
For this, a new dataset, human-connectome-project-openaccess
, was created.
It contains a README
file with short instructions on how to use it,
a text-based copy of the HCP project’s data usage agreement, – and each subject
dataset as a subdataset. The .gitmodules
file1 of this superdataset
thus is impressive. Here is an excerpt:
[submodule "100206"]
path = HCP1200/100206
url = ./HCP1200/100206
branch = master
datalad-id = 346a3ae0-2c2e-11ea-a27d-002590496000
[submodule "100307"]
path = HCP1200/100307
url = ./HCP1200/100307
branch = master
datalad-id = a51b84fc-2c2d-11ea-9359-0025904abcb0
[submodule "100408"]
path = HCP1200/100408
url = ./HCP1200/100408
branch = master
datalad-id = d3fa72e4-2c2b-11ea-948f-0025904abcb0
[...]
For each subdataset (named after subject IDs), there is one entry (note that
individual url
s of the subdatasets are pointless and not needed: As will be
demonstrated shortly, DataLad resolves each subdataset ID from the common store
automatically).
Thus, this superdataset combines all individual datasets to the original HCP dataset
structure. This (and only this) superdataset is published to a public GitHub
repository that anyone can datalad clone4.
Data retrieval and interacting with the repository¶
Procedurally, getting data from this dataset is almost as simple as with any other public DataLad dataset: One needs to clone the repository and use datalad get [-n] [-r] PATH to retrieve any file, directory, or subdataset (content). But because the data will be downloaded from the HCP’s AWS S3 bucket, users will need to create an account at db.humanconnectome.org to agree to the project’s data usage terms and get credentials. When performing the first datalad get for file contents, DataLad will prompt for these credentials interactively from the terminal. Once supplied, all subsequent get commands will retrieve data right away.
Resetting AWS credentials
In case one misenters their AWS credentials or needs to reset them,
this can easily be done using the Python keyring
package. For more information on keyring
and DataLad’s authentication
process, see the Basic process section in Configure custom data access.
After launching Python, import the keyring
package and use the
set_password()
function. This function takes 3 arguments:
system
: “datalad-hcp-s3” in this caseusername
: “key_id” if modifying the AWS access key ID or “secret_id” if modifying the secret access keypassword
: the access key itself
import keyring
keyring.set_password("datalad-hcp-s3", "key_id", <password>)
keyring.set_password("datalad-hcp-s3", "secret_id", <password>)
Alternatively, one can set their credentials using environment variables. For more details on this method, see this Findoutmore.
$ export DATALAD_hcp_s3_key_id=<password>
$ export DATALAD_hcp_s3_secret_id=<password>
Internally, DataLad cleverly manages the crucial aspects of data retrieval:
Linking registered subdatasets to the correct dataset in the RIA store. If you
inspect the GitHub repository, you will find that the subdataset links in it
will not resolve if you click on them, because none of the subdatasets were
published to GitHub5, but lie in the RIA store instead.
Dataset or file content retrieval will nevertheless work automatically with
datalad get: Each .gitmodule
entry lists the subdataset’s
dataset ID. Based on a configuration of “subdataset-source-candidates” in
.datalad/config
of the superdataset, the subdataset ID is assembled to a
RIA URL that retrieves the correct dataset from the store by get:
$ cat .datalad/config
[datalad "dataset"]
id = 2e2a8a70-3eaa-11ea-a9a5-b4969157768c
[datalad "get"]
subdataset-source-candidate-origin = "ria+http://store.datalad.org#{id}"
This configuration allows get to flexibly generate RIA URLs from the
base URL in the config file and the dataset IDs listed in .gitmodules
. In
the superdataset, it needed to be done “by hand” via the git config
command.
Because the configuration should be shared together with the dataset, the
configuration needed to be set in .datalad/config
6:
$ git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+http://store.datalad.org#{id}"
With this configuration, get will retrieve all subdatasets from the
RIA store. Any subdataset that is obtained from a RIA store in turn gets the very
same configuration automatically into .git/config
. Thus, the configuration
that makes seamless subdataset retrieval from RIA stores possible is propagated
throughout the dataset hierarchy.
With this in place, anyone can clone the top most dataset from GitHub, and –
given they have valid credentials – get any file in the HCP dataset hierarchy.
Speeding operations up¶
At this point in time, the HCP dataset is a single, published superdataset with ~4500 subdatasets that are hosted in a remote indexed archive (RIA) store at store.datalad.org. This makes the HCP data accessible via DataLad and its download easier. One downside to gigantic nested datasets like this one, though, is the time it takes to retrieve all of it. Some tricks can help to mitigate this: Contents can either be retrieved in parallel, or, in the case of general need for subsets of the dataset, subsampled datasets can be created with datalad copy-file.
If the complete HCP dataset is required, subdataset installation and data retrieval can be sped up by parallelizing. The gists Parallelize subdataset processing and Retrieve partial content from a hierarchy of (uninstalled) datasets can shed some light on how to do this. If you are interested in learning about the datalad copy-file, checkout the section Subsample datasets using datalad copy-file.
Summary¶
This usecase demonstrated how it is possible to version control and distribute datasets of sizes that would otherwise be unmanageably large for version control systems. With the public HCP dataset available as a DataLad dataset, data access is simplified, data analysis that use the HCP data can link it (in precise versions) to their scripts and even share it, and the complete HCP release can be stored at a fraction of its total size for on demand retrieval.
Footnotes
- 1(1,2)
If you want to read up on how DataLad stores information about registered subdatasets in
.gitmodules
, checkout section More on DIY configurations.- 2
Precise performance will always be dependent on the details of the repository, software setup, and hardware, but to get a feeling for the possible performance issues in oversized datasets, imagine a mere git status or datalad status command taking several minutes up to hours in a clean dataset.
- 3
Note that this command is more complex than the previously shown datalad addurls command. In particular, it has an additional loglevel configuration for the main command, and creates the datasets with an hcp_dataset configuration. The logging level was set (to
warning
) to help with post-execution diagnostics in the HTCondors log files. The configuration can be found in code/cfg_hcp_dataset and enables a special remote in the resulting dataset.- 4
To re-read about publishing datasets to hosting services such as GitHub or GitLab, go back to Publishing the dataset to GitHub.
- 5
If you coded along in the Basics part of the book and published your dataset to Gin, you have experienced in Subdataset publishing how the links to unpublished subdatasets in a published dataset do not resolve in the webinterface: Its path points to a URL that would resolve to lying underneath the superdataset, but there is not published subdataset on the hosting platform!
- 6
To re-read on configurations of datasets, go back to sections DIY configurations and More on DIY configurations.