Scaling up: Managing 80TB and 15 million files from the HCP release

This usecase outlines how a large data collection can be version controlled and published in an accessible manner with DataLad in a remote indexed archive (RIA) data store. Using the Human Connectome Project (HCP) data as an example, it shows how large-scale datasets can be managed with the help of modular nesting, and how access to data that is contingent on usage agreements and external service credentials is possible via DataLad without circumventing or breaching the data providers terms:

  1. The datalad addurls command is used to automatically aggregate files and information about their sources from public AWS S3 bucket storage into small-sized, modular DataLad datasets.

  2. Modular datasets are structured into a hierarchy of nested datasets, with a single HCP superdataset at the top. This modularizes storage and access, and mitigates performance problems that would arise in oversized standalone datasets, but maintains access to any subdataset from the top-level dataset.

  3. Individual datasets are stored in a remote indexed archive (RIA) store at store.datalad.org under their dataset ID. This setup constitutes a flexible, domain-agnostic, and scalable storage solution, while dataset configurations enable seamless automatic dataset retrieval from the store.

  4. The top-level dataset is published to GitHub as a public access point for the full HCP dataset. As the RIA store contains datasets with only file source information instead of hosting data contents, a datalad get retrieves file contents from the original AWS S3 sources.

  5. With DataLad’s authentication management, users will authenticate once – and are thus required to accept the HCP projects terms to obtain valid credentials –, but subsequent datalad get commands work swiftly without logging in.

  6. The datalad copy-file can be used to subsample special-purpose datasets for faster access.

The Challenge

The Human Connectome Project aims to provide an unparalleled compilation of neural data through a customized database. Its largest open access data collection is the WU-Minn HCP1200 Data. It is made available via a public AWS S3 bucket and includes high-resolution 3T magnetic resonance scans from young healthy adult twins and non-twin siblings (ages 22-35) using four imaging modalities: structural images (T1w and T2w), resting-state fMRI (rfMRI), task-fMRI (tfMRI), and high angular resolution diffusion imaging (dMRI). It further includes behavioral and other individual subject measure data for all, and magnetoencephalography data and 7T MR data for a subset of subjects (twin pairs). In total, the data release encompasses around 80TB of data in 15 million files, and is of immense value to the field of neuroscience.

Its large amount of data, however, also constitutes a data management challenge: Such amounts of data are difficult to store, structure, access, and version control. Even tools such as DataLad, and its foundations, Git and git-annex, will struggle or fail with datasets of this size or number of files. Simply transforming the complete data release into a single DataLad dataset would at best lead to severe performance issues, but quite likely result in software errors and crashes. Moreover, access to the HCP data is contingent on consent to the data usage agreement of the HCP project and requires valid AWS S3 credentials. Instead of hosting this data or providing otherwise unrestrained access to it, an HCP DataLad dataset would need to enable data retrieval from the original sources, conditional on the user agreeing to the HCP usage terms.

The DataLad Approach

Using the datalad addurls command, the HCP data release is aggregated into a large amount (N ~= 4500) of datasets. A lean top-level dataset combines all datasets into a nested dataset hierarchy that recreates the original HCP data release’s structure. The topmost dataset contains one subdataset per subject with the subject’s release notes, and within each subject’s subdataset, each additional available subdirectory is another subdataset. This preserves the original structure of the HCP data release, but builds it up from sensible components that resemble standalone dataset units. As with any DataLad dataset, dataset nesting and operations across dataset boundaries are seamless, and allow to easily retrieve data on a subject, modality, or file level.

The highly modular structure has several advantages. For one, with barely any data in the superdataset, the top-level dataset is very lean. It mainly consists of an impressive .gitmodules file1 with almost 1200 registered (subject-level) subdatasets. The superdataset is published to GitHub at github.com/datalad-datasets/human-connectome-project-openaccess to expose this superdataset and allow anyone to install it with a single datalad clone command in a few seconds. Secondly, the modularity from splitting the data release into several thousand subdatasets has performance advantages. If Git or git-annex repositories exceed a certain size (either in terms of file sizes or the number of files), performance can drop severely2. By dividing the vast amount of data into many subdatasets, this can be prevented: Subdatasets are small-sized units that are combined to the complete HCP dataset structure, and nesting comes with no additional costs or difficulties, as DataLad can work smoothly across hierarchies of subdatasets.

In order to simplify access to the data instead of providing data access that could circumvent HCP license term agreements for users, DataLad does not host any HCP data. Instead, thanks to datalad addurls, each data file knows its source (the public AWS S3 bucket of the HCP project), and a datalad get will retrieve HCP data from this bucket. With this setup, anyone who wants to obtain the data will still need to consent to data usage terms and retrieve AWS credentials from the HCP project, but can afterwards obtain the data solely with DataLad commands from the command line or in scripts. Only the first datalad get requires authentication with AWS credentials provided by the HCP project: DataLad will prompt any user at the time of retrieval of the first file content of the dataset. Afterwards, no further authentication is needed, unless the credentials become invalid or need to be updated for other reasons. Thus, in order to retrieve HCP data of up to single file level with DataLad, users only need to:

The HCP data release, despite its large size, can thus be version controlled and easily distributed with DataLad. In order to speed up data retrieval, subdataset installation can be parallelized, and the full HCP dataset can be subsampled into special-purpose datasets using DataLad’s copy-file command (introduced with DataLad version 0.13.0)

Step-by-Step

Building and publishing a DataLad dataset with HCP data consists of several steps: 1) Creating all necessary datasets, 2) publishing them to a RIA store, and 3) creating an access point to all files in the HCP data release. The upcoming subsections detail each of these.

Dataset creation with datalad addurls

The datalad addurls command (datalad-addurls manual) allows you to create (and update) potentially nested DataLad datasets from a list of download URLs that point to the HCP files in the S3 buckets. By supplying subject specific .csv files that contain S3 download links, a subject ID, a file name, and a version specification per file in the HCP dataset, as well as information on where subdataset boundaries are, datalad addurls can download all subjects’ files and create (nested) datasets to store them in. With the help of a few bash commands, this task can be automated, and with the help of a job scheduler, it can also be parallelized. As soon as files are downloaded and saved to a dataset, their content can be dropped with datalad drop: The origin of the file was successfully recorded, and a datalad get can now retrieve file contents on demand. Thus, shortly after a complete download of the HCP project data, the datasets in which it has been aggregated are small in size, and yet provide access to the HCP data for anyone who has valid AWS S3 credentials.

At the end of this step, there is one nested dataset per subject in the HCP data release. If you are interested in the details of this process, checkout the hidden section below.

How exactly did the datasets came to be?

Note

All code and tables necessary to generate the HCP datasets can be found on GitHub at github.com/TobiasKadelka/build_hcp.

The datalad addurls command is capable of building all necessary nested subject datasets automatically, it only needs an appropriate specification of its tasks. We’ll approach the function of datalad addurls and how exactly it was invoked to build the HCP dataset by looking at the information it needs. Below are excerpts of the .csv table of one subject (100206) that illustrate how addurls works:

Listing 1 Table header and some of the release note files
"original_url","subject","filename","version"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Diffusion_unproc.txt","100206","release-notes/Diffusion_unproc.txt","j9bm9Jvph3EzC0t9Jl51KVrq6NFuoznu"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/ReleaseNotes.txt","100206","release-notes/ReleaseNotes.txt","RgG.VC2mzp5xIc6ZGN6vB7iZ0mG7peXN"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc.txt","100206","release-notes/Structural_preproc.txt","OeUYjysiX5zR7nRMixCimFa_6yQ3IKqf"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_preproc_extended.txt","100206","release-notes/Structural_preproc_extended.txt","cyP8G5_YX5F30gO9Yrpk8TADhkLltrNV"
"s3://hcp-openaccess/HCP_1200/100206/release-notes/Structural_unproc.txt","100206","release-notes/Structural_unproc.txt","AyW6GmavML6I7LfbULVmtGIwRGpFmfPZ"
Listing 2 Some files in the MNINonLinear directory
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.164k_fs_LR.wb.spec","100206","MNINonLinear//100206.164k_fs_LR.wb.spec","JSZJhZekZnMhv1sDWih.khEVUNZXMHTE"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii","sP4uw8R1oJyqCWeInSd9jmOBjfOCtN4D"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","100206","MNINonLinear//100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii","yD88c.HfsFwjyNXHQQv2SymGIsSYHQVZ"
"s3://hcp-openaccess/HCP_1200/100206/MNINonLinear/100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii","100206","MNINonLinear

The .csv table contains one row per file, and includes the columns original_url, subject, filename, and version. original_url is an s3 URL pointing to an individual file in the S3 bucket, subject is the subject’s ID (here: 100206), filename is the path to the file within the dataset that will be build, and version is an S3 specific file version identifier. The first table excerpt thus specifies a few files in the directory release-notes in the dataset of subject 100206. For datalad addurls, the column headers serve as placeholders for fields in each row. If this table excerpt is given to a datalad addurls call as shown below, it will create a dataset and download and save the precise version of each file in it:

$ datalad addurls -d <Subject-ID> <TABLE> '{original_url}?versionId={version}' '{filename}'

This command translates to “create a dataset with the name of the subject ID (-d <Subject-ID>) and use the provided table (<TABLE>) to assemble the dataset contents. Iterate through the table rows, and perform one download per row. Generate the download URL from the original_url and version field of the table ({original_url}?versionId={version}'), and save the downloaded file under the name specified in the filename field ('{filename}')”.

If the file name contains a double slash (//), for example seen in the second table excerpt in "MNINonLinear//..., this file will be created underneath a subdataset of the name in front of the double slash. The rows in the second table thus translate to “save these files into the subdataset MNINonLinear, and if this subdataset does not exist, create it”.

Thus, with a single subject’s table, a nested, subject specific dataset is built. Here is how the directory hierarchy looks for this particular subject once datalad addurls worked through its table:

100206
├── MNINonLinear     <- subdataset
├── release-notes
├── T1w              <- subdataset
└── unprocessed      <- subdataset

This is all there is to assemble subject specific datasets. The interesting question is: How can this be done as automated as possible?

How to create subject-specific tables

One crucial part of the process are the subject specific tables for datalad addurls. The information on the file url, its name, and its version can be queried with the datalad ls command (datalad-ls manual). It is a DataLad-specific version of the Unix ls command and can be used to list summary information about s3 URLs and datasets. With this command, the public S3 bucket can be queried and the command will output the relevant information.

Note

The datalad ls command is a rather old command and less user-friendly than other commands demonstrated in the handbook. One problem for automation is that the command is made for interactive use, and it outputs information in a non-structured fashion. In order to retrieve the relevant information, a custom Python script was used to split its output and extract it. This script can be found in the GitHub repository as code/create_subject_table.py.

How to schedule datalad addurls commands for all tables

Once the subject specific tables exist, datalad addurls can start to aggregate the files into datasets. To do it efficiently, this can be done in parallel by using a job scheduler. On the computer cluster the datasets were aggregated, this was HTCondor.

The jobs (per subject) performed by HTCondor consisted of

  • a datalad addurls command to generate the (nested) dataset and retrieve content once3:

    datalad -l warning addurls -d "$outds" -c hcp_dataset "$subj_table" '{original_url}?versionId={version}' '{filename}'
    
  • a subsequent datalad drop command to remove file contents as soon as they were saved to the dataset to save disk space (this is possible since the S3 source of the file is known, and content can be reobtained using get):

    datalad drop -d "$outds" -r --nocheck
    
  • a few (Git) commands to clean up well afterwards, as the system the HCP dataset was downloaded to had a strict 5TB limit on disk usage.

Summary

Thus, in order to download the complete HCP project and aggregate it into nested subject level datasets (on a system with much less disk space than the complete HCP project’s size!), only two DataLad commands, one custom configuration, and some scripts to parse terminal output into .csv tables and create subject-wise HTCondor jobs were necessary. With all tables set up, the jobs ran over the Christmas break and finished before everyone went back to work. Getting 15 million files into datasets? Check!

Using a Remote Indexed Archive Store for dataset hosting

All datasets were built on a scientific compute cluster. In this location, however, datasets would only be accessible to users with an account on this system. Subsequently, therefore, everything was published with datalad push to the publicly available store.datalad.org, a remote indexed archive (RIA) store.

A RIA store is a flexible and scalable data storage solution for DataLad datasets. While its layout may look confusing if one were to take a look at it, a RIA store is nothing but a clever storage solution, and users never consciously interact with the store to get the HCP datasets. On the lowest level, store.datalad.org is a directory on a publicly accessible server that holds a great number of datasets stored as bare git repositories. The only important aspect of it for this usecase is that instead of by their names (e.g., 100206), datasets are stored and identified via their dataset ID. The datalad clone command can understand this layout and install datasets from a RIA store based on their ID.

How would a datalad clone from a RIA store look like?

In order to get a dataset from a RIA store, datalad clone needs a RIA URL. It is build from the following components:

  • a ria+ identifier

  • a path/url to the store in question. For store.datalad.org, this is http://store.datalad.org, but it could also be an SSH url, such as ssh://juseless.inm7.de/data/group/psyinf/dataset_store

  • a pound sign (#)

  • the dataset ID

  • and optionally a version or branch specification (appended with a leading @)

Here is how a valid datalad clone command from the data store for one dataset would look like:

datalad clone 'ria+http://store.datalad.org#d1ca308e-3d17-11ea-bf3b-f0d5bf7b5561' subj-01

But worry not! To get the HCP data, no-one will ever need to compose clone commands to RIA stores apart from DataLad itself.

A RIA store is used, because – among other advantages – its layout makes the store flexible and scalable. With datasets of sizes like the HCP project, especially scalability becomes an important factor. If you are interested in finding out why, you can find more technical details on RIA stores, their advantages, and even how to create and use one yourself in the section Remote Indexed Archives for dataset storage and backup.

Making the datasets accessible

At this point, roughly 1200 nested datasets were created and published to a publicly accessible RIA store. This modularized the HCP dataset and prevented performance issues that would arise in oversized datasets. In order to make the complete dataset available and accessible from one central point, the only thing missing is a single superdataset.

For this, a new dataset, human-connectome-project-openaccess, was created. It contains a README file with short instructions on how to use it, a text-based copy of the HCP project’s data usage agreement, – and each subject dataset as a subdataset. The .gitmodules file1 of this superdataset thus is impressive. Here is an excerpt:

[submodule "100206"]
    path = HCP1200/100206
    url = ./HCP1200/100206
    branch = master
    datalad-id = 346a3ae0-2c2e-11ea-a27d-002590496000
[submodule "100307"]
    path = HCP1200/100307
    url = ./HCP1200/100307
    branch = master
    datalad-id = a51b84fc-2c2d-11ea-9359-0025904abcb0
[submodule "100408"]
    path = HCP1200/100408
    url = ./HCP1200/100408
    branch = master
    datalad-id = d3fa72e4-2c2b-11ea-948f-0025904abcb0
[...]

For each subdataset (named after subject IDs), there is one entry (note that individual urls of the subdatasets are pointless and not needed: As will be demonstrated shortly, DataLad resolves each subdataset ID from the common store automatically). Thus, this superdataset combines all individual datasets to the original HCP dataset structure. This (and only this) superdataset is published to a public GitHub repository that anyone can datalad clone4.

Data retrieval and interacting with the repository

Note

Using this dataset requires DataLad version 0.12.2 or higher. Upgrading an existing DataLad installation is detailed in section Installation and configuration.

Procedurally, getting data from this dataset is almost as simple as with any other public DataLad dataset: One needs to clone the repository and use datalad get [-n] [-r] PATH to retrieve any file, directory, or subdataset (content). But because the data will be downloaded from the HCP’s AWS S3 bucket, users will need to create an account at db.humanconnectome.org to agree to the project’s data usage terms and get credentials. When performing the first datalad get for file contents, DataLad will prompt for these credentials interactively from the terminal. Once supplied, all subsequent get commands will retrieve data right away.

Resetting AWS credentials

In case one misenters their AWS credentials or needs to reset them, this can easily be done using the Python keyring package. For more information on keyring and DataLad’s authetication process, see the Basic process section in Configure custom data access.

After launching Python, import the keyring package and use the set_password() function. This function takes 3 arguments:

  • system: “datalad-hcp-s3” in this case

  • username: “key_id” if modifying the AWS access key ID or “secret_id” if modifying the secret access key

  • password: the access key itself

import keyring

keyring.set_password("datalad-hcp-s3", "key_id", <password>)
keyring.set_password("datalad-hcp-s3", "secret_id", <password>)

Alternatively, one can set their credentials using environment variables. For more details on this method, see this footnote.

$ export DATALAD_hcp_s3_key_id=<password>
$ export DATALAD_hcp_s3_secret_id=<password>

Internally, DataLad cleverly manages the crucial aspects of data retrieval: Linking registered subdatasets to the correct dataset in the RIA store. If you inspect the GitHub repository, you will find that the subdataset links in it will not resolve if you click on them, because none of the subdatasets were published to GitHub5, but lie in the RIA store instead. Dataset or file content retrieval will nevertheless work automatically with datalad get: Each .gitmodule entry lists the subdataset’s dataset ID. Based on a configuration of “subdataset-source-candidates” in .datalad/config of the superdataset, the subdataset ID is assembled to a RIA URL that retrieves the correct dataset from the store by get:

 $ cat .datalad/config
 [datalad "dataset"]
     id = 2e2a8a70-3eaa-11ea-a9a5-b4969157768c
 [datalad "get"]
     subdataset-source-candidate-origin = "ria+http://store.datalad.org#{id}"

This configuration allows get to flexibly generate RIA URLs from the base URL in the config file and the dataset IDs listed in .gitmodules. In the superdataset, it needed to be done “by hand” via the git config command. Because the configuration should be shared together with the dataset, the configuration needed to be set in .datalad/config6:

$ git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+http://store.datalad.org#{id}"

With this configuration, get will retrieve all subdatasets from the RIA store. Any subdataset that is obtained from a RIA store in turn gets the very same configuration automatically into .git/config. Thus, the configuration that makes seamless subdataset retrieval from RIA stores possible is propagated throughout the dataset hierarchy. With this in place, anyone can clone the top most dataset from GitHub, and – given they have valid credentials – get any file in the HCP dataset hierarchy.

Parallel operations and subsampled datasets using datalad copy-file

At this point in time, the HCP dataset is a single, published superdataset with ~4500 subdatasets that are hosted in a remote indexed archive (RIA) store at store.datalad.org. This makes the HCP data accessible via DataLad and its download easier. One downside to gigantic nested datasets like this one, though, is the time it takes to retrieve all of it. Some tricks can help to mitigate this: Contents can either be retrieved in parallel, or, in the case of general need for subsets of the dataset, subsampled datasets can be created with datalad copy-file.

If the complete HCP dataset is required, subdataset installation and data retrieval can be sped up by parallelizing. The gists Parallelize subdataset processing and Retrieve partial content from a hierarchy of (uninstalled) datasets can shed some light on how to do this.

If there is a need for only a subset of files, it can be helpful to create or use special-purpose datasets with a subset of all available files with the datalad copy-file command (datalad-copy-file manual). Consider the following example: A large number of scientists need to access the HCP dataset for structural connectivity analyses. Should they all clone the complete superdataset, the installation of all subdatasets will take them around 90 minutes, if parallelized (and a complete night if performed serially). The files that they need, however, are only a handful of files per subject. In order to simplify their lives, a structural connectivity subset can be created as a singular dataset and published for easy access. The following findoutmore details how this is done.

Note

datalad copy-file requires DataLad version 0.13.0 or higher.

How to create subsampled datasets with datalad copy-file

For a structural connectivity subset of the HCP dataset, only eleven files per subject are relevant:

- <sub>/T1w/Diffusion/nodif_brain_mask.nii.gz
- <sub>/T1w/Diffusion/bvecs
- <sub>/T1w/Diffusion/bvals
- <sub>/T1w/Diffusion/data.nii.gz
- <sub>/T1w/Diffusion/grad_dev.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_BIAS_32CH.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_AFI.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_BIAS_BC.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Magnitude.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Phase.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_T1w_MPR1.nii.gz

To access these files in the full HCP dataset, one would need to install all subject subdatasets and each subject’s T1w and unprocessed subdatasets. In order to spare researchers the time and effort to install roughly 3500 subdatasets, a one-time effort can create a subsampled, single dataset of those files using the datalad copy-file command. The result of this can be found on GitHub at github.com/datalad-datasets/hcp-structural-connectivity.

datalad copy-file is able to copy files with their availability metadata into other datasets. The content of the files does not need to be retrieved in order to do this. Because the subset of relevant files is small, all structural connectivity related files can be copied into a single dataset. This speeds up the installation time significantly, and reduces the confusion that the concept of subdatasets can bring to DataLad novices. The result is a dataset with a subset of files (following the original directory structure of the HCP dataset), created reproducibly with complete provenance capture. Access to the files inside of the subsampled dataset works via valid AWS credentials just as it does for the full dataset. In order to understand how it was done for the dataset in question, the first findoutmore below starts by explaining the basics of datalad copy-file. The second then details the process that led to the finished subsampled dataset.

The Basics of copy-file

This short demonstration gives an overview of the functionality of datalad copy-file - Feel free to follow along by copy-pasting the commands into your terminal. Let’s start by cloning a dataset to work with:

$ datalad clone git@github.com:datalad-datasets/human-connectome-project-openaccess.git hcp
[INFO] Cloning dataset to Dataset(/home/me/usecases/HCP/hcp) 
[INFO] Attempting to clone from git@github.com:datalad-datasets/human-connectome-project-openaccess.git to /home/me/usecases/HCP/hcp 
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 
[INFO] Start receiving objects 
[INFO] Start resolving deltas 
[INFO] Completed clone attempts for Dataset(/home/me/usecases/HCP/hcp) 
install(ok): /home/me/usecases/HCP/hcp (dataset)

In order to use copy-file, we need to install a few subdatasets. We will install 9 subject subdatasets recursively. Note that we don’t retrieve any data. (The output of this command is omitted – it is quite lengthy as 36 subdatasets are being installed)

$ cd hcp
$ datalad get -n -r HCP1200/130*
[INFO] Cloning dataset to Dataset(/home/me/usecases/HCP/hcp/HCP1200/130013) 

Afterwards, we can create a new dataset to copy any files into:

$ cd ..
$ datalad create dataset-to-copy-to
[INFO] Creating a new annex repo at /home/me/usecases/HCP/dataset-to-copy-to 
create(ok): /home/me/usecases/HCP/dataset-to-copy-to (dataset)

With the prerequisites set up, we can start to copy files. The command datalad copy-file works as follows: By providing a path to a file to be copied (which can be annex’ed, not annex’ed, or not version-controlled at all) and either a second path (the destination path), a target directory inside of a dataset, or a dataset specification, datalad copy-file copies the file and all of its availability metadata into the specified dataset. Let’s copy a single file from the hcp dataset into dataset-to-copy-to:

$ datalad copy-file hcp/HCP1200/130013/T1w/Diffusion/bvals -d dataset-to-copy-to
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130013/T1w/Diffusion/bvals [/home/me/usecases/HCP/dataset-to-copy-to/bvals]
save(ok): . (dataset)
action summary:
  copy_file (ok: 1)
  save (ok: 1)

Providing the -d/--dataset argument instead of a target directory or a destination path leads to the file being saved in the new dataset. If a target directory or a destination path is given for a file, the copied file will be not be saved:

$ datalad copy-file hcp/HCP1200/130013/T1w/Diffusion/bvecs -t dataset-to-copy-to
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130013/T1w/Diffusion/bvecs [/home/me/usecases/HCP/dataset-to-copy-to/bvecs]

Note how the file is added, but not saved afterwards:

$ cd dataset-to-copy-to
$ datalad status
    added: bvecs (file)

Providing a second path as a destination path allows one to copy the file under a different name, but it will also not save the new file in the destination dataset unless -d/--dataset is specified as well:

$ datalad copy-file hcp/HCP1200/130013/T1w/Diffusion/bvecs dataset-to-copy-to/anothercopyofbvecs
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130013/T1w/Diffusion/bvecs [/home/me/usecases/HCP/dataset-to-copy-to/anothercopyofbvecs]
$ cd dataset-to-copy-to
$ datalad status
    added: anothercopyofbvecs (file)
    added: bvecs (file)

Let’s save those two unsaved files:

$ datalad save
save(ok): . (dataset)

With the -r/--recursive flag enabled, the command can copy complete subdirectory (not subdataset!) hierarchies – Let’s copy a complete directory:

$ cd ..
$ datalad copy-file hcp/HCP1200/130114/T1w/Diffusion/* -r \
  -d dataset-to-copy-to \
  -t dataset-to-copy-to/130114/T1w/Diffusion
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/bvals [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/bvals]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/bvecs [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/bvecs]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/data.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/data.nii.gz]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_stdev_map [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_stdev_map]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_map [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_map]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_sqr_stdev_map [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_sqr_stdev_map]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_post_eddy_shell_alignment_parameters [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_post_eddy_shell_alignment_parameters]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_report [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_report]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_movement_rms [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_movement_rms]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_restricted_movement_rms [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_restricted_movement_rms]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/grad_dev.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/grad_dev.nii.gz]
copy_file(ok): /home/me/usecases/HCP/hcp/HCP1200/130114/T1w/Diffusion/nodif_brain_mask.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130114/T1w/Diffusion/nodif_brain_mask.nii.gz]
save(ok): . (dataset)
action summary:
  copy_file (ok: 13)
  save (ok: 1)

Here is how the dataset that we copied files into looks like at the moment:

$ tree dataset-to-copy-to
dataset-to-copy-to
├── 130114
│   └── T1w
│       └── Diffusion
│           ├── bvals -> ../../../.git/annex/objects/w8/VX/MD5E-s1344--4c9ca43cc986f388bcf716b4ba7321cc/MD5E-s1344--4c9ca43cc986f388bcf716b4ba7321cc
│           ├── bvecs -> ../../../.git/annex/objects/61/80/MD5E-s9507--24793fb936e9e18419325af9b6152458/MD5E-s9507--24793fb936e9e18419325af9b6152458
│           ├── data.nii.gz -> ../../../.git/annex/objects/K0/mJ/MD5E-s1468805393--f8077751ddc2802a853d1199ff762a00.nii.gz/MD5E-s1468805393--f8077751ddc2802a853d1199ff762a00.nii.gz
│           ├── eddylogs
│           │   ├── eddy_unwarped_images.eddy_movement_rms -> ../../../../.git/annex/objects/xX/GF/MD5E-s15991--287c3e06ece5b883a862f79c478b7b69/MD5E-s15991--287c3e06ece5b883a862f79c478b7b69
│           │   ├── eddy_unwarped_images.eddy_outlier_map -> ../../../../.git/annex/objects/87/Xx/MD5E-s127363--919aed21eb51a77ca499cdc0a5560592/MD5E-s127363--919aed21eb51a77ca499cdc0a5560592
│           │   ├── eddy_unwarped_images.eddy_outlier_n_sqr_stdev_map -> ../../../../.git/annex/objects/PP/GX/MD5E-s523738--1bd90e1e7a86b35695d8039599835435/MD5E-s523738--1bd90e1e7a86b35695d8039599835435
│           │   ├── eddy_unwarped_images.eddy_outlier_n_stdev_map -> ../../../../.git/annex/objects/qv/0F/MD5E-s520714--f995a46ec8ddaa5c7b33d71635844609/MD5E-s520714--f995a46ec8ddaa5c7b33d71635844609
│           │   ├── eddy_unwarped_images.eddy_outlier_report -> ../../../../.git/annex/objects/Xq/xV/MD5E-s10177--2934d2c7b316b86cde6d6d938bb3da37/MD5E-s10177--2934d2c7b316b86cde6d6d938bb3da37
│           │   ├── eddy_unwarped_images.eddy_parameters -> ../../../../.git/annex/objects/60/gf/MD5E-s141201--9a94e9fa805446ddb5ff8f76207fc1d2/MD5E-s141201--9a94e9fa805446ddb5ff8f76207fc1d2
│           │   ├── eddy_unwarped_images.eddy_post_eddy_shell_alignment_parameters -> ../../../../.git/annex/objects/kJ/0W/MD5E-s2171--c2e0deca2a5e84d119002032d87cd762/MD5E-s2171--c2e0deca2a5e84d119002032d87cd762
│           │   └── eddy_unwarped_images.eddy_restricted_movement_rms -> ../../../../.git/annex/objects/6K/X6/MD5E-s16134--5321d11df307f8452c8a5e92647ec73a/MD5E-s16134--5321d11df307f8452c8a5e92647ec73a
│           ├── grad_dev.nii.gz -> ../../../.git/annex/objects/zz/51/MD5E-s46820650--13be960cd99e48e21e25635d1390c1c5.nii.gz/MD5E-s46820650--13be960cd99e48e21e25635d1390c1c5.nii.gz
│           └── nodif_brain_mask.nii.gz -> ../../../.git/annex/objects/0Q/Kk/MD5E-s67280--9042713a11d557df58307ba85d51285a.nii.gz/MD5E-s67280--9042713a11d557df58307ba85d51285a.nii.gz
├── anothercopyofbvecs -> .git/annex/objects/X0/Vg/MD5E-s9507--f4cf263de8c3fb11f739467bf15e80ec/MD5E-s9507--f4cf263de8c3fb11f739467bf15e80ec
├── bvals -> .git/annex/objects/Fj/Wg/MD5E-s1344--843688799692be0ab485fe746e0f9241/MD5E-s1344--843688799692be0ab485fe746e0f9241
└── bvecs -> .git/annex/objects/X0/Vg/MD5E-s9507--f4cf263de8c3fb11f739467bf15e80ec/MD5E-s9507--f4cf263de8c3fb11f739467bf15e80ec

4 directories, 16 files

Importantly, all of the copied files had yet unretrieved contents. The copy-file process, however, also copied the files’ availability metadata to their new location. Retrieving file contents works just as it would in the full HCP dataset via datalad get (the authentication step is omitted in the output below):

$ cd dataset-to-copy-to
$ datalad get bvals anothercopyofbvecs 130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters
get(ok): bvals (file) [from datalad...]
get(ok): anothercopyofbvecs (file) [from datalad...]
get(ok): 130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters (file) [from datalad...]
action summary:
  get (ok: 3)

What’s especially helpful for automation is that copy-file can take source and (optionally) destination paths from a file or from stdin with the option --specs-from <source>. In the case of specifications from a file, <source> is a path to this file.

In order to use stdin for specification, such as the output of a find command that is piped into datalad copy-file with a Unix pipe (|), <source> needs to be a dash (-). Below is an example find command:

$ cd hcp
$ find HCP1200/130013/T1w/ -maxdepth 1 -name T1w*.nii.gz
HCP1200/130013/T1w/T1w_acpc_dc.nii.gz
HCP1200/130013/T1w/T1w_acpc_dc_restore_1.25.nii.gz
HCP1200/130013/T1w/T1wDividedByT2w.nii.gz
HCP1200/130013/T1w/T1wDividedByT2w_ribbon.nii.gz
HCP1200/130013/T1w/T1w_acpc_dc_restore_brain.nii.gz
HCP1200/130013/T1w/T1w_acpc_dc_restore.nii.gz

Here is how the outputted paths can be given as source paths to datalad copy-file:

# inside of hcp
$ find HCP1200/130013/T1w/ -maxdepth 1 -name T1w*.nii.gz \
  | datalad copy-file -d ../dataset-to-copy-to --specs-from -
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/T1w_acpc_dc.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc_restore_1.25.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/T1w_acpc_dc_restore_1.25.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1wDividedByT2w.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/T1wDividedByT2w.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1wDividedByT2w_ribbon.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/T1wDividedByT2w_ribbon.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc_restore_brain.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/T1w_acpc_dc_restore_brain.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc_restore.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/T1w_acpc_dc_restore.nii.gz]
save(ok): . (dataset)
action summary:
  copy_file (ok: 6)
  save (ok: 1)

This copied all files into the root of dataset-to-copy-to:

$ ls ../dataset-to-copy-to
130114
anothercopyofbvecs
bvals
bvecs
T1w_acpc_dc.nii.gz
T1w_acpc_dc_restore_1.25.nii.gz
T1w_acpc_dc_restore_brain.nii.gz
T1w_acpc_dc_restore.nii.gz
T1wDividedByT2w.nii.gz
T1wDividedByT2w_ribbon.nii.gz

To preserve the directory structure, a target directory (-t ../dataset-to-copy-to/130013/T1w/) or a destination path could be given.

How to specify files with source and destination paths for ``–specs-from``

To only specify source paths (i.e., paths to files or directories that should be copied), simply create a file or a command like find that specifies those files line-wise.

Specifying source and destination paths comes with a twist: Source and destination paths need to go into the same line, but need to be separated by a nullbyte. One way this can be done is by using the stream editor sed. Here is how to pipe source and destination paths into datalad copy-file:

$ find HCP1200/130518/T1w/ -maxdepth 1 -name T1w*.nii.gz \
  | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#' \
  | datalad copy-file -d ../dataset-to-clone-to -r --specs-from -

As always, the regular expressions used for sed are a bit hard to grasp upon first sight. Here is what this command does:

  • In general, sed’s s (substitute) command will take a string specified between the first set of #’s (\(HCP1200\)\(.*\)) and replace it with what is between the second and third # (\1\2\x0\2).

  • The first part splits the paths find returns (such as HCP1200/130518/T1w/T1w_acpc_dc.nii.gz) into two groups:

    • The start of the path (HCP1200), and

    • the remaining path (/130518/T1w/T1w_acpc_dc.nii.gz).

  • The second part then prints the first and the second group (\1\2, the source path), a nullbyte (\x0), and a relative path to the destination dataset together with the second group only (../dataset-to-copy-to\2, the destination path).

Here is how an output of find piped into sed looks like:

$ find HCP1200/130518/T1w -maxdepth 1 -name T1w*.nii.gz \
 | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#'
HCP1200/130518/T1w/T1w_acpc_dc.nii.gz../dataset-to-copy-to/130518/T1w/T1w_acpc_dc.nii.gz
HCP1200/130518/T1w/T1w_acpc_dc_restore_1.25.nii.gz../dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_1.25.nii.gz
HCP1200/130518/T1w/T1wDividedByT2w.nii.gz../dataset-to-copy-to/130518/T1w/T1wDividedByT2w.nii.gz
HCP1200/130518/T1w/T1wDividedByT2w_ribbon.nii.gz../dataset-to-copy-to/130518/T1w/T1wDividedByT2w_ribbon.nii.gz
HCP1200/130518/T1w/T1w_acpc_dc_restore_brain.nii.gz../dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_brain.nii.gz
HCP1200/130518/T1w/T1w_acpc_dc_restore.nii.gz../dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore.nii.gz
HCP1200/130518/T1w/T1w_acpc_dc_restore_1.05.nii.gz../dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_1.05.nii.gz

Note how the nullbyte is not visible to the naked eye in the output. To visualize it, you could redirect this output into a file and open it with an editor like vim. Let’s now see a copy-file from stdin in action:

$ find HCP1200/130518/T1w -maxdepth 1 -name T1w*.nii.gz \
 | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#' \
 | datalad copy-file -d ../dataset-to-copy-to -r --specs-from -
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130518/T1w/T1w_acpc_dc.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore_1.25.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_1.25.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1wDividedByT2w.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130518/T1w/T1wDividedByT2w.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1wDividedByT2w_ribbon.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130518/T1w/T1wDividedByT2w_ribbon.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore_brain.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_brain.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore_1.05.nii.gz [/home/me/usecases/HCP/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_1.05.nii.gz]
save(ok): . (dataset)
action summary:
  copy_file (ok: 7)
  save (ok: 1)

Now that you know the basics of datalad copy-file, the upcoming findoutmore on how the actual dataset was created will be much easier to understand.

Copying reproducibly

Note

You should have read the previous findoutmore!

To capture the provenance of subsampled dataset creation, the copy-file command can be wrapped into a datalad run call. Here is a sketch on how it was done:

Step 1: Create a dataset

$ datalad create hcp-structural-connectivity

Step 2: Install the full dataset as a subdataset

$ datalad clone -d . \
  git@github.com:datalad-datasets/human-connectome-project-openaccess.git \
  .hcp

Step 3: Install all subdataset of the full dataset with datalad get -n -r

Step 4: Inside of the new dataset, draft a find command that returns all desired files, and a subsequent sed substitution command that returns a nullbyte separated source and destination path. For this subsampled dataset, this one would work:

$ find .hcp/HCP1200  -maxdepth 5 -path '*/unprocessed/3T/T1w_MPR1/*' -name '*' \
  -o -path '*/T1w/Diffusion/*' -name 'b*' \
  -o -path '*/T1w/Diffusion/*' -name '*.nii.gz' \
 | sed -e 's#\(\.hcp/HCP1200\)\(.*\)#\1\2\x00.\2#' \

Step 5: Pipe the results into datalad copy-file, and wrap everything into a datalad run. Note that -d/--dataset is not specified for copy-file – this way, datalad run will save everything in one go at the end:

$ datalad run \
  -m "Assemble HCP dataset subset for structural connectivity data. \


   Specifically, these are the files:

   - T1w/Diffusion/nodif_brain_mask.nii.gz
   - T1w/Diffusion/bvecs
   - T1w/Diffusion/bvals
   - T1w/Diffusion/data.nii.gz
   - T1w/Diffusion/grad_dev.nii.gz
   - unprocessed/3T/T1w_MPR1/*_3T_BIAS_32CH.nii.gz
   - unprocessed/3T/T1w_MPR1/*_3T_AFI.nii.gz
   - unprocessed/3T/T1w_MPR1/*_3T_BIAS_BC.nii.gz
   - unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Magnitude.nii.gz
   - unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Phase.nii.gz
   - unprocessed/3T/T1w_MPR1/*_3T_T1w_MPR1.nii.gz

   for each participant. The structure of the directory tree and file names
   are kept identical to the full HCP dataset." \
        " find .hcp/HCP1200  -maxdepth 5 -path '*/unprocessed/3T/T1w_MPR1/*' -name '*' \
                 -o -path '*/T1w/Diffusion/*' -name 'b*' \
                 -o -path '*/T1w/Diffusion/*' -name '*.nii.gz' \
         | sed -e 's#\(\.hcp/HCP1200\)\(.*\)#\1\2\x00.\2#' \
         | datalad copy-file -r --specs-from -"

Step 6: Publish the dataset to a RIA store and to GitHub or similar hosting services to allow others to clone it easily and get fast access to a subset of files.

Afterwards, the slimmed down structural connectivity dataset can be installed completely within seconds. Because of the reduced amount of files it contains, it is easier to transform the data into BIDS format. Such a conversion can be done on a different branch of the dataset. Because RIA stores allow cloning of datasets in specific versions (such as a branch or tag as an identifier), a single command can clone a BIDS-ified, slimmed down HCP dataset for structural connectivity analyses:

$ datalad clone ria+http://store.datalad.org#~hcp-structural-connectivity@bids

Summary

This usecase demonstrated how it is possible to version control and distribute datasets of sizes that would otherwise be unmanageably large for version control systems. With the public HCP dataset available as a DataLad dataset, data access is simplified, data analysis that use the HCP data can link it (in precise versions) to their scripts and even share it, and the complete HCP release can be stored at a fraction of its total size for on demand retrieval.

Footnotes

1(1,2)

If you want to read up on how DataLad stores information about registered subdatasets in .gitmodules, checkout section More on DIY configurations.

2

Precise performance will always be dependent on the details of the repository, software setup, and hardware, but to get a feeling for the possible performance issues in oversized datasets, imagine a mere git status or datalad status command taking several minutes up to hours in a clean dataset.

3

Note that this command is more complex than the previously shown datalad addurls command. In particular, it has an additional loglevel configuration for the main command, and creates the datasets with an hcp_dataset configuration. The logging level was set (to warning) to help with post-execution diagnostics in the HTCondors log files. The configuration can be found in code/cfg_hcp_dataset and enables a special remote in the resulting dataset.

4

To re-read about publishing datasets to hosting services such as GitHub or GitLab, go back to Publishing the dataset to GitHub.

5

If you coded along in the Basics part of the book and published your dataset to Gin, you have experienced in Subdataset publishing how the links to unpublished subdatasets in a published dataset do not resolve in the webinterface: Its path points to a URL that would resolve to lying underneath the superdataset, but there is not published subdataset on the hosting platform!

6

To re-read on configurations of datasets, go back to sections DIY configurations and More on DIY configurations.