1.6. Subsample datasets using datalad copy-file

If there is a need for a dataset that contains only a subset of files of one or more other dataset, it can be helpful to create subsamples special-purpose datasets with the datalad copy-file (manual) command. This command is capable of transferring files from different datasets or locations outside of a dataset into a new dataset, unlocking them if necessary, and preserving and copying their availability information. As such, the command is a superior, albeit more technical alternative to copying dereferenced files out of datasets.

This section demonstrates the command based on a published data, a subset of the Human Connectome Project dataset that is subsampled for structural connectivity analysis. This dataset can be found on GitHub at github.com/datalad-datasets/hcp-structural-connectivity.

1.6.1. Copy-file in action with the HCP dataset

Consider a real-life example: A large number of scientists use the human connectome project (HCP) dataset for structural connectivity analyses. This dataset contains data from more than 1000 subjects, and exceeds 80 million files. As such, as explained in more detail in the chapter Go big or go home, it is split up into a hierarchy of roughly 4500 subdatasets[1]. The installation of all subdatasets takes around 90 minutes, if parallelized, and a complete night if performed serially. However, for a structural connectivity analysis, only eleven files per subject are relevant:

- <sub>/T1w/Diffusion/nodif_brain_mask.nii.gz
- <sub>/T1w/Diffusion/bvecs
- <sub>/T1w/Diffusion/bvals
- <sub>/T1w/Diffusion/data.nii.gz
- <sub>/T1w/Diffusion/grad_dev.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_BIAS_32CH.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_AFI.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_BIAS_BC.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Magnitude.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Phase.nii.gz
- <sub>/unprocessed/3T/T1w_MPR1/*_3T_T1w_MPR1.nii.gz

In order to spare others the time and effort to install thousands of subdatasets, a one-time effort can create and publish a subsampled, single dataset of those files using the datalad copy-file command.

datalad copy-file is able to copy files with their availability metadata into other datasets. The content of the files does not need to be retrieved in order to do this. Because the subset of relevant files is small, all structural connectivity related files can be copied into a single dataset. This speeds up the installation time significantly, and reduces the confusion that the concept of subdatasets can bring to DataLad novices. The result is a dataset with a subset of files (following the original directory structure of the HCP dataset), created reproducibly with complete provenance capture. Access to the files inside of the subsampled dataset works via valid AWS credentials just as it does for the full dataset[1]. The Basics of copy-file

This short demonstration gives an overview of the functionality of datalad copy-file - Feel free to follow along by copy-pasting the commands into your terminal. Let’s start by cloning a dataset to work with:

$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess.git hcp
install(ok): /home/me/beyond_basics/HPC/hcp (dataset)

In order to use datalad copy-file, we need to install a few subdatasets, and thus install 9 subject subdatasets recursively. Note that we don’t retrieve any data, using -n/--no-data. (The output of this command is omitted – it is quite lengthy as 36 subdatasets are being installed)

$ cd hcp
$ datalad get -n -r HCP1200/130*
install(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130013 (dataset) [Installed subdataset in order to get /home/me/beyond_basics/HPC/hcp/HCP1200/130013]

Afterwards, we can create a new dataset to copy any files into. This dataset will later hold the relevant subset of the data in the HCP dataset.

$ cd ..
$ datalad create dataset-to-copy-to
create(ok): /home/me/beyond_basics/HPC/dataset-to-copy-to (dataset)

With the prerequisites set up, we can start to copy files. The command datalad copy-file works as follows: By providing a path to a file to be copied (which can be annex’ed, not annex’ed, or not version-controlled at all) and either a second path (the destination path), a target directory inside of a dataset, or a dataset specification, datalad copy-file copies the file and all of its availability metadata into the specified dataset. Let’s copy a single file (hcp/HCP1200/130013/T1w/Diffusion/bvals) from the hcp dataset into dataset-to-copy-to:

$ datalad copy-file \
   hcp/HCP1200/130013/T1w/Diffusion/bvals  \
   -d dataset-to-copy-to
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130013/T1w/Diffusion/bvals [/home/me/beyond_basics/HPC/dataset-to-copy-to/bvals]
save(ok): . (dataset)

When the -d/--dataset argument is provided instead of a target directory or a destination path, the copied file will be saved in the new dataset. If a target directory or a destination path is given for a file, however, the copied file will be not be saved:

$ datalad copy-file \
   hcp/HCP1200/130013/T1w/Diffusion/bvecs \
   -t dataset-to-copy-to
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130013/T1w/Diffusion/bvecs [/home/me/beyond_basics/HPC/dataset-to-copy-to/bvecs]

Note that instead of a as dataset, we specify it as a target path, and how the file is added, but not saved afterwards:

$ cd dataset-to-copy-to
$ datalad status
    added: bvecs (symlink)

Providing a second path as a destination path allows one to copy the file under a different name, but it will also not save the new file in the destination dataset unless -d/--dataset is specified as well:

$ datalad copy-file \
   hcp/HCP1200/130013/T1w/Diffusion/bvecs \
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130013/T1w/Diffusion/bvecs [/home/me/beyond_basics/HPC/dataset-to-copy-to/anothercopyofbvecs]
$ cd dataset-to-copy-to
$ datalad status
    added: anothercopyofbvecs (symlink)
    added: bvecs (symlink)

Those were the minimal basics of the command syntax - the original location, a specification where the file should be copied to, and an indication if the file should be saved or not. Let’s save those two unsaved files:

$ datalad save
save(ok): . (dataset)

With the -r/--recursive flag enabled, the command can copy complete subdirectory (not subdataset!) hierarchies – Let’s copy a complete directory, and save it in its target dataset:

$ cd ..
$ datalad copy-file hcp/HCP1200/130114/T1w/Diffusion/* \
 -r \
 -d dataset-to-copy-to \
 -t dataset-to-copy-to/130114/T1w/Diffusion
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/bvals [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/bvals]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/bvecs [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/bvecs]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/data.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/data.nii.gz]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_post_eddy_shell_alignment_parameters [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_post_eddy_shell_alignment_parameters]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_movement_rms [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_movement_rms]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_restricted_movement_rms [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_restricted_movement_rms]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_sqr_stdev_map [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_sqr_stdev_map]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_report [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_report]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_map [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_map]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_stdev_map [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_outlier_n_stdev_map]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/grad_dev.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/grad_dev.nii.gz]
copy_file(ok): /home/me/beyond_basics/HPC/hcp/HCP1200/130114/T1w/Diffusion/nodif_brain_mask.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130114/T1w/Diffusion/nodif_brain_mask.nii.gz]
save(ok): . (dataset)

Here is how the dataset that we copied files into looks like at the moment:

$ tree dataset-to-copy-to
├── 130114
│   └── T1w
│       └── Diffusion
│           ├── bvals -> ../../../.git/annex/objects/w8/VX/✂/MD5E-s1344--4c9ca43c✂MD5
│           ├── bvecs -> ../../../.git/annex/objects/61/80/✂/MD5E-s9507--24793fb9✂MD5
│           ├── data.nii.gz -> ../../../.git/annex/objects/K0/mJ/✂/MD5E-s1468805393--f8077751✂MD5.nii.gz
│           ├── eddylogs
│           │   ├── eddy_unwarped_images.eddy_movement_rms -> ../../../../.git/annex/objects/xX/GF/✂/MD5E-s15991--287c3e06✂MD5
│           │   ├── eddy_unwarped_images.eddy_outlier_map -> ../../../../.git/annex/objects/87/Xx/✂/MD5E-s127363--919aed21✂MD5
│           │   ├── eddy_unwarped_images.eddy_outlier_n_sqr_stdev_map -> ../../../../.git/annex/objects/PP/GX/✂/MD5E-s523738--1bd90e1e✂MD5
│           │   ├── eddy_unwarped_images.eddy_outlier_n_stdev_map -> ../../../../.git/annex/objects/qv/0F/✂/MD5E-s520714--f995a46e✂MD5
│           │   ├── eddy_unwarped_images.eddy_outlier_report -> ../../../../.git/annex/objects/Xq/xV/✂/MD5E-s10177--2934d2c7✂MD5
│           │   ├── eddy_unwarped_images.eddy_parameters -> ../../../../.git/annex/objects/60/gf/✂/MD5E-s141201--9a94e9fa✂MD5
│           │   ├── eddy_unwarped_images.eddy_post_eddy_shell_alignment_parameters -> ../../../../.git/annex/objects/kJ/0W/✂/MD5E-s2171--c2e0deca✂MD5
│           │   └── eddy_unwarped_images.eddy_restricted_movement_rms -> ../../../../.git/annex/objects/6K/X6/✂/MD5E-s16134--5321d11d✂MD5
│           ├── grad_dev.nii.gz -> ../../../.git/annex/objects/zz/51/✂/MD5E-s46820650--13be960c✂MD5.nii.gz
│           └── nodif_brain_mask.nii.gz -> ../../../.git/annex/objects/0Q/Kk/✂/MD5E-s67280--9042713a✂MD5.nii.gz
├── anothercopyofbvecs -> .git/annex/objects/X0/Vg/✂/MD5E-s9507--f4cf263d✂MD5
├── bvals -> .git/annex/objects/Fj/Wg/✂/MD5E-s1344--84368879✂MD5
└── bvecs -> .git/annex/objects/X0/Vg/✂/MD5E-s9507--f4cf263d✂MD5

4 directories, 16 files

Importantly, all of the copied files had yet unretrieved contents. The copy-file process, however, also copied the files’ availability metadata to their new location. Retrieving file contents works just as it would in the full HCP dataset via datalad get (manual) (the authentication step is omitted in the output below):

$ cd dataset-to-copy-to
$ datalad get bvals anothercopyofbvecs 130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters
get(ok): anothercopyofbvecs (file) [from datalad...]
get(ok): bvals (file) [from datalad...]
get(ok): 130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters (file) [from datalad...]

What’s especially helpful for automation of this operation is that datalad copy-file can take source and (optionally) destination paths from a file or from stdin with the option --specs-from <source>. In the case of specifications from a file, <source> is a path to this file.

In order to use stdin for specification, such as the output of a find command that is piped into datalad copy-file with a Unix pipe (|), <source> needs to be a dash (-). Below is an example find command:

$ cd hcp
$ find HCP1200/130013/T1w/ -maxdepth 1 -name T1w*.nii.gz

This uses find to get a list of all files matching the specified pattern in the specified directory. And here is how the outputted paths can be given as source paths to datalad copy-file, copying all of the found files into a new dataset:

# inside of hcp
$ find HCP1200/130013/T1w/ -maxdepth 1 -name T1w*.nii.gz \
  | datalad copy-file -d ../dataset-to-copy-to --specs-from -
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc_restore_brain.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/T1w_acpc_dc_restore_brain.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1wDividedByT2w_ribbon.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/T1wDividedByT2w_ribbon.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc_restore_1.25.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/T1w_acpc_dc_restore_1.25.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/T1w_acpc_dc.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1wDividedByT2w.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/T1wDividedByT2w.nii.gz]
copy_file(ok): HCP1200/130013/T1w/T1w_acpc_dc_restore.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/T1w_acpc_dc_restore.nii.gz]
save(ok): . (dataset)

To preserve the directory structure, a target directory (-t ../dataset-to-copy-to/130013/T1w/) or a destination path could be given, because the above command copied all files into the root of dataset-to-copy-to:

$ ls ../dataset-to-copy-to

With this trick, you can use simple search commands to assemble a list of files as a <source> for datalad copy-file: simply create a file or a command like find that specifies tho relevant files or directories line-wise. --specs-from can take information on both <source> and <destination>, though. Specify files with source AND destination paths for –specs-from

Specifying source and destination paths comes with a twist: Source and destination paths need to go into the same line, but need to be separated by a nullbyte. This is not a straightforward concept, but trying it out and seeing it in action will help.

One way it can be done is by using the stream editor sed. Here is how to pipe source AND destination paths into datalad copy-file:

$ find HCP1200/130518/T1w/ -maxdepth 1 -name T1w*.nii.gz \
  | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#' \
  | datalad copy-file -d ../dataset-to-clone-to -r --specs-from -

As always, the regular expressions used for sed are a bit hard to grasp upon first sight. Here is what this command does:

  • In general, sed's s (substitute) command will take a string specified between the first set of #'s (\(HCP1200\)\(.*\)) and replace it with what is between the second and third # (\1\2\x0\2).

  • The first part splits the paths find returns (such as HCP1200/130518/T1w/T1w_acpc_dc.nii.gz) into two groups:

    • The start of the path (HCP1200), and

    • the remaining path (/130518/T1w/T1w_acpc_dc.nii.gz).

    • The second part then prints the first and the second group (\1\2, the source path), a nullbyte (\x0), and a relative path to the destination dataset together with the second group only (../dataset-to-copy-to\2, the destination path).

Here is how an output of find piped into sed looks like:

$ find HCP1200/130518/T1w -maxdepth 1 -name T1w*.nii.gz \
      | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#'

Note how the nullbyte is not visible to the naked eye in the output. To visualize it, you could redirect this output into a file and open it with an editor like vim. Let’s now see a datalad copy-file from stdin in action:

$ find HCP1200/130518/T1w -maxdepth 1 -name T1w*.nii.gz \
 | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#' \
 | datalad copy-file -d ../dataset-to-copy-to -r --specs-from -
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore_1.05.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_1.05.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore_brain.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_brain.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1wDividedByT2w_ribbon.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130518/T1w/T1wDividedByT2w_ribbon.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore_1.25.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore_1.25.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130518/T1w/T1w_acpc_dc.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1wDividedByT2w.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130518/T1w/T1wDividedByT2w.nii.gz]
copy_file(ok): HCP1200/130518/T1w/T1w_acpc_dc_restore.nii.gz [/home/me/beyond_basics/HPC/dataset-to-copy-to/130518/T1w/T1w_acpc_dc_restore.nii.gz]
save(ok): . (dataset)

Done! A complex looking command with regular expressions and unix pipes, but it does powerful things in only a single line. Copying reproducibly

To capture the provenance of subsampled dataset creation, the datalad copy-file command can be wrapped into a datalad run (manual) call. Here is a sketch how it was done in the structural connectivity subdataset:

Step 1: Create a dataset

$ datalad create hcp-structural-connectivity

Step 2: Install the full dataset as a subdataset

$ datalad clone -d . \
  https://github.com/datalad-datasets/human-connectome-project-openaccess.git \

Step 3: Install all subdataset of the full dataset with datalad get -n -r

Step 4: Inside of the new dataset, draft a find command that returns all 11 desired files, and a subsequent sed substitution command that returns a nullbyte separated source and destination path. For this subsampled dataset, this one would work:

$ find .hcp/HCP1200  -maxdepth 5 -path '*/unprocessed/3T/T1w_MPR1/*' -name '*' \
 -o -path '*/T1w/Diffusion/*' -name 'b*' \
 -o -path '*/T1w/Diffusion/*' -name '*.nii.gz' \
 | sed -e 's#\(\.hcp/HCP1200\)\(.*\)#\1\2\x00.\2#' \

Step 5: Pipe the results into datalad copy-file, and wrap everything into a datalad run. Note that -d/--dataset is not specified for datalad copy-file – this way, datalad run will save everything in one go at the end.

$ datalad run \
  -m "Assemble HCP dataset subset for structural connectivity data. \

     Specifically, these are the files:

 - T1w/Diffusion/nodif_brain_mask.nii.gz
     - T1w/Diffusion/bvecs
     - T1w/Diffusion/bvals
     - T1w/Diffusion/data.nii.gz
     - T1w/Diffusion/grad_dev.nii.gz
     - unprocessed/3T/T1w_MPR1/*_3T_BIAS_32CH.nii.gz
     - unprocessed/3T/T1w_MPR1/*_3T_AFI.nii.gz
     - unprocessed/3T/T1w_MPR1/*_3T_BIAS_BC.nii.gz
     - unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Magnitude.nii.gz
     - unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Phase.nii.gz
     - unprocessed/3T/T1w_MPR1/*_3T_T1w_MPR1.nii.gz

     for each participant. The structure of the directory tree and file names
     are kept identical to the full HCP dataset." \
     "find .hcp/HCP1200  -maxdepth 5 -path '*/unprocessed/3T/T1w_MPR1/*' -name '*' \
       -o -path '*/T1w/Diffusion/*' -name 'b*' \
       -o -path '*/T1w/Diffusion/*' -name '*.nii.gz' \
     | sed -e 's#\(\.hcp/HCP1200\)\(.*\)#\1\2\x00.\2#' \
     | datalad copy-file -r --specs-from -"

Step 6: Publish the dataset to GitHub or similar hosting services to allow others to clone it easily and get fast access to a relevant subset of files.

Afterwards, the slimmed down structural connectivity dataset can be installed completely within seconds. Because of the reduced amount of files it contains, it is easier to transform the data into BIDS format. Such a conversion can be done on a different branch of the dataset. If you have published your subsampled dataset into a RIA store, as it was done with this specific subset, a single command can clone a BIDS-ified, slimmed down HCP dataset for structural connectivity analyses because RIA stores allow cloning of datasets in specific versions (such as a branch or tag as an identifier):

$ datalad clone ria+https://store.datalad.org#~hcp-structural-connectivity@bids

1.6.2. Summary

datalad copy-file is a useful command to create datasets from content of other datasets. Although it requires some Unix-y command line magic, it can be automated for larger tasks, and, when combined with a datalad run, produce suitable provenance records of where files have been copied from.