1.4. Remote indexed archives for dataset storage and backup

If DataLad datasets should be backed-up, made available for collaborations with others, or stored or managed in a central location, remote indexed archive (RIA) stores, dataset storage locations that allow for access to and collaboration on DataLad datasets, may be a suitable solution. They are flat, flexible, file-system based repository representations of any number of datasets, and they can exist on all standard computing infrastructure, be it personal computers, servers or compute clusters, or even super computing infrastructure – even on machines that do not have DataLad installed.

1.4.1. Technical details

RIA stores can be created or extended with a single command from within any dataset. DataLad datasets can subsequently be published into the datastore as a means of backing up a dataset or creating a dataset sibling to collaborate on with others. Alternatively, datasets can be cloned and updated from a RIA store just as from any other dataset location. The subsection RIA store workflows a few paragraphs down will demonstrate RIA-store related functionality. But prior to introducing the user-facing commands, this section starts by explaining the layout and general concept of a RIA store.

1.4.1.1. Layout

RIA stores store DataLad datasets. Both the layout of the RIA store and the layout of the datasets in the RIA store are different from typical dataset layouts, though. If one were to take a look inside of a RIA store as it is set up by default, one would see a directory that contains a flat subdirectory tree with datasets represented as bare Git repositories and an annex. Usually, looking inside of RIA stores is not necessary for RIA-related workflows, but it can help to grasp the concept of these stores.

The first level of subdirectories in this RIA store tree consists of the first three characters of the dataset IDs of the datasets that lie in the store, and the second level of subdatasets contains the remaining characters of the dataset IDs. Thus, the first two levels of subdirectories in the tree are split dataset IDs of the datasets that are stored in them[1]. The code block below illustrates how a single DataLad dataset looks like in a RIA store, and the dataset ID of the dataset (946e8cac-432b-11ea-aac8-f0d5bf7b5561) is highlighted:

 /path/to/my_riastore
 ├── 946
 │   └── e8cac-432b-11ea-aac8-f0d5bf7b5561
 │       ├── annex
 │       │   └── objects
 │       │       ├── 6q
 │       │       │   └── mZ
 │       │       │       └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
 │       │       │           └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
 │       │       ├── 6v
 │       │       │   └── zK
 │       │       │       └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
 │       │       │           └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
 │       │       ├── [...]
 │       │       └── [...]
 │       ├── archives
 │       │   └── archive.7z
 │       ├── branches
 │       ├── config
 │       ├── description
 │       ├── HEAD
 │       ├── hooks
 │       │   ├── applypatch-msg.sample
 │       │   ├── [...]
 │       │   └── update.sample
 │       ├── info
 │       │   └── exclude
 │       ├── objects
 │       │   ├── 05
 │       │   │   └── 3d25959223e8173497fa7f747442b72c31671c
 │       │   ├── 0b
 │       │   │   └── 8d0edbf8b042998dfeb185fa2236d25dd80cf9
 │       │   ├── [...]
 │       │   │   └── [...]
 │       │   ├── info
 │       │   └── pack
 │       ├── refs
 │       │   ├── heads
 │       │   │   ├── git-annex
 │       │   │   └── main
 │       │   └── tags
 │       ├── ria-layout-version
 │       └── ria-remote-ebce196a-b057-4c96-81dc-7656ea876234
 │           └── transfer
 ├── error_logs
 └── ria-layout-version

If a second dataset gets published to the RIA store, it will be represented in a similar tree structure underneath its individual dataset ID. If subdatasets of a dataset are published into a RIA store, they are not represented underneath their superdataset, but are stored on the same hierarchy level as any other dataset. Thus, the dataset representation in a RIA store is completely flat[2]. With this hierarchy-free setup, the location of a particular dataset in the RIA store is only dependent on its dataset ID. As the dataset ID is universally unique, gets assigned to a dataset at the time of creation, and does not change across the life time of a dataset, no two different datasets could have the same location in a RIA store.

The directory underneath the two dataset-ID-based subdirectories contains a bare git repository (highlighted above as well) that is a clone of the dataset.

What is a bare Git repository?

A bare Git repository is a repository that contains the contents of the .git directory of regular DataLad datasets or Git repositories, but no worktree or checkout. This has advantages: The repository is leaner, it is easier for administrators to perform garbage collections, and it is required if you want to push to it at all times. You can find out more on what bare repositories are and how to use them here.

Note that bare Git repositories can be cloned, and the clone of a bare Git repository will have a checkout and a worktree, thus resuming the shape that you are familiar with.

Inside of the bare Git repository, the annex directory – just as in any standard dataset or repository – contains the dataset’s keystore (object tree) under annex/objects[3]. In conjunction, keystore and bare Git repository are the original dataset – just differently represented, with no working tree, i.e., directory hierarchy that exists in the original dataset, and without the name it was created under, but stored under its dataset ID instead.

If necessary, the keystores (annex) can be (compressed) 7zipped archives (archives/), either for compression gains, or for use on HPC-systems with inode limitations[4]. Despite being 7zipped, those archives can be indexed and support relatively fast random read access. Thus, the entire key store can be put into an archive, re-using the exact same directory structure, and remains fully accessible while only using a handful of inodes, regardless of file number and size. If the dataset contains only annexed files, a complete dataset can be represented in about 25 inodes. A detailed example and utility script can be found at knowledge-base.psychoinformatics.de/kbi/0024.

Taking all of the above information together, on an infrastructural level, a RIA store is fully self-contained, and is a plain file system storage, not a database. Everything inside of a RIA store is either a file, a directory, or a zipped archive. It can thus be set up on any infrastructure that has a file system with directory and file representation, and has barely any additional software requirements (see below). Access to datasets in the store can be managed by using file system permissions. With these attributes, a RIA store is a suitable solution for a number of usecases (back-up, single or multi-user dataset storage, central point for collaborative workflows, …), be that on private workstations, web servers, compute clusters, or other IT infrastructure.

Software Requirements

On the RIA store hosting infrastructure, only 7z is to be installed, if the archive feature is desired. Specifically, no Git, no git-annex, and no otherwise running daemons are necessary. If the RIA store is set up remotely, the server needs to be SSH-accessible.

On the client side, you need DataLad.

1.4.1.2. git-annex ORA-remote special remotes

On a technical level, beyond being a directory tree of datasets, a RIA store is by default a git-annex ORA-remote (optional remote access) special remote of a dataset. This allows to not only store the history of a dataset, but also all annexed contents.

What is a special remote?

A special-remote is an extension to Git’s concept of remotes, and can enable git-annex to transfer data to and from places that are not Git repositories (e.g., cloud services or external machines such as an HPC system). Don’t envision a special-remote as a physical place or location – a special-remote is just a protocol that defines the underlying transport of your files to and from a specific location.

The git-annex ora-remote special remote is referred to as a “storage sibling” of the original dataset. It is similar to git-annex’s built-in directory special remote (but works remotely and uses the hashdir_mixed[2] keystore layout). Thanks to the git-annex ora-remote, RIA stores can have regular git-annex key storage and retrieval of keys from (compressed) 7z archives in the RIA store works. Put simple, annexed contents of datasets can only be pushed into RIA stores if they have a git-annex ora-remote.

Certain applications will not require special remote features. The usecase Scaling up: Managing 80TB and 15 million files from the HCP release shows an example where git-annex key storage is explicitly not wanted. Other applications may require only the special remote, such as cases where Git isn’t installed on the RIA store hosting infrastructure. For most storage or back-up scenarios, special remote capabilities are useful, though, and thus the default.

The command datalad create-sibling-ria (manual) can both create datasets in RIA stores and the RIA stores themselves. However, datalad create-sibling-ria sets up a new RIA store if it does not find one under the provided URL only if the parameter --new-store-ok is passed. By default, the command will automatically create a dataset representation in a RIA store and configure a sibling to allow publishing to the RIA store and updating from it. With special remote capabilities enabled, the command will automatically create the special remote as a storage-sibling and link it to the RIA-sibling. With the sibling and special remote set up, upon an invocation of datalad push --to <sibling> (manual), the complete dataset contents, including annexed contents, will be published to the RIA store, with no further setup or configuration required[5].

To disable the storage sibling completely, invoke datalad create-sibling-ria with the argument --storage-sibling=off. To create a RIA store with only special remote storage, you can invoke datalad create-sibling-ria with the argument --storage-sibling=only.

1.4.1.3. Advantages of RIA stores

Storing datasets in RIA stores has a number of advantages that align well with the demands of central dataset management on shared compute infrastructure, but are also well suited for most back-up and storage applications. In a RIA store layout, the first two levels of subdirectories can host any number of keystores and bare repositories. As datasets are identified via ID and stored next to each other underneath the top-level RIA store directory, the store is completely flexible and extendable, and regardless of the number or nature of datasets inside of the store, a RIA store keeps a homogeneous directory structure. This aids the handling of large numbers of repositories, because unique locations are derived from dataset/repository properties (their ID) rather than a dataset name or a location in a complex dataset hierarchy. Because the dataset representation in the RIA store is a bare repository, “house-keeping” as well as query tasks can be automated or performed by data management personnel with no domain-specific knowledge about dataset contents. Short maintenance scripts can be used to automate basically any task that is of interest and possible in a dataset, but across the full RIA store. A few examples are:

  • Copy or move annex objects into a 7z archive.

  • Find dataset dependencies across all stored datasets by returning the dataset IDs of subdatasets recorded in each dataset.

  • Automatically return the number of commits in each repository.

  • Automatically return the author and time of the last dataset update.

  • Find all datasets associated with specific authors.

  • Clean up unnecessary files and minimize a (or all) repository with Gits garbage collection (gc) command.

The use case Building a scalable data storage for scientific computing demonstrates the advantages of this in a large scientific institute with central data management. Due to the git-annex ora-remote special remote, datasets can be exported and stored as archives to save disk space.

1.4.2. RIA store workflows

The user facing commands for interactions with a RIA store are barely different from standard DataLad workflows. The paragraphs below detail how to create and populate a RIA store, how to clone datasets and retrieve data from it, and also how to handle permissions or hide technicalities.

1.4.2.1. Creating or publishing to RIA stores

A dataset can be added into an existing or not yet existing RIA store by running the datalad create-sibling-ria command, and subsequently published into the store using datalad push. Just like the datalad siblings add (manual) command, for datalad create-sibling-ria, an arbitrary sibling name (with the -s/--name option) and a URL to the location of the store (as a positional argument) need to be specified. In the case of RIA stores, the URL takes the form of a ria+ URL, and the looks of this URL are dependent on where the RIA store (should) exists, or rather, which file transfer protocol (SSH or file) is used:

  • A URL to an SSH-accessible server has a ria+ssh:// prefix, followed by user and hostname specification and an absolute path: ria+ssh://[user@]hostname/absolute/path/to/ria-store

  • A URL to a store on a local file system has a ria+file:// prefix, followed by an absolute path: ria+file:///absolute/path/to/ria-store

RIA stores with HTTP access

Setting up RIA store with access via HTTP requires additional server-side configurations for Git. Git’s http-backend documentation can point you the relevant configurations for your web server and usecase.

Note that it is always required to specify an absolute path in the URL!

In addition, as a convenience for cloning, you can supply an --alias parameter with a name under which the dataset can later be cloned from the dataset.

If you code along, make sure to check the next findoutmore!

The upcoming demonstration of RIA stores uses the DataLad-101 dataset the was created throughout the Basics of this handbook. If you want to execute these code snippets on a DataLad-101 dataset you created, the modification described in the findoutmore below needs to be done first.

If necessary, adjust the submodule path!

Back in Subdataset publishing, in order to appropriately reference and link subdatasets on hostings sites such as GitHub, we adjusted the submodule path of the subdataset in .gitmodules to point to a published subdataset on GitHub:

# in DataLad-101
$ cat .gitmodules
[submodule "recordings/longnow"]
	path = recordings/longnow
	url = https://github.com/datalad-datasets/longnow-podcasts.git
	datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
	datalad-url = https://github.com/datalad-datasets/longnow-podcasts.git
[submodule "midterm_project"]
	path = midterm_project
	url = https://github.com/adswa/midtermproject
	datalad-id = d95bafc8-f2a4-d27b-dcf4-bb99f4bea973

Later in this demonstration we would like to publish the subdataset to a RIA store and retrieve it automatically from this store – retrieval is only attempted from a store, however, if no other working source is known. Therefore, we will remove the reference to the published dataset prior to this demonstration and replace it with the path it was originally referenced under.

# in DataLad-101
$ datalad subdatasets --contains midterm_project --set-property url ./midterm_project
add(ok): .gitmodules (file)
save(ok): . (dataset)
subdataset(ok): midterm_project (dataset)

To demonstrate the basic process, we will create a RIA store on a local file system to publish the DataLad-101 dataset from the handbook’s “Basics” section to. In the example below, the RIA sibling gets the name ria-backup. The URL uses the file protocol and points with an absolute path to the not yet existing directory myriastore. Make sure that the --new-store-ok parameter is set to allow the creation of a new store.

# inside of the dataset DataLad-101
$ datalad create-sibling-ria -s ria-backup --alias dl-101 --new-store-ok "ria+file:///home/me/myriastore"
[INFO] Creating a new RIA store at /home/me/myriastore
[INFO] create siblings 'ria-backup' and 'ria-backup-storage' ...
update(ok): . (dataset)
update(ok): . (dataset)
[INFO] Configure additional publication dependency on "ria-backup-storage"
configure-sibling(ok): . (sibling)
create-sibling-ria(ok): /home/me/dl-101/DataLad-101 (dataset)

Afterwards, the dataset has two additional siblings: ria-backup, and ria-backup-storage.

$ datalad siblings
.: here(+) [git]
.: roommate(+) [../mock_user/DataLad-101 (git)]
.: ria-backup(-) [/home/me/myriastore/e3e/70682-c209-4cac-629f-6fbed82c07cd (git)]
.: gin(+) [/home/me/pushes/DataLad-101 (git)]
.: ria-backup-storage(+) [ora]

The storage sibling is the git-annex ora-remote and is set up automatically – unless datalad create-sibling-ria is run with --storage-sibling=off. By default, it has the name of the RIA sibling, suffixed with -storage, but alternative names can be supplied with the --storage-name option.

Take a look into the store

Right after running this command, a RIA store has been created in the specified location:

$ tree /home/me/myriastore
/home/me/myriastore
├── alias
│   └── dl-101 -> ../e3e/70682-c209-4cac-629f-6fbed82c07cd
├── e3e
│   └── 70682-c209-4cac-629f-6fbed82c07cd
│       ├── annex
│       │   └── objects
│       ├── archives
│       ├── branches
│       ├── config
│       ├── config.dataladlock
│       ├── description
│       ├── HEAD
│       ├── hooks
│       │   ├── applypatch-msg.sample
│       │   ├── commit-msg.sample
│       │   ├── fsmonitor-watchman.sample
│       │   ├── post-update.sample
│       │   ├── pre-applypatch.sample
│       │   ├── pre-commit.sample
│       │   ├── pre-merge-commit.sample
│       │   ├── prepare-commit-msg.sample
│       │   ├── pre-push.sample
│       │   ├── pre-rebase.sample
│       │   ├── pre-receive.sample
│       │   ├── push-to-checkout.sample
│       │   └── update.sample
│       ├── info
│       │   └── exclude
│       ├── objects
│       │   ├── info
│       │   └── pack
│       ├── refs
│       │   ├── heads
│       │   └── tags
│       └── ria-layout-version
├── error_logs
└── ria-layout-version

17 directories, 20 files

Note that there is one dataset represented in the RIA store. The two-directory structure it is represented under corresponds to the dataset ID of DataLad-101:

# The dataset ID is stored in .datalad/config
$ cat .datalad/config
[datalad "dataset"]
	id = e3e70682-c209-4cac-629f-6fbed82c07cd

In order to publish the dataset’s history and all its contents into the RIA store, a single datalad push to the RIA sibling suffices:

$ datalad push --to ria-backup
copy(ok): books/TLCL.pdf (file) [to ria-backup-storage...]
copy(ok): books/bash_guide.pdf (file) [to ria-backup-storage...]
copy(ok): books/byte-of-python.pdf (file) [to ria-backup-storage...]
copy(ok): books/progit.pdf (file) [to ria-backup-storage...]
publish(ok): . (dataset) [refs/heads/main->ria-backup:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->ria-backup:refs/heads/git-annex [new branch]]

Take another look into the store

Now that dataset contents have been pushed to the RIA store, the bare repository contains them, although their representation is not human-readable. But worry not – this representation only exists in the RIA store. When cloning this dataset from the RIA store, the clone will be in its standard human-readable format.

$ tree /home/me/myriastore
/home/me/myriastore
├── alias
│   └── dl-101 -> ../e3e/70682-c209-4cac-629f-6fbed82c07cd
├── e3e
│   └── 70682-c209-4cac-629f-6fbed82c07cd
│       ├── annex
│       │   └── objects
│       │       ├── G6
│       │       │   └── Gj
│       │       │       └── MD5E-s12465653--05cd7ed5✂MD5.pdf
│       │       │           └── MD5E-s12465653--05cd7ed5✂MD5.pdf
│       │       ├── jf
│       │       │   └── 3M
│       │       │       └── MD5E-s2120211--06d1efcb✂MD5.pdf
│       │       │           └── MD5E-s2120211--06d1efcb✂MD5.pdf
│       │       ├── WF
│       │       │   └── Gq
│       │       │       └── MD5E-s1198170--0ab2c121✂MD5.pdf
│       │       │           └── MD5E-s1198170--0ab2c121✂MD5.pdf
│       │       └── xF
│       │           └── 42
│       │               └── MD5E-s4161086--c832fc13✂MD5.pdf
│       │                   └── MD5E-s4161086--c832fc13✂MD5.pdf
│       ├── archives
│       │   ├── pre-merge-commit.sample
│       │   ├── prepare-commit-msg.sample
│       │   ├── pre-push.sample
│       │   ├── pre-rebase.sample
│       │   ├── pre-receive.sample
│       │   ├── push-to-checkout.sample
│       │   └── update.sample
│       ├── info
│       │   └── exclude
│       ├── objects
│       │   ├── info
│       │   └── pack
│       │       ├── pack-a02f2380✂SHA1.idx
│       │       └── pack-a02f2380✂SHA1.pack
│       ├── ora-remote-46b169aa-bb91-42d6-be06-355d957fb4f7
│       │   └── transfer
│       ├── refs
│       │   ├── heads
│       │   │   ├── git-annex
│       │   │   └── main
│       │   └── tags
│       └── ria-layout-version
├── error_logs
└── ria-layout-version

31 directories, 28 files

A second dataset can be added and published to the store in the very same way. As a demonstration, we’ll do it for the midterm_project subdataset:

$ cd midterm_project
$ datalad create-sibling-ria -s ria-backup ria+file:///home/me/myriastore
[INFO] Creating a new RIA store at /home/me/myriastore
[INFO] create siblings 'ria-backup' and 'ria-backup-storage' ...
update(ok): . (dataset)
update(ok): . (dataset)
[INFO] Configure additional publication dependency on "ria-backup-storage"
configure-sibling(ok): . (sibling)
create-sibling-ria(ok): /home/me/dl-101/DataLad-101/midterm_project (dataset)
$ datalad push --to ria-backup
copy(ok): .datalad/environments/midterm-software/image (file) [to ria-backup-storage...]
copy(ok): pairwise_relationships.png (file) [to ria-backup-storage...]
copy(ok): prediction_report.csv (file) [to ria-backup-storage...]
publish(ok): . (dataset) [refs/heads/main->ria-backup:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->ria-backup:refs/heads/git-annex [new branch]]

Take a look into the RIA store after a second dataset has been added

With creating a RIA sibling to the RIA store and publishing the contents of the midterm_project subdataset to the store, a second dataset has been added to the datastore. Note how it is represented on the same hierarchy level as the previous dataset, underneath its dataset ID (note that the output is cut off for readability):

$ cat .datalad/config
[datalad "dataset"]
	id = d95bafc8-f2a4-d27b-dcf4-bb99f4bea973
[datalad "containers.midterm-software"]
	image = .datalad/environments/midterm-software/image
	cmdexec = singularity exec -B {{pwd}} {img} {cmd}
$ tree /home/me/myriastore
/home/me/myriastore
├── alias
│   └── dl-101 -> ../e3e/70682-c209-4cac-629f-6fbed82c07cd
├── d95
│   └── bafc8-f2a4-d27b-dcf4-bb99f4bea973
│       ├── annex
│       │   └── objects
│       │       ├── 8q
│       │       │   └── 6M
│       │       │       └── MD5E-s345--a88cab39✂MD5.csv
│       │       │           └── MD5E-s345--a88cab39✂MD5.csv
│       │       ├── F1
│       │       │   └── K3
│       │       │       └── MD5E-s230694943--944b0300✂MD5
│       │       │           └── MD5E-s230694943--944b0300✂MD5
│       │       └── q1
│       │           └── gp
│       │               └── MD5E-s261062--025dc493✂MD5.png
│       │                   └── MD5E-s261062--025dc493✂MD5.png
│       ├── archives
│       ├── branches
│       ├── config
│       ├── config.dataladlock
│       ├── description
│       │   ├── pre-receive.sample
│       │   ├── push-to-checkout.sample
│       │   └── update.sample
│       ├── info
│       │   └── exclude
│       ├── objects
│       │   ├── info
│       │   └── pack
│       │       ├── pack-e79ad660✂SHA1.idx
│       │       └── pack-e79ad660✂SHA1.pack
│       ├── ora-remote-aa9031bf-0855-4611-aa21-da0698cce020
│       │   └── transfer
│       ├── refs
│       │   ├── heads
│       │   │   ├── git-annex
│       │   │   └── main
│       │   └── tags
│       └── ria-layout-version
├── e3e
│   └── 70682-c209-4cac-629f-6fbed82c07cd
│       ├── annex

Thus, in order to create and populate RIA stores, only the commands datalad create-sibling-ria and datalad push are required.

1.4.2.2. Cloning and updating from RIA stores

Cloning from RIA stores is done via datalad clone (manual) from a ria+ URL, suffixed with a dataset identifier. Depending on the protocol being used, the URLs are composed similarly to during sibling creation:

  • A URL to a RIA store on an SSH-accessible server takes the same format as before: ria+ssh://[user@]hostname/absolute/path/to/ria-store

  • A URL to a RIA store on a local file system also looks like during sibling creation: ria+file:///absolute/path/to/ria-store

  • A URL for read (without annex) access to a store via http (e.g., to a RIA store like store.datalad.org, through which the HCP dataset is published) looks like this: ria+https://store.datalad.org:/absolute/path/to/ria-store

The appropriate ria+ URL needs to be suffixed with a # sign and a dataset identifier. One way this can be done is via the dataset ID. Here is how to clone the DataLad-101 dataset from the RIA store using its dataset ID:

$ datalad clone ria+file:///home/me/myriastore#e3e70682-c209-4cac-629f-6fbed82c07cd myclone
[INFO] Configure additional publication dependency on "ria-backup-storage"
configure-sibling(ok): . (sibling)
install(ok): /home/me/beyond_basics/myclone (dataset)

There are two downsides to this method: For one, it is hard to type, remember, and know the dataset ID of a desired dataset. Secondly, if no additional path is given to datalad clone, the resulting dataset clone would be named after its ID. An alternative, therefore, is to use an alias for the dataset. This is an alternative dataset identifier that a dataset in a RIA store can be configured with - either with a parameter at the time of running datalad create-sibling-ria as done above, or manually afterwards. For example, given that the dataset also has an alias dl-101, the above call would simplify to

$ datalad clone ria+file:///home/me/myriastore#~dl-101

Configure an alias for a dataset manually

In order to define an alias for an individual dataset in a store, one needs to create an alias/ directory in the root of the datastore and place a symlink of the desired name to the dataset inside of it. Here is how it is done, for the midterm project dataset:

First, create an alias/ directory in the store, if it doesn’t yet exist:

$ mkdir /home/me/myriastore/alias

Afterwards, place a symlink with a name of your choice to the dataset inside of it. Here, we create a symlink called midterm_project:

$ ln -s /home/me/myriastore/d95/bafc8-f2a4-d27b-dcf4-bb99f4bea973 /home/me/myriastore/alias/midterm_project

Here is how it looks like inside of this directory. You can see both the automatically created alias as well as the newly manually created one:

$ tree /home/me/myriastore/alias
/home/me/myriastore/alias
├── dl-101 -> ../e3e/70682-c209-4cac-629f-6fbed82c07cd
└── midterm_project -> /home/me/myriastore/d95/bafc8-f2a4-d27b-dcf4-bb99f4bea973

2 directories, 0 files

Afterwards, the alias name, prefixed with a ~, can be used as a dataset identifier:

datalad clone ria+file:///home/me/myriastore#~midterm_project
[INFO] Configure additional publication dependency on "ria-backup-storage"
configure-sibling(ok): . (sibling)
install(ok): /home/me/beyond_basics/midterm_project (dataset)

This makes it easier for others to clone the dataset and will provide a sensible default name for the clone if no additional path is provided in the command.

Note that it is even possible to create “aliases of an aliases” – symlinking an existing alias-symlink (in the example above midterm_project) under another name in the alias/ directory is no problem. This could be useful if the same dataset needs to be accessible via several aliases, or to safeguard against common spelling errors in alias names.

The dataset clone is just like any other dataset clone. Contents stored in Git are present right after cloning, while the contents of annexed files is not yet retrieved from the store and can be obtained with a datalad get (manual).

$ cd myclone
$ tree
.
├── books
│   ├── bash_guide.pdf -> ../.git/annex/objects/WF/Gq/✂/MD5E-s1198170--0ab2c121✂MD5.pdf
│   ├── byte-of-python.pdf -> ../.git/annex/objects/xF/42/✂/MD5E-s4161086--c832fc13✂MD5.pdf
│   ├── progit.pdf -> ../.git/annex/objects/G6/Gj/✂/MD5E-s12465653--05cd7ed5✂MD5.pdf
│   └── TLCL.pdf -> ../.git/annex/objects/jf/3M/✂/MD5E-s2120211--06d1efcb✂MD5.pdf
├── code
│   ├── list_titles.sh
│   └── nested_repos.sh
├── Gitjoke2.txt
├── midterm_project
├── notes.txt
└── recordings
    ├── interval_logo_small.jpg
    ├── longnow
    ├── podcasts.tsv
    └── salt_logo_small.jpg

5 directories, 11 files

To demonstrate file retrieval from the store, let’s get an annexed file:

$ datalad get books/progit.pdf
get(ok): books/progit.pdf (file) [from ria-backup-storage...]

What about creating RIA stores and cloning from RIA stores with different protocols

Consider setting up and populating a RIA store on a server via the file protocol, but cloning a dataset from that store to a local computer via SSH protocol. Will this be a problem for file content retrieval? No, in all standard situations, DataLad will adapt to this. Upon cloning the dataset with a different URL than it was created under, enabling the special remote will initially fail, but DataLad will adaptive try out other URLs (including changes in hostname, path, or protocol) to enable the ora-remote and retrieve file contents.

Just as expected, the subdatasets are not pre-installed. How will subdataset installation work for datasets that exist in a RIA store as well, like midterm_project? Just as with any other subdataset! DataLad cleverly handles subdataset installations from RIA stores in the background: The location of the subdataset in the RIA store is discovered and used automatically:

$ datalad get -n midterm_project
[INFO] Configure additional publication dependency on "ria-backup-storage"
install(ok): /home/me/beyond_basics/myclone/midterm_project (dataset) [Installed subdataset in order to get /home/me/beyond_basics/myclone/midterm_project]

More technical insights into the automatic ria+ URL generation are outlined in the findoutmore below:

On cloning datasets with subdatasets from RIA stores

The use case Scaling up: Managing 80TB and 15 million files from the HCP release details a RIA-store based publication of a large dataset, split into a nested dataset hierarchy with about 4500 subdatasets in total. But how can links to subdatasets work, if datasets in a RIA store are stored in a flat hierarchy, with no nesting?

The key to this lies in flexibly regenerating subdataset’s URLs based on their ID and a path to the RIA store. The datalad get command is capable of generating RIA URLs to subdatasets on its own, if the higher level dataset contains a datalad get configuration on subdataset-source-candidate-origin that points to the RIA store the subdataset is published in. Here is how the .datalad/config configuration looks like for the top-level dataset of the HCP dataset:

[datalad "get"]
    subdataset-source-candidate-origin = "ria+https://store.datalad.org#{id}"

With this configuration, a datalad get can use the URL and insert the dataset ID in question into the {id} placeholder to clone directly from the RIA store.

This configuration is automatically added to a dataset that is cloned from a RIA store, but it can also be done by hand with a git config (manual) command[6].

Beyond straightforward access to datasets, RIA stores also allow very fine-grained cloning operations: Datasets in RIA stores can be cloned in specific versions.

Cloning specific dataset versions

Optionally, datasets can be cloned in a specific version, such as a tag or branch by appending @<version-identifier> after the dataset ID or the dataset alias. Here is how to clone the BIDS version of the structural preprocessed subset of the HCP dataset that exists on the branch bids of this dataset:

$ datalad clone ria+https://store.datalad.org#~hcp-structural-preprocessed@bids

If you are interested in finding out how this dataset came into existence, checkout the use case Scaling up: Managing 80TB and 15 million files from the HCP release.

Updating datasets works with the datalad update (manual) and datalad update --merge commands introduced in chapter Collaboration. And because a RIA store hosts bare Git repositories, collaborating becomes easy. Anyone with access can clone the dataset from the store, add changes, and push them back – this is the same workflow as for datasets hosted on sites such as GitHub, GitLab, or Gin.

1.4.2.3. Permission management

In order to limit access or give access to datasets in datastores, permissions can be set at the time of RIA sibling creation with the --shared option. If it is given, this option configures the permissions in the RIA store for multi-users access. Possible values for this option are identical to those of git init --shared and are described in its documentation. In order for the dataset to be accessible to everyone, for example, --shared all could be specified. If access should be limited to a particular Unix group (--shared group), the group name needs to be specified with the --group option.

1.4.2.4. Configurations and tricks to hide technical layers

In setups with a central, DataLad-centric data management, in order to spare users knowing about RIA stores, custom configurations can be distributed via DataLad’s run-procedures to simplify workflows further and hide the technical layers of the RIA setup. For example, custom procedures provided at dataset creation could automatically perform a sibling setup in a RIA store, and also create an associated GitLab repository with a publication dependency to the RIA store to ease publishing data or cloning the dataset. The use case Building a scalable data storage for scientific computing details the setup of RIA stores in a scientific institute and demonstrates this example.

To simplify repository access beyond using aliases, the datasets stored in a RIA store can be installed under human-readable names in a single superdataset. Cloning the superdataset exposes the underlying datasets under a non-dataset-ID name. Users can thus get data from datasets hosted in a datastore without any knowledge about the dataset IDs or the need to construct ria+ URLs, just as it was done in the usecases Scaling up: Managing 80TB and 15 million files from the HCP release and Building a scalable data storage for scientific computing. From a user’s perspective, the RIA store would thus stay completely hidden.

Standard maintenance tasks by data stewards with knowledge about RIA stores and access to it can be performed easily or even in an automated fashion. The use case Building a scalable data storage for scientific computing showcases some examples of those operations.

1.4.3. Summary

RIA stores are useful, lean, and undemanding storage locations for DataLad datasets. Their properties make them suitable solutions to back-up, central data management, or collaboration use cases. They can be set up with minimal effort, and the few technical details a user may face such as cloning from dataset IDs can be hidden with minimal configurations of the store like aliases or custom procedures.

Footnotes