8.2. Dataset hosting on GIN¶
GIN (G-Node infrastructure) is a free data management system designed for comprehensive and reproducible management of scientific data. It is a web-based repository store and provides fine-grained access control to share data. GIN builds up on Git and git-annex, and is an easy alternative to other third-party services to host and share your DataLad datasets1. It allows to share datasets and their contents with selected collaborators or making them publicly and anonymously available.
Note
If you reached this section to find out how to access a DataLad dataset shared on Gin, please skip to the section Sharing and accessing the dataset.
8.2.1. Prerequisites¶
In order to use GIN for hosting and sharing your datasets, you need to
register
upload your public SSH key for SSH access
create an empty repository on GIN and publish your dataset to it
Todo
Revise this last step once there is a datalad create-sibling-gin
command: https://github.com/datalad/datalad/issues/2680
Once you have registered an account on the GIN server by providing your e-mail address, affiliation, and name, and selecting a user name and password, you should upload your SSH key to allow SSH access.
What is an SSH key and how can I create one?
An SSH key is an access credential in the SSH protocol that can be used to login from one system to remote servers and services, such as from your private computer to an SSH server. For repository hosting services such as GIN, GitHub, or GitLab, it can be used to connect and authenticate without supplying your username or password for each action.
This tutorial by GitHub
is a detailed step-by-step instruction to generate and use SSH keys for authentication,
and it also shows you how to add your public SSH key to your GitHub account
so that you can install or clone datasets or Git repositories via SSH
(in addition
to the http
protocol), and the same procedure applies to GitLab and Gin.
Don’t be intimidated if you have never done this before – it is fast and easy:
First, you need to create a private and a public key (an SSH key pair).
All this takes is a single command in the terminal. The resulting files are
text files that look like someone spilled alphabet soup in them, but constitute
a secure password procedure.
You keep the private key on your own machine (the system you are connecting from,
and that only you have access to),
and copy the public key to the system or service you are connecting to.
On the remote system or service, you make the public key an authorized key to
allow authentication via the SSH key pair instead of your password. This
either takes a single command in the terminal, or a few clicks in a web interface
to achieve.
You should protect your SSH keys on your machine with a passphrase to prevent
others – e.g., in case of theft – to log in to servers or services with
SSH authentication2, and configure an ssh agent
to handle this passphrase for you with a single command. How to do all of this
is detailed in the above tutorial.
To do this, visit the settings of your user account. On the left hand side, select the tab “SSH Keys”, and click the button “Add Key”:
You should copy the contents of your public key file into the field labeled
content
, and enter an arbitrary but informative Key Name
, such as
“My private work station”. Afterwards, you are done!
8.2.2. Publishing your dataset to GIN¶
To publish an existing dataset to GIN, create a new, empty repository on GIN first. Unlike with datalad create-sibling-github (that does this step automatically for you on GitHub), this needs to be done via the web interface:
Afterwards, add this repository as a sibling of your dataset. To do this, use the datalad siblings add command and the SSH URL of the repository as shown below. Note that since this is the first time you will be connecting to the GIN server via SSH, you will likely be asked to confirm to connect. This is a safety measure, and you can type “yes” to continue:
$ datalad siblings add -d . --name gin --url git@gin.g-node.org:/adswa/DataLad-101.git
The authenticity of host 'gin.g-node.org (141.84.41.219)' can't be established.
ECDSA key fingerprint is SHA256:E35RRG3bhoAm/WD+0dqKpFnxJ9+yi0uUiFLi+H/lkdU.
Are you sure you want to continue connecting (yes/no)? yes
[INFO ] Failed to enable annex remote gin, could be a pure git or not accessible
[WARNING] Failed to determine if gin carries annex.
.: gin(-) [git@gin.g-node.org:/adswa/DataLad-101.git (git)]
Afterwards, you can publish your dataset with datalad push. As the repository on GIN supports a dataset annex, there is no publication dependency to an external data hosting service necessary, and the dataset contents stored in Git and in git-annex are published to the same place:
$ datalad push --to gin [INFO] Determine push target [INFO] Push refspecs [INFO] Start enumerating objects [INFO] Start counting objects [INFO] Start compressing objects [INFO] Start writing objects [INFO] Start resolving deltas [INFO] Transfer data [INFO] Start annex operation [INFO] books/TLCL.pdf [INFO] books/bash_guide.pdf [INFO] books/byte-of-python.pdf [INFO] books/progit.pdf [INFO] Finished annex copy [INFO] Update availability information [INFO] Start enumerating objects [INFO] Start counting objects [INFO] Start compressing objects [INFO] Start writing objects [INFO] Start resolving deltas [INFO] Finished push of Dataset(/home/me/dl-101/DataLad-101) publish(ok): . (dataset) [refs/heads/master->gin:refs/heads/master [new branch]] copy(ok): books/TLCL.pdf (file) [to gin...] copy(ok): books/bash_guide.pdf (file) [to gin...] copy(ok): books/byte-of-python.pdf (file) [to gin...] copy(ok): books/progit.pdf (file) [to gin...] copy(ok): flowers.jpg (file) [to gin...] publish(ok): . (dataset) [refs/heads/git-annex->gin:refs/heads/git-annex fa73e9b..adc3814]
If you refresh the GIN web interface afterwards, you will find all of your dataset – including annexed contents! – on GIN. What is especially cool is that the GIN web interface (unlike GitHub) can even preview your annexed contents.
8.2.3. Sharing and accessing the dataset¶
Once your dataset is published, you can point collaborators and friends to it.
If it is a public repository, retrieving the dataset and getting access to
all published data contents (in a read-only fashion) is done by cloning the
repository’s https
url. This does not require a user account on Gin.
Important: Take the URL in the browser, not the copy-paste URL
Please note that you need to use the browser URL of the repository, not the copy-paste URL on the upper right hand side of the repository if you want to get anonymous HTTPS access!
The two URLs differ only by a .git
extension:
Brower bar:
https://gin.g-node.org/<user>/<repo>
Copy-paste “HTTPS clone”:
https://gin.g-node.org/<user>/<repo>.git
A dataset cloned from https://gin.g-node.org/<user>/<repoy>.git
, however, can not retrieve annexed files!
$ datalad clone https://gin.g-node.org/adswa/DataLad-101
[INFO] Cloning dataset to Dataset(/home/me/dl-101/clone_of_dl-101/DataLad-101)
[INFO] Attempting to clone from https://gin.g-node.org/adswa/DataLad-101 to /home/me/dl-101/clone_of_dl-101/DataLad-101
[INFO] Start enumerating objects
[INFO] Start counting objects
[INFO] Start compressing objects
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/clone_of_dl-101/DataLad-101)
[INFO] Scanning for unlocked files (this may take some time)
install(ok): /home/me/dl-101/clone_of_dl-101/DataLad-101 (dataset)
Subsequently, datalad get calls will be able to retrieve all annexed file contents that have been published to the repository.
If it is a private dataset, cloning the dataset from Gin requires a user name and password for anyone you want to share your dataset with. The “Collaboration” tab under Settings lets you set fine-grained access rights, and it is possible to share datasets with collaborators that are not registered on GIN with provided Guest accounts. In order to get access to annexed contents, cloning requires setting up an SSH key as detailed above, and cloning via the SSH url:
$ datalad clone git@gin.g-node.org:/adswa/DataLad-101.git
How do I know if my repository is private?
Private repos are marked with a lock sign. To make it public, untick the “Private” box, found under “Settings”:
In order to publish changes to a Gin repository, the repository needs to be cloned via its SSH url.
8.2.4. Subdataset publishing¶
Just as the input subdataset iris_data
in your published midterm_project
was referencing its source on GitHub, the longnow
subdataset in your
published DataLad-101
dataset directly references the original
dataset on GitHub. If you click onto recordings
and then longnow
, you will
be redirected to the podcast’s original dataset.
The subdataset midterm_project
, however, is not successfully referenced. If
you click on it, you would get to a 404 Error page. The crucial difference between this
subdataset and the longnow dataset is its entry in the .gitmodules
file of
DataLad-101
:
$ cat .gitmodules
[submodule "recordings/longnow"]
path = recordings/longnow
url = https://github.com/datalad-datasets/longnow-podcasts.git
datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
[submodule "midterm_project"]
path = midterm_project
url = ./midterm_project
datalad-id = e5a3d370-223d-11ea-af8b-e86a64c8054c
While the podcast subdataset is referenced with a valid URL to GitHub, the midterm
project’s URL is a relative path from the root of the superdataset. This is because
the longnow
subdataset was installed with datalad clone -d .
(that records the source of the subdataset), and the midterm_project
dataset
was created as a subdataset with datalad create -d . midterm_project.
Since there is no repository at
https://gin.g-node.org/<USER>/DataLad-101/midterm_project
(which this submodule
entry would resolve to), accessing the subdataset fails.
However, since you have already published this dataset (to GitHub), you could
update the submodule entry and provide the accessible GitHub URL instead. This
can be done via the set-property <NAME> <VALUE>
option of
datalad subdatasets3 (replace the URL shown here with the URL
your dataset was published to – likely, you only need to change the user name):
$ datalad subdatasets --contains midterm_project --set-property url https://github.com/adswa/midtermproject
subdataset(ok): midterm_project (dataset)
$ cat .gitmodules
[submodule "recordings/longnow"]
path = recordings/longnow
url = https://github.com/datalad-datasets/longnow-podcasts.git
datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
[submodule "midterm_project"]
path = midterm_project
url = https://github.com/adswa/midtermproject
datalad-id = 1a39abf2-3795-4796-a70f-818ecc4041f9
Handily, the datalad subdatasets command saved this change to the
.gitmodules
file automatically and the state of the dataset is clean:
$ datalad status
nothing to save, working tree clean
Afterwards, publish these changes to gin
and see for yourself how this fixed
the problem:
$ datalad push --to gin
[INFO] Determine push target
[INFO] Push refspecs
[INFO] Transfer data
[INFO] Start annex operation
[INFO] Finished
[INFO] Update availability information
[INFO] Start enumerating objects
[INFO] Start counting objects
[INFO] Start compressing objects
[INFO] Start writing objects
publish(ok): . (dataset) [refs/heads/master->gin:refs/heads/master b27c1f9..b5780e1]
[INFO] Finished push of Dataset(/home/me/dl-101/DataLad-101)
If the subdataset was not published before, you could publish the subdataset to
a location of your choice, and modify the .gitmodules
entry accordingly.
Footnotes
- 1
GIN looks and feels similar to GitHub, and among a number advantages, it can assign a DOI to your dataset, making it cite-able. Moreover, its web interface and client are useful tools with a variety of features that are worthwhile to check out, as well.
- 2
Your private SSH key is incredibly valuable, and it is important to keep it secret! Anyone who gets your private key has access to anything that the public key is protecting. If the private key does not have a passphrase, simply copying this file grants a person access!
- 3
Alternatively, you can configure the siblings url with git config:
$ git config -f .gitmodules --replace-all submodule.midterm_project.url https://github.com/adswa/midtermproject
Remember, though, that this command modifies
.gitmodules
without an automatic, subsequent save, so that you will have to save this change manually.