8.6. Walk-through: Dataset hosting on GIN¶
GIN (G-Node infrastructure) is a free data management system designed for comprehensive and reproducible management of scientific data. It is a web-based repository store and provides fine-grained access control to share data. GIN builds up on Git and git-annex, and is an easy alternative to other third-party services to host and share your DataLad datasets1. It allows to share datasets and their contents with selected collaborators or making them publicly and anonymously available. And even if you prefer to expose and share your datasets via GitHub, you can still use Gin to host your data.
Go further for dataset access from GIN
If you reached this section to find out how to access a DataLad dataset shared on Gin, please skip to the section Sharing and accessing the dataset.
In order to use GIN for hosting and sharing your datasets, you need to
upload your public SSH key for SSH access
Once you have registered an account on the GIN server by providing your e-mail address, affiliation, and name, and selecting a user name and password, you should upload your SSH key to allow SSH access (you can find an explanation of what SSH keys are and how you can create one in this Findoutmore in the general section Publishing datasets to Git repository hosting). To do this, visit the settings of your user account. On the left hand side, select the tab “SSH Keys”, and click the button “Add Key”:
You should copy the contents of your public key file into the field labeled
content, and enter an arbitrary but informative
Key Name, such as
“My private work station”. Afterwards, you are done!
8.6.2. Publishing your dataset to GIN¶
As outlined in the section Publishing datasets to Git repository hosting, there are two ways in which you can publish your dataset to Gin.
Either by 1) creating a new, empty repository on GIN via the web interface, or 2), if you use DataLad version
0.16 or higher, via the create-sibling-gin command (datalad-create-sibling-gin manual).
1) via webinterface: If you choose to create a new repository via Gin’s web interface, make sure to not initialize it with a README:
Afterwards, add this repository as a sibling of your dataset. To do this, use the datalad siblings add command and the SSH URL of the repository as shown below. Note that since this is the first time you will be connecting to the GIN server via SSH, you will likely be asked to confirm to connect. This is a safety measure, and you can type “yes” to continue:
$ datalad siblings add -d . \ --name gin \ --url email@example.com:/adswa/DataLad-101.git The authenticity of host 'gin.g-node.org (18.104.22.168)' can't be established. ECDSA key fingerprint is SHA256:E35RRG3bhoAm/WD+0dqKpFnxJ9+yi0uUiFLi+H/lkdU. Are you sure you want to continue connecting (yes/no)? yes [INFO ] Failed to enable annex remote gin, could be a pure git or not accessible [WARNING] Failed to determine if gin carries annex. .: gin(-) [firstname.lastname@example.org:/adswa/DataLad-101.git (git)]
2) via command-line:
If you choose to use the create-sibling-gin command, supply the command with a name for the repository, and optionally add a
-s/--siblingname [NAME] parameter (if unconfigured it will be
--access-protocol [https|ssh|https-ssh] (ideally
The command has a number of additional useful parameters, so make sure to take a look at datalad-create-sibling-gin.
Afterwards, you can publish your dataset with datalad push. As the repository on GIN supports a dataset annex, there is no publication dependency to an external data hosting service necessary, and the dataset contents stored in Git and in git-annex are published to the same place:
$ datalad push --to gin [INFO] Determine push target [INFO] Push refspecs [INFO] Transfer data copy(ok): books/TLCL.pdf (file) [to gin...] copy(ok): books/bash_guide.pdf (file) [to gin...] copy(ok): books/byte-of-python.pdf (file) [to gin...] copy(ok): books/progit.pdf (file) [to gin...] [INFO] Update availability information [INFO] Start enumerating objects [INFO] Start counting objects [INFO] Start compressing objects [INFO] Start writing objects [INFO] Start resolving deltas publish(ok): . (dataset) [refs/heads/git-annex->gin:refs/heads/git-annex FROM..TO (REDACTED)] publish(ok): . (dataset) [refs/heads/master->gin:refs/heads/master [new branch]] [INFO] Finished push of Dataset(/home/me/dl-101/DataLad-101) action summary: copy (ok: 4) publish (ok: 2)
On the GIN web interface you will find all of your dataset – including annexed contents! What is especially cool is that the GIN web interface (unlike GitHub) can even preview your annexed contents.
8.6.4. Subdataset publishing¶
Just as the input subdataset
iris_data in your published
was referencing its source on GitHub, the
longnow subdataset in your
DataLad-101 dataset directly references the original
dataset on GitHub. If you click onto
recordings and then
longnow in GIN’s webinterface, you will
be redirected to the podcast’s original dataset.
midterm_project, however, is not successfully referenced. If
you click on it, you would get to a 404 Error page. The crucial difference between this
subdataset and the longnow dataset is its entry in the
.gitmodules file of
$ cat .gitmodules [submodule "recordings/longnow"] path = recordings/longnow url = https://github.com/datalad-datasets/longnow-podcasts.git datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e [submodule "midterm_project"] path = midterm_project url = ./midterm_project datalad-id = e5a3d370-223d-11ea-af8b-e86a64c8054c
While the longnow subdataset is referenced with a valid URL to GitHub, the midterm
project’s URL is a relative path from the root of the superdataset. This is because
longnow subdataset was installed with datalad clone -d .
(that records the source of the subdataset), and the
was created as a subdataset with datalad create -d . midterm_project.
Since there is no repository at
https://gin.g-node.org/<USER>/DataLad-101/midterm_project (which this submodule
entry would resolve to), accessing the subdataset fails.
However, since you have already published this dataset (to GitHub), you could
update the submodule entry and provide the accessible GitHub URL instead. This
can be done via the
set-property <NAME> <VALUE> option of
datalad subdatasets2 (replace the URL shown here with the URL
your dataset was published to – likely, you only need to change the user name):
$ datalad subdatasets --contains midterm_project \ --set-property url https://github.com/adswa/midtermproject add(ok): .gitmodules (file) save(ok): . (dataset) subdataset(ok): midterm_project (dataset)
$ cat .gitmodules [submodule "recordings/longnow"] path = recordings/longnow url = https://github.com/datalad-datasets/longnow-podcasts.git datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e datalad-url = https://github.com/datalad-datasets/longnow-podcasts.git [submodule "midterm_project"] path = midterm_project url = https://github.com/adswa/midtermproject datalad-id = d95bafc8-f2a4-d27b-dcf4-bb99f4bea973
Handily, the datalad subdatasets command saved this change to the
.gitmodules file automatically and the state of the dataset is clean:
$ datalad status nothing to save, working tree clean
Afterwards, publish these changes to
gin and see for yourself how this fixed
$ datalad push --to gin [INFO] Determine push target [INFO] Push refspecs [INFO] Transfer data [INFO] Update availability information [INFO] Start enumerating objects [INFO] Start counting objects [INFO] Start compressing objects [INFO] Start writing objects publish(ok): . (dataset) [refs/heads/master->gin:refs/heads/master 4c2df2d..e546d0d] [INFO] Finished push of Dataset(/home/me/dl-101/DataLad-101) action summary: publish (notneeded: 1, ok: 1)
If the subdataset was not published before, you could publish the subdataset to
a location of your choice, and modify the
.gitmodules entry accordingly.
8.6.5. Using Gin as a data source behind the scenes¶
Even if you do not want to point collaborators to yet another hosting site but want to be able to expose your datasets via services they use and know already (such as GitHub or GitLab), Gin can be very useful: You can let Gin perform data hosting in the background by using it as an “autoenabled data source” that a dataset sibling (even if it is published to GitHub or GitLab) can retrieve data from. You will need to have a Gin account and SSH key setup, so please take a look at the first part of this section if you do not yet know how to do this.
Then, follow these steps:
First, create a new repository on Gin (see step by step instructions above).
In your to-be-published dataset, add this repository as a sibling, this time setting –url and –pushurl arguments explicitly. Make sure to configure a SSH URL as a
--pushurlbut a HTTPS URL as a
url. Please also note that the HTTPS URL written after
--urlDOES NOT have the
.gitsuffix. Here is the command:
$ datalad siblings add \ -d . \ --name gin \ --pushurl email@example.com:/studyforrest/aggregate-fmri-timeseries.git \ --url https://gin.g-node.org/studyforrest/aggregate-fmri-timeseries \
git config --unset-all remote.gin.annex-ignoreto prevent git-annex from ignoring this new dataset
Push your data to the repository on Gin (
datalad push --to gin). This pushes the actual state of the repository, including content, but also adjusts the git-annex configuration.
Configure this sibling as a “common data source”. Use the same name as previously in
--name(to indicate which sibling you are configuring) and give a new, different, name after
$ datalad siblings configure \ --name gin \ --as-common-datasrc gin-src
Push to the repository on Gin again (
datalad push --to gin) to make the configuration change known to the Gin sibling.
Publish your dataset to GitHub/GitLab/…, or update an existing published dataset (
Afterwards, datalad get retrieves files from Gin, even if the dataset has been cloned from GitHub.
siblings as a common data source
as-common-datasrc <name> configures a sibling as a common data source – in technical terms, as an auto-enabled git-annex special remote.
GIN looks and feels similar to GitHub, and among a number advantages, it can assign a DOI to your dataset, making it cite-able. Moreover, its web interface and client are useful tools with a variety of features that are worthwhile to check out, as well.
Alternatively, you can configure the siblings url with git config:
$ git config -f .gitmodules --replace-all submodule.midterm_project.url https://github.com/adswa/midtermproject
Remember, though, that this command modifies
.gitmoduleswithout an automatic, subsequent save, so that you will have to save this change manually.