8.3. Walk-through: Dropbox as a special remote

Let’s say you’d like to share your complete DataLad-101 dataset with a friend overseas. After all you know about DataLad, you’d like to let more people know about its capabilities. You and your friend, however, do not have access to the same computational infrastructure, and there are also many annexed files, e.g., the PDFs in your dataset, that you’d like your friend to have but that can’t be simply computed or automatically obtained from web sources. What you would like to do is to provide your friend with a URL to install a dataset from and successfully run datalad get (manual), just as with the many publicly available DataLad datasets such as the longnow podcasts.

As an example, let’s walk through all necessary steps to publish the DataLad-101 dataset to GitHub, and its file contents to Dropbox. To make this as convenient as possible, we will also set up a publication dependency between the two.

To set up Dropbox as a third party storage provide you need to configure a special-remote called git-annex-remote-rclone. It is a command line program to sync files and directories to and from a large number of commercial providers[1].

  • The first step is to install rclone on your computer. The installation instructions are straightforward and the installation is quick if you are on a Unix-based system (macOS or any Linux distribution).

  • Afterwards, run rclone config from the command line to configure rclone to work with Dropbox. Running this command will a guide you with an interactive prompt through a ~2 minute configuration of the remote (here we will name the remote “dropbox-for-friends” – the name will be used to refer to it later during the configuration of the dataset we want to publish). The interactive dialog is outlined below, and all parts that require user input are highlighted.

$ rclone config
 2019/09/06 13:43:58 NOTICE: Config file "/home/me/.config/rclone/rclone.conf" not found - using defaults
 No remotes found - make a new one
 n) New remote
 s) Set configuration password
 q) Quit config
 n/s/q> n
 name> dropbox-for-friends
 Type of storage to configure.
 Enter a string value. Press Enter for the default ("").
 Choose a number from below, or type in your own value
  1 / 1Fichier
    \ "fichier"
  2 / Alias for an existing remote
    \ "alias"
  8 / Dropbox
    \ "dropbox"
 31 / premiumize.me
    \ "premiumizeme"
 Storage> dropbox
 ** See help for dropbox backend at: https://rclone.org/dropbox/ **

 Dropbox App Client Id
 Leave blank normally.
 Enter a string value. Press Enter for the default ("").
 Dropbox App Client Secret
 Leave blank normally.
 Enter a string value. Press Enter for the default ("").
 Edit advanced config? (y/n)
 y) Yes
 n) No
 y/n> n
 If your browser doesn't open automatically go to the following link:
 Log in and authorize rclone for access
 Waiting for code...
  • At this point, this will open a browser and ask you to authorize rclone to manage your Dropbox, or any other third-party service you have selected in the interactive prompt. Accepting will bring you back into the terminal to the final configuration prompts:

Got code
type = dropbox
token = {"access_token":"meVHyc[...]",
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:

Name                 Type
====                 ====
dropbox-for-friends  dropbox

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
  • Once this is done, install git-annex-remote-rclone. It is a wrapper around rclone that makes any destination supported by rclone usable with git-annex. If you are on a recent version of Debian or Ubuntu (or have enabled the NeuroDebian repository), you can get it conveniently via your package manager, e.g., with sudo apt-get install git-annex-remote-rclone. Alternatively, git clone the git-annex-remote-rclone repository to your machine (do not clone it into DataLad-101 but somewhere else on your computer), and copy the path to this repository into your $PATH variable. If you clone into /home/user-bob/repos, the command would look like this[2]:

    $ git clone https://github.com/DanielDent/git-annex-remote-rclone.git
    $ export PATH="/home/user-bob/repos/git-annex-remote-rclone:$PATH"
  • Finally, in the dataset you want to share, run the git annex initremote (manual) command. Give the remote a name (it is dropbox-for-friends here), and specify the name of the remote you configured with rclone with the target parameters:

$ git annex initremote dropbox-for-friends type=external externaltype=rclone chunk=50MiB encryption=none target=dropbox-for-friends prefix=my_awesome_dataset

initremote dropbox-for-friends ok
(recording state in git...)

What has happened up to this point is that we have configured Dropbox as a third-party storage service for the annexed contents in the dataset. On a conceptual, dataset level, your Dropbox folder is now a sibling – the sibling name is the first positional argument after initremote, i.e., “dropbox-for-friends”:

$ datalad siblings
 .: here(+) [git]
 .: dropbox-for-friends(+) [rclone]
 .: roommate(+) [../mock_user/DataLad-101 (git)]

On Dropbox, a new folder will be created for your annexed files. By default, this folder will be called git-annex, but it can be configured using the --prefix=<whatitshouldbecalled> option, as done above. However, this directory on Dropbox is not the location you would refer your friend or a collaborator to. The representation of the files in the special-remote is not human-readable – it is a tree of annex objects, and thus looks like a bunch of very weirdly named folders and files to anyone. Through this design it becomes possible to chunk files into smaller units (see the git-annex documentation for more on this), optionally encrypt content on its way from a local machine to a storage service (see the git-annex documentation for more on this), and avoid leakage of information via file names. Therefore, the Dropbox remote is not a places a real person would take a look at, instead they are only meant to be managed and accessed via DataLad/git-annex.

To actually share your dataset with someone, you need to publish it to Github, Gitlab, or a similar hosting service.

You could, for example, create a sibling of the DataLad-101 dataset on GitHub with the command datalad create-sibling-github (manual). This will create a new GitHub repository called “DataLad-101” under your account, and configure this repository as a sibling of your dataset called github (exactly like you have done in YODA-compliant data analysis projects with the midterm_project subdataset). However, in order to be able to link the contents stored in Dropbox, you also need to configure a publication dependency to the dropbox-for-friends sibling – this is done with the publish-depends <sibling> option.

$ datalad create-sibling-github -d . DataLad-101 \
  --publish-depends dropbox-for-friends
  [INFO   ] Configure additional publication dependency on "dropbox-for-friends"
  .: github(-) [https://github.com/<user-name>/DataLad-101.git (git)]
  'https://github.com/<user-name>/DataLad-101.git' configured as sibling 'github' for <Dataset path=/home/me/dl-101/DataLad-101>

datalad siblings (manual) will again list all available siblings:

$ datalad siblings
 .: here(+) [git]
 .: dropbox-for-friends(+) [rclone]
 .: roommate(+) [../mock_user/DataLad-101 (git)]
 .: github(-) [https://github.com/<user-name>/DataLad-101.git (git)]

Note that each sibling has either a + or - attached to its name. This indicates the presence (+) or absence (-) of a remote data annex at this remote. You can see that your github sibling indeed does not have a remote data annex. Therefore, instead of “only” publishing to this GitHub repository (as done in section YODA-compliant data analysis projects), in order to also publish annex contents, we made publishing to GitHub dependent on the dropbox-for-friends sibling (that has a remote data annex), so that annexed contents are published there first.

Publication dependencies are strictly local configuration

Note that the publication dependency is only established for your own dataset, it is not shared with clones of the dataset. Internally, this configuration is a key value pair in the section of your remote in .git/config:

[remote "github"]
   annex-ignore = true
   url = https://github.com/<user-name>/DataLad-101.git
   fetch = +refs/heads/*:refs/remotes/github/*
   datalad-publish-depends = dropbox-for-friends

With this setup, we can publish the dataset to GitHub. Note how the publication dependency is served first:

$ datalad push --to github
[INFO   ] Transferring data to configured publication dependency: 'dropbox-for-friends'
[INFO   ] Publishing <Dataset path=/home/me/dl-101/DataLad-101> data to dropbox-for-friends
publish(ok): books/TLCL.pdf (file)
publish(ok): books/byte-of-python.pdf (file)
publish(ok): books/progit.pdf (file)
publish(ok): recordings/interval_logo_small.jpg (file)
publish(ok): recordings/salt_logo_small.jpg (file)
[INFO   ] Publishing to configured dependency: 'dropbox-for-friends'
[INFO   ] Publishing <Dataset path=/home/me/dl-101/DataLad-101> data to dropbox-for-friends
[INFO   ] Publishing <Dataset path=/home/me/dl-101/DataLad-101> to github
Username for 'https://github.com': <user-name>
Password for 'https://<user-name>@github.com':
publish(ok): . (dataset) [pushed to github: ['[new branch]', '[new branch]']]
action summary:
  publish (ok: 6)

Afterwards, your dataset can be found on GitHub, and cloned or installed.

8.3.1. From the perspective of whom you share your dataset with…

If your friend would now want to get your dataset including the annexed contents, and you made sure that they can access the Dropbox folder with the annexed files (e.g., by sharing an access link), here is what they would have to do:

If the repository is on GitHub, a datalad clone (manual) with the URL will install the dataset:

$ datalad clone https://github.com/<user-name>/DataLad-101.git
[INFO   ] Cloning https://github.com/<user-name>/DataLad-101.git [1 other candidates] into '/Users/awagner/Documents/DataLad-101'
[INFO   ]   Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] access to 1 dataset sibling dropbox-for-friends not auto-enabled, enable with:
|         datalad siblings -d "/Users/awagner/Documents/DataLad-101" enable -s dropbox-for-friends
install(ok): /Users/awagner/Documents/DataLad-101 (dataset)

Pay attention to one crucial information in this output:

[INFO   ] access to 1 dataset sibling dropbox-for-friends not auto-enabled, enable with:
|         datalad siblings -d "/Users/<user-name>/Documents/DataLad-101" enable -s dropbox-for-friends

This means that someone who wants to access the data from dropbox needs to enable the special remote. For this, this person first needs to install and configure rclone as well: Since rclone is the protocol with which annexed data can be transferred from and to Dropbox, anyone who needs annexed data from Dropbox needs this special remote. Therefore, the first steps are the same as before:

  • Install rclone (as described above).

  • Run rclone config to configure rclone to work with Dropbox (as described above). It is important to name the remote identically - in the example above, it would need to be “dropbox-for-friends”. This means: You need to communicate the name of your special remote to your friend, and they have to give it the same name as the one configured in the dataset). (There are efforts towards extracting this information automatically from datasets, but for the time being, this is an important detail to keep in mind).

  • install git-annex-remote-rclone (as described above).

After this is done, you can execute what DataLad’s output message suggests to “enable” this special remote (inside of the installed DataLad-101):

$ datalad siblings -d "/Users/awagner/Documents/DataLad-101" \
  enable -s dropbox-for-friends
.: dropbox-for-friends(?) [git]

And once this is done, you can get any annexed file contents, for example, the books, or the cropped logos from chapter DataLad, run!:

$ datalad get books/TLCL.pdf
get(ok): /home/some/other/user/DataLad-101/books/TLCL.pdf (file) [from dropbox-for-friends]