8.3. Walk-through: Dropbox as a special remote¶
Let’s say you’d like to share your complete DataLad-101
dataset with
a friend overseas. After all you know about DataLad, you’d like to let more people
know about its capabilities. You and your friend, however, do not have access
to the same computational infrastructure, and there are also many annexed files, e.g., the PDFs in your dataset, that you’d like your friend to have but that can’t be simply computed or automatically obtained from web sources.
What you would like to do is to provide your friend with a URL to
install a dataset from and successfully run datalad get, just as with
the many publicly available DataLad datasets such as the longnow
podcasts.
As an example, let’s walk through all necessary steps to publish the DataLad-101
dataset to GitHub, and its file contents to Dropbox.
To make this as convenient as possible, we will also set up a publication dependency between the two.
To set up Dropbox as a third party storage provide you need to configure a special-remote called rclone. It is a command line program to sync files and directories to and from a large number of commercial providers1.
The first step is to install
rclone
on your computer. The installation instructions are straightforward and the installation is quick if you are on a Unix-based system (macOS or any Linux distribution).Afterwards, run
rclone config
from the command line to configurerclone
to work with Dropbox. Running this command will a guide you with an interactive prompt through a ~2 minute configuration of the remote (here we will name the remote “dropbox-for-friends” – the name will be used to refer to it later during the configuration of the dataset we want to publish). The interactive dialog is outlined below, and all parts that require user input are highlighted.
$ rclone config
2019/09/06 13:43:58 NOTICE: Config file "/home/me/.config/rclone/rclone.conf" not found - using defaults
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> dropbox-for-friends
Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
1 / 1Fichier
\ "fichier"
2 / Alias for an existing remote
\ "alias"
[...]
8 / Dropbox
\ "dropbox"
[...]
31 / premiumize.me
\ "premiumizeme"
Storage> dropbox
** See help for dropbox backend at: https://rclone.org/dropbox/ **
Dropbox App Client Id
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_id>
Dropbox App Client Secret
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_secret>
Edit advanced config? (y/n)
y) Yes
n) No
y/n> n
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
At this point, this will open a browser and ask you to authorize
rclone
to manage your Dropbox, or any other third-party service you have selected in the interactive prompt. Accepting will bring you back into the terminal to the final configuration prompts:
Got code
--------------------
[dropbox-for-friends]
type = dropbox
token = {"access_token":"meVHyc[...]",
"token_type":"bearer",
"expiry":"0001-01-01T00:00:00Z"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:
Name Type
==== ====
dropbox-for-friends dropbox
e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
Once this is done, install
git-annex-remote-rclone
. It is a wrapper around rclone that makes any destination supported by rclone usable with git-annex. If you are on a recent version of Debian or Ubuntu (or have enabled the NeuroDebian repository), you can get it conveniently via your package manager, e.g., withsudo apt-get install git-annex-remote-rclone
. Alternatively,git clone
the git-annex-remote-rclone repository to your machine (do not clone it intoDataLad-101
but somewhere else on your computer), and copy the path to this repository into your$PATH
variable. If you clone into/home/user-bob/repos
, the command would look like this2:$ git clone https://github.com/DanielDent/git-annex-remote-rclone.git $ export PATH="/home/user-bob/repos/git-annex-remote-rclone:$PATH"
Finally, in the dataset you want to share, run the git annex initremote command. Give the remote a name (it is
dropbox-for-friends
here), and specify the name of the remote you configured withrclone
with thetarget
parameters:
$ git annex initremote dropbox-for-friends type=external externaltype=rclone chunk=50MiB encryption=none target=dropbox-for-friends prefix=my_awesome_dataset
initremote dropbox-for-friends ok
(recording state in git...)
What has happened up to this point is that we have configured Dropbox
as a third-party storage service for the annexed contents in the dataset.
On a conceptual, dataset level, your Dropbox folder is now a sibling – the sibling name is the first positional argument after initremote
, i.e., “dropbox-for-friends”:
$ datalad siblings
.: here(+) [git]
.: dropbox-for-friends(+) [rclone]
.: roommate(+) [../mock_user/DataLad-101 (git)]
On Dropbox, a new folder will be created for your annexed files.
By default, this folder will be called git-annex
, but it can be configured using the --prefix=<whatitshouldbecalled>
option, as done above.
However, this directory on Dropbox is not the location you would refer your friend or a collaborator to.
The representation of the files in the special-remote is not human-readable –
it is a tree of annex objects, and thus looks like a bunch of very weirdly named
folders and files to anyone.
Through this design it becomes possible to chunk files into smaller units (see
the git-annex documentation for more on this),
optionally encrypt content on its way from a local machine to a storage service
(see the git-annex documentation for more on this),
and avoid leakage of information via file names. Therefore, the Dropbox remote is
not a places a real person would take a look at, instead they are only meant to
be managed and accessed via DataLad/git-annex.
To actually share your dataset with someone, you need to publish it to Github, Gitlab, or a similar hosting service.
You could, for example, create a sibling of the DataLad-101
dataset
on GitHub with the command create-sibling-github.
This will create a new GitHub repository called “DataLad-101” under your account,
and configure this repository as a sibling of your dataset
called github
(exactly like you have done in YODA-compliant data analysis projects
with the midterm_project
subdataset).
However, in order to be able to link the contents stored in Dropbox, you also need to
configure a publication dependency to the dropbox-for-friends
sibling – this is
done with the publish-depends <sibling>
option.
$ datalad create-sibling-github -d . DataLad-101 \
--publish-depends dropbox-for-friends
[INFO ] Configure additional publication dependency on "dropbox-for-friends"
.: github(-) [https://github.com/<user-name>/DataLad-101.git (git)]
'https://github.com/<user-name>/DataLad-101.git' configured as sibling 'github' for <Dataset path=/home/me/dl-101/DataLad-101>
datalad siblings will again list all available siblings:
$ datalad siblings
.: here(+) [git]
.: dropbox-for-friends(+) [rclone]
.: roommate(+) [../mock_user/DataLad-101 (git)]
.: github(-) [https://github.com/<user-name>/DataLad-101.git (git)]
Note that each sibling has either a +
or -
attached to its name. This
indicates the presence (+
) or absence (-
) of a remote data annex at this
remote. You can see that your github
sibling indeed does not have a remote
data annex.
Therefore, instead of “only” publishing to this GitHub repository (as done in section
YODA-compliant data analysis projects), in order to also publish annex contents, we made
publishing to GitHub dependent on the dropbox-for-friends
sibling
(that has a remote data annex), so that annexed contents are published
there first.
Publication dependencies are strictly local configuration
Note that the publication dependency is only established for your own dataset,
it is not shared with clones of the dataset. Internally, this configuration
is a key value pair in the section of your remote in .git/config
:
[remote "github"]
annex-ignore = true
url = https://github.com/<user-name>/DataLad-101.git
fetch = +refs/heads/*:refs/remotes/github/*
datalad-publish-depends = dropbox-for-friends
With this setup, we can publish the dataset to GitHub. Note how the publication dependency is served first:
$ datalad push --to github
[INFO ] Transferring data to configured publication dependency: 'dropbox-for-friends'
[INFO ] Publishing <Dataset path=/home/me/dl-101/DataLad-101> data to dropbox-for-friends
publish(ok): books/TLCL.pdf (file)
publish(ok): books/byte-of-python.pdf (file)
publish(ok): books/progit.pdf (file)
publish(ok): recordings/interval_logo_small.jpg (file)
publish(ok): recordings/salt_logo_small.jpg (file)
[INFO ] Publishing to configured dependency: 'dropbox-for-friends'
[INFO ] Publishing <Dataset path=/home/me/dl-101/DataLad-101> data to dropbox-for-friends
[INFO ] Publishing <Dataset path=/home/me/dl-101/DataLad-101> to github
Username for 'https://github.com': <user-name>
Password for 'https://<user-name>@github.com':
publish(ok): . (dataset) [pushed to github: ['[new branch]', '[new branch]']]
action summary:
publish (ok: 6)
Afterwards, your dataset can be found on GitHub, and cloned
or installed
.