8.9. The datalad push command¶
Previous sections on publishing DataLad datasets have each
shown you crucial aspects of the functions of dataset publishing with
datalad push
(manual). This section wraps them all together.
8.9.1. The general overview¶
datalad push
is the command to turn to when you want to publish datasets.
It is capable of publishing all dataset content, i.e., files stored in Git,
and files stored with git-annex, to a known dataset sibling.
Push internals
The datalad push
uses git push
, and git annex copy
under
the hood. Publication targets need to either be configured remote Git repositories,
or git-annex special remotes (if they support data upload).
In order to publish a dataset, the dataset needs to have a sibling to push to. This, for instance, can be a GitHub, GitLab, or GIN repository, but it can also be a Remote Indexed Archive (RIA) store for backup or storage of datasets[1], or a regular clone.
all of the ways to configure siblings
Add an existing repository as a sibling with the
datalad siblings
(manual) command. Here are common examples:$ # to a remote repository $ datalad siblings add --name github-repo --url <url.to.github> $ # to a local path $ datalad siblings add --name local-sibling --url /path/to/sibling/ds $ # to a clone on an SSH-accessible machine $ datalad siblings add --name server-sibling --url [user@]hostname:/path/to/sibling/ds
Create a sibling on an external hosting service from scratch, right from within your repository: This can be done with the commands
datalad create-sibling-github
(manual) (for GitHub) ordatalad create-siblings-gitlab
(manual) (for GitLab), ordatalad create-sibling-ria
(manual) (for a remote indexed archive dataset store). Note thatdatalad create-sibling-ria
can add an existing store as a sibling or create a new one from scratch.Create a sibling on a local or SSH accessible Unix machine with
datalad create-sibling
(manual).
In order to publish dataset content, DataLad needs to know to which sibling
content shall be pushed. This can be specified with the --to
option directly
from the command line:
$ datalad push --to <sibling>
If you have more than one branch in your dataset, note that a
datalad push
command will by default update only the current branch.
If updating multiple branches is relevant for your workflow, please check out
the find-out-more about this.
By default, datalad push
will make the last saved state of the dataset
available. Consequently, if the sibling is in the same state as the dataset,
no push is attempted.
Additionally, datalad push
will attempt to automatically decide what type
of dataset contents are going to be published. With a sibling that has a
special remote configured as a publication dependency,
or a sibling that contains an annex (such as a GIN repository or a
Remote Indexed Archive (RIA) store), both the contents
stored in Git (i.e., a dataset’s history) as well as file contents stored in
git-annex will be published unless dataset configurations overrule this.
Alternatively, one can enforce particular operations or push a subset of dataset
contents. For one, when specifying a path in the datalad push
command,
only data or changes for those paths are considered for a push.
Additionally, one can select a particular mode of operation with the -data
option.
Several different modes are possible:
nothing
: With this option, annexed contents are not published. This means that the sibling will have information on the annexed files’ names, but file contents will not be available, and thusdatalad get
calls in the sibling would fail.anything
: Transfer all annexed contents.auto
: With this option, the decision which data is transferred is based on configurations that can determine rules on a per-file and per-sibling level. On a technical level, thegit annex copy
call to publish file contents is called with its--auto
option. With this option, only data that satisfies specific git-annex configurations gets transferred. Those configurations could benumcopies
settings (the number of copies available at different remotes), orwanted
settings (preferred contents for a specific remote), and need to be created by a user[2] with git-annex commands. If you have files you want to keep private, or do not need published, these configurations are very useful.auto-if-wanted
(Default): Unless awanted
ornumcopies
configuration exists in the dataset, all content are published. Should awanted
ornumcopies
configuration exist, the command enables--auto
in the underlyinggit annex copy
call.
Beyond different modes of transferring data, the -f/--force
option allows to force specific publishing operations with three different modes.
Be careful when using it, as its modes possibly overrule safety protections or optimizations:
checkdatapresent
: With this option, the underlyinggit annex copy
call to publish file contents is invoked without a--fast
option. Usually, the--fast
option increases the speed of the operation, as it disables a check whether the sibling already has content. This however, might skip copying content in some cases. Therefore,--force datatransfer
is a slower, but more fail-safe option to publish annexed file contents.gitpush
: This option triggers agit push --force
. Be very careful using this option! If the changes on the dataset conflict with the changes that exist in the sibling, the changes in the sibling will be overwritten.all
: The final mode,all
, combines all force modes – thus attempting to really get your dataset contents published by any means.
datalad push
can publish available subdatasets recursively if the
-r/--recursive
flag is specified. Note that this requires that all subdatasets
that should be published have sibling names identical to the sibling specified in
the top-level datalad push
command, or that appropriate default publication
targets are configured throughout the dataset hierarchy.
Pushing more than the current branch
If you have more than one branch in your
dataset, a datalad push --to <sibling>
will by default only push
the current branch, unless you provide configurations that alter
this default. Here are two ways in which this can be achieved:
Option 1: Setting the push.default
configuration variable from
simple
(the default) to matching
will configure the dataset such that
datalad push
pushes all branches to the sibling.
A concrete example: On a dataset level, this can be done using
$ git config --local push.default matching
Option 2: Tweaking the default push refspec for the dataset allows to select a range of branches that should be pushed. The link above gives a thorough introduction into the refspec. For a hands-on example, consider how it is done for the published DataLad-101 dataset:
The published version of the handbook is known to the local handbook dataset
as a remote called public
, and each section of the book is identified
with a custom branch name that corresponds to the section name. Whenever an
update to the public dataset is pushed, apart from pushing only the main
branch, all branches starting with the section identifier sct
are pushed
automatically as well. This configuration was achieved by specifying these branches
(using globbing with *
) in the push
specification of this remote:
$ git config --local remote.public.push 'refs/heads/sct*'
8.9.2. Pushing errors¶
If you are unfamiliar with Git, please be aware that cloning a dataset to a different place and subsequently pushing to it can lead to Git error messages if changes are pushed to a currently checked out branch of the sibling (in technical Git terms: When pushing to a checked-out branch of a non-bare repository remote).
As an example, consider what happens if we attempt a datalad push
to the sibling roommate
that we created in the chapter Collaboration:
$ datalad push --to roommate
copy(ok): books/TLCL.pdf (file) [to roommate...]
copy(ok): books/bash_guide.pdf (file) [to roommate...]
copy(ok): books/byte-of-python.pdf (file) [to roommate...]
publish(ok): . (dataset) [refs/heads/git-annex->roommate:refs/heads/git-annex ✂FROM✂..✂TO✂]
publish(error): . (dataset) [refs/heads/main->roommate:refs/heads/main [remote rejected] (branch is currently checked out)]
action summary:
copy (ok: 3)
publish (error: 1, ok: 1)
Publishing fails with the error message [remote rejected] (branch is currently checked out)
.
This can be prevented with configuration settings in Git versions 2.3 or higher, or by pushing to a branch of the sibling that is currently not checked-out.
For more information on this, and other error messages during push, please checkout the section How to get help.
Footnotes