8.3. Overview: Publishing datasets

The sections YODA-compliant data analysis projects, Beyond shared infrastructure, and Dataset hosting on GIN have each shown you crucial aspects of the functions of dataset publishing with datalad push. This section wraps them all together.

Note

datalad push requires DataLad version 0.13.0 or higher. Older DataLad versions need to use the datalad publish command. For details into datalad publish, please check out the hidden section at the end of this page.

8.3.1. The general overview

datalad push is the command to turn to when you want to publish datasets. It is capable of publishing all dataset content, i.e., files stored in Git, and files stored with git-annex, to a known dataset sibling.

Note for Git users

The datalad push uses git push, and git annex copy under the hood. Publication targets need to either be configured remote Git repositories, or git-annex special remotes (if they support data upload).

In order to publish a dataset, the dataset needs to have a sibling to push to. This, for instance, can be a GitHub, GitLab, or Gin repository, but it can also be a Remote Indexed Archive (RIA) store for backup or storage of datasets1, or a regular clone2.

all of the ways to configure siblings

  • Add an existing repository as a sibling with the datalad siblings command. Here are common examples:

    # to a remote repository
    $ datalad siblings add --name github-repo --url <url.to.github>
    # to a local path
    $ datalad siblings add --name local-sibling --url /path/to/sibling/ds
    # to a clone on an SSH-accessible machine
    $ datalad siblings add --name server-sibling --url [user@]hostname:/path/to/sibling/ds
    
  • Create a sibling on an external hosting service from scratch, right from within your repository: This can be done with the commands create-sibling-github (for GitHub) or create-siblings-gitlab (for GitLab), or create-sibling-ria (for a remote indexed archive dataset store1). Note that create-sibling-ria can add an existing store as a sibling or create a new one from scratch.

  • Create a sibling on a local or SSH accessible Unix machine with datalad create-sibling (datalad-create-sibling manual).

In order to publish dataset content, DataLad needs to know to which sibling content shall be pushed. This can be specified with the --to option directly from the command line:

$ datalad push --to <sibling>

If you have more than one branch in your dataset, note that a datalad push command will by default update all branches that both the sibling and the dataset share. If such advanced aspects of pushing are relevant for your workflow, please check out the hidden section at the end of this paragraph.

By default, push will make the last saved state of the dataset available. Consequently, if the sibling is in the same state as the dataset, no push is attempted. Additionally, push will attempt to automatically decide what type of dataset contents are going to be published. With a sibling that has a special remote configured as a publication dependency, or a sibling that contains an annex (such as a Gin repository or a Remote Indexed Archive (RIA) store), both the contents stored in Git (i.e., a dataset’s history) as well as file contents stored in git-annex will be published.

Alternatively, one can enforce particular operations or push a subset of dataset contents. For one, when specifying a path in the datalad push command, only data or changes for those paths are considered for a push. Additionally, one can select a particular mode of operation with the -f/--force option. Several different modes are possible:

  • no-datatransfer: With this option, annexed contents are not published. This means that the sibling will have information on the annexed files’ names, but file contents will not be available, and thus datalad get calls in the sibling would fail.

  • datatransfer: With this option, the underlying git annex copy call to publish file contents is evoked without a --fast option. Usually, the --fast option increases the speed of the operation, as it disables a check whether the sibling already has content. This however, might skip copying content in some cases. Therefore, --force datatransfer is a slower, but more fail-safe option to publish annexed file contents.

  • gitpush: This option triggers a git push --force. Be very careful using this option - it will push all branches that are known to the sibling, and if the changes on these branches are conflicting with the changes that exist in the sibling, the changes that exist in the sibling will be overwritten.

  • all: The final mode, all, combines modes gitpush and datatransfer, thus attempting to really get your dataset contents published.

datalad push can publish available subdatasets recursively if the -r/--recursive flag is specified. Note that this requires that all subdatasets that should be published have sibling names identical to the sibling specified in the top-level push command, or that appropriate default publication targets are configured throughout the dataset hierarchy.

Pushing more than the current branch

If you have more than one branch in your dataset, a datalad push --to <sibling> will by default only push the current branch, unless you provide configurations that alter this default. Here are two ways in which this can be achieved:

Option 1: Setting the push.default configuration variable from simple (the default) to matching will configure the dataset such that push pushes all branches to the sibling. A concrete example: On a dataset level, this can be done using

$ git config --local push.default matching

Option 2: Tweaking the default push refspec for the dataset allows to select a range of branches that should be pushed. The link above gives a thorough introduction into the refspec. For a hands-on example, consider how it is done for the published DataLad-101 dataset:

The published version of the handbook is known to the local handbook dataset as a remote called public, and each section of the book is identified with a custom branch name that corresponds to the section name. Whenever an update to the public dataset is pushed, apart from pushing only the master branch, all branches starting with the section identifier sct are pushed automatically as well. This configuration was achieved by specifying these branches (using globbing with *) in the push specification of this remote:

$ git config --local remote.public.push 'refs/heads/sct*'

8.3.2. Setting access control via publishing

There are a number of ways to restrict access to your dataset or individual files of your dataset. One is via choice of (third party) hosting service for annexed file contents. If you chose a service only selected people have access to, and publish annexed contents exclusively there, then only those selected people can perform a successful datalad get. On shared file systems you may achieve this via permissions for certain groups or users, and for third party infrastructure you may achieve this by invitations/permissions/… options of the respective service.

If it is individual files that you do not want to share, you can selectively publish the contents of all files you want others to have, and withhold the data of the files you do not want to share. This can be done by publishing only selected files by providing paths, or overriding default push behavior with the -f/--force option. In the latter case, specifying -f no-datatransfer would for example not push any annexed contents.

Let’s say you have a dataset with three files:

  • experiment.txt

  • subject_1.dat

  • subject_2.dat

Consider that all of these files are annexed. While the information in experiment.txt is fine for everyone to see, subject_1.dat and subject_2.dat contain personal and potentially identifying data that can not be shared. Nevertheless, you want collaborators to know that these files exist. The use case

Todo

Write use case “external researcher without data access”

details such a scenario and demonstrates how external collaborators (with whom data can not be shared) can develop scripts against the directory structure and file names of a dataset, submit those scripts to the data owners, and thus still perform an analysis despite not having access to the data.

By publishing only the file contents of experiment.txt with

$ datalad push --to github experiment.txt

only meta data about file availability of subject_1.dat and subject_2.dat exists, but as these files’ annexed data is not published, a datalad get will fail. Note, though, that push will publish the complete dataset history (unless you specify a commit range with the --since option – see the manual for more information).

On the datalad publish command

Starting with DataLad version 0.13.0, datalad push was introduced and became an alternative to datalad publish, which will be removed in a future DataLad release.

By default, datalad publish publishes the last saved state of the dataset (i.e., its Git history) to a specified sibling:

$ datalad publish --to <sibling>

Like push, it supports recursive publishing across dataset hierarchies (if all datasets have appropriately configured default publication targets or identical sibling names) with the -r/--recursive flag, and it supports the --since option.

Main differences to push lie in publishs --transfer-data option that can be specified with either all, auto or none and determines whether and how annexed contents should be published if the sibling carries an annex: none will transfer only Git history and no annexed data, auto relies on configurations of the sibling, and all will publish all annexed contents.

By default, when using a plain datalad publish --to <sibling> with no path specification or --transfer-data option, publish will be used in auto mode. In practice, this default will most likely lead to the same outcome as when specifying none: only your datasets history, but no annexed contents will be published.

Note for Git users

On a technical level, the auto option leads to adding auto to the underlying git annex copy command, which in turn publishes annexed contents based on the git-annex preferred content configuration of the sibling.

In order to publish all annexed contents, one needs to specify --transfer-data all. Alternatively, adding paths to the publish call will publish the specified annexed content (unless --transfer-data none is explicitly added). As yet another alternative, one needs to add appropriate configuration for git-annex, that publish can rely on in auto mode. These configurations allow fine-grained specifications of up to file type or individual file level. More information on these configurations can be found in git-annex’s documentation.

Footnotes

1(1,2)

RIA siblings are filesystem-based, scalable storage solutions for DataLad datasets. You can find out more about them in the section Remote Indexed Archives for dataset storage and backup.

2

If you are unfamiliar with Git, please be aware that cloning a dataset to a different place and subsequently pushing to it can lead to Git error messages if changes are pushed to a currently checked out branch of the sibling (in technical Git terms: When pushing to a checked-out branch of a non-bare repository remote). As an example, consider what happens if we attempt a datalad push to the sibling roommate that we created in the chapter Collaboration:

$ datalad push --to roommate
[INFO] Determine push target
[INFO] Push refspecs
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 
[INFO] Start writing objects 
[ERROR] refs/heads/master->roommate:refs/heads/master [remote rejected] (branch is currently checked out) [publish(/home/me/dl-101/DataLad-101)] 
[INFO] Finished push of Dataset(/home/me/dl-101/DataLad-101)
publish(error): . (dataset) [refs/heads/master->roommate:refs/heads/master [remote rejected] (branch is currently checked out)]

Publishing fails with the error message [remote rejected] (branch is currently checked out). This can be prevented with configuration settings in Git versions 2.3 or higher, or by pushing to a branch of the sibling that is currently not checked-out.