1.7. Pull requests with annexed file contents

Git has revolutionized collaborative processes on text files. The concept is simple: In order to integrate new changes into a main development line, the changes are merged in from a different branch. Individual contributors became able to clone repositories, make changes in new branches, and email these changes as patches or requests to pull from their branches to maintainers. Hosting platforms like GitHub or GitLab made this even simpler by introducing pull requests (GitHub) or merge requests (GitLab) that contributors can initiate when they push new branches to their forks.

../_images/collab_4_annex_1.svg

Fig. 1.3 A typical collaborative setup. A Git repository (origin) is the central development place, living on a hosting site such as GitHub. Individual contributors fork this repository. Rather than adding commits to the blue main development line, they make changes on a new branch - here denoted with a different color. By requesting that the central repository “pulls in” or “merges” from their branch, e.g., the fix-paths branch, the main line in the central repository grows collaboratively.

This process is not as straight-forward when both forks and annexed file contents are involved. For example if a collaborator adds a commit in their fork with a new annexed file. In this scenario, a regular Git branch learns about the file identity (its hash and name), but not about its availability information, for example from which special remote the file content can be obtained from - this information is on the git-annex branch[2]. The content of the annexed file, meanwhile, likely exists on infrastructure accessible to the contributor, but not necessarily the owner of the origin dataset.

The problem is thus multi-faceted: For one, as crucial availability information is tracked on the git-annex branch, a “single branch PR” is not able to propagate all relevant information. After merging only the Git branch, without the availability information on the annex-branch, collaborators know that the file exists, but have no way of getting or even querying it: the git-annex branch on the main development repository would not know about the file. Propagating information in the git-annex branch requires, for example, a git annex sync or datalad push (which syncs the annex branch internally) to the origin repository.

../_images/collab_4_annex.svg

Fig. 1.4 A similar scenario as before, but with annexed file contents. If a collaborator adds annexed files to a branch in their fork (e.g., in the commit “first results computed!”), merging only the testrun branch is insufficient. While the existence of the file will be known in the origin dataset’s Git history, its annex will have no information on it, making any git-annex operation on this file fail.

Don’t PR the annex branch

Importantly, the issue wouldn’t be solved if one would PR the git-annex branch from their fork. It is not a branch that one can create a pull request with - in fact, it shouldn’t even really be checked out manually. It is automatically managed by git-annex, and the necessary synchronization to update availability information across dataset siblings usually requires a git annex sync (which a datalad push runs internally).

The second issue concerns the annexed contents. A contributor needs to ensure that file contents are accessible for whomever they contribute to. This is typically a manual process. Even when annexed contents can be pushed to their fork in annex-aware hosting like Gin or forgejo-aneksajo,the annexed content is not copied between a fork and the main repository when a pull request is merged. This would mean that the repo from which the PR is made would effectively be the only source of the content until it is explicitly transferred. This is different to git-only PR workflows, where the fork is essentially throwaway. If the annexed contents are hosted on third party services or multi-user compute infrastructure, contributors need to ensure that their intended receiver has appropriate access permissions.

1.7.1. What are possible solutions?

Especially with the availability of annex-aware hosting like Gin or forgejo-aneksajo, it becomes more and more relevant to know about the issues with annex PRs. But what are the solutions? At the moment, there are mostly work-arounds.

The easiest is not using forks, whenever permission management allows that, and instead working within the main development dataset: PRs from a branch within the dataset instead of from a fork are not an issue. With write access, another possibility would be to keep using forks, but merge and transfer contents “manually”, i.e., git switch main; git merge feature; [datalad | git-annex] push. Whether having a fork adds anything of value in this situation is questionable, though.

Often however, write access for the contributor is not possible, for example because an organization doesn’t want to make every possible contributor from the outside a member with the necessary elevated permissions. In these cases, 3rd party services can act as a transfer solution for annexed contents. The templateflow project came up with an Open Science Framework based approach for external contributors: www.templateflow.org/contributing/submission. With git-annex-aware hosting like forgejo-aneksajo, permission management could also work “the other way around”: The maintainer with write access to the origin dataset that receives a PR gets access to the repo from which the PR is made (read access should suffice). The maintainer can merge locally, get the files, and push them to their repository.

Footnotes