To get a hang on the basics of sharing a dataset,
you shared your
DataLad-101 dataset with your
room mate on a common, local file system. Your lucky
room mate now has your notes and can thus try to catch
up to still pass the course.
Moreover, though, he can also integrate all other notes
or changes you make to your dataset, and stay up to date.
This is because a DataLad dataset makes updating shared
data a matter of a single datalad update --merge command.
But why does this need to be a one-way line? “I want to provide helpful information for you as well!”, says your room mate. “How could you get any insightful notes that I make in my dataset, or maybe the results of our upcoming mid-term project? Its a bit unfair that I can get your work, but you can not get mine.”
Consider, for example, that your room mate might have googled about DataLad
a bit. On the datalad homepage
he might have found very useful additional information, such
as the ascii-cast on dataset nesting.
Because he found this very helpful in understanding dataset
nesting concepts, he decided to download the
that was used to generate this example
from GitHub, and saved it in the
He does it using the datalad command datalad download-url
that you experienced in section Create a dataset already: This command will
download a file just as
wget, but it can also take a commit message
and will save the download right to the history of the dataset that you specify,
while recording its origin as provenance information.
Navigate into your dataset copy in
and run the following command
# navigate into the installed copy $ cd ../mock_user/DataLad-101 # download the shell script and save it in your code/ directory $ datalad download-url \ -d . \ -m "Include nesting demo from datalad website" \ -O code/nested_repos.sh \ https://raw.githubusercontent.com/datalad/datalad.org/7e8e39b1f08d0a54ab521586f27ee918b4441d69/content/asciicast/seamless_nested_repos.sh [INFO] Downloading 'https://raw.githubusercontent.com/datalad/datalad.org/7e8e39b1f08d0a54ab521586f27ee918b4441d69/content/asciicast/seamless_nested_repos.sh' into '/home/me/dl-101/mock_user/DataLad-101/code/nested_repos.sh' download_url(ok): /home/me/dl-101/mock_user/DataLad-101/code/nested_repos.sh (file) add(ok): code/nested_repos.sh (file) save(ok): . (dataset) action summary: add (ok: 1) download_url (ok: 1) save (ok: 1)
Run a quick datalad status:
$ datalad status
Nice, the datalad download-url command saved this download right into the history, and datalad status does not report unsaved modifications! We’ll show an excerpt of the last commit here:
$ git log -n 1 -p commit d4cce9c59acef0bd9d0a26fa6a11261a272015ca Author: Elena Piscopia <email@example.com> Date: Thu Jan 9 07:52:24 2020 +0100 Include nesting demo from datalad website diff --git a/code/nested_repos.sh b/code/nested_repos.sh new file mode 100644 index 0000000..f84c817 --- /dev/null +++ b/code/nested_repos.sh @@ -0,0 +1,59 @@ +#!/bin/bash +# This script was converted using cast2script from: +# docs/casts/seamless_nested_repos.sh +set -e -u +export GIT_PAGER=cat + +# DataLad provides seamless management of nested Git repositories... + +# Let's create a dataset +datalad create demo +cd demo + +# A DataLad dataset is just a Git repo with some initial configuration +git log --oneline + +# We can generate nested datasets, by telling DataLad to register a +# new dataset in a parent dataset
Suddenly, your room mate has a file change that you do not have. His dataset evolved.
So how do we link back from the copy of the dataset to its origin, such that your room mate’s changes can be included in your dataset? How do we let the original dataset “know” about this copy your room mate has? Do we need to install the installed dataset of our room mate as a copy again?
No, luckily, it’s simpler and less convoluted. What we have to do is to register a datalad sibling: A reference to our room mate’s dataset in our own, original dataset.
Note for Git users
Git repositories can configure clones of a dataset as remotes in order to fetch, pull, or push from and to them. A datalad sibling is the equivalent of a git clone that is configured as a remote.
Let’s see how this is done.
First of all, navigate back into the original dataset.
In the original dataset, “add” a “sibling” by using
the datalad siblings command (datalad-siblings manual).
The command takes the base command,
datalad siblings, an action, in this case
add, a path to the
root of the dataset
-d ., a name for the sibling,
and a URL or path to the sibling,
This registers your room mate’s
DataLad-101 as a “sibling” (we will call it
“roommate”) to your own
$ cd ../../DataLad-101 # add a sibling $ datalad siblings add -d . --name roommate --url ../mock_user/DataLad-101 .: roommate(+) [../mock_user/DataLad-101 (git)]
There are a few confusing parts about this command: For one, do not be surprised
--url argument – it’s called “URL” but it can be a path as well.
Also, do not forget to give a name to your dataset’s sibling. Without the
--name argument the command will fail. The reason behind this is that the default
name of a sibling if no name is given will be the host name of the specified URL,
but as you provide a path and not a URL, there is no host name to take as a default.
As you can see in the command output, the addition of a sibling succeeded:
roommate(+)[../mock_user/DataLad-101] means that your room mate’s dataset
is now known to your own dataset as “roommate”
$ datalad siblings .: here(+) [git] .: roommate(+) [../mock_user/DataLad-101 (git)]
This command will list all known siblings of the dataset. You can see it in the resulting list with the name “roommate” you have given to it.
Find out more: What if I mistyped the name or want to remove the sibling?
You can remove a sibling using datalad siblings remove -s roommate
The fact that the
DataLad-101 dataset now has a sibling means that we
can also datalad update this repository. Awesome!
Your room mate previously ran a datalad update --merge in the section Stay up to date. This got him changes he knew you made into a dataset that he so far did not change. This meant that nothing unexpected would happen with the datalad update --merge.
But consider the current case: Your room mate made changes to his dataset, but you do not necessarily know which. You also made changes to your dataset in the meantime, and added a note on datalad update. How would you know that his changes and your changes are not in conflict with each other?
This scenario is where a plain datalad update becomes useful. If you run a plain datalad update, DataLad will query the sibling for changes, and store those changes in a safe place in your own dataset, but it will not yet integrate them into your dataset. This gives you a chance to see whether you actually want to have the changes your room mate made.
Let’s see how it’s done. First, run a plain datalad update without
$ datalad update -s roommate [INFO] Fetching updates for <Dataset path=/home/me/dl-101/DataLad-101> update(ok): . (dataset)
Note that we supplied the sibling’s name with the
This is good practice, and allows you to be precise in where you want to get
updates from. It would have worked without the specification (just as a bare
datalad update --merge worked for your room mate), because there is only
one other known location, though.
This plain datalad update informs you that it “fetched” updates from
the dataset. The changes however, are not yet visible – the script that
he added is not yet in your
$ ls code/ list_titles.sh
So where is the file? It is in a different branch of your dataset.
If you do not use Git, the concept of a branch can be a big source of confusion. There will be sections later in this book that will elaborate a bit more what branches are, and how to work with them, but for now envision a branch just like a bunch of drawers on your desk. The paperwork that you have in front of you right on your desk is your dataset as you currently see it. These drawers instead hold documents that you are in principle working on, just not now – maybe different versions of paperwork you currently have in front of you, or maybe other files than the ones currently in front of you on your desk.
Imagine that a datalad update created a small drawer, placed all of the changed or added files from the sibling inside, and put it on your desk. You can now take a look into that drawer to see whether you want to have the changes right in front of you.
The drawer is a branch, and it is usually called
To look inside of it you can git checkout BRANCHNAME, or you can
diff between the branch (your drawer) and the dataset as it
is currently in front of you (your desk). We will do the latter, and leave
the former for a different lecture:
$ datalad diff --to remotes/roommate/master added: code/nested_repos.sh (file) modified: notes.txt (file)
This shows us that there is an additional file, and it also shows us
that there is a difference in
notes.txt! Let’s ask
git diff to show us what the differences in detail:
$ git diff remotes/roommate/master diff --git a/code/nested_repos.sh b/code/nested_repos.sh deleted file mode 100644 index f84c817..0000000 --- a/code/nested_repos.sh +++ /dev/null @@ -1,59 +0,0 @@ -#!/bin/bash -# This script was converted using cast2script from: -# docs/casts/seamless_nested_repos.sh -set -e -u -export GIT_PAGER=cat - -# DataLad provides seamless management of nested Git repositories... - -# Let's create a dataset -datalad create demo -cd demo - -# A DataLad dataset is just a Git repo with some initial configuration -git log --oneline - -# We can generate nested datasets, by telling DataLad to register a -# new dataset in a parent dataset -datalad create -d . sub1 - -# A subdataset is nothing more than regular Git submodule -git submodule - -# Of course subdatasets can be nested -datalad create -d . sub1/justadir/sub2 - -# Unlike Git, DataLad automatically takes care of committing all -# changes associated with the added subdataset up to the given -# parent dataset -git status - -# Let's create some content in the deepest subdataset -mkdir sub1/justadir/sub2/anotherdir -touch sub1/justadir/sub2/anotherdir/afile - -# Git can only tell us that something underneath the top-most -# subdataset was modified -git status - -# DataLad saves us from further investigation -datalad diff -r - -# Like Git, it can report individual untracked files, but also across -# repository boundaries -datalad diff -r --report-untracked all - -# Adding this new content with Git or git-annex would be an exercise -git add sub1/justadir/sub2/anotherdir/afile || true - -# DataLad does not require users to determine the correct repository -# in the tree -datalad add -d . sub1/justadir/sub2/anotherdir/afile - -# Again, all associated changes in the entire dataset tree, up to -# the given parent dataset, were committed -git status - -# DataLad's 'diff' is able to report the changes from these related -# commits throughout the repository tree -datalad diff --revision @~1 -r diff --git a/notes.txt b/notes.txt index 7d3dc4c..0483229 100644 --- a/notes.txt +++ b/notes.txt @@ -60,3 +60,7 @@ The command "git annex whereis PATH" lists the repositories that have the file content of an annexed file. When using "datalad get" to retrieve file content, those repositories will be queried. +To update a shared dataset, run the command "datalad update --merge". +This command will query its origin for changes, and integrate the +changes into the dataset. +
Let’s digress into what is shown here.
We are comparing the current state of your dataset against
the current state of your room mate’s dataset. Everything marked with
- is a change that your room mate has, but not you: This is the
script that he downloaded!
Everything that is marked with a
+ is a change that you have,
but not your room mate: It is the additional note on datalad update
you made in your own dataset in the previous section.
Cool! So now that you know what the changes are that your room mate made, you can safely datalad update --merge them to integrate them into your dataset. In technical terms you will “merge the branch remotes/roommate/master into master”. But the details of this will be stated in a standalone section later.
Note that the fact that your room mate does not have the note on datalad update does not influence your note. It will not get deleted by the merge. You do not set your dataset to the state of your room mate’s dataset, but you incorporate all changes he made – which is only the addition of the script.
$ datalad update --merge -s roommate [INFO] Fetching updates for <Dataset path=/home/me/dl-101/DataLad-101> [INFO] Applying updates to <Dataset path=/home/me/dl-101/DataLad-101> update(ok): . (dataset)
The exciting question is now whether your room mate’s change is now
also part of your own dataset. Let’s list the contents of the
directory and also peek into the history:
$ ls code/ list_titles.sh nested_repos.sh
$ git log --oneline 9163dea Merge remote-tracking branch 'refs/remotes/roommate/master' 533ea56 add note about datalad update d4cce9c Include nesting demo from datalad website a1d07c5 add note on git annex whereis a27eb75 add note about cloning from paths and recursive datalad get
Wohoo! Here it is: The script now also exists in your own dataset.
You can see the commit that your room mate made when he saved the script,
and you can also see a commit that records how you
room mate’s dataset changes into your own dataset. The commit message of this
latter commit for now might contain many words yet unknown to you if you
do not use Git, but a later section will get into the details of what
the meaning of “merge”, “branch”, “refs”
or “master” is.
For now, you’re happy to have the changes your room mate made available. This is how it should be! You helped him, and he helps you. Awesome! There actually is a wonderful word for it: Collaboration. Thus, without noticing, you have successfully collaborated for the first time using DataLad datasets.
Create a note about this, and save it.
$ cat << EOT >> notes.txt To update from a dataset with a shared history, you need to add this dataset as a sibling to your dataset. "Adding a sibling" means providing DataLad with info about the location of a dataset, and a name for it. Afterwards, a "datalad update --merge -s name" will integrate the changes made to the sibling into the dataset. A safe step in between is to do a "datalad update -s name" and checkout the changes with "git/datalad diff" to remotes/origin/master EOT $ datalad save -m "Add note on adding siblings" add(ok): notes.txt (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1)