4.4. Stay up to date

All of what you have seen about sharing dataset was really cool, and for the most part also surprisingly intuitive. datalad run commands or file retrieval worked exactly as you imagined it to work, and you begin to think that slowly but steadily you’re getting a feel about how DataLad really works.

But to be honest, so far, sharing the dataset with DataLad was also remarkably unexciting given that you already knew most of the dataset magic that your room mate currently is still mesmerized about. To be honest, you’re not yet certain whether sharing data with DataLad really improves your life up until this point. After all, you could have just copied your directory into your mock_user directory and this would have resulted in about the same output, right?

What we will be looking into now is how shared DataLad datasets can be updated.

Remember that you added some notes on datalad clone, datalad get, and git annex whereis into the original DataLad-101?

This is a change that is not reflected in your “shared” installation in ../mock_user/DataLad-101:

# we are inside the installed copy
$ cat notes.txt
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty

The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.

The command 'datalad clone URL/PATH [PATH]'
installs a dataset from e.g., a URL or a path.
If you install a dataset into an existing
dataset (as a subdataset), remember to specify the
root of the superdataset with the '-d' option.

There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.

The datalad run command can record the impact a script or command has on a Dataset.
In its simplest form, datalad run only takes a commit message and the command that
should be executed.

Any datalad run command can be re-executed by using its commit shasum as an argument
in datalad rerun CHECKSUM. DataLad will take information from the run record of the original
commit, and re-execute it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!

You should specify all files that a command takes as input with an -i/--input flag. These
files will be retrieved prior to the command execution. Any content that is modified or
produced by the command should be specified with an -o/--output flag. Upon a run or rerun
of the command, the contents of these files will get unlocked so that they can be modified.

Important! If the dataset is not "clean" (a datalad status output is empty),
datalad run will not work - you will have to save modifications present in your
dataset.
A suboptimal alternative is the --explicit flag,
used to record only those changes done
to the files listed with --output flags.

But the original intention of sharing the dataset with your room mate was to give him access to your notes. How does he get the notes that you have added in the last two sections, for example?

This installed copy of DataLad-101 knows its origin, i.e., the place it was installed from. Using this information, it can query the original dataset whether any changes happened since the last time it checked, and if so, retrieve and integrate them.

This is done with the datalad update --merge command (datalad-update manual).

$ datalad update --merge
[INFO] Fetching updates for Dataset(/home/me/dl-101/mock_user/DataLad-101)
[INFO] Start enumerating objects
[INFO] Start counting objects
[INFO] Start compressing objects
merge(ok): . (dataset) [Merged origin/master]
update(ok): . (dataset)
action summary:
  merge (ok: 1)
  update (ok: 1)

Importantly, run this command either within the specific (sub)dataset you are interested in, or provide a path to the root of the dataset you are interested in with the -d/--dataset flag. If you would run the command within the longnow subdataset, you would query this subdatasets’ origin for updates, not the original DataLad-101 dataset.

Let’s check the contents in notes.txt to see whether the previously missing changes are now present:

$ cat notes.txt
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty

The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.

The command 'datalad clone URL/PATH [PATH]'
installs a dataset from e.g., a URL or a path.
If you install a dataset into an existing
dataset (as a subdataset), remember to specify the
root of the superdataset with the '-d' option.

There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.

The datalad run command can record the impact a script or command has on a Dataset.
In its simplest form, datalad run only takes a commit message and the command that
should be executed.

Any datalad run command can be re-executed by using its commit shasum as an argument
in datalad rerun CHECKSUM. DataLad will take information from the run record of the original
commit, and re-execute it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!

You should specify all files that a command takes as input with an -i/--input flag. These
files will be retrieved prior to the command execution. Any content that is modified or
produced by the command should be specified with an -o/--output flag. Upon a run or rerun
of the command, the contents of these files will get unlocked so that they can be modified.

Important! If the dataset is not "clean" (a datalad status output is empty),
datalad run will not work - you will have to save modifications present in your
dataset.
A suboptimal alternative is the --explicit flag,
used to record only those changes done
to the files listed with --output flags.

A source to install a dataset from can also be a path,
for example as in "datalad clone ../DataLad-101".

Just as in creating datasets, you can add a
description on the location of the new dataset clone
with the -D/--description option.

Note that subdatasets will not be installed by default,
but are only registered in the superdataset -- you will
have to do a "datalad get -n PATH/TO/SUBDATASET"
to install the subdataset for file availability meta data.
The -n/--no-data options prevents that file contents are
also downloaded.

Note that a recursive "datalad get" would install all further
registered subdatasets underneath a subdataset, so a safer
way to proceed is to set a decent --recursion-limit:
"datalad get -n -r --recursion-limit 2 <subds>"

The command "git annex whereis PATH" lists the repositories that have
the file content of an annexed file. When using "datalad get" to retrieve
file content, those repositories will be queried.

Wohoo, the contents are here!

Therefore, sharing DataLad datasets by installing them enables you to update the datasets content should the original datasets’ content change – in only a single command. How cool is that?!

Conclude this section by adding a note about updating a dataset to your own DataLad-101 dataset:

# navigate back:
$ cd ../../DataLad-101

# write the note
$ cat << EOT >> notes.txt
To update a shared dataset, run the command "datalad update --merge".
This command will query its origin for changes, and integrate the
changes into the dataset.

EOT
# save the changes

$ datalad save -m "add note about datalad update"
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

PS: You might wonder whether there is also a sole datalad update command. Yes, there is – if you are a Git-user and know about branches and merging you can read the Note for Git-users below. However, a thorough explanation and demonstration will be in the next section.

Note for Git users

datalad update is the DataLad equivalent of a git fetch, datalad update --merge is the DataLad equivalent of a git pull. Upon a simple datalad update, the remote information is available on a branch separate from the master branch – in most cases this will be remotes/origin/master. You can git checkout this branch or run git diff to explore the changes and identify potential merge conflicts.