Back and forth in time

Almost everyone inadvertently deleted or overwrote files at some point with a hasty operation that caused data fatalities or at least troubles to re-obtain or restore data. With DataLad, no mistakes are forever: One powerful feature of datasets is the ability to revert data to a previous state and thus view earlier content or correct mistakes. As long as the content was version controlled (i.e., tracked), it is possible to look at previous states of the data, or revert changes – even years after they happened – thanks to the underlying version control system Git.

To get a glimpse into how to work with the history of a dataset, today’s lecture has an external Git-expert as a guest lecturer. “I do not have enough time to go through all the details in only one lecture. But I’ll give you the basics, and an idea of what is possible. Always remember: Just google what you need. You will find thousands of helpful tutorials or questions on Stack Overflow right away. Even experts will constantly seek help to find out which Git command to use, and how to use it.”, he reassures with a wink.

The basis of working with the history is to look at it with tools such as tig, gitk, or simply the git log command. The most important information in an entry (commit) in the history is the shasum (or hash) associated with it. This hash is how dataset modifications in the history are identified, and with this hash you can communicate with DataLad or Git about these modifications or version states1. Here is an excerpt from the DataLad-101 history to show a few abbreviated hashes of the 15 most recent commits2:

$ git log -15 --oneline
cc27fbf add container and execute analysis within container
12e9c4a finished my midterm project!
8356a71 [DATALAD] Recorded changes
1948a37 add note on DataLads procedures
a19bd24 add note on configurations and git config
cda7d02 Add note on adding siblings
ba1b576 Merge remote-tracking branch 'refs/remotes/roommate/master'
462a468 Include nesting demo from datalad website
a746756 add note about datalad update
f530382 add note on git annex whereis
53e05c4 add note about installing from paths and recursive installations
78ea5d5 add note on clean datasets
231d15d [DATALAD RUNCMD] Resize logo for slides
8542779 [DATALAD RUNCMD] Resize logo for slides
a9ead0b add additional notes on run options

“I’ll let you people direct this lecture”, the guest lecturer proposes. “You tell me what you would be interested in doing, and I’ll show you how it’s done. For the rest of the lecture, call me Google!”

Fixing (empty) commit messages

From the back of the lecture hall comes a question you’re really glad someone asked: “It has happened to me that I accidentally did a datalad save and forgot to specify the commit message, how can I fix this?”. The room nods in agreement – apparently, others have run into this premature slip of the Enter key as well.

Let’s demonstrate a simple example. First, let’s create some random files. Do this right in your dataset.

$ cat << EOT > Gitjoke1.txt
Git knows what you did last summer!
EOT

$ cat << EOT > Gitjoke2.txt
Knock knock. Who's there? Git.
Git-who?
Sorry, 'who' is not a git command - did you mean 'show'?
EOT

$ cat << EOT > Gitjoke3.txt
In Soviet Russia, git commits YOU!
EOT

This will generate three new files in your dataset. Run a datalad status to verify this:

$ datalad status
untracked: Gitjoke1.txt (file)
untracked: Gitjoke2.txt (file)
untracked: Gitjoke3.txt (file)

And now:

$ datalad save
add(ok): Gitjoke1.txt (file)
add(ok): Gitjoke2.txt (file)
add(ok): Gitjoke3.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  save (ok: 1)

Whooops! A datalad save without a commit message that saved all of the files.

$ git log -p -1
commit 2c3fcfd52ff5e5f970baf8d8360447399c53666e
Author: Elena Piscopia <elena@example.net>
Date:   Tue Dec 10 09:10:14 2019 +0100

    [DATALAD] Recorded changes

diff --git a/Gitjoke1.txt b/Gitjoke1.txt
new file mode 100644
index 0000000..d7e1359
--- /dev/null
+++ b/Gitjoke1.txt
@@ -0,0 +1 @@
+Git knows what you did last summer!
diff --git a/Gitjoke2.txt b/Gitjoke2.txt
new file mode 100644
index 0000000..51beecb
--- /dev/null
+++ b/Gitjoke2.txt
@@ -0,0 +1,3 @@
+Knock knock. Who's there? Git.
+Git-who?
+Sorry, 'who' is not a git command - did you mean 'show'?
diff --git a/Gitjoke3.txt b/Gitjoke3.txt
new file mode 100644
index 0000000..7b83d95
--- /dev/null
+++ b/Gitjoke3.txt
@@ -0,0 +1 @@
+In Soviet Russia, git commits YOU!

As expected, all of the modifications present prior to the command are saved into the most recent commit, and the commit message DataLad provides by default, [DATALAD] Recorded changes, is not very helpful.

Changing the commit message of the most recent commit can be done with the command git commit --amend. Running this command will open an editor (the default, as configured in Git), and allow you to change the commit message.

Try running the git commit --amend command right now and give the commit a new commit message (you can just delete the one created by DataLad in the editor)!

Find out more: Changing the commit messages of not-the-most-recent commits

The git commit --amend commands will let you rewrite the commit message of the most recent commit. If you however need to rewrite commit messages of older commits, you can do so during a so-called “interactive rebase”4. The command for this is

$ git rebase -i HEAD~N

where N specifies how far back you want to rewrite commits. git rebase -i HEAD~3 for example lets you apply changes to the any number of commit messages within the last three commits.

Note

Be aware that an interactive rebase lets you rewrite history. This can lead to confusion or worse if the history you are rewriting is shared with others, e.g., in a collaborative project. Be also aware that rewriting history that is pushed/published (e.g., to GitHub) will require a force-push!

Running this command gives you a list of the N most recent commits in your text editor (which may be vim!), sorted with the most recent commit on the bottom. This is how it may look like:

pick 8503f26 Add note on adding siblings
pick 23f0a52 add note on configurations and git config
pick c42cba4 add note on DataLads procedures

# Rebase b259ce8..c42cba4 onto b259ce8 (3 commands)
#
# Commands:
# p, pick <commit> = use commit
# r, reword <commit> = use commit, but edit the commit message
# e, edit <commit> = use commit, but stop for amending
# s, squash <commit> = use commit, but meld into previous commit
# f, fixup <commit> = like "squash", but discard this commit's log message
# x, exec <command> = run command (the rest of the line) using shell
# b, break = stop here (continue rebase later with 'git rebase --continue')
# d, drop <commit> = remove commit
# l, label <label> = label current HEAD with a name

An interactive rebase allows to apply various modifying actions to any number of commits in the list. Below the list are descriptions of these different actions. Among them is “reword”, which lets you “edit the commit message”. To apply this action and reword the top-most commit message in this list (8503f26 Add note on adding siblings, three commits back in the history), exchange the word pick in the beginning of the line with the word reword or simply r like this:

r 8503f26 Add note on adding siblings

If you want to reword more than one commit message, exchange several picks. Any commit with the word pick at the beginning of the line will be kept as is. Once you are done, save and close the editor. This will sequentially open up a new editor for each commit you want to reword. In it, you will be able to change the commit message. Save to proceed to the next commit message until the rebase is complete. But be careful not to delete any lines in the above editor view – An interactive rebase can be dangerous, and if you remove a line, this commit will be lost!5

Untracking accidentally saved contents (tracked in Git)

The next question comes from the front: “It happened that I forgot to give a path to the datalad save command when I wanted to only start tracking a very specific file. Other times I just didn’t remember that additional, untracked files existed in the dataset and saved unaware of those. I know that it is good practice to only save those changes together that belong together, so is there a way to disentangle an accidental datalad save again?”

Let’s say instead of saving all three previously untracked Git jokes you intended to save only one of those files. What we want to achieve is to keep all of the files and their contents in the dataset, but get them out of the history into an untracked state again, and save them individually afterwards.

Important

Note that this is a case with text files (stored in Git)! For accidental annexing of files, please make sure to check out the next paragraph!

This is a task for the git reset command. It essentially allows to undo commits by resetting the history of a dataset to an earlier version. git reset comes with several modes that determine the exact behavior it, but the relevant one for this aim is --mixed3. Specifying the command:

git reset --mixed COMMIT

will preserve all changes made to files until the specified commit in the dataset, but remove them from the datasets history. This means the commits until COMMIT (not including COMMIT) will not be in your history anymore, and instead “untracked files” or “unsaved changes”. In other words, the modifications you made in these commits that are “undone” will still be present in your dataset – just not written to the history anymore. Let’s try this to get a feel for it.

The COMMIT in the command can either be a hash or a reference with the HEAD pointer.

Find out more: Git terminology: branches and HEADs?

A Git repository (and thus any DataLad dataset) is built up as a tree of commits. A branch is a named pointer (reference) to a commit, and allows you to isolate developments. The default branch is called master. HEAD is a pointer to the branch you are currently on, and thus to the last commit in the given branch.

../_images/git_branch_HEAD.png

Using HEAD, you can identify the most recent commit, or count backwards starting from the most recent commit. HEAD~1 is the ancestor of the most recent commit, i.e., one commit back (f30ab in the figure above). Apart from the notation HEAD~N, there is also HEAD^N used to count backwards, but less frequently used and of importance primarily in the case of merge commits. This post explains the details well.

Let’s stay with the hash, and reset to the commit prior to saving the Gitjokes.

First, find out the shasum, and afterwards, reset it.

$ git log -n 3 --oneline
2c3fcfd [DATALAD] Recorded changes
cc27fbf add container and execute analysis within container
12e9c4a finished my midterm project!
$ git reset --mixed cc27fbfbf8a100609d87d16b8522e96d75e747e0

Let’s see what has happened. First, let’s check the history:

$ git log -n 2 --oneline
cc27fbf add container and execute analysis within container
12e9c4a finished my midterm project!

As you can see, the commit in which the jokes were tracked is not in the history anymore! Go on to see what datalad status reports:

$ datalad status
untracked: Gitjoke1.txt (file)
untracked: Gitjoke2.txt (file)
untracked: Gitjoke3.txt (file)

Nice, the files are present, and untracked again. Do they contain the content still? We will read all of them with cat:

$ cat Gitjoke*
Git knows what you did last summer!
Knock knock. Who's there? Git.
Git-who?
Sorry, 'who' is not a git command - did you mean 'show'?
In Soviet Russia, git commits YOU!

Great. Now we can go ahead and save only the file we intended to track:

$ datalad save -m "save my favorite Git joke" Gitjoke2.txt
add(ok): Gitjoke2.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Finally, let’s check how the history looks afterwards:

$ git log -2
commit dd375f24f51f8a1b64bc5eaf1365c615550c5d94
Author: Elena Piscopia <elena@example.net>
Date:   Tue Dec 10 09:10:15 2019 +0100

    save my favorite Git joke

commit cc27fbfbf8a100609d87d16b8522e96d75e747e0
Author: Elena Piscopia <elena@example.net>
Date:   Tue Dec 10 09:09:58 2019 +0100

    add container and execute analysis within container

Wow! You have rewritten history4 !

Untracking accidentally saved contents (stored in git-annex)

The previous git reset undid the tracking of text files. However, those files are stored in Git, and thus their content is also stored in Git. Files that are annexed, however, have their content stored in git-annex, and not the file itself is stored in the history, but a symlink pointing to the location of the file content in the dataset’s annex. This has consequences for a git reset command: Reverting a save of a file that is annexed would revert the save of the symlink into Git, but it will not revert the annexing of the file. Thus, what will be left in the dataset is an untracked symlink.

To undo an accidental save that annexed a file, the annexed file has to be “unannexed” first with a datalad unlock command.

We will simulate such a situation by creating a PDF file that gets annexed with an accidental datalad save:

# create an empty pdf file
$ convert xc:none -page Letter apdffile.pdf
# accidentally save it
$ datalad save
add(ok): Gitjoke1.txt (file)
add(ok): Gitjoke3.txt (file)
add(ok): apdffile.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  save (ok: 1)

This accidental save has thus added both text files stored in Git, but also a PDF file to the history of the dataset. As an ls -l reveals, the PDF file has been annexed and is thus a symlink:

$ ls -l apdffile.pdf
lrwxrwxrwx 1 adina adina 122 Dec 10 09:10 apdffile.pdf -> .git/annex/objects/1Z/2Z/MD5E-s1842--1ee8dcdbc60241abb86d48ba2951091d.pdf/MD5E-s1842--1ee8dcdbc60241abb86d48ba2951091d.pdf

Prior to resetting, the PDF file has to be unannexed. To unannex files, i.e., get the contents out of the object tree, the datalad unlock command is relevant:

$ datalad unlock apdffile.pdf
unlock(ok): apdffile.pdf (file)

The file is now no longer symlinked:

$ ls -l apdffile.pdf
-rw-r--r-- 1 adina adina 1842 Dec 10 09:10 apdffile.pdf

Finally, git reset --mixed can be used to revert the accidental save. Again, find out the shasum first, and afterwards, reset it.

$ git log -n 3 --oneline
0adc3de [DATALAD] Recorded changes
dd375f2 save my favorite Git joke
cc27fbf add container and execute analysis within container
$ git reset --mixed dd375f24f51f8a1b64bc5eaf1365c615550c5d94

To see what has happened, let’s check the history:

$ git log -n 2 --oneline
dd375f2 save my favorite Git joke
cc27fbf add container and execute analysis within container

… and also the status of the dataset:

$ datalad status
untracked: Gitjoke1.txt (file)
untracked: Gitjoke3.txt (file)
untracked: apdffile.pdf (file)

The accidental save has been undone, and the file is present as untracked content again. As before, this action has not been recorded in your history.

Viewing previous versions of files and datasets

The next question is truly magical: How does one see data as it was at a previous state in history?

This magic trick can be performed with the git checkout. It is a very heavily used command for various tasks, but among many it can send you back in time to view the state of a dataset at the time of a specific commit.

Let’s say you want to find out which notes you took in the first few chapters of the handbook. Find a commit shasum in your history to specify the point in time you want to go back to:

$ git log -n 20 --oneline
dd375f2 save my favorite Git joke
cc27fbf add container and execute analysis within container
12e9c4a finished my midterm project!
8356a71 [DATALAD] Recorded changes
1948a37 add note on DataLads procedures
a19bd24 add note on configurations and git config
cda7d02 Add note on adding siblings
ba1b576 Merge remote-tracking branch 'refs/remotes/roommate/master'
462a468 Include nesting demo from datalad website
a746756 add note about datalad update
f530382 add note on git annex whereis
53e05c4 add note about installing from paths and recursive installations
78ea5d5 add note on clean datasets
231d15d [DATALAD RUNCMD] Resize logo for slides
8542779 [DATALAD RUNCMD] Resize logo for slides
a9ead0b add additional notes on run options
63b2b51 [DATALAD RUNCMD] convert -resize 450x450 recordings/longn...
d01aaf1 resized picture by hand
0e38ca9 [DATALAD RUNCMD] convert -resize 400x400 recordings/longn...
d52fad8 add note on basic datalad run and datalad rerun

Let’s go 15 commits back in time:

$ git checkout 63b2b5159ab0549ee40ae240d567124f8a6b42c5
warning: unable to rmdir 'midterm_project': Directory not empty
Note: switching to '63b2b5159ab0549ee40ae240d567124f8a6b42c5'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 63b2b51 [DATALAD RUNCMD] convert -resize 450x450 recordings/longn...

How did your notes.txt file look at this point?

$ cat notes.txt
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty

The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.

The command 'datalad install [--source] PATH'
installs a dataset from e.g., a URL or a path.
If you install a dataset into an existing
dataset (as a subdataset), remember to specify the
root of the superdataset with the '-d' option.

There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.

The datalad run command can record the impact a script or command has on a Dataset.
In its simplest form, datalad run only takes a commit message and the command that
should be executed.

Any datalad run command can be re-executed by using its commit shasum as an argument
in datalad rerun CHECKSUM. DataLad will take information from the run record of the original
commit, and re-execute it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!

Neat, isn’t it? By checking out a commit shasum you can explore a previous state of a datasets history. And this does not only apply to simple text files, but every type of file in your dataset, regardless of size. The checkout command however led to something that Git calls a “detached HEAD state”. While this sounds scary, a git checkout master will bring you back into the most recent version of your dataset and get you out of the “detached HEAD state”:

$ git checkout master
Previous HEAD position was 63b2b51 [DATALAD RUNCMD] convert -resize 450x450 recordings/longn...
Switched to branch 'master'

Note one very important thing: The previously untracked files are still there.

$ datalad status
untracked: Gitjoke1.txt (file)
untracked: Gitjoke3.txt (file)
untracked: apdffile.pdf (file)

The contents of notes.txt will now be the most recent version again:

$ cat notes.txt
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty

The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.

The command 'datalad install [--source] PATH'
installs a dataset from e.g., a URL or a path.
If you install a dataset into an existing
dataset (as a subdataset), remember to specify the
root of the superdataset with the '-d' option.

There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.

The datalad run command can record the impact a script or command has on a Dataset.
In its simplest form, datalad run only takes a commit message and the command that
should be executed.

Any datalad run command can be re-executed by using its commit shasum as an argument
in datalad rerun CHECKSUM. DataLad will take information from the run record of the original
commit, and re-execute it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!

You should specify all files that a command takes as input with an -i/--input flag. These
files will be retrieved prior to the command execution. Any content that is modified or
produced by the command should be specified with an -o/--output flag. Upon a run or rerun
of the command, the contents of these files will get unlocked so that they can be modified.

Important! If the dataset is not "clean" (a datalad status output is empty),
datalad run will not work - you will have to save modifications present in your
dataset.
A suboptimal alternative is the --explicit flag,
used to record only those changes done
to the files listed with --output flags.

A source to install a dataset from can also be a path,
for example as in "datalad install -s ../DataLad-101".
As when installing datasets before, make sure to add a
description on the location of the dataset to be
installed, and, if you want, a path to where the dataset
should be installed under which name.

Note that subdatasets will not be installed by default --
you will have to do a plain
"datalad install PATH/TO/SUBDATASET", or specify the
-r/--recursive option in the install command:
"datalad install -s ../DataLad-101 -r".

A recursive installation would however install all
installed subdatasets, so a safer way to proceed is to
set a decent --recursion-limit:
"datalad install -s ../DataLad-101 -r --recursion-limit 2"

The command "git annex whereis PATH" lists the repositories that have
the file content of an annexed file. When using "datalad get" to retrieve
file content, those repositories will be queried.

To update a shared dataset, run the command "datalad update --merge".
This command will query its origin for changes, and integrate the
changes into the dataset.

To update from a dataset with a shared history, you
need to add this dataset as a sibling to your dataset.
"Adding a sibling" means providing DataLad with info about
the location of a dataset, and a name for it. Afterwards,
a "datalad update --merge -s name" will integrate the changes
made to the sibling into the dataset.
A safe step in between is to do a "datalad update -s name"
and checkout the changes with "git/datalad diff"
to remotes/origin/master

Configurations for datasets exist on different levels
(systemwide, global, and local), and in different types
of files (not version controlled (git)config files, or
version controlled .datalad/config, .gitattributes, or
gitmodules files), or environment variables.
With the exception of .gitattributes, all configuration
files share a common structure, and can be modified with
the git config command, but also with an editor by hand.

Depending on whether a configuration file is version
controlled or not, the configurations will be shared together
with the dataset. More specific configurations and not-shared
configurations will always take precedence over more global or
shared configurations, and environment variables take precedence
over configurations in files.

The git config --list --show-origin command is a useful tool
to give an overview over existing configurations. Particularly
important may be the .gitattributes file, in which one can set
rules for git-annex about which files should be version-controlled
with Git instead of being annexed.

It can be useful to use pre-configured procedures that can apply
configurations, create files or file hierarchies, or perform
arbitrary tasks in datasets. They can be shipped with DataLad,
its extensions, or datasets, and you can even write your own
procedures and distribute them. The "datalad run-procedure"
command is used to apply such a procedure to a dataset. Procedures
shipped with DataLad or its extensions starting with a "cfg" prefix
can also be applied at the creation of a dataset with
"datalad create -c <PROC-NAME> <PATH>" (omitting the "cfg" prefix).

… Wow! You traveled back and forth in time! But an even more magical way to see the contents of files in previous versions is Git’s cat-file command: Among many other things, it lets you read a file’s contents as of any point in time in the history, without a prior git checkout:

$ git cat-file --textconv 63b2b5159ab0549ee40ae240d567124f8a6b42c5:notes.txt
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty

The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.

The command 'datalad install [--source] PATH'
installs a dataset from e.g., a URL or a path.
If you install a dataset into an existing
dataset (as a subdataset), remember to specify the
root of the superdataset with the '-d' option.

There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.

The datalad run command can record the impact a script or command has on a Dataset.
In its simplest form, datalad run only takes a commit message and the command that
should be executed.

Any datalad run command can be re-executed by using its commit shasum as an argument
in datalad rerun CHECKSUM. DataLad will take information from the run record of the original
commit, and re-execute it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!

The cat-file command is very versatile, and it’s documentation will list all of its functionality. To use it to see the contents of a file at a previous state as done above, this is how the general structure looks like:

$ git cat-file --textconv SHASUM:<path/to/file>

Undoing latest modifications of files

Previously, we saw how to remove files from a datasets history that were accidentally saved and thus tracked for the first time. How does one undo a modification to a tracked file?

Let’s modify the saved Gitjoke1.txt:

$ echo "this is by far my favorite joke!" >> Gitjoke2.txt
$ cat Gitjoke2.txt
Knock knock. Who's there? Git.
Git-who?
Sorry, 'who' is not a git command - did you mean 'show'?
this is by far my favorite joke!
$ datalad status
untracked: Gitjoke1.txt (file)
untracked: Gitjoke3.txt (file)
untracked: apdffile.pdf (file)
 modified: Gitjoke2.txt (file)
$ datalad save -m "add joke evaluation to joke" Gitjoke2.txt
add(ok): Gitjoke2.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

How could this modification to Gitjoke2.txt be undone? With the git reset command again. If you want to “unsave” the modification but keep it in the file, use git reset --mixed as before. However, if you want to get rid of the modifications entirely, use the option --hard instead of --mixed:

$ git log -n 2 --oneline
9052c9a add joke evaluation to joke
dd375f2 save my favorite Git joke
$ git reset --hard dd375f24f51f8a1b64bc5eaf1365c615550c5d94
HEAD is now at dd375f2 save my favorite Git joke
$ cat Gitjoke2.txt
Knock knock. Who's there? Git.
Git-who?
Sorry, 'who' is not a git command - did you mean 'show'?

The change has been undone completely. This method will work with files stored in Git and annexed files.

Note that this operation only restores this one file, because the commit that was undone only contained modifications to this one file. This is a demonstration of one of the reasons why one should strive for commits to represent meaningful logical units of change – if necessary, they can be undone easily.

Undoing past modifications of files

What git reset did was to undo commits from the most recent version of your dataset. How would one undo a change that happened a while ago, though, with important changes being added afterwards that you want to keep?

Let’s save a bad modification to Gitjoke2.txt, but also a modification to notes.txt:

$ echo "bad modification" >> Gitjoke2.txt
$ datalad save -m "did a bad modification" Gitjoke2.txt
add(ok): Gitjoke2.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
$ cat << EOT > notes.txt

Git has many handy tools to go back in forth in
time and work with the history of datasets.
Among many other things you can rewrite commit
messages, undo changes, or look at previous versions
of datasets. A superb resource to find out more about
this and practice such Git operations is this
chapter in the Pro-git book:
https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History

EOT
$ datalad save -m "add note on helpful git resource" notes.txt
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

The objective is to remove the first, “bad” modification, but keep the more recent modification of notes.txt. A git reset command is not convenient, because resetting would need to reset the most recent, “good” modification as well.

One way to accomplish it is with an interactive rebase, using the git rebase -i command5. Experienced Git-users will know under which situations and how to perform such an interactive rebase.

However, outlining an interactive rebase here in the handbook could lead to problems for readers without (much) Git experience: An interactive rebase, even if performed successfully, can lead to many problems if it is applied with too little experience, for example in any collaborative real-world project.

Instead, we demonstrate a different, less intrusive way to revert one or more changes at any point in the history of a dataset: the git revert command. Instead of rewriting the history, it will add an additional commit in which the changes of an unwanted commit are reverted.

The command looks like this:

$ git revert SHASUM

where SHASUM specifies the commit hash of the modification that should be reverted.

Find out more: Reverting more than a single commit

Alternatively, you can also specify a range of commits modify commits, for example like this:

$ git revert OLDER_SHASUM..NEWERSHASUM

This command will revert all commits starting with the one after OLDER_SHASUM (i.e. not including this commit) until and including the one specified with NEWERSHASUM. For each reverted commit, one new commit will be added to the history that reverts it. Thus, if you revert a range of three commits, there will be three reversal commits. If you however want the reversal of a range of commits saved in a single commit, supply the --no-commit option as in

$ git revert --no-commit OLDER_SHASUM..NEWERSHASUM

After running this command, run a single git commit to conclude the reversal and save it in a single commit.

Let’s see how it looks like:

$ git revert 99bc1fe877d3c08e2a8e2cfe1e079925257cec88
[master 6662a48] Revert "did a bad modification"
 Date: Tue Dec 10 09:10:17 2019 +0100
 1 file changed, 1 deletion(-)

This is the state of the file in which we reverted a modification:

$ cat Gitjoke2.txt
Knock knock. Who's there? Git.
Git-who?
Sorry, 'who' is not a git command - did you mean 'show'?

It does not contain the bad modification anymore. And this is what happened in the history of the dataset:

$ git log -n 3
commit 6662a48f90538948e36c4d7e5a9a36c22a90c772
Author: Elena Piscopia <elena@example.net>
Date:   Tue Dec 10 09:10:17 2019 +0100

    Revert "did a bad modification"
    
    This reverts commit 99bc1fe877d3c08e2a8e2cfe1e079925257cec88.

commit 74fe25a960cb16ae633264146b524b3b98443bce
Author: Elena Piscopia <elena@example.net>
Date:   Tue Dec 10 09:10:17 2019 +0100

    add note on helpful git resource

commit 99bc1fe877d3c08e2a8e2cfe1e079925257cec88
Author: Elena Piscopia <elena@example.net>
Date:   Tue Dec 10 09:10:17 2019 +0100

    did a bad modification

The commit that introduced the bad modification is still present, but it transparently gets undone with the most recent commit. At the same time, the good modification of notes.txt was not influenced in any way. The git revert command is thus a transparent and safe way of undoing past changes. Note though that this command can only be used efficiently if the commits in your datasets history are meaningful, independent units – having several unrelated modifications in a single commit may make an easy solution with git revert impossible and instead require a complex checkout, revert, or rebase operation.

Finally, let’s take a look at the state of the dataset after this operation:

$ datalad status
untracked: Gitjoke1.txt (file)
untracked: Gitjoke3.txt (file)
untracked: apdffile.pdf (file)

As you can see, unsurprisingly, the git revert command had no effects on anything else but the specified commit, and previously untracked files are still present.

Oh no! I’m in a merge conflict!

When working with the history of a dataset, especially when rewriting the history with an interactive rebase or when reverting commits, it is possible to run into so-called merge conflicts. Merge conflicts happen when Git needs assistance in deciding which changes to keep and which to apply. It will require you to edit the file the merge conflict is happening in with a text editor, but such merge conflict are by far not as scary as they may seem during the first few times of solving merge conflicts.

This section is not a guide on how to solve merge-conflicts, but a broad overview on the necessary steps, and a pointer to a more comprehensive guide.

  • The first thing to do if you end up in a merge conflict is to read the instructions Git is giving you – they are a useful guide.

  • Also, it is reassuring to remember that you can always get out of a merge conflict by aborting the operation that led to it (e.g., git rebase --abort.

  • To actually solve a merge conflict, you will have to edit files: In the documents the merge conflict applies to, Git marks the sections it needs help with with markers that consists of >, <, and = signs and commit shasums or branch names. There will be two marked parts, and you have to delete the one you do not want to keep, as well as all markers.

  • Afterwards, run git add <path/to/file and finally a git commit.

An excellent resource on how to deal with merge conflicts is this post.

Summary

This guest lecture has given you a glimpse into how to work with the history of your DataLad datasets. To conclude this section, let’s remove all untracked contents from the dataset. This can be done with git clean: The command git clean -f swipes your dataset clean and removes any untracked file. Careful! This is not revertible, and content lost with this commands can not be recovered! If you want to be extra sure, run git clean -fn beforehand – this will give you a list of the files that would be deleted.

$ git clean -f
Removing Gitjoke1.txt
Removing Gitjoke3.txt
Removing apdffile.pdf

Afterwards, the datalad status returns nothing, indicating a clean dataset state with no untracked files or modifications.

$ datalad status

Finally, if you want, apply you’re new knowledge about reverting commits to remove the Gitjoke2.txt file.

Footnotes

1

For example, the datalad rerun command introduced in section DataLad, Re-Run! takes such a hash as an argument, and re-executes the datalad run or datalad rerun run record associated with this hash. Likewise, the git diff can work with commit hashes.

2

There are other alternatives to reference commits in the history of a dataset, for example “counting” ancestors of the most recent commit using the notation HEAD~2, HEAD^2 or HEAD@{2}. However, using hashes to reference commits is a very fail-save method and saves you from accidentally miscounting.

3

The option --mixed is the default mode for a git reset command, omitting it (i.e., running just git reset) leads to the same behavior. It is explicitly stated in this book to make the mode clear, though.

4(1,2)

Note though that rewriting history can be dangerous, and you should be aware of what you are doing. For example, rewriting parts of the dataset’s history that have been published (e.g., to a GitHub repository) already or that other people have copies of, is not advised.

5(1,2)

When in need to interactively rebase, please consult further documentation and tutorials. It is out of the scope of this handbook to be a complete guide on rebasing, and not all interactive rebasing operations are complication-free. However, you can always undo mistakes that occur during rebasing with the help of the reflog.