3.1. Data safety

Later in the day, after seeing and solving so many DataLad error messages, you fall tired into your bed. Just as you are about to fall asleep, a thought crosses your mind:

“I now know that tracked content in a dataset is protected by git-annex. Whenever tracked contents are saved, they get locked and should not be modifiable. But… what about the notes that I have been taking since the first day? Should I not need to unlock them before I can modify them? And also the script! I was able to modify this despite giving it to DataLad to track, with no permission denied errors whatsoever! How does that work?”

This night, though, your question stays unanswered and you fall into a restless sleep filled with bad dreams about “permission denied” errors. The next day you’re the first student in your lecturer’s office hours.

“Oh, you’re really attentive. This is a great question!” our lecturer starts to explain.

../_images/teacher.svg

Do you remember that we created the DataLad-101 dataset with a specific configuration template? It was the -c text2git option we provided in the beginning of Create a dataset. It is because of this configuration that we can modify notes.txt without unlocking its content first.

The second commit message in our datasets history summarizes this:

$ git log --reverse --oneline
a82a9ce [DATALAD] new dataset
dcc11c1 Instruct annex to add text files to Git
4e85396 add books on Python and Unix to read later
84674e3 add reference book about git
d841737 add beginners guide on bash
708e6f8 Add notes on datalad create
b10d56a add note on datalad save
8dca46c [DATALAD] Recorded changes
4b7923f [DATALAD] modified subdataset properties
a6bb5da Add note on datalad clone
15398ed Add short script to write a list of podcast speakers and titles
ee3c90f [DATALAD RUNCMD] create a list of podcast titles
fa3e98a BF: list both directories content
b5919c5 [DATALAD RUNCMD] create a list of podcast titles
b6566aa add note datalad and git diff
27a3df9 add note on basic datalad run and datalad rerun
cc0854a [DATALAD RUNCMD] convert -resize 400x400 recordings/longn...
5db13d3 resized picture by hand
d5b8506 [DATALAD RUNCMD] convert -resize 450x450 recordings/longn...
5adbb4b add additional notes on run options
3211533 [DATALAD RUNCMD] Resize logo for slides
9518f3b [DATALAD RUNCMD] Resize logo for slides
eb8ce32 add note on clean datasets

Instead of giving text files such as your notes or your script to git-annex, the dataset stores it in Git. But what does it mean if files are in Git instead of git-annex?

Well, procedurally it means that everything that is stored in git-annex is content-locked, and everything that is stored in Git is not. You can modify content stored in Git straight away, without unlocking it first.

A simplified illustration of content lock in files managed by git-annex.

Fig. 3.2 A simplified overview of the tools that manage data in your dataset.

That’s easy enough.

“So, first of all: If we hadn’t provided the -c text2git argument, text files would get content-locked, too?”. “Yes, indeed. However, there are also ways to later change how file content is handled based on its type or size. It can be specified in the .gitattributes file, using annex.largefile options. But there will be a lecture on that1.”

“Okay, well, second: Isn’t it much easier to just not bother with locking and unlocking, and have everything ‘stored in Git’? Even if datalad run takes care of unlocking content, I do not see the point of git-annex”, you continue.

Here it gets tricky. To begin with the most important, and most straight-forward fact: It is not possible to store large files in Git. This is because Git would very quickly run into severe performance issues. For this reason, GitHub, a well-known hosting site for projects using Git, for example does not allow files larger than 100MB of size.

For now, we have solved the mystery of why text files can be modified without unlocking, and this is a small improvement in the vast amount of questions that have piled up in our curious minds. Essentially, git-annex protects your data from accidental modifications and thus keeps it safe. datalad run commands mitigate any technical complexity of this completely if -o/--output is specified properly, and datalad unlock commands can be used to unlock content “by hand” if modifications are performed outside of a datalad run.

But there comes the second, tricky part: There are ways to get rid of locking and unlocking within git-annex, using so-called adjusted branches. This functionality is dependent on the git-annex version one has installed, the git-annex version of the repository, and a use-case dependent comparison of the pros and cons. On Windows systems, this adjusted mode is even the only mode of operation. In later sections we will see how to use this feature. The next lecture, in any way, will guide us deeper into git-annex, and improve our understanding a slight bit further.

Footnotes

1

If you cannot wait to read about .gitattributes and other configuration files, jump ahead to chapter Tuning datasets to your needs, starting with section DIY configurations.