Populate a dataset

The first lecture in DataLad-101 referenced some useful literature. Even if we end up not reading those books at all, let’s just download them and put them into our dataset. You never know, right? Let’s first create a directory to save books for additional reading in.

$ mkdir books

Let’s take a look at the current directory structure with the tree command1:

$ tree
.
└── books

1 directory, 0 files

Arguably, not the most exciting thing to see. So let’s put some PDFs inside. Below is a short list of optional readings. We decide to download them (they are all free, in total about 15 MB), and save them in DataLad-101/books.

You can either visit the links and save them in books/, or run the following commands2 to download the books right from the terminal:

$ cd books
$ wget https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf
$ wget https://www.gitbook.com/download/pdf/book/swaroopch/byte-of-python -O byte-of-python.pdf
# get back into the root of the dataset
$ cd ../
2019-11-12 15:04:59 URL:https://netix.dl.sourceforge.net/project/linuxcommand/TLCL/19.01/TLCL-19.01.pdf [2120211/2120211] -> "TLCL.pdf" [1]
2019-11-12 15:05:01 URL:https://legacy.gitbook.com/download/pdf/book/swaroopch/byte-of-python [4407669] -> "byte-of-python.pdf" [1]

Let’s see what happened. First of all, in the root of DataLad-101, show the directory structure with tree:

$ tree
.
└── books
    ├── byte-of-python.pdf
    └── TLCL.pdf

1 directory, 2 files

Now what does DataLad do with this new content? One command you will use very often is datalad status (datalad-status manual). It reports on the state of dataset content, and regular status reports should become a habit in the wake of DataLad-101.

$ datalad status
untracked: books (directory)

Interesting, the books/ directory is “untracked”. Remember how content can be tracked if a user wants to? Untracked means that DataLad does not know about this directory or its content, because we have not instructed DataLad to actually track it. This means, DataLad does not keep the downloaded books in its history yet. Let’s change this by saving the files to the dataset’s history with the datalad save command (datalad-save manual).

This time, its your turn to specify a helpful commit message with the -m option:

$ datalad save -m "add books on Python and Unix to read later"
add(ok): books/TLCL.pdf (file)
add(ok): books/byte-of-python.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)

Find out more: “Oh no! I forgot the -m option!”

If you forget to specify a commit message with -m, DataLad will write [DATALAD] Recorded changes as a commit message into your history. This is not particularly informative. You can change the last commit message with the Git command git commit --amend. This will open up your default editor and you can edit the commit message. Careful – the default editor might be vim!

As already noted, any files you save in this dataset, and all modifications to these files that you save, are tracked in this history. Importantly, this file tracking works regardless of the size of the files – a DataLad dataset could be your private music or movie collection with single files being many GB in size. This is one aspect that distinguishes DataLad from many other version control tools, among them Git. Large content is tracked in an annex that is automatically created and handled by DataLad. Whether text files or larger files change, all of these changes can be written to your DataLad datasets history.

Let’s see how the saved content shows up in the history of the dataset with git log. -n 1 specifies that we want to take a look at the most recent commit. In order to get a bit more details, we add the -p flag (if in a pager, leave the git log by typing q, navigate with up and down arrow keys):

$ git log -p -n 1
commit 9a6fb63c12113788948a72bc0b44c1aac4e69466
Author: Elena Piscopia <elena@example.net>
Date:   Tue Nov 12 15:05:02 2019 +0100

    add books on Python and Unix to read later

diff --git a/books/TLCL.pdf b/books/TLCL.pdf
new file mode 120000
index 0000000..4c84b61
--- /dev/null
+++ b/books/TLCL.pdf
@@ -0,0 +1 @@
+../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file
diff --git a/books/byte-of-python.pdf b/books/byte-of-python.pdf
new file mode 120000
index 0000000..9a812a0
--- /dev/null
+++ b/books/byte-of-python.pdf

Now this might look a bit cryptic (and honestly, tig3 makes it look prettier). But this tells us the date and time in which a particular author added two PDFs to the directory books/, and thanks to that commit message we have a nice human-readable summary of that action.

Find out more: DOs and DON’Ts for commit messages

DOs

  • Write a title line with 72 characters or less (as we did so far)

  • it should be in imperative voice, e.g., “Add notes from lecture 2”

  • Often, a title line is not enough to express your changes and reasoning behind it. In this case, add a body to your commit message by hitting enter twice (before closing the quotation marks), and continue writing a brief summary of the changes after a blank line. This summary should explain “what” has been done and “why”, but not “how”. Close the quotation marks, and hit enter to save the change with your message.

  • here you can find more guidelines: https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53

DON’Ts

  • passive voice is hard to read afterwards

  • extensive formatting (hashes, asterisks, quotes, …) will most likely make your shell complain

  • it should be obvious: do not say nasty things about other people

Note for Git users

Just as in Git, new files are not tracked from their creation on, but only when explicitly added to Git (in Git terms with an initial git add). But different from the common Git workflow, DataLad skips the staging area. A datalad save combines a git add and a git commit, and therefore, the commit message is specified with datalad save.

Cool, so now you have added some files to your dataset history. But what is a bit inconvenient is that both books were saved together. You begin to wonder: “A Python book and a Unix book do not have that much in common. I probably should not save them in the same commit. And … what happens if I have files I do not want to track? datalad save -m "some commit message" would save all of what is currently untracked or modified in the dataset into the history!”

Regarding your first remark, you’re absolutely right with that! It is good practice to save only those changes together that belong together. We do not want to squish completely unrelated changes into the same spot of our history, because it would get very nasty should we want to revert some of the changes without affecting others in this commit.

Luckily, we can point datalad save to exactly the changes we want it to record. Let’s try this by adding yet another book, a good reference work about git, Pro Git:

$ cd books
$ wget https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf
$ cd ../
2019-11-12 15:05:04 URL:https://github-production-release-asset-2e65be.s3.amazonaws.com/15400220/57552a00-9a49-11e9-9144-d9607ed4c2db?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20191112%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20191112T140502Z&X-Amz-Expires=300&X-Amz-Signature=cf7bb2994cf463e68b19817d3317497f58f1b2e62d9c5b3abd1a788b4d482358&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dprogit.pdf&response-content-type=application%2Foctet-stream [12465653/12465653] -> "progit.pdf" [1]

datalad status shows that there is a new untracked file:

$ datalad status
untracked: books/progit.pdf (file)

Let’s datalad save precisely this file by specifying its path after the commit message:

$ datalad save -m "add reference book about git" books/progit.pdf
add(ok): books/progit.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Find out more: Some more on save

Regarding your second remark, you’re right that a datalad save without a path specification would write all of the currently untracked files or modifications to the history. There are some ways to mitigate this: A datalad save -m "concise message" --updated (or the shorter form of --updated, -u) will only write modifications to the history, not untracked files. Later, we will also see .gitignore files that let you hide content from version control. However, it is good practice to safely store away modifications or new content. This both improves your dataset and workflow, and will be a requirement for the execution of certain commands.

A datalad status should now be empty, and our dataset’s history should look like this:

# lets make the output a bit more concise with the --oneline option
$ git log --oneline
dac2fa7 add reference book about git
9a6fb63 add books on Python and Unix to read later
1a45fd4 Instruct annex to add text files to Git
e85b950 [DATALAD] new dataset

Well done! Your DataLad-101 dataset and its history are slowly growing.

Footnotes

1

tree is a Unix command to list file system content. If it is not yet installed, you can get it with your native package manager (e.g., apt or brew). For example, if you use OSX, brew install tree will get you this tool.

2

wget is a Unix command for non-interactively downloading files from the web. If it is not yet installed, you can get it with your native package manager (e.g., apt or brew). For example, if you use OSX, brew install wget will get you this tool.

3

See tig. Once installed, exchange any git log command you see here with the single word tig.