1.2. Populate a dataset¶
The first lecture in DataLad-101 referenced some useful literature. Even if we end up not reading those books at all, let’s download them nevertheless and put them into our dataset. You never know, right? Let’s first create a directory to save books for additional reading in.
$ mkdir books
Let’s take a look at the current directory structure with the tree command1:
$ tree
.
└── books
1 directory, 0 files
Arguably, not the most exciting thing to see. So let’s put some PDFs inside.
Below is a short list of optional readings. We decide to download them (they
are all free, in total about 15 MB), and save them in DataLad-101/books
.
Additional reading about the command line: The Linux Command Line
An intro to Python: A byte of Python
You can either visit the links and save them in books/
,
or run the following commands2 to download the books right from the terminal.
Note that we line break the command with \
signs. In your own work you can write
commands like this into a single line. If you copy them into your terminal as they
are presented here, make sure to check the Windows-wit on peculiarities of its terminals.
Terminals other than Git Bash can’t handle multi-line commands
In Unix shells, \
can be used to split a command into several lines, for example to aid readability.
Standard Windows terminals (including the Anaconda prompt) do not support this.
They instead use the ^
character:
$ wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download ^
-O TLCL.pdf
If you are not using the Git Bash, you will either need to copy multi-line commands into a single line, or use ^
(make sure that there is no space afterwards) instead of \
.
$ cd books
$ wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download \
-O TLCL.pdf
$ wget -q https://homepages.uc.edu/~becktl/byte_of_python.pdf \
-O byte-of-python.pdf
# get back into the root of the dataset
$ cd ../
2022-04-13 10:39:28 URL:https://deac-ams.dl.sourceforge.net/project/linuxcommand/TLCL/19.01/TLCL-19.01.pdf [2120211/2120211] -> "TLCL.pdf" [1]
2022-04-13 10:39:29 URL:https://objects.githubusercontent.com/github-production-release-asset-2e65be/6501727/56225300-af61-11ea-8d7f-be2b68e479be?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220413%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220413T083928Z&X-Amz-Expires=300&X-Amz-Signature=0941f917a57bde633dbd4ce55741be3b364a5f51d0367959dcea3d3b7c9ac791&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=6501727&response-content-disposition=attachment%3B%20filename%3Dbyte-of-python.pdf&response-content-type=application%2Foctet-stream [4208954/4208954] -> "byte-of-python.pdf" [1]
Some machines will not have wget available by default, but any command that can download a file can work as an alternative. See the Windows-wit for the popular alternative curl.
You can use curl instead of wget
Many versions of Windows do not ship with the tool wget
.
You can install it, but it may be easier to use the pre-installed curl
command:
$ cd books
$ curl -L https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download \
-o TLCL.pdf
$ curl -L https://homepages.uc.edu/~becktl/byte_of_python.pdf \
-o byte-of-python.pdf
$ cd ../
Let’s see what happened. First of all, in the root of DataLad-101
, show the directory
structure with tree:
$ tree
.
└── books
├── byte-of-python.pdf
└── TLCL.pdf
1 directory, 2 files
Now what does DataLad do with this new content? One command you will use very
often is datalad status (datalad-status manual).
It reports on the state of dataset content, and
regular status reports should become a habit in the wake of DataLad-101
.
$ datalad status
untracked: books (directory)
Interesting; the books/
directory is “untracked”. Remember how content
can be tracked if a user wants to?
Untracked means that DataLad does not know about this directory or its content,
because we have not instructed DataLad to actually track it. This means that DataLad
does not store the downloaded books in its history yet. Let’s change this by
saving the files to the dataset’s history with the datalad save command
(datalad-save manual).
This time, it is your turn to specify a helpful commit message
with the -m
option (although the DataLad command is datalad save, we talk
about commit messages because datalad save ultimately uses the command
git commit to do its work):
$ datalad save -m "add books on Python and Unix to read later"
add(ok): books/TLCL.pdf (file)
add(ok): books/byte-of-python.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
save (ok: 1)
If you ever forget to specify a message, or made a typo, not all is lost. A Find-out-more explains how to amend a saved state.
“Oh no! I forgot the -m option for datalad-save!”
If you forget to specify a commit message with the -m
option, DataLad will write
[DATALAD] Recorded changes
as a commit message into your history.
This is not particularly informative.
You can change the last commit message with the Git command
git commit --amend. This will open up your default editor
and you can edit
the commit message. Careful – the default editor might be vim!
The section Back and forth in time will show you many more ways in which you can
interact with a dataset’s history.
As already noted, any files you save
in this dataset, and all modifications
to these files that you save
, are tracked in this history.
Importantly, this file tracking works
regardless of the size of the files – a DataLad dataset could be
your private music or movie collection with single files being many GB in size.
This is one aspect that distinguishes DataLad from many other
version control tools, among them Git.
Large content is tracked in an annex that is automatically
created and handled by DataLad. Whether text files or larger files change,
all of these changes can be written to your DataLad dataset’s history.
Let’s see how the saved content shows up in the history of the dataset with git log.
The option -n 1
specifies that we want to take a look at the most recent commit.
In order to get a bit more details, we add the -p
flag. If you end up in a
pager, navigate with up and down arrow keys and leave the log by typing q
:
$ git log -p -n 1
commit 0605893cf92418e09954f1fa0eea79675350dbbc
Author: Elena Piscopia <elena@example.net>
Date: Wed Apr 13 10:39:30 2022 +0200
add books on Python and Unix to read later
diff --git a/books/TLCL.pdf b/books/TLCL.pdf
new file mode 120000
index 0000000..4c84b61
--- /dev/null
+++ b/books/TLCL.pdf
@@ -0,0 +1 @@
+../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file
diff --git a/books/byte-of-python.pdf b/books/byte-of-python.pdf
new file mode 120000
index 0000000..adaec61
--- /dev/null
+++ b/books/byte-of-python.pdf
Now this might look a bit cryptic (and honestly, tig3 makes it look prettier).
But this tells us the date and time in which a particular author added two PDFs to
the directory books/
, and thanks to that commit message we have a nice
human-readable summary of that action. A Find-out-more explains what makes
a good message.
DOs and DON’Ts for commit messages
DOs
Write a title line with 72 characters or less (as we did so far)
it should be in imperative voice, e.g., “Add notes from lecture 2”
Often, a title line is not enough to express your changes and reasoning behind it. In this case, add a body to your commit message by hitting enter twice (before closing the quotation marks), and continue writing a brief summary of the changes after a blank line. This summary should explain “what” has been done and “why”, but not “how”. Close the quotation marks, and hit enter to save the change with your message.
DON’Ts
passive voice is hard to read afterwards
extensive formatting (hashes, asterisks, quotes, …) will most likely make your shell complain
it should be obvious: do not say nasty things about other people
There is no staging area in DataLad
Just as in Git, new files are not tracked from their creation on, but only when explicitly added to Git (in Git terms with an initial git add). But different from the common Git workflow, DataLad skips the staging area. A datalad save combines a git add and a git commit, and therefore, the commit message is specified with datalad save.
Cool, so now you have added some files to your dataset history. But what is a bit inconvenient is that both books were saved together. You begin to wonder: “A Python book and a Unix book do not have that much in common. I probably should not save them in the same commit. And … what happens if I have files I do not want to track? datalad save -m "some commit message" would save all of what is currently untracked or modified in the dataset into the history!”
Regarding your first remark, you’re absolutely right! It is good practice to save only those changes together that belong together. We do not want to squish completely unrelated changes into the same spot of our history, because it would get very nasty should we want to revert some of the changes without affecting others in this commit.
Luckily, we can point datalad save to exactly the changes we want it to record. Let’s try this by adding yet another book, a good reference work about git, Pro Git:
$ cd books
$ wget -q https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf
$ cd ../
2022-04-13 10:39:33 URL:https://objects.githubusercontent.com/github-production-release-asset-2e65be/15400220/57552a00-9a49-11e9-9144-d9607ed4c2db?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220413%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220413T083930Z&X-Amz-Expires=300&X-Amz-Signature=e32dd16458b2169ec4baa05f3ba56c848ba52fd77205c925b4e254372bfd0c68&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15400220&response-content-disposition=attachment%3B%20filename%3Dprogit.pdf&response-content-type=application%2Foctet-stream [12465653/12465653] -> "progit.pdf" [1]
datalad status shows that there is a new untracked file:
$ datalad status
untracked: books/progit.pdf (file)
Let’s give datalad save precisely this file by specifying its path after the commit message:
$ datalad save -m "add reference book about git" books/progit.pdf
add(ok): books/progit.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
Regarding your second remark, you’re right that a datalad save without a path specification would write all of the currently untracked files or modifications to the history. But check the Find-out-more on how to tell it otherwise.
How to save already tracked dataset components only?
A datalad save -m "concise message" --updated (or the shorter
form of --updated
, -u
) will only write modifications to the
history, not untracked files. Later, we will also see .gitignore
files
that let you hide content from version control. However, it is good
practice to safely store away modifications or new content. This improves
your dataset and workflow, and will be a requirement for executing certain
commands.
A datalad status should now be empty, and our dataset’s history should look like this:
# lets make the output a bit more concise with the --oneline option
$ git log --oneline
c0949c0 add reference book about git
0605893 add books on Python and Unix to read later
9cca5af Instruct annex to add text files to Git
ecd080c [DATALAD] new dataset
“Wonderful! I’m getting a hang on this quickly”, you think. “Version controlling files is not as hard as I thought!”
But downloading and adding content to your dataset “manually” has two disadvantages: For one, it requires you to download the content and save it. Compared to a workflow with no DataLad dataset, this is one additional command you have to perform (and that additional time adds up, after a while). But a more serious disadvantage is that you have no electronic record of the source of the contents you added. The amount of provenance, the time, date, and author of file, is already quite nice, but we don’t know anything about where you downloaded these files from. If you would want to find out, you would have to remember where you got the content from – and brains are not made for such tasks.
Luckily, DataLad has a command that will solve both of these problems:
The datalad download-url command (datalad-download-url manual).
We will dive deeper into the provenance-related benefits of using it in later chapters, but for now,
we’ll start with best-practice-building. datalad download-url can retrieve content
from a URL (following any URL-scheme from https, http, or ftp or s3) and save it
into the dataset together with a human-readable commit message and a hidden,
machine-readable record of the origin of the content. This saves you time,
and captures provenance information about the data you add to your dataset.
To experience this, let’s add a final book,
a beginner’s guide to bash,
to the dataset. We provide the command with a URL, a pointer to the dataset the
file should be saved in (.
denotes “current directory”), and a commit message.
$ datalad download-url \
http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \
--dataset . \
-m "add beginners guide on bash" \
-O books/bash_guide.pdf
[INFO] Downloading 'http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf' into '/home/me/dl-101/DataLad-101/books/bash_guide.pdf'
download_url(ok): /home/me/dl-101/DataLad-101/books/bash_guide.pdf (file)
add(ok): books/bash_guide.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
download_url (ok: 1)
save (ok: 1)
Afterwards, a fourth book is inside your books/
directory:
$ ls books
bash_guide.pdf
byte-of-python.pdf
progit.pdf
TLCL.pdf
However, the datalad status command does not return any output – the dataset state is “clean”:
$ datalad status
nothing to save, working tree clean
This is because datalad download-url took care of saving for you:
$ git log -p -n 1
commit da75663ba7058b6fb91b3eea2c726ad067976531
Author: Elena Piscopia <elena@example.net>
Date: Wed Apr 13 10:39:39 2022 +0200
add beginners guide on bash
diff --git a/books/bash_guide.pdf b/books/bash_guide.pdf
new file mode 120000
index 0000000..00ca6bd
--- /dev/null
+++ b/books/bash_guide.pdf
@@ -0,0 +1 @@
+../.git/annex/objects/WF/Gq/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf
\ No newline at end of file
At this point in time, the biggest advantage may seem to be the time save. However, soon you will experience how useful it is to have DataLad keep track for you where file content came from.
To conclude this section, let’s take a final look at the history of your dataset at this point:
$ git log --oneline
da75663 add beginners guide on bash
c0949c0 add reference book about git
0605893 add books on Python and Unix to read later
9cca5af Instruct annex to add text files to Git
ecd080c [DATALAD] new dataset
Well done! Your DataLad-101
dataset and its history are slowly growing.
Footnotes
- 1
tree
is a Unix command to list file system content. If it is not yet installed, you can get it with your native package manager (e.g.,apt
,brew
, or conda). For example, if you use OSX,brew install tree
will get you this tool. On Windows, if you have the Miniconda-based installation described in Installation and configuration, you can install them2-base
package (conda install m2-base
), which contains tree along with many other Unix-like commands. Note that this tree works slightly different than its Unix equivalent - it will only display directories, not files, and it doesn’t accept common options or flags. It will also display hidden directories, i.e., those that start with a.
(dot).- 2
wget
is a Unix command for non-interactively downloading files from the web. If it is not yet installed, you can get it with your native package manager (e.g.,apt
orbrew
). For example, if you use OSX,brew install wget
will get you this tool.- 3
See tig. Once installed, exchange any git log command you see here with the single word
tig
.