Input and output¶
In the previous two sections, you created a simple .tsv
file of all
speakers and talk titles in the longnow/
podcasts subdataset, and you have
re-executed a datalad run command after a bug-fix in your script.
But these previous datalad run and datalad rerun command were very simple.
Maybe you noticed some values in the run record
were empty:
inputs
and outputs
for example did not have an entry. Let’s experience
a few situations in which
these two arguments can become necessary.
In our DataLad-101 course we were given a group assignment. Everyone should give a small presentation about an open DataLad dataset they found. Conveniently, you decided to settle for the longnow podcasts right away. After all, you know the dataset quite well already, and after listening to almost a third of the podcasts and enjoying them a lot, you also want to recommend them to the others.
Almost all of the slides are ready, but what’s still missing is the logo of the longnow podcasts. Good thing that this is part of the subdataset, so you can simply retrieve it from there.
The logos (one for the SALT series, one for the Interval series – the two
directories in the subdataset)
were originally extracted from the podcasts metadata information by DataLad.
In a while, we will dive into the metadata aggregation capabilities of DataLad,
but for now, let’s just use the logos instead of finding out where they
come from – this will come later.
As part of the metadata of the dataset, the logos are
in the hidden paths
.datalad/feed_metadata/logo_salt.jpg
and
.datalad/feed_metadata/logo_interval.jpg
:
$ ls recordings/longnow/.datalad/feed_metadata/*jpg
recordings/longnow/.datalad/feed_metadata/logo_interval.jpg
recordings/longnow/.datalad/feed_metadata/logo_salt.jpg
For the slides you decide to prepare images of size 400x400 px, but the logos’ original size is much larger (both are 3000x3000 pixel). Therefore let’s try to resize the images – currently, they’re far too large to fit on a slide.
To resize an image from the command line we can use the Unix
command convert -resize
from the ImageMagick tool.
The command takes a new size in pixels as an argument, a path to the file that should be
resized, and a filename and path under which a new,
resized image will be saved.
To resize one image to 400x400 px, the command would thus be
convert -resize 400x400 path/to/file.jpg path/to/newfilename.jpg
.
Remembering the last lecture on datalad run, you decide to plug this into datalad run. Even though this is not a script, it is a command, and you can wrap commands like this conveniently with datalad run. Because they will be quite long, we line break the commands in the upcoming examples for better readability – in your terminal, you can always write the commands into a single line.
$ datalad run -m "Resize logo for slides" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] == Command start (output follows) =====
convert-im6.q16: unable to open image `recordings/longnow/.datalad/feed_metadata/logo_salt.jpg': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `recordings/salt_logo_small.jpg' @ error/convert.c/ConvertImageCommand/3258.
[INFO] == Command exit (modification check follows) =====
[INFO] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -d . -r -F .git/COMMIT_EDITMSG'
CommandError: command 'convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' failed with exitcode 1
Failed to run 'convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' under '/home/me/dl-101/DataLad-101'. Exit code=1.
Oh, crap! Why didn’t this work?
Let’s take a look at the error message DataLad provides. In general, these error messages might seem wordy, and maybe a bit intimidating as well, but usually they provide helpful information to find out what is wrong. Whenever you encounter an error message, make sure to read it, even if it feels like a mushroom cloud exploded in your terminal.
A datalad run error message has several parts. The first starts after
[INFO ] == Command start (output follows) =====
.
This is displaying errors that the
terminal command threw: The convert
tool complains that it can not open
the file, because there is “No such file or directory”.
The second part starts after
[INFO ] == Command exit (modification check follows) =====
.
DataLad adds information about a “non-zero exit code”. A non-zero exit code indicates that something went wrong1. In principle, you could go ahead and google what this specific exit status indicates. However, the solution might have already occurred to you when reading the first error report: The file is not present.
How can that be?
“Right!”, you exclaim with a facepalm.
Just as the .mp3
files, the .jpg
file content is not present
locally after a datalad clone, and we did not datalad get it yet!
This is where the -i
/--input
option for a datalad run becomes useful.
The content of everything that is specified as an input
will be retrieved
prior to running the command.
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
# or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
get(ok): recordings/longnow/.datalad/feed_metadata/logo_salt.jpg (file) [from web...]
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
get (notneeded: 1, ok: 1)
save (notneeded: 1, ok: 1)
Cool! You can see in this output that prior to the data command execution, DataLad did a datalad get. This is useful for several reasons. For one, it saved us the work of manually getting content. But moreover, this is useful for anyone with whom we might share the dataset: With an installed dataset one can very simply rerun datalad run commands if they have the input argument appropriately specified. It is therefore good practice to specify the inputs appropriately. Remember from section Install datasets that datalad get will only retrieve content if it is not yet present, all input already downloaded will not be downloaded again – so specifying inputs even though they are already present will not do any harm.
Find out more: What if there are several inputs?
Often, a command needs several inputs. In principle, every input gets its own -i
/--input
flag. However, you can make use of globbing. For example,
datalad run --input "*.jpg" "COMMAND"
will retrieve all .jpg
files prior to command execution.
If outputs already exist…¶
Looking at the resulting image, you wonder whether 400x400 might be a tiny bit to small. Maybe we should try to resize it to 450x450, and see whether that looks better?
Note that we can not use a datalad rerun for this: if we want to change the dimension option in the command, we have to define a new datalad run command.
To establish best-practices, let’s specify the input even though it is already present:
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
# or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
convert-im6.q16: unable to open image `recordings/salt_logo_small.jpg': Permission denied @ error/blob.c/OpenBlob/2874.
[INFO] == Command exit (modification check follows) =====
[INFO] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -d . -r -F .git/COMMIT_EDITMSG'
CommandError: command 'convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' failed with exitcode 1
Failed to run 'convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' under '/home/me/dl-101/DataLad-101'. Exit code=1.
Oh wtf… What is it now?
A quick glimpse into the error message shows a different error than before: The tool complains that it is “unable to open” the image, because the “Permission [is] denied”.
We have not seen anything like this before, and we need to turn to our lecturer for help. Confused about what we might have done wrong, we raise our hand to ask the instructor for help. Knowingly, she smiles, and tells you about how DataLad protects content given to it:
“Content in your DataLad dataset is protected by git-annex from accidental changes” our instructor begins.
“Wait!” we interrupt. “First off, that wasn’t accidental. And second, I was told this
course does not have git-annex-101
as a prerequisite?”
“Yes, hear me out” she says. “I promise you two different solutions at the end of this explanation, and the concept behind this is quite relevant”.
DataLad usually gives content to git-annex to store and track. git-annex, let’s just say, takes this task really seriously. One of its features that you have just experienced is that it locks content.
If files are locked down, their content can not be modified. In principle, that’s not a bad thing: It could be your late grandma’s secret cherry-pie recipe, and you do not want to accidentally change that. Therefore, a file needs to be consciously unlocked to apply modifications.
In the attempt to resize the image to 450x450 you tried to overwrite
recordings/salt_logo_small.jpg
, a file that was given to DataLad
and thus protected by git-annex.
There is a DataLad command that takes care of unlocking file content, and thus making locked files modifiable again: datalad unlock (datalad-unlock manual). Let us check out what it does:
$ datalad unlock recordings/salt_logo_small.jpg
unlock(ok): recordings/salt_logo_small.jpg (file)
Well, unlock(ok)
does not sound too bad for a start. As always, we
feel the urge to run a datalad status on this:
$ datalad status
modified: recordings/salt_logo_small.jpg (symlink)
“Ah, do not mind that for now”, our instructor says, and with a wink she continues: “We’ll talk about symlinks and object trees a while later”. You are not really sure whether that’s a good thing, but you have a task to focus on. Hastily, you run the command right from the terminal:
$ convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg
Hey, no permission denied error! You note that the instructor still stands right next to you. “Sooo… now what do I do to lock the file again?” you ask.
“Well… what you just did there was quite suboptimal. Didn’t you want to use datalad run? But, anyway, in order to lock the file again, you would need to run a datalad save.”
$ datalad save -m "resized picture by hand"
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
“So”, you wonder aloud, “whenever I want to modify I need to datalad unlock it, do the modifications, and then datalad save it?”
“Well, this is certainly one way of doing it, and a completely valid workflow
if you would do that outside of a datalad run command.
But within datalad run there is actually a much easier way of doing this.
Let’s use the --output
argument.”
datalad run retrieves everything that is specified as --input
prior to
command execution, and it unlocks everything specified as --output
prior to
command execution. Therefore, whenever the output of a datalad run command already
exists and is tracked, it should be specified as an argument in
the -o
/--output
option.
Find out more: But what if I have a lot of outputs?
The use case here is simplistic – a single file gets modified.
But there are commands and tools that create full directories with
many files as an output, for example
FSL, a neuro-imaging tool.
The easiest way to specify this type of output
is the directory name and a globbing character, such as
-o directory/*
. And, just as for -i
/--input
, you could use
multiple --output
specifications.
In order to execute datalad run with both the -i
/--input
and -o
/--output
flag and see their magic, let’s crop the second logo, logo_interval.jpg
:
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
# or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
-o "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
get(ok): recordings/longnow/.datalad/feed_metadata/logo_interval.jpg (file) [from web...]
add(ok): recordings/interval_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
get (notneeded: 1, ok: 1)
save (notneeded: 1, ok: 1)
This time, with both --input
and --output
options specified, DataLad informs about the datalad get
operations it performs prior to the command
execution, and datalad run executes the command successfully.
It does not inform about any datalad unlock operation,
because the output recordings/interval_logo_small.jpg
does not
exist before the command is run. Should you rerun this command however,
the summary will include a statement about content unlocking. You will
see an example of this in the next section.
Note now how many individual commands a datalad run saves us: datalad get, datalad unlock, and datalad save! But even better: Beyond saving time now, running commands reproducibly and recorded with datalad run saves us plenty of time in the future as soon as we want to rerun a command, or find out how a file came into existence.
With this last code snippet, you have experienced a full datalad run command: commit message, input and output definitions (the order in which you give those two options is irrelevant), and the command to be executed. Whenever a command takes input or produces output you should specify this with the appropriate option.
Make a note of this behavior in your notes.txt
file.
$ cat << EOT >> notes.txt
You should specify all files that a command takes as input with an -i/--input flag. These
files will be retrieved prior to the command execution. Any content that is modified or
produced by the command should be specified with an -o/--output flag. Upon a run or rerun
of the command, the contents of these files will get unlocked so that they can be modified.
EOT
Placeholders¶
Just after writing this note, you have to relax your fingers a bit. “Man, this was so much typing. Not only did I need to specify the inputs and outputs, I also had to repeat all of these lengthy paths in the command line call…” you think.
There is a neat little trick to spare you half of this typing effort, though: Placeholders for inputs and outputs. This is how it works:
Instead of running
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
you could shorten this to
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 {inputs} {outputs}"
The placeholder {inputs}
will expand to the path given as --input
, and
the placeholder {outputs}
will expand to the path given as --output
.
This means instead of writing the full paths in the command, you can simply reuse
the --input
and --output
specification done before.
Find out more: What if I have multiple inputs or outputs?
If multiple values are specified, e.g., as in
$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv {inputs} {outputs}"
the values will be joined by a space like this:
$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"
The order of the values will match that order from the command line.
If you use globs for input specification, as in
$ datalad run -m "move a few files around" \
--input "file*" \
--output "directory_a/" \
"mv {inputs} {outputs}"
the globs will expanded in alphabetical order (like bash):
$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"
If the command only needs a subset of the inputs or outputs, individual values
can be accessed with an integer index, e.g., {inputs[0]}
for the very first
input.
Find out more: … wait, what if I need a { or } character in my datalad run call?
If your command call involves a {
or }
character, you will need to escape
this brace character by doubling it, i.e., {{
or }}
.
- 1
In shell programming, commands exit with a specific code that indicates whether they failed, and if so, how. Successful commands have the exit code zero. All failures have exit codes greater than zero. A few lines lower, DataLad even tells us the specific error code: The command failed with exit code 1.