Input and output

In the previous two sections, you created a simple .tsv file of all speakers and talk titles in the longnow/ podcasts subdataset, and you have re-executed a datalad run command after a bug-fix in your script.

But these previous datalad run and datalad rerun command were very simple. Maybe you noticed some values in the run record were empty: inputs and outputs for example did not have an entry. Let’s experience a few situations in which these two arguments can become necessary.

In our DataLad-101 course we were given a group assignment. Everyone should give a small presentation about an open DataLad dataset they found. Conveniently, you decided to settle for the longnow podcasts right away. After all, you know the dataset quite well already, and after listening to almost a third of the podcasts and enjoying them a lot, you also want to recommend them to the others.

Almost all of the slides are ready, but what’s still missing is the logo of the longnow podcasts. Good thing that this is part of the subdataset, so you can simply retrieve it from there.

The logos (one for the SALT series, one for the Interval series – the two directories in the subdataset) were originally extracted from the podcasts metadata information by DataLad. In a while, we will dive into the metadata aggregation capabilities of DataLad, but for now, let’s just use the logos instead of finding out where they come from – this will come later. As part of the metadata of the dataset, the logos are in the hidden paths .datalad/feed_metadata/logo_salt.jpg and .datalad/feed_metadata/logo_interval.jpg:

$ ls recordings/longnow/.datalad/feed_metadata/*jpg
recordings/longnow/.datalad/feed_metadata/logo_interval.jpg
recordings/longnow/.datalad/feed_metadata/logo_salt.jpg

For the slides you decide to prepare images of size 400x400 px, but the logos’ original size is much larger (both are 3000x3000 pixel). Therefore let’s try to resize the images – currently, they’re far too large to fit on a slide.

To resize an image from the command line we can use the Unix command convert -resize from the ImageMagick tool. The command takes a new size in pixels as an argument, a path to the file that should be resized, and a filename and path under which a new, resized image will be saved. To resize one image to 400x400 px, the command would thus be convert -resize 400x400 path/to/file.jpg path/to/newfilename.jpg.

Remembering the last lecture on datalad run, you decide to plug this into datalad run. Even though this is not a script, it is a command, and you can wrap commands like this conveniently with datalad run. Because they will be quite long, we line break the commands in the upcoming examples for better readability – in your terminal, you can always write the commands into a single line.

$ datalad run -m "Resize logo for slides" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] == Command start (output follows) ===== 
convert-im6.q16: unable to open image `recordings/longnow/.datalad/feed_metadata/logo_salt.jpg': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `recordings/salt_logo_small.jpg' @ error/convert.c/ConvertImageCommand/3258.
[INFO] == Command exit (modification check follows) ===== 
[INFO] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -d . -r -F .git/COMMIT_EDITMSG' 
CommandError: command 'convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' failed with exitcode 1
Failed to run 'convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' under '/home/me/dl-101/DataLad-101'. Exit code=1.

Oh, crap! Why didn’t this work?

Let’s take a look at the error message DataLad provides. In general, these error messages might seem wordy, and maybe a bit intimidating as well, but usually they provide helpful information to find out what is wrong. Whenever you encounter an error message, make sure to read it, even if it feels like a mushroom cloud exploded in your terminal.

A datalad run error message has several parts. The first starts after

[INFO   ] == Command start (output follows) =====.

This is displaying errors that the terminal command threw: The convert tool complains that it can not open the file, because there is “No such file or directory”.

The second part starts after

[INFO   ] == Command exit (modification check follows) =====.

DataLad adds information about a “non-zero exit code”. A non-zero exit code indicates that something went wrong1. In principle, you could go ahead and google what this specific exit status indicates. However, the solution might have already occurred to you when reading the first error report: The file is not present.

How can that be?

“Right!”, you exclaim with a facepalm. Just as the .mp3 files, the .jpg file content is not present locally after a datalad clone, and we did not datalad get it yet!

This is where the -i/--input option for a datalad run becomes useful. The content of everything that is specified as an input will be retrieved prior to running the command.

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
# or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
get(ok): recordings/longnow/.datalad/feed_metadata/logo_salt.jpg (file) [from web...]
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1, ok: 1)
  save (notneeded: 1, ok: 1)

Cool! You can see in this output that prior to the data command execution, DataLad did a datalad get. This is useful for several reasons. For one, it saved us the work of manually getting content. But moreover, this is useful for anyone with whom we might share the dataset: With an installed dataset one can very simply rerun datalad run commands if they have the input argument appropriately specified. It is therefore good practice to specify the inputs appropriately. Remember from section Install datasets that datalad get will only retrieve content if it is not yet present, all input already downloaded will not be downloaded again – so specifying inputs even though they are already present will not do any harm.

Find out more: What if there are several inputs?

Often, a command needs several inputs. In principle, every input gets its own -i/--input flag. However, you can make use of globbing. For example,

datalad run --input "*.jpg" "COMMAND"

will retrieve all .jpg files prior to command execution.

If outputs already exist…

Looking at the resulting image, you wonder whether 400x400 might be a tiny bit to small. Maybe we should try to resize it to 450x450, and see whether that looks better?

Note that we can not use a datalad rerun for this: if we want to change the dimension option in the command, we have to define a new datalad run command.

To establish best-practices, let’s specify the input even though it is already present:

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
# or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
convert-im6.q16: unable to open image `recordings/salt_logo_small.jpg': Permission denied @ error/blob.c/OpenBlob/2874.
[INFO] == Command exit (modification check follows) ===== 
[INFO] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -d . -r -F .git/COMMIT_EDITMSG' 
CommandError: command 'convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' failed with exitcode 1
Failed to run 'convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' under '/home/me/dl-101/DataLad-101'. Exit code=1.

Oh wtfWhat is it now?

A quick glimpse into the error message shows a different error than before: The tool complains that it is “unable to open” the image, because the “Permission [is] denied”.

We have not seen anything like this before, and we need to turn to our lecturer for help. Confused about what we might have done wrong, we raise our hand to ask the instructor for help. Knowingly, she smiles, and tells you about how DataLad protects content given to it:

“Content in your DataLad dataset is protected by git-annex from accidental changes” our instructor begins.

“Wait!” we interrupt. “First off, that wasn’t accidental. And second, I was told this course does not have git-annex-101 as a prerequisite?”

“Yes, hear me out” she says. “I promise you two different solutions at the end of this explanation, and the concept behind this is quite relevant”.

DataLad usually gives content to git-annex to store and track. git-annex, let’s just say, takes this task really seriously. One of its features that you have just experienced is that it locks content.

If files are locked down, their content can not be modified. In principle, that’s not a bad thing: It could be your late grandma’s secret cherry-pie recipe, and you do not want to accidentally change that. Therefore, a file needs to be consciously unlocked to apply modifications.

In the attempt to resize the image to 450x450 you tried to overwrite recordings/salt_logo_small.jpg, a file that was given to DataLad and thus protected by git-annex.

There is a DataLad command that takes care of unlocking file content, and thus making locked files modifiable again: datalad unlock (datalad-unlock manual). Let us check out what it does:

$ datalad unlock recordings/salt_logo_small.jpg
unlock(ok): recordings/salt_logo_small.jpg (file)

Well, unlock(ok) does not sound too bad for a start. As always, we feel the urge to run a datalad status on this:

$ datalad status
 modified: recordings/salt_logo_small.jpg (symlink)

“Ah, do not mind that for now”, our instructor says, and with a wink she continues: “We’ll talk about symlinks and object trees a while later”. You are not really sure whether that’s a good thing, but you have a task to focus on. Hastily, you run the command right from the terminal:

$ convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg

Hey, no permission denied error! You note that the instructor still stands right next to you. “Sooo… now what do I do to lock the file again?” you ask.

“Well… what you just did there was quite suboptimal. Didn’t you want to use datalad run? But, anyway, in order to lock the file again, you would need to run a datalad save.”

$ datalad save -m "resized picture by hand"
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

“So”, you wonder aloud, “whenever I want to modify I need to datalad unlock it, do the modifications, and then datalad save it?”

“Well, this is certainly one way of doing it, and a completely valid workflow if you would do that outside of a datalad run command. But within datalad run there is actually a much easier way of doing this. Let’s use the --output argument.”

datalad run retrieves everything that is specified as --input prior to command execution, and it unlocks everything specified as --output prior to command execution. Therefore, whenever the output of a datalad run command already exists and is tracked, it should be specified as an argument in the -o/--output option.

Find out more: But what if I have a lot of outputs?

The use case here is simplistic – a single file gets modified. But there are commands and tools that create full directories with many files as an output, for example FSL, a neuro-imaging tool. The easiest way to specify this type of output is the directory name and a globbing character, such as -o directory/*. And, just as for -i/--input, you could use multiple --output specifications.

In order to execute datalad run with both the -i/--input and -o/--output flag and see their magic, let’s crop the second logo, logo_interval.jpg:

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"

# or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
-o "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
[INFO] Making sure inputs are available (this may take some time) 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
get(ok): recordings/longnow/.datalad/feed_metadata/logo_interval.jpg (file) [from web...]
add(ok): recordings/interval_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1, ok: 1)
  save (notneeded: 1, ok: 1)

This time, with both --input and --output options specified, DataLad informs about the datalad get operations it performs prior to the command execution, and datalad run executes the command successfully. It does not inform about any datalad unlock operation, because the output recordings/interval_logo_small.jpg does not exist before the command is run. Should you rerun this command however, the summary will include a statement about content unlocking. You will see an example of this in the next section.

Note now how many individual commands a datalad run saves us: datalad get, datalad unlock, and datalad save! But even better: Beyond saving time now, running commands reproducibly and recorded with datalad run saves us plenty of time in the future as soon as we want to rerun a command, or find out how a file came into existence.

With this last code snippet, you have experienced a full datalad run command: commit message, input and output definitions (the order in which you give those two options is irrelevant), and the command to be executed. Whenever a command takes input or produces output you should specify this with the appropriate option.

Make a note of this behavior in your notes.txt file.

$ cat << EOT >> notes.txt
You should specify all files that a command takes as input with an -i/--input flag. These
files will be retrieved prior to the command execution. Any content that is modified or
produced by the command should be specified with an -o/--output flag. Upon a run or rerun
of the command, the contents of these files will get unlocked so that they can be modified.

EOT

Placeholders

Just after writing this note, you have to relax your fingers a bit. “Man, this was so much typing. Not only did I need to specify the inputs and outputs, I also had to repeat all of these lengthy paths in the command line call…” you think.

There is a neat little trick to spare you half of this typing effort, though: Placeholders for inputs and outputs. This is how it works:

Instead of running

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"

you could shorten this to

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 {inputs} {outputs}"

The placeholder {inputs} will expand to the path given as --input, and the placeholder {outputs} will expand to the path given as --output. This means instead of writing the full paths in the command, you can simply reuse the --input and --output specification done before.

Find out more: What if I have multiple inputs or outputs?

If multiple values are specified, e.g., as in

$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv {inputs} {outputs}"

the values will be joined by a space like this:

$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"

The order of the values will match that order from the command line.

If you use globs for input specification, as in

$ datalad run -m "move a few files around" \
--input "file*" \
--output "directory_a/" \
"mv {inputs} {outputs}"

the globs will expanded in alphabetical order (like bash):

$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"

If the command only needs a subset of the inputs or outputs, individual values can be accessed with an integer index, e.g., {inputs[0]} for the very first input.

Find out more: … wait, what if I need a { or } character in my datalad run call?

If your command call involves a { or } character, you will need to escape this brace character by doubling it, i.e., {{ or }}.

1

In shell programming, commands exit with a specific code that indicates whether they failed, and if so, how. Successful commands have the exit code zero. All failures have exit codes greater than zero. A few lines lower, DataLad even tells us the specific error code: The command failed with exit code 1.