2.3. Input and output¶

In the previous two sections, you created a simple .tsv file of all speakers and talk titles in the longnow/ podcasts subdataset, and you have re-executed a datalad run (manual) command after a bug-fix in your script.

But these previous datalad run and datalad rerun (manual) command were very simple. Maybe you noticed some values in the run record were empty: inputs and outputs for example did not have an entry. Let’s experience a few situations in which these two arguments can become necessary.

In our DataLad-101 course we were given a group assignment. Everyone should give a small presentation about an open DataLad dataset they found. Conveniently, you decided to settle for the longnow podcasts right away. After all, you know the dataset quite well already, and after listening to almost a third of the podcasts and enjoying them a lot, you also want to recommend them to the others.

Almost all of the slides are ready, but what’s still missing is the logo of the longnow podcasts. Good thing that this is part of the subdataset, so you can simply retrieve it from there.

The logos (one for the SALT series, one for the Interval series – the two directories in the subdataset) were originally extracted from the podcasts metadata information by DataLad. In a while, we will dive into the metadata aggregation capabilities of DataLad, but for now, let’s just use the logos instead of finding out where they come from – this will come later. As part of the metadata of the dataset, the logos are in the hidden paths .datalad/feed_metadata/logo_salt.jpg and .datalad/feed_metadata/logo_interval.jpg:

$ ls recordings/longnow/.datalad/feed_metadata/*jpg
recordings/longnow/.datalad/feed_metadata/logo_interval.jpg
recordings/longnow/.datalad/feed_metadata/logo_salt.jpg

For the slides you decide to prepare images of size 400x400 px, but the logos’ original size is much larger (both are 3000x3000 pixel). Therefore let’s try to resize the images – currently, they are far too large to fit on a slide.

To resize an image from the command line we can use the Unix command convert -resize from the ImageMagick tool. The command takes a new size in pixels as an argument, a path to the file that should be resized, and a filename and path under which a new, resized image will be saved. To resize one image to 400x400 px, the command would thus be convert -resize 400x400 path/to/file.jpg path/to/newfilename.jpg.

ImageMagick is not installed on Windows systems by default. To use it, you need to install it, using the provided Windows Binary Release on the Download page. During installation, it is important to install the tool into a place where it is easily accessible to your terminal, for example the Program Files folder. Do also make sure to tick the box “install legacy commands” in the installation wizard.

Remembering the last lecture on datalad run, you decide to plug this into datalad run. Even though this is not a script, it is a command, and you can wrap commands like this conveniently with datalad run. Because they will be quite long, we line break the commands in the upcoming examples for better readability – in your terminal, you can always write the commands into a single line.

$ datalad run -m "Resize logo for slides" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] == Command start (output follows) =====
convert convert: Unable to open file (recordings/longnow/.datalad/feed_metadata/logo_salt.jpg) [No such file or directory].
[INFO] == Command exit (modification check follows) =====
[INFO] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -d . -r -F .git/COMMIT_EDITMSG'
run(error): /home/me/dl-101/DataLad-101 (dataset) [convert -resize 400x400 recordings/longn...]

Oh, crap! Why didn’t this work?

Let’s take a look at the error message DataLad provides. In general, these error messages might seem wordy, and maybe a bit intimidating as well, but usually they provide helpful information to find out what is wrong. Whenever you encounter an error message, make sure to read it, even if it feels like a mushroom cloud exploded in your terminal.

A datalad run error message has several parts. The first starts after

[INFO ] == Command start (output follows) =====.

This is displaying errors that the terminal command threw: The convert tool complains that it cannot open the file, because there is “No such file or directory”.

The second part starts after

[INFO ] == Command exit (modification check follows) =====.

DataLad adds information about a “non-zero exit code”. A non-zero exit code indicates that something went wrong[1]. In principle, you could go ahead and google what this specific exit status indicates. However, the solution might have already occurred to you when reading the first error report: The file is not present.

How can that be?

“Right!”, you exclaim with a facepalm. Just as the .mp3 files, the .jpg file content is not present locally after a datalad clone (manual), and we did not datalad get (manual) it yet!

This is where the -i/--input option for a datalad run becomes useful. The content of everything that is specified as an input will be retrieved prior to running the command.

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
$ # or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
get(ok): recordings/longnow/.datalad/feed_metadata/logo_salt.jpg (file) [from web...]
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/dl-101/DataLad-101 (dataset) [convert -resize 400x400 recordings/longn...]
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)

Cool! You can see in this output that prior to the data command execution, DataLad did a datalad get. This is useful for several reasons. For one, it saved us the work of manually getting content. But moreover, this is useful for anyone with whom we might share the dataset: With an installed dataset one can very simply rerun datalad run commands if they have the input argument appropriately specified. It is therefore good practice to specify the inputs appropriately. Remember from section Install datasets that datalad get will only retrieve content if it is not yet present, all input already downloaded will not be downloaded again – so specifying inputs even though they are already present will not do any harm.

Often, a command needs several inputs. In principle, every input (which could be files, directories, or subdatasets) gets its own -i/--input flag. However, you can make use of globbing. For example,

$ datalad run --input "*.jpg" "COMMAND"

will retrieve all .jpg files prior to command execution.

2.3.1. If outputs already exist…¶

The section below describes something that is very confusing for people that have just started with DataLad: Some files in a dataset can’t be modified, and if one tries, it results in a “permission denied” error. Why is that? The remainder of this section and the upcoming chapter Under the hood: git-annex contain a procedural explanation. However: This doesn’t happen on Windows. The “unlocking” that is necessary on almost all other systems to modify a file is already done on Windows. Thus, all files in your dataset will be readily modifiable, sparing you the need to adjust to the unexpected behavior that is described below. While it is easier, it isn’t a “more useful” behavior, though. A different Windows Wit in the next chapter will highlight how it rather is a suboptimal workaround.

Please don’t skip the next section – it is useful to know how datasets behave on other systems. Just be mindful that you will not encounter the errors that the handbook displays next. And while this all sounds quite cryptic and vague, an upcoming Windows Wit will provide more information.

Looking at the resulting image, you wonder whether 400x400 might be a tiny bit to small. Maybe we should try to resize it to 450x450, and see whether that looks better?

Note that we cannot use a datalad rerun for this: if we want to change the dimension option in the command, we have to define a new datalad run command.

To establish best-practices, let’s specify the input even though it is already present:

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
$ # or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO] == Command start (output follows) =====
convert convert: Unable to open file (recordings/salt_logo_small.jpg) [Permission denied].
[INFO] == Command exit (modification check follows) =====
[INFO] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -d . -r -F .git/COMMIT_EDITMSG'
run(error): /home/me/dl-101/DataLad-101 (dataset) [convert -resize 450x450 recordings/longn...]

Oh wtf… What is it now?

A quick glimpse into the error message shows a different error than before: The tool complains that it is “unable to open” the image, because the “Permission [is] denied”.

We have not seen anything like this before, and we need to turn to our lecturer for help. Confused about what we might have done wrong, we raise our hand to ask the instructor. Knowingly, she smiles, and tells you about how DataLad protects content given to it:

“Content in your DataLad dataset is protected by git-annex from accidental changes” our instructor begins.

“Wait!” we interrupt. “First off, that wasn’t accidental. And second, I was told this course does not have git-annex-101 as a prerequisite?”

“Yes, hear me out” she says. “I promise you two different solutions at the end of this explanation, and the concept behind this is quite relevant”.

DataLad usually gives content to git-annex to store and track. git-annex, let’s just say, takes this task really seriously. One of its features that you have just experienced is that it locks content.

If files are locked down, their content cannot be modified. In principle, that’s not a bad thing: It could be your late grandma’s secret cherry-pie recipe, and you do not want to accidentally change that. Therefore, a file needs to be consciously unlocked to apply modifications.

In the attempt to resize the image to 450x450 you tried to overwrite recordings/salt_logo_small.jpg, a file that was given to DataLad and thus protected by git-annex.

There is a DataLad command that takes care of unlocking file content, and thus making locked files modifiable again: datalad unlock (manual). Let us check out what it does:

Nothing. All of the files in your dataset are always unlocked, and actually cannot be locked at all. Consequently, there will be nothing to show for datalad status afterwards (as shown a few paragraphs below). This is due to a file system limitation, and will be explained in more detail in chapter Under the hood: git-annex.

$ datalad unlock recordings/salt_logo_small.jpg
unlock(ok): recordings/salt_logo_small.jpg (file)

Well, unlock(ok) does not sound too bad for a start. As always, we feel the urge to run a datalad status (manual) on this:

$ datalad status
 modified: recordings/salt_logo_small.jpg (file)

“Ah, do not mind that for now”, our instructor says, and with a wink she continues: “We’ll talk about symlinks and object trees a while later”. You are not really sure whether that’s a good thing, but you have a task to focus on. Hastily, you run the command right from the terminal:

$ convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg

Hey, no permission denied error! You note that the instructor still stands right next to you. “Sooo… now what do I do to lock the file again?” you ask.

“Well… what you just did there was quite suboptimal. Didn’t you want to use datalad run? But, anyway, in order to lock the file again, you would need to run a datalad save (manual).”

$ datalad save -m "resized picture by hand"
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)

“So”, you wonder aloud, “whenever I want to modify I need to datalad unlock it, do the modifications, and then datalad save it?”

“Well, this is certainly one way of doing it, and a completely valid workflow if you would do that outside of a datalad run command. But within datalad run there is actually a much easier way of doing this. Let’s use the --output argument.”

datalad run retrieves everything that is specified as --input prior to command execution, and it unlocks everything specified as --output prior to command execution. Therefore, whenever the output of a datalad run command already exists and is tracked, it should be specified as an argument in the -o/--output option.

The use case here is simplistic – a single file gets modified. But there are commands and tools that create full directories with many files as an output. The easiest way to specify this type of output is by supplying the directory name, or the directory name and a globbing character, such as -o directory/*.dat. This would unlock all files with a .dat extension inside of directory. To glob for files in multiple levels of directories, use ** (a so-called globstar) for a recursive glob through any number directories. And, just as for -i/--input, you could use multiple --output specifications.

In order to execute datalad run with both the -i/--input and -o/--output flag and see their magic, let’s crop the second logo, logo_interval.jpg:

Given that nothing in your dataset is locked, is there a need for you to bother with creating --output flags? Not for you personally, if you only stay on your Windows machine. However, you will be doing others that you share your dataset with a favor if they are not using Windows – should you or others want to rerun a run record, --output flags will make it work on all operating systems.

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"

$ # or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
-o "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
get(ok): recordings/longnow/.datalad/feed_metadata/logo_interval.jpg (file) [from web...]
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/dl-101/DataLad-101 (dataset) [convert -resize 450x450 recordings/longn...]
add(ok): recordings/interval_logo_small.jpg (file)
save(ok): . (dataset)

This time, with both --input and --output options specified, DataLad informs about the datalad get operations it performs prior to the command execution, and datalad run executes the command successfully. It does not inform about any datalad unlock operation, because the output recordings/interval_logo_small.jpg does not exist before the command is run. Should you rerun this command however, the summary will include a statement about content unlocking. You will see an example of this in the next section.

Note now how many individual commands a datalad run saves us: datalad get, datalad unlock, and datalad save! But even better: Beyond saving time now, running commands reproducibly and recorded with datalad run saves us plenty of time in the future as soon as we want to rerun a command, or find out how a file came into existence.

With this last code snippet, you have experienced a full datalad run command: commit message, input and output definitions (the order in which you give those two options is irrelevant), and the command to be executed. Whenever a command takes input or produces output you should specify this with the appropriate option.

Make a note of this behavior in your notes.txt file.

$ cat << EOT >> notes.txt
You should specify all files that a command takes as input with an
-i/--input flag. These files will be retrieved prior to the command
execution. Any content that is modified or produced by the command
should be specified with an -o/--output flag. Upon a run or rerun of
the command, the contents of these files will get unlocked so that
they can be modified.

EOT

2.3.2. Save yourself the preparation time¶

Its generally good practice to specify --input and --output even if your input files are already retrieved and your output files unlocked – it makes sure that a recomputation can succeed, even if inputs are not yet retrieved, or if output needs to be unlocked. However, the internal preparation steps of checking that inputs exist or that outputs are unlocked can take a bit of time, especially if it involves checking a large number of files.

If you want to avoid the expense of unnecessary preparation steps you can make use of the --assume-ready argument of datalad run. Depending on whether your inputs are already retrieved, your outputs already unlocked (or not needed to be unlocked), or both, specify --assume-ready with the argument inputs, outputs or both and save yourself a few seconds, without sacrificing the ability to rerun your command under conditions in which the preparation would be necessary.

2.3.3. Placeholders¶

Just after writing the note, you had to relax your fingers a bit. “Man, this was so much typing. Not only did I need to specify the inputs and outputs, I also had to repeat all of these lengthy paths in the command line call…” you think.

There is a neat little trick to spare you half of this typing effort, though: Placeholders for inputs and outputs. This is how it works:

Instead of running

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"

you could shorten this to

$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 {inputs} {outputs}"

The placeholder {inputs} will expand to the path given as --input, and the placeholder {outputs} will expand to the path given as --output. This means instead of writing the full paths in the command, you can simply reuse the --input and --output specification done before.

If multiple values are specified, e.g., as in

$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv {inputs} {outputs}"

the values will be joined by a space like this:

$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"

The order of the values will match that order from the command line.

If you use globs for input specification, as in

$ datalad run -m "move a few files around" \
--input "file*" \
--output "directory_a/" \
"mv {inputs} {outputs}"

the globs will expanded in alphabetical order (like bash):

$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"

If the command only needs a subset of the inputs or outputs, individual values can be accessed with an integer index, e.g., {inputs[0]} for the very first input.

If your command call involves a { or } character, you will need to escape this brace character by doubling it, i.e., {{ or }}.

2.3.4. Dry-running your run call¶

datalad run commands can become confusing and long, especially when you make heavy use of placeholders or wrap a complex bash commands. To better anticipate what you will be running, or help debug a failed command, you can make use of the --dry-run flag of datalad run. This option needs a mode specification (--dry-run=basic or dry-run=command), followed by the run command you want to execute, and it will decipher the commands elements: The mode command will display the command that is about to be ran. The mode basic will report a few important details about the execution: Apart from displaying the command that will be ran, you will learn where the command runs, what its inputs are (helpful if your --input specification includes a globbing term), and what its outputs are.

Table of Contents

Related Topics