2.5. Summary¶
In the last four sections, we demonstrated how to create a proper datalad run
(manual)
command, and discovered the concept of locked content.
datalad run
records and saves the changes a command makes in a dataset. That means that modifications to existing content or new content are associated with a specific command and saved to the dataset’s history. Essentially,datalad run
helps you to keep track of what you do in your dataset by capturing all provenance.A
datalad run
command generates arun record
in the commit. This run record can be used by DataLad to re-execute a command withdatalad rerun SHASUM
(manual), where SHASUM is the commit hash of thedatalad run
command that should be re-executed.If a
datalad run
ordatalad rerun
does not modify any content, it will not write a record to history.With any
datalad run
, specify a commit message, and whenever appropriate, specify its inputs to the executed command (using the-i
/--input
flag) and/or its output (using the-o
/--output
flag). The full command structure is:$ datalad run -m "commit message here" --input "path/to/input/" --output "path/to/output" "command"
Anything specified as
input
will be retrieved if necessary with adatalad get
(manual) prior to command execution. Anything specified asoutput
will beunlocked
prior to modifications.It is good practice to specify
input
andoutput
to ensure that adatalad rerun
works, and to capture the relevant elements of a computation in a machine-readable record. If you want to spare yourself preparation time in case everything is already retrieved and unlocked, you can use--assume-ready {input|output|both}
to skip a check on whether inputs are already present or outputs already unlocked.
Getting and unlocking content is not only convenient for yourself, but enormously helpful for anyone you share your dataset with, but this will be demonstrated in an upcoming section in detail.
To execute a
datalad run
ordatalad rerun
, adatalad status
(manual) either needs to report that the dataset has no uncommitted changes (the dataset state should be “clean”), or the command needs to be extended with the--explicit
option.
2.5.1. Now what can I do with that?¶
You have procedurally experienced how to use datalad run
and datalad rerun
. Both
of these commands make it easier for you and others to associate changes in a dataset with
a script or command, and are helpful as the exact command for a given task is stored by
DataLad, and does not need to be remembered.
Furthermore, by experiencing many common error messages in the context of datalad run
commands, you have gotten some clues on where to look for problems, should you encounter
those errors in your own work.
Lastly, we’ve started to unveil some principles of git-annex that are relevant to
understanding how certain commands work and why certain commands may fail. We have seen that
git-annex locks large files’ content to prevent accidental modifications, and how the --output
flag in datalad run
can save us an intermediate datalad unlock
(manual) to unlock this content.
The next section will elaborate on this a bit more.
2.5.2. Further reading¶
The chapter on datalad run
provided an almost complete feature overview of the command.
If you want, you can extend this knowledge with computational environments and datalad containers-run
(manual) in chapter Computational reproducibility with software containers.
In addition, you can read up on other forms of computing usecases - for example, how to use datalad run
in interactive computing environments such as Jupyter Notebooks.