- absolute path¶
The complete path from the root of the file system. Absolute paths always start with
/home/user/Pictures/xkcd-webcomics/530.png. See also relative path.
- adjusted branch¶
git-annex concept: a special branch in a dataset. Adjusted branches refer to a different, existing branch that is not adjusted. The adjusted branch is called “adjusted/<branchname>(unlocked)” and on an the adjusted branch”, all files handled by git-annex are not locked – They will stay “unlocked” and thus modifiable. Instead of referencing data in the annex with a symlink, unlocked files need to be copies of the data in the annex. Adjusted branches primarily exist as the default branch on so-called crippled filesystems such as Windows.
git-annex concept: a different word for object-tree.
- annex UUID¶
A UUID assigned to an annex of each individual clone of a dataset repository. git-annex uses this UUID to track file content availability information. The UUID is available under the configuration key
annex.uuidand is stored in the configuration file of a local clone (
<dataset root>/.git/config). A single dataset instance (i.e. a local clone) has exactly one annex UUID, but other clones of the same dataset each have their own unique annex UUIDs.
- bare Git repositories¶
A bare Git repository is a repository that contains the contents of the
.gitdirectory of regular DataLad datasets or Git repositories, but no worktree or checkout. This has advantages: The repository is leaner, it is easier for administrators to perform garbage collections, and it is required if you want to push to it at all times. You can find out more on what bare repositories are and how to use them here.
A Unix shell and command language.
Bitbucket is an online platform where one can store and share version controlled projects using Git (and thus also DataLad project), similar to GitHub or GitLab. See bitbucket.org.
Git concept: A lightweight, independent history streak of your dataset. Branches can contain less, more, or changed files compared to other branches, and one can merge the changes a branch contains into another branch.
An alternative term to shasum.
Git concept: A copy of a Git repository. In Git-terminology, all “installed” datasets are clones.
Git concept: Adding selected changes of a file or dataset to the repository, and thus making these changes part of the revision history of the repository. Should always have an informative commit message.
- commit message¶
Git concept: A concise summary of changes you should attach to a datalad save command. This summary will show up in your DataLad dataset history.
- compute node¶
A compute node is an individual computer, part of a high-performance computing (HPC) or high-throughput computing (HTC) cluster.
A package, dependency, and environment management system for a number of programming languages. Find out more at docs.conda.io. It overlaps with pip in functionality, but it is advised to not use both tools simultaneously for package management.
- container recipe¶
A text file template that lists all required components of the computational environment that a software container should contain. It is made by a human user.
- container image¶
Container images are built from container recipe files. They are a static filesystem inside a file, populated with the software specified in the recipe, and some initial configuration.
- crippled filesystem¶
git-annex concept: A file system that does not allow making symlinks or removing write permissions from files. Examples for this are FAT (likely used by your USB sticks) or NTFS (used on Windows systems of the last three decades).
- DataLad dataset¶
A DataLad dataset is a Git repository that may or may not have a data annex that is used to manage data referenced in a dataset. In practice, most DataLad datasets will come with an annex.
- DataLad extension¶
Python packages that equip DataLad with specialized commands. The section DataLad extensions gives and overview of available extensions and links to Handbook chapters that contain demonstrations.
- DataLad Gooey¶
A DataLad extension that provides DataLad with a graphical user interface. Find out more in its Documentation: docs.datalad.org/projects/gooey
- DataLad subdataset¶
A DataLad dataset contained within a different DataLad dataset (the parent or DataLad superdataset).
- DataLad superdataset¶
A DataLad dataset that contains one or more levels of other DataLad datasets (DataLad subdataset).
- dataset ID¶
A UUID that identifies a dataset as a unit – across its entire history and flavors. This ID is stored in a dataset’s own configuration file (
<dataset root>/.datalad/config) under the configuration key
datalad.dataset.id. As this configuration is stored in a file that is part of the Git history of a dataset, this ID is identical for all clones of a dataset and across all its versions.
A common Linux distribution. More information here.
Finding and resolving problems within a computer program. To learn about debugging a failed execution of a DataLad command, take a look at the section Debugging.
Docker is a containerization software that can package software into software containers, similar to Singularity. Find out more on Wikipedia.
Docker Hub is a library for Docker container images. Among other things, it hosts and builds Docker container images. You can can pull container images built from a publicly shared container recipe from it.
A digital object identifier (DOI) is a character string used to permanently identify a resource and link to in on the web. A DOI will always refer to the one resource it was assigned to, and only that one.
DataLad concept: A metadata extractor of the DataLad extension
datalad-metaladenables DataLad to extract and aggregate special types of metadata.
- environment variable¶
A variable made up of a name/value pair. Programs using a given environment variable will use its associated value for their execution. You can find out a bit more on environment variable in this Findoutmore.
- ephemeral clone¶
dataset clones that share the annex with the dataset they were cloned from, without git-annex being aware of it. On a technical level, this is achieved via symlinks. They can be created with the
--reckless ephemeraloption of datalad clone.
Git concept; Enforcing a git push command with the
--forceoption. Find out more in the documentation of git push.
Git concept on repository hosting sites (GitHub, GitLab, Gin, …) A fork is a copy of a repository on a web-based Git repository hosting site. Find out more here.
A web-based repository store for data management that you can use to host and share datasets. Find out more about GIN here.
A version control system to track changes made to small-sized files over time. You can find out more about git in this (free) book or these interactive Git tutorials on GitHub.
A distributed file synchronization system, enabling sharing and synchronizing collections of large files. It allows managing files with Git, without checking the file content into Git.
- git-annex branch¶
This branch exists in your dataset if the dataset contains an annex. The git-annex branch is completely unconnected to any other branch in your dataset, and contains different types of log files. Its contents are used for git-annex’s internal tracking of the dataset and its annexed contents. The branch is managed by git-annex, and you should not tamper with it unless you absolutely know what you are doing.
- Git config file¶
A file in which Git stores configuration option. Such a file usually exists on the system, user, and repository (dataset) level.
GitHub is an online platform where one can store and share version controlled projects using Git (and thus also DataLad project). See`GitHub.com <https://github.com/>`_.
A repository browser that displays changes in a repository or a selected set of commits. It visualizes a commit graph, information related to each commit, and the files in the trees of each revision.
An online platform to host and share software projects version controlled with Git, similar to GitHub. See Gitlab.com.
A powerful pattern matching function of a shell. Allows to match the names of multiple files or directories. The most basic pattern is
*, which matches any number of character, such that
ls *.txtwill list all
.txtfiles in the current directory. You can read about more about Pattern Matching in Bash’s Docs.
- high-performance computing (HPC)¶
Aggregating computing power from a bond of computers in a way that delivers higher performance than a typical desktop computer in order to solve computing tasks that require high computing power or demand a lot of disk space or memory.
- high-throughput computing (HTC)¶
A computing environment build from a bond of computers and tuned to deliver large amounts of computational power to allow parallel processing of independent computational jobs. For more information, see this Wikipedia entry.
Hypertext Transfer Protocol; A protocol for file transfer over a network.
Hypertext Transfer Protocol Secure; A protocol for file transfer over a network.
Automatic protocol creation of software processes, for example in order to gain insights into errors. To learn about logging to troubleshoot problems or remove or increase the amount of information printed to your terminal during the execution of a DataLad command, take a look at the section Logging.
- log level¶
Adjusts the amount of verbosity during logging.
Makefiles are recipes on how to create a digital object for the build automation tool Make. They are used to build programs, but also to manage projects where some files must be automatically updated from others whenever the others change. An example of a Makefile is shown in the usecase Writing a reproducible paper.
Abbreviation of “manual page”. For most Unix programs, the command
man <program-name>will open a pager with this commands documentation. If you have installed DataLad as a Debian package,
manwill allow you to open DataLad manpages in your terminal.
Git concept: For the longest time,
masterwas the name of the default branch in a dataset. More recently, the name
mainis used. If you are not sure, you can find out if your default branch is
Git concept: to integrate the changes of one branch/sibling/ … into a different branch.
- merge request¶
See pull request.
“Data about data”: Information about one or more aspects of data used to summarize basic information, for example means of create of the data, creator or author, size, or purpose of the data. For example, a digital image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, the shutter speed, and other data.
A common text-editor.
git-annex concept: The place where git-annex stores available file contents. Files that are annexed get a symlink added to Git that points to the file content. A different word for annex.
- Open Science Framework (OSF)¶
An open source software project that facilitates open collaboration in science research.
A terminal paper is a program to view file contents in the terminal. Popular examples are the programs
more. Some terminal output can be opened automatically in a pager, for example the output of a git log command. You can use the arrow keys to navigate and scroll in the pager, and the letter
qto exit it.
Access rights assigned by most file systems that determine whether a user can view (
read permission), change (
write permission), or execute (
execute permission) a specific content.
read permissionsgrant the ability to a file, or the contents (file names) in a directory.
write permissionsgrant the ability to modify a file. When content is stored in the object-tree by git-annex, your previously granted write permission for this content is revoked to prevent accidental modifications.
execute permissionsgrant the ability to execute a file. Any script that should be an executable needs to get such permission.
A Python package manager. Short for “Pip installs Python”.
pip install <package name>searches the Python package index PyPi for a package and installs it while resolving any potential dependencies.
Unix concept: A mechanism for providing the output of one command (stdout) as the input of a next command (stdin) in a Unix terminal. The standard syntax are multiple commands, separated by vertical bars (the “pipes”, “|”). Read more on Wikipedia.
A record that describes entities and processes that were involved in producing or influencing a digital resource. It provides a critical foundation for assessing authenticity, enables trust, and allows reproducibility.
- publication dependency¶
DataLad concept: An existing sibling is linked to a new sibling so that the existing sibling is always published prior to the new sibling. The existing sibling could be a special remote to publish file contents stored in the dataset annex automatically with every datalad push to the new sibling. Publication dependencies can be set with the option
publish-dependsin the commands datalad siblings, datalad create-sibling, and datalad create-sibling-github/gitlab.
- pull request¶
Also known as merge request. Contributions to Git repositories/DataLad datasets can be proposed to be merged into the dataset by “requesting a pull/update” from the dataset maintainer to obtain a proposed change from a dataset clone or sibling. It is implemented as a feature in repository hosting sites such as GitHub, Gin, or GitLab.
Git concept. A “Git Reference”, typically shortened to “ref”, is a text file containing a commit shasum as a human-readable reference to a specific version of your dataset or Git repository. Thanks to refs, Git users do not need to memorize or type shasums when switching between dataset states, and can use simple names instead: For example, a branch such as
mainis a ref, and a tag is one, too. In both cases, those refs are text files that contain the shasum of the commit at the tip of a branch, or the shasum of the commit you added the tag to. Refs are organized in the directory
.git/refsand Git commands and configurations can use refs to perform updating operations or determine their behavior. More details can be found at at git-scm.com
- relative path¶
A path related to the present working directory. Relative paths never start with
../Pictures/xkcd-webcomics/530.png. See also absolute path.
Git-terminology: A repository (and thus also DataLad dataset) that a given repository tracks. A sibling is DataLad’s equivalent to a remote.
- Remote Indexed Archive (RIA) store¶
A Remote Indexed Archive (RIA) Store is a flexible and scalable dataset storage solution, useful for collaborative, back-up, or storage workflows. Read more about RIA stores in the section Remote Indexed Archives for dataset storage and backup.
- run procedure¶
DataLad concept: An executable (such as a script) that can be called with the datalad run-procedure command and performs modifications or routine tasks in datasets. Procedures can be written by users, or come with DataLad and its extensions. Find out more in section Configurations to go
- run record¶
A command summary of a datalad run command, generated by DataLad and included in the commit message.
A Unix stream editor to parse and transform text. Find out more here and in its documentation.
A hexadecimal number, 40 digits long, that is produced by a secure hash algorithm, and is used by Git to identify commits. A shasum is a type of checksum.
#!at the very top of a script. One can specify the interpreter (i.e., the software that executes a script of yours, such as Python) after with it such as in
#! /usr/bin/python. If the script has executable permissions, it is henceforth able to call the interpreter itself. Instead of
python code/myscript.pyone can just run
myscripthas executable permissions and a correctly specified shebang.
A command line language and programming language. See also terminal.
- special remote¶
git-annex concept: A protocol that defines the underlying transport of annexed files to and from places that are not Git repositories (e.g., a cloud service or external machines such as HPC systems).
Git concept; Squashing is a Git operation which rewrites history by taking a range of commits and squash them into a single commit. For more information on rewriting Git history, checkout section Back and forth in time and the documentation.
Secure shell (SSH) is a network protocol to link one machine (computer), the client, to a different local or remote machine, the server. See also: SSH server.
- SSH key¶
An SSH key is an access credential in the SSH protocol that can be used to login from one system to remote servers and services, such as from your private computer to an SSH server, without supplying your username or password at each visit. To use an SSH key for authentication, you need to generate a key pair on the system you would like to use to access a remote system or service (most likely, your computer). The pair consists of a private and a public key. The public key is shared with the remote server, and the private key is used to authenticate your machine whenever you want to access the remote server or service. Services such as GitHub, GitLab, and GIN use SSH keys and the SSH protocol to ease access to repositories. This tutorial by GitHub is a detailed step-by-step instruction to generate and use SSH keys for authentication.
- SSH server¶
An remote or local computer that users can log into using the SSH protocol.
Unix concept: One of the three standard input/output streams in programming. Standard input (
stdin) is a stream from which a program reads its input data.
Unix concept: One of the three standard input/output streams in programming. Standard error (
stderr) is a stream to which a program outputs error messages, independent from standard output.
Unix concept: One of the three standard input/output streams in programming. Standard output (
stdout) is a stream to which a program writes its output data.
A symbolic link (also symlink or soft link) is a reference to another file or path in the form of a relative path. Windows users are familiar with a similar concept: shortcuts.
DataLad concept: A dataset clone that a given DataLad dataset knows about. Changes can be retrieved and pushed between a dataset and its sibling. It is the equivalent of a remote in Git.
Singularity is a containerization software that can package software into software containers. It is a useful alternative to Docker as it can run on shared computational infrastructure. Find out more on Wikipedia.
singularity-hub.org is a Singularity container portal. Among other things, it hosts and builds Singularity container images. You can can pull container images built from a publicly shared container recipe from it.
- software container¶
Computational containers are cut-down virtual machines that allow you to package software libraries and their dependencies in precise versions into a bundle that can be shared with others. They are running instances of a container image. On your own and other’s machines, the container constitutes a secluded software environment that contains the exact software environment that you specified but does not effect any software outside of the container. Unlike virtual machines, software containers do not have their own operating system and instead use basic services of the underlying operating system of the computer they run on (in a read-only fashion). This makes them lightweight and portable. By sharing software environments with containers, such as Docker or Singularity containers, others (and also yourself) have easy access to software without the need to modify the software environment of the machine the container runs on.
Git concept: a submodule is a Git repository embedded inside another Git repository. A DataLad subdataset is known as a submodule in the Git config file.
- tab completion¶
Also known as command-line completion. A common shell feature in which the program automatically fills in partially types commands upon pressing the
Git concept: A mark on a commit that can help to identify commits. You can attach a tag with a name of your choice to any commit by supplying the
--version-tag <TAG-NAME>option to datalad save.
- the DataLad superdataset ///¶
DataLad provides unified access to a large amount of data at an open data collection found at datasets.datalad.org. This collection is known as “The DataLad superdataset” and under its shortcut,
///. You can install the superdataset – and subsequently query its content via metadata search – by running
datalad clone ///.
A text-mode interface for git that allows you to easily browse through your commit history. It is not part of git and needs to be installed. Find out more here.
The terminal (sometimes also called a shell, console, or CLI) is an interactive, text based interface that allows you to access your computer’s functionality. The most common command-line shells use bash or c-shell. You can get a short intro to the terminal and useful commands in the section General prerequisites.
A common Linux distribution. More information here.
Universally Unique Identifier. It is a character string used for unambiguous, identification, formatted according to a specific standard. This identification is not only unambiguous and unique on a system, but indeed universally unique – no UUID exists twice anywhere on the planet. Every DataLad dataset has a UUID that identifies a dataset uniquely as a whole across its entire history and flavors called Dataset ID that looks similar to this
0828ac72-f7c8-11e9-917f-a81e84238a11. This dataset ID will only exist once, identifying only one particular dataset on the planet. Note that this does not require all UUIDs to be known in some central database – the fact that no UUID exists twice is achieved by mere probability: The chance of a UUID being duplicated is so close to zero that it is negligible.
- version control¶
Processes and tools to keep track of changes to documents or other collections of information.
A text editor, often the default in UNIX operating systems. If you are not used to using it, but ended up in it accidentally: press
Enterto exit without saving. Here is help: A vim tutorial and how to configure the default editor for git.
- virtual environment¶
A specific Python installation with packages of your choice, kept in a self-contained directory tree, and not interfering with the system-wide installations. Virtual environments are an easy solution to create several different Python environments and come in handy if you want to have a cleanly structured software setup and several applications with software requirements that would conflict with each other in a single system: You can have one virtual environment with package A in version X, and a second one with package A in version Y. There are several tools that create virtual environments such as the built-in
virtualenvmodule, or conda. Virtual environments are light-weight and you can switch between them fast.
The Windows Subsystem for Linux, a compatibility layer for running Linux destributions on recent versions of Windows. Find out more here.
A Unix shell.