1.5. Configure custom data access

DataLad can download files via the http, https, ftp, and s3 protocol from various data storage solutions via its downloading commands (datalad download-url, datalad addurls, datalad get). If data retrieval from a storage solution requires authentication, for example via a username and password combination, DataLad provides an interface to query, request, and store the most common type of credentials that are necessary to authenticate, for a range of authentication types. There are a number of natively supported types of authentication and out-of-the box access to a broad range of access providers, from common solutions such as S3 to special purpose solutions, such as LORIS. However, beyond natively supported services, custom data access can be configured as long as the required authentication and credential type are supported. In addition, starting with DataLad version 0.16, authentication can be “outsourced” to Git’s credential helper and vice versa. This makes DataLad even more flexible for retrieving data, and can allow tools like Git to use DataLad’s credentials as well.

1.5.1. Basic process

For any supported access type that requires authentication, the procedure is always the same: Upon first access via any downloading command, users will be prompted for their credentials from the command line. Subsequent downloads handle authentication in the background as long as the credentials stay valid. An example of this credential management is shown in the usecase Scaling up: Managing 80TB and 15 million files from the HCP release: Data is stored in S3 buckets that require authentication with AWS credentials. The first datalad get to retrieve any of the data will prompt for the credentials from the terminal. If the given credentials are valid, the requested data will be downloaded, and all subsequent retrievals via get will authenticate automatically, without user input, as long as the entered credentials stay valid.

How does the authentication work?

Passwords, user names, tokens, or any other login information is stored in your system’s (encrypted) keyring. It is a built-in credential store, used in all major operating systems, and can store credentials securely. DataLad uses the Python keyring package to access the keyring. In addition to a standard interface to the keyring, this library also has useful special purpose backends that come in handy in corner cases such as HPC/cluster computing, where no interactive sessions are available.

If a particular storage solution requires authentication but it is not known to DataLad yet, the download will fail. Here is how this looks like if data is retrieved from a server that requires HTTP authentication, but DataLad – or the dataset – lacks a configuration for data access about this server:

$ datalad download-url  \
  https://example.com/myuser/protected/path/to/file
  [INFO   ] Downloading 'https://example.com/myuser/protected/path/to/file' into 'local/path/'
  Authenticated access to https://example.com/myuser/protected/path/to/file has failed.
  Would you like to setup a new provider configuration to access url? (choices: [yes], no): yes

However, data access can be configured by the user if the required authentication and credential type are supported by DataLad (a list is given in the hidden section below). With a data access configuration in place, commands such as datalad download-url or datalad addurls can work with urls the point to the location of the data to be retrieved, and datalad get is enabled to retrieve file contents from these sources.

The configuration can either be done in the terminal upon a prompt from the command line when a download fails due to a missing provider configuration as shown above, or by placing a configuration file for the required data access into .datalad/providers/<provider-name>.cfg. The following information is needed:

  • An arbitrary name that the data access is identified with,

  • a regular expression that can match a url one would want to download from,

  • an authentication type, and

  • a credential type.

The example below sheds some light one this.

Which authentication and credential types are possible?

When configuring custom data access, credential and authentication type are required information. Below, we list the most common choices for these fields.

Among the most common credential types, 'user_password', 'aws-s3', and 'token' authentication is supported. For a full list, including some less common authentication types, please see the technical documentation of DataLad.

For authentication, the most common supported solutions are 'html_form', 'http_auth' ( http and html form-based authentication), 'http_basic_auth' (http basic access), 'http_digest_auth' ( digest access authentication), 'bearer_token' (http bearer token authentication) and 'aws-s3'. A full list can be found in the technical docs.

1.5.2. Example: Data access to a server that requires basic HTTP authentication

Consider a private Apache web server with an .htaccess file that configures a range of allowed users to access a certain protected directory on this server via basic HTTP authentication. If opened in a browser, such a setup would prompt visitors of this directory on the web server for their username and password, and only grant access if valid credentials are entered. Unauthenticated requests cause 401 Unauthorized Status responses.

By default, when DataLad attempts to retrieve files from this protected directory, the authentication and credential type that are required are unknown to DataLad and authentication fails. An attempt to download or get a file from this directory with DataLad can only succeed if a “provider configuration”, i.e., a configuration how to access the data, for this specific web server with information on how to authenticate exists.

“Provider configurations” are small text files that either exist on a per-dataset level in .datalad/providers/<name>.cfg, or on a user-level in ~/.config/datalad/providers/<name>.cfg. They can be created and saved by hand, or configured “on the fly” from the command line upon unsuccessful download attempts. A configuration file follows a similar structure as the example below:

[provider:my-webserver]
url_re = https://example.com/~myuser/protected/.*
credential = my-webserver
authentication_type = http_basic_auth

[credential:my-webserver]
type = user_password

For a local1, i.e., dataset-specific, configuration, place the file into .datalad/providers/my-webserver.cfg. Subsequently, in the dataset that this file was placed into, downloading commands that point to https://example.com/~myuser/protected/<path> will ask (once) for the user’s user name and password, and subsequently store these credentials. In order to make it a global configuration, i.e., enable downloads from the web server from within all datasets of the user, place the file into the users home directory under ~/.config/datalad/providers/my-webserver.cfg.

If the file is generated “on the fly” from the terminal, it will prompt for exactly the same information as specified in the example above and write the required .cfg based on the given information. Note that this will configure data access globally, i.e., it will place the file under ~/.config/datalad/providers/<name>.cfg. Here is how that would look like:

$ datalad download-url  https://example.com/~myuser/protected/my_protected_file
 [INFO   ] Downloading 'https://example.com/~myuser/protected/my_protected_file' into '/tmp/ds/'
 Authenticated access to https://example.com/~myuser/protected/my_protected_file has failed.
 Would you like to setup a new provider configuration to access url? (choices: [yes], no): yes

 New provider name
 Unique name to identify 'provider' for https://example.com/~myuser/protected/my_protected_file [https://example.com]:
 my-webserver

 New provider regular expression
 A (Python) regular expression to specify for which URLs this provider
 should be used [https://example\.com/\~myuser/protected/my_protected_file]:
 https://example.com/~myuser/protected/.*

 Authentication type
 What authentication type to use (choices: aws-s3, bearer_token, html_form,
 http_auth, http_basic_auth, http_digest_auth, loris-token, nda-s3, none, xnat):
 http_basic_auth

 Credential
 What type of credential should be used? (choices: aws-s3, loris-token, nda-s3,
 token, [user_password]):
 user_password

 Save provider configuration file
 Following configuration will be written to /home/me/.config/datalad/providers/my-webserver.cfg:
 # Provider configuration file created to initially access
 # https://example.com/~myuser/protected/my_protected_file

 [provider:my-webserver]
 url_re = https://example.com/~myuser/protected/.*
 authentication_type = http_basic_auth
 # Note that you might need to specify additional fields specific to the
 # authenticator.  Fow now "look into the docs/source" of <class 'datalad.downloaders.http.HTTPBasicAuthAuthenticator'>
 # http_basic_auth_
 credential = my-webserver

 [credential:my-webserver]
 # If known, specify URL or email to how/where to request credentials
 # url = ???
 type = user_password
  (choices: [yes], no):
 yes

 You need to authenticate with 'my-webserver' credentials.
 user: <user name>

 password: <password>
 password (repeat): <password>
 [INFO   ] http session: Authenticating into session for https://example.com/~myuser/protected/my_protected_file
 https://example.com/~myuser/protected/my_protected_file:   0%| | 0.00/611k
 download_url(ok): /https://example.com/~myuser/protected/my_protected_file (file)
 add(ok): my_protected_file (file)
 save(ok): . (dataset)
 action summary:
   add (ok: 1)
   download_url (ok: 1)
   save (ok: 1)

Subsequently, all downloads from https://example.com/~myuser/protected/* by the user will succeed. If something went wrong during this interactive configuration, delete or edit the file at ~/.config/datalad/providers/<name>.cfg.

1.5.3. Example: Data access via Git’s credential system

Consider a private repository on GitHub. When cloning such datasets via the https protocol, every connection needs a user name and a password in the form of a Personal Access Token.

$ git clone https://github.com/adswa/my-super-secret-work.git
  Cloning into 'my-super-secret-work'...
  Username for 'https://github.com': <user-name>
  Password for 'https://github.com': <GitHub Access Token>

Because this can be tedious, Git has a credential system that can help to store and provide the necessary configurations automatically. One of its pieces are so called credential helper, executables that ultimately store credentials for specific hosts, and will provide them automatically in place of an interactive query to the user.

This system is particularly flexible because Git allows users to create custom helpers that fit specific usecases. Here is one example: A server contains a number of DataLad datasets, but a different and changing number of users of the shared computational infrastructure has access to each one. In order to centralize and automate authentication, a system-wide Git configuration1 is employed:

$ git config --list
  credential.https://cool-dataset.ds.research-center.de.helper=/usr/local/bin/research-center_datastore_pw

This credential helper for host https://cool-dataset.ds.research-center.de points to an executable, /usr/local/bin/research-center_datastore_pw, which determines, for example by querying a password database, whether the given user has access or not. If they have, it returns the user name and password required for authentication to the Git process that tried to access the server.

Beginning with DataLad version 0.16, DataLad’s own credential management can interface with Git’s by its aforementioned mechanism of provider configurations. A basic mock example can illustrate the necessary steps to set this up.

Here is a short list of preparations if you want to try this out for yourself:

  1. Create a private repository on GitHub. This can be done via GitHub’s webinterface or the --private flag of create-sibling-github (requires DataLad version 0.16 or higher).

  2. The repository should contain a file, like a simple README.md, and can be a pure Git repository.

  3. Ensure that all tokens in Git configurations files are commented out, because those would provide authentication as well. Running git config --list can give you an overview, but you can also check that git clone <repo> with a https URL prompts for user name and password.

The challenge is to datalad download-url the file successfully. This is difficult because the repository is private and requires authentication that DataLad is yet unaware about. For fun, you can check that a download via wget from the command line also fails:

# try to download a file from a private repo with this url scheme
$ wget https://raw.githubusercontent.com/<username>/<reponame>/<branch-name>/<filename.txt>
# should return a 404

To achieve a successful download, we will create a small, custom credential helper for Git, and tell DataLad about it with a provider configuration. First, we will store the password on your system. Create a personal access token on GitHub, and, for simplicity, write it into a text file github in your home directory. Please do note that it is highly discouraged to store passwords in plain files, and only done for demonstration here.

Next, we will write a credential helper that will retrieve this password. Open your .gitconfig file in your home directory, and add the following contents to it, replacing the user name placeholder with your GitHub handle:

[credential "https://raw.githubusercontent.com"]
        username = <your-user-name-here>
        helper = "!f() { echo \"password=$(cat ~/github)\"; }; f"

This configuration will be queried by Git when a URL matches https://raw.githubusercontent.com and runs the helper, which here is a shell function that prints the string password= and the content of the file containing the token. This function is rudimentary, but does the job for this illustration.

Finally, we will teach DataLad to use on this configuration to authenticate. For this, create a new dataset, and, with your favourite editor, create a new provider configuration .datalad/providers/github.cfg in it. Depending on your editor, you will need to create the directory providers under .datalad first. This provider configuration should contain the following:

[provider:github]
  url_re = https://.*github.*\.com
  authentication_type = http_basic_auth
  credential = data_example_cred
[credential:data_example_cred]
  type = git

Importantly, the type key should specify git, the provider:<name> name should match the name of the provider configuration filename, the url_re should be a regular expression that can match the credential URL in your .gitconfig file, and the credential value should be the same string as the [credential:<credential>] name. With this setup, a datalad download-url succeeds, authenticating via the Git credential helper.

Git authenticating via DataLad’s credential system

Not only can DataLad use Git’s credential system, Git can also query credentials from DataLad. This requires DataLad version 0.16 or higher, and a Git configuration pointing to the credential helper git-credential-datalad for a given URL scheme:

[credential "https://*data.example.com"]
   helper = "datalad"

To find out more about DataLad’s integration with Git’s credential system, take a look into the more technical documentation at docs.datalad.org/credentials.html and docs.datalad.org/design/credentials.html.

Footnotes

1(1,2)

To re-read on configurations and their scope, check out chapter Tuning datasets to your needs again.