1.4. Configure custom data access

DataLad can download files via the http, https, ftp, and s3 protocol from various data storage solutions via its downloading commands (datalad download-url, datalad addurls, datalad get). If data retrieval from a storage solution requires authentication, for example via a username and password combination, DataLad provides an interface to query, request, and store the most common type of credentials that are necessary to authenticate, for a range of authentication types. There are a number of natively supported types of authentication and out-of-the box access to a broad range of access providers, from common solutions such as S3 to special purpose solutions, such as LORIS. However, beyond natively supported services, custom data access can be configured as long as the required authentication and credential type are supported. This makes DataLad even more flexible for retrieving data.

1.4.1. Basic process

For any supported access type that requires authentication, the procedure is always the same: Upon first access via any downloading command, users will be prompted for their credentials from the command line. Subsequent downloads handle authentication in the background as long as the credentials stay valid. An example of this credential management is shown in the usecase Scaling up: Managing 80TB and 15 million files from the HCP release: Data is stored in S3 buckets that require authentication with AWS credentials. The first datalad get to retrieve any of the data will prompt for the credentials from the terminal. If the given credentials are valid, the requested data will be downloaded, and all subsequent retrievals via get will authenticate automatically, without user input, as long as the entered credentials stay valid.

How does the authentication work?

Passwords, user names, tokens, or any other login information is stored in your system’s (encrypted) keyring. It is a built-in credential store, used in all major operating systems, and can store credentials securely. DataLad uses the Python keyring package to access the keyring. In addition to a standard interface to the keyring, this library also has useful special purpose backends that come in handy in corner cases such as HPC/cluster computing, where no interactive sessions are available.

If a particular storage solution requires authentication but it is not known to DataLad yet, the download will fail. Here is how this looks like if data is retrieved from a server that requires HTTP authentication, but DataLad – or the dataset – lacks a configuration for data access about this server:

$ datalad download-url  \
  [INFO   ] Downloading 'https://example.com/myuser/protected/path/to/file' into 'local/path/'
  Authenticated access to https://example.com/myuser/protected/path/to/file has failed.
  Would you like to setup a new provider configuration to access url? (choices: [yes], no): yes

However, data access can be configured by the user if the required authentication and credential type are supported by DataLad (a list is given in the hidden section below). With a data access configuration in place, commands such as datalad download-url or datalad addurls can work with urls the point to the location of the data to be retrieved, and datalad get is enabled to retrieve file contents from these sources.

The configuration can either be done in the terminal upon a prompt from the command line when a download fails due to a missing provider configuration as shown above, or by placing a configuration file for the required data access into .datalad/providers/<provider-name>.cfg. The following information is needed:

  • An arbitrary name that the data access is identified with,

  • a regular expression that can match a url one would want to download from,

  • an authentication type, and

  • a credential type.

The example below sheds some light one this.

Which authentication and credential types are possible?

When configuring custom data access, credential and authentication type are required information. Below, we list the most common choices for these fields.

Among the most common credential types, 'user_password', 'aws-s3', and 'token' authentication is supported. For a full list, including some less common authentication types, please see the technical documentation of DataLad.

For authentication, the most common supported solutions are 'html_form', 'http_auth' ( http and html form-based authentication), 'http_basic_auth' (http basic access), 'http_digest_auth' ( digest access authentication), 'bearer_token' (http bearer token authentication) and 'aws-s3'. A full list can be found in the technical docs.

1.4.2. Example: Data access to a server that requires basic HTTP authentication

Consider a private Apache web server with an .htaccess file that configures a range of allowed users to access a certain protected directory on this server via basic HTTP authentication. If opened in a browser, such a setup would prompt visitors of this directory on the web server for their username and password, and only grant access if valid credentials are entered. Unauthenticated requests cause 401 Unauthorized Status responses.

By default, when DataLad attempts to retrieve files from this protected directory, the authentication and credential type that are required are unknown to DataLad and authentication fails. An attempt to download or get a file from this directory with DataLad can only succeed if a “provider configuration”, i.e., a configuration how to access the data, for this specific web server with information on how to authenticate exists.

“Provider configurations” are small text files that either exist on a per-dataset level in .datalad/providers/<name>.cfg, or on a user-level in ~/.config/datalad/providers/<name>.cfg. They can be created and saved by hand, or configured “on the fly” from the command line upon unsuccessful download attempts. A configuration file follows a similar structure as the example below:

url_re = https://example.com/~myuser/protected/.*
credential = my-webserver
authentication_type = http_basic_auth

type = user_password

For a local1, i.e., dataset-specific, configuration, place the file into .datalad/providers/my-webserver.cfg. Subsequently, in the dataset that this file was placed into, downloading commands that point to https://example.com/~myuser/protected/<path> will ask (once) for the user’s user name and password, and subsequently store these credentials. In order to make it a global configuration, i.e., enable downloads from the web server from within all datasets of the user, place the file into the users home directory under ~/.config/datalad/providers/my-webserver.cfg.

If the file is generated “on the fly” from the terminal, it will prompt for exactly the same information as specified in the example above and write the required .cfg based on the given information. Note that this will configure data access globally, i.e., it will place the file under ~/.config/datalad/providers/<name>.cfg. Here is how that would look like:

$ datalad download-url  https://example.com/~myuser/protected/my_protected_file
 [INFO   ] Downloading 'https://example.com/~myuser/protected/my_protected_file' into '/tmp/ds/'
 Authenticated access to https://example.com/~myuser/protected/my_protected_file has failed.
 Would you like to setup a new provider configuration to access url? (choices: [yes], no): yes

 New provider name
 Unique name to identify 'provider' for https://example.com/~myuser/protected/my_protected_file [https://example.com]:

 New provider regular expression
 A (Python) regular expression to specify for which URLs this provider
 should be used [https://example\.com/\~myuser/protected/my_protected_file]:

 Authentication type
 What authentication type to use (choices: aws-s3, bearer_token, html_form,
 http_auth, http_basic_auth, http_digest_auth, loris-token, nda-s3, none, xnat):

 What type of credential should be used? (choices: aws-s3, loris-token, nda-s3,
 token, [user_password]):

 Save provider configuration file
 Following configuration will be written to /home/me/.config/datalad/providers/my-webserver.cfg:
 # Provider configuration file created to initially access
 # https://example.com/~myuser/protected/my_protected_file

 url_re = https://example.com/~myuser/protected/.*
 authentication_type = http_basic_auth
 # Note that you might need to specify additional fields specific to the
 # authenticator.  Fow now "look into the docs/source" of <class 'datalad.downloaders.http.HTTPBasicAuthAuthenticator'>
 # http_basic_auth_
 credential = my-webserver

 # If known, specify URL or email to how/where to request credentials
 # url = ???
 type = user_password
  (choices: [yes], no):

 You need to authenticate with 'my-webserver' credentials.
 user: <user name>

 password: <password>
 password (repeat): <password>
 [INFO   ] http session: Authenticating into session for https://example.com/~myuser/protected/my_protected_file
 https://example.com/~myuser/protected/my_protected_file:   0%| | 0.00/611k
 download_url(ok): /https://example.com/~myuser/protected/my_protected_file (file)
 add(ok): my_protected_file (file)
 save(ok): . (dataset)
 action summary:
   add (ok: 1)
   download_url (ok: 1)
   save (ok: 1)

Subsequently, all downloads from https://example.com/~myuser/protected/* by the user will succeed. If something went wrong during this interactive configuration, delete or edit the file at ~/.config/datalad/providers/<name>.cfg.



To re-read on configurations and their scope, check out chapter Tuning datasets to your needs again.