9.2. Miscellaneous file system operations

With all of the information about symlinks and object trees, you might be reluctant to perform usual file system managing operations, such as copying, moving, renaming or deleting files or directories with annexed content.

If I renamed one of those books, would the symlink that points to the file content still be correct? What happens if I’d copy an annexed file? If I moved the whole books/ directory? What if I moved all of DataLad-101 into a different place on my computer? What if renamed the whole superdataset? And how do I remove a file, or directory, or subdataset?

Therefore, there is an extra tutorial offered by the courses’ TA today, and you attend. There is no better way of learning than doing. Here, in the safe space of the DataLad-101 course, you can try out all of the things you would be unsure about or reluctant to try on the dataset that contains your own, valuable data.

Below you will find common questions about file system management operations, and each question outlines caveats and solutions with code examples you can paste into your own terminal. Because these code snippets will add many commits to your dataset, we’re cleaning up within each segment with common git operations that manipulate the datasets history – be sure to execute these commands as well (and be sure to be in the correct dataset).

9.2.1. Renaming files

Let’s try it. In Unix, renaming a file is exactly the same as moving a file, and uses the mv command.

$ cd books/
$ mv TLCL.pdf The_Linux_Command_Line.pdf
$ ls -lah
total 24K
drwxr-xr-x 2 adina adina 4.0K Dec 14 17:01 .
drwxr-xr-x 8 adina adina 4.0K Dec 14 17:01 ..
lrwxrwxrwx 1 adina adina  131 Jan 19  2009 bash_guide.pdf -> ../.git/annex/objects/WF/Gq/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf
lrwxrwxrwx 1 adina adina  131 Dec  8  2021 byte-of-python.pdf -> ../.git/annex/objects/z1/Q8/MD5E-s4208954--ab3a8c2f6b76b18b43c5949e0661e266.pdf/MD5E-s4208954--ab3a8c2f6b76b18b43c5949e0661e266.pdf
lrwxrwxrwx 1 adina adina  133 Dec  7  2021 progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
lrwxrwxrwx 1 adina adina  131 Jan 28  2019 The_Linux_Command_Line.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf

Try to open the renamed file, e.g., with evince The_Linux_Command_Line.pdf. This works!

But let’s see what changed in the dataset with this operation:

$ datalad status
untracked: /home/me/dl-101/DataLad-101/books/The_Linux_Command_Line.pdf (symlink)
  deleted: /home/me/dl-101/DataLad-101/books/TLCL.pdf (symlink)

We can see that the old file is marked as deleted, and simultaneously, an untracked file appears: the renamed PDF.

While this might appear messy, a datalad save will clean all of this up. Therefore, do not panic if you rename a file, and see a dirty dataset status with deleted and untracked files – datalad save handles these and other cases really well under the hood.

$ datalad save -m "rename the book"
delete(ok): books/TLCL.pdf (file)
add(ok): books/The_Linux_Command_Line.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  delete (ok: 1)
  save (ok: 1)

The datalad save command will identify that a file was renamed, and will summarize this nicely in the resulting commit:

$ git log -n 1 -p
commit 1cd81ab630a26762f5c9dba8ad3b7de330061eab
Author: Elena Piscopia <elena@example.net>
Date:   Wed Dec 14 17:01:26 2022 +0100

    rename the book

diff --git a/books/TLCL.pdf b/books/The_Linux_Command_Line.pdf
similarity index 100%
rename from books/TLCL.pdf
rename to books/The_Linux_Command_Line.pdf

Note that datalad save commits all modifications when it’s called without a path specification, so any other changes will be saved in the same commit as the rename. If there are unsaved modifications you do not want to commit together with the file name change, you could give both the new and the deleted file as a path specification to datalad save, even if it feels unintuitive to save a change that is marked as a deletion in a datalad status:

datalad save -m "rename file" oldname newname

Alternatively, there is also a way to save the name change only using Git tools only, outlined in the following hidden section. If you are a Git user, you will be very familiar with it.

Renaming with Git tools

Git has built-in commands that provide a solution in two steps.

If you have followed along with the previous datalad save, let’s revert the renaming of the the files:

$ git reset --hard HEAD~1
$ datalad status
HEAD is now at b66e68e add container and execute analysis within container
nothing to save, working tree clean

Now we’re checking out how to rename files and commit this operation using only Git: A Git-specific way to rename files is the git mv command:

$ git mv TLCL.pdf The_Linux_Command_Line.pdf
$ datalad status
    added: /home/me/dl-101/DataLad-101/books/The_Linux_Command_Line.pdf (symlink)
  deleted: /home/me/dl-101/DataLad-101/books/TLCL.pdf (symlink)

We can see that the old file is still seen as “deleted”, but the “new”, renamed file is “added”. A git status displays the change in the dataset a bit more accurately:

$ git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	renamed:    TLCL.pdf -> The_Linux_Command_Line.pdf

Because the git mv places the change directly into the staging area (the index) of Git1, a subsequent git commit -m "rename book" will write the renaming – and only the renaming – to the dataset’s history, even if other (unstaged) modifications are present.

$ git commit -m "rename book"
[master 475c300] rename book
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename books/{TLCL.pdf => The_Linux_Command_Line.pdf} (100%)

To summarize, renaming files is easy and worry-free. Do not be intimidated by a file marked as deleted – a datalad save will rectify this. Be mindful of other modifications in your dataset, though, and either supply appropriate paths to datalad save, or use Git tools to exclusively save the name change and nothing else.

Let’s revert this now, to have a clean history.

$ git reset --hard HEAD~1
$ datalad status
HEAD is now at b66e68e add container and execute analysis within container
nothing to save, working tree clean

9.2.2. Moving files from or into subdirectories

Let’s move an annexed file from within books/ into the root of the superdataset:

$ mv TLCL.pdf ../TLCL.pdf
$ datalad status
untracked: /home/me/dl-101/DataLad-101/TLCL.pdf (symlink)
  deleted: /home/me/dl-101/DataLad-101/books/TLCL.pdf (symlink)

In general, this looks exactly like renaming or moving a file in the same directory. There is a subtle difference though: Currently, the symlink of the annexed file is broken. There are two ways to demonstrate this. One is trying to open the file – this will currently fail. The second way is to look at the symlink:

$ cd ../
$ ls -l TLCL.pdf
lrwxrwxrwx 1 adina adina 131 Dec 14 17:01 TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf

The first part of the symlink should point into the .git/ directory, but currently, it does not – the symlink still looks like TLCL.pdf would be within books/. Instead of pointing into .git, it currently points to ../.git, which is non-existent, and even outside of the superdataset. This is why the file cannot be opened: When any program tries to follow the symlink, it will not resolve, and an error such as “no file or directory” will be returned. But do not panic! A datalad save will rectify this as well:

$ datalad save -m "moved book into root"
$ ls -l TLCL.pdf
delete(ok): books/TLCL.pdf (file)
add(ok): TLCL.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  delete (ok: 1)
  save (ok: 1)
lrwxrwxrwx 1 adina adina 128 Dec 14 17:01 TLCL.pdf -> .git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf

After a datalad save, the symlink is fixed again. Therefore, in general, whenever moving or renaming a file, especially between directories, a datalad save is the best option to turn to. Therefore, while it might be startling if you’ve moved a file and can not open it directly afterwards, everything will be rectified by datalad save as well.

Why a move between directories is actually a content change

Let’s see how this shows up in the dataset history:

$ git log -n 1 -p
commit 0d615ccd59c51768b34c0a7a44aa320f18efe24f
Author: Elena Piscopia <elena@example.net>
Date:   Wed Dec 14 17:01:27 2022 +0100

    moved book into root

diff --git a/TLCL.pdf b/TLCL.pdf
new file mode 120000
index 0000000..34328e2
--- /dev/null
+++ b/TLCL.pdf
@@ -0,0 +1 @@
+.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file
diff --git a/books/TLCL.pdf b/books/TLCL.pdf
deleted file mode 120000
index 4c84b61..0000000
--- a/books/TLCL.pdf
+++ /dev/null
@@ -1 +0,0 @@
-../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file

As you can see, this action does not show up as a move, but instead a deletion and addition of a new file. Why? Because the content that is tracked is the actual symlink, and due to the change in relative location, the symlink needed to change. Hence, what looks and feels like a move on the file system for you is actually a move plus a content change for Git.

git annex fix

A datalad save command internally uses a git commit to save changes to a dataset. git commit in turn triggers a git annex fix command. This git-annex command fixes up links that have become broken to again point to annexed content, and is responsible for cleaning up what needs to be cleaned up. Thanks, git-annex!

Finally, let’s clean up:

$ git reset --hard HEAD~1
HEAD is now at b66e68e add container and execute analysis within container

9.2.3. Moving files across dataset boundaries

Generally speaking, moving files across dataset hierarchies is not advised. While DataLad blurs the dataset boundaries to ease working in nested dataset, the dataset boundaries do still exist. If you move a file from one subdataset into another, or up or down a dataset hierarchy, you will move it out of the version control it was in (i.e., from one .git directory into a different one). From the perspective of the first subdataset, the file will be deleted, and from the perspective of the receiving dataset, the file will be added to the dataset, but straight out of nowhere, with none of its potential history from its original dataset attached to it. Before moving a file, consider whether copying it (outlined in the next but one paragraph) might be a more suitable alternative.

If you are willing to sacrifice2 the file’s history and move it to a different dataset, the procedure differs between annexed files, and files stored in Git.

For files that Git manages, moving and saving is simple: Move the file, and save the resulting changes in both affected datasets (this can be done with a recursive save from a top-level dataset, though).

$ mv notes.txt midterm_project/notes.txt
$ datalad status -r
 modified: midterm_project (dataset)
untracked: midterm_project/notes.txt (file)
  deleted: notes.txt (file)
$ datalad save -r -m "moved notes.txt from root of top-ds to midterm subds"
add(ok): notes.txt (file)
save(ok): midterm_project (dataset)
delete(ok): notes.txt (file)
add(ok): midterm_project (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  delete (ok: 1)
  save (notneeded: 2, ok: 2)

Note how the history of notes.txt does not exist in the subdataset – it appears as if the file was generated at once, instead of successively over the course:

$ cd midterm_project
$ git log notes.txt
commit 27280681cae72f7820db61d06ac76c63c147edfc
Author: Elena Piscopia <elena@example.net>
Date:   Wed Dec 14 17:01:28 2022 +0100

    moved notes.txt from root of top-ds to midterm subds

(Undo-ing this requires git resets in both datasets)

# in midterm_project
$ git reset --hard HEAD~

# in DataLad-101
$ cd ../
$ git reset --hard HEAD~
HEAD is now at 263e285 [DATALAD RUNCMD] rerun analysis in container
HEAD is now at b66e68e add container and execute analysis within container

The process is a bit more complex for annexed files. Let’s do it wrong, first: What happens if we move an annexed file in the same way as notes.txt?

$ mv books/TLCL.pdf midterm_project
$ datalad status -r
  deleted: books/TLCL.pdf (symlink)
 modified: midterm_project (dataset)
untracked: midterm_project/TLCL.pdf (symlink)
$ datalad save -r -m "move annexed file around"
add(ok): TLCL.pdf (file) [  TLCL.pdf is a git-annex symlink. Its content is not available in this repository. (Maybe TLCL.pdf was copied from another repository?)]
save(ok): midterm_project (dataset)
delete(ok): books/TLCL.pdf (file)
add(ok): midterm_project (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  delete (ok: 1)
  save (notneeded: 2, ok: 2)

At this point, this does not look that different to the result of moving notes.txt. Note, though, that the deleted and untracked PDFs are symlinks – and therein lies the problem: What was moved was not the file content (which is still in the annex of the top-level dataset, DataLad-101), but its symlink that was stored in Git. After moving the file, the symlink is broken, and git-annex has no way of finding out where the file content could be:

$ cd midterm_project
$ git annex whereis TLCL.pdf
whereis TLCL.pdf (0 copies) failed
whereis: 1 failed

Let’s rewind, and find out how to do it correctly:

$ git reset --hard HEAD~
$ cd ../
$ git reset --hard HEAD~
HEAD is now at 263e285 [DATALAD RUNCMD] rerun analysis in container
HEAD is now at b66e68e add container and execute analysis within container

The crucial step to remember is to get the annexed file out of the annex prior to moving it. For this, we need to fall back to git-annex commands:

$ git annex unlock books/TLCL.pdf
$ mv books/TLCL.pdf midterm_project
$ datalad status -r
unlock books/TLCL.pdf ok
(recording state in git...)
  deleted: books/TLCL.pdf (file)
 modified: midterm_project (dataset)
untracked: midterm_project/TLCL.pdf (file)

Afterwards, a (recursive) save commits the removal of the book from DataLad-101, and adds the file content into the annex of midterm_project:

$ datalad save -r -m "move book into midterm_project"
add(ok): TLCL.pdf (file)
save(ok): midterm_project (dataset)
delete(ok): books/TLCL.pdf (file)
add(ok): midterm_project (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  delete (ok: 1)
  save (notneeded: 2, ok: 2)

Even though you did split the file’s history, at least its content is in the correct dataset now:

$ cd midterm_project
$ git annex whereis TLCL.pdf
whereis TLCL.pdf (1 copy)
	f99ce3a1-b0d4-4d88-b0eb-068589f19a5b -- me@muninn:~/dl-101/DataLad-101/midterm_project [here]
ok

But more than showing you how it can be done, if necessary, this paragraph hopefully convinced you that moving files across dataset boundaries is not convenient. It can be a confusing and potentially “file-content-losing”-dangerous process, but it also dissociates a file from its provenance that is captured in its previous dataset, with no machine-readable way to learn about the move easily. A better alternative may be copying files with the datalad copy-file command introduced in detail in Subsample datasets using datalad copy-file, and demonstrated in the next but one paragraph. Let’s quickly clean up by moving the file back:

# in midterm_project
$ git annex unannex TLCL.pdf
unannex TLCL.pdf ok
(recording state in git...)
$ mv TLCL.pdf ../books
$ cd ../
$ datalad save -r -m "move book back from midterm_project"
save(ok): midterm_project (dataset)
add(ok): midterm_project (file)
add(ok): books/TLCL.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  save (notneeded: 2, ok: 2)

9.2.4. Copying files

Let’s create a copy of an annexed file, using the Unix command cp to copy.

$ cp books/TLCL.pdf copyofTLCL.pdf
$ datalad status
untracked: copyofTLCL.pdf (file)

That’s expected. The copy shows up as a new, untracked file. Let’s save it:

$ datalad save -m "add copy of TLCL.pdf"
add(ok): copyofTLCL.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
$ git log -n 1 -p
commit f485657900d8412a5d162a1c36e1fe91dee5f5ca
Author: Elena Piscopia <elena@example.net>
Date:   Wed Dec 14 17:01:33 2022 +0100

    add copy of TLCL.pdf

diff --git a/copyofTLCL.pdf b/copyofTLCL.pdf
new file mode 120000
index 0000000..34328e2
--- /dev/null
+++ b/copyofTLCL.pdf
@@ -0,0 +1 @@
+.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file

That’s it.

Symlinks!

If you have read the additional content in the section Data integrity, you know that the same file content is only stored once, and copies of the same file point to the same location in the object tree.

Let’s check that out:

$ ls -l copyofTLCL.pdf
$ ls -l books/TLCL.pdf
lrwxrwxrwx 1 adina adina 128 Dec 14 17:01 copyofTLCL.pdf -> .git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
lrwxrwxrwx 1 adina adina 131 Jan 28  2019 books/TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf

Indeed! Apart from their relative location (.git versus ../.git) their symlink is identical. Thus, even though two copies of the book exist in your dataset, your disk needs to store it only once.

In most cases, this is just an interesting fun-fact, but beware when dropping content with datalad drop (Removing annexed content entirely): If you drop the content of one copy of a file, all other copies will lose this content as well.

Finally, let’s clean up:

$ git reset --hard HEAD~1
HEAD is now at d48d7a8 move book back from midterm_project

9.2.5. Copying files across dataset boundaries

copy-file availability

datalad copy-file requires DataLad version 0.13.0 or higher.

Instead of moving files across dataset boundaries, copying them is an easier and – beginning with DataLad version 0.13.0 – actually supported method. The DataLad command that can be used for this is datalad copy-file (datalad-copy-file manual). This command allows to copy files (from any dataset or non-dataset location, annexed or not annexed) into a dataset. If the file is copied from a dataset and is annexed, its availability metadata is added to the new dataset as well, and there is no need for unannex’ing the or even retrieving its file contents. Let’s see this in action for a file stored in Git, and a file stored in annex:

$ datalad copy-file notes.txt midterm_project -d midterm_project
[INFO] Copying non-annexed file or copy into non-annex dataset: /home/me/dl-101/DataLad-101/notes.txt -> <datalad.local.copy_file._CachedRepo object at 0x7fe0bc110550>
copy_file(ok): /home/me/dl-101/DataLad-101/notes.txt
add(ok): notes.txt (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  copy_file (ok: 1)
  save (ok: 1)
$ datalad copy-file books/bash_guide.pdf midterm_project -d midterm_project
copy_file(ok): /home/me/dl-101/DataLad-101/books/bash_guide.pdf [/home/me/dl-101/DataLad-101/midterm_project/bash_guide.pdf]
save(ok): . (dataset)
action summary:
  copy_file (ok: 1)
  save (ok: 1)

Both files have been successfully transferred and saved to the subdataset, and no unannexing was necessary. Note, though, that notes.txt was annexed in the subdataset, as this subdataset was not configured with the text2git run procedure.

$ tree midterm_project
midterm_project
├── bash_guide.pdf -> .git/annex/objects/31/wQ/SHA256E-s1198170--d08f2c7b8492c574239ca3be131fb8cffe39e36262d6b24a20cb5abae4d4402c.pdf/SHA256E-s1198170--d08f2c7b8492c574239ca3be131fb8cffe39e36262d6b24a20cb5abae4d4402c.pdf
├── CHANGELOG.md
├── code
│   ├── README.md
│   └── script.py
├── input
│   └── iris.csv -> .git/annex/objects/qz/Jg/MD5E-s3975--341a3b5244f213282b7b0920b729c592.csv/MD5E-s3975--341a3b5244f213282b7b0920b729c592.csv
├── notes.txt -> .git/annex/objects/mf/wJ/MD5E-s5074--99d027490a2f9a9c49cffc2c34b55d5c.txt/MD5E-s5074--99d027490a2f9a9c49cffc2c34b55d5c.txt
├── pairwise_relationships.png -> .git/annex/objects/q1/gp/MD5E-s261062--025dc493ec2da6f9f79eb1ce8512cbec.png/MD5E-s261062--025dc493ec2da6f9f79eb1ce8512cbec.png
├── prediction_report.csv -> .git/annex/objects/8q/6M/MD5E-s345--a88cab39b1a5ec59ace322225cc88bc9.csv/MD5E-s345--a88cab39b1a5ec59ace322225cc88bc9.csv
└── README.md

2 directories, 9 files

The subdataset has two new commits as datalad copy-file can take care of saving changes in the copied-to dataset, and thus the new subdataset state would need to be saved in the superdataset.

$ datalad status -r
 modified: midterm_project (dataset)

Still, just as when we moved files across dataset boundaries, the files’ provenance record is lost:

$ cd midterm_project
$ git log notes.txt
commit d2970df1078a52bbff0a7966e017c40e398b8b20
Author: Elena Piscopia <elena@example.net>
Date:   Wed Dec 14 17:01:33 2022 +0100

    [DATALAD] Recorded changes

Nevertheless, copying files with datalad copy-file is easier and safer than moving them with standard Unix commands, especially so for annexed files. A more detailed introduction to datalad copy-file and a concrete usecase can currently be found in Subsample datasets using datalad copy-file.

Let’s clean up:

$ git reset --hard HEAD~2
HEAD is now at 0cd8e99 move book back from midterm_project

9.2.6. Moving/renaming a subdirectory or subdataset

Moving or renaming subdirectories, especially if they are subdatasets, can be a minefield. But in principle, a safe way to proceed is using the Unix mv command to move or rename, and the datalad save to clean up afterwards, just as in the examples above. Make sure to not use git mv, especially for subdatasets.

Let’s for example rename the books directory:

$ mv books/ readings
$ datalad status
untracked: readings (directory)
  deleted: books/TLCL.pdf (symlink)
  deleted: books/bash_guide.pdf (symlink)
  deleted: books/byte-of-python.pdf (symlink)
  deleted: books/progit.pdf (symlink)
$ datalad save -m "renamed directory"
delete(ok): books/TLCL.pdf (file)
delete(ok): books/bash_guide.pdf (file)
delete(ok): books/byte-of-python.pdf (file)
delete(ok): books/progit.pdf (file)
add(ok): readings/TLCL.pdf (file)
add(ok): readings/bash_guide.pdf (file)
add(ok): readings/byte-of-python.pdf (file)
add(ok): readings/progit.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 4)
  delete (ok: 4)
  save (ok: 1)

This is easy, and complication free. Moving (as in: changing the location, instead of the name) the directory would work in the same fashion, and a datalad save would fix broken symlinks afterwards. Let’s quickly clean this up:

$ git reset --hard HEAD~1
HEAD is now at d48d7a8 move book back from midterm_project

But let’s now try to move the longnow subdataset into the root of the superdataset:

$ mv recordings/longnow .
$ datalad status
untracked: longnow (directory)
  deleted: recordings/longnow (dataset)
$ datalad save -m "moved subdataset"
delete(ok): recordings/longnow (file)
add(ok): longnow (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  delete (ok: 1)
  save (ok: 1)
$ datalad status
nothing to save, working tree clean

This seems fine, and it has indeed worked. However, reverting a commit like this is tricky, at the moment. This could lead to trouble if you at a later point try to revert or rebase chunks of your history including this move. Therefore, if you can, try not to move subdatasets around. For now we’ll clean up in a somewhat “hacky” way: Reverting, and moving remaining subdataset contents back to their original place by hand to take care of the unwanted changes the commit reversal introduced.

$ git reset --hard HEAD~1
warning: unable to rmdir 'longnow': Directory not empty
HEAD is now at d48d7a8 move book back from midterm_project
$ mv -f longnow recordings

The take-home message therefore is that it is best not to move subdatasets, but very possible to move subdirectories if necessary. In both cases, do not attempt moving with the git mv, but stick with mv and a subsequent datalad save.

9.2.7. Moving/renaming a superdataset

Once created, a DataLad superdataset may not be in an optimal place on your file system, or have the best name.

After a while, you might think that the dataset would fit much better into /home/user/research_projects/ than in /home/user/Documents/MyFiles/tmp/datalad-test/. Or maybe at some point, a long name such as My-very-first-DataLad-project-wohoo-I-am-so-excited does not look pretty in your terminal prompt anymore, and going for finance-2019 seems more professional.

These will be situations in which you want to rename or move a superdataset. Will that break anything?

In all standard situations, no, it will be completely fine. You can use standard Unix commands such as mv to do it, and also whichever graphical user interface or explorer you may use.

Beware of one thing though: If your dataset either is a sibling or has a sibling with the source being a path, moving or renaming the dataset will break the linkage between the datasets. This can be fixed easily though. We can try this in the following hidden section.

If a renamed/moved dataset is a sibling…

As section DIY configurations explains, each sibling is registered in .git/config in a “submodule” section. Let’s look at how our sibling “roommate” is registered there:

$ cat .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
	editor = nano
[annex]
	uuid = 49a69e90-9581-4205-b2d5-a5c3fe832f0d
	version = 10
[filter "annex"]
	smudge = git-annex smudge -- %f
	clean = git-annex smudge --clean -- %f
	process = git-annex filter-process
[submodule "recordings/longnow"]
	active = true
	url = https://github.com/datalad-datasets/longnow-podcasts.git
[remote "roommate"]
	url = ../mock_user/DataLad-101
	fetch = +refs/heads/*:refs/remotes/roommate/*
	annex-uuid = bd057cda-8db2-4bcc-aff2-975cb5b000cf
	annex-ignore = false
[submodule "midterm_project"]
	active = true
	url = ./midterm_project
[submodule "longnow"]
	active = true
	url = https://github.com/datalad-datasets/longnow-podcasts.git

As you can see, its “url” is specified as a relative path. Say your room mate’s directory is a dataset you would want to move. Let’s see what happens if we move the dataset such that the path does not point to the dataset anymore:

# add an intermediate directory
$ cd ../mock_user
$ mkdir onemoredir
# move your room mates dataset into this new directory
$ mv DataLad-101 onemoredir

This means that relative to your DataLad-101, your room mates dataset is not at ../mock_user/DataLad-101 anymore, but in ../mock_user/onemoredir/DataLad-101. The path specified in the configuration file is thus wrong now.

# navigate back into your dataset
$ cd ../DataLad-101
# attempt a datalad update
$ datalad update
[INFO] Fetching updates for Dataset(/home/me/dl-101/DataLad-101) 
update(error): . (dataset) [Fetch failed: CommandError(CommandError: 'git -c diff.ignoreSubmodules=none fetch --verbose --progress --no-recurse-submodules --prune roommate' failed with exitcode 128 under /home/me/dl-101/DataLad-101 [err: 'fatal: '../mock_user/DataLad-101' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.'])] [CommandError: 'git -c diff.ignoreSubmodules=none fetch --verbose --progress --no-recurse-submodules --prune roommate' failed with exitcode 128 under /home/me/dl-101/DataLad-101 [err: 'fatal: '../mock_user/DataLad-101' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.']]

Here we go:

'fatal: '../mock_user/DataLad-101' does not appear to be a git repository
 fatal: Could not read from remote repository.

Git seems pretty insistent (given the amount of error messages) that it can not seem to find a Git repository at the location the .git/config file specified. Luckily, we can provide this information. Edit the file with an editor of your choice and fix the path from url = ../mock_user/DataLad-101 to url = ../mock_user/onemoredir/DataLad-101.

Below, we are using the stream editor sed for this operation.

$ sed -i 's/..\/mock_user\/DataLad-101/..\/mock_user\/onemoredir\/DataLad-101/' .git/config

This is how the file looks now:

$ cat .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
	editor = nano
[annex]
	uuid = 49a69e90-9581-4205-b2d5-a5c3fe832f0d
	version = 10
[filter "annex"]
	smudge = git-annex smudge -- %f
	clean = git-annex smudge --clean -- %f
	process = git-annex filter-process
[submodule "recordings/longnow"]
	active = true
	url = https://github.com/datalad-datasets/longnow-podcasts.git
[remote "roommate"]
	url = ../mock_user/onemoredir/DataLad-101
	fetch = +refs/heads/*:refs/remotes/roommate/*
	annex-uuid = bd057cda-8db2-4bcc-aff2-975cb5b000cf
	annex-ignore = false
[submodule "midterm_project"]
	active = true
	url = ./midterm_project
[submodule "longnow"]
	active = true
	url = https://github.com/datalad-datasets/longnow-podcasts.git

Let’s try to update now:

$ datalad update
[INFO] Fetching updates for Dataset(/home/me/dl-101/DataLad-101) 
update(ok): . (dataset)

Nice! We fixed it! Therefore, if a dataset you move or rename is known to other datasets from its path, or identifies siblings with paths, make sure to adjust them in the .git/config file.

To clean up, we’ll redo the move of the dataset and the modification in .git/config.

$ cd ../mock_user && mv onemoredir/DataLad-101 .
$ rm -r onemoredir
$ cd ../DataLad-101 && sed -i 's/..\/mock_user\/onemoredir\/DataLad-101/..\/mock_user\/DataLad-101/' .git/config

9.2.8. Getting contents out of git-annex

Files in your dataset can either be handled by Git or Git-annex. Self-made or predefined configurations to .gitattributes, defaults, or the --to-git option to datalad save allow you to control which tool does what on up to single-file basis. Accidentally though, you may give a file of yours to git-annex when it was intended to be stored in Git, or you want to get a previously annexed file into Git.

Consider you intend to share the cropped .png images you created from the longnow logos. Would you publish your DataLad-101 dataset so GitHub or GitLab, these files would not be available to others, because annexed dataset contents can not be published to these services. Even though you could find a third party service of your choice and publish your dataset and the annexed data (see section Beyond shared infrastructure), you’re feeling lazy today. And since it is only two files, and they are quite small, you decide to store them in Git – this way, the files would be available without configuring an external data store.

To get contents out of the dataset’s annex you need to unannex them. This is done with the git-annex command git annex unannex. Let’s see how it works:

$ git annex unannex recordings/*logo_small.jpg
unannex recordings/interval_logo_small.jpg ok
unannex recordings/salt_logo_small.jpg ok
(recording state in git...)

Your dataset’s history records the unannexing of the files.

$ git log -p -n 1
commit d48d7a81f513d0b9c8696ea550cff819a7d96d24
Author: Elena Piscopia <elena@example.net>
Date:   Wed Dec 14 17:01:32 2022 +0100

    move book back from midterm_project

diff --git a/books/TLCL.pdf b/books/TLCL.pdf
new file mode 120000
index 0000000..4c84b61
--- /dev/null
+++ b/books/TLCL.pdf
@@ -0,0 +1 @@
+../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file
diff --git a/midterm_project b/midterm_project
index 65aacf3..0cd8e99 160000
--- a/midterm_project
+++ b/midterm_project
@@ -1 +1 @@
-Subproject commit 65aacf3b830b721c97f3d66ddf3febc6da8d5e8b
+Subproject commit 0cd8e9937c31122823ea8056a8e0f90535912b9a

Once files have been unannexed, they are “untracked” again, and you can save them into Git, either by adding a rule to .gitattributes, or with datalad save --to-git:

$ datalad save --to-git -m "save cropped logos to Git" recordings/*jpg
add(ok): recordings/interval_logo_small.jpg (file)
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)

9.2.9. Deleting (annexed) files/directories

Removing annexed file content from a dataset is possible in two different ways: Either by removing the file from the current state of the repository (which Git calls the worktree) but keeping the content in the history of the dataset, or by removing content entirely from a dataset and its history.

9.2.9.1. Removing a file, but keeping content in history

An rm <file> or rm -rf <directory> with a subsequent datalad save will remove a file or directory, and save its removal. The file content however will still be in the history of the dataset, and the file can be brought back to existence by going back into the history of the dataset or reverting the removal commit:

# download a file
$ datalad download-url -m "Added flower mosaic from wikimedia" \
  https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg \
  --path flowers.jpg
$ ls -l flowers.jpg
[INFO] Downloading 'https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg' into '/home/me/dl-101/DataLad-101/flowers.jpg'
download_url(ok): /home/me/dl-101/DataLad-101/flowers.jpg (file)
add(ok): flowers.jpg (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)
lrwxrwxrwx 1 adina adina 128 Oct  6  2013 flowers.jpg -> .git/annex/objects/7q/9Z/MD5E-s4487679--3898ef0e3497a89fa1ea74698992bf51.jpg/MD5E-s4487679--3898ef0e3497a89fa1ea74698992bf51.jpg
# removal is easy:
$ rm flowers.jpg

This will lead to a dirty dataset status:

$ datalad status
  deleted: flowers.jpg (symlink)

If a removal happened by accident, a git checkout -- flowers.jpg would undo the removal at this stage. To stick with the removal and clean up the dataset state, datalad save will suffice:

$ datalad save -m "removed file again"
delete(ok): flowers.jpg (file)
save(ok): . (dataset)
action summary:
  delete (ok: 1)
  save (ok: 1)

This commits the deletion of the file in the dataset’s history. If this commit is reverted, the file comes back to existence:

$ git reset --hard HEAD~1
$ ls
HEAD is now at 5403d26 Added flower mosaic from wikimedia
books
code
flowers.jpg
midterm_project
notes.txt
recordings

In other words, with an rm and subsequent datalad save, the symlink is removed, but the content is retained in the history.

9.2.9.2. Removing annexed content entirely

The command to remove file content entirely and irreversibly from a repository is the datalad drop command (datalad-drop manual). This command will delete the content stored in the annex of the dataset, and can be very helpful to make a dataset more lean if the file content is either irrelevant or can be retrieved from other sources easily. Think about a situation in which a very large result file is computed by default in some analysis, but is not relevant for any project, and can thus be removed. Or if only the results of an analysis need to be kept, but the file contents from its input datasets can be dropped at these input datasets are backed-up else where. Because the command works on annexed contents, it will drop file content from a dataset, but it will retain the symlink for this file (as this symlink is stored in Git).

drop can take any number of files. If an entire dataset is specified, all file content in sub-directories is dropped automatically, but for content in sub-datasets to be dropped, the -r/--recursive flag has to be included. By default, DataLad will not drop any content that does not have at least one verified remote copy that the content could be retrieved from again. It is possible to drop the downloaded image, because thanks to datalad download-url its original location in the web is known:

$ datalad drop flowers.jpg
drop(ok): flowers.jpg (file)

Currently, the file content is gone, but the symlink still exist. Opening the remaining symlink will fail, but the content can be obtained easily again with datalad get:

$ datalad get flowers.jpg
get(ok): flowers.jpg (file) [from web...]

If a file has no verified remote copies, DataLad will only drop its content if the user enforces it. DataLad versions prior to 0.16 need to enforce dropping using the --nocheck option, while DataLad version 0.16 and up need to enforce dropping using the --reckless [MODE] option, where [MODE] is either modification (drop despite unsaved modifications) availability (drop even though no other copy is known) undead (only for datasets; would drop a dataset without announcing its death to linked dataset clones) or kill (no safety checks at all are run). While the --reckless parameter sounds more complex, it ensures a safer operation than the previous --nocheck implementation. We will demonstrate this by generating a random PDF file:

$ convert xc:none -page Letter a.pdf
$ datalad save -m "add empty pdf"
add(ok): a.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

DataLad will safeguard dropping content that it can not retrieve again:

$ datalad drop a.pdf
drop(error): a.pdf (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copy; (Use --reckless availability to override this check, or adjust numcopies.)]

But with --nocheck (for <0.16) or --reckless availability (for 0.16 and higher) it will work:

$ datalad drop --reckless availability a.pdf
drop(ok): a.pdf (file)

Note though that this file content is irreversibly gone now, and even going back in time in the history of the dataset will not bring it back into existence.

Finally, let’s clean up:

$ git reset --hard HEAD~2
HEAD is now at 5fb84f0 save cropped logos to Git

9.2.10. Deleting content stored in Git

It is much harder to delete dataset content that is stored in Git compared to content stored in git-annex. Operations such as rm or git rm remove the file from the worktree, but not from its history, and they can be brought back to life just as annexed contents that were solely rm'ed. There is also no straightforward Git equivalent of drop. To accomplish a complete removal of a file from a dataset, we recommend the external tool git-filter-repo. It is a powerful and potentially very dangerous tool to rewrite Git history.

Usually, removing files stored in Git completely is not a common or recommended operation, as it involves quite aggressive rewriting of the dataset history. Sometimes, however, sensitive files, for example private SSH keys or passwords, or too many or too large files are accidentally saved into Git, and need to get out of the dataset history. The command git-filter-repo <path-specification> --force will “filter-out”, i.e., remove all files but the ones specified in <path-specification> from the dataset’s history. The section Fixing up too-large datasets shows an example invocation. If you want to use it, however, make sure to attempt it in a dataset clone or with its --dry-run flag first. It is easy to lose dataset history and files with this tool.

9.2.11. Uninstalling or deleting subdatasets

Depending on the exact aim, two commands are of relevance for deleting a DataLad subdataset. The softer (and not so much “deleting” version) is to uninstall a dataset with the datalad uninstall (datalad-uninstall manual). This command can be used to uninstall any number of subdatasets. Note though that only subdatasets can be uninstalled; the command will error if given a sub-directory, a file, or a top-level dataset.

# clone a subdataset - the content is irrelevant, so why not a cloud :)
$ datalad clone -d . \
 https://github.com/datalad-datasets/disneyanimation-cloud.git \
 cloud
[INFO] Cloning dataset to Dataset(/home/me/dl-101/DataLad-101/cloud)
[INFO] Attempting to clone from https://github.com/datalad-datasets/disneyanimation-cloud.git to /home/me/dl-101/DataLad-101/cloud
[INFO] Start enumerating objects
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/DataLad-101/cloud)
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] https://github.com/datalad-datasets/disneyanimation-cloud.git/config download failed: Not Found
install(ok): cloud (dataset)
add(ok): cloud (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  install (ok: 1)
  save (ok: 2)

To uninstall the dataset, use

$ datalad uninstall cloud
uninstall(ok): cloud (dataset)

Note that the dataset is still known in the dataset, and not completely removed. A datalad get [-n/--no-data] cloud would install the dataset again.

In case one wants to fully delete a subdataset from a dataset, the datalad remove command (datalad-remove manual) is relevant3. It needs a pointer to the root of the superdataset with the -d/--dataset flag, a path to the subdataset to be removed, and optionally a commit message (-m/--message) or recursive specification (-r/--recursive). To remove a subdataset, we will install the uninstalled subdataset again, and subsequently remove it with the datalad remove command:

$ datalad get -n cloud
[INFO] Cloning dataset to Dataset(/home/me/dl-101/DataLad-101/cloud)
[INFO] Attempting to clone from https://github.com/datalad-datasets/disneyanimation-cloud.git to /home/me/dl-101/DataLad-101/cloud
[INFO] Start enumerating objects
[INFO] Start receiving objects
[INFO] Start resolving deltas
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/DataLad-101/cloud)
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] https://github.com/datalad-datasets/disneyanimation-cloud.git/config download failed: Not Found
install(ok): /home/me/dl-101/DataLad-101/cloud (dataset) [Installed subdataset in order to get /home/me/dl-101/DataLad-101/cloud]
# delete the subdataset
$ datalad remove -m "remove obsolete subds" -d . cloud
uninstall(ok): cloud (dataset)
remove(ok): cloud (file)
save(ok): . (dataset)
action summary:
  remove (ok: 1)
  save (ok: 1)
  uninstall (ok: 1)

Note that for both commands a pointer to the current directory will not work. datalad remove . or datalad uninstall . will fail, even if the command is executed in a subdataset instead of the top-level superdataset – you need to execute the command from a higher-level directory.

9.2.12. Deleting a superdataset

If for whatever reason you at one point tried to remove a DataLad dataset, whether with a GUI or the command line call rm -rf <directory>, you likely have seen permission denied errors such as

rm: cannot remove '<directory>/.git/annex/objects/Mz/M1/MD5E-s422982--2977b5c6ea32de1f98689bc42613aac7.jpg/MD5E-s422982--2977b5c6ea32de1f98689bc42613aac7.jpg': Permission denied
rm: cannot remove '<directory>/.git/annex/objects/FP/wv/MD5E-s543180--6209797211280fc0a95196b0f781311e.jpg/MD5E-s543180--6209797211280fc0a95196b0f781311e.jpg': Permission denied
[...]

This error indicates that there is write-protected content within .git that cannot not be deleted. What is this write-protected content? It’s the file content stored in the object tree of git-annex. If you want, you can re-read the section on Data integrity to find out how git-annex revokes write permission for the user to protect the file content given to it. To remove a dataset with annexed content one has to regain write permissions to everything in the dataset. This is done with the chmod command:

chmod -R u+w <dataset>

This recursively (-R, i.e., throughout all files and (sub)directories) gives users (u) write permissions (+w) for the dataset.

Afterwards, rm -rf <dataset> will succeed.

However, instead of rm -rf, a faster way to remove a dataset is using datalad remove: Run datalad remove <dataset> outside of the superdataset to remove a top-level dataset with all its contents. Likely, both --recursive and --nocheck (for DataLad versions <0.16) or --reckless [availability|undead|kill] (for DataLad versions 0.16 and higher) flags are necessary to traverse into subdatasets and to remove content that does not have verified remotes.

Be aware though that both ways to delete a dataset will irretrievably delete the dataset, it’s contents, and it’s history.

9.2.13. Summary

To sum up, file system management operations are safe and easy. Even if you are currently confused about one or two operations, worry not – the take-home-message is simple: Use datalad save whenever you move or rename files. Be mindful that a datalad status can appear unintuitive or that symlinks can break if annexed files are moved, but all of these problems are solved after a datalad save command. Apart from this command, having a clean dataset status prior to doing anything is your friend as well. It will make sure that you have a neat and organized commit history, and no accidental commits of changes unrelated to your file system management operations. The only operation you should beware of is moving subdatasets around – this can be a minefield. With all of these experiences and tips, you feel confident that you know how to handle your datasets files and directories well and worry-free.

Footnotes

1

If you want to learn more about the Git-specific concepts of worktree, staging area/index or HEAD, the upcoming section Back and forth in time will talk briefly about them and demonstrate helpful commands.

2

Or rather: split – basically, the file is getting a fresh new start. Think of it as some sort of witness-protection program with complete disrespect for provenance…

3

This is indeed the only case in which datalad remove is relevant. For all other cases of content deletion a normal rm with a subsequent datalad save works best.