Miscellaneous file system operations¶
With all of the information about symlinks and object trees, you might be reluctant to perform usual file system managing operations, such as copying, moving, renaming or deleting files or directories with annexed content.
If I renamed one of those books, would the symlink that points
to the file content still be correct? What happens if I’d copy
an annexed file?
If I moved the whole books/
directory? What if I moved
all of DataLad-101
into a different place on my computer?
What if renamed the whole superdataset?
And how do I remove a file, or directory, or subdataset?
Therefore, there is an extra tutorial offered by the courses’
TA today, and you attend.
There is no better way of learning than doing. Here, in the
safe space of the DataLad-101
course, you can try out all
of the things you would be unsure about or reluctant to try
on the dataset that contains your own, valuable data.
Below you will find common questions about file system management operations, and each question outlines caveats and solutions with code examples you can paste into your own terminal. Because these code snippets will add many commits to your dataset, we’re cleaning up within each segment with common git operations that manipulate the datasets history – be sure to execute these commands as well (and be sure to be in the correct dataset).
Renaming files¶
Let’s try it. In Unix, renaming a file is exactly the same as moving a file, and uses the mv command.
$ cd books/
$ mv TLCL.pdf The_Linux_Command_Line.pdf
$ ls -lah
total 24K
drwxr-xr-x 2 adina adina 4.0K Jan 9 07:53 .
drwxr-xr-x 8 adina adina 4.0K Jan 9 07:53 ..
lrwxrwxrwx 1 adina adina 131 Jan 19 2009 bash_guide.pdf -> ../.git/annex/objects/WF/Gq/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf/MD5E-s1198170--0ab2c121bcf68d7278af266f6a399c5f.pdf
lrwxrwxrwx 1 adina adina 131 Apr 19 2017 byte-of-python.pdf -> ../.git/annex/objects/F1/Wz/MD5E-s4242644--f4e1c8ebfb5c89a69ff6d268eb2e63e3.pdf/MD5E-s4242644--f4e1c8ebfb5c89a69ff6d268eb2e63e3.pdf
lrwxrwxrwx 1 adina adina 133 Jun 29 2019 progit.pdf -> ../.git/annex/objects/G6/Gj/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf/MD5E-s12465653--05cd7ed561d108c9bcf96022bc78a92c.pdf
lrwxrwxrwx 1 adina adina 131 Jan 28 2019 The_Linux_Command_Line.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
Try to open the renamed file, e.g., with
evince The_Linux_Command_Line.pdf
.
This works!
But let’s see what changed in the dataset with this operation:
$ datalad status
untracked: /home/me/dl-101/DataLad-101/books/The_Linux_Command_Line.pdf (symlink)
deleted: /home/me/dl-101/DataLad-101/books/TLCL.pdf (symlink)
We can see that the old file is marked as deleted
, and
simultaneously, an untracked
file appears: the renamed
PDF.
While this might appear messy, a datalad save
will clean
all of this up. Therefore, do not panic if you rename a file,
and see a dirty dataset status with deleted and untracked files
– datalad save
handles these and other cases really well
under the hood.
$ datalad save -m "rename the book"
delete(ok): books/TLCL.pdf (file)
add(ok): books/The_Linux_Command_Line.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
delete (ok: 1)
save (ok: 1)
The datalad save command will identify that a file was renamed, and will summarize this nicely in the resulting commit:
$ git log -n 1 -p
commit 2c93e94f6581d655f17e3b223c3b573265396070
Author: Elena Piscopia <elena@example.net>
Date: Thu Jan 9 07:53:33 2020 +0100
rename the book
diff --git a/books/TLCL.pdf b/books/The_Linux_Command_Line.pdf
similarity index 100%
rename from books/TLCL.pdf
rename to books/The_Linux_Command_Line.pdf
Note that datalad save commits all modifications when it’s called without a path specification, so any other changes will be saved in the same commit as the rename. If there are unsaved modifications you do not want to commit together with the file name change, you could give both the new and the deleted file as a path specification to datalad save, even if it feels unintuitive to save a change that is marked as a deletion in a datalad status:
datalad save -m "rename file" oldname newname
Alternatively, there is also a way to save the name change only using Git tools only, outlined in the following hidden section. If you are a Git user, you will be very familiar with it.
Find out more: Renaming with Git tools
Git has built-in commands that provide a solution in two steps.
If you have followed along with the previous datalad save (which you should have), let’s revert the renaming of the the files:
$ git reset --hard HEAD~1
$ datalad status
HEAD is now at c2c4282 add container and execute analysis within container
Now we’re checking out how to rename files and commit this operation
using only Git:
A Git-specific way to rename files is the git mv
command:
$ git mv TLCL.pdf The_Linux_Command_Line.pdf
$ datalad status
added: /home/me/dl-101/DataLad-101/books/The_Linux_Command_Line.pdf (file)
deleted: /home/me/dl-101/DataLad-101/books/TLCL.pdf (file)
We can see that the old file is still seen as “deleted”, but the “new”,
renamed file is “added”. A git status
displays the change
in the dataset a bit more accurately:
$ git status
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
renamed: TLCL.pdf -> The_Linux_Command_Line.pdf
Because the git mv places the change directly into the
staging area (the index) of Git1,
a subsequent git commit -m "rename book"
will write the renaming
– and only the renaming – to the dataset’s history, even if other
(unstaged) modifications are present.
$ git commit -m "rename book"
[master 0d83cd6] rename book
1 file changed, 0 insertions(+), 0 deletions(-)
rename books/{TLCL.pdf => The_Linux_Command_Line.pdf} (100%)
To summarize, renaming files is easy and worry-free. Do not be intimidated
by a file marked as deleted – a datalad save will rectify this.
Be mindful of other modifications in your dataset, though, and either supply
appropriate paths to datalad save
, or use Git tools to exclusively save
the name change and nothing else.
Let’s revert this now, to have a clean history.
$ git reset --hard HEAD~1
$ datalad status
HEAD is now at c2c4282 add container and execute analysis within container
Moving files from or into subdirectories¶
Let’s move an annexed file from within books/
into the root
of the superdataset:
$ mv TLCL.pdf ../TLCL.pdf
$ datalad status
untracked: /home/me/dl-101/DataLad-101/TLCL.pdf (symlink)
deleted: /home/me/dl-101/DataLad-101/books/TLCL.pdf (symlink)
In general, this looks exactly like renaming or moving a file in the same directory. There is a subtle difference though: Currently, the symlink of the annexed file is broken. There are two ways to demonstrate this. One is trying to open the file – this will currently fail. The second way is to look at the symlink:
$ cd ../
$ ls -l TLCL.pdf
lrwxrwxrwx 1 adina adina 131 Jan 9 07:53 TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
The first part of the symlink should point into the .git/
directory, but currently, it does not – the symlink still looks
like TLCL.pdf
would be within books/
. Instead of pointing
into .git
, it currently points to ../.git
, which is non-existent,
and even outside of the superdataset. This is why the file
cannot be opened: When any program tries to follow the symlink,
it will not resolve, and an error such as “no file or directory”
will be returned. But do not panic! A datalad save will
rectify this as well:
$ datalad save -m "moved book into root"
$ ls -l TLCL.pdf
delete(ok): books/TLCL.pdf (file)
add(ok): TLCL.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
delete (ok: 1)
save (ok: 1)
lrwxrwxrwx 1 adina adina 128 Jan 9 07:53 TLCL.pdf -> .git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
After a datalad save
, the symlink is fixed again.
Therefore, in general, whenever moving or renaming a file,
especially between directories, a datalad save
is
the best option to turn to.
Find out more: Why a move between directories is actually a content change
Let’s see how this shows up in the dataset history:
$ git log -n 1 -p
commit 5a617cd7d1e8c901658130c0f57655b66247cc9b
Author: Elena Piscopia <elena@example.net>
Date: Thu Jan 9 07:53:36 2020 +0100
moved book into root
diff --git a/TLCL.pdf b/TLCL.pdf
new file mode 120000
index 0000000..34328e2
--- /dev/null
+++ b/TLCL.pdf
@@ -0,0 +1 @@
+.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file
diff --git a/books/TLCL.pdf b/books/TLCL.pdf
deleted file mode 120000
index 4c84b61..0000000
--- a/books/TLCL.pdf
+++ /dev/null
@@ -1 +0,0 @@
-../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file
As you can see, this action does not show up as a move, but instead a deletion and addition of a new file. Why? Because the content that is tracked is the actual symlink, and due to the change in relative location, the symlink needed to change. Hence, what looks and feels like a move on the file system for you is actually a move plus a content change for Git.
An additional piece of background information: A datalad save command internally uses a git commit to save changes to a dataset. git commit in turn triggers a git annex fix command. This git-annex command fixes up links that have become broken to again point to annexed content, and is responsible for cleaning up what needs to be cleaned up. Thanks, git-annex!
Therefore, while it might be startling if you’ve moved a file and can not open it directly afterwards, everything will be rectified by datalad save as well.
Finally, let’s clean up:
$ git reset --hard HEAD~1
HEAD is now at c2c4282 add container and execute analysis within container
Copying files¶
Let’s create a copy of an annexed file, using the Unix
command cp
to copy.
$ cp books/TLCL.pdf copyofTLCL.pdf
$ datalad status
untracked: copyofTLCL.pdf (file)
That’s expected. The copy shows up as a new, untracked file. Let’s save it:
$ datalad save -m "add copy of TLCL.pdf"
add(ok): copyofTLCL.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
$ git log -n 1 -p
commit 2cf5ad0c35fdc3c98933c5c660ac5f3bbf3f2dfa
Author: Elena Piscopia <elena@example.net>
Date: Thu Jan 9 07:53:37 2020 +0100
add copy of TLCL.pdf
diff --git a/copyofTLCL.pdf b/copyofTLCL.pdf
new file mode 120000
index 0000000..34328e2
--- /dev/null
+++ b/copyofTLCL.pdf
@@ -0,0 +1 @@
+.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
\ No newline at end of file
That’s it.
Find out more: Symlinks!
If you have read the additional content in the section Data integrity, you know that the same file content is only stored once, and copies of the same file point to the same location in the object tree.
Let’s check that out:
$ ls -l copyofTLCL.pdf
$ ls -l books/TLCL.pdf
lrwxrwxrwx 1 adina adina 128 Jan 9 07:53 copyofTLCL.pdf -> .git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
lrwxrwxrwx 1 adina adina 131 Jan 9 07:53 books/TLCL.pdf -> ../.git/annex/objects/jf/3M/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf/MD5E-s2120211--06d1efcb05bb2c55cd039dab3fb28455.pdf
Indeed! Apart from their relative location (.git
versus
../.git
) their symlink is identical. Thus, even though two
copies of the book exist in your dataset, your disk needs to
store it only once.
In most cases, this is just an interesting fun-fact, but beware when dropping content with datalad drop (Removing annexed content entirely): If you drop the content of one copy of a file, all other copies will lose this content as well.
Finally, let’s clean up:
$ git reset --hard HEAD~1
HEAD is now at c2c4282 add container and execute analysis within container
Moving/renaming a subdirectory or subdataset¶
Moving or renaming subdirectories, especially if they are subdatasets,
can be a minefield. But in principle, a safe way to proceed is using
the Unix mv command to move or rename, and the datalad save
to clean up afterwards, just as in the examples above. Make sure to
not use git mv
, especially for subdatasets.
Let’s for example rename the books
directory:
$ mv books/ readings
$ datalad status
untracked: readings (directory)
deleted: books/TLCL.pdf (symlink)
deleted: books/bash_guide.pdf (symlink)
deleted: books/byte-of-python.pdf (symlink)
deleted: books/progit.pdf (symlink)
$ datalad save -m "renamed directory"
delete(ok): books/TLCL.pdf (file)
delete(ok): books/bash_guide.pdf (file)
delete(ok): books/byte-of-python.pdf (file)
delete(ok): books/progit.pdf (file)
add(ok): readings/TLCL.pdf (file)
add(ok): readings/bash_guide.pdf (file)
add(ok): readings/byte-of-python.pdf (file)
add(ok): readings/progit.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 4)
delete (ok: 4)
save (ok: 1)
This is easy, and complication free. Moving (as in: changing the location, instead of the name) the directory would work in the same fashion, and a datalad save would fix broken symlinks afterwards. Let’s quickly clean this up:
$ git reset --hard HEAD~1
HEAD is now at c2c4282 add container and execute analysis within container
But let’s now try to move the longnow
subdataset into the root of the
superdataset:
$ mv recordings/longnow .
$ datalad status
untracked: longnow (directory)
deleted: recordings/longnow (dataset)
$ datalad save -m "moved subdataset"
delete(ok): recordings/longnow (file)
add(ok): longnow (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
delete (ok: 1)
save (ok: 1)
$ datalad status
This seems fine, and it has indeed worked. However, reverting a commit like this is tricky, at the moment. This could lead to trouble if you at a later point try to revert or rebase chunks of your history including this move. Therefore, if you can, try not to move subdatasets around. For now we’ll clean up in a somewhat “hacky” way: Reverting, and moving remaining subdataset contents back to their original place by hand to take care of the unwanted changes the commit reversal introduced.
$ git reset --hard HEAD~1
warning: unable to rmdir 'longnow': Directory not empty
HEAD is now at c2c4282 add container and execute analysis within container
$ mv -f longnow recordings
The take-home message therefore is that it is best not to move subdatasets, but very possible to move subdirectories if necessary. In both cases, do not attempt moving with the git mv, but stick with mv and a subsequent datalad save.
Todo
Update this when progress has been made towards https://github.com/datalad/datalad/issues/3464
Moving/renaming a superdataset¶
Once created, a DataLad superdataset may not be in an optimal place on your file system, or have the best name.
After a while, you might think that the dataset would fit much
better into /home/user/research_projects/
than in
/home/user/Documents/MyFiles/tmp/datalad-test/
. Or maybe at
some point, a long name such as My-very-first-DataLad-project-wohoo-I-am-so-excited
does not look pretty in your terminal prompt anymore, and going for
finance-2019
seems more professional.
These will be situations in which you want to rename or move a superdataset. Will that break anything?
In all standard situations, no, it will be completely fine.
You can use standard Unix commands such as mv
to do it,
and also whichever graphical user interface or explorer you may
use.
Beware of one thing though: If your dataset either is a sibling or has a sibling with the source being a path, moving or renaming the dataset will break the linkage between the datasets. This can be fixed easily though. We can try this in the following hidden section.
Find out more: If a renamed/moved dataset is a sibling…
As section DIY configurations explains, each
sibling is registered in .git/config
in a “submodule” section.
Let’s look at how our sibling “roommate” is registered there:
$ cat .git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
editor = nano
[annex]
uuid = 0b6bef5c-68c4-465f-8d32-c00ffa64dcfb
version = 5
backends = MD5E
[submodule "recordings/longnow"]
url = https://github.com/datalad-datasets/longnow-podcasts.git
active = true
[remote "roommate"]
url = ../mock_user/DataLad-101
fetch = +refs/heads/*:refs/remotes/roommate/*
annex-uuid = 2907ce9f-0de0-4f84-aafd-7c07a8cc3c8a
annex-ignore = false
[submodule "midterm_project"]
url = /home/me/dl-101/DataLad-101/midterm_project
active = true
[submodule "longnow"]
url = https://github.com/datalad-datasets/longnow-podcasts.git
active = true
As you can see, its “url” is specified as a relative path. Say your room mate’s directory is a dataset you would want to move. Let’s see what happens if we move the dataset such that the path does not point to the dataset anymore:
# add an intermediate directory
$ cd ../mock_user
$ mkdir onemoredir
# move your room mates dataset into this new directory
$ mv DataLad-101 onemoredir
This means that relative to your DataLad-101
, your room mates
dataset is not at ../mock_user/DataLad-101
anymore, but in
../mock_user/onemoredir/DataLad-101
. The path specified in
the configuration file is thus wrong now.
# navigate back into your dataset
$ cd ../DataLad-101
# attempt a datalad update
$ datalad update
[INFO] Fetching updates for <Dataset path=/home/me/dl-101/DataLad-101>
[ERROR] Cmd('/usr/lib/git-annex.linux/git') failed due to: exit code(128)
| cmdline: /usr/lib/git-annex.linux/git fetch --progress --prune --recurse-submodules=no -v roommate
| stderr: 'fatal: '../mock_user/DataLad-101' does not appear to be a git repository
| fatal: Could not read from remote repository.
|
| Please make sure you have the correct access rights
| and the repository exists.' [cmd.py:wait:415] (GitCommandError)
Here we go:
'fatal: '../mock_user/DataLad-101' does not appear to be a git repository
fatal: Could not read from remote repository.
Git seems pretty insistent (given the amount of error messages) that
it can not seem to find a Git repository at the location the .git/config
file specified. Luckily, we can provide this information. Edit the file with
an editor of your choice and fix the path from
url = ../mock_user/DataLad-101
to
url = ../mock_user/onemoredir/DataLad-101
.
Below, we are using the stream editor sed for this operation.
$ sed -i 's/..\/mock_user\/DataLad-101/..\/mock_user\/onemoredir\/DataLad-101/' .git/config
This is how the file looks now:
$ cat .git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
editor = nano
[annex]
uuid = 0b6bef5c-68c4-465f-8d32-c00ffa64dcfb
version = 5
backends = MD5E
[submodule "recordings/longnow"]
url = https://github.com/datalad-datasets/longnow-podcasts.git
active = true
[remote "roommate"]
url = ../mock_user/onemoredir/DataLad-101
fetch = +refs/heads/*:refs/remotes/roommate/*
annex-uuid = 2907ce9f-0de0-4f84-aafd-7c07a8cc3c8a
annex-ignore = false
[submodule "midterm_project"]
url = /home/me/dl-101/DataLad-101/midterm_project
active = true
[submodule "longnow"]
url = https://github.com/datalad-datasets/longnow-podcasts.git
active = true
Let’s try to update now:
$ datalad update
[INFO] Fetching updates for <Dataset path=/home/me/dl-101/DataLad-101>
update(ok): . (dataset)
Nice! We fixed it!
Therefore, if a dataset you move or rename is known to other
datasets from its path, or identifies siblings with paths,
make sure to adjust them in the .git/config
file.
To clean up, we’ll redo the move of the dataset and the
modification in .git/config
.
$ cd ../mock_user && mv onemoredir/DataLad-101 .
$ rm -r onemoredir
$ cd ../DataLad-101 && git reset --hard master
HEAD is now at c2c4282 add container and execute analysis within container
Getting contents out of git-annex¶
Files in your dataset can either be handled by Git or git-annex.
Self-made or predefined configurations to .gitattributes
, defaults, or the
--to-git
option to datalad save allow you to control which tool
does what on up to single-file basis. Accidentally though, you may give a file of yours
to git-annex when it was intended to be stored in Git, or you want to get a previously
annexed file into Git.
Consider you intend to share the cropped .png
images you created from the
longnow
logos. Would you publish your DataLad-101
dataset so GitHub
or GitLab, these files would not be available to others, because annexed
dataset contents can not be published to these services.
Even though you could find a third party service of your choice
and publish your dataset and the annexed data (section Beyond shared infrastructure
will demonstrate how this can be done), you’re feeling lazy today. And since it
is only two files, and they are quite small, you decide to store them in Git –
this way, the files would be available without configuring an external data
store.
To get contents out of the dataset’s annex you need to unannex them. This is done with the git-annex command git annex unannex. Let’s see how it works:
$ git annex unannex recordings/*logo_small.jpg
unannex recordings/interval_logo_small.jpg ok
unannex recordings/salt_logo_small.jpg ok
Your dataset’s history records the unannexing of the files.
$ git log -p -n 1
commit efb894252c434dca89c158c257d40c3403a73da1
Author: Elena Piscopia <elena@example.net>
Date: Thu Jan 9 10:54:33 2020 +0100
content removed from git annex
diff --git a/recordings/interval_logo_small.jpg b/recordings/interval_logo_small.jpg
deleted file mode 120000
index f4d6fd6..0000000
--- a/recordings/interval_logo_small.jpg
+++ /dev/null
@@ -1 +0,0 @@
-../.git/annex/objects/36/jF/MD5E-s100877--0fea9537f9fe255d827e4401a7d539e7.jpg/MD5E-s100877--0fea9537f9fe255d827e4401a7d539e7.jpg
\ No newline at end of file
diff --git a/recordings/salt_logo_small.jpg b/recordings/salt_logo_small.jpg
deleted file mode 120000
index 55ada0f..0000000
--- a/recordings/salt_logo_small.jpg
+++ /dev/null
@@ -1 +0,0 @@
-../.git/annex/objects/xJ/4G/MD5E-s260607--4e695af0f3e8e836fcfc55f815940059.jpg/MD5E-s260607--4e695af0f3e8e836fcfc55f815940059.jpg
\ No newline at end of file
Once files have been unannexed, they are “untracked” again, and you can save them
into Git, either by adding a rule to .gitattributes
, or with
datalad save --to-git:
$ datalad save --to-git -m "save cropped logos to Git" recordings/*jpg
add(ok): recordings/interval_logo_small.jpg (file)
add(ok): recordings/salt_logo_small.jpg (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
save (ok: 1)
Deleting (annexed) files/directories¶
Removing files from a dataset is possible in two different ways: Either by removing the file from the current state of the repository (which Git calls the worktree) but keeping the content in the history of the dataset, or by removing content entirely from a dataset and its history.
Removing a file, but keeping content in history¶
An rm <file>
or rm -rf <directory>
with a subsequent datalad save
will remove a file or directory, and save its removal. The file content however will
still be in the history of the dataset, and the file can be brought back to existence
by going back into the history of the dataset or reverting the removal commit:
# download a file
$ datalad download-url -m "Added flower mosaic from wikimedia" \
https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg \
--path flowers.jpg
$ ls -l flowers.jpg
[INFO] Downloading 'https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg' into '/home/me/dl-101/DataLad-101/flowers.jpg'
download_url(ok): /home/me/dl-101/DataLad-101/flowers.jpg (file)
add(ok): flowers.jpg (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
download_url (ok: 1)
save (ok: 1)
lrwxrwxrwx 1 adina adina 128 Oct 6 2013 flowers.jpg -> .git/annex/objects/7q/9Z/MD5E-s4487679--3898ef0e3497a89fa1ea74698992bf51.jpg/MD5E-s4487679--3898ef0e3497a89fa1ea74698992bf51.jpg
# removal is easy:
$ rm flowers.jpg
This will lead to a dirty dataset status:
$ datalad status
deleted: flowers.jpg (symlink)
If a removal happened by accident, a git checkout -- flowers.jpg
would undo
the removal at this stage. To stick with the removal and clean up the dataset
state, datalad save will suffice:
$ datalad save -m "removed file again"
delete(ok): flowers.jpg (file)
save(ok): . (dataset)
action summary:
delete (ok: 1)
save (ok: 1)
This commits the deletion of the file in the dataset’s history. If this commit is reverted, the file comes back to existence:
$ git reset --hard HEAD~1
$ ls
HEAD is now at 70c6b87 Added flower mosaic from wikimedia
books
code
flowers.jpg
midterm_project
notes.txt
recordings
In other words, with an rm and subsequent datalad save, the symlink is removed, but the content is retained in the history.
Removing annexed content entirely¶
A different command to remove file content entirely and irreversibly from a repository is the datalad drop command (datalad-drop manual). One use case for this is to make a repository more lean. Think about a situation in which a very large result file is computed by default in some analysis, but is not relevant for any project, and one may want to remove it.
If an entire dataset is specified, all file content in sub-directories is
dropped automatically, but for content in sub-datasets to be dropped, the
-r/--recursive
flag has to be included.
The command will drop file content or directory content from a dataset, but will retain a symlink for this file. By default, DataLad will not drop any content that does not have at least one verified remote copy that the content could be retrieved from again. It is possible to drop the downloaded image, because thanks to datalad download-url its original location in the web in known:
$ datalad drop flowers.jpg
drop(ok): /home/me/dl-101/DataLad-101/flowers.jpg (file) [checking https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg...]
Currently, the file content is gone, but the symlink still exist. Opening the remaining symlink will fail, but the content can be obtained easily again with datalad get:
$ datalad get flowers.jpg
get(ok): flowers.jpg (file) [from web...]
If a file has no verified remote copies, DataLad will only drop its
content if the --nocheck
option is specified. We will demonstrate
this by generating a random PDF file:
$ convert xc:none -page Letter a.pdf
$ datalad save -m "add empty pdf"
add(ok): a.pdf (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
DataLad will safeguard dropping content that it can not retrieve again:
$ datalad drop a.pdf
[WARNING] Running drop resulted in stderr output: git-annex: drop: 1 failed
[ERROR] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/home/me/dl-101/DataLad-101/a.pdf)]
drop(error): /home/me/dl-101/DataLad-101/a.pdf (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]
But with the --nocheck
flag it will work:
$ datalad drop --nocheck a.pdf
drop(ok): /home/me/dl-101/DataLad-101/a.pdf (file)
Note though that this file content is irreversibly gone now, and even going back in time in the history of the dataset will not bring it back into existence.
Finally, let’s clean up:
$ git reset --hard HEAD~2
HEAD is now at c2c4282 add container and execute analysis within container
Uninstalling or deleting subdatasets¶
Depending on the exact aim, two commands are of relevance for deleting a DataLad subdataset. The softer (and not so much “deleting” version) is to uninstall a dataset with the datalad uninstall (datalad-uninstall manual). This command can be used to uninstall any number of subdatasets. Note though that only subdatasets can be uninstalled; the command will error if given a sub-directory, a file, or a top-level dataset.
# clone a subdataset - the content is irrelevant, so why not a cloud :)
$ datalad clone -d . \
https://github.com/datalad-datasets/disneyanimation-cloud.git \
cloud
[INFO] Cloning https://github.com/datalad-datasets/disneyanimation-cloud.git [1 other candidates] into '/home/me/dl-101/DataLad-101/cloud'
[INFO] Remote origin not usable by git-annex; setting annex-ignore
add(ok): cloud (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
install(ok): cloud (dataset)
action summary:
add (ok: 2)
install (ok: 1)
save (ok: 1)
To uninstall the dataset, use
$ datalad uninstall cloud
uninstall(ok): cloud (dataset)
action summary:
drop (notneeded: 1)
uninstall (ok: 1)
Note that the dataset is still known in the dataset, and not completely removed.
A datalad get [-n/--no-data] cloud
would install the dataset again.
In case one wants to fully delete a subdataset from a dataset, the
datalad remove command (datalad-remove manual) is
relevant2.
It needs a pointer to the root of the superdataset with the -d/--dataset
flag, a path to the subdataset to be removed, and optionally a commit message
(-m/--message
) or recursive specification (-r/--recursive
).
To remove a subdataset, we will install the uninstalled subdataset again, and
subsequently remove it with the datalad remove command:
$ datalad get -n cloud
# delete the subdataset
$ datalad remove -m "remove obsolete subds" -d . cloud
[INFO] Cloning https://github.com/datalad-datasets/disneyanimation-cloud.git [1 other candidates] into '/home/me/dl-101/DataLad-101/cloud'
[INFO] Remote origin not usable by git-annex; setting annex-ignore
install(ok): /home/me/dl-101/DataLad-101/cloud (dataset) [Installed subdataset in order to get /home/me/dl-101/DataLad-101/cloud]
uninstall(ok): cloud (dataset)
remove(ok): cloud (dataset)
save(ok): . (dataset)
action summary:
drop (notneeded: 1)
remove (ok: 1)
save (ok: 1)
uninstall (ok: 1)
Note that for both commands a pointer to the current directory will not work.
datalad remove .
or datalad uninstall .
will fail, even if
the command is executed in a subdataset instead of the top-level
superdataset – you need to execute the command from a higher-level directory.
Finally, after this last piece of information, let’s clean up:
$ git reset --hard HEAD~2
HEAD is now at c2c4282 add container and execute analysis within container
Deleting a superdataset¶
If for whatever reason you at one point tried to remove a DataLad dataset,
whether with a GUI or the command line call rm -rf <directory>
, you likely
have seen permission denied errors such as
rm: cannot remove '<directory>/.git/annex/objects/Mz/M1/MD5E-s422982--2977b5c6ea32de1f98689bc42613aac7.jpg/MD5E-s422982--2977b5c6ea32de1f98689bc42613aac7.jpg': Permission denied
rm: cannot remove '<directory>/.git/annex/objects/FP/wv/MD5E-s543180--6209797211280fc0a95196b0f781311e.jpg/MD5E-s543180--6209797211280fc0a95196b0f781311e.jpg': Permission denied
[...]
This error indicates that there is write-protected content within .git
that
cannot not be deleted. What is this write-protected content? It’s the file content
stored in the object tree of git-annex. If you want, you can re-read the section on
Data integrity to find out how git-annex revokes write permission for the user
to protect the file content given to it. To remove a dataset with annexed content
one has to regain write permissions to everything in the dataset. This is done
with the chmod command:
chmod -R u+w <dataset>
This recursively (-R
, i.e., throughout all files and (sub)directories) gives users
(u
) write permissions (+w
) for the dataset.
Afterwards, rm -rf <dataset>
will succeed.
However, instead of rm -rf
, a faster way to remove a dataset is using
datalad remove: Run datalad remove <dataset>
outside of the
superdataset to remove a top-level dataset with all its contents. Likely,
both --nocheck
and --recursive
flags are necessary
to remove content that does not have verified remotes, and to traverse into subdatasets.
Be aware though that both ways to delete a dataset will irretrievably delete the dataset, it’s contents, and it’s history.
Summary¶
To sum up, file system management operations are safe and easy.
Even if you are currently confused about one or two operations,
worry not – the take-home-message is simple: Use datalad save
whenever you move or rename files. Be mindful that a datalad status
can appear unintuitive or that symlinks can break if annexed files are moved,
but all of these problems are solved after a datalad save command.
Apart from this command, having a clean dataset status prior to doing anything
is your friend as well. It will make sure that you have a neat and organized
commit history, and no accidental commits of changes unrelated to your file
system management operations. The only operation you should beware of is
moving subdatasets around – this can be a minefield.
With all of these experiences and tips, you feel confident that you know
how to handle your datasets files and directories well and worry-free.
Footnotes
- 1
If you want to learn more about the Git-specific concepts of worktree, staging area/index or HEAD, check out section …
Todo
Write a section on this high-level Git stuff. Maybe in draft of section on Git history…
- 2
This is indeed the only case in which datalad remove is relevant. For all other cases of content deletion a normal
rm
with a subsequent datalad save works best.