I think this would be really cool. I just looked at the possibility of using inodes, but this seems a little bit fragile. Git has no notion of files, but this is how Darcs does it. I don’t trust their hypothesis that “it is impossible to get the same inode twice”, I believe this strongly depends on the filesystem.
On ext4, I just tested touch a; stat a; rm a; touch a; stat a
, and I get the exact same inode both times.
A cool way to do this for files is to hard-link them somewhere in order to “reserve” the inodes if they’re deleted, but this doesn’t work for directories, since you can’t hard-link directories.
Definitely FS-dependent, because on ZFS the same test results in two different inodes.
Just spitballing:
Assuming we do the hardlink-to-reserve trick, maybe for directory detection, just compare the hardlinked file inodes of the directories? And then use some sort of similarity metric? This seems expensive (both computationally and implementation-wise), though.
Another idea would be to:
Seems like it would be much safer than using inodes. This will only match single moves though, not moves with subsequent edits. It might be possible to use a similarity-preserving hash function, but I think that would cause more trouble than it’s worth.
@cole-h: sounds interesting, and not necessarily expensive: if you sort the list of inodes, comparing the contents of two directories doesn’t sound super expensive.
@dnaq: maybe hashes of small enough windows (starting and stopping at lines boundaries)? We would have multiple hashes per file, but that’s probably ok.
Another idea would be that if you want that feature, you need to enable a “Pijul daemon” in the repository (which may even be implemented in bash), which watches the filesystem changes and performs the corresponding moves. I don’t know if that works on non-Linux systems, but it would work 100% of the time, at a tiny implementation cost.
The daemon might be a good workaround.
What I’m mostly afraid of with using similarity hashing functions is that they might pick up on changes that should be semantically unrelated. The risk of that might be low, since we should only look at deleted and added files, but it might still cause some surprises for the user.
On the other hand just using hashes of the whole file would never lead to surprising behaviour, and would still be strictly better than not tracking moves automatically at all.
The way Git determine if a file has been moved / renamed is if treating a newly added file as a modified version of a file deleted in the same commit results in a smaller change set. Because the change set is what matters most to Git, not the accuracy of how those changes are presented.
I would think that for pijul what matters most is whether or not a ‘new’ file should depend on the change history of a ‘deleted’ file. This will in most cases mirror the actual file system changes so automatic detection would make things easier in most cases. But if automatic detection of moves is implemented, then ways should be added to correct the file history when it does not work as the user considers correct. This means both a way to inform pijul of any file moves it missed and a means of breaking the connection if the user does not want a file to depend on the history of a previous one for some reason.
FYI Mercurial has extensive copy-tracing support (some of the work for a new, better, algorithm is actually still underway), so there might be some overlap in design.
Breezy/Bazaar have rm/mv --after
, which is very useful. I like being able to manage my files in the most intuitive way, and then sort out the VCS afterwards, as this is less stressful.
Does the idea of suggestions make any sense for adding? E.g. pijul says “Suggestion: file src
(missing) and dst
(untracked) have the same checksum - has src
been renamed to dst
?” at the end of a status output. This way, no surprises by making assumptions.
As it stands now, for a file to show as “moved”, it must be
pijul mv
’d. Otherwise, by simply runningmv src dest
, it will show up as “deleted” (src
) and “added” (dst
). I don’t know if this is antithetical to pijul (using some kind of metric to determine if a file was moved), but it would be nice to have (would shorten diffs where thesemv
s happen).