pijul/pijul - Discussion #40 - Detect moved files

#40 Detect moved files

Opened by cole-h on November 12, 2020 Feature request

cole-h on November 12, 2020

As it stands now, for a file to show as “moved”, it must be pijul mv’d. Otherwise, by simply running mv src dest, it will show up as “deleted” (src) and “added” (dst). I don’t know if this is antithetical to pijul (using some kind of metric to determine if a file was moved), but it would be nice to have (would shorten diffs where these mvs happen).

pmeunier on November 30, 2020

I think this would be really cool. I just looked at the possibility of using inodes, but this seems a little bit fragile. Git has no notion of files, but this is how Darcs does it. I don’t trust their hypothesis that “it is impossible to get the same inode twice”, I believe this strongly depends on the filesystem.

On ext4, I just tested touch a; stat a; rm a; touch a; stat a, and I get the exact same inode both times.

A cool way to do this for files is to hard-link them somewhere in order to “reserve” the inodes if they’re deleted, but this doesn’t work for directories, since you can’t hard-link directories.

cole-h on November 30, 2020

Definitely FS-dependent, because on ZFS the same test results in two different inodes.

cole-h on November 30, 2020

Just spitballing:

Assuming we do the hardlink-to-reserve trick, maybe for directory detection, just compare the hardlinked file inodes of the directories? And then use some sort of similarity metric? This seems expensive (both computationally and implementation-wise), though.

dnaq on December 3, 2020

Another idea would be to:

For each deleted file in the repo: hash the file contents before deletion.
For each added file in the repo: hash the file contents.
Compare the hashes and see if any of them match.

Seems like it would be much safer than using inodes. This will only match single moves though, not moves with subsequent edits. It might be possible to use a similarity-preserving hash function, but I think that would cause more trouble than it’s worth.

pmeunier on December 3, 2020

@cole-h: sounds interesting, and not necessarily expensive: if you sort the list of inodes, comparing the contents of two directories doesn’t sound super expensive.

@dnaq: maybe hashes of small enough windows (starting and stopping at lines boundaries)? We would have multiple hashes per file, but that’s probably ok.

Another idea would be that if you want that feature, you need to enable a “Pijul daemon” in the repository (which may even be implemented in bash), which watches the filesystem changes and performs the corresponding moves. I don’t know if that works on non-Linux systems, but it would work 100% of the time, at a tiny implementation cost.

dnaq on December 3, 2020

The daemon might be a good workaround.

What I’m mostly afraid of with using similarity hashing functions is that they might pick up on changes that should be semantically unrelated. The risk of that might be low, since we should only look at deleted and added files, but it might still cause some surprises for the user.

On the other hand just using hashes of the whole file would never lead to surprising behaviour, and would still be strictly better than not tracking moves automatically at all.

Skia on December 6, 2020

@pmeunier

The way Git determine if a file has been moved / renamed is if treating a newly added file as a modified version of a file deleted in the same commit results in a smaller change set. Because the change set is what matters most to Git, not the accuracy of how those changes are presented.

I would think that for pijul what matters most is whether or not a ‘new’ file should depend on the change history of a ‘deleted’ file. This will in most cases mirror the actual file system changes so automatic detection would make things easier in most cases. But if automatic detection of moves is implemented, then ways should be added to correct the file history when it does not work as the user considers correct. This means both a way to inform pijul of any file moves it missed and a means of breaking the connection if the user does not want a file to depend on the history of a previous one for some reason.

Alphare on May 19, 2021

FYI Mercurial has extensive copy-tracing support (some of the work for a new, better, algorithm is actually still underway), so there might be some overlap in design.

pmeunier added tag Feature request on December 7, 2021

stellarpower on August 15, 2023

Breezy/Bazaar have rm/mv --after, which is very useful. I like being able to manage my files in the most intuitive way, and then sort out the VCS afterwards, as this is less stressful.

stellarpower on September 29, 2023

Does the idea of suggestions make any sense for adding? E.g. pijul says “Suggestion: file src (missing) and dst (untracked) have the same checksum - has src been renamed to dst?” at the end of a status output. This way, no surprises by making assumptions.