IZUDXVF5R7IYD2D4LRII7Z6WFMY7ARFSNULPBOKQLPBGIFLPXHIQC
Using this change, this bug can be reproduced: https://nest.pijul.com/pijul/pijul/discussions/535
curl -o new https://raw.githubusercontent.com/vim/vim/6bb683663ad7ae9c303284c335a731a13233c6c2/runtime/spell/en.spl
curl -o old https://raw.githubusercontent.com/vim/vim/fc73515f7ba66b47705265bb8d01c6bec5df09c4/runtime/spell/en.spl
# OK
diff old new
# hangs
cargo run --release -- myers old new
It seems to be just a manifestation of the quadratic runtime of Myers diff when the difference between the two files is very large. I’m not sure how diff/git solve this issue.
I am trying out a fix inspired by diffutils, where runs of unique lines are discarded. Because of the high genericity of diffs
crate, this seems quite easy to do.
BORN6A37DOU775BMVOCDMQW3IBBI7PWU3V5A5ITJIS7TIUIYUFTQC
QRCSI5CO7MDVIXSJBAGD3WHKPMMITJ2YZZCVRNNJKKIICDTUN7GAC
I implemented the optimized diffing routine. It is fast on the test-case above (0.2 seconds). I added property tests in order to be confident about correctness.
There is the issue of the optimized routine returning worse diffs. Right now, runs of >1 unique rows are treated as a single row. This makes the algorithm unable to properly penalize modifying these. There are ways to improve this: for example, treat only long runs of unique rows as a single one.
What do you think about this, @pmeunier?
EDIT: The code also needs to be made more generic. I need an extra Hash
or Ord
constraint on elements though, since I need to put elements into a set.
Q3UCF2GCRN3OSCD7GK7J6DKI6EI647VYELLMOPIFVWEUSFVXOWFQC
Note that this solution doesn’t solve the problem fully. It solves the case of files where many lines in a row are unique. This is super common in practice - for example when we’re replacing one large random-ish file with another.
It fails when both files alternate unique and non-unique lines frequently. It is possible to detect these cases somewhat reliably by counting the number of unique lines. In the case this number is too large, the algorithm is guaranteed to be slow, and we could switch to a trivial diff algorithm when this happens.
Nevertheless, we always end up with some pathological cases. I’m not sure how to solve the general case.
Excellent! (and sorry for not being able to comment sooner, I only got an actual internet connection two days ago).
As far as Pijul is concerned, a “random” file is likely to take less disk space with a trivial diff algorithm (i.e. replace the entire file). The downside is that while concurrent changes will still commute, such changes will always conflict, and always conflict in a trivial way, where the entire file conflicts with the other version of the entire file.
I wanted to see what the algorithm does in an intuitive way, so I added an executable for diffing two files.