pijul/diffs - Discussion #1 - Add a simple testing executable

#1 Add a simple testing executable

Opened by potocpav on September 14, 2021

potocpav on September 14, 2021

I wanted to see what the algorithm does in an intuitive way, so I added an executable for diffing two files.

potocpav added a change on September 14, 2021

Add a simple testing executable by FHRXP5Jnb2MWLDrPrnLnkN2ryWcGCo6CRr1dXR9FW2YA,

IZUDXVF5R7IYD2D4LRII7Z6WFMY7ARFSNULPBOKQLPBGIFLPXHIQC

potocpav on September 14, 2021

Using this change, this bug can be reproduced: https://nest.pijul.com/pijul/pijul/discussions/535

curl -o new https://raw.githubusercontent.com/vim/vim/6bb683663ad7ae9c303284c335a731a13233c6c2/runtime/spell/en.spl
curl -o old https://raw.githubusercontent.com/vim/vim/fc73515f7ba66b47705265bb8d01c6bec5df09c4/runtime/spell/en.spl

# OK
diff old new

# hangs
cargo run --release -- myers old new

It seems to be just a manifestation of the quadratic runtime of Myers diff when the difference between the two files is very large. I’m not sure how diff/git solve this issue.

potocpav on September 14, 2021

I am trying out a fix inspired by diffutils, where runs of unique lines are discarded. Because of the high genericity of diffs crate, this seems quite easy to do.

potocpav added a change on September 14, 2021

Add the optimized diff routine by FHRXP5Jnb2MWLDrPrnLnkN2ryWcGCo6CRr1dXR9FW2YA,

BORN6A37DOU775BMVOCDMQW3IBBI7PWU3V5A5ITJIS7TIUIYUFTQC

potocpav added a change on September 14, 2021

Add property tests for all diff algorithms. by FHRXP5Jnb2MWLDrPrnLnkN2ryWcGCo6CRr1dXR9FW2YA,

QRCSI5CO7MDVIXSJBAGD3WHKPMMITJ2YZZCVRNNJKKIICDTUN7GAC

potocpav on September 14, 2021

I implemented the optimized diffing routine. It is fast on the test-case above (0.2 seconds). I added property tests in order to be confident about correctness.

There is the issue of the optimized routine returning worse diffs. Right now, runs of >1 unique rows are treated as a single row. This makes the algorithm unable to properly penalize modifying these. There are ways to improve this: for example, treat only long runs of unique rows as a single one.

What do you think about this, @pmeunier?

EDIT: The code also needs to be made more generic. I need an extra Hash or Ord constraint on elements though, since I need to put elements into a set.

potocpav added a change on September 14, 2021

Add property tests for all diff algorithms. by FHRXP5Jnb2MWLDrPrnLnkN2ryWcGCo6CRr1dXR9FW2YA,

Q3UCF2GCRN3OSCD7GK7J6DKI6EI647VYELLMOPIFVWEUSFVXOWFQC

potocpav on September 16, 2021

Note that this solution doesn’t solve the problem fully. It solves the case of files where many lines in a row are unique. This is super common in practice - for example when we’re replacing one large random-ish file with another.

It fails when both files alternate unique and non-unique lines frequently. It is possible to detect these cases somewhat reliably by counting the number of unique lines. In the case this number is too large, the algorithm is guaranteed to be slow, and we could switch to a trivial diff algorithm when this happens.

Nevertheless, we always end up with some pathological cases. I’m not sure how to solve the general case.

pmeunier on September 22, 2021

Excellent! (and sorry for not being able to comment sooner, I only got an actual internet connection two days ago).

As far as Pijul is concerned, a “random” file is likely to take less disk space with a trivial diff algorithm (i.e. replace the entire file). The downside is that while concurrent changes will still commute, such changes will always conflict, and always conflict in a trivial way, where the entire file conflicts with the other version of the entire file.