pijul/pijul - Discussion #530 - Refactoriong text

#530 Refactoriong text_change.rs

Closed on November 27, 2021

potocpav on September 9, 2021

While investigating :528, I explored how text_change.rs works. The whole setup seems extremely fragile to me.

The module basically writes hunks to a text format, and reads them from the same format. These two operations together must not alter the meaning of the change, otherwise stuff breaks. But this is very hard to achieve without bugs in corner cases.

As evidence, I found out that file names with Unicode characters break (:528). @pmeunier fixed this, but this introduced a bug (double quotes in Edit hunks), and didn’t address several other bugs (" and \n in file names). Moreover, I’m quite sure I encountered similar bugs in the past. It is extremely hard to get everything right here, and impossible to create an exhaustive test-suite.

Much of the trouble seems to be caused by the fact that text_change.rs module does several things at the same time. It not only reads/writes the text format, but also constructs quite complex hunk data structures. I think we can gain a lot by separating these two steps.

I propose that we create a new data-structure that tightly represents everything we need to show in the text format. Let’s call it PrintableHunk for now. This would separate our write-read roundtrip into four stages f, g, g’, f’:

      f                 g             
Hunk --> PrintableHunk --> String



        g'                f'
String --> PrintableHunk --> Hunk

Now, if we could somehow ensure that the composition (g' ∘ g) is identity, we only have to write f and f' correctly. I think this is actually not that difficult, since both these functions deal only with Rust data-types, not serialization.

Now, how do I propose we write the correct g and g' functions? First, we need to use a parser library instead of reading lines + using regular expressions - nom is probably a good choice. Second, we can check that (g' ∘ g = identity) using the quickcheck crate by checking that ∀h. g'(g(h)) = h. Note that this can hold only if no two PrintableHunks serialize to the same String - that’s why we need the new type.

I think these changes can make this part of pijul much more dependable and easier to understand and modify. What do you think?

I started writing a proof of concept, but it turned out to be quite a bit of work, so I want to discuss the approach first.

pmeunier on September 9, 2021

I think this is very cool, and am looking forward to reviewing these changes!

pmeunier on September 9, 2021

Feel free to push PoC patches if you need to discuss the implementation.

potocpav on September 9, 2021

That’s great to hear! I can see one potential problem: performance. We may end up with some extra copies, etc. I don’t know how much of a problem that would be. But I think I might as well try and we will see.

pmeunier on September 9, 2021

I fixed the bug in the serialisation code btw.

Also, after reading that code again, I’d like to see places where you think using a parser library would be much better. I don’t even know if the escaping issue could be fixed by using a parser, since this is typically a tokenizer duty, which is usually performed via regular expressions.

Outside of the escaping issue (which is definitely a bug, but no parser library would have avoided that one), I believe the parsing done by Hunk::read is very close to what we would write with a parser combinator library, although this one has a slightly more explicit backtracking.

potocpav on September 10, 2021

I find it easier to create a bug-free parser with a parser combinator library than by manually reading lines and matching regexps. I think the code in Hunk::read is hard to follow, with control flow jumping all over the place. But that may be just me, and you can disregard my opinion.

The more important change I proposed is separating out parsing/printing, and checking its round-trip validity on random inputs.

I implemented this approach for the FileAddition hunk, and I was already able to find some new bugs using quickcheck: some files (depending on contents) can’t be properly recorded. This is probably caused by the fact that the guessed encoding doesn’t guarantee that the byte sequence is actually valid with given encoding.

echo /v8= | base64 -d > win # bytes [254, 255]
echo j6vW | base64 -d > jap # bytes [143, 171, 214]
pj init
pj add win jap
pj rec -m a
touch jap win
pj diff # error: is not empty

And there is another bug with binary files: BASE64 representation seems to never get decoded. This does not manifest in practice though, since no file I’ve been able to produce ends up with “binary” encoding.

Instead of fixing these bugs as they are discovered, we can get rid of whole class of bugs with the proposed approach.

I’m not sure if we should even fix these bugs now, since the fixes would conflict with my changes.

potocpav on September 12, 2021

To answer your points more directly:

I don’t even know if the escaping issue could be fixed by using a parser, since this is typically a tokenizer duty, which is usually performed via regular expressions.

I’m using parser all the way down, there is no tokenizer. This is AFAIK normal with parser combinators.

I believe the parsing done by Hunk::read is very close to what we would write with a parser combinator library, although this one has a slightly more explicit backtracking.

One of my main motivations was improving the code structure of Hunk::read. Since now Hunk::read does line-by-line parsing, line context must be passed to the function using mutable variables. After each line, control flow goes out of Hunk::read, and then back inside again. This makes it very hard to follow what happens there. My approach is to reify the Hunk data structures, and do the (multi-line) parsing outside Hunk::read. I can then pass each hunk as a whole into Hunk::read and process it without jumping around. Control flow is made much easier to follow, and the state-management complexity disappears.

pmeunier on September 12, 2021

Good points, I was already convinced before, but now we also have some benchmarks to run afterwards ;-)

Since you’re working in Libpijul, note that there’s a CLA allowing me to release Libpijul under a different license in the future, in addition to the current GPL2. My current plan is to release it under an MIT-style license when it is ready. For this reason, your patches have to depend on the patch that introduced the CLA (even though this isn’t strictly required, it allows this kind of agreement to work independent from the Nest). In order to do that, first read the CLA, and then add the following line in the dependencies section of your patches:

[] IUH7IMWES3KQTHVWA5UNHAO7QWVCC5PQJ6VLK3RC3T4F2MS74P3AC

For the base64 stuff, this is meant for “binary” files, for example image files. Try using ImageMagick on an image that is in the repository, and then record, for example.

Since these files are usually large, I’m not sure we really want the base64 stuff. The format could be changed to just +b <could not print binary data> or -b <could not print binary data>. The extra info before these lines (B->BD, etc) is enough to recover the necessary information.

potocpav on September 12, 2021

I have an initial draft of the changes ready. I just need to upload somehow (see my comment in #531), and write some notes about it.

Regarding binary files: I agree, base64 output seems mostly useless for the user. It makes sense to not print it.

What I meant by BASE64 not being used: binary files are detected for me as having windows-125* encoding:

1. Replacement in image.png:3 2.41 "windows-1252"
B:BD 2.50 -> 2.50:59705/2
  up 2.50, new 0:57006, down 2.59705
- ^@^@^@^MIHDR^@^@^A,^@^@^@^F^@^@^@Z+Ó^O^@^@ ^@IDATx^A„¼^ExUÇÚö¿b^D×ê©Ûé9mÏ©œS/m
i^K…$Á^BDH îîF^LKÐ"-RZ\Jñ^@qA‚^E)R4$Ä·K’ßÿšÙ„·ß{Î÷ýs]“µ×¬µ÷ž5óÌ=÷s?Ïlené
t2ËfU>“œJwæU{²à„^W^KOz³èÔl–œõ“eé9^?–^O`ù…@YVÔ^E±¢.eçýYvÎ
‡åµ~,?åË²“>,99›^E§=È>3¤S.Ì»áMâÙ©¤œó úèt"öÎ ³(œyÇÙîÅÜª(RküH:ëNÄi^G\^KÞ# z
^D^A^U_3kïp"Ž9‘RéEB±ûÇ‘TãJL•#±'^H+ž€Ï1ÄWÏ ªl
(...)

potocpav added a change on September 13, 2021

Re-implement change printing and parsing by FHRXP5Jnb2MWLDrPrnLnkN2ryWcGCo6CRr1dXR9FW2YA, created on September 12, 2021

5FI6SBEZ6RERERUAIWQJVAY66BEZ7YQOYOUNK2DPOLRGS2X326RAC

main

potocpav on September 13, 2021

I implemented the proposed changes. I wanted to create a small-ish example, but found out that it requires changes to overall code structure, so I did the whole thing. Some notes about the changes:

Hunks are printed by Hunk::write. This is very similar to before, just producing PrintableHunk as an intermediate step instead of writing directly.
Hunks are read back by Hunk::from_printable. It is analogous to the old Hunk::read, but now it creates hunks in one go.
Printing is done in printable.hs, parsing in parse.rs. These are checked against one another, so they can be modified quite confidently.

Now for the bad parts:

I am completely unfamiliar with libpijul data structures. I tried to transcribe the old logic 1-to-1, but I probably wasn’t 100% successful. ~~There are some failing tests, even though~~ the commands I tried seemed to work.
- For example, what are offsets good for? I have no idea.
- ~~I ended up with two functions, from_printable_pos_vec_offsets, and from_printable_pos_vec, where only the former one uses offsets. Should this be the case?~~
Quickcheck tests kept failing because of bad encoding detection. I fixed that by validating the chosen encoding via a round-trip in EncodingDetector::get_valid.
Parsing errors are bad, and there could be better robustness to user edits. These can be improved quite easily.
Parsing code is copy-paste heavy. Much can be probably pulled into functions, such as all the space handling.
I cloned everything to avoid fighting the borrow checker. This could be improved.
Line count increased considerably. This is in part due to the new data structures, but also because of testing boilerplate, and because cargo fmt loves creating one-word lines.

potocpav on September 13, 2021

I tested that the new code produces the exact same change as the old code by adapting the test change::text. I needed to fix one inaccuracy in the new code. This gave me some confidence that the code is actually correct.

I just copied the old text_changes.rs file, and renamed functions to avoid name collisions. Do you want this testing code, or should I just push the fixes?

pmeunier on September 14, 2021

Excellent! I’m looking forward to seeing that change.

potocpav added a change on September 14, 2021

Test new changes against the old code. Fix several small bugs. by FHRXP5Jnb2MWLDrPrnLnkN2ryWcGCo6CRr1dXR9FW2YA,

UN2M77YUIQZMPH5CQARPUUK2Q66RHN5J6BIIQQFEJ35DSFSKFFPQC

main

potocpav on September 14, 2021

I used the patch in #532 to make the tests compile

pmeunier on October 19, 2021

Hi! I improved the binary file detection (#I3HDN5CSJMZKLRDGNFCT64UK3ATHL45M3STDVH4LYN7VI6UVJORQC), if you care to test again. I’ve been super busy with other stuff, but I hope to be able to get back to Pijul very soon (meaning “now”).

pmeunier on November 23, 2021

Alright, I just merged this locally, along with a bunch of other changes (I did modify the hunks a little bit). This is excellent work, thanks!

Your assumptions “Should we panic?” are true, we should indeed panic in these cases, they never happen.

There are two other TODOs I’m a bit worried about:

                // TODO: pass in the encoding

We could fix these, but I’m a bit scared that this could break backwards-compatibility. One way to fix that is to add them onto a shopping list for changes, and only break the format in the future when there are enough of those.

In text_changes.rs, I don’t understand the following comment of yours:

        // TODO: get rid of this whole default change if possible

I believe it’s one way to initialise the structures we are editing, do you mean we should have a Default impl instead?

pmeunier on November 23, 2021

(and sorry for the long delay, I hope to become more efficient in my Pijul maintenance duty soon)

pmeunier on November 27, 2021

Merged! Thanks a million for these patches.

pmeunier closed this discussion on November 27, 2021

potocpav on December 1, 2021

Wow, thanks for acutally merging this!! I hope it will be more maintainable in the future than the previous version.