I think this is very cool, and am looking forward to reviewing these changes!
Feel free to push PoC patches if you need to discuss the implementation.
That’s great to hear! I can see one potential problem: performance. We may end up with some extra copies, etc. I don’t know how much of a problem that would be. But I think I might as well try and we will see.
I fixed the bug in the serialisation code btw.
Also, after reading that code again, I’d like to see places where you think using a parser library would be much better. I don’t even know if the escaping issue could be fixed by using a parser, since this is typically a tokenizer duty, which is usually performed via regular expressions.
Outside of the escaping issue (which is definitely a bug, but no parser library would have avoided that one), I believe the parsing done by Hunk::read
is very close to what we would write with a parser combinator library, although this one has a slightly more explicit backtracking.
I find it easier to create a bug-free parser with a parser combinator library than by manually reading lines and matching regexps. I think the code in Hunk::read
is hard to follow, with control flow jumping all over the place. But that may be just me, and you can disregard my opinion.
The more important change I proposed is separating out parsing/printing, and checking its round-trip validity on random inputs.
I implemented this approach for the FileAddition
hunk, and I was already able to find some new bugs using quickcheck: some files (depending on contents) can’t be properly recorded. This is probably caused by the fact that the guessed encoding doesn’t guarantee that the byte sequence is actually valid with given encoding.
echo /v8= | base64 -d > win # bytes [254, 255]
echo j6vW | base64 -d > jap # bytes [143, 171, 214]
pj init
pj add win jap
pj rec -m a
touch jap win
pj diff # error: is not empty
And there is another bug with binary files: BASE64
representation seems to never get decoded. This does not manifest in practice though, since no file I’ve been able to produce ends up with “binary” encoding.
Instead of fixing these bugs as they are discovered, we can get rid of whole class of bugs with the proposed approach.
I’m not sure if we should even fix these bugs now, since the fixes would conflict with my changes.
To answer your points more directly:
I don’t even know if the escaping issue could be fixed by using a parser, since this is typically a tokenizer duty, which is usually performed via regular expressions.
I’m using parser all the way down, there is no tokenizer. This is AFAIK normal with parser combinators.
I believe the parsing done by
Hunk::read
is very close to what we would write with a parser combinator library, although this one has a slightly more explicit backtracking.
One of my main motivations was improving the code structure of Hunk::read
. Since now Hunk::read
does line-by-line parsing, line context must be passed to the function using mutable variables. After each line, control flow goes out of Hunk::read
, and then back inside again. This makes it very hard to follow what happens there. My approach is to reify the Hunk data structures, and do the (multi-line) parsing outside Hunk::read
. I can then pass each hunk as a whole into Hunk::read
and process it without jumping around. Control flow is made much easier to follow, and the state-management complexity disappears.
Good points, I was already convinced before, but now we also have some benchmarks to run afterwards ;-)
Since you’re working in Libpijul, note that there’s a CLA allowing me to release Libpijul under a different license in the future, in addition to the current GPL2. My current plan is to release it under an MIT-style license when it is ready. For this reason, your patches have to depend on the patch that introduced the CLA (even though this isn’t strictly required, it allows this kind of agreement to work independent from the Nest). In order to do that, first read the CLA, and then add the following line in the dependencies section of your patches:
[] IUH7IMWES3KQTHVWA5UNHAO7QWVCC5PQJ6VLK3RC3T4F2MS74P3AC
For the base64 stuff, this is meant for “binary” files, for example image files. Try using ImageMagick on an image that is in the repository, and then record, for example.
Since these files are usually large, I’m not sure we really want the base64 stuff. The format could be changed to just +b <could not print binary data>
or -b <could not print binary data>
. The extra info before these lines (B->BD
, etc) is enough to recover the necessary information.
I have an initial draft of the changes ready. I just need to upload somehow (see my comment in #531), and write some notes about it.
Regarding binary files: I agree, base64 output seems mostly useless for the user. It makes sense to not print it.
What I meant by BASE64
not being used: binary files are detected for me as having windows-125*
encoding:
1. Replacement in image.png:3 2.41 "windows-1252"
B:BD 2.50 -> 2.50:59705/2
up 2.50, new 0:57006, down 2.59705
- ^@^@^@^MIHDR^@^@^A,^@^@^@^F^@^@^@Z+Ó^O^@^@ ^@IDATx^A„¼^ExUÇÚö¿b^D×ê©Ûé9mÏ©œS/m
i^K…$Á^BDH îîF^LKÐ"-RZ\Jñ^@qA‚^E)R4$Ä·K’ßÿšÙ„·ß{Î÷ýs]“µ×¬µ÷ž5óÌ=÷s?Ïlené
t2ËfU>“œJwæU{²à„^W^KOz³èÔl–œõ“eé9^?–^O`ù…@YVÔ^E±¢.eçýYvÎ
‡åµ~,?å˲“>,99›^E§=È>3¤S.Ì»áMâÙ©¤œó úèt"öÎ ³(œyÇÙîÅܪ(RküH:ëNÄi^G\^KÞ# z
^D^A^U_3kïp"Ž9‘RéEB±ûÇ‘TãJL•#±'^H+ž€Ï1ÄWÏ ªl
(...)
5FI6SBEZ6RERERUAIWQJVAY66BEZ7YQOYOUNK2DPOLRGS2X326RAC
I implemented the proposed changes. I wanted to create a small-ish example, but found out that it requires changes to overall code structure, so I did the whole thing. Some notes about the changes:
Hunk::write
. This is very similar to before, just producing PrintableHunk
as an intermediate step instead of writing directly.Hunk::from_printable
. It is analogous to the old Hunk::read
, but now it creates hunks in one go.printable.hs
, parsing in parse.rs
. These are checked against one another, so they can be modified quite confidently.Now for the bad parts:
offsets
good for? I have no idea.from_printable_pos_vec_offsets
, and from_printable_pos_vec
, where only the former one uses offsets. Should this be the case?EncodingDetector::get_valid
.cargo fmt
loves creating one-word lines.I tested that the new code produces the exact same change as the old code by adapting the test change::text
. I needed to fix one inaccuracy in the new code. This gave me some confidence that the code is actually correct.
I just copied the old text_changes.rs
file, and renamed functions to avoid name collisions. Do you want this testing code, or should I just push the fixes?
Excellent! I’m looking forward to seeing that change.
UN2M77YUIQZMPH5CQARPUUK2Q66RHN5J6BIIQQFEJ35DSFSKFFPQC
Hi! I improved the binary file detection (#I3HDN5CSJMZKLRDGNFCT64UK3ATHL45M3STDVH4LYN7VI6UVJORQC), if you care to test again. I’ve been super busy with other stuff, but I hope to be able to get back to Pijul very soon (meaning “now”).
Alright, I just merged this locally, along with a bunch of other changes (I did modify the hunks a little bit). This is excellent work, thanks!
Your assumptions “Should we panic?” are true, we should indeed panic in these cases, they never happen.
There are two other TODOs I’m a bit worried about:
// TODO: pass in the encoding
We could fix these, but I’m a bit scared that this could break backwards-compatibility. One way to fix that is to add them onto a shopping list for changes, and only break the format in the future when there are enough of those.
In text_changes.rs, I don’t understand the following comment of yours:
// TODO: get rid of this whole default change if possible
I believe it’s one way to initialise the structures we are editing, do you mean we should have a Default
impl instead?
(and sorry for the long delay, I hope to become more efficient in my Pijul maintenance duty soon)
Merged! Thanks a million for these patches.
Wow, thanks for acutally merging this!! I hope it will be more maintainable in the future than the previous version.
While investigating :528, I explored how
text_change.rs
works. The whole setup seems extremely fragile to me.The module basically writes hunks to a text format, and reads them from the same format. These two operations together must not alter the meaning of the change, otherwise stuff breaks. But this is very hard to achieve without bugs in corner cases.
As evidence, I found out that file names with Unicode characters break (:528). @pmeunier fixed this, but this introduced a bug (double quotes in Edit hunks), and didn’t address several other bugs (
"
and\n
in file names). Moreover, I’m quite sure I encountered similar bugs in the past. It is extremely hard to get everything right here, and impossible to create an exhaustive test-suite.Much of the trouble seems to be caused by the fact that
text_change.rs
module does several things at the same time. It not only reads/writes the text format, but also constructs quite complex hunk data structures. I think we can gain a lot by separating these two steps.I propose that we create a new data-structure that tightly represents everything we need to show in the text format. Let’s call it
PrintableHunk
for now. This would separate our write-read roundtrip into four stages f, g, g’, f’:Now, if we could somehow ensure that the composition
(g' ∘ g)
is identity, we only have to writef
andf'
correctly. I think this is actually not that difficult, since both these functions deal only with Rust data-types, not serialization.Now, how do I propose we write the correct
g
andg'
functions? First, we need to use a parser library instead of reading lines + using regular expressions - nom is probably a good choice. Second, we can check that(g' ∘ g = identity)
using the quickcheck crate by checking that∀h. g'(g(h)) = h
. Note that this can hold only if no two PrintableHunks serialize to the same String - that’s why we need the new type.I think these changes can make this part of pijul much more dependable and easier to understand and modify. What do you think?
I started writing a proof of concept, but it turned out to be quite a bit of work, so I want to discuss the approach first.