The sound distributed version control system

#528 [BUG] File names with weird characters break stuff

Opened by potocpav on September 7, 2021
potocpav on September 7, 2021

If I record a file with name källström, things break. I’m running Manjaro, and I’m using the most recent pijul from main. To reproduce:

pijul init
echo test > källström
pijul add källström
pijul rec -m 'test'
pijul diff

The last line, pijul diff, returns a file move, even though it should not return anything:

message = ''
timestamp = '2021-09-07T10:03:56.020865041Z'
authors = []

# Dependencies
[2] JJB4JA7FU3557BNXS4C3GYGHZOVAXCCUIJRDAWNTR3LNCNXVS5VQC

# Hunks

1. Moved: "ka\\u{308}llstro\\u{308}m" "ka\u{308}llstro\u{308}m" 1.0
BF:BFD 1.0 -> 2.0:47/2
up 1.0, down 2.48
pmeunier on September 7, 2021

Hi! Thanks for the report! After testing on my machine (Ext4 on Linux, Pijul 1.0.0-alpha.54, Rustc 1.54), everything seems to works for me:

message = ''
timestamp = '2021-09-07T10:12:28.673944156Z'

[[authors]]
key = 'FZQ2g7VfnzLYM4mtTVDk9HAZjA8Jk9ndkwN1icgbtWUr'

# Hunks

1. File addition: "kälström" in "/" "UTF-8"
  up 1.0, new 5:39
+ a

If I record again, it works. Would you mind sharing some bits of your configuration? Which platform? Filesystem? Is this the latest stable Rust? Latest Pijul?

potocpav on September 7, 2021

Sure. Thanks for a quick reply! I was investigating further, and I think I managed to isolate the issue. The file addition hunk is “filtered through” the textual format you posted above, right? In this line:

1. File addition: "kälström" in "/" "UTF-8"

the "kälström" string was produced using {:?}, but it will be parsed out without un-escaping the string inside the quotes. This creates an extra layer of escaping. For you, for whatever reason, the string was not escaped so everything is OK. But I have "ka\u{308}llstro\u{308}m", which ends up incorrectly being recorded as "ka\\u{308}llstro\\u{308}m".

Hope it makes sense.

potocpav on September 7, 2021
$ pijul --version
pijul 1.0.0-alpha.54
$ rustc --version
rustc 1.54.0 (a178d0322 2021-07-26)
pmeunier on September 7, 2021

This could be the reason indeed. I have no idea why Rust prints my file name correctly the first time. I’ll fix that now.

potocpav on September 7, 2021

encoding is done at text_changes.rs:434, decoding at text_changes.rs:572, but these are not inverses.

pmeunier added a change on September 7, 2021
SFJ3XRTFUNG6KNYDLIYKHCENZ6Y3PG33GNGDW6444LAVBMCSH2FAC
main
pmeunier on September 7, 2021

I just pushed a patch replacing the {:?} with something independent from Rust. If you care to test again, it would be helpful.

potocpav on September 7, 2021

That was super quick :-) Tested it, and it solved my test-case.

Would it be a good idea to quickcheck the write-read roundtrip? It seems that otherwise text_changes.rs would be an endless source of bugs.

pmeunier on September 7, 2021

It is actually tested by cargo test in Libpijul, but some of these tests are broken at the moment, too much stuff going on in this project.

potocpav on September 7, 2021

I think your change breaks on some special characters in file-names, such as a newline character. I created a test file in bash like this:

echo txt > $'abc\ndef'

which can’t be recorded.

potocpav on September 7, 2021

I am now trying to fix tests, but I don’t know if I’ll succeed.

pmeunier on September 7, 2021

I have a patch in preparation for that, we might get conflicts if you do that.

pmeunier on September 7, 2021

I think your change breaks on some special characters in file-names, such as a newline character. I created a test file in bash like this:

Indeed… Nice catch.

potocpav on September 7, 2021

this may be a good library for string un/escaping, though I’ve never used it. https://crates.io/crates/snailquote

pmeunier on September 7, 2021

Unfortunately, that crate’s license doesn’t allow us to use it in Libpijul. I wouldn’t even look at their code in order to avoid deriving things.

But using a more general escaping function is a good idea, the patch I pushed can easily be extended to include more escape sequences.

potocpav on September 7, 2021

ah OK.

potocpav on September 7, 2021

Do you plan on keeping the changes editable in the future? If so, file name escaping would need to be 100% correct eventually.

I don’t think it can be made correct using the current approach based on regular expressions. Would it be beneficial if I rewrote it using a proper parser library?

And what about non-utf8 filenames, which may not be representable by String?

pmeunier added a change on September 9, 2021
LWBBN2IBFFW2UIQIP7XL3J2QOK76EJI3P4MO5D543UAATAYW66XAC
main
potocpav on September 13, 2021

Unless I am mistaken, the quotes are still not properly fixed :-) Recognition regexps should match double quotes around the names, and the Escaped function should be used for Edit and Replace hunks.