The sound distributed version control system

#1 Using `String` instead of bytes

Closed on November 9, 2020
Alphare on November 5, 2020

Hi there, I’m a Mercurial developer and I’m eager to look at more and see what Pijul/Anu has to offer, but that’s for another discussion (if I can find a contact email somewhere).

We’re currently rewriting parts of the core of Mercurial in Rust for performance reasons and part of that rewrite entailed discussing the internal representation of paths.

Rust’s String was ruled out immediately due to its strict UTF-8 requirement, otherwise a great deal of filenames/paths would be impossible to represent. We ended up with a wrapper around [u8]/Vec<u8> called HgPath/HgPathBuf that offers the necessary encapsulation (in theory, the cross-platform work is incomplete) to store any path inside Mercurial except ones containing null bytes. That, along with some extensive normalization for… difficult filesystems/OSes, ensures portability and backwards-compatibility with older repositories as well as alternative encodings.

What is your stance on the matter? Raphaël

pmeunier on November 6, 2020

Hi! Thanks for your question. Feel free to contact me directly at pe@pijul.org.

To answer your question, the internal format is pretty generic and could be changed to pretty much any encoding without breaking compatibility. However, because Anu allows people to clone the same repository on multiple systems, I think it makes sense to decide on a common encoding, and UTF-8 seems like a reasonable choice. In the future, we may even re-encode files before outputting them. By the way, Rust seems to make the assumption that UTF-8 strings are always valid paths:

https://doc.rust-lang.org/std/path/struct.Path.html#method.new

Another issue that Mercurial doesn’t have is that Anu is broken on file systems that don’t support mmap, which reduces our possibilities of backwards compatibility.

pmeunier closed this discussion on November 9, 2020