The sound distributed version control system

#179 please support more text file encodings

Opened by rohan on December 6, 2020 Beta
rohan on December 6, 2020

It seems that ISO-8859, UTF-16 (little endian) and Windows code page 1252 encoded files are being treated as binary.

pmeunier on December 6, 2020

That’s right. While I don’t think this is desirable for file names (I believe Rust converts them automatically from UTF-8), it would indeed be good to treat these files as text.

I wonder how to detect them though, do you have any idea?

rohan on December 6, 2020

No, but some googling resulted in https://crates.io/crates/chardet

Skia on December 6, 2020

@pmeunier

Git does it by having the user manually specify which files should be treated as text and what their encoding is in the .gitattributes file link. Relative to how long Git has been around and popular, this is a relatively recent feature having been added in March 2018.

Git converts files attributed as another encoding to UTF-8 before performing any other processing on them when reading from the working directory. When writing content to the file it converts from UTF-8 back to the specified encoding. Which has a side effect of if the encoding of the file is changed later there is no change to the file as Git sees it, only a change to the .gitattributes file.

rohan on December 6, 2020

From that link:

Git recognizes files encoded in ASCII or one of its supersets (e.g. UTF-8, ISO-8859-1, …​) as text files. Files encoded in certain other encodings (e.g. UTF-16) are interpreted as binary

So the .gitattributes is only needed for UTF-16 in my case.

rohan on December 7, 2020

I’ll try to code something up for this

rohan on December 7, 2020

Using tree_magic_mini and chardetng allows the detection of everything except UTF-16

pmeunier on December 7, 2020

The Git way seems very reasonable to me, but I wondered if we could do better:

  • For the diff algorithm, we don’t have to convert anything for now, as long as we can detect newlines correctly. This is easy for ASCII, UTF-8, ISO-8859-1 (and probably Windows codepages too), since 0xa is the universal newline character for these encodings, and doesn’t appear in any other character sequence. This doesn’t work for UTF-16, unfortunately, so we might have to do something special for that (adjusting the character sequence for \n should be enough).

  • Changes involving multiple files with different encodings must use a single character encoding to represent the text version of the diff, and UTF-8 is fine for that.

  • When changes are stored back as bytes before being applied, we still have the information about their encoding, so we can use it to re-encode the diff.

  • However, if a change is transmitted in the text format (using pijul change + pijul apply), it is ambiguous, as information about the encoding has been lost. So, either we would need to store that in the change itself.

  • Also, people working on Windows and Linux on the same repository might want to use their own favourite encodings, and not get artificial conflicts because of that. This can only be achieved in Pijul by choosing a universal encoding across all repositories.

  • Finally, I wouldn’t mind to have the ability to add a little pre- and post-processing layer to Pijul, as this would allow us to encode source code differently, and model refactorings, file splitting, semantic patches…

Skia on December 8, 2020

As I see it, if the encoding of text files is auto detected, then the encoding is part of the file’s data that gets recorded. Thus a change of encoding would be a change to the file even if the recorded contents stay the same. It would also mean text patches would indeed need to contain the encoding of files as part of the information they convey.

If the encoding of files is specified in a Pijul version of ‘.gitattributes’, the encoding is not considered part of the file’s data. Thus the change history of files would be independent of encoding. So when changes are applied the working copy would always be the encoding specified working copy of the ‘.gitattributes’ equivalent file.

In both cases text files would be converted to a uniform “storage encoding” (UTF-8) when being recorded to the repository or when represented in text patch files.

Realistically, the encoding of text files is not going to be changed often if at all. Most cases of the encoding being changed will likely be due to the file having been given the wrong encoding by accident and then being corrected.

Since on of the reasons for knowing file encoding is to handle line endings, it would make sense to handle both in the same manner. The line endings for files come in three types: crlf (Windows style), lf (non-Windows), and ‘native’ (for files that the type of line ending does not matter). The first two can be auto detected for what is being used; however, it is not reasonable to automatically know if the type of line ending does not matter to a file. That would require being able to identify the type of each file.

If a file for configuring how Pijul handles files in a repository is implemented, my personal preference would be to use a general config file format (like TOML, which is already used by Pijul for its config files) rather than a special purpose one like ‘.gitattributes’ has.

pmeunier on December 8, 2020

Thanks for the input. I don’t have much experience with the UX of file encodings, to be honest. Between the two options (storing the encoding in the patches and pristine vs. storing it locally), what do you think is more ergonomic, and why?

rohan added a change on December 9, 2020
EZHWLWLCXGW7GSE3KF5LAW6W6PXSHGVWY65IAQD2XFIKIEI2PBGAC
rohan on December 9, 2020

That change detects at least 8859-1 and 1252 and displays them correctly in the change.

Writing and diff haven’t been changed - diff sees that the files have changed due to the encoding difference and reset turns the files into UTF-8.

rohan added a change on December 9, 2020
6HNRL5RT76NH5YNSUN7B4FHNRZXKNLX4DROFGMO4R5P2U7JWOL2QC
Skia on December 11, 2020

@pmeunier

If I was doing the project, I would always go the config file route if I would have said config file regardless in order to reduce the work load.

I do not think there is much of a functional difference between having encoding as part of the file metadata or being part of a separate config file. Some people might perceive one or the other as giving a more ‘correct’ view of the history of the repository.

For the user, not having to configure it manually is technically less work. But not by much, since as I said the encoding of text file is rarely ever going to change unless there is an accident that causes it to be in the wrong encoding. One advantage that can come from using a config file is that if the file is not valid in terms of the encoding Pijul would then be able to produce an error saying so instead of silently accepting the file if it is not in the correct encoding. Kind of like typed versus untyped programming languages.

rohan added a change on December 11, 2020
VMOYG7MKEWTUEEY2EOL256RWCVPGRD63IFOSKXHBGJ6VSRITLMOAC
rohan on December 11, 2020

An accident where the file changes encoding should be detectable as it will result in a large diff of duplicate lines. Unfortunately the reason for that diff may not be obvious, much like line terminator changes. Adding a new file which is not in the expected encoding could be an error either in the file or in the config.

A config file would be useful for where pijul guesses wrong, eg., decoding a file which should be binary, but I think it should be the exception.

I’m ignoring user-specific config for now as there’s probably a good number of corner cases.

rohan added a change on December 11, 2020
246V5TYIUL7CFN7G5Y7A35EEM6IJPN532ROEYVSM7Q4HCQSWPDBQC
rohan on December 12, 2020

Spoke too soon. Now that I’ve read more of the Change IO parts it seems that we should indeed record the file’s encoding in the change. Therefor an accidental change to the encoding would not show a large diff as the lines being compared have all been converted to UTF-8. Instead there should be extra change metadata displayed when recording showing the before and after file encoding. That way the user can remove that part and the change will be recorded using the original file encoding.

This does require people to read what they’re recording which given the way Pijul works seems more likely then Git, if less in your face than Darcs.

So… where to store the file encoding in the change. This is where I wish that the change contents had used Cap’n Proto to do the contents data structure backed by a plain [u8] with zero copy (for many things) as it also supports various kinds of structural evolution with default values. In lieu of such a major change there seem three choices - in selected change::Record, encoded into the [u8] contents or in the unhashed Toml. The first two would require a new change format version while the later looks like it wouldn’t. However using the unhashed adds complexity as it would be a somewhat parallel structure to the recorded changes and just seems odd - why would a file encoding change not affect the hash?

Any thoughts @pmeunier

rohan added a change on December 15, 2020
NYOF5766GLBTWQV2KTVRAJMGVJNJ37Z5BLJMFPZA3HG7X2Q2RXPAC
rohan on December 18, 2020

The answer to the above seems to be the [u8] encoded contents.

Putting it into the change::Record, as done in the previous change, works for display of the change and storing in the repo but not for writing the changes back to files. The information is available in the recorded change but the output functions do not currently expose that connection. This seems to be acknowledged by the fact that the [u8] encoded contents stores the file name and path, duplicating what’s in the corresponding change::Record entries.

I’m not going to be able to work on this for a few weeks at least.

rohan added a change on December 18, 2020
W4NSLQNGQVQBB4BEXLFW5OTCBD5XQ4E2BGH3WIFGKSJBCODSUIGQC
pmeunier on January 12, 2021

Hi @rohan! These patches look great, thanks a lot for your work.

This is the first major contribution to Libpijul, and I have received a lot of advice about licenses since the recent alpha release, in particular from people wanting to use Libpijul in proprietary projects.

I am firmly convinced that Pijul must at all costs remain open source. Moreover, it has also been suggested that sublicensing Libpijul could provide funding for the project in the future. After some research, it turns out that there is a generic solution to this problem, called contributor license agreements, and generic templates usable in open source projects are provided by the Contributor Agreements project, sponsored by the FSFE: https://contributoragreements.org/.

I’ve added that template to the repository, and I have written instructions there: https://pijul.org/cla

In your case, since I don’t want to ask you to record your changes again, it would be enough to send an email to contact@pijul.org stating that you have permission to contribute to Libpijul, and that you agree with the CLA introduced by patch IUH7IMWES3KQTHVWA5UNHAO7QWVCC5PQJ6VLK3RC3T4F2MS74P3AC (after reading that CLA, obviously).

Since this is my first open source project that is starting to look like something serious, I am obviously completely open to any comment that you may have.

Thanks!

pmeunier added tag Beta on January 12, 2021
rohan on January 15, 2021

Hi pmeunier, I’m more than happy to contribute under that agreement and as there’s more work to do I can just add it to the next change. However, I need to get an official waiver from my employer to do so. Not sure how long that will take.

pmeunier on January 15, 2021

Thanks! Sorry about that extra bit of admin, using this “standard” CLA should hopefully make it as fast as possible.