pijul/pijul - Discussion #179 - supporting non-UTF8 text file encodings

#179 supporting non-UTF8 text file encodings

Closed on June 29, 2021

rohan on December 6, 2020

It seems that ISO-8859, UTF-16 (little endian) and Windows code page 1252 encoded files are being treated as binary.

pmeunier on December 6, 2020

That’s right. While I don’t think this is desirable for file names (I believe Rust converts them automatically from UTF-8), it would indeed be good to treat these files as text.

I wonder how to detect them though, do you have any idea?

rohan on December 6, 2020

No, but some googling resulted in https://crates.io/crates/chardet

Skia on December 6, 2020

@pmeunier

Git does it by having the user manually specify which files should be treated as text and what their encoding is in the .gitattributes file link. Relative to how long Git has been around and popular, this is a relatively recent feature having been added in March 2018.

Git converts files attributed as another encoding to UTF-8 before performing any other processing on them when reading from the working directory. When writing content to the file it converts from UTF-8 back to the specified encoding. Which has a side effect of if the encoding of the file is changed later there is no change to the file as Git sees it, only a change to the .gitattributes file.

rohan on December 6, 2020

From that link:

Git recognizes files encoded in ASCII or one of its supersets (e.g. UTF-8, ISO-8859-1, …) as text files. Files encoded in certain other encodings (e.g. UTF-16) are interpreted as binary

So the .gitattributes is only needed for UTF-16 in my case.

rohan on December 7, 2020

I’ll try to code something up for this

rohan on December 7, 2020

Using tree_magic_mini and chardetng allows the detection of everything except UTF-16

pmeunier on December 7, 2020

The Git way seems very reasonable to me, but I wondered if we could do better:

For the diff algorithm, we don’t have to convert anything for now, as long as we can detect newlines correctly. This is easy for ASCII, UTF-8, ISO-8859-1 (and probably Windows codepages too), since 0xa is the universal newline character for these encodings, and doesn’t appear in any other character sequence. This doesn’t work for UTF-16, unfortunately, so we might have to do something special for that (adjusting the character sequence for \n should be enough).
Changes involving multiple files with different encodings must use a single character encoding to represent the text version of the diff, and UTF-8 is fine for that.
When changes are stored back as bytes before being applied, we still have the information about their encoding, so we can use it to re-encode the diff.
However, if a change is transmitted in the text format (using pijul change + pijul apply), it is ambiguous, as information about the encoding has been lost. So, either we would need to store that in the change itself.
Also, people working on Windows and Linux on the same repository might want to use their own favourite encodings, and not get artificial conflicts because of that. This can only be achieved in Pijul by choosing a universal encoding across all repositories.
Finally, I wouldn’t mind to have the ability to add a little pre- and post-processing layer to Pijul, as this would allow us to encode source code differently, and model refactorings, file splitting, semantic patches…

Skia on December 8, 2020

As I see it, if the encoding of text files is auto detected, then the encoding is part of the file’s data that gets recorded. Thus a change of encoding would be a change to the file even if the recorded contents stay the same. It would also mean text patches would indeed need to contain the encoding of files as part of the information they convey.

If the encoding of files is specified in a Pijul version of ‘.gitattributes’, the encoding is not considered part of the file’s data. Thus the change history of files would be independent of encoding. So when changes are applied the working copy would always be the encoding specified working copy of the ‘.gitattributes’ equivalent file.

In both cases text files would be converted to a uniform “storage encoding” (UTF-8) when being recorded to the repository or when represented in text patch files.

Realistically, the encoding of text files is not going to be changed often if at all. Most cases of the encoding being changed will likely be due to the file having been given the wrong encoding by accident and then being corrected.

Since on of the reasons for knowing file encoding is to handle line endings, it would make sense to handle both in the same manner. The line endings for files come in three types: crlf (Windows style), lf (non-Windows), and ‘native’ (for files that the type of line ending does not matter). The first two can be auto detected for what is being used; however, it is not reasonable to automatically know if the type of line ending does not matter to a file. That would require being able to identify the type of each file.

If a file for configuring how Pijul handles files in a repository is implemented, my personal preference would be to use a general config file format (like TOML, which is already used by Pijul for its config files) rather than a special purpose one like ‘.gitattributes’ has.

pmeunier on December 8, 2020

Thanks for the input. I don’t have much experience with the UX of file encodings, to be honest. Between the two options (storing the encoding in the patches and pristine vs. storing it locally), what do you think is more ergonomic, and why?

rohan added a change on December 9, 2020

detect non-utf8 text files

EZHWLWLCXGW7GSE3KF5LAW6W6PXSHGVWY65IAQD2XFIKIEI2PBGAC

rohan on December 9, 2020

That change detects at least 8859-1 and 1252 and displays them correctly in the change.

Writing and diff haven’t been changed - diff sees that the files have changed due to the encoding difference and reset turns the files into UTF-8.

rohan added a change on December 9, 2020

detect non-utf8 text files

6HNRL5RT76NH5YNSUN7B4FHNRZXKNLX4DROFGMO4R5P2U7JWOL2QC

main

Skia on December 10, 2020

@pmeunier

If I was doing the project, I would always go the config file route if I would have said config file regardless in order to reduce the work load.

I do not think there is much of a functional difference between having encoding as part of the file metadata or being part of a separate config file. Some people might perceive one or the other as giving a more ‘correct’ view of the history of the repository.

For the user, not having to configure it manually is technically less work. But not by much, since as I said the encoding of text file is rarely ever going to change unless there is an accident that causes it to be in the wrong encoding. One advantage that can come from using a config file is that if the file is not valid in terms of the encoding Pijul would then be able to produce an error saying so instead of silently accepting the file if it is not in the correct encoding. Kind of like typed versus untyped programming languages.

rohan added a change on December 11, 2020

text file decoding for new files

VMOYG7MKEWTUEEY2EOL256RWCVPGRD63IFOSKXHBGJ6VSRITLMOAC

main

rohan on December 11, 2020

An accident where the file changes encoding should be detectable as it will result in a large diff of duplicate lines. Unfortunately the reason for that diff may not be obvious, much like line terminator changes. Adding a new file which is not in the expected encoding could be an error either in the file or in the config.

A config file would be useful for where pijul guesses wrong, eg., decoding a file which should be binary, but I think it should be the exception.

I’m ignoring user-specific config for now as there’s probably a good number of corner cases.

rohan added a change on December 11, 2020

decode existing files

246V5TYIUL7CFN7G5Y7A35EEM6IJPN532ROEYVSM7Q4HCQSWPDBQC

main

rohan on December 12, 2020

Spoke too soon. Now that I’ve read more of the Change IO parts it seems that we should indeed record the file’s encoding in the change. Therefor an accidental change to the encoding would not show a large diff as the lines being compared have all been converted to UTF-8. Instead there should be extra change metadata displayed when recording showing the before and after file encoding. That way the user can remove that part and the change will be recorded using the original file encoding.

This does require people to read what they’re recording which given the way Pijul works seems more likely then Git, if less in your face than Darcs.

So… where to store the file encoding in the change. This is where I wish that the change contents had used Cap’n Proto to do the contents data structure backed by a plain [u8] with zero copy (for many things) as it also supports various kinds of structural evolution with default values. In lieu of such a major change there seem three choices - in selected change::Record, encoded into the [u8] contents or in the unhashed Toml. The first two would require a new change format version while the later looks like it wouldn’t. However using the unhashed adds complexity as it would be a somewhat parallel structure to the recorded changes and just seems odd - why would a file encoding change not affect the hash?

Any thoughts @pmeunier

rohan added a change on December 15, 2020

track file encoding in the record, including change text for file adds

NYOF5766GLBTWQV2KTVRAJMGVJNJ37Z5BLJMFPZA3HG7X2Q2RXPAC

main

rohan on December 18, 2020

The answer to the above seems to be the [u8] encoded contents.

Putting it into the change::Record, as done in the previous change, works for display of the change and storing in the repo but not for writing the changes back to files. The information is available in the recorded change but the output functions do not currently expose that connection. This seems to be acknowledged by the fact that the [u8] encoded contents stores the file name and path, duplicating what’s in the corresponding change::Record entries.

I’m not going to be able to work on this for a few weeks at least.

rohan added a change on December 18, 2020

make text_encoding available to all of libpijul

W4NSLQNGQVQBB4BEXLFW5OTCBD5XQ4E2BGH3WIFGKSJBCODSUIGQC

main

pmeunier on January 12, 2021

Hi @rohan! These patches look great, thanks a lot for your work.

This is the first major contribution to Libpijul, and I have received a lot of advice about licenses since the recent alpha release, in particular from people wanting to use Libpijul in proprietary projects.

I am firmly convinced that Pijul must at all costs remain open source. Moreover, it has also been suggested that sublicensing Libpijul could provide funding for the project in the future. After some research, it turns out that there is a generic solution to this problem, called contributor license agreements, and generic templates usable in open source projects are provided by the Contributor Agreements project, sponsored by the FSFE: https://contributoragreements.org/.

I’ve added that template to the repository, and I have written instructions there: https://pijul.org/cla

In your case, since I don’t want to ask you to record your changes again, it would be enough to send an email to contact@pijul.org stating that you have permission to contribute to Libpijul, and that you agree with the CLA introduced by patch IUH7IMWES3KQTHVWA5UNHAO7QWVCC5PQJ6VLK3RC3T4F2MS74P3AC (after reading that CLA, obviously).

Since this is my first open source project that is starting to look like something serious, I am obviously completely open to any comment that you may have.

Thanks!

pmeunier added tag Beta on January 12, 2021

rohan on January 14, 2021

Hi pmeunier, I’m more than happy to contribute under that agreement and as there’s more work to do I can just add it to the next change. However, I need to get an official waiver from my employer to do so. Not sure how long that will take.

pmeunier on January 15, 2021

Thanks! Sorry about that extra bit of admin, using this “standard” CLA should hopefully make it as fast as possible.

Alphare on February 12, 2021

What I believe to be a must-read if you’re looking into VCS encoding: https://www.mercurial-scm.org/wiki/EncodingStrategy . Encoding is already a big dragon, but encoding with fundamentally immutable legacy is a cursed dragon. You might make different choices than hg on the subjet, but you should do so with all of those trade-offs in mind. :)

rohan on February 12, 2021

You’ve added a period to the end of that link. I’ll read https://www.mercurial-scm.org/wiki/EncodingStrategy while the wheels of bureaucracy turn.

Alphare on February 12, 2021

It appears that Nest’s link detection is not fool-proof. :) (Also it keeps redirecting me to a new entry page when I instinctively do Ctrl + Enter to submit instead of… submitting)

rohan on February 12, 2021

https://www.mercurial-scm.org/wiki/EncodingStrategy#The_encoding_tracking_problem suggests that the contents of the file should not be converted to UTF-8 and back again but a quick google finds no examples of round-tripping failures.

Since that back-again conversion seems to be mostly to support the applying the text of the change directly one option would be to append the unconverted bytes base64 encoded. That would mean that the existing diff logic would only need to be updated to recognise the appropriate line terminator, as mentioned by pmeunier above, and there’d be no need to extend the change’s [u8] content format. That’s good because I can’t currently see a way to make that a non-breaking change.

rohan added a change on February 25, 2021

store new non-UTF-8 files raw and decode to deplay the contents

PEUL7HHDK65WQ2JHVUC5EMBHMQL6ITX6KWBW3KQIEJTHBRFJGH7QC

rohan added a change on February 25, 2021

store new non-UTF-8 files raw and decode to deplay the contents

UM5DLRPBCQZQBEDJDLDPKODOKLACUHZD6YL6S4JRNKW6JLPNUVSAC

main

rohan on February 25, 2021

That change includes the CLA dependency. Having to read over my code to remember what was going on :-)

pmeunier on February 25, 2021

Fantastic! Thanks a million! I’m in the middle of a massive change (replacing the old Sanakirja with the new one), but I’ll get back to this discussion as soon as that is over.

rohan added a change on February 26, 2021

merge with changes from sanakirja v1.1.2

Q3GU26WDEYE2HXMM4WWNEGVMSXLQAKRJBSB2EJZA6JTHPKQHXCXQC

main

rohan added a change on February 26, 2021

fix left over conflicts

PDTUHOMVQ2NUN2MMFZL7F4NI3EXOK5SOBBS34DF4ACFOJCB5NQTAC

main

pmeunier on February 26, 2021

Thanks a lot! I was planning on doing that myself, I made these conflicts after all!

rohan on February 26, 2021

Those merge changes aren’t displaying correctly in the Nest with lots of duplicated lines. Looks fine locally using pijul 1.0.0-alpha.41

rohan on February 26, 2021

Also trying to amend that last lot of conflicts into the merge change caused Pijul to think I’d deleted and replaced the entirety of libpijul/src/diff/mod.rs. That issue can be recreated by unrecording PDTUHOMVQ2NUN2MMFZL7F4NI3EXOK5SOBBS34DF4ACFOJCB5NQTAC

rohan added a change on February 26, 2021

fix encoded file add

GDDYJH7AJYJ2DWTYVP6WS3FPPRTDRQAXZ2BH2CKIFGMIASTSLD4QC

main

rohan added a change on February 26, 2021

separate line endings test data for dos and unix

LCXEUK7KIZTFHTMBBDCURKU3CU62YG4NWKEU65UM5HGXMWIBHAHAC

main

pmeunier on March 2, 2021

Hi! Thanks a lot! I only got 2 conflicts, with stuff I added today or yesterday, and they’re fixed.

It seems you forgot to add the text_encoding module.

pmeunier added a change on March 2, 2021

Fixing conflicts

XSEODPNEN2Y2THBRO7L5QFPAEQVSQTLAFZFWCRMBGZ3YSRZB2UJAC

main

rohan added a change on March 5, 2021

text_encoding module

IACED7RWM2ZQIPN3YZATA6SXTRO2C6OUGT3HSOU3LIDZ7YRMLRXAC

main

rohan on March 5, 2021

Opps. I’ll claim that “diff didn’t mention the new file” and not my failing memory :-/

rohan added a change on March 6, 2021

conflicts

4NNR32V6RIG7RAJLHKYNA3YHQDL55RLEKTK72KI4RVBU72F2IXSQC

main

rohan on March 6, 2021

XSEODPNEN2Y2THBRO7L5QFPAEQVSQTLAFZFWCRMBGZ3YSRZB2UJAC removed all permissions from the new files and most from the new directory

rohan added a change on March 7, 2021

file encoding in updates created on March 6, 2021

XR7MNOMU5PMOOEY2EPPUABZ7NOP432RDCWUET23ONPXTT3JQIFIAC

main

rohan on March 7, 2021

I was trying to understand why the text encoding tests were passing but when running real records on the same files the result was different. The issue is that the recorded change is based on the text shown in the editor parsed back via the text_change’s read function. That makes it more difficult to avoid any UTF-8 roundtrip if there’s a desire to edit which lines in the hunks are applied in future.

For now my plan is to change Change::read_and_deps to take the original change written out to the editor and use the hunk index to obtain the original text bytes. Can you see any issues with that @pmeunier?

pmeunier on March 7, 2021

XSEODPNEN2Y2THBRO7L5QFPAEQVSQTLAFZFWCRMBGZ3YSRZB2UJAC removed all permissions from the new files and most from the new directory

Yes, sorry about that, this is intentional. I released a new Pijul on crates.io just before pushing these changes. They’re related to discussion #358, if you want to understand the exact reason (it’s about security).

This is actually the intended way, but I’m also aware that some users have tried editing not only which lines are in the hunks, but also editing the lines themselves. This is supposed to work for line additions, but the results for deletions are counter-intuitive (editing or removing the lines starting with - doesn’t do anything for now).

First, I don’t think UTF-8 roundtrips can be avoided at all when multiple encodings are used in the same patch. I’m also aware that some users expect to be able to edit the patches themselves: not only choose which individual lines are in each hunk, but also edit the contents of these lines. At the moment, this seems to work (somewhat unintentionally) for line insertions, and doesn’t do anything for deletions (but this is fixable).

I took some time away from Pijul in the last few days, after the giant marathon to complete the new backend and fix a few issues with that (including a segfault!). I’m getting back to it now, and I hope to apply your changes today or tomorrow.

rohan on March 7, 2021

I’d expect to exclude some lines within a hunk. Arbitrary editing of the hunk contents feels like it would be rarely safe/useful (for source code at least). Maybe to correct comments but otherwise that’s changes which have never been past any IDE or static verification tools I might be using.

It’s not ready to merge to main but feel free to improve what’s there - I’ll most likely have only a few hours per week to look at it for the next while.

rohan on March 25, 2021

I’ve gone with the roundtriping version as that’s much the easiest to implement given the current recording logic. Adding new files and recording basic changes work, except that there seems to be extra line endings added to the change. Other hunk types don’t support file encoding yet but I think they should just follow what has already been done.

UTF-16 and any other encoding where the lines don’t end in \n isn’t started.

rohan added a change on March 28, 2021

roundtrip text encoding when recording created on March 20, 2021

NG3Z3DOKDZQDQ7UNQEOMXWX2NJKKLM7DKRWW3KIUD7QEECTNQIZQC

main

rohan added a change on March 28, 2021

print coloured text correctly given line terminator is not included created on March 23, 2021

LYZOL6PMVGZBFHJPLJHG5E6CURSDZDQMNTSOU6LSNQQKFFGZM44AC

main

rohan on April 2, 2021

For correct display of deleted files the file encoding will need to be recorded in the pristine and it looks like the pristine inode, specifically InodeMetadata, is the appropriate place. InodeMetadata is currently very compact (a single u16) so do you see any issues adding the encoding label which may be up to 20 characters long?

rohan on April 4, 2021

I’d forgotten that the InodeMetadata is stored encoded into the first two bytes of the change contents and the same encoding applies to the pristine’s graph of folders and files. That encoding/decoding is spread around so to ensure the changes to InodeMetadata apply everywhere I’ll centralise that. What to call the new centralised thing which has the filename and inode metadata… nothing springs to mind so maybe something boring like FileMetadata.

rohan on April 12, 2021

One inconvenience with storing the encoding with the content’s metadata is that, for efficient serialisation, the metadata may need to be written after the file contents as the true encoding isn’t known until then. The alternative would be to read the file contents into a temporary buffer, write the metadata first as is currently done, then append the file buffer. That would reduce the scope of the change but I’m not sure if the O(log n) performance of Vec::append is sufficient (assuming that the ‘n’ refers to the size of the appended buffer).

rohan added a change on May 4, 2021

abstract out FileMetadata (de)serialistion created on April 11, 2021

3S6LU2U5TIU43WRE5RQS3IOLVJLVNCDL4W44FVK2HR3NAXZ7IDUAC

main

rohan added a change on May 4, 2021

encoded file deletion created on April 11, 2021

ZSF3YFZTDOXMCOC3HOT5C6MQLYLWOR7QJAOUDS2M2Z4SC2YW3GRAC

main

rohan on May 4, 2021

These last two changes are where I’m at with trying to get encoding for deleted files working using encoding stored in the change metadata. No encoding is found and the deleted files are treated as binary.

Not sure how to progress this - one possibility is to try and extend the TreeTxnT’s Inode tree as that seems to reflect the pristine’s directory/file structure.

The encoding tests are broken on metadata deserialisation but that doesn’t affect real recording, unrecording or reset.

pmeunier on May 8, 2021

Alright, merging this discussion is now my current target.

pmeunier on May 8, 2021

Going through your changes now, I wanted to say thank you! This is great work!

Once I’m done solving the conflicts, I’ll push them to main, that’ll make it easier to test.

To answer your last comment, the “tree” DB is a tree of file “inodes” (random numbers generated locally), “revtree” is the backwards map (i.e. going up the tree). These are used to assign identifiers to files currently tracked in the working copy. Then, there’s also the inodes/revinodes tables to map these to graph vertices.

rohan on May 9, 2021

My pleasure!

Hopefully that issue with the deletion encoding is just that I’m not grabbing the right bit of the context. Also you may notice a bunch of inefficiencies with cloning, String and Vec wrapping etc. as I’m still at that stage in Rust of throwing things at the borrow checker until it stops complaining.

As mentioned somewhere above, this doesn’t necessarily support UTF-16 and crlf (windows/dos) line terminators don’t have special handling so the cr is treated as part of the text.

rohan on May 9, 2021

With reference to your Zulip comment about tags breaking the repo format, these changes also break the repo format. Changing that to recognise the prior format and default to UTF-8 for all existing non-binary files seems like it should be possible but needs custom Hunk deserialisers. Might be easier to combine these breakages into one upgrade hit.

pmeunier on May 9, 2021

Also you may notice a bunch of inefficiencies with cloning, String and Vec wrapping etc. as I’m still at that stage in Rust of throwing things at the borrow checker until it stops complaining.

No worries, I know the feeling, and I can help with that!

With reference to your Zulip comment about tags breaking the repo format, these changes also break the repo format.

That’s ok, we’re still in alpha. I’d rather break the format one or two more times than not fixing the file encoding issue. This is not a lossy operation anyway, so we can always convert (unlike the previous patch format change).

I believe I agree with your design choices, correct me if I’m wrong:

The contents is stored in the changes as raw binary, and only mapped to characters later, based on extra metadata stored in the changes to indicate the encoding.
When using the text format for patches, we show the patch as UTF-8, and write the encoding name inside the text format, for each hunk that has contents to parse. I note that your current implementation also does it for things like FileDel, where the contents isn’t parsed at the moment, but could be in the future, so it’s probably safe to include that label.

I’m thinking of amending your design in the following way, which also happens to be backwards-compatible with the current patch format:

InodeMetadata is a u16, and used to have 10 bits used, but now only 2 bits (at positions 0 and 9) are used. One way could be to use one new bit (for example bit 10) of that field to mean “non-utf8”, in which case the encoding label would come immediately after in the file name, encoded by an initial u8 indicating the label’s length, followed by the actual label. There’s no real need to be very compact here, since this field is (1) optional, so users with special/embedded/WASM applications can always use UTF-8 and (2) compressed anyway, so we don’t gain much by trying to save bytes.
After reading Mercurial’s page about encodings, I believe a major difference with Mercurial is that file contents in Pijul may be stored in lots of small patches, and these patches may not all use the same file encoding. I don’t think any reasonable text editor would change the encoding of just one part of the file though, so one possible solution would be to add a new kind of hunk, called for example ChangeMetadata, to make it explicit, in the text format for changes, that a user is changing the encoding metadata and/or the permissions of a file. This can be backwards-compatible with the current patch format (by adding a variant of Hunk as the last variant).

On a related note, I’ve move a little bit on the binary diff question, see https://nest.pijul.com/pmeunier/bindiff

rohan on May 9, 2021

The contents is stored in the changes as raw binary, and only mapped to characters later, based on extra metadata stored in the changes to indicate the encoding.

Yes, only layers which display text to the user or read from the text-change format need to know about encoding.

I note that your current implementation also does it for things like FileDel, where the contents isn’t parsed at the moment, but could be in the future, so it’s probably safe to include that label.

FileDel has so that the inverse of FileAdd isn’t lossy. It’s also then available to FileUndel. Otherwise the Hunks that track encoding is so that it’s available for print_change_contents

One way could be to use one new bit (for example bit 10) of that field to mean “non-utf8”, in which case the encoding label would come immediately after in the file name, encoded by an initial u8 indicating the label’s length, followed by the actual label.

I think for efficiency this label should come after the main change content. Removing the FileMetadata structure as part of that split isn’t a bad thing as it turned out most places didn’t need all three parts anyway.

After reading Mercurial’s page about encodings, I believe a major difference with Mercurial is that file contents in Pijul may be stored in lots of small patches, and these patches may not all use the same file encoding. I don’t think any reasonable text editor would change the encoding of just one part of the file though

Embedding different encoding does sound like madness… though it turns out that Emacs has recode-region so I guess someone has seen it before. That’s another reason why I was looking to store the encoding against the pristine file.

I was going to say that it’s a non-issue anyway as an encoding change would change the existing file and so the “small patch” would be the entire file. However, that doesn’t allow for unchanged lines which happen to contain text which is valid in both encodings. With those excluded from the encoding-change change I guess it’d be possible for changes to those lines from some other source to remain in the old encoding even when that’s no longer a proper encoding for the file as a whole.

pmeunier on May 10, 2021

Thanks for the clarification!

Removing the FileMetadata structure as part of that split isn’t a bad thing as it turned out most places didn’t need all three parts anyway.

Indeed. We still have to deserialize it every time, but that could be made efficient by using the details of the bincode format, which is stable.

Alright, I think I understand your work now, I just have a few conflicts to fix and I’ll merge it. I’ll write the converter for the change format and add a temporary pijul upgrade command.

One remaining issue I see is to diff files that have changed encoding, or have exotic encodings like UTF-16. In particular, line splitting is currently a bit naïve as it assumes that all lines end in \n.

rohan on May 10, 2021

Diffing encoding changes using only the new encoding does work somewhat in that the previous encoded text comes through corrupted so it’s obvious that something has changed. I would expect the mechanism for deleted files could be used to obtain the prior encoding and display that as a change by itself.

Looking forward to the merge!

pmeunier added a change on May 11, 2021

Solving conflicts related to file encodings

HZD7CFQU4HU2HIXADBO2E6J334MIL3MBRIVC4ZWNMAR77TZ5UJDQC

pmeunier on May 11, 2021

Cool conflicts, they triggered a bug in the conflict resolution code.

pmeunier added a change on May 24, 2021

Conflict resolution created on May 11, 2021

TVVW53HZGYPODAXEQ4BFZNSPBOFG6JEDVOKIYIDZMWFAMOBKOR2QC

main

pmeunier on May 24, 2021

(I had forgotten to push this)

I’m mostly happy with the state of this discussion for the moment, but since this requires a format change, I also want to solve #359, which might also require a minor format change.

rohan on May 24, 2021

Sounds like a plan. Also, I note that you tagged me in that other discussion but I don’t recall getting a notification.

pmeunier on June 1, 2021

Hi Rohan! Two questions:

Any reason you used the fork of magic_tree instead of the original? magic_tree_mini crashes on my machine, and magic_tree works fine, with the same API.
Is the Nest faster in New Zealand now?

rohan on June 1, 2021

I originally used tree_magic but the other claimed to be smaller/faster/have newer dependencies.

Nest is much much much faster at ~11ms. Cloudflare’s site in Auckland?

One question in response… is there anything I can help with on this?

pmeunier on June 2, 2021

I originally used tree_magic but the other claimed to be smaller/faster/have newer dependencies.

I saw these claims, but I’m not fully sure. Since it panics on my machine, I investigated a little, and found that:

tree_magic_mini is already at version 3.0, which is surprising. I can understand that different people have different expectations for version numbers, but in the Rust ecosystem, 3.0 means you’ve survived a number of battles.
as far as dependencies go, the only new dependency for Pijul is petgraph, for both.
the original tree_magic has more than 10 times more downloads, which doesn’t necessarily mean that it’s better, but at least it receives more testing on more platforms, which is quite crucial for this kind of system-dependent crate (the fact that the author of the fork doesn’t care about that is a bit worrying).

Nest is much much much faster at ~11ms. Cloudflare’s site in Auckland?

Yes, and New Zealand is serverd from the Singapore server, so even uncached pages probably take much less time. I’m glad to hear that this has improved. In the process, I’ve also learn to distrust the ping as a measure of latency: a low ping certainly helps for connection stability, but since Cloudflare is a proxy, what really matters is the time to first byte of a page, or in other words, how much of the contents they have in cache.

One question in response… is there anything I can help with on this?

Not really, unfortunately. I’m working on the other “format change” issue: I should be done today or tomorrow, and we should be able to start testing the new format very soon.

rohan on June 2, 2021

TTFB for the CSS and JS is good at 20-60ms but the small personal icons take over 1s, consistently over multiple reloads.

pmeunier on June 2, 2021

Thanks! I just fixed this.

pmeunier on June 3, 2021

Alright, I’ve made progress again: now that the Nest has been operating without any interruption (not even a second) for about a week, I would find it hard to change the patch format and once again break users’ hopes about this project.

So, I just wrote a patch to allow converting “online”, meaning that when Pijul sees the old format, it knows how to convert. I just tested, it works!

I’m ready to merge! I’ll write a blog post about all the recent changes in the infrastructure. After that, if you want to write a guest blog post about this really cool achievement, just reach out to me on zulip, or by email (pe@pijul.org).

I just have one last element of doubt about the encoding to choose: since I would expect most (if not all) non-experimental repositories are using UTF-8 as the default encoding now (probably because it was the only supported encoding in the diff), I’m doing the conversion by adding that encoding automatically. Do you foresee any problem with that?

rohan on June 3, 2021

Yes, UTF-8 seems the correct default - aside from those known to be binary obviously. I’ll think about the blog post

pmeunier on June 29, 2021

This is now on main! Thanks a lot.

After testing a bit, I had to change one thing: tree_magic is 100-1000 slower than reading the file, which isn’t quite acceptable. Also, it almost always returns “application/…” on the C files from the Linux kernel, so I’m using mime_guess instead, which returns the correct result in no time. Decoding takes about the same time as reading the file (I have an SSD but still, reading is a system call).

pmeunier closed this discussion on June 29, 2021

pmeunier on June 30, 2021

As I’m trying to get binary diff to work, I submitted a PR to chardetng to simplify the detection: https://github.com/hsivonen/chardetng/pull/5

rohan on June 30, 2021

OK, looking through the rules embedded into tree_magic there are definitely some issues where some things which we should treat as text are marked as application (like application/x-shellscript!) and there’s a lot of obscure file types checked. So going straight to the decoder seems like the correct action.