That’s right. While I don’t think this is desirable for file names (I believe Rust converts them automatically from UTF-8), it would indeed be good to treat these files as text.
I wonder how to detect them though, do you have any idea?
No, but some googling resulted in https://crates.io/crates/chardet
Git does it by having the user manually specify which files should be treated as text and what their encoding is in the .gitattributes
file link. Relative to how long Git has been around and popular, this is a relatively recent feature having been added in March 2018.
Git converts files attributed as another encoding to UTF-8 before performing any other processing on them when reading from the working directory. When writing content to the file it converts from UTF-8 back to the specified encoding. Which has a side effect of if the encoding of the file is changed later there is no change to the file as Git sees it, only a change to the .gitattributes
file.
From that link:
Git recognizes files encoded in ASCII or one of its supersets (e.g. UTF-8, ISO-8859-1, …) as text files. Files encoded in certain other encodings (e.g. UTF-16) are interpreted as binary
So the .gitattributes is only needed for UTF-16 in my case.
I’ll try to code something up for this
Using tree_magic_mini and chardetng allows the detection of everything except UTF-16
The Git way seems very reasonable to me, but I wondered if we could do better:
For the diff algorithm, we don’t have to convert anything for now, as long as we can detect newlines correctly. This is easy for ASCII, UTF-8, ISO-8859-1 (and probably Windows codepages too), since 0xa
is the universal newline character for these encodings, and doesn’t appear in any other character sequence. This doesn’t work for UTF-16, unfortunately, so we might have to do something special for that (adjusting the character sequence for \n
should be enough).
Changes involving multiple files with different encodings must use a single character encoding to represent the text version of the diff, and UTF-8 is fine for that.
When changes are stored back as bytes before being applied, we still have the information about their encoding, so we can use it to re-encode the diff.
However, if a change is transmitted in the text format (using pijul change
+ pijul apply
), it is ambiguous, as information about the encoding has been lost. So, either we would need to store that in the change itself.
Also, people working on Windows and Linux on the same repository might want to use their own favourite encodings, and not get artificial conflicts because of that. This can only be achieved in Pijul by choosing a universal encoding across all repositories.
Finally, I wouldn’t mind to have the ability to add a little pre- and post-processing layer to Pijul, as this would allow us to encode source code differently, and model refactorings, file splitting, semantic patches…
As I see it, if the encoding of text files is auto detected, then the encoding is part of the file’s data that gets recorded. Thus a change of encoding would be a change to the file even if the recorded contents stay the same. It would also mean text patches would indeed need to contain the encoding of files as part of the information they convey.
If the encoding of files is specified in a Pijul version of ‘.gitattributes’, the encoding is not considered part of the file’s data. Thus the change history of files would be independent of encoding. So when changes are applied the working copy would always be the encoding specified working copy of the ‘.gitattributes’ equivalent file.
In both cases text files would be converted to a uniform “storage encoding” (UTF-8) when being recorded to the repository or when represented in text patch files.
Realistically, the encoding of text files is not going to be changed often if at all. Most cases of the encoding being changed will likely be due to the file having been given the wrong encoding by accident and then being corrected.
Since on of the reasons for knowing file encoding is to handle line endings, it would make sense to handle both in the same manner. The line endings for files come in three types: crlf (Windows style), lf (non-Windows), and ‘native’ (for files that the type of line ending does not matter). The first two can be auto detected for what is being used; however, it is not reasonable to automatically know if the type of line ending does not matter to a file. That would require being able to identify the type of each file.
If a file for configuring how Pijul handles files in a repository is implemented, my personal preference would be to use a general config file format (like TOML, which is already used by Pijul for its config files) rather than a special purpose one like ‘.gitattributes’ has.
Thanks for the input. I don’t have much experience with the UX of file encodings, to be honest. Between the two options (storing the encoding in the patches and pristine vs. storing it locally), what do you think is more ergonomic, and why?
EZHWLWLCXGW7GSE3KF5LAW6W6PXSHGVWY65IAQD2XFIKIEI2PBGAC
That change detects at least 8859-1 and 1252 and displays them correctly in the change.
Writing and diff haven’t been changed - diff sees that the files have changed due to the encoding difference and reset turns the files into UTF-8.
If I was doing the project, I would always go the config file route if I would have said config file regardless in order to reduce the work load.
I do not think there is much of a functional difference between having encoding as part of the file metadata or being part of a separate config file. Some people might perceive one or the other as giving a more ‘correct’ view of the history of the repository.
For the user, not having to configure it manually is technically less work. But not by much, since as I said the encoding of text file is rarely ever going to change unless there is an accident that causes it to be in the wrong encoding. One advantage that can come from using a config file is that if the file is not valid in terms of the encoding Pijul would then be able to produce an error saying so instead of silently accepting the file if it is not in the correct encoding. Kind of like typed versus untyped programming languages.
VMOYG7MKEWTUEEY2EOL256RWCVPGRD63IFOSKXHBGJ6VSRITLMOAC
An accident where the file changes encoding should be detectable as it will result in a large diff of duplicate lines. Unfortunately the reason for that diff may not be obvious, much like line terminator changes. Adding a new file which is not in the expected encoding could be an error either in the file or in the config.
A config file would be useful for where pijul guesses wrong, eg., decoding a file which should be binary, but I think it should be the exception.
I’m ignoring user-specific config for now as there’s probably a good number of corner cases.
246V5TYIUL7CFN7G5Y7A35EEM6IJPN532ROEYVSM7Q4HCQSWPDBQC
Spoke too soon. Now that I’ve read more of the Change IO parts it seems that we should indeed record the file’s encoding in the change. Therefor an accidental change to the encoding would not show a large diff as the lines being compared have all been converted to UTF-8. Instead there should be extra change metadata displayed when recording showing the before and after file encoding. That way the user can remove that part and the change will be recorded using the original file encoding.
This does require people to read what they’re recording which given the way Pijul works seems more likely then Git, if less in your face than Darcs.
So… where to store the file encoding in the change. This is where I wish that the change contents had used Cap’n Proto to do the contents data structure backed by a plain [u8] with zero copy (for many things) as it also supports various kinds of structural evolution with default values. In lieu of such a major change there seem three choices - in selected change::Record, encoded into the [u8] contents or in the unhashed Toml. The first two would require a new change format version while the later looks like it wouldn’t. However using the unhashed adds complexity as it would be a somewhat parallel structure to the recorded changes and just seems odd - why would a file encoding change not affect the hash?
Any thoughts @pmeunier
NYOF5766GLBTWQV2KTVRAJMGVJNJ37Z5BLJMFPZA3HG7X2Q2RXPAC
The answer to the above seems to be the [u8] encoded contents.
Putting it into the change::Record, as done in the previous change, works for display of the change and storing in the repo but not for writing the changes back to files. The information is available in the recorded change but the output functions do not currently expose that connection. This seems to be acknowledged by the fact that the [u8] encoded contents stores the file name and path, duplicating what’s in the corresponding change::Record entries.
I’m not going to be able to work on this for a few weeks at least.
W4NSLQNGQVQBB4BEXLFW5OTCBD5XQ4E2BGH3WIFGKSJBCODSUIGQC
Hi @rohan! These patches look great, thanks a lot for your work.
This is the first major contribution to Libpijul, and I have received a lot of advice about licenses since the recent alpha release, in particular from people wanting to use Libpijul in proprietary projects.
I am firmly convinced that Pijul must at all costs remain open source. Moreover, it has also been suggested that sublicensing Libpijul could provide funding for the project in the future. After some research, it turns out that there is a generic solution to this problem, called contributor license agreements, and generic templates usable in open source projects are provided by the Contributor Agreements project, sponsored by the FSFE: https://contributoragreements.org/.
I’ve added that template to the repository, and I have written instructions there: https://pijul.org/cla
In your case, since I don’t want to ask you to record your changes again, it would be enough to send an email to contact@pijul.org stating that you have permission to contribute to Libpijul, and that you agree with the CLA introduced by patch IUH7IMWES3KQTHVWA5UNHAO7QWVCC5PQJ6VLK3RC3T4F2MS74P3AC
(after reading that CLA, obviously).
Since this is my first open source project that is starting to look like something serious, I am obviously completely open to any comment that you may have.
Thanks!
Hi pmeunier, I’m more than happy to contribute under that agreement and as there’s more work to do I can just add it to the next change. However, I need to get an official waiver from my employer to do so. Not sure how long that will take.
Thanks! Sorry about that extra bit of admin, using this “standard” CLA should hopefully make it as fast as possible.
What I believe to be a must-read if you’re looking into VCS encoding: https://www.mercurial-scm.org/wiki/EncodingStrategy .
Encoding is already a big dragon, but encoding with fundamentally immutable legacy is a cursed dragon. You might make different choices than hg
on the subjet, but you should do so with all of those trade-offs in mind. :)
You’ve added a period to the end of that link. I’ll read https://www.mercurial-scm.org/wiki/EncodingStrategy while the wheels of bureaucracy turn.
It appears that Nest’s link detection is not fool-proof. :) (Also it keeps redirecting me to a new entry page when I instinctively do Ctrl + Enter to submit instead of… submitting)
https://www.mercurial-scm.org/wiki/EncodingStrategy#The_encoding_tracking_problem suggests that the contents of the file should not be converted to UTF-8 and back again but a quick google finds no examples of round-tripping failures.
Since that back-again conversion seems to be mostly to support the applying the text of the change directly one option would be to append the unconverted bytes base64 encoded. That would mean that the existing diff logic would only need to be updated to recognise the appropriate line terminator, as mentioned by pmeunier above, and there’d be no need to extend the change’s [u8] content format. That’s good because I can’t currently see a way to make that a non-breaking change.
PEUL7HHDK65WQ2JHVUC5EMBHMQL6ITX6KWBW3KQIEJTHBRFJGH7QC
UM5DLRPBCQZQBEDJDLDPKODOKLACUHZD6YL6S4JRNKW6JLPNUVSAC
That change includes the CLA dependency. Having to read over my code to remember what was going on :-)
Fantastic! Thanks a million! I’m in the middle of a massive change (replacing the old Sanakirja with the new one), but I’ll get back to this discussion as soon as that is over.
Q3GU26WDEYE2HXMM4WWNEGVMSXLQAKRJBSB2EJZA6JTHPKQHXCXQC
PDTUHOMVQ2NUN2MMFZL7F4NI3EXOK5SOBBS34DF4ACFOJCB5NQTAC
Thanks a lot! I was planning on doing that myself, I made these conflicts after all!
Those merge changes aren’t displaying correctly in the Nest with lots of duplicated lines. Looks fine locally using pijul 1.0.0-alpha.41
Also trying to amend that last lot of conflicts into the merge change caused Pijul to think I’d deleted and replaced the entirety of libpijul/src/diff/mod.rs. That issue can be recreated by unrecording PDTUHOMVQ2NUN2MMFZL7F4NI3EXOK5SOBBS34DF4ACFOJCB5NQTAC
GDDYJH7AJYJ2DWTYVP6WS3FPPRTDRQAXZ2BH2CKIFGMIASTSLD4QC
LCXEUK7KIZTFHTMBBDCURKU3CU62YG4NWKEU65UM5HGXMWIBHAHAC
Hi! Thanks a lot! I only got 2 conflicts, with stuff I added today or yesterday, and they’re fixed.
It seems you forgot to add the text_encoding
module.
Opps. I’ll claim that “diff didn’t mention the new file” and not my failing memory :-/
XSEODPNEN2Y2THBRO7L5QFPAEQVSQTLAFZFWCRMBGZ3YSRZB2UJAC removed all permissions from the new files and most from the new directory
XR7MNOMU5PMOOEY2EPPUABZ7NOP432RDCWUET23ONPXTT3JQIFIAC
I was trying to understand why the text encoding tests were passing but when running real records on the same files the result was different. The issue is that the recorded change is based on the text shown in the editor parsed back via the text_change’s read function. That makes it more difficult to avoid any UTF-8 roundtrip if there’s a desire to edit which lines in the hunks are applied in future.
For now my plan is to change Change::read_and_deps to take the original change written out to the editor and use the hunk index to obtain the original text bytes. Can you see any issues with that @pmeunier?
XSEODPNEN2Y2THBRO7L5QFPAEQVSQTLAFZFWCRMBGZ3YSRZB2UJAC removed all permissions from the new files and most from the new directory
Yes, sorry about that, this is intentional. I released a new Pijul on crates.io just before pushing these changes. They’re related to discussion #358, if you want to understand the exact reason (it’s about security).
This is actually the intended way, but I’m also aware that some users have tried editing not only which lines are in the hunks, but also editing the lines themselves. This is supposed to work for line additions, but the results for deletions are counter-intuitive (editing or removing the lines starting with
-
doesn’t do anything for now).
First, I don’t think UTF-8 roundtrips can be avoided at all when multiple encodings are used in the same patch. I’m also aware that some users expect to be able to edit the patches themselves: not only choose which individual lines are in each hunk, but also edit the contents of these lines. At the moment, this seems to work (somewhat unintentionally) for line insertions, and doesn’t do anything for deletions (but this is fixable).
I took some time away from Pijul in the last few days, after the giant marathon to complete the new backend and fix a few issues with that (including a segfault!). I’m getting back to it now, and I hope to apply your changes today or tomorrow.
I’d expect to exclude some lines within a hunk. Arbitrary editing of the hunk contents feels like it would be rarely safe/useful (for source code at least). Maybe to correct comments but otherwise that’s changes which have never been past any IDE or static verification tools I might be using.
It’s not ready to merge to main but feel free to improve what’s there - I’ll most likely have only a few hours per week to look at it for the next while.
I’ve gone with the roundtriping version as that’s much the easiest to implement given the current recording logic. Adding new files and recording basic changes work, except that there seems to be extra line endings added to the change. Other hunk types don’t support file encoding yet but I think they should just follow what has already been done.
UTF-16 and any other encoding where the lines don’t end in \n isn’t started.
NG3Z3DOKDZQDQ7UNQEOMXWX2NJKKLM7DKRWW3KIUD7QEECTNQIZQC
LYZOL6PMVGZBFHJPLJHG5E6CURSDZDQMNTSOU6LSNQQKFFGZM44AC
For correct display of deleted files the file encoding will need to be recorded in the pristine and it looks like the pristine inode, specifically InodeMetadata, is the appropriate place. InodeMetadata is currently very compact (a single u16) so do you see any issues adding the encoding label which may be up to 20 characters long?
I’d forgotten that the InodeMetadata is stored encoded into the first two bytes of the change contents and the same encoding applies to the pristine’s graph of folders and files. That encoding/decoding is spread around so to ensure the changes to InodeMetadata apply everywhere I’ll centralise that. What to call the new centralised thing which has the filename and inode metadata… nothing springs to mind so maybe something boring like FileMetadata.
One inconvenience with storing the encoding with the content’s metadata is that, for efficient serialisation, the metadata may need to be written after the file contents as the true encoding isn’t known until then. The alternative would be to read the file contents into a temporary buffer, write the metadata first as is currently done, then append the file buffer. That would reduce the scope of the change but I’m not sure if the O(log n) performance of Vec::append is sufficient (assuming that the ‘n’ refers to the size of the appended buffer).
3S6LU2U5TIU43WRE5RQS3IOLVJLVNCDL4W44FVK2HR3NAXZ7IDUAC
ZSF3YFZTDOXMCOC3HOT5C6MQLYLWOR7QJAOUDS2M2Z4SC2YW3GRAC
These last two changes are where I’m at with trying to get encoding for deleted files working using encoding stored in the change metadata. No encoding is found and the deleted files are treated as binary.
Not sure how to progress this - one possibility is to try and extend the TreeTxnT’s Inode tree as that seems to reflect the pristine’s directory/file structure.
The encoding tests are broken on metadata deserialisation but that doesn’t affect real recording, unrecording or reset.
Alright, merging this discussion is now my current target.
Going through your changes now, I wanted to say thank you! This is great work!
Once I’m done solving the conflicts, I’ll push them to main, that’ll make it easier to test.
To answer your last comment, the “tree” DB is a tree of file “inodes” (random numbers generated locally), “revtree” is the backwards map (i.e. going up the tree). These are used to assign identifiers to files currently tracked in the working copy. Then, there’s also the inodes/revinodes tables to map these to graph vertices.
My pleasure!
Hopefully that issue with the deletion encoding is just that I’m not grabbing the right bit of the context. Also you may notice a bunch of inefficiencies with cloning, String and Vec wrapping etc. as I’m still at that stage in Rust of throwing things at the borrow checker until it stops complaining.
As mentioned somewhere above, this doesn’t necessarily support UTF-16 and crlf (windows/dos) line terminators don’t have special handling so the cr is treated as part of the text.
With reference to your Zulip comment about tags breaking the repo format, these changes also break the repo format. Changing that to recognise the prior format and default to UTF-8 for all existing non-binary files seems like it should be possible but needs custom Hunk deserialisers. Might be easier to combine these breakages into one upgrade hit.
Also you may notice a bunch of inefficiencies with cloning, String and Vec wrapping etc. as I’m still at that stage in Rust of throwing things at the borrow checker until it stops complaining.
No worries, I know the feeling, and I can help with that!
With reference to your Zulip comment about tags breaking the repo format, these changes also break the repo format.
That’s ok, we’re still in alpha. I’d rather break the format one or two more times than not fixing the file encoding issue. This is not a lossy operation anyway, so we can always convert (unlike the previous patch format change).
I believe I agree with your design choices, correct me if I’m wrong:
The contents is stored in the changes as raw binary, and only mapped to characters later, based on extra metadata stored in the changes to indicate the encoding.
When using the text format for patches, we show the patch as UTF-8, and write the encoding name inside the text format, for each hunk that has contents to parse. I note that your current implementation also does it for things like FileDel
, where the contents isn’t parsed at the moment, but could be in the future, so it’s probably safe to include that label.
I’m thinking of amending your design in the following way, which also happens to be backwards-compatible with the current patch format:
InodeMetadata
is a u16
, and used to have 10 bits used, but now only 2 bits (at positions 0 and 9) are used. One way could be to use one new bit (for example bit 10) of that field to mean “non-utf8”, in which case the encoding label would come immediately after in the file name, encoded by an initial u8
indicating the label’s length, followed by the actual label. There’s no real need to be very compact here, since this field is (1) optional, so users with special/embedded/WASM applications can always use UTF-8 and (2) compressed anyway, so we don’t gain much by trying to save bytes.
After reading Mercurial’s page about encodings, I believe a major difference with Mercurial is that file contents in Pijul may be stored in lots of small patches, and these patches may not all use the same file encoding. I don’t think any reasonable text editor would change the encoding of just one part of the file though, so one possible solution would be to add a new kind of hunk, called for example ChangeMetadata
, to make it explicit, in the text format for changes, that a user is changing the encoding metadata and/or the permissions of a file. This can be backwards-compatible with the current patch format (by adding a variant of Hunk
as the last variant).
On a related note, I’ve move a little bit on the binary diff question, see https://nest.pijul.com/pmeunier/bindiff
The contents is stored in the changes as raw binary, and only mapped to characters later, based on extra metadata stored in the changes to indicate the encoding.
Yes, only layers which display text to the user or read from the text-change format need to know about encoding.
I note that your current implementation also does it for things like FileDel, where the contents isn’t parsed at the moment, but could be in the future, so it’s probably safe to include that label.
FileDel has so that the inverse of FileAdd isn’t lossy. It’s also then available to FileUndel. Otherwise the Hunks that track encoding is so that it’s available for print_change_contents
One way could be to use one new bit (for example bit 10) of that field to mean “non-utf8”, in which case the encoding label would come immediately after in the file name, encoded by an initial u8 indicating the label’s length, followed by the actual label.
I think for efficiency this label should come after the main change content. Removing the FileMetadata
structure as part of that split isn’t a bad thing as it turned out most places didn’t need all three parts anyway.
After reading Mercurial’s page about encodings, I believe a major difference with Mercurial is that file contents in Pijul may be stored in lots of small patches, and these patches may not all use the same file encoding. I don’t think any reasonable text editor would change the encoding of just one part of the file though
Embedding different encoding does sound like madness… though it turns out that Emacs has recode-region
so I guess someone has seen it before. That’s another reason why I was looking to store the encoding against the pristine file.
I was going to say that it’s a non-issue anyway as an encoding change would change the existing file and so the “small patch” would be the entire file. However, that doesn’t allow for unchanged lines which happen to contain text which is valid in both encodings. With those excluded from the encoding-change change I guess it’d be possible for changes to those lines from some other source to remain in the old encoding even when that’s no longer a proper encoding for the file as a whole.
Thanks for the clarification!
Removing the FileMetadata structure as part of that split isn’t a bad thing as it turned out most places didn’t need all three parts anyway.
Indeed. We still have to deserialize it every time, but that could be made efficient by using the details of the bincode format, which is stable.
Alright, I think I understand your work now, I just have a few conflicts to fix and I’ll merge it. I’ll write the converter for the change format and add a temporary pijul upgrade
command.
One remaining issue I see is to diff files that have changed encoding, or have exotic encodings like UTF-16. In particular, line splitting is currently a bit naïve as it assumes that all lines end in \n
.
Diffing encoding changes using only the new encoding does work somewhat in that the previous encoded text comes through corrupted so it’s obvious that something has changed. I would expect the mechanism for deleted files could be used to obtain the prior encoding and display that as a change by itself.
Looking forward to the merge!
HZD7CFQU4HU2HIXADBO2E6J334MIL3MBRIVC4ZWNMAR77TZ5UJDQC
Cool conflicts, they triggered a bug in the conflict resolution code.
TVVW53HZGYPODAXEQ4BFZNSPBOFG6JEDVOKIYIDZMWFAMOBKOR2QC
Sounds like a plan. Also, I note that you tagged me in that other discussion but I don’t recall getting a notification.
Hi Rohan! Two questions:
Any reason you used the fork of magic_tree
instead of the original? magic_tree_mini
crashes on my machine, and magic_tree
works fine, with the same API.
Is the Nest faster in New Zealand now?
I originally used tree_magic but the other claimed to be smaller/faster/have newer dependencies.
Nest is much much much faster at ~11ms. Cloudflare’s site in Auckland?
One question in response… is there anything I can help with on this?
I originally used tree_magic but the other claimed to be smaller/faster/have newer dependencies.
I saw these claims, but I’m not fully sure. Since it panics on my machine, I investigated a little, and found that:
tree_magic_mini
is already at version 3.0, which is surprising. I can understand that different people have different expectations for version numbers, but in the Rust ecosystem, 3.0 means you’ve survived a number of battles.petgraph
, for both.tree_magic
has more than 10 times more downloads, which doesn’t necessarily mean that it’s better, but at least it receives more testing on more platforms, which is quite crucial for this kind of system-dependent crate (the fact that the author of the fork doesn’t care about that is a bit worrying).Nest is much much much faster at ~11ms. Cloudflare’s site in Auckland?
Yes, and New Zealand is serverd from the Singapore server, so even uncached pages probably take much less time. I’m glad to hear that this has improved. In the process, I’ve also learn to distrust the ping as a measure of latency: a low ping certainly helps for connection stability, but since Cloudflare is a proxy, what really matters is the time to first byte of a page, or in other words, how much of the contents they have in cache.
One question in response… is there anything I can help with on this?
Not really, unfortunately. I’m working on the other “format change” issue: I should be done today or tomorrow, and we should be able to start testing the new format very soon.
TTFB for the CSS and JS is good at 20-60ms but the small personal icons take over 1s, consistently over multiple reloads.
Thanks! I just fixed this.
Alright, I’ve made progress again: now that the Nest has been operating without any interruption (not even a second) for about a week, I would find it hard to change the patch format and once again break users’ hopes about this project.
So, I just wrote a patch to allow converting “online”, meaning that when Pijul sees the old format, it knows how to convert. I just tested, it works!
I’m ready to merge! I’ll write a blog post about all the recent changes in the infrastructure. After that, if you want to write a guest blog post about this really cool achievement, just reach out to me on zulip, or by email (pe@pijul.org).
I just have one last element of doubt about the encoding to choose: since I would expect most (if not all) non-experimental repositories are using UTF-8 as the default encoding now (probably because it was the only supported encoding in the diff), I’m doing the conversion by adding that encoding automatically. Do you foresee any problem with that?
Yes, UTF-8 seems the correct default - aside from those known to be binary obviously. I’ll think about the blog post
This is now on main! Thanks a lot.
After testing a bit, I had to change one thing: tree_magic is 100-1000 slower than reading the file, which isn’t quite acceptable. Also, it almost always returns “application/…” on the C files from the Linux kernel, so I’m using mime_guess instead, which returns the correct result in no time. Decoding takes about the same time as reading the file (I have an SSD but still, reading is a system call).
As I’m trying to get binary diff to work, I submitted a PR to chardetng to simplify the detection: https://github.com/hsivonen/chardetng/pull/5
OK, looking through the rules embedded into tree_magic there are definitely some issues where some things which we should treat as text are marked as application (like application/x-shellscript!) and there’s a lot of obscure file types checked. So going straight to the decoder seems like the correct action.
It seems that ISO-8859, UTF-16 (little endian) and Windows code page 1252 encoded files are being treated as binary.