Pijul detects zip compressed (built with store-only mode/
-0) file as
windows-1252 encoding text.
Steps to reproduce:
❯ echo "Hello" > hello ❯ zip -0 hello.zip hello ❯ pijul init ❯ pijul add hello.zip ❯ pijul diff
I’ve found this problem when trying to add a jar file to a pijul repository.
Here is an alternative file to try, it was the spike for the investigation. This file is a mere
.jar (that is just a zip with another extension) which stores a mix of Java classes and text files.
By doing some researches, I found that detecting a zip can be done by checking if the file 4 initial bytes are
0x50 0x4b 0x05 0x06, and for more accurate detection, if it does have an EOCD at the end, reading the entire file can be avoided by doing a backward loop until
0x50 0x4b 0x05 0x06 is found. Combining two strategies (looking first at the start and then doing the reverse walk, to avoid doing it in files that are not zip files) would result in accurate detection, however, empty zip files do not have the EOCD, which is sad.
I have done an experimental detection code on my own and I’ll be sending the change soon, but I’m not sure if it is better to do an extendable approach (to allow future detection logics), or a simple approach for this corner case.
I’ve struggled a bit with sending a change, but here it is, be critical about those additions, this is my first attempt to contribute.
Edit: I’ve accidentally sent two changes instead of one, and accidentally one of them was the algorithm alone.
We could also skip
18 bytes at the end (instead of
4) because they are guaranteed to be other descriptors, but I think that this is enough.
Hi! Thanks for these patches (and sorry for the late answer), I just improved the binary detection a bit today, before I saw this discussion.
What do yous think about #I3HDN5CSJMZKLRDGNFCT64UK3ATHL45M3STDVH4LYN7VI6UVJORQC?