pijul/pijul - Discussion #547 - [Encoding] pijul detects zip files as windows-1252 encoded text instead of binary

#547 [Encoding] pijul detects zip files as windows-1252 encoded text instead of binary

Closed on November 27, 2021

Jonathan on September 22, 2021

Pijul detects zip compressed (built with store-only mode/-0) file as windows-1252 encoding text.

Steps to reproduce:

❯ echo "Hello" > hello
❯ zip -0 hello.zip hello
❯ pijul init
❯ pijul add hello.zip
❯ pijul diff

I’ve found this problem when trying to add a jar file to a pijul repository.

Here is an alternative file to try, it was the spike for the investigation. This file is a mere .jar (that is just a zip with another extension) which stores a mix of Java classes and text files.

Jonathan on September 22, 2021

By doing some researches, I found that detecting a zip can be done by checking if the file 4 initial bytes are 0x50 0x4b 0x05 0x06, and for more accurate detection, if it does have an EOCD at the end, reading the entire file can be avoided by doing a backward loop until 0x50 0x4b 0x05 0x06 is found. Combining two strategies (looking first at the start and then doing the reverse walk, to avoid doing it in files that are not zip files) would result in accurate detection, however, empty zip files do not have the EOCD, which is sad.

I have done an experimental detection code on my own and I’ll be sending the change soon, but I’m not sure if it is better to do an extendable approach (to allow future detection logics), or a simple approach for this corner case.

Jonathan added a change on September 22, 2021

Add zip file detection

XCVIBS6UMRDFRMLIBFSQNKV3Q7VXL3NFH4BTV4BLQZEG5OHJQQMQC

Jonathan added a change on September 22, 2021

Add zip file detection

XCVIBS6UMRDFRMLIBFSQNKV3Q7VXL3NFH4BTV4BLQZEG5OHJQQMQC

Jonathan added a change on September 22, 2021

Add missing binary detect

XW7FQBBPNFV5LAABODXFYRKWOI5OWDUH4YF3VLCESUPOH4NUPZIAC

Jonathan on September 22, 2021

I’ve struggled a bit with sending a change, but here it is, be critical about those additions, this is my first attempt to contribute.

Edit: I’ve accidentally sent two changes instead of one, and accidentally one of them was the algorithm alone.

Jonathan on September 22, 2021

We could also skip 18 bytes at the end (instead of 4) because they are guaranteed to be other descriptors, but I think that this is enough.

Jonathan on September 29, 2021

I’ve found that the same happens with *.png files, because it starts with ‰PNG␍␊␚␊, maybe it is better to use a library like infer ?

Jonathan added a change on October 1, 2021

Use infer to detect binary files

ZZWJK65NGWG2OMYDAXBQ56GDEGMJ3GYJCZSFC62JCTGVVSUV2YUAC

Jonathan added a change on October 1, 2021

Improve the is_binary code

EILG3YI4EFK6L2NL5DUZC27XG5BFCVF3PMNVMO2OG6YNZKSTHZXQC

rohan on October 5, 2021

We were using https://crates.io/crates/tree_magic but pmeunier discovered it had very poor performance. Partly it’s that it tries to determine exact file types whereas we only care if it’s plain text or not.

pmeunier on October 19, 2021

Hi! Thanks for these patches (and sorry for the late answer), I just improved the binary detection a bit today, before I saw this discussion.

What do yous think about #I3HDN5CSJMZKLRDGNFCT64UK3ATHL45M3STDVH4LYN7VI6UVJORQC?

Skia on November 22, 2021

@Jonathan @rohan @pmeunier

The best way to avoid edge cases is to have explicit configuration of how files should be treated (.gitattributes equivalent).

Example .pijulattr.toml (assuming pijul still uses toml files for its configurations):

[[attributes]]
dirs = [''] # empty directory string refers to root of repo and applies the attribute set as the defaults
exts = [''] # empty string means files without a '.' in the file name
encoding = true # true means auto detect

[[attributes]]
exts = ['zip', 'png'] # applies to files with a given extension; case insensitive
encoding = false # false means not a text file

[[attributes]
exts = ['txt']
encoding = 'UTF-8'  # explicit encoding

[[attributes]]
dirs = ['bar'] # set default for directory 'bar'. Directory settings have lowest priority, so 'bar/xyz.zip' would be binary.
files = ['foo.txt'] # explicit file names have the highest priority, so 'foo.txt' is ISO-8859 rather than UTF-8.
encoding = 'ISO-8859'

# if two attribute sets would apply with same priority, `pijul record` would fail with an error message.

Getting this in place also lays the ground work for how any other configurable file handling would be configured in the future.

If this seems good I will open a discussion for it and get the code for reading such an attribute file set up.

pmeunier on November 27, 2021

Hi @Skia, I think this is highly relevant indeed, feel free to open another discussion.

pmeunier closed this discussion on November 27, 2021