By doing some researches, I found that detecting a zip can be done by checking if the file 4 initial bytes are 0x50 0x4b 0x05 0x06
, and for more accurate detection, if it does have an EOCD at the end, reading the entire file can be avoided by doing a backward loop until 0x50 0x4b 0x05 0x06
is found. Combining two strategies (looking first at the start and then doing the reverse walk, to avoid doing it in files that are not zip files) would result in accurate detection, however, empty zip files do not have the EOCD, which is sad.
I have done an experimental detection code on my own and I’ll be sending the change soon, but I’m not sure if it is better to do an extendable approach (to allow future detection logics), or a simple approach for this corner case.
XCVIBS6UMRDFRMLIBFSQNKV3Q7VXL3NFH4BTV4BLQZEG5OHJQQMQC
XCVIBS6UMRDFRMLIBFSQNKV3Q7VXL3NFH4BTV4BLQZEG5OHJQQMQC
XW7FQBBPNFV5LAABODXFYRKWOI5OWDUH4YF3VLCESUPOH4NUPZIAC
I’ve struggled a bit with sending a change, but here it is, be critical about those additions, this is my first attempt to contribute.
Edit: I’ve accidentally sent two changes instead of one, and accidentally one of them was the algorithm alone.
We could also skip 18
bytes at the end (instead of 4
) because they are guaranteed to be other descriptors, but I think that this is enough.
ZZWJK65NGWG2OMYDAXBQ56GDEGMJ3GYJCZSFC62JCTGVVSUV2YUAC
EILG3YI4EFK6L2NL5DUZC27XG5BFCVF3PMNVMO2OG6YNZKSTHZXQC
We were using https://crates.io/crates/tree_magic but pmeunier discovered it had very poor performance. Partly it’s that it tries to determine exact file types whereas we only care if it’s plain text or not.
Hi! Thanks for these patches (and sorry for the late answer), I just improved the binary detection a bit today, before I saw this discussion.
What do yous think about #I3HDN5CSJMZKLRDGNFCT64UK3ATHL45M3STDVH4LYN7VI6UVJORQC?
The best way to avoid edge cases is to have explicit configuration of how files should be treated (.gitattributes
equivalent).
Example .pijulattr.toml
(assuming pijul still uses toml files for its configurations):
[[attributes]]
dirs = [''] # empty directory string refers to root of repo and applies the attribute set as the defaults
exts = [''] # empty string means files without a '.' in the file name
encoding = true # true means auto detect
[[attributes]]
exts = ['zip', 'png'] # applies to files with a given extension; case insensitive
encoding = false # false means not a text file
[[attributes]
exts = ['txt']
encoding = 'UTF-8' # explicit encoding
[[attributes]]
dirs = ['bar'] # set default for directory 'bar'. Directory settings have lowest priority, so 'bar/xyz.zip' would be binary.
files = ['foo.txt'] # explicit file names have the highest priority, so 'foo.txt' is ISO-8859 rather than UTF-8.
encoding = 'ISO-8859'
# if two attribute sets would apply with same priority, `pijul record` would fail with an error message.
Getting this in place also lays the ground work for how any other configurable file handling would be configured in the future.
If this seems good I will open a discussion for it and get the code for reading such an attribute file set up.
Pijul detects zip compressed (built with store-only mode/
-0
) file aswindows-1252
encoding text.Steps to reproduce:
I’ve found this problem when trying to add a jar file to a pijul repository.
Here is an alternative file to try, it was the spike for the investigation. This file is a mere
.jar
(that is just a zip with another extension) which stores a mix of Java classes and text files.