I had a great experience using Kaitai in a previous job. We were decoding proprietary binary messages from Teltonika OBD GPS trackers. The online editor, https://ide.kaitai.io/, is really nice for developing and testing your definition. You can store multiple binary files in local-storage and you get a nice detailed look at the data and how your definition is parsing it.
There's actually more than one, though Kaitai probably has the most maturity of any of them.
Various hex editors have their own formats. 010 Editor has C-style binary templates, imhex has a binary pattern language as well. Okteta has Okteta Structure Definitions which can be declared using XML or with JS.
Kaitai Struct is the most complete system that has code generation for multiple programming languages and isn't tied to a hex editor or anything else for that matter. That said, I think there's still a ton of room for improvement and innovation. Kaitai has a lot of useful tooling, but I think as it is today it falls a bit short: the code gen is not at the same support level for all languages (most languages are fairly limited), and I think serialization is still mostly experimental. That and there's probably a lot you could do to still make it more expressive and powerful.
An adjacent or complementary field is description of data in transit. Wireshark dissectors come to mind. I think it'd be quite useful to unify these fields.
I had been trying to make a Kaitai to Wireshark Dissector compiler in my third party Kaitai implementation[1]. However, the Wireshark emitter is still basically useless for now. It only supports basic structs with basic attrs.
I mainly started a third-party Kaitai implementation to experiment a bit with supporting new features in Go, and also just to have a native Go implementation for convenience, since I'm still not very good at Scala. However, once an approach is developed for how exactly to handle emitting to Wireshark it should be purely mechanical to graft on a Wireshark emitter to the upstream Kaitai Struct compiler, too.
In addition to languages, there's a Python library called "construct" that's been around for a long time. It uses a declarative style to make it surprisingly easy to make binary parsers and emitters.
Completely different problem, completely different solution.
Protobuf and its ilk (ASN.1, Cap’n Proto, etc.) have you describe a tree structure, then map that to bytes according to their own sensibilities. Kaitai and its ilk (Wireshark might be a more familliar member of the group) have you describe a bunch of data structures as well as somebody else’s pretty much arbitrary ideas as to how they are to map to bytes, then deal with the results.
You can’t use a Protobuf implementation to get EXIF data out of JPEGs, but then you can’t get format evolution guarantees out of Kaitai either.
(I hear ASN.1 can somewhat cross the gap using ECN, but as far as I can tell literally nobody uses that in public.)
As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:
* Things may be non-byte-aligned bitstreams.
* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* Fields that may be optional if some parent of the current record has some weird value.
* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
and so on.
File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.
I found their ELF format specification to have a decent coverage, even if not completely exhaustive (e.g. some debug info isn't breakdown after a certain point, but it just might be incomplete rather than limitations).
There are ID3 tags used for MP3 and other files. Old players may not know about them and may misread their data as MPEG frames. To prevent this a tag may break up sequences of 00 bytes with an FF byte. Or may not, because now most players are aware of tags. So there is a preference, at two levels, default and for a single tag. Not too hard to program, but rather unfriendly to a grammar-based approach.
UTF-8 is a regular language (as a subset of all octet strings), so that doesn’t feel like much of a benchmark? Something like TIFF or PECOFF would seem to be a more reasonable standard. (PDF is probably too much to ask, seeing as understanding the structure requires a full Deflate decoder among other things.)
I had a great experience using Kaitai in a previous job. We were decoding proprietary binary messages from Teltonika OBD GPS trackers. The online editor, https://ide.kaitai.io/, is really nice for developing and testing your definition. You can store multiple binary files in local-storage and you get a nice detailed look at the data and how your definition is parsing it.
Interesting. I didn't know anyone had come up with a declarative language for binary files.
There's actually more than one, though Kaitai probably has the most maturity of any of them.
Various hex editors have their own formats. 010 Editor has C-style binary templates, imhex has a binary pattern language as well. Okteta has Okteta Structure Definitions which can be declared using XML or with JS.
Kaitai Struct is the most complete system that has code generation for multiple programming languages and isn't tied to a hex editor or anything else for that matter. That said, I think there's still a ton of room for improvement and innovation. Kaitai has a lot of useful tooling, but I think as it is today it falls a bit short: the code gen is not at the same support level for all languages (most languages are fairly limited), and I think serialization is still mostly experimental. That and there's probably a lot you could do to still make it more expressive and powerful.
An adjacent or complementary field is description of data in transit. Wireshark dissectors come to mind. I think it'd be quite useful to unify these fields.
I had been trying to make a Kaitai to Wireshark Dissector compiler in my third party Kaitai implementation[1]. However, the Wireshark emitter is still basically useless for now. It only supports basic structs with basic attrs.
I mainly started a third-party Kaitai implementation to experiment a bit with supporting new features in Go, and also just to have a native Go implementation for convenience, since I'm still not very good at Scala. However, once an approach is developed for how exactly to handle emitting to Wireshark it should be purely mechanical to graft on a Wireshark emitter to the upstream Kaitai Struct compiler, too.
https://github.com/jchv/zanbato
Have you used Google wuffs?
No, though I am familiar with it. I wouldn't have classified Kaitai and wuffs as being the same category of software, though I can see why you would.
In addition to languages, there's a Python library called "construct" that's been around for a long time. It uses a declarative style to make it surprisingly easy to make binary parsers and emitters.
https://construct.readthedocs.io/en/latest/intro.html#exampl...
There's an old XML one called Data Format Description Language (DFDL).
There's a metric ton of them by now. Here's incomplete notes from a couple of years ago:
### kaitai - https://github.com/kaitai-io/kaitai_struct - https://github.com/kaitai-io/awesome-kaitai - http://formats.kaitai.io/dos_datetime/index.html
### Hexinator / Synalyze It! - Universal Parsing Engine - Hexinator is freemium version of Synalyze It! - https://github.com/synalysis/Grammars/blob/master/bitmap.gra...
### quickbms - http://aluigi.altervista.org/quickbms.htm
## multiex - http://multiex.xentax.com/
### Game Extractor by WATTO - http://www.watto.org/game_extractor.html
### 010 editor templates - https://www.sweetscape.com/010editor/repository/templates/
### hex fiend templates - https://github.com/HexFiend/HexFiend/tree/master/templates
### malcat - has some form of binary templates - https://malcat.fr/
### Andys Binary Folding Editor - http://www.nyangau.org/be/be.htm
### winhex templates - https://www.x-ways.net/winhex/templates/index.html
### TRiD - file identifier - TrID is an utility designed to identify file types from their binary signatures. - https://mark0.net/soft-trid-e.html
### GNU file - https://github.com/file/file
### Noesis - Noesis is a tool for previewing and converting between hundreds of model, image, and animation formats. - http://richwhitehouse.com/index.php?content=inc_projects.php... - https://github.com/RoadTrain/noesis-plugins - https://github.com/RoadTrain/noesis-plugins-official
### Ninja ripper - extract individual models from DirectX 3D games, while they are running - https://ninjaripper.com/
### Unpakke - http://www.nullsecurity.org/unpakke
### Camoto online-only universal game modding tool - https://moddingwiki.shikadi.net/wiki/Camoto - https://camoto.shikadi.net/
Great list! Will incorporate some of those into my list of tools for binary parsing: https://github.com/dloss/binary-parsing
Imhex https://imhex.werwolv.net/ has another one. Not fully declarative, but that makes some things easier to deal with.
You should check out erlangs binary literals.
How does this compare to how protobuf defines structures?
Completely different problem, completely different solution.
Protobuf and its ilk (ASN.1, Cap’n Proto, etc.) have you describe a tree structure, then map that to bytes according to their own sensibilities. Kaitai and its ilk (Wireshark might be a more familliar member of the group) have you describe a bunch of data structures as well as somebody else’s pretty much arbitrary ideas as to how they are to map to bytes, then deal with the results.
You can’t use a Protobuf implementation to get EXIF data out of JPEGs, but then you can’t get format evolution guarantees out of Kaitai either.
(I hear ASN.1 can somewhat cross the gap using ECN, but as far as I can tell literally nobody uses that in public.)
Kaitai was awesome for reverse-engineering the Soloshot session format https://github.com/foobarbecue/soloshot-session-to-gpx-conve...
Kaitai is cool but it seems like kind of a waste since you can't roundtrip the data back into binary.
Is this able to represent any binary format? How do things like relative offsets work and such? (basically any non-rigid parts of the format)
https://doc.kaitai.io/user_guide.html#_relative_positioning
It can represent an UTF-8 string, so it can probably represent anything.
As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:
* Things may be non-byte-aligned bitstreams.
* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* Fields that may be optional if some parent of the current record has some weird value.
* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
and so on.
File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.
I found their ELF format specification to have a decent coverage, even if not completely exhaustive (e.g. some debug info isn't breakdown after a certain point, but it just might be incomplete rather than limitations).
> Things may be non-byte-aligned bitstreams.
* https://doc.kaitai.io/user_guide.html#_bit_sized_integers
> Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* https://doc.kaitai.io/user_guide.html#_repetitions
> Fields that may be optional if some parent of the current record has some weird value.
* https://doc.kaitai.io/user_guide.html#do-nothing
> Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* https://doc.kaitai.io/user_guide.html#_relative_positioning
> The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
* https://doc.kaitai.io/user_guide.html#param-types
* https://doc.kaitai.io/user_guide.html#switch-advanced
There are ID3 tags used for MP3 and other files. Old players may not know about them and may misread their data as MPEG frames. To prevent this a tag may break up sequences of 00 bytes with an FF byte. Or may not, because now most players are aware of tags. So there is a preference, at two levels, default and for a single tag. Not too hard to program, but rather unfriendly to a grammar-based approach.
(Another example are checksums.)
UTF-8 is a regular language (as a subset of all octet strings), so that doesn’t feel like much of a benchmark? Something like TIFF or PECOFF would seem to be a more reasonable standard. (PDF is probably too much to ask, seeing as understanding the structure requires a full Deflate decoder among other things.)