It was really hard to resist spilling the beans about OpenZL on this recent HN post about compressing genomic sequence data [0]. It's a great example of the really simple transformations you can perform on data that can unlock significant compression improvements. OpenZL can perform that transformation internally (quite easily with SDDL!).
That post immediately came to my mind too! Do you maybe have a comparison to share with respect to the specialized compressor mentioned in the OP there?
> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.
I'd love to see some benchmarks for this on some common genomic formats (fa, fq, sam, vcf). Will be doubly interesting to see its applicability to nanopore data - lots of useful data is lost because storing FAST5/POD5 is a pain.
OpenZL compressed SAM/BAM vs. CRAM is the interesting comparison. It would really test the flexibility of the framework. Can OpenZL reach the same level of compression, and how much effort does it take?
I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.
Do you happen to have a pointer to a good open source dataset to look at?
Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.
We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.
I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.
And a comparison between CRAM and openzl on a sam/bam file. Is openzl indexable, where you can just extract and decompress the data you need from a file if you know where it is?
Specialization for file formats is not novel (e.g. 7-Zip uses BCJ2 prefiltering to convert x86 opcodes from absolute to relative JMP instructions), nor is embedding specialized decoder bytecode in the archive (e.g. ZPAQ did this and won a lot of Matt Mahoney's benchmarks) but i think OpenZL's execution here, along with the data description and training system, is really fantastic.
Thanks, I've enjoyed reading more about ZPAQ but their main focus seems to be versioning (which is quite a useful feature too, will try it later) but they don't include specialized compression per context.
Like you mention, the expandability is quite something. In a few years we might see a very capable compressor.
So, as I understand, you describe the structure of your data in an SDL and then the compressor can plan a strategy on how to best compress the various part of the data ?
Honestly looks incredible. Could be amazing to provide a general framework for compressing custom format.
Exactly! SDDL [0] provides a toolkit to do this all with no-code, but today is pretty limited. We will be expanding its feature set, but in the meantime you can also write code in C++ or Python to parse your format. And this code is compression side only, so the decompressor is agnostic to your format.
Now I cannot stop thinking about how I can fit this somewhere in my work hehe. ZStandard already blew me away when it was released, and this is just another crazy work. And being able to access this kind of state-of-the-art algo' for free and open-source is the oh so sweet cherry on top
Yeah, backend compression in columnar data formats is a natural fit for OpenZL. Knowing the data it is compressing is numeric, e.g. a column of i64 or float, allows for immediate wins over Zstandard.
You could have an LLM generate the SDDL description [0] for you, or even have it write a C++ or Python tokenizer. If compression succeeds, then it is guaranteed to round trip, as the LLM-generated logic lives only on the compression side, and the decompressor is agnostic to it.
It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.
We developed OpenZL initially for our own consumption at Meta. More recently we've been putting a lot of effort into making this a usable tool for people who, you know, didn't develop OpenZL. Your feedback is welcome!
We left it out of the paper because it is an implementation detail that is absolutely going to change as we evolve the format. This is the function that actually does it [0], but there really isn't anything special here. There are some bit-packing tricks to save some bits, but nothing crazy.
Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.
The charts in the "Results With OpenZL" section compare against all levels of zstd, xz, and zlib.
On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.
However, OpenZL is different in that you need to tell the compressor how to compress your data. The CLI tool has a few builtin "profiles" which you can specify with the `--profile` argument. E.g. csv, parquet, or le-u64. They can be listed with `./zli list-profiles`.
You can always use the `serial` profile, but because you haven't told OpenZL anything about your data, it will just use Zstandard under the hood. Training can learn a compressor, but it won't be able to learn a format like `.tar` today.
If you have raw numeric data you want to throw at it, or Parquets or large CSV files, thats where I would expect OpenZL to perform really well.
Are you thinking about adding stream support? I.e something along the lines of i) build up efficient vocabulary up front for the whole data and then ii) compress by chunks, so it can be decompressed by chunks as well. This is important for seeking in data and stream processing.
Yes, definitely! Chunking support is currently in development. Streaming and seeking and so on are features we will certainly pursue as we mature towards an eventual v1.0.0.
Great! I find apache arrow ipc as the most sensible format I found how to organise stream data. Headers first, so you learn what data you work with, columnar for good simd and compression, deeply nested data structures supported. Might serve as an inspiration.
I've recently been wondering: could you re-compress gzip to a better compression format, while keeping all instructions that would let you recover a byte-exact copy of the original file? I often work with huge gzip files and they're a pain to work with, because decompression is slow even with zlib-ng.
precomp/antix/... are tools that can bruteforce the original gzip parameters and let you recreate the byte-identical gzip archive.
The output is something like {precomp header}{gzip parameters}{original uncompressed data} which you can then feed to a stronger compressor.
A major use case is if you have a lot of individually gzipped archives with similar internal content, you can precomp them and then use long-range solid compression over all your archives together for massive space savings.
I may be misunderstanding the question but that should be just decompressing gzip & compressing with something better like zstd (and saving the gzip options to compress it back), however it won't avoid compressing and decompressing gzip.
No, not really. They are both cool but solve different problems. The problem Basis solves is that GPUs don't agree on which compressed texture formats to support in hardware. Basis is a single compressed format that can be transcoded to almost any of the formats GPUs support, which is faster and higher quality than e.g. decoding a JPEG and then re-encoding to a GPU format.
It probably does have different modes that it selects based on the input data. I don't know that much about the implementation of image compression, but I know that PNG for example has several preprocessing modes that can be selected based on the image contents, which transform the data before entropy encoding for better results.
The difference with OpenZL IIUC seems to be that it has some language that can flexibly describe a family of transformations, which can be serialized and included with the compressed data for the decoder to use. So instead of choosing between a fixed set of transformations built into the decoder ahead of time, as in PNG, you can apply arbitrary transformations (as long as they can be represented in their format).
You'd have to tell OpenZL what your format looks like by writing a tokenizer for it, and annotating which parts are which. We aim to make this easier with SDDL [0], but today is not powerful enough to parse JSON. However, you can do that in C++ or Python.
Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.
However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.
I used to see as magic that the old original compression algorithms worked so well with generic text, without worrying about format, file type, structure or other things that could give hints of additional redundancy.
I wonder, given the docs, how well could AI translate imhex and Kaitai descriptions into SDDL. We could get a few good schemas quickly that way.
It was really hard to resist spilling the beans about OpenZL on this recent HN post about compressing genomic sequence data [0]. It's a great example of the really simple transformations you can perform on data that can unlock significant compression improvements. OpenZL can perform that transformation internally (quite easily with SDDL!).
[0] https://news.ycombinator.com/item?id=45223827
That post immediately came to my mind too! Do you maybe have a comparison to share with respect to the specialized compressor mentioned in the OP there?
> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.
I'd love to see some benchmarks for this on some common genomic formats (fa, fq, sam, vcf). Will be doubly interesting to see its applicability to nanopore data - lots of useful data is lost because storing FAST5/POD5 is a pain.
OpenZL compressed SAM/BAM vs. CRAM is the interesting comparison. It would really test the flexibility of the framework. Can OpenZL reach the same level of compression, and how much effort does it take?
I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.
Do you happen to have a pointer to a good open source dataset to look at?
Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.
We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.
For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html
Happy to discuss further
Amazing, thank you!
I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.
And a comparison between CRAM and openzl on a sam/bam file. Is openzl indexable, where you can just extract and decompress the data you need from a file if you know where it is?
> Is openzl indexable
Not today. However, we are considering this as we are continuing to evolve the frame format, and it is likely we will add this feature in the future.
Author of [0] here. Congratulations and well done for resisting. Eager to try it!
Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)
Well, well. Kind of surprised to see this really good tool that should have been made available a longer time ago since the approach is quite sound.
When the data container is understood, the deduplication is far more efficient because now it is targeted.
Licensed as BSD-3-Clause, solid C++ implementation, well documented.
Will be looking forward to see new developments as more file formats are contributed.
Specialization for file formats is not novel (e.g. 7-Zip uses BCJ2 prefiltering to convert x86 opcodes from absolute to relative JMP instructions), nor is embedding specialized decoder bytecode in the archive (e.g. ZPAQ did this and won a lot of Matt Mahoney's benchmarks) but i think OpenZL's execution here, along with the data description and training system, is really fantastic.
Thanks, I've enjoyed reading more about ZPAQ but their main focus seems to be versioning (which is quite a useful feature too, will try it later) but they don't include specialized compression per context.
Like you mention, the expandability is quite something. In a few years we might see a very capable compressor.
So, as I understand, you describe the structure of your data in an SDL and then the compressor can plan a strategy on how to best compress the various part of the data ?
Honestly looks incredible. Could be amazing to provide a general framework for compressing custom format.
Exactly! SDDL [0] provides a toolkit to do this all with no-code, but today is pretty limited. We will be expanding its feature set, but in the meantime you can also write code in C++ or Python to parse your format. And this code is compression side only, so the decompressor is agnostic to your format.
[0] https://openzl.org/api/c/graphs/sddl/
Now I cannot stop thinking about how I can fit this somewhere in my work hehe. ZStandard already blew me away when it was released, and this is just another crazy work. And being able to access this kind of state-of-the-art algo' for free and open-source is the oh so sweet cherry on top
Meta's Nimble is natively integrated with OpenZL (pre-OSS version), and is insanely benefiting from it.
Yeah, backend compression in columnar data formats is a natural fit for OpenZL. Knowing the data it is compressing is numeric, e.g. a column of i64 or float, allows for immediate wins over Zstandard.
Couldn't the input be automatically described/guessed using a few rows of data and a LLM?
You could have an LLM generate the SDDL description [0] for you, or even have it write a C++ or Python tokenizer. If compression succeeds, then it is guaranteed to round trip, as the LLM-generated logic lives only on the compression side, and the decompressor is agnostic to it.
It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.
[0] https://openzl.org/api/c/graphs/sddl/
Wow this sounds nuts. I want to try this on some large csvs later today.
Let us know how it goes!
We developed OpenZL initially for our own consumption at Meta. More recently we've been putting a lot of effort into making this a usable tool for people who, you know, didn't develop OpenZL. Your feedback is welcome!
Couldn't find in the paper a description of how the DAG itself is encoded. Any ideas?
We left it out of the paper because it is an implementation detail that is absolutely going to change as we evolve the format. This is the function that actually does it [0], but there really isn't anything special here. There are some bit-packing tricks to save some bits, but nothing crazy.
Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.
[0] https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...
Thanks! (Super cool idea btw.)
Wonder how it compares to zstd-9 since they only mention zstd-3
The charts in the "Results With OpenZL" section compare against all levels of zstd, xz, and zlib.
On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.
How do you use it to compress a directory (or .tar file)? Not seeing any example usages in the repo, `zli compress -o dir.tar.zl dir.tar` ->
Same thing for the `train` command.Edit: @terrelln Got it, thank you!
There's a Quick Start guide here:
https://openzl.org/getting-started/quick-start/
However, OpenZL is different in that you need to tell the compressor how to compress your data. The CLI tool has a few builtin "profiles" which you can specify with the `--profile` argument. E.g. csv, parquet, or le-u64. They can be listed with `./zli list-profiles`.
You can always use the `serial` profile, but because you haven't told OpenZL anything about your data, it will just use Zstandard under the hood. Training can learn a compressor, but it won't be able to learn a format like `.tar` today.
If you have raw numeric data you want to throw at it, or Parquets or large CSV files, thats where I would expect OpenZL to perform really well.
Are you thinking about adding stream support? I.e something along the lines of i) build up efficient vocabulary up front for the whole data and then ii) compress by chunks, so it can be decompressed by chunks as well. This is important for seeking in data and stream processing.
Yes, definitely! Chunking support is currently in development. Streaming and seeking and so on are features we will certainly pursue as we mature towards an eventual v1.0.0.
Great! I find apache arrow ipc as the most sensible format I found how to organise stream data. Headers first, so you learn what data you work with, columnar for good simd and compression, deeply nested data structures supported. Might serve as an inspiration.
I've recently been wondering: could you re-compress gzip to a better compression format, while keeping all instructions that would let you recover a byte-exact copy of the original file? I often work with huge gzip files and they're a pain to work with, because decompression is slow even with zlib-ng.
precomp/antix/... are tools that can bruteforce the original gzip parameters and let you recreate the byte-identical gzip archive.
The output is something like {precomp header}{gzip parameters}{original uncompressed data} which you can then feed to a stronger compressor.
A major use case is if you have a lot of individually gzipped archives with similar internal content, you can precomp them and then use long-range solid compression over all your archives together for massive space savings.
I may be misunderstanding the question but that should be just decompressing gzip & compressing with something better like zstd (and saving the gzip options to compress it back), however it won't avoid compressing and decompressing gzip.
I understand it cannot work well on random text files, but would it support structured text? Like .c, .java or even JSON
Is there a way to use this with blosc?
Is this similar to Basis ? https://github.com/BinomialLLC/basis_universal
No, not really. They are both cool but solve different problems. The problem Basis solves is that GPUs don't agree on which compressed texture formats to support in hardware. Basis is a single compressed format that can be transcoded to almost any of the formats GPUs support, which is faster and higher quality than e.g. decoding a JPEG and then re-encoding to a GPU format.
Thanks. I thought basis also had specific encoders depending on the typical average / nature of the data input, like this OpenZL project
It probably does have different modes that it selects based on the input data. I don't know that much about the implementation of image compression, but I know that PNG for example has several preprocessing modes that can be selected based on the image contents, which transform the data before entropy encoding for better results.
The difference with OpenZL IIUC seems to be that it has some language that can flexibly describe a family of transformations, which can be serialized and included with the compressed data for the decoder to use. So instead of choosing between a fixed set of transformations built into the decoder ahead of time, as in PNG, you can apply arbitrary transformations (as long as they can be represented in their format).
Is this useful for highly repetitive JSON data? Something like stock prices for example, one JSON per line.
Unclear if this has enough "structure" for OpenZL.
You'd have to tell OpenZL what your format looks like by writing a tokenizer for it, and annotating which parts are which. We aim to make this easier with SDDL [0], but today is not powerful enough to parse JSON. However, you can do that in C++ or Python.
Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.
However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.
[0] https://openzl.org/api/c/graphs/sddl/
Maybe convert to BSON first then compress it.
This is such a leap forward it's hard to believe it's anything but magic.
I used to see as magic that the old original compression algorithms worked so well with generic text, without worrying about format, file type, structure or other things that could give hints of additional redundancy.
Compared to columnar databases this is more of an incremental improvement.
In addition to the blog post, here are the other things we've published today:
Code: https://github.com/facebook/openzl
Documentation: https://openzl.org/
White Paper: https://arxiv.org/abs/2510.03203
We'll put those links in the toptext above.
Cool, but what's the Weissman Score?