Blogs · Aras' website

EXR: libdeflate is great

Posted on Aug 9, 2021

Previous blog post was about adding Zstandard compression to OpenEXR. I planned to look into something else now, but a github comment from Miloš Komarčević and a blog post from Matt Pharr reminded me to look into libdeflate, which I was not consciously aware of before.

TL;DR: libdeflate is most excellent. If you need to use zlib/deflate compression, look into it!

Here’s what happens by replacing zlib usage for Zip compression in OpenEXR with libdeflate v1.8 (click for a larger chart):

zlib is dark green (both the currently default compression level 6, and my proposed level 4 are indicated). libdeflate is light green, star shape.

Compression ratio is almost the same. Level 4: 2.421x for zlib, 2.427x for libdeflate; level 6: 2.452x for zlib, 2.447x for libdeflate.
Writing: level 4 goes 456 -> 640 MB/s (1.4x faster), and level 6 goes 213 -> 549 MB/s (2.6x faster). Both are faster than writing uncompressed.
Reading: with libdeflate reaches 2GB/s speed, and becomes same speed as Zstandard. I suspect this might be disk bandwidth bound at that point, since the numbers all look curiously similar.

So, changing zlib to libdeflate should be a no-brainer. Way faster, and a huge advantage is that the file format stays exactly the same; everything that could read or write EXR files in the past can still read/write them if libdeflate is used.

In compression performance, Zip+libdeflate does not quite reach Zstandard speeds though.

Another possible thing to watch out is security/bugs. zlib, being an extremely popular library, has been quite thoroughly battle-tested against bugs, crashes, handling of malformed or malicious data, etc. I don’t know if libdeflate got a similar treatment.

In terms of code, my quick hack is not even very optimal – I create a whole new libdeflate compressor/decompressor object for each compression request. This could be optimized somehow if one were to switch to libdeflate for real, and maybe the numbers would be a tiny bit better. All my change did was this in src/lib/OpenEXR/ImfZip.cpp:

// in Zip::compress:
//
// if (Z_OK != ::compress2 ((Bytef *)compressed, &outSize,
//                  (const Bytef *) _tmpBuffer, rawSize, level))
// {
//     throw IEX_NAMESPACE::BaseExc ("Data compression (zlib) failed.");
// }
libdeflate_compressor* cmp = libdeflate_alloc_compressor(level);
size_t cmpBytes = libdeflate_zlib_compress(cmp, _tmpBuffer, rawSize, compressed, outSize);
libdeflate_free_compressor(cmp);
if (cmpBytes == 0)
{
    throw IEX_NAMESPACE::BaseExc ("Data compression (libdeflate) failed.");
}
outSize = cmpBytes;

// in Zip::uncompress:
// if (Z_OK != ::uncompress ((Bytef *)_tmpBuffer, &outSize,
//                  (const Bytef *) compressed, compressedSize))
// {
//     throw IEX_NAMESPACE::InputExc ("Data decompression (zlib) failed.");
// } 
libdeflate_decompressor* cmp = libdeflate_alloc_decompressor();
size_t cmpBytes = 0;
libdeflate_result cmpRes = libdeflate_zlib_decompress(cmp, compressed, compressedSize, _tmpBuffer, _maxRawSize, &cmpBytes);
libdeflate_free_decompressor(cmp);
if (cmpRes != LIBDEFLATE_SUCCESS)
{
    throw IEX_NAMESPACE::InputExc ("Data decompression (libdeflate) failed.");
}
outSize = cmpBytes;

Next up?

I want to look into more specialized compression schemes, besides just “let’s throw a general purpose compressor”. For example, ZFP.

EXR: Zstandard compression

Posted on Aug 6, 2021

In the previous blog post I looked at OpenEXR Zip compression level settings.

Now, Zip compression algorithm (DEFLATE) has one good thing going for it: it’s everywhere. However, it is also from the year 1993, and both the compression algorithm world and the hardware has moved on quite a bit since then :) These days, if one were to look for a good, general purpose, freely available lossless compression algorithm, the answer seems to be either Zstandard or LZ4, both by Yann Collet.

Let’s look into Zstandard then!

Initial (bad) attempt

Some quick hacky plumbing of Zstd (version 1.5.0) into OpenEXR, here’s what we get:

Zip/Zips has been bumped from previous compression level 6 to level 4 (see previous post), the new Zstandard is the large blue data point. Ok that’s not terrible, but also quite curious:

Both compression and decompression performance is better than Zip, which is expected.
However, that compression ratio? Not great at all. Zip and PIZ are both at ~2.4x compression, whereas Zstd only reaches 1.8x. Hmpft!

Turns out, OpenEXR does not simply just “zip the pixel data”. Quite similar to how e.g. PNG does it, it first filters the data, and then compresses it. When decompressing, it first decompresses and then does the reverse filtering process.

In OpenEXR, here’s what looks to be happening:

First the incoming data is split into two parts; first all the odd-indexed bytes, then all the even-indexed bytes. My guess is that this is based on assumption that 16-bit float is going to be the dominant input data type, and splitting it into “first all the lower bytes, then all the higher bytes” does improve compression when a general purpose compressor is used.
- That got me thinking: EXR also supports 32-bit float and 32-bit integer pixel data types. However here for compression, they are still split into two parts, as if data is 16-bit sized. This does not cause any correctness issues, but I’m wondering whether it might be slightly suboptimal for compression ratio.
Then the resulting byte stream is delta encoded; e.g. this turns a byte sequence like { 1, 2, 3, 4, 5, 6, 4, 2, 0 } (not very compressible) into { 1, 129, 129, 129, 129, 129, 126, 126, 126 } which is much tastier for a compressor.

Let’s try doing exactly the same data filtering for Zstandard too:

Zstd with filtering

Look at that! Zstd sweeps all others away!

Ratio: 2.446x for Zstd, 2.442x for PIZ, 2.421x for Zip. These are actually very close to each other.
Writing: At 735MB/s, Zstd is fastest of all, by far. 1.7x faster than uncompressed or Zip, and handily winning against previous “fast to write, good ratio” PIZ. And it would be 3.6x faster than previous Zip at compression level 6.
Reading: At 2005MB/s, Zstd almost reaches RLE reading performance, is a bit faster to read than uncompressed (1744MB/s) or Zip (1697MB/s), and quite a bit faster than PIZ (1264MB/s).

Zstd also has various compression levels; the above chart is using the default (3) level. Let’s look at those.

Zstd compression levels

We have much more compression levels to choose from compared to Zip – there are “regular levels” between 1 and 22, but also negative levels that drop quite a bit of compression ratio in hopes to increase performance (this makes Zstd almost reach into LZ4 territory). Here’s a chart (click for an interactive page) where I tried most of them:

Negative levels (-1 and -3 in the chart) don’t seem to be worth it: compression ratio drops significantly (from 2.4-2.5x down to 2.1x) and they don’t buy any additional performance. I guess the compression itself might be faster, but the increased file size makes it slower to write, so they cancel each other out.
There isn’t much compression ratio changes between the levels – it varies between 2.446x (level 3) up to 2.544x (level 16). Slightly more variation than Zip, but not much. Levels beyond 10 get into “really slow” territory without buying much more ratio.
Level 1 looks better than default Level 3 in all aspects: quite a bit faster to write (745 -> 837 MB/s), and curiously enough slightly better compression ratio too (2.446x -> 2.463x)! Zstd with level 1 looks quite excellent (marked with a star shape point in the graph):
- Writing: 2.0x faster than uncompressed, 1.9x faster than Zip, 1.4x faster than PIZ.
- Reading: 1.16x faster than uncompressed, 1.06x faster than Zip, 1.7x faster than PIZ.
- Ratio: a tiny bit better than either Zip or PIZ, but all of them about 2.4x really.

Next up?

I’ll report these findings to “Investigate additional compression” OpenEXR github issue, and see if someone says that Zstd makes sense to add (maybe? TIFF added it in v4.0.10 back in year 2017…). If it does, then most of the work will be “ok how to properly do that with their CMake/Bazel/whatever build system”; C++ projects are always “fun” in that regard, aren’t they.

Maybe it would be worth looking at some different filter than the one used by Zip (particularly for 32-bit float/integer images) too?

I also want to look into more specialized compression schemes, besides just “let’s throw something better than zlib at the thing” :)

Update: next blog post turned out to be about libdeflate.

EXR: Zip compression levels

Posted on Aug 5, 2021

Update 2021 October: default zip compression level was switched from 6 to 4, for OpenEXR 3.2 (see PR). Yay faster zipped exr writing, soon!

In the previous blog post I looked at lossless compression options that are available in OpenEXR.

The Zip compression in OpenEXR is just the standard DEFLATE algorithm as used by Zip, gzip, PNG and others. That got me thinking - the compression has different “compression levels” that control ratio vs. performance. Which one is OpenEXR using, and would changing them affect anything?

OpenEXR seems to be mostly using the default zlib compression level (6). It uses level 9 in several places (within the lossy DWAA/DWAB compression), we’ll ignore those for now.

Let’s try all the zlib compression levels, 1 to 9 (click for an interactive chart):

The Zip compression level used in current OpenEXR is level 6, marked with the triangle shape point on the graph.
Compression ratio is not affected much by the level settings - fastest level (1) compresses data 2.344x; slowest (9) compresses at 2.473x.
Levels don’t affect decompression performance much.
Maybe level 4 should be the default (marked with a star shape point on the graph)? It’s a tiny compression ratio drop (2.452x -> 2.421x), but compression is over 2x faster (206 -> 437 MB/s)! At level 4, writing a Zip-compressed EXR file becomes faster than writing an uncompressed one.
- Just a tiny 4 line change in OpenEXR library source code would be enough for this.
- A huge advantage is that this does not change the compression format at all. All the existing EXR decoding software can still decode the files just fine; it’s still exactly the same compression algorithm.

With a bit more changes, it should be possible to make the Zip compression level be configurable, like so:

Header header(width, height);
header.compression() = ZIP_COMPRESSION;
addZipCompressionLevel(header, level); // <-- new!
RgbaOutputFile output(filePath, header);

So that’s it. I think switching OpenEXR from Zip compression level 6 to level 4 by default should be a no-brainer. Let’s make a PR and see what happens!

Next up

In the next post I’ll try adding a new lossless compression algorithm to OpenEXR and see what happens.

EXR: Lossless Compression

Posted on Aug 4, 2021

One thing led to another, and I happened to be looking at various lossless compression options available in OpenEXR image file format.

EXR has several lossless compression options, and most of the available material (e.g. “Technical Introduction to OpenEXR” and others) basically end up saying: Zip compression is slow to write, but fast to read; whereas PIZ compression is faster to write, but slower to read than Zip. PIZ is the default one used by the library/API.

How “slow” is Zip to write, and how much “faster” is PIZ? I decided to figure that out :)

Test setup

Hardware: MacBookPro 16" (2019, Core i9 9980HK, 8 cores / 16 threads). I used latest OpenEXR version (3.1.1), compiled with Apple Clang 12.0 in RelWithDebInfo configuration.

Everything was tested on a bunch of EXR images of various types: rendered frames, HDRI skyboxes, lightmaps, reflection probes, etc. All of them tend to be “not too small” – 18 files totaling 1057 MB of raw uncompressed (RGBA, 16-bit float) data.

What are we looking for?

As with any lossless compression, there are at least three factors to consider:

Compression ratio. The larger, the better (e.g. “4.0” ratio means it produces 4x smaller data).
Compression performance. How fast does it compress the data?
Decompression performance. How fast can the data be decompressed?

Which ones are more important than others depends, as always, on a lot of factors. For example:

If you’re going to write an EXR image once, and use it a lot of times (typical case: HDRI textures), then compression performance does not matter that much. On the other hand, if for each written EXR image it will get read just once or several times (typical case: capturing rendered frames for later encoding into a movie file), then you would appreciate faster compression.
The slower your storage or transmission medium is, the more you care about compression ratio. Or to phrase it differently: the slower I/O is, the more CPU time you are willing to spend to reduce I/O data size.
Compression ratio can also matter when data size is costly. For example, modern SSDs might be fast, but their capacity still be a limiting factor. Or a network transmission of files might be fast, but you’re paying for bandwidth used.

There are other things to keep in mind about compression: memory usage, technical complexity of compressor/decompressor, ability to randomly access parts of image without decompressing everything else, etc. etc., but let’s not concern ourselves with those right now :)

Initial (bad) result

What do we have here? (click for a larger interactive chart)

This is two plots of compression ratio vs. compression performance, and compression ratio vs. decompression performance. In both cases, the best place on the chart is top right – the largest compression ratio, and the best performance.

For performance, I’m measuring it in MB/s, in terms of uncompressed data size. That is, if we have 1GB worth of raw image pixel data and processing it took half a second, that’s 2GB/s throughput (even if compressed data size might be different).

The time it has taken to write or read the file itself is included into the measurement. This does mean that results are not only CPU dependent, but also storage (disk speed, filesystem speed) dependent. My test is on 2019 MacBookPro, which is “quite fast” SSD for today, and average (not too fast, not too slow) filesystem. I’m flushing the OS file cache between writing and reading the file (via system("purge")) so that EXR file reading is closer to a “read a new file” scenario.

What we can see from the above is that:

Writing an uncompressed EXR goes at about 400 MB/s, reading at 1400 MB/s,
Zip and PIZ compression ratio is roughly the same (2.4x),
Compression and decompression performance is quite terrible. Why?

Turns out, OpenEXR library is single-threaded by default. The file format itself is much better than the image formats of yore (e.g. PNG, which is completely single threaded, fully, always) – EXR format in most cases splits up the whole image into smaller chunks that can be compressed and decompressed independently. For example, Zip compression does it on 16 pixel row chunks – this loses some of the compression ratio, but each 16-row image slice could be compressed & decompressed in parallel.

If you tell the library to use multiple threads, that is. By default it does not. So, one call to Imf::setGlobalThreadCount() later…

Threaded result

There, much better! (16 threads on this machine)

Compression ratio: Zip and PIZ EXR compression types both have very similar compression ratio, making the data 2.4x smaller.
Writing: If you want to write EXR files fast, you want PIZ. It’s faster than writing them uncompressed (400 -> 600 MB/s), and about 3x faster to write than Zip (200 -> 600 MB/s). Zip is about 2x slower to write than uncompressed.
Reading: However, if you mostly care about reading files, you want Zip instead – it’s about the same performance as uncompressed (~1600 MB/s), whereas PIZ reads at a lower 1200 MB/s.
RLE compression is fast both at writing and reading, but compression ratio is much lower at 1.7x.
Zips compression is very similar to Zip; it’s slightly faster but lower compression ratio. Internally, instead of compressing 16-pixel-row image chunks, it compresses each pixel row independently.

Next up?

So that was with OpenEXR library and format as-is. In the next post I’ll look at what could be done if, hypothetically, one were free to extend of modify the format just a tiny bit. Until then!

Texture Compression on Apple M1

Posted on Jan 18, 2021

In the previous post I did a survey of various GPU format compression libraries. I just got an Apple M1 MacMini to help port some of these compression libraries to it, and of course decided to see some performance numbers. As everyone already noticed, M1 CPU is quite impressive. I’m comparing three setups here:

MacBookPro (2019 16", 8 cores / 16 threads). This is basically the “top” MacBook Pro you can get in 2020, with 9th generation Coffee Lake Core i9 9980HK CPU. It starts at $3000 for this CPU.
MacMini (M1, 4 perf + 4 efficiency cores). It starts at $700 for this CPU (but realistically you’d want maybe a $1300 model for more decent RAM/SSD sizes).
The same MacMini, but testing Intel/x64 builds of the compressors under Rosetta 2 translator.

Multi-threaded compression

Here we’re compressing a bunch of textures into various GPU formats, using various compression libraries, and various quality settings of those. See previous post for details. The tests are done by using all the CPU cores, and results are in millions of pixels per second (higher = better).

Desktop BC7 format, using ISPCTextureCompressor and bc7e libraries:

Desktop BC1/BC3 (aka DXT1/DXT5) format, using ISPCTextureCompressor and stb_dxt libraries:

Mobile ASTC 4x4 format, using ISPCTextureCompressor and astcenc (2.3-ish) libraries:

Mobile ETC2 format, using Etc2Comp and etcpak libraries:

Overall the 2019 MacBookPro is from “a bit faster” to “about twice as fast” as the M1, when compression is fully multi-threaded. This makes sense due to two things:

2019 MBP uses 16 threads, whereas M1 uses 8 threads. In both cases these are not “100% the same” threads, since the former only has 8 “real” cores, with two SMT threads per core; and the latter has 4 “high performance” cores and 4 “low power” cores. But with some squinting we should probably expect MBP to be almost 2x faster overall, just due to higher CPU thread count.
Some of the texture compressors (ISPCTexComp, bc7e) use AVX2 code paths for “almost ideal” speedup, meaning the full compression algorithm is fully SIMD, using AVX2 8-wide execution when available. These compressors are written in ISPC language. M1 on the other hand, only has 4-wide SIMD execution (via NEON). If a program can take really good advantage of wider SIMD, then Intel CPU has an advantage there.

Summary: on all-cores texture compression, 2019 MBP is about 2x faster than M1, for compressors written with ISPC (ISPCTexComp, bc7e) that take really good advantage of AVX2. In other compressors, 2019 MBP is “a bit” faster. ETC2 etcpak compressor has M1 faster than 2019 MBP.

Rosetta 2 translator for x64/SSE works impressively well, reaching ~70-90% performance of natively compiled Arm+NEON code.

Single-threaded compression

Ok, what if we limited compression to a single CPU thread? For texture compression itself that does not make a whole lot of sense, but it’s interesting to see how 2019 MBP and M1 compare without the “MBP has more threads” advantage. You could maybe extrapolate how M1 CPU would behave if it had more cores.

Same formats and compressors as above, just single threaded everywhere:

Here it’s basically: if a compressor is fully SIMD with AVX2 (ISPCTexComp, bc7e), then 2019 MBP is 1.5x faster than M1. Otherwise M1 is a bit faster.

Multi-thread speedup

Once we have multi-threaded and single-thread numbers, we can see what’s the effective speedup from using all the CPU cores. Ideally 2019 MBP would be 16x faster, and M1 would be 8x faster, since that’s the amount of threads we’re distributing the work to. In practice, as mentioned above, not all of these threads are fully independent or equal. And the computation could hit some other limits, e.g. RAM bandwidth and so on. Anyway, what’s the effective speedup for texture compression, when using all the CPU cores?

2019 MacBook Pro is ~6x faster from using all cores. This one’s curious, since it’s even below the “full 8 cores” scaling. Maybe loading all the SMT threads ends up doing more harm than good here, or we’re hitting some other bottleneck that prevents further scaling.
M1 is ~4.5x faster from using all cores. This either means there’s a fairly large performance difference between “performance” and “efficiency” cores, or we’re hitting some other bottleneck.

Anyway, that’s it! Now I’m curious to see what the next iteration of Apple CPUs will look like. M1 is impressive!