Float Compression 3: Filters

Introduction and index of this series is here.

In the previous parts we saw that using generic data compression libraries, we can get our 94.5MB data down to 33.8MB (zstd level 7) or 29.6MB (oodle kraken level 2) size, if we’re not willing to spend more than one second compressing it.

That’s not bad, but is there something else we can do? Turns out, there is, and in fact it’s quite simple. Enter data filtering.

Prediction / filtering

We saw filtering in the past (EXR and SPIR-V), and the idea is simple: losslessly transform the data so that it is more compressible. Filtering alone does nothing to reduce the data size, but (hopefully!) it decreases data randomness. So the process is: filter the data, then compress that. Decompression is reverse: decompress, then un-filter it.

Here’s some simple filters that I’ve tried (there are many, many other filters possible, I did not try them all!).

Reorder floats array-of-structures style

Recall that in our data, we know that water simulation has four floats per “element” (height, velocity x, velocity y, pollution); snow simulation similarly has four floats per element; and other data is either four or three floats per element. Instead of having the data like that (“array of structures” style), we can try to reorder it into “structure of arrays” style. For water simulation, that would be all heights first, then all x velocities, then all y velocities, etc.

So this:

becomes this:

Completely unoptimized code to do that could look like this (and our data is floats, i.e. 4 bytes, so you’d call these templates with a 4-byte type e.g. uint32_t):

// channels: how many items per data element
// dataElems: how many data elements
template<typename T>
static void Split(const T* src, T* dst, size_t channels, size_t dataElems)
{
	for (size_t ich = 0; ich < channels; ++ich)
	{
		const T* ptr = src + ich;
		for (size_t ip = 0; ip < dataElems; ++ip)
		{
			*dst = *ptr;
			ptr += channels;
			dst += 1;
		}
	}
}
template<typename T>
static void UnSplit(const T* src, T* dst, size_t channels, size_t dataElems)
{
	for (size_t ich = 0; ich < channels; ++ich)
	{
		T* ptr = dst + ich;
		for (size_t ip = 0; ip < dataElems; ++ip)
		{
			*ptr = *src;
			src += 1;
			ptr += channels;
		}
	}
}

Does that help? The results are interesting (click for an interactive chart):

  • It does help LZ4 to achieve a bit higher compression ratios.
  • Makes zstd compress faster, and helps the ratio at lower levels, but hurts the ratio at higher levels.
  • Hurts oodle kraken compression.
  • Hurts the decompression performance quite a bit (for lz4 and kraken, slashes it in half). In all cases the data still decompresses under 0.1 seconds, so acceptable for my case, but the extra pass over memory is not free.

Ok, so this one’s a bit “meh”, but hey, now that the data is grouped together (all heights, then all velocities, …), we could try to exploit the fact that maybe neighboring elements are similar to each other?

Reorder floats + XOR

In the simulation data example, it’s probably expected that usually the height, or velocity, or snow coverage, does not vary “randomly” over the terrain surface. Or in an image, you might have a color gradient that varies smoothly.

“But can’t data compressors already compress that really well?!”

Yes and no. Usually generic data compressors can’t. Most of them are very much oriented at finding repeated sequences of bytes. So if you have a very smoothly varying surface height or image pixel color, e.g. a sequence of bytes 10, 11, 12, 13, 14, 15, well that is not compressible at all! There are no repeating byte sequences.

But, if you transform the sequence using some sort of “difference” between neighboring elements, then repeated byte sequences might start appearing. At first I tried XOR’ing the neighboring elements together (interpreting each float as an uint32_t), since at some point I saw that trick being mentioned in some “time series database” writeups (e.g. Gorilla).

A completely unoptimized code to do that:

template<typename T>
static void EncodeDeltaXor(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		*data = v ^ prev;
		prev = v;
		++data;
	}
}
template<typename T>
static void DecodeDeltaXor(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		v = prev ^ v;
		*data = v;
		prev = v;
		++data;
	}
}

And that gives (faint dashed line: raw compression, thin line: previous attempt (split floats), thick line: split floats + XOR):

  • Compression ratio is way better for zstd and lz4 (for kraken, only at lower levels).
  • zstd pretty much reaches kraken compression levels! The lines almost overlap in the graph.
  • Decompression speed takes a bit of a hit, as expected. I might need to do something about it later.

So far we got from 33.8MB (zstd) / 29.6MB (kraken) at beginning of the post down to 28MB (zstd, kraken), while still compressing in under 1 second. Nice, we’re getting somewhere.

Reorder floats + Delta

The “xor neighboring floats” trick from Gorilla database was in the context of then extracting the non-zero sequences of bits from the result and storing that in less space than four bytes. I’m not doing any of that, so how about this: instead of XOR, do a difference (“delta”) between the neighboring elements? Note that delta is done by reinterpreting data as if it were unsigned integers, i.e. these templates are called with uint32_t type (you can’t easily do completely lossless floating point delta math).

template<typename T>
static void EncodeDeltaDif(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		*data = v - prev;
		prev = v;
		++data;
	}
}
template<typename T>
static void DecodeDeltaDif(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		v = prev + v;
		*data = v;
		prev = v;
		++data;
	}
}

And that gives (faint dashed line: raw compression, thin line: previous attempt (split floats + XOR), thick line: split floats + delta):

Now that’s quite an improvement! All three compressors tested get their compression ratio lifted up. Good! Let’s keep on going.

Reorder bytes

Hey, how about instead of splitting each data point into 4-byte-wide “streams”, we split into 1-byte-wide ones? After all, general compression libraries are oriented at finding byte sequences that would be repeating. This is also known as a “shuffle” filter elsewhere (e.g. HDF5). Exactly the same Split and UnSplit functions as above, just with uint8_t type.

Faint dashed line: raw compression, thin line: previous attempt (split floats + Delta), thick line: split bytes:

  • kraken results are almost the same as with “split floats and delta”. Curious!
  • zstd ratio (and compression speed) is improved a bit.
  • lz4 ratio is improved a lot (it’s beating original unfiltered kraken at this point!).

I’ll declare this a small win, and let’s continue.

Reorder bytes + Delta

Split by bytes as previous, and delta-encode that. Faint dashed line: raw compression, thin line: previous attempt (split bytes), thick line: split bytes + delta:

Holy macaroni grated potato dumplings!

  • Another compression ratio increase. Both zstd and kraken get our data to 23MB in about one second (whereas it was 33.8MB and 29.6MB at the start of the post).
  • zstd actually slightly surpasses kraken at compression ratios in the area (“under 1 sec”) that I care about. 😮
  • lz4 is not too shabby either, being well ahead of unfiltered kraken.
  • Downside: decompression is slightly longer than 0.1 seconds now. Not “terrible”, but I’d want to look into whether all this reordering and delta could be sped up.

Conclusion and what’s next

There’s lots of other data filtering approaches and ideas I could have tried, but for now I’m gonna call “reorder bytes and delta” a pretty good win; it’s extremely simple to implement and gives a massive compression ratio improvement on my data set.

I did actually try a couple other filtering approaches. Split data by bits (using bitshuffle library) was producing worse ratios than splitting by bytes. Rotating each float left by one bit, to make the mantissa & exponent aligned on byte boundaties, was also not an impressive result. Oh well!

Maybe at some point I should also test filters specifically designed for 2D data (like the water and snow simulation data files in my test), e.g. something like PNG Paeth filter or JPEG-LS LOCO-I (aka “ClampedGrad”).

Next up, I’ll look at an open source library that does not advertise itself as a general data compressor, but I’m gonna try it anyway :) Until then!


Float Compression 2: Oodleflate

Introduction and index of this series is here.

In the previous part I looked at generic lossless data compressors (zlib, lz4, zstd, brotli), and was thinking of writing about data filtering next. But then, libdeflate and Oodle entered the chat!

libdeflate

libdeflate is an independent implementation of zlib/deflate compression, that is heavily optimized. I’ve seen it before (EXR blog post), and indeed it is very impressive, considering it produces completely zlib-compatible data, but much faster and at a bit higher compression ratio too.

Here are the results (click for an interactive chart). The compressors from the previous post are faded out; libdeflate addition is dark green:

It is not better than zstd, but definitely way better than zlib (~3x faster compression, ~2x faster decompression, slightly higher ratio). If you need to produce or consume deflate/zlib/gzip formatted data, then absolutely look into libdeflate.

Oodle

Oodle is a set of compressors by Epic Games Tools (née RAD Game Tools). It is not open source or freely available, but the good people there graciously gave me the libraries to play around with.

There are several compressors available, each aiming at different ratio vs. performance tradeoff. It’s easiest to think of them as: “Kraken” is “better zstd” (great ratio, good decompression performance), “Selkie” is “better lz4” (ok ratio, very fast decompression), and “Mermaid” is in the middle of the two, with no clear open source competitor.

Here they are, additions in various shades of purple:

And… yeah. On the compression front, Mermaid leaves everyone else (zstd, brotli) in the dust on the Pareto front (remember: upper right is better). And then Kraken improves on top of that 🤯 Decompression performance is a similar story: everything (except lz4) is behind.

Wizardry! It’s a shame they are not freely available, eh :/

What’s next?

Next up, we’ll really look at some simple data filtering (i.e. fairly similar to EXR post from a while ago). Until then!


Float Compression 1: Generic

Introduction and index of this series is here.

We have 94.5MB worth of data, and we want to make it smaller. About the easiest thing to do: use one of existing lossless data compression libraries, or as us old people call it, “zip it up”.

There are tons of compression libraries out there, I certainly will not test all of them (there’s lzbench and others for that). I tried some popular ones: zlib, lz4, zstd and brotli.

Here are the results (click for an interactive chart):

This is running on Windows, AMD Ryzen 5950X, compiled with Visual Studio 2022 (17.4), x64 Release build. Each of the four data files is compressed independently one after another, serially on a single thread.

I also have results on the same Windows machine, compiled with Clang 15, and on an Apple M1 Max, compiled with Clang 14.

What can we see from this?

The compression charts have compression ratio (higher = better) on the vertical axis, and throughput on the horizontal axis (higher = better). Note that the horizontal axis is logarithmic scale; performance of the rightmost point is literally hundreds of times faster than the leftmost point. Each line on the graph is one compression library, with different settings that it provides (usually called “compression levels”).

Compression graph is the usual “Pareto front” situation - the more towards upper right a point is, the better (higher compression ratio, and better performance). The larger points on each compression curve are the “default” levels of each library (6 for zlib, 3 for zstd, “simple” for lz4).

Decompression graph, as is typical for many data compression algorithms, is much less interesting. Often decompression performance is more or less the same when using the same compression library; no matter what the compression level is. You can ballpark and say “lz4 decompresses at around 5GB/s” for this data set.

Anyway, some takeaway points:

  • Generic data compression libraries can get this data set from 94.5MB down to 32-48MB, i.e. make it 2x-3x smaller.
  • They would take between 0.1s and 1.0s to compress the data, and between 0.02s and 0.2s to decompress it.
    • With Brotli maximum level (11), you can go as low as 27MB data (3.5x ratio), but it takes over two minutes to compress it :/
  • zstd is better than zlib in all aspects. You can get the same compression ratio, while compressing 10x faster; or spend same time compressing, and get better ratio (e.g. 2.6x -> 2.9x). Decompression is at least 2x faster too.
    • Current version of zstd (1.5.2, 2022 Jan) has some dedicated decompression code paths written in assembly, that are only used when compiling with clang/gcc compilers. On a Microsoft platform you might want to look into using Clang to compile it; decompression goes from around 1GB/s up to 2GB/s! Or… someone could contribute MSVC compatible assembly routines to zstd project :)
  • If you need decompression faster than 2GB/s, use lz4, which goes at around 5GB/s. While zstd at level -5 is roughly the same compression ratio and compression performance as lz4, the decompression is still quite a bit slower. lz4 does not reach more than 2.2x compression ratio on this data set though.
  • If you need best compression ratio out of these four, then brotli can do that, but compression will not be fast. Decompression will not be fast either; curiously brotli is slower to decompress than even zlib.
  • There’s basically no reason to use zlib, except for the reason that it likely exists everywhere due to its age.

Which place exactly on the Pareto front is important to you, depends on your use case:

  • If compression happens once on some server, and the result is used thousands/millions of times, then you much more care about compression ratio, and not so much about compression performance. A library like brotli seems to be targeted at exact this use case.
  • On the completely opposite side, you might want to “stream” some data exactly once, for example some realtime application sending profiling data for realtime viewing. You are compressing it once, and decompressing it once, and all you care about is whether the overhead of compression/decompression saves enough space to transmit it faster. lz4 is primarily targeted at use cases like this.
  • If your scenario is similar to a “game save”, then for each compression (“save”), there’s probably a handful of decompressions (“load”) expected. You care both about compression performance, and about decompression performance. Smaller file like with brotli level 11 would be nice, but if it means almost three minutes of user waiting (as opposed to “one second”), then that’s no good.

My use case here is similar to a “game save”; as in if compressing this data set takes longer than one or two seconds then it’s not acceptable. Later posts will keep this in mind, and I will not even explore the “slow” compression approaches much.

In general, know your use case! And don’t just blindly use “maximum” compression level; in many libraries it does not buy you much, but compression becomes much, much slower (brotli here seems to be an exception - while compression is definitely much slower, going from level 10 to the maximum level 11 does increase compression ratio quite a bit).

What’s next?

Next up, we’ll look at some simple data filtering (i.e. fairly similar to EXR post from a while ago) a couple more generic data compressors. Until then!


Float Compression 0: Intro

I was playing around with compression of some floating point data, and decided to write up my findings. None of this will be any news for anyone who actually knows anything about compression, but eh, did that ever stop me from blogging? :)

Situation

Suppose we have some sort of “simulation” thing going on, that produces various data. And we need to take snapshots of it, i.e. save simulation state and later restore it. The amount of data is in the order of several hundred megabytes, and for one or another reason we’d want to compress it a bit.

Let’s look at a more concrete example. Simulation state that we’re saving consists of:

Water simulation. This is a 2048x2048 2D array, with four floats per array element (water height, flow velocity X, velocity Y, pollution). If you’re a graphics programmer, think of it as a 2k square texture with a float4 in each texel.

Here’s the data visualized, and the actual 64MB size raw data file (link):

Snow simulation. This is a 1024x1024 2D array, again with four floats per array element (snow amount, snow-in-water amount, ground height, and the last float is actually always zero).

Here’s the data visualized, and the actual 16MB size raw data file (link):

A bunch of “other data” that is various float4s (mostly colors and rotation quaternions) and various float3s (mostly positions). Let’s assume these are not further structured in any way; we just have a big array of float4s (raw file, 3.55MB) and another array of float3s (raw file, 10.9MB).

Don’t ask me why the “water height” seems to be going the other direction as “snow ground height”, or why there’s a full float in each snow simulation data point that is not used at all. Also, in this particular test case, “snow-in-water” state is always zero too. 🤷

Anyway! So we do have 94.5MB worth of data that is all floating point numbers. Let’s have a go at making it a bit smaller.

Post Index

Similar to the toy path tracer series I did back in 2018, I generally have no idea what I’m doing. I’ve heard a thing or two about compression, but I’m not an expert by any means. I never wrote my own LZ compressor or a Huffman tree, for example. So the whole series is part exploration, part discovery; who knows where it will lead.


Swallowing the elephant into Blender

Some years ago Matt Pharr wrote an excellent blog post series, “Swallowing the elephant”, in which he describes various optimizations done in pbrt using Disnay’s Moana Island Scene.

Recently I was optimizing Blender’s OBJ importer, and the state of it in the end was “the bottlenecks are not OBJ specific anymore”. I was looking for some larger test scene to import, noticed that the Moana scene got a USD version done in 2022, and started to wonder how well that would import into Blender.

So yeah. It takes almost four hours until the import is finished, and then Blender crashes while trying to display it in the viewport. Not quite usable! Here’s a story of random optimizations that I did, and some learnings along the way.

Setting the expectations

The initial state (Blender ~3.3 codebase in May 2022) was: import of Moana USD scene takes 12300 seconds (3.5 hours; on Windows, Ryzen 5950X).

After a round of optimizations described below, the same import takes 82 seconds (1.5 minutes). So I made it import 150 times faster, and read below why “X times faster” is sometimes not a good description.

In both cases, after the import is done, Blender tries to display the scene, and crashes in rendering related code. That is for someone to investigate some other day :)

I should clarify that USD itself is not directly relevant; what I was looking for, and what I tried to optimize, was the general “how does Blender behave when lots of objects are being created” (part of overall “Blender editing performance with many datablocks” open task). USD just happened to be the format of the largest scene I had available – it contains about 260 thousand objects, and that’s what stresses various places of Blender – not “parsing the files” or similar, but just the “I’m creating 260 thousand objects, send help” bit.

The optimizations

As previously, I used Superluminal profiler while on Windows, and Xcode Instruments while on Mac, looked at where they said was a major time sink, and thought about how to address them. The five independent optimizations I did were:

Collection Sync

Stop doing “sync view layer collection” work after each imported object is created; instead create all the objects and then sync the view layer once (D15215 - for both USD and Alembic importers; OBJ importer already got that previously).

A more proper fix would be to get rid of this “each object creation needs to sync view layer” thing completely (T73411). That would speed up a lot more places than just importers, but is also a lot more involved. Honestly, I don’t even know what the view layer collection thing is :)

Material Maps

Stop building “which USD material name maps to which Blender material” mapping for each object being imported. Instead, build that map once, and update it with newly imported materials (D15222).

Material Assignment

When assigning materials to just imported meshes, stop scanning the whole world (twice!) for possible other users of said meshes. Due to how Blender data structures are done right now, a “mesh data block” can have materials set up on it, and then the “object data block” that uses said mesh also has to have a materials array of the same size. Now, several “objects” can use the same “mesh”, so what the code was doing is, whenever setting a material on a mesh, it was scanning the whole world for which other objects use this mesh, and update their materials array to match in size.

But! Specifically in importer related code, all that scanning is pointless – the importers know which objects and meshes they create, there’s no need to scan the whole universe! So that was D15145 and applies to USD, Alembic and OBJ importers.

Unique Name Checks

Rework the way Blender ensures all objects have unique names (D14162).

All references in Blender are name/path based (contrast to for example Unity, where all references are GUID based). But for that to work, all the objects have to have unique names – you can’t have say ten Cube objects since you’ll have no way of referring to the specific one.

Whenever an object is created or renamed, Blender does a “is name unique? if not, adjust until it is” work, so while your first Cube can be just that, the second one would become Cube.001 and so on.

The way it was doing that, was basically “walk though all the objects, checking if a name is unique”. This does not scale to huge object counts, and there was an open task for a few years to address it (T73412), and so I did it.

This one is the most involved and “most risky” among all the optimizations. It is not related to any import related code, but rather touches core Blender data structures that possibly affect everything. So potential for accidental bugs is huge! But also, the benefits ripple out to everything – any operation that creates possibly many objects gets faster. For example, Shift+D duplicating 10 thousand cubes goes from 9 seconds down to 2 seconds. Yay!

Pre-sort Objects

Unique name checks used to be the bottleneck, but with that gone, another smaller bottleneck appears – sorting objects by name. Whole of Blender assumes that all the objects are always sorted in the “Main database” object lists. So whenever a new object is created, it needs to get inserted into the proper place in the list.

The name sorting code is written under assumption that generally, objects are created in already sorted-by-name order – it tries to find the proper places for objects going backwards from the end of the list. So within importer code, we can first pre-sort the objects by name, and only then create them. And that was D15506 (applies to USD and OBJ importers).

Dawson’s First Law of Computing

What is common among all these optimizations, is that they all address Dawson’s First Law of Computing:

O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

See Dawson’s tweet, “quadratic” tag on his blog, Accidentally Quadratic tumblr.

All of the things mentioned above are quadratic complexity in terms of object count in the scene.

And these can happen fairly accidentally: say you have a function “set a material on a mesh” (as Blender does), and then whoops, someone finds that it needs to make sure all the objects that use said mesh need to have a little bit of massaging when the material is set. What’s the lowest-effort way of doing that? Of course, right inside the function, go through the whole world, and for all the objects that use this mesh, do this little bit of massaging.

And that’s fine if you are setting one material on one mesh. And, because computers are really fast these days, you won’t even notice if you are setting up materials on a thousand meshes. So what if this does a million tiny checks now? That’s a fraction of a second.

But, try to do that on a hundred thousand meshes, and suddenly the (ten billion) x (tiny things) product is not so tiny anymore.

I wrote about this back in 2017, in the context of shader variants combinatorial explosion (post):

Moral of the story is: code that used to do something with five things years ago might turn out to be problematic when it has to deal with a hundred. And then a thousand. And a million. And a hundred billion. Kinda obvious, isn’t it?

The moral this time might be, try to not have functions that work on one object, and are O(N_objects) complexity. They absolutely might be fine when called on one object. But sooner or later, someone will need to do that operation on N objects at once, and whoops suddenly you are in quadratic complexity land.

This is the same “when there is one, there’s more than one” guideline that DOD (data oriented design) ideas are saying: express operations in terms of doing them efficiently on a whole batch of things at once. Doing them on just one thing can be just a special case of that.

Another thing that is Blender specific, is linked lists. Blender codebase and some of the core data structures go back all the way to 1994. Back then, linked lists were a data structure held in high esteem – simple, elegant, can insert/remove items in any place! It was not until around 2000, when due to rise of both the computation-to-memory gap (and thus CPU caches), and also multi-core programming, when people realized that arrays tend to run circles around linked lists. Way fewer cache misses, and in many cases you can process array of items in parallel. My understanding is that within Blender, there is “a wish” to gradually go away from linked lists in core data structures (like the name sorting issue above). Someday!

“N times faster!” can be misleading

“Ten times faster” is clear and unambiguous in isolation, but does not compose when multiple speedups are talked about cumulatively. Using the optimizations I did above, applied in that order, each of them would have this much speedup:

Optimization Step Speedup (X times faster)
Collection Sync 1.12
Material Maps 1.03
Material Assignment 3.16
Unique Name Checks 5.77
Pre-sort Objects 7.10

So you would think that “well clearly that last one is the best optimization”. But it’s not! The speedup factor is order-dependent. Because the preceding optimization steps shaved off so much time from the whole process, the last one gets to enjoy “7x faster!” while it only changed the time from 10 minutes down to 1.5 minutes.

Instead, when talking about multiple isolated things, where each of them adds or saves something, it might be better to talk about their absolute times. That’s why performance people in games tend to talk about “how many milliseconds does system X take in the frame”, because you can add milliseconds up, in any order you want. Similarly, when optimizing something, you might want to talk about “time saved” in absolute terms.

For this same “import Moana scene” case, here are the savings of each optimization step:

Optimization Step Time saved (minutes)
Collection Sync 22
Material Maps 6
Material Assignment 123
Unique Name Checks 46
Pre-sort Objects 8

Or more visually:

The tiny light blue sliver is the final time for import, left after all the optimizations applied. Clearly optimizing material assignment brought the most absolute time savings.

And that’s it! All the optimizations above have already landed to Blender 3.3 (currently in alpha). None of them were really complex or clever, just noticing some quadratic complexities and trying to stop them being quadratic.