Float Compression 5: Science!

Introduction and index of this series is here.

Previous post was about mis-using meshoptimizer compression for compressing totally non-mesh data, for pretty good results. This time, let’s look at several libraries specifically targeted at compressing floating point data sets. Most of them are coming from the scientific community – after all, they do have lots of simulation data, which is very often floating point numbers, and some of that data is massive and needs some compression.

Let’s go!

Reminder: so far I’m only looking at lossless compression. Lossy compression investigation might be a future post.

zfp

zfp (website, github) is a library for either lossy or lossless compression of 1D-4D data (floats and integers, 32 and 64 bit sizes of either). I’ve seen it before in EXR post.

It is similar to GPU texture compression schemes – 2D data is divided into 4x4 blocks, and each block is encoded completely independently from the others. Inside the block, various magic stuff happens and then, ehh, some bits get out in the end :) The actual algorithm is well explained here.

Sounds cool! Let’s try it (again, I’m using the lossless mode of it right now). zfp is the red 4-sided star point:

Ouch. I hope I “used it wrong” in some way? But this is not looking great. Ratio is under 1, i.e. it makes the data larger, and is slow to decompress. (Edit: initially I made a mistake and misunderstood what zfp_compress returns – it returns the cumulative number of compressed bytes; not new compressed bytes done during the compress call. And the second mistake I made was when compressing 1D data - you need to tell it that Y dimension is zero, not one!)

In lossless mode, zfp is not looking super great – ratio is not stellar, and decompression speed in particular is quite slow. The compression ratio is similar to zstd without any data filtering. To be fair, zfp compresses the two 2D files from my data set much better than the two other (1D) files. My takeaway is that zfp might be mostly targeted at lossy compression, which is a topic for some other day. Let’s move on.

fpzip

fpzip (website, github) is from the same research group as zfp, and is their previous floating point compressor. It can also be both lossless and lossy, but from the description seems to be more targeted at lossless case. Let’s try it out (fpzip is the 5-sided star):

Ok, this is better compression ratio compared to zfp. Here I’m first splitting the data by floats (see part 3 post) so that all water heights are together, all velocities are together, etc. Without doing that, fpzip does not get any good compression ratio. Code is here.

  • Compression ratio and performance is really good! Our data set gets down to 24.8MB (not far from “split bytes + delta + zstd” 22.9MB or “mesh optimizer + zstd” 24.3MB). Compresses in 0.6 seconds.
  • However decompression performance is disappointing; takes same time as compression at 0.6 seconds – so between 5x and 10x slower than other best approaches so far.

SPDP

SPDP (website, 2018 paper) is interesting, in that it is a “generated” algorithm. They developed a number of building blocks (reorder something, delta, LZ-like compress, etc.) and then tested millions of their possible combinations on some datasets, and picked the winner. Source code only comes either as a HDF5 filter, or as a standalone command line utility. I had to modify it slightly (code) to be usable as a library, and to not use 1MB of stack space which Windows does not appreciate :)

A series of six-sided stars here:

  • Similar to fpzip case, I had to split the data by floats (code) to get better compression ratio.
  • Compression ratio is between regular zstd and lz4, and is a far behind the best options.
  • It is about twice as fast as fpzip at both compression and decompression, which is still far behind the better options.

The idea of having a bunch of building blocks and automatically picking their best sequence / combination on a bunch of data sets is interesting though. Maybe they should have had stronger building blocks though (their LZ-like codec might be not super good, I guess).

ndzip

ndzip (github, 2021 paper) promises a “efficient implementation on modern SIMD-capable multicore processors, it compresses and decompresses data at speeds close to main memory bandwidth, significantly outperforming existing schemes”, let’s see about it!

Note that I have not used or tested multi-threaded modes of any of the compressors present. Some can do it; all could do it if incoming data was split into some sort of chunks (or after splitting “by floats”, each float block compressed in parallel). But that’s not for today.

ndzip does not build on Windows out of the box (I opened a PR with some fixes), and for the multi-threaded code path it uses OpenMP features that MSVC 2022 seems to not have. It also is specifically designed for AVX2, and and that needs a bit of juggling to get compiled on Windows too. On Mac, it does not link due to some symbol issues related to STL (and AVX2 code path would not really work on an arm64). On Linux, the multi-threaded OpenMP path does not produce correct results, but single-threaded path does. Huge props for releasing source code, but all of this does sound more like a research project that is not quite yet ready for production use :)

Anyway, 7-sided yellow star here:

  • Similar to others, I split data by floats first, or otherwise it does not achieve pretty much any compression.
  • Now this one does achieve a place on the Pareto frontier for compression. The ratio is well behind the best possible (file size gets to 38.1MB; best others go down to 23MB), but it does compress at over 1GB/s. So if you need that kind of compression performance, this one’s interesting.
  • They also have CUDA and SYCL code for GPUs too. I haven’t tried that.

streamvbyte

streamvbyte (github, blog post, 2017 paper) is not meant for compressing floating point data; it is targeted as compressing 4-byte integers. But hey, there’s no law saying we can’t pretend our floats are integers, right?

  • Three-sided star is regular streamvbyte. Only 1.2x compression ratio, but it is the fastest of the bunch; compressing at 5.7GB/s, decompressing at ~10GB/s.
  • There’s also streamvbyte_delta, which on unfiltered data is not really good (not shown here).
  • However (similar to others), if the data is first split by floats, then streamvbyte_delta followed by a general purpose compressor is quite good. Especially if you need compression faster than 0.5GB/s, then “split by floats, streamvbyte_delta, zstd” is on the Pareto frontier, reaching 3.5x ratio.

Conclusion and what’s next

On this data set, in lossless mode, most of the compression libraries I tried in this post are not very impressive. My guess is that’s a combination of several factors: 1) maybe they are primarily targeted at double precision, and single precision floats somewhat of an afterthought, 2) maybe they are much better at 3D or 4D data, and are weaker at a mix of 2D and 1D data like in my case, 3) maybe they are much better when used in lossy compression mode.

However it is curious that some of the papers describing these compressors only either compare them to other scientific compressors, or only to regular lossless compression algorithms. So they go with conclusions like “we surpass zstd a bit!” and declare that a win, without comparing to something like “filter the data, and then zstd it”.

Another interesting aspect is that most of these libraries have symmetric compression and decompression performance, which is very different from most of regular data compression libraries, where compression part is often much slower.

Next up: either look into lossy compression, or into speeding up the data filtering part. Until then!


Float Compression 4: Mesh Optimizer

Introduction and index of this series is here.

The previous post investigated some lossless filtering of data, before passing it to a regular compression library. Our result so far is: 94.5MB of data can get filtered+compressed down to 23.0MB in one second (split/shuffle bytes, delta encode, zstd or kraken compression). It decompresses back in about 0.15 seconds (which quite a bit slower than without data filtering, but that’s something for later day).

In this post, I’ll look at one open source library that does not feel like it would immediately be suitable for compressing my data set.

zeux/meshoptimizer

meshoptimizer by Arseny Kapoulkine is a very nice library for processing, optimizing, and compressing 3D model (“mesh”) data. It can do vertex and index reordering for efficiency of various GPU caches, mesh simplification, quantization and conversion of the vertex data, has index and vertex data compression utilities, and various tools to do all that on glTF2 geometry files.

Notice the “vertex data compression utilities” bit? We’re going to try that one. Since the meshoptimizer compression is an official glTF extension, there’s a specification for it even (EXT_meshopt_compression).

It’s not immediately obvious that it can also be used for anything else than actual 3D model vertex data compression. But look at it:

  • It is completely lossless,
  • It takes “size of each vertex in bytes” and “number of vertices” as an input. But who says these need to be vertices? It’s just some data.
  • It assumes that there is some correllation / smoothness between neighboring vertices; that’s how it gets compression after all. We don’t have “vertices”, but our “data items” from water or snow simulation are nicely laid out in memory one after another, and their values do vary fairly smoothly.

In our case, water, snow simulation and float4 data files are all going to be “yeah, we are totally 16-byte vertices”, and the float3 data file is going to be “I’m full of 12-byte vertices, really”. And then we just use meshopt_encodeVertexBufferBound, meshopt_encodeVertexBuffer and meshopt_decodeVertexBuffer functions from the library.

So does it work?

The chart here shows our three regular compressors (zstd, lz4, kraken; dashed lines), as well as the same compressors with best filtering from previous post (solid lines). meshoptimizer is the large blue point, since it has no “compression levels” to speak of.

This is actually pretty impressive!

  • Really fast to compress (0.2 seconds), and really fast to decompress (3.5GB/s, almost at lz4 speeds).
  • For a “compress in under a second” task, it beats zstd on ratio, and achieves the same ratio as Oodle Kraken. 😮 We have a 29.5MB file.
  • It does not quite achieve the compression ratio of regular compressors with filtering applied. But hey, we are mis-using “mesh” optimizer library for data that is almost certainly not meshes :)

Oh but hey, meshoptimizer readme says:

The result of the encoding is generally significantly smaller than initial data, and remains compressible with general purpose compressors

So let’s try just that: compress our data with meshoptimizer, and then try adding our old friends zstd/lz4/kraken on top of that.

  • For compression under one second, this now achieves 24.4MB file size. Nice!
  • Kraken and zstd are almost the same performance and ratio here.
  • Still not as small as filtering + regular compression (which gets down to 23.0MB), but pretty close.
  • Decompression is still very fast; 3x faster than with filtering + regular decompression. Nice!

I have also tried various filtering approaches before doing mesh optimizer compression (split floats, split bytes, delta, xor, rotate floats left by one bit, etc.). And these do not actually help; often making compression ratio worse. This makes sense; mesh optimizer vertex compression codec has a lot of similarities to a data filter itself, so additional filtering just gets in the way and “confuses” it. Or that’s my impression.

Conclusion and what’s next

My takeaway is that if you have structured data that is mostly floating point and inherently has some similarities / smoothness across it, then you should take a look at using meshoptimizer vertex compression codec. Even if your data is not meshes at all!

It makes the data smaller by itself, but you can also pass that down into any other regular data compression library for further compression.

And it’s really fast at both compression and decompression. There’s a pure-JavaScript version too in there, if you’re targeting the web platform.

Next up, I’ll look into several libraries specifically targeted at floating point data compression, that are mostly coming from the scientific community. And then after that, maybe at lossy compression.


Float Compression 3: Filters

Introduction and index of this series is here.

In the previous parts we saw that using generic data compression libraries, we can get our 94.5MB data down to 33.8MB (zstd level 7) or 29.6MB (oodle kraken level 2) size, if we’re not willing to spend more than one second compressing it.

That’s not bad, but is there something else we can do? Turns out, there is, and in fact it’s quite simple. Enter data filtering.

Prediction / filtering

We saw filtering in the past (EXR and SPIR-V), and the idea is simple: losslessly transform the data so that it is more compressible. Filtering alone does nothing to reduce the data size, but (hopefully!) it decreases data randomness. So the process is: filter the data, then compress that. Decompression is reverse: decompress, then un-filter it.

Here’s some simple filters that I’ve tried (there are many, many other filters possible, I did not try them all!).

Reorder floats array-of-structures style

Recall that in our data, we know that water simulation has four floats per “element” (height, velocity x, velocity y, pollution); snow simulation similarly has four floats per element; and other data is either four or three floats per element. Instead of having the data like that (“array of structures” style), we can try to reorder it into “structure of arrays” style. For water simulation, that would be all heights first, then all x velocities, then all y velocities, etc.

So this:

becomes this:

Completely unoptimized code to do that could look like this (and our data is floats, i.e. 4 bytes, so you’d call these templates with a 4-byte type e.g. uint32_t):

// channels: how many items per data element
// dataElems: how many data elements
template<typename T>
static void Split(const T* src, T* dst, size_t channels, size_t dataElems)
{
	for (size_t ich = 0; ich < channels; ++ich)
	{
		const T* ptr = src + ich;
		for (size_t ip = 0; ip < dataElems; ++ip)
		{
			*dst = *ptr;
			ptr += channels;
			dst += 1;
		}
	}
}
template<typename T>
static void UnSplit(const T* src, T* dst, size_t channels, size_t dataElems)
{
	for (size_t ich = 0; ich < channels; ++ich)
	{
		T* ptr = dst + ich;
		for (size_t ip = 0; ip < dataElems; ++ip)
		{
			*ptr = *src;
			src += 1;
			ptr += channels;
		}
	}
}

Does that help? The results are interesting (click for an interactive chart):

  • It does help LZ4 to achieve a bit higher compression ratios.
  • Makes zstd compress faster, and helps the ratio at lower levels, but hurts the ratio at higher levels.
  • Hurts oodle kraken compression.
  • Hurts the decompression performance quite a bit (for lz4 and kraken, slashes it in half). In all cases the data still decompresses under 0.1 seconds, so acceptable for my case, but the extra pass over memory is not free.

Ok, so this one’s a bit “meh”, but hey, now that the data is grouped together (all heights, then all velocities, …), we could try to exploit the fact that maybe neighboring elements are similar to each other?

Reorder floats + XOR

In the simulation data example, it’s probably expected that usually the height, or velocity, or snow coverage, does not vary “randomly” over the terrain surface. Or in an image, you might have a color gradient that varies smoothly.

“But can’t data compressors already compress that really well?!”

Yes and no. Usually generic data compressors can’t. Most of them are very much oriented at finding repeated sequences of bytes. So if you have a very smoothly varying surface height or image pixel color, e.g. a sequence of bytes 10, 11, 12, 13, 14, 15, well that is not compressible at all! There are no repeating byte sequences.

But, if you transform the sequence using some sort of “difference” between neighboring elements, then repeated byte sequences might start appearing. At first I tried XOR’ing the neighboring elements together (interpreting each float as an uint32_t), since at some point I saw that trick being mentioned in some “time series database” writeups (e.g. Gorilla).

A completely unoptimized code to do that:

template<typename T>
static void EncodeDeltaXor(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		*data = v ^ prev;
		prev = v;
		++data;
	}
}
template<typename T>
static void DecodeDeltaXor(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		v = prev ^ v;
		*data = v;
		prev = v;
		++data;
	}
}

And that gives (faint dashed line: raw compression, thin line: previous attempt (split floats), thick line: split floats + XOR):

  • Compression ratio is way better for zstd and lz4 (for kraken, only at lower levels).
  • zstd pretty much reaches kraken compression levels! The lines almost overlap in the graph.
  • Decompression speed takes a bit of a hit, as expected. I might need to do something about it later.

So far we got from 33.8MB (zstd) / 29.6MB (kraken) at beginning of the post down to 28MB (zstd, kraken), while still compressing in under 1 second. Nice, we’re getting somewhere.

Reorder floats + Delta

The “xor neighboring floats” trick from Gorilla database was in the context of then extracting the non-zero sequences of bits from the result and storing that in less space than four bytes. I’m not doing any of that, so how about this: instead of XOR, do a difference (“delta”) between the neighboring elements? Note that delta is done by reinterpreting data as if it were unsigned integers, i.e. these templates are called with uint32_t type (you can’t easily do completely lossless floating point delta math).

template<typename T>
static void EncodeDeltaDif(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		*data = v - prev;
		prev = v;
		++data;
	}
}
template<typename T>
static void DecodeDeltaDif(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		v = prev + v;
		*data = v;
		prev = v;
		++data;
	}
}

And that gives (faint dashed line: raw compression, thin line: previous attempt (split floats + XOR), thick line: split floats + delta):

Now that’s quite an improvement! All three compressors tested get their compression ratio lifted up. Good! Let’s keep on going.

Reorder bytes

Hey, how about instead of splitting each data point into 4-byte-wide “streams”, we split into 1-byte-wide ones? After all, general compression libraries are oriented at finding byte sequences that would be repeating. This is also known as a “shuffle” filter elsewhere (e.g. HDF5). Exactly the same Split and UnSplit functions as above, just with uint8_t type.

Faint dashed line: raw compression, thin line: previous attempt (split floats + Delta), thick line: split bytes:

  • kraken results are almost the same as with “split floats and delta”. Curious!
  • zstd ratio (and compression speed) is improved a bit.
  • lz4 ratio is improved a lot (it’s beating original unfiltered kraken at this point!).

I’ll declare this a small win, and let’s continue.

Reorder bytes + Delta

Split by bytes as previous, and delta-encode that. Faint dashed line: raw compression, thin line: previous attempt (split bytes), thick line: split bytes + delta:

Holy macaroni grated potato dumplings!

  • Another compression ratio increase. Both zstd and kraken get our data to 23MB in about one second (whereas it was 33.8MB and 29.6MB at the start of the post).
  • zstd actually slightly surpasses kraken at compression ratios in the area (“under 1 sec”) that I care about. 😮
  • lz4 is not too shabby either, being well ahead of unfiltered kraken.
  • Downside: decompression is slightly longer than 0.1 seconds now. Not “terrible”, but I’d want to look into whether all this reordering and delta could be sped up.

Conclusion and what’s next

There’s lots of other data filtering approaches and ideas I could have tried, but for now I’m gonna call “reorder bytes and delta” a pretty good win; it’s extremely simple to implement and gives a massive compression ratio improvement on my data set.

I did actually try a couple other filtering approaches. Split data by bits (using bitshuffle library) was producing worse ratios than splitting by bytes. Rotating each float left by one bit, to make the mantissa & exponent aligned on byte boundaties, was also not an impressive result. Oh well!

Maybe at some point I should also test filters specifically designed for 2D data (like the water and snow simulation data files in my test), e.g. something like PNG Paeth filter or JPEG-LS LOCO-I (aka “ClampedGrad”).

Next up, I’ll look at an open source library that does not advertise itself as a general data compressor, but I’m gonna try it anyway :) Until then!


Float Compression 2: Oodleflate

Introduction and index of this series is here.

In the previous part I looked at generic lossless data compressors (zlib, lz4, zstd, brotli), and was thinking of writing about data filtering next. But then, libdeflate and Oodle entered the chat!

libdeflate

libdeflate is an independent implementation of zlib/deflate compression, that is heavily optimized. I’ve seen it before (EXR blog post), and indeed it is very impressive, considering it produces completely zlib-compatible data, but much faster and at a bit higher compression ratio too.

Here are the results (click for an interactive chart). The compressors from the previous post are faded out; libdeflate addition is dark green:

It is not better than zstd, but definitely way better than zlib (~3x faster compression, ~2x faster decompression, slightly higher ratio). If you need to produce or consume deflate/zlib/gzip formatted data, then absolutely look into libdeflate.

Oodle

Oodle is a set of compressors by Epic Games Tools (née RAD Game Tools). It is not open source or freely available, but the good people there graciously gave me the libraries to play around with.

There are several compressors available, each aiming at different ratio vs. performance tradeoff. It’s easiest to think of them as: “Kraken” is “better zstd” (great ratio, good decompression performance), “Selkie” is “better lz4” (ok ratio, very fast decompression), and “Mermaid” is in the middle of the two, with no clear open source competitor.

Here they are, additions in various shades of purple:

And… yeah. On the compression front, Mermaid leaves everyone else (zstd, brotli) in the dust on the Pareto front (remember: upper right is better). And then Kraken improves on top of that 🤯 Decompression performance is a similar story: everything (except lz4) is behind.

Wizardry! It’s a shame they are not freely available, eh :/

What’s next?

Next up, we’ll really look at some simple data filtering (i.e. fairly similar to EXR post from a while ago). Until then!


Float Compression 1: Generic

Introduction and index of this series is here.

We have 94.5MB worth of data, and we want to make it smaller. About the easiest thing to do: use one of existing lossless data compression libraries, or as us old people call it, “zip it up”.

There are tons of compression libraries out there, I certainly will not test all of them (there’s lzbench and others for that). I tried some popular ones: zlib, lz4, zstd and brotli.

Here are the results (click for an interactive chart):

This is running on Windows, AMD Ryzen 5950X, compiled with Visual Studio 2022 (17.4), x64 Release build. Each of the four data files is compressed independently one after another, serially on a single thread.

I also have results on the same Windows machine, compiled with Clang 15, and on an Apple M1 Max, compiled with Clang 14.

What can we see from this?

The compression charts have compression ratio (higher = better) on the vertical axis, and throughput on the horizontal axis (higher = better). Note that the horizontal axis is logarithmic scale; performance of the rightmost point is literally hundreds of times faster than the leftmost point. Each line on the graph is one compression library, with different settings that it provides (usually called “compression levels”).

Compression graph is the usual “Pareto front” situation - the more towards upper right a point is, the better (higher compression ratio, and better performance). The larger points on each compression curve are the “default” levels of each library (6 for zlib, 3 for zstd, “simple” for lz4).

Decompression graph, as is typical for many data compression algorithms, is much less interesting. Often decompression performance is more or less the same when using the same compression library; no matter what the compression level is. You can ballpark and say “lz4 decompresses at around 5GB/s” for this data set.

Anyway, some takeaway points:

  • Generic data compression libraries can get this data set from 94.5MB down to 32-48MB, i.e. make it 2x-3x smaller.
  • They would take between 0.1s and 1.0s to compress the data, and between 0.02s and 0.2s to decompress it.
    • With Brotli maximum level (11), you can go as low as 27MB data (3.5x ratio), but it takes over two minutes to compress it :/
  • zstd is better than zlib in all aspects. You can get the same compression ratio, while compressing 10x faster; or spend same time compressing, and get better ratio (e.g. 2.6x -> 2.9x). Decompression is at least 2x faster too.
    • Current version of zstd (1.5.2, 2022 Jan) has some dedicated decompression code paths written in assembly, that are only used when compiling with clang/gcc compilers. On a Microsoft platform you might want to look into using Clang to compile it; decompression goes from around 1GB/s up to 2GB/s! Or… someone could contribute MSVC compatible assembly routines to zstd project :)
  • If you need decompression faster than 2GB/s, use lz4, which goes at around 5GB/s. While zstd at level -5 is roughly the same compression ratio and compression performance as lz4, the decompression is still quite a bit slower. lz4 does not reach more than 2.2x compression ratio on this data set though.
  • If you need best compression ratio out of these four, then brotli can do that, but compression will not be fast. Decompression will not be fast either; curiously brotli is slower to decompress than even zlib.
  • There’s basically no reason to use zlib, except for the reason that it likely exists everywhere due to its age.

Which place exactly on the Pareto front is important to you, depends on your use case:

  • If compression happens once on some server, and the result is used thousands/millions of times, then you much more care about compression ratio, and not so much about compression performance. A library like brotli seems to be targeted at exact this use case.
  • On the completely opposite side, you might want to “stream” some data exactly once, for example some realtime application sending profiling data for realtime viewing. You are compressing it once, and decompressing it once, and all you care about is whether the overhead of compression/decompression saves enough space to transmit it faster. lz4 is primarily targeted at use cases like this.
  • If your scenario is similar to a “game save”, then for each compression (“save”), there’s probably a handful of decompressions (“load”) expected. You care both about compression performance, and about decompression performance. Smaller file like with brotli level 11 would be nice, but if it means almost three minutes of user waiting (as opposed to “one second”), then that’s no good.

My use case here is similar to a “game save”; as in if compressing this data set takes longer than one or two seconds then it’s not acceptable. Later posts will keep this in mind, and I will not even explore the “slow” compression approaches much.

In general, know your use case! And don’t just blindly use “maximum” compression level; in many libraries it does not buy you much, but compression becomes much, much slower (brotli here seems to be an exception - while compression is definitely much slower, going from level 10 to the maximum level 11 does increase compression ratio quite a bit).

What’s next?

Next up, we’ll look at some simple data filtering (i.e. fairly similar to EXR post from a while ago) a couple more generic data compressors. Until then!