Gaussian Splatting is pretty cool!

SIGGRAPH 2023 just had a paper “3D Gaussian Splatting for Real-Time Radiance Field Rendering” by Kerbl, Kopanas, Leimkühler, Drettakis, and it looks pretty cool! Check out their website, source code repository, data sets and so on (I should note that it is really, really good to see full source and full data sets being released. Way to go!).

I’ve decided to try to implement the realtime visualization part (i.e. the one that takes already-produced gaussian splat “model” file) in Unity. As well as maybe play around with looking at whether the data sizes could be made smaller (maybe use some of the learnings from float compression series too?).

What’s a few million badly rendered boxes among friends, anyway?

For the impatient: I got something working over at aras-p/UnityGaussianSplatting, and will tinker with things there some more. And since this post, I wrote several others:

Meanwhile, some quick random thoughts on this Gaussian Splatting thing.

What are these Gaussian Splats?

Watch the paper video and read the paper, they are pretty good!

I have seen quite many 3rd party explanations of the concept at this point, and some of them, uhh, get a thing or two wrong about it :)

  • This is not a NeRF (Neural Radiance Field)! There is absolutely nothing “neural” about it.
  • It is not somehow “fast, because it uses GPU rasterization hardware”. The official implementation does not use the rasterization pipeline at all; it is 100% done with CUDA. In fact, it is fast because it does not use the fixed function rasterization, as we’ll see below.

Anyway,

Gaussian Splats are, basically, “a bunch of blobs in space”. Instead of representing a 3D scene as polygonal meshes, or voxels, or distance fields, it represents it as (millions of) particles:

  • Each particle (“a 3D Gaussian”) has position, rotation and a non-uniform scale in 3D space.
  • Each particle also has an opacity, as well as color (actually, not a single color, but rather 3rd order Spherical Harmonics coefficients - meaning “the color” can change depending on the view direction).
  • For rendering, the particles are rendered (“splatted”) as 2D Gaussians in screen space, i.e. they are not rendered as scaled elongated spheres, actually! More on this below.

And that’s it. The “Gaussian Splatting” scene representation is just that, a whole bunch of scaled and colored blobs in space. The genius part of the paper is several things:

  • They found a way how to create these millions of blobs that would represent a scene depicted by a bunch of photos. This is using gradient descent and “differentiable rendering” and all the other things that are way over my head. This feels like the major contribution, like maybe previously people assumed that in order for gradient descent optimizer to work nicely, you need to use a continuous or connected scene representation (vs “just a bunch of blobs”), and this paper proved that wrong? Anyway, I don’t understand this area, so I won’t talk about it more :)
  • They have developed a fast way to render all these millions of scaled particles. This by itself is not particularly ground breaking IMHO, various people have noticed that using something like a tile-based “software, but on GPU” rasterizer is a good way to do this.
  • They have combined existing established approaches (like gaussian splatting, and spherical harmonics) in a nice way.
  • And finally, they have resisted the temptation to do “neural” anything ;)

Previous Building Blocks

The Gaussian Splatting seems to be invented around year 2001-2002, see for example “EWA Splatting” paper by Zwicker, Pfister, Van Baar, Gross. There they have scaled and oriented “blobs” in space, calculate how would they project onto screen, and then do the actual “blob shape” (a “Gaussian”) in 2D, in screen-space. A bunch of signal processing, sampling, aliasing etc. math presumably supports doing it that way.

Speaking of ellipsoids, Ecstatica game from 1994 had a fairly unique ellipsoid-based renderer.

Spherical Harmonics (a way to represent a function over a surface of a sphere) have been around for several hundred years in physics, but really were popularized in computer graphics around 2000 by Ravi Ramamoorthi and Peter-Pike Sloan. But actually, a 1984 “Ray tracing volume densities” paper by Kajiya & Von Herzen might be the first use of them in graphics. A nice summary of various things related to SH is at Patapom’s page.

Point-Based Rendering in various forms has been around for a long time, e.g. particle systems were used since “forever” (but typically used for vfx / non-solid phenomena). “The Use of Points as a Display Primitive” is from 1985. “Surfels” paper is from 2000.

Something closer to my heart, a whole bunch of demoscene demos are using non-traditional rendering approaches. Fairlight & CNCD have several notable examples, e.g. Agenda Circling Forth (2010), Ceasefire (all falls down..) (2010), Number One / Another One (2018):

Real-time VFX tools like Notch have pretty extensive features for creating, simulating and displaying point/blob based “things”.

Ideas of representing images or scenes with a bunch of “primitive shapes”, as well as tools to generate those, have been around too. E.g. fogleman/primitive (2016) is nice.

Media Molecule “Dreams” has a splat-based renderer (I think the shipped version is not purely splat-based but a combination of several techniques). Check out the most excellent “Learning from Failure” talk by Alex Evans: at SIGGRAPH 2015 (splats start at slide 109) or video from Umbra Ignite 2015 (splats start at 22:34).

Tiled Rasterization for particles has been around at least since 2014 (“Holy smoke! Faster Particle Rendering using Direct Compute” by Gareth Thomas). And the idea that dividing screen into tiles, doing a bunch of things “inside the tile” thus cutting on memory traffic, is how entire mobile GPU space operates, and has been operating since “forever”, tracing back to first PowerVR designs (1996) and even Pixel Planes 5 from 1989.

This is all great! Taking existing, developed, solid building blocks and combining them in a novel way is excellent work.

My Toy Playground

My current implementation (of just the visualizer of Gaussian Splat models) for Unity is over at github: aras-p/UnityGaussianSplatting. Current state is “it kinda works, but it is not fast”:

  • The rendering does not look horrible, but does not exactly match official implementation. Here is official vs my rendering of the same scene. Official one has more small detail, and lighting is slightly different Fixed!
  • Performance is not great. The scene above renders on NVIDIA RTX 3080 Ti at 1200x800 in 7.40ms (135FPS) in the official viewer, whereas my attempt is 23.8ms (42FPS) currently, i.e. 4x slower. For sorting I’m using some fairly simple GPU bitonic sort (official impl uses CUDA radix sort which is based on OneSweep algorithm). Rasterization in their case is tile-based and written in CUDA, whereas I’m “just” using regular GPU rasterization pipeline and rendering each splat as a screenspace quad.
    • On the plus side, my code is all regular HLSL within Unity, which means it also happens to work on e.g. Mac just fine. The scene above on Apple M1 Max renders in 108ms (9FPS) though :/
    • My implementation seems to use 2x less GPU memory right now too (official viewer: 4.8GB, mine: 2.2GB and that’s including whatever Unity editor takes).

So all of that could be improved and optimized quite a bit!

One thing I haven’t seen talked much about, by everyone super excited about Gaussian Splats, is data size and memory usage. Yeah, rendering is nice, but this bicycle scene above is 1.5GB of data on-disk, and then at runtime it needs some more (for sorting, tile based rendering etc.). That scene is six million blobs in space, with each of them taking about 250 bytes. There has to be some way to make that smaller! Actually the Dreams talk above has some neat ideas.

Maybe I should play around with that. Someday!


Float Compression 9: LZSSE and Lizard

Introduction and index of this series is here.

Some people asked whether I have tested LZSSE or Lizard. I have not! But I have been aware of them for years. So here’s a short post, testing them on “my” data set. Note that at least currently both of these compressors do not seem to be actively developed or updated.

LZSSE and Lizard, without data filtering

Here they are on Windows (VS2022, Ryzen 5950X). Also included Zstd and LZ4 for comparison, as faint dashed lines:

For LZSSE I have tested LZSSE8 variant, since that’s what readme tells to generally use. “Zero” compression level here is the “fast” compressor; other levels are the “optimal” compressor. Compression levels beyond 5 seem to not buy much ratio, but get much slower to compress. On this machine, on this data set, it does not look competetive - compression ratio is very similar to LZ4; decompression a bit slower, compression a lot slower.

For Lizard (née LZ5), it really is like four different compression algorithms in there (fastLZ4, LIZv1, fastLZ4 + Huffman, LIZv1 + Huffman). I have not tested the Huffman variants since they can not co-exist with Zstd in the same build easily (symbol redefinitions). The fastLZ4 is shown as lizard1x here, and LIZv1 is shown as lizard2x.

lizard1x (i.e. Lizard compression levels 10..19) seems to be pretty much the same as LZ4. Maybe it was faster than LZ4 back in 2019, but since then LZ4 gained some performance improvements?

lizard2x is interesting - better compression ratio than LZ4, a bit slower decompression speed. In the middle between Zstd and LZ4 when it comes to decompression parameter space.

What about Mac?

The above charts are on x64 architecture, and Visual Studio compiler. How about a Mac (with a Clang compiler)? But first, we need to get LZSSE working there, since it is very much written with raw SSE4.1 intrinsics and no fallback or other platform paths. Luckily, just dropping a sse2neon.h into the project and doing a tiny change in LZSSE source make it just work on an Apple M1 platform.

With that out of the way, here’s the chart on Apple M1 Max with Clang 14:

Here lzsse8 and lizard1x do get ahead of LZ4 in terms of decompression performance. lizard1x is about 40% faster than LZ4 at decompression at the same compression ratio. LZSSE is “a bit” faster (but compression performance is still a lot slower than LZ4).

LZSSE and Lizard, with data filtering and chunking

If there’s anything we’ve learned so far in this whole series, is that “filtering” the data before compression can increase the compression ratio a lot (which in turn can speed up both compression and decompression due to data being easier or smaller). So let’s do that!

Windows case, all compressors with “split bytes, delta” filter from part 7, and each 1MB block is compressed independently (see part 8):

Well, neither LZSSE nor Lizard are very good here – LZ4 with filtering is faster than either of them, with a slightly better compression ratio too. If you’d want higher compression ratio, you’d reach for filtered Zstd.

On a Mac things are a bit more interesting for lzsse8 case; it can get ahead of filtered LZ4 decompression performance at expense of some compression ratio loss:

I have also tested on Windows (same Ryzen 5950X) but using Clang 15 compiler. Neither LZSSE nor Lizard are on the Pareto frontier here:

Conclusions

On my data set, neither LZSSE nor Lizard are much competetive against (filtered or unfiltered) LZ4 or Zstd. They might have been several years ago when they were developed, but since then both LZ4 and Zstd got several speedup optimizations.

Lizard levels 10-19, without any data filtering, do get ahead of LZ4 in decompression performance, but only on Apple M1.

LZSSE is “basically LZ4” in terms of decompression performance, but the compressor is much slower (fair, the project says as much in the readme). Curiously enough, where LZSSE gets ahead of LZ4 is on an Apple M1, a platform it is not even supposed to work on outside the box :)

Maybe next time I’ll finally look at lossy floating point compression. Who knows!


Float Compression 8: Blosc

Introduction and index of this series is here.

Several people have asked whether I have tried Blosc library for compressing my test data set. I was aware of it, but have not tried it! So this post is fixing that.

In the graphics/vfx world, OpenVDB uses Blosc as one option for volumetric data compression. I had no idea until about two weeks ago!

What is Blosc?

Blosc is many things, but if we ignore all the parts that are not relevant for us (Python APIs etc.), the summary is fairly simple:

  • It is a data compression library geared towards structured (i.e. arrays of same-sized items) binary data. There’s a fairly simple C API to access it.
  • It splits the data into separate chunks, and compresses/decompresses them separately. This optionally allows multi-threading by putting each chunk onto a separate job/thread.
  • It has a built-in compression codec BloscLZ, based on FastLZ. Out of the box it also builds with support for Zlib, LZ4, and Zstd.
  • It has several built-in lossless data filtering options: “shuffle” (exactly the same as “reorder bytes” from part 3), “bitshuffle” and “delta”.

All this sounds pretty much exactly like what I’ve been playing with in this series! So, how does Blosc with all default settings (their own BloscLZ compression, shuffle filter) compare? The below is on Windows, VS2022 (thin solid lines are zstd and lz4 with byte reorder + delta filter optimizations from previous post; dashed lines are zstd and lz4 without any filtering; thick line is blosc defaults):

So, Blosc defaults are:

  • Better than just throwing zstd at the data set: slightly higher compression ratio, faster compression, way faster decompression.
  • Mostly better than just using lz4: much better compression ratio, faster compression, decompression slightly slower but still fast.
  • However, when compared to my filters from previous post, Blosc default settings do not really “win” – compression ratio is quite a bit lower (but, compression and decompression speed is very good).

Note that here I’m testing Blosc without using multi-threading, i.e. nthreads is set to 1 which is the default.

I’m not quite sure how they arrive at the “Blosc is faster than a memcpy()” claim on their website though. Maybe if all the data is literally zeroes, you could get there? Otherwise I don’t quite see how any LZ-like codec could be faster than just a memory copy, on any realistic data.

Blosc but without the Shuffle filter

Ok, how about Blosc but using LZ4 or Zstd compression, instead of the default BloscLZ? And for now, let’s also turn off the default “shuffle” (“reorder bytes”) filter:

  • Without “shuffle” filter, Blosc-Zstd basically is just Zstd, with a very small overhead and a tiny loss in compression ratio. Same for Blosc-LZ4; it is “just” LZ4, so to speak.
  • BloscLZ compressor without the shuffle filter is behind vanilla LZ4 both in terms of ratio and performance.

Blosc with Shuffle, and Zstd/LZ4

What if we turn the Shuffle filter back on, but also try LZ4 and Zstd?

For both zstd and lz4 cases, the compression ratio is below what “my” filter achieves. But the decompression performance is interesting: Blosc-Zstd is slightly ahead of “zstd + my filter”, and Blosc-LZ4 is quite a bit ahead of “lz4 + my filter”. So that’s interesting! So far, Blosc-LZ4 with Shuffle is on the Pareto frontier if you need that particular balance between ratio and decompression performance.

Blosc with Shuffle and Delta filters

Looks like Blosc default Shuffle filter is exactly the same as my “reorder bytes” filter. But in terms of best compression ratio, “reorder bytes + delta” was the best option so far. Oh hey, Blosc (since version 2.0) also has a filter named “Delta”! Let’s try that one out:

Huh. The Shuffle + Delta combination is, like, not great at all? The compression ratio is below 2.0 everywhere; i.e. worse than just zstd or lz4 without any data filtering? 🤔

Oh wait, looks like Blosc' “Delta” filter is not a “delta” at all. Sometime in 2017 they changed it to be a XOR filter instead (commit). The commit message says “for better numerical stability”, no idea what that means since this is operating all on integers.

Ah well, looks like at least on this data set, the Delta filter does not do anything good, so we’ll forget about it. Update: look at “bytedelta” filter below, new in Blosc 2.8!

Input data chunking

The question remains, why and how is Blosc-LZ4 with Shuffle faster at decompression than “my” filter with LZ4? One reason could be that my filter is not entirely optimized (very possible). Another might be that Blosc is doing something differently…

And that difference is: by default, Blosc splits input data into separate chunks. The chunk sizes seem to be 256 kilobytes by default. Then each chunk is filtered and compressed completely independently from the other chunks. Of course, the smaller the chunk size, the lower is the compression ratio that you get, since the LZ compression codec can’t “see” data repetitions outside the chunk boundaries.

What if I added very similar “data chunking” to “my” tests, i.e. just zstd/lz4, my filters, and all that split into independent chunks? Here’s without chunks, plus graphs for 64KB, 256KB, 1MB, 4MB chunk sizes:

It is a bit crowded, but you can see how splitting data into 4MB chunks practically does not lose any compression ratio, yet manages to make LZ4 decoding quite a bit faster. With much smaller chunk sizes of 64KB, the compression ratio loss seems to be “maybe too large” (still way better than without any data filtering though). It feels like chunk size of 1MB is quite good: very small compression ratio loss, good decompression speedup:

So this is another trick that is not directly related to Blosc: splitting up your data into separate 256KB-1MB chunks might be worth doing. Not only this would enable operating on chunks in parallel (if you wish to do that), but also it speeds things up, especially decompression. The reason being that the working set memory needed to do decompression now neatly fits into CPU caches.

Update: “bytedelta” filter in Blosc 2.8

Something quite amazing happened: seemingly after reading this same blog post and the previous one, Blosc people added a new filter called “bytedelta”, that, well, does exactly what you’d think it would – it is delta encoding. Within Blosc, you would put a “shuffle” (“split bytes” in my posts) filter, followed by a “bytedelta” filter.

This just shipped in Blosc 2.8.0, and they have an in-depth blog post testing it on ERA5 data sets, a short video, and a presentation at LEAPS Innov WP7 (slides). That was fast!

So how does it compare?

  • Thick solid lines are Blosc shuffle+bytedelta, for the three bases of Blosc built-in BLZ compression, as well as Blosc using LZ4 and Zstd compression.
  • For comparison, Blosc with just shuffle filter are dashed lines of the same color.
  • There’s also “my own” filter from previous post using LZ4 and Zstd and splitting into 1MB chunks on the graph for comparison.

So, Blosc bytedelta filter helps compression ratio a bit in BLZ and LZ4 cases, but helps compression ratio a lot when using Zstd. It is a sligth loss of compression ratio compared to best result we have without Blosc (Blosc splits data into ~256KB chunks by default), and a bit slower too, probably because the “shuffle” and “bytedelta” are separate filters there instead of combined filter that does both in one go.

But it’s looking really good! This is a great outcome. If you are using Blosc, check whether “shuffle” + “bytedelta” combination works well on your data. It might! Their own blog post has way more extensive evaluation.

Aside: “reinventing the wheel”

Several comments I saw about this whole blog post series were along the lines of “what’s the point; all of these things were already invented”. And that is true! I am going down this rabbit hole mostly for my own learning purposes, and just writing them down because… “we don’t ask why, we ask why not”.

I have already learned a bit more about compression, data filtering and SIMD, so yay, success. But also:

  • The new “bytedelta” filter in Blosc 2.8 is very directly inspired by this blog post series. Again, this is not a new invention; delta encoding has been around for many decades. But a random post on the interwebs can make someone else go “wait, turns out we don’t have this trick, let’s add it”. Nice!
  • After writing part 7 of these series, I looked at OpenEXR code again, saw that while they do have Intel SSE optimizations for zip-compressed .exr files reading, they do not have ARM NEON paths. So I added those, and that makes loading .exr files that use zip compression almost 30% faster on a Mac M1 laptop. That shipped in OpenEXR 3.1.6, yay!

So I don’t quite agree with some random internet commenters saying “these posts are garbage, all of this has been invented before”. The posts might be garbage, true, but 1) I’ve learned something and 2) improvements based on these posts have landed into two open source software libraries by now.

Don’t pay too much attention to internet commenters.

Conclusions

Blosc is pretty good!

I do wonder why they only have a “shuffle” filter built-in though (there’s also “delta” but it’s really some sort of “xor”). At least on my data, “shuffle + actual delta” would result in much better compression ratio. Without having that filter, blosc loses to the filter I have at the end of previous post in terms of compression ratio, while being roughly the same in performance (after I apply 1MB data chunking in my code). Update: since 2.8 Blosc has a “bytedelta” filter; if you put that right after “shuffle” filter then it gets results really close to what I’ve got in the previous post.


Float Compression 7: More Filtering Optimization

Introduction and index of this series is here.

In the previous post I explored how to make data filters a bit faster, using some trivial merging of filters, and a largely misguided attempt at using SIMD.

People smarter than me pointed out that getting good SIMD performance requires a different approach. Which is kinda obvious, and another thing that is obvious is that I have very little SIMD programming experience, and thus very little intuition of what’s a good approach.

But to get there, first I’d need to fix a little poopoo I made.

A tiny accidental thing can prevent future things

Recall how in part 3 the most promising data filter seemed to be “reorder bytes + delta”? What it does, is first reorder data items so that all 1st bytes are together, then all 2nd bytes, etc. If we had three items of four bytes each, it would do this:

And then delta-encode the whole result:

i.e. first byte stays the same, and each following byte is difference from the previous one.

Turns out, this choice prevents some future optimizations. How? Because whole data reordering conceptually produces N separate byte streams, delta-encoding the whole result conceptually merges these streams together; the values in them become dependent on all previous values.

What if instead, we delta-encoded each stream separately?

In terms of code, this is a tiny change:

void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    // uint8_t prev = 0; <-- WAS HERE
    for (int ich = 0; ich < channels; ++ich)
    {
        uint8_t prev = 0; // <-- NOW HERE
        const uint8_t* srcPtr = src + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            uint8_t v = *srcPtr;
            *dst = v - prev;
            prev = v;
            srcPtr += channels;
            dst += 1;
        }
    }
}

We will see how that is useful later. Meanwhile, this choice of “reorder, then delta the whole result” is what OpenEXR image format also does in the ZIP compression code :)

What the above change allows, is in the filter decoder to fetch from any number of byte streams at once and process their data (apply reverse delta, etc.). Something we could not do before, since values within each stream depended on the values of previous streams.

So, more (questionable) optimizations

Overall I did a dozen experiments, and they are all too boring to write about them here, so here are the main ones.

Note: in the previous post I made a mistake in timing calculations, where the time numbers were more like “average time it takes to filter one file”, not “total time it takes to filter whole data set”. Now the numbers are more proper, but don’t directly compare them with the previous post!

In the previous post we went from “A” to “D” variants, resulting in some speedups depending on the platform (un-filter for decompression: WinVS 106→75ms, WinClang 116→75ms, Mac 94→32ms):

Now that we can decode/unfilter all the byte streams independently, let’s try doing just that (no SIMD at all):

const size_t kMaxChannels = 64;
// Scalar, fetch a byte from N streams, write sequential
void UnFilter_G(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev[kMaxChannels] = {};
    uint8_t* dstPtr = dst;
    for (size_t ip = 0; ip < dataElems; ++ip)
    {
        const uint8_t* srcPtr = src + ip;
        for (int ich = 0; ich < channels; ++ich)
        {
            uint8_t v = *srcPtr + prev[ich];
            prev[ich] = v;
            *dstPtr = v;
            srcPtr += dataElems;
            dstPtr += 1;
        }
    }
}

I did hardcode “maximum bytes per data element” to just 64 here. In our data set it’s always either 16 or 12, but let’s make the limit somewhat more realistic. It should be possible to not have a limit with some additional code, but “hey 64 bytes per struct should be enough for anyone”, or so the ancient saying goes.

So this is “G” variant: decompression WinVS 139ms, WinClang 125ms, Mac 104ms (“D” was: 75, 75, 32). This is not great at all! This is way slower! ☹️

But! See how this fetches a byte from all the “streams” of a data item, and has all the bytes of the previous data item? Doing the “un-delta” step could be done way more efficiently using SIMD now, by processing like 16 bytes at once (128 bit SSE/NEON registers are exactly 16 bytes in size).

A tiny SIMD wrapper, and Transpose

All the SSE and NEON code I scribbled in the previous post felt like I’m just repeating the same things for NEON after doing SSE part, just with different intrinsic function names. So, perhaps prematurely, I made a little helper to avoid having to do that: a data type Bytes16 that, well, holds 16 bytes, and then functions like SimdAdd, SimdStore, SimdGetLane and whatnot. It’s under 100 lines of code, and does just the tiny amount of operations I need: simd.h.

I will also very soon need a function that transposes a matrix, i.e. flips rows & columns of a rectangular array. As usual, turns out Fabian has written about this a decade ago (Part 1, Part 2). You can cook up a sweet nice 16x16 byte matrix transpose like this:

static void EvenOddInterleave16(const Bytes16* a, Bytes16* b)
{
    int bidx = 0;
    for (int i = 0; i < 8; ++i)
    {
        b[bidx] = SimdInterleaveL(a[i], a[i+8]); bidx++; // _mm_unpacklo_epi8 / vzip1q_u8
        b[bidx] = SimdInterleaveR(a[i], a[i+8]); bidx++; // _mm_unpackhi_epi8 / vzip2q_u8
    }
}
static void Transpose16x16(const Bytes16* a, Bytes16* b)
{
    Bytes16 tmp1[16], tmp2[16];
    EvenOddInterleave16(a, tmp1);
    EvenOddInterleave16(tmp1, tmp2);
    EvenOddInterleave16(tmp2, tmp1);
    EvenOddInterleave16(tmp1, b);
}

and then have a more generic Transpose function for any NxM sized matrix, with the faster SIMD code path for cases like “16 rows, multiple-of-16 columns”. Why? We’ll need it soon :)

Continuing with optimizations

The “G” variant fetched one byte from each stream/channel, did <something else>, and then fetched the following byte from each stream, and so on. Now, fetching bytes one by one is probably wasteful.

What we could do instead, for the un-filter: from each stream, fetch 16 (SIMD register size) bytes, and decode the deltas using SIMD prefix sum (very much like in “D” variant). Now we have 16 data items on stack, but they are still in the “split bytes” memory layout. But, doing a matrix transpose gets them into exactly the layout we need, and we can just blast that into destination buffer with a single memcpy.

// Fetch 16b from N streams, prefix sum SIMD undelta, transpose, sequential write 16xN chunk.
void UnFilter_H(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t* dstPtr = dst;
    int64_t ip = 0;

    // simd loop: fetch 16 bytes from each stream
    Bytes16 curr[kMaxChannels] = {};
    const Bytes16 hibyte = SimdSet1(15);
    for (; ip < int64_t(dataElems) - 15; ip += 16)
    {
        // fetch 16 bytes from each channel, prefix-sum un-delta
        const uint8_t* srcPtr = src + ip;
        for (int ich = 0; ich < channels; ++ich)
        {
            Bytes16 v = SimdLoad(srcPtr);
            // un-delta via prefix sum
            curr[ich] = SimdAdd(SimdPrefixSum(v), SimdShuffle(curr[ich], hibyte));
            srcPtr += dataElems;
        }

        // now transpose 16xChannels matrix
        uint8_t currT[kMaxChannels * 16];
        Transpose((const uint8_t*)curr, currT, 16, channels);

        // and store into destination
        memcpy(dstPtr, currT, 16 * channels);
        dstPtr += 16 * channels;
    }

    // any remaining leftover
    if (ip < int64_t(dataElems))
    {
        uint8_t curr1[kMaxChannels];
        for (int ich = 0; ich < channels; ++ich)
            curr1[ich] = SimdGetLane<15>(curr[ich]);
        for (; ip < int64_t(dataElems); ip++)
        {
            const uint8_t* srcPtr = src + ip;
            for (int ich = 0; ich < channels; ++ich)
            {
                uint8_t v = *srcPtr + curr1[ich];
                curr1[ich] = v;
                *dstPtr = v;
                srcPtr += dataElems;
                dstPtr += 1;
            }
        }
    }
}

The code is getting more complex! But conceptually it’s not – half of the function is the SIMD loop that reads 16 bytes from each channel; and the remaining half of the function is non-SIMD code to handle any leftover in case data size was not multiple of 16.

For the compression filter it is similar idea, just the other way around: read 16 N-sized items from the source data, transpose which gets them into N channels with 16 bytes each. Now do the delta encoding with SIMD on that. Store each of these 16 bytes into N separate locations. Again half of the code is just for handling non-multiple-of-16 data size leftovers.

// Fetch 16 N-sized items, transpose, SIMD delta, write N separate 16-sized items
void Filter_H(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t* dstPtr = dst;
    int64_t ip = 0;
    
    const uint8_t* srcPtr = src;
    // simd loop
    Bytes16 prev[kMaxChannels] = {};
    for (; ip < int64_t(dataElems) - 15; ip += 16)
    {
        // fetch 16 data items
        uint8_t curr[kMaxChannels * 16];
        memcpy(curr, srcPtr, channels * 16);
        srcPtr += channels * 16;
        // transpose so we have 16 bytes for each channel
        Bytes16 currT[kMaxChannels];
        Transpose(curr, (uint8_t*)currT, channels, 16);
        // delta within each channel, store
        for (int ich = 0; ich < channels; ++ich)
        {
            Bytes16 v = currT[ich];
            Bytes16 delta = SimdSub(v, SimdConcat<15>(v, prev[ich]));
            SimdStore(dstPtr + dataElems * ich, delta);
            prev[ich] = v;
        }
        dstPtr += 16;
    }
    // any remaining leftover
    if (ip < int64_t(dataElems))
    {
        uint8_t prev1[kMaxChannels];
        for (int ich = 0; ich < channels; ++ich)
            prev1[ich] = SimdGetLane<15>(prev[ich]);
        for (; ip < int64_t(dataElems); ip++)
        {
            for (int ich = 0; ich < channels; ++ich)
            {
                uint8_t v = *srcPtr;
                srcPtr++;
                dstPtr[dataElems * ich] = v - prev1[ich];
                prev1[ich] = v;
            }
            dstPtr++;
        }
    }
}

Now that is a lot of code indeed, relatively speaking. Was it worth it? This is variant “H”: decompression unfilter WinVS 21ms, WinClang 20ms, Mac 28ms (previous best “D” was 75, 75, 32). Compression filter WinVS 24ms, WinClang 23ms, Mac 31ms (“D” was 63, 54, 32).

Hey look, not bad at all!

Is it cheating if you optimize for your data?

Next up is a step I did only for the decoding unfilter. It’s not all that interesting, but raises a good question: is it “cheating”, if you optimize/specialize your code for your data?

The answer is, of course, “it depends”. In my particular case, I’m testing on four data files, and three of them use data items that are 16 bytes in size (the 4th one uses 12 byte items). The UnFilter_H function above is written for generic “any, as long as <64 bytes item size” data. What I used that exact code for all non-16 sized data, but did “something better” for 16-sized data?

In particular, the transpose step becomes exact 16x16 matrix transpose, for which we have a sweet nice function already. And the delta decoding could be done after the transpose, using way more efficient “just add SIMD registers” operation instead of trying to cram that into SIMD prefix sum. The interesting part of the SIMD inner loop becomes this then:

// fetch 16 bytes from each channel
Bytes16 curr[16];
const uint8_t* srcPtr = src + ip;
for (int ich = 0; ich < 16; ++ich)
{
    Bytes16 v = SimdLoad(srcPtr);
    curr[ich] = v;
    srcPtr += dataElems;
}

// transpose 16xChannels matrix
Bytes16 currT[16];
Transpose((const uint8_t*)curr, (uint8_t*)currT, 16, channels);

// un-delta and store
for (int ib = 0; ib < 16; ++ib)
{
    prev = SimdAdd(prev, currT[ib]);
    SimdStore(dstPtr, prev);
    dstPtr += 16;
}

Does that help? “I” case: decompression unfilter WinVS 18ms, WinClang 14ms, Mac 24ms (“H” was 21, 20, 28). Yeah, it does help.

Groups of four

Fabian pointed out that fetching from “lots” (e.g. 16) separate streams at once can get into an issue of CPU cache trashing. If the streams spaced apart at particular powers of two, and you are fetching from more than 4 or 8 (typical CPU cache associativity) streams at once, it’s quite likely that many of your memory fetches will be landing into the same physical CPU cache lines.

One possible way to avoid this is also “kinda cheating” (or well, “using knowledge of our data”) – we know we are operating on floating point (4-byte) things, i.e. our data structure sizes are always a multiple of four. We could be fetching not all N (N = data item size) streams at once, but rather do that in groups of 4 streams. Why four? Because we know the number of streams is multiple of four, and most CPU caches are at least 4-way associative.

So conceptually, decompression unfilter would be like (straight pseudo-code paste from Fabian’s toot):

for (chunks of N elements) {
  for groups of 4 streams {
    read and interleave N values from 4 streams each
    store to stack
  }
  for elements {
    read and interleave groups of 4 streams from stack
    sum into running total
    store to dest
  }
}

This hopefully avoids some CPU cache trashing, and also fetches more than 16 bytes from each stream in one go. Ideally we’d want to fetch as much as possible, while making sure that everything we’re working with stays in CPU L1 cache.

I’ve tried various sizes, and in my testing size of 384 bytes per channel worked best. The code is long though, and kind of a mess; the “we effectively achieve a matrix transpose, but in two separate steps” is not immediately clear at all (or not clear at my lacking experience level :)). In the code I have separate path for 16-sized data items, where the 2nd interleave part is much simpler.

Anyhoo, here it is:

void UnFilter_K(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    assert((channels % 4) == 0); // our data is floats so channels will always be multiple of 4

    const int kChunkBytes = 384;
    const int kChunkSimdSize = kChunkBytes / 16;
    static_assert((kChunkBytes % 16) == 0, "chunk bytes needs to be multiple of simd width");
    uint8_t* dstPtr = dst;
    int64_t ip = 0;
    alignas(16) uint8_t prev[kMaxChannels] = {};
    Bytes16 prev16 = SimdZero();
    for (; ip < int64_t(dataElems) - (kChunkBytes - 1); ip += kChunkBytes)
    {
        // read chunk of bytes from each channel
        Bytes16 chdata[kMaxChannels][kChunkSimdSize];
        const uint8_t* srcPtr = src + ip;
        // fetch data for groups of 4 channels, interleave
        // so that first in chdata is (a0b0c0d0 a1b1c1d1 a2b2c2d2 a3b3c3d3) etc.
        for (int ich = 0; ich < channels; ich += 4)
        {
            for (int item = 0; item < kChunkSimdSize; ++item)
            {
                Bytes16 d0 = SimdLoad(((const Bytes16*)(srcPtr)) + item);
                Bytes16 d1 = SimdLoad(((const Bytes16*)(srcPtr + dataElems)) + item);
                Bytes16 d2 = SimdLoad(((const Bytes16*)(srcPtr + dataElems * 2)) + item);
                Bytes16 d3 = SimdLoad(((const Bytes16*)(srcPtr + dataElems * 3)) + item);
                // interleaves like from https://fgiesen.wordpress.com/2013/08/29/simd-transposes-2/
                Bytes16 e0 = SimdInterleaveL(d0, d2); Bytes16 e1 = SimdInterleaveR(d0, d2);
                Bytes16 e2 = SimdInterleaveL(d1, d3); Bytes16 e3 = SimdInterleaveR(d1, d3);
                Bytes16 f0 = SimdInterleaveL(e0, e2); Bytes16 f1 = SimdInterleaveR(e0, e2);
                Bytes16 f2 = SimdInterleaveL(e1, e3); Bytes16 f3 = SimdInterleaveR(e1, e3);
                chdata[ich + 0][item] = f0;
                chdata[ich + 1][item] = f1;
                chdata[ich + 2][item] = f2;
                chdata[ich + 3][item] = f3;
            }
            srcPtr += 4 * dataElems;
        }

        if (channels == 16)
        {
            // channels == 16 case is much simpler
            // read groups of data from stack, interleave, accumulate sum, store
            for (int item = 0; item < kChunkSimdSize; ++item)
            {
                for (int chgrp = 0; chgrp < 4; ++chgrp)
                {
                    Bytes16 a0 = chdata[chgrp][item];
                    Bytes16 a1 = chdata[chgrp + 4][item];
                    Bytes16 a2 = chdata[chgrp + 8][item];
                    Bytes16 a3 = chdata[chgrp + 12][item];
                    // now we want a 4x4 as-uint matrix transpose
                    Bytes16 b0 = SimdInterleave4L(a0, a2); Bytes16 b1 = SimdInterleave4R(a0, a2);
                    Bytes16 b2 = SimdInterleave4L(a1, a3); Bytes16 b3 = SimdInterleave4R(a1, a3);
                    Bytes16 c0 = SimdInterleave4L(b0, b2); Bytes16 c1 = SimdInterleave4R(b0, b2);
                    Bytes16 c2 = SimdInterleave4L(b1, b3); Bytes16 c3 = SimdInterleave4R(b1, b3);
                    // c0..c3 is what we should do accumulate sum on, and store
                    prev16 = SimdAdd(prev16, c0); SimdStore(dstPtr, prev16); dstPtr += 16;
                    prev16 = SimdAdd(prev16, c1); SimdStore(dstPtr, prev16); dstPtr += 16;
                    prev16 = SimdAdd(prev16, c2); SimdStore(dstPtr, prev16); dstPtr += 16;
                    prev16 = SimdAdd(prev16, c3); SimdStore(dstPtr, prev16); dstPtr += 16;
                }
            }
        }
        else
        {
            // general case: interleave data
            uint8_t cur[kMaxChannels * kChunkBytes];
            for (int ib = 0; ib < kChunkBytes; ++ib)
            {
                uint8_t* curPtr = cur + ib * kMaxChannels;
                for (int ich = 0; ich < channels; ich += 4)
                {
                    *(uint32_t*)curPtr = *(const uint32_t*)(((const uint8_t*)chdata) + ich * kChunkBytes + ib * 4);
                    curPtr += 4;
                }
            }
            // accumulate sum and store
            // the row address we want from "cur" is interleaved in a funky way due to 4-channels data fetch above.
            for (int item = 0; item < kChunkSimdSize; ++item)
            {
                for (int chgrp = 0; chgrp < 4; ++chgrp)
                {
                    uint8_t* curPtrStart = cur + (chgrp * kChunkSimdSize + item) * 4 * kMaxChannels;
                    for (int ib = 0; ib < 4; ++ib)
                    {
                        uint8_t* curPtr = curPtrStart;
                        // accumulate sum w/ SIMD
                        for (int ich = 0; ich < channels; ich += 16)
                        {
                            Bytes16 v = SimdAdd(SimdLoadA(&prev[ich]), SimdLoad(curPtr));
                            SimdStoreA(&prev[ich], v);
                            SimdStore(curPtr, v);
                            curPtr += 16;
                        }
                        // store
                        memcpy(dstPtr, curPtrStart, channels);
                        dstPtr += channels;
                        curPtrStart += kMaxChannels;
                    }
                }
            }
        }
    }

    // any remainder
    if (channels == 16)
    {
        for (; ip < int64_t(dataElems); ip++)
        {
            // read from each channel
            alignas(16) uint8_t chdata[16];
            const uint8_t* srcPtr = src + ip;
            for (int ich = 0; ich < 16; ++ich)
            {
                chdata[ich] = *srcPtr;
                srcPtr += dataElems;
            }
            // accumulate sum and write into destination
            prev16 = SimdAdd(prev16, SimdLoadA(chdata));
            SimdStore(dstPtr, prev16);
            dstPtr += 16;
        }
    }
    else
    {
        for (; ip < int64_t(dataElems); ip++)
        {
            const uint8_t* srcPtr = src + ip;
            for (int ich = 0; ich < channels; ++ich)
            {
                uint8_t v = *srcPtr + prev[ich];
                prev[ich] = v;
                *dstPtr = v;
                srcPtr += dataElems;
                dstPtr += 1;
            }
        }
    }
}

I told you it is getting long! Did it help? “K” case: decompression unfilter WinVS 15ms, WinClang 15ms, Mac 16ms (“H” was 18, 14, 24)

Ok, it does significantly help Mac (Apple M1/NEON) case; helps a bit on Windows PC too.

Conclusions

All in all, for the decompression unfilter we went from super simple code in part 3 (“A”) to a naïve attempt at SIMD (“D”) to this fairly involved “K” variant, and their respective timings are:

  • Ryzen 5950X, Windows, VS2022: 1067515ms. Clang 15: 1167515ms.
  • Apple M1, Clang 14: 943216ms.

The performance is 5-8 times faster, which is nice. Note: it’s entirely possible that I have misunderstood Fabian’s advice, and/or did it wrong, or just left some other speedup opportunities lying on the floor.

The filtering part is faster now, great. How does this affect the overall process, when we put it next to the actual data compression & decompression? After all, this is what we are really interested in.

Here they are (click for interactive chart; solid thick line: this post, solid thin line: “D” from previous post). Windows MSVC, Windows Clang, Mac Clang:

Hey look! If you are using zstd, both compression and decompression is faster and better ratio with the data filtering applied. For LZ4 the decompression does not quite reach the 5GB/s that it can do without data filtering, but it does go up to 2.5GB/s which is way faster than 0.7GB/s that it was going on in the “A” approach. And the compression ratio is way better than just LZ4 can achieve.

The code is quite a bit more complex though. Is all that code complexity worth it? Depends.

  • The code is much harder to follow and understand.
  • However, the functionality is trivial to test, and to ensure it keeps on working via tests.
  • This is not one of those “oh but we need to keep it simple if it needs to get changed later” cases. You either do “combine streams, decode delta”, or you do something else. Once or if you settled onto that data format, the code to achieve that un-filtering step needs to do exactly just that. If you need to do something else, throw away this code and write code to do that other thing!
  • If “data un-filtering” (transpose, decode delta) is critical part of your library or product, or just a performance critical area, then it might be very well worth it.

Nothing in the above is “new” or noteworthy, really. The task of “how to filter structured data” or “how to transpose data fast” has been studied and solved quite extensively. But, it was a nice learning experience for me!

What’s next

I keep on saying that I’d look into lossy compression options, but now many (read: more than one!) people have asked “what about Blosc?” and while I was aware of it for a number of years, I have never actually tested it. So I’m gonna do that next!


Float Compression 6: Filtering Optimization

Introduction and index of this series is here.

Several posts ago we learned that filtering the data can make it more compressible. Out of several simple filters that I tried, “reorder data items byte-wise and delta encode that” was the most effective at improving compression ratio. That’s all nice and good, but the filtering has a cost to it. One is the extra memory needed to hold the filtered data, and another is the time it takes to do the filtering. While for compression the relative speed hit is fairly small (i.e. compression itself takes much longer), for decompression the cost is not trivial. A fast decompressor like LZ4 normally goes at ~5GB/s, but with “reorder bytes + delta” filter it only achieves 0.8GB/s:

Can we do something simple (i.e. something even me could do) to speed that up a bit? Let’s find out.

Decompression filter optimization

The code we are starting with is this: first decode the delta-encoded data, then un-split the data items, i.e. assemble the items from the “first byte of each item, then 2nd byte of each item, then 3rd byte of each item” layout. Like this:

// channels: how many items per data element
// dataElems: how many data elements
template<typename T>
static void UnSplit(const T* src, T* dst, int channels, size_t dataElems)
{
    for (int ich = 0; ich < channels; ++ich)
    {
        T* dstPtr = dst + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            *dstPtr = *src;
            src += 1;
            dstPtr += channels;
        }
    }
}
template<typename T>
static void DecodeDeltaDif(T* data, size_t dataElems)
{
    T prev = 0;
    for (size_t i = 0; i < dataElems; ++i)
    {
        T v = *data;
        v = prev + v;
        *data = v;
        prev = v;
        ++data;
    }
}
void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    DecodeDeltaDif<uint8_t>(src, channels * dataElems);
    UnSplit<uint8_t>(src, dst, channels, dataElems);
}

Now, this process does not depend on the compression library or the settings used for it; it’s always the same for all of them. So instead of complicated per-compressor scatter plots I’ll just give three time numbers: milliseconds that it takes to do the filtering process on our 94.5MB data set. Two numbers on Windows PC (Ryzen 5950X, Visual Studio 2022 16.4 and Clang 15.0), and one on a Mac laptop (Apple M1 Max, Apple Clang 14.0).

The code above, which I’m gonna label “A”: WinVS 27.9ms, WinClang 29.7ms, MacClang 23.9ms.

Attempt B: combine unsplit+delta into one

The code above does two passes over the data: first the delta-decode, then the un-split. This is good for “code reuse”, as in arbitrarily complex filters can be produced by combining several simple filters in sequence. But if we know we already locked onto “split bytes + delta”, then we can try combining all that into just once function:

void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        uint8_t* dstPtr = dst + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            uint8_t v = *src + prev;
            prev = v;
            *dstPtr = v;
            src += 1;
            dstPtr += channels;
        }
    }
}

“B”: WinVS 19.5ms, WinClang 18.9ms, MacClang 9.5ms (“A” was: 27.9, 29.7, 23.9). Not bad at all, and the code is actually smaller now.

Attempt C: I heard something about SIMD

CPUs these days have this “SIMD” thing, and for usual use cases we can pretty much assume that something like SSE4 is available on Intel architectures, and NEON on ARM architectures. Our code loops over data “quite a lot”, doing very simple operations with it. Would trying to sprinkle some manually written SIMD in there help?

One spanner in the works is that the loop in there is not independent operations: each iteration of the loop updates the prev byte value. Delta-encoded data decoding is essentially a prefix sum operation, and some five minute googling having been aware of all of Fabian’s tweets and toots, ever finds this gist with a prefix_sum_u8 function for SSE. So with that, let’s try to rewrite the above code so that the inner loop can do 16 bytes at a time, for both SSE and NEON cases.

The code’s quite a bit longer:

#if defined(__x86_64__) || defined(_M_X64)
#   define CPU_ARCH_X64 1
#   include <emmintrin.h> // sse2
#   include <tmmintrin.h> // sse3
#   include <smmintrin.h> // sse4.1
#endif
#if defined(__aarch64__) || defined(_M_ARM64)
#   define CPU_ARCH_ARM64 1
#   include <arm_neon.h>
#endif

#if CPU_ARCH_X64
// https://gist.github.com/rygorous/4212be0cd009584e4184e641ca210528
static inline __m128i prefix_sum_u8(__m128i x)
{
    x = _mm_add_epi8(x, _mm_slli_epi64(x, 8));
    x = _mm_add_epi8(x, _mm_slli_epi64(x, 16));
    x = _mm_add_epi8(x, _mm_slli_epi64(x, 32));
    x = _mm_add_epi8(x, _mm_shuffle_epi8(x, _mm_setr_epi8(-1,-1,-1,-1,-1,-1,-1,-1,7,7,7,7,7,7,7,7)));
    return x;
}
#endif // #if CPU_ARCH_X64
#if CPU_ARCH_ARM64
// straight-up port to NEON of the above; no idea if this is efficient at all, yolo!
static inline uint8x16_t prefix_sum_u8(uint8x16_t x)
{
    x = vaddq_u8(x, vshlq_u64(x, vdupq_n_u64(8)));
    x = vaddq_u8(x, vshlq_u64(x, vdupq_n_u64(16)));
    x = vaddq_u8(x, vshlq_u64(x, vdupq_n_u64(32)));
    alignas(16) uint8_t tbl[16] = {0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,7,7,7,7,7,7,7,7};
    x = vaddq_u8(x, vqtbl1q_u8(x, vld1q_u8(tbl)));
    return x;
}
#endif // #if CPU_ARCH_ARM64

void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        uint8_t* dstPtr = dst + ich;
        size_t ip = 0;

#       if CPU_ARCH_X64
        // SSE simd loop, 16 bytes at a time
        __m128i prev16 = _mm_set1_epi8(prev);
        __m128i hibyte = _mm_set1_epi8(15);
        alignas(16) uint8_t scatter[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // load 16 bytes of filtered data
            __m128i v = _mm_loadu_si128((const __m128i*)src);
            // un-delta via prefix sum
            prev16 = _mm_add_epi8(prefix_sum_u8(v), _mm_shuffle_epi8(prev16, hibyte));
            // scattered write into destination
            _mm_store_si128((__m128i*)scatter, prev16);
            for (int lane = 0; lane < 16; ++lane)
            {
                *dstPtr = scatter[lane];
                dstPtr += channels;
            }
            src += 16;
        }
        prev = _mm_extract_epi8(prev16, 15); // sse4.1
#       endif // if CPU_ARCH_X64

#       if CPU_ARCH_ARM64
        // NEON simd loop, 16 bytes at a time
        uint8x16_t prev16 = vdupq_n_u8(prev);
        uint8x16_t hibyte = vdupq_n_u8(15);
        alignas(16) uint8_t scatter[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // load 16 bytes of filtered data
            uint8x16_t v = vld1q_u8(src);
            // un-delta via prefix sum
            prev16 = vaddq_u8(prefix_sum_u8(v), vqtbl1q_u8(prev16, hibyte));
            // scattered write into destination
            vst1q_u8(scatter, prev16);
            for (int lane = 0; lane < 16; ++lane)
            {
                *dstPtr = scatter[lane];
                dstPtr += channels;
            }
            src += 16;
        }
        prev = vgetq_lane_u8(prev16, 15);
#       endif // if CPU_ARCH_ARM64

        // any trailing leftover
        for (ip = ip * 16; ip < dataElems; ++ip)
        {
            uint8_t v = *src + prev;
            prev = v;
            *dstPtr = v;
            src += 1;
            dstPtr += channels;
        }
    }
}

Phew. “C”: WinVS 19.7ms, WinClang 18.7ms, MacClang 8.3ms (“B” was: 19.5, 18.9, 9.5). Meh! 😕 This makes the Apple/NEON case a bit faster, but on PC/SSE case it’s pretty much the same timings.

Lesson: just because you use some SIMD, does not necessarily make things faster. In this case, I suspect it’s the lack of independent work that’s available (each loop iteration depends on results of previous iteration; and “work” within loop is tiny), and the scattered memory writes are the problem. If I was less lazy I’d try to draw up a data flow graph or something.

Attempt D: undeterred, even more SIMD

The SIMD attempt was a lot of code for very little (if any) gain, but hey what if we try adding even more SIMD? Looking at the assembly of the compiled code in compiler explorer, I noticed that while while Clang can keep code like this:

// scattered write into destination
_mm_store_si128((__m128i*)scatter, prev16);
for (int lane = 0; lane < 16; ++lane)
{
    *dstPtr = scatter[lane];
    dstPtr += channels;
}

completely within SIMD registers, MSVC can not. Clang compiles the above into a sequence of pextrb (SSE) and st1 (NEON) instructions, like:

; SSE
pextrb  byte ptr [rbx], xmm3, 0
pextrb  byte ptr [rbx + rax], xmm3, 1
add     rbx, rax
pextrb  byte ptr [rax + rbx], xmm3, 2
add     rbx, rax
pextrb  byte ptr [rax + rbx], xmm3, 3
add     rbx, rax
; ...
; NEON
st1     { v2.b }[0], [x13]
add     x13, x15, x8
st1     { v2.b }[1], [x12]
add     x12, x13, x8
st1     { v2.b }[2], [x15]
add     x15, x12, x8
; ...

But MSVC is emitting assembly very similar to how the code is written: first writes the SSE register into memory, and then stores each byte of that into final location:

movdqa  XMMWORD PTR scatter$1[rsp], xmm3
movdqa  xmm0, xmm3
psrldq  xmm0, 8
movd    ecx, xmm3
mov     BYTE PTR [rax], cl
add     rax, rbx
movzx   ecx, BYTE PTR scatter$1[rsp+1]
mov     BYTE PTR [rax], cl
add     rax, rbx
movzx   ecx, BYTE PTR scatter$1[rsp+2]
mov     BYTE PTR [rax], cl
add     rax, rbx
movzx   ecx, BYTE PTR scatter$1[rsp+3]
mov     BYTE PTR [rax], cl
add     rax, rbx
; ...

So how about if we replace that loop with a sequence of _mm_extract_epi8 (SSE intrinsic function that maps to pextrb)?

// scattered write into destination
//_mm_store_si128((__m128i*)scatter, prev16);
//for (int lane = 0; lane < 16; ++lane)
//{
//    *dstPtr = scatter[lane];
//    dstPtr += channels;
//}
*dstPtr = _mm_extract_epi8(prev16, 0); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 1); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 2); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 3); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 4); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 5); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 6); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 7); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 8); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 9); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 10); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 11); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 12); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 13); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 14); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 15); dstPtr += channels;

And a very similar thing could be done for NEON, just each line would look like *dstPtr = vgetq_lane_u8(prev16, 0); dstPtr += channels; and so on. And now MSVC does keep everything within SIMD registers. There’s no change on Clang, either in SSE nor NEON case.

“D”: WinVS 18.8ms, WinClang 18.8ms, MacClang 8.3ms (“B” was: 19.7, 18.7, 8.3). Ok, a small improvement for MSVC, unchanged (as expected) for the two Clang cases.

Attempt E: try to flip it around

Ok so that whole SIMD thing was a lot of code for very little gain. How about something different? Our UnSplit8Delta reads the source data sequentially, but does “scattered” writes into destination array. How about if we change the order, so that the destination writes are done sequentially, but the source data is “gathered” from multiple locations?

This is not easily done due to delta decoding that needs to happen, so we’ll just first do delta-decoding in place (modifying the source array! YOLO!), and then to the “unsplit” part. Upper part of the code (prefix_sum_u8 etc.) as before; the function itself now is:

void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    // first pass: decode delta
    const size_t dataSize = dataElems * channels;
    uint8_t* ptr = src;
    size_t ip = 0;
    uint8_t prev = 0;
#   if CPU_ARCH_X64
    // SSE simd loop, 16 bytes at a time
    __m128i prev16 = _mm_set1_epi8(0);
    __m128i hibyte = _mm_set1_epi8(15);
    for (; ip < dataSize / 16; ++ip)
    {
        __m128i v = _mm_loadu_si128((const __m128i*)ptr);
        // un-delta via prefix sum
        prev16 = _mm_add_epi8(prefix_sum_u8(v), _mm_shuffle_epi8(prev16, hibyte));
        _mm_storeu_si128((__m128i*)ptr, prev16);
        ptr += 16;
    }
    prev = _mm_extract_epi8(prev16, 15); // sse4.1
#   endif // if CPU_ARCH_X64
    
#   if CPU_ARCH_ARM64
    // NEON simd loop, 16 bytes at a time
    uint8x16_t prev16 = vdupq_n_u8(prev);
    uint8x16_t hibyte = vdupq_n_u8(15);
    for (; ip < dataSize / 16; ++ip)
    {
        uint8x16_t v = vld1q_u8(ptr);
        // un-delta via prefix sum
        prev16 = vaddq_u8(prefix_sum_u8(v), vqtbl1q_u8(prev16, hibyte));
        vst1q_u8(ptr, prev16);
        ptr += 16;
    }
    prev = vgetq_lane_u8(prev16, 15);
#   endif // if CPU_ARCH_ARM64

    // any trailing leftover
    for (ip = ip * 16; ip < dataSize; ++ip)
    {
        uint8_t v = *ptr + prev;
        prev = v;
        *ptr = v;
        ptr += 1;
    }

    // second pass: un-split; sequential write into destination
    uint8_t* dstPtr = dst;
    for (int ip = 0; ip < dataElems; ++ip)
    {
        const uint8_t* srcPtr = src + ip;
        for (int ich = 0; ich < channels; ++ich)
        {
            uint8_t v = *srcPtr;
            *dstPtr = v;
            srcPtr += dataElems;
            dstPtr += 1;
        }
    }
}

Times for “E”: WinVS 20.9ms, WinClang 14.3ms, MacClang 16.3ms (“C” was: 18.8, 18.8, 8.3). Now that is curious! The two “primary” configurations (Windows MSVC and Mac Clang) get slower or much slower. You could have expected that; this function gets us back into “two passes over memory” land. But the Windows Clang gets quite a bit faster 🤔

Looking at the code in compiler explorer, Clang for x64 decides to unroll the un-split loop into a loop that does 8 bytes at a time; whereas MSVC x64 and Clang arm64 does a simple loop of one byte at a time, as written in C code.

However I know my data; and I know that majority of my data is 16-byte data items. But, trying to explicitly add a code path for channels==16 case, and manually unrolling the loop to do 16 bytes at a time gets slower. Maybe something to look into some other day.

For now I’ll say it’s enough of decoding speedup attempts. We are here:

And the lesson so far is, that the most trivial of code changes (just fold delta + unsplit into one function, with one pass over memory) got the largest speedup – from 24..30ms depending on the platform, the time went down to 10..20ms.

Additional SIMD things can get it a bit faster, but I suspect if we’re looking for larger gains, we’d want to change the filter itself, so that it is no longer just one “stream” of delta bytes, but rather perhaps several interleaved streams. That way, all the fancy machinery inside CPUs that can do kphjillion operations in parallel can actually do it.

Compression filter optimization

While compression filter cost is relatively cheap compared to the lossless compression part itself, I did very similar attempts at speeding that one up. A short run through these is below. We start with “split data, delta encode” as two passes over memory:

template<typename T>
static void EncodeDeltaDif(T* data, size_t dataElems)
{
    T prev = 0;
    for (size_t i = 0; i < dataElems; ++i)
    {
        T v = *data;
        *data = v - prev;
        prev = v;
        ++data;
    }
}
template<typename T>
static void Split(const T* src, T* dst, int channels, size_t dataElems)
{
    for (int ich = 0; ich < channels; ++ich)
    {
        const T* ptr = src + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            *dst = *ptr;
            ptr += channels;
            dst += 1;
        }
    }
}
void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    Split<uint8_t>(src, dst, channels, dataElems);
    EncodeDeltaDif<uint8_t>(dst, channels * dataElems);
}

Which is “A”: WinVS 27.6ms, WinClang 17.9ms, MacClang 9.6ms.

Next up, fold split and delta into one function:

void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        const uint8_t* srcPtr = src + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            uint8_t v = *srcPtr;
            *dst = v - prev;
            prev = v;
            srcPtr += channels;
            dst += 1;
        }
    }
}

Which is “B”: WinVS 21.0ms, WinClang 18.0ms, MacClang 11.0ms (“A” was: 27.6, 17.9, 9.6). Curious! WinVS a good chunk faster, WinClang unchanged, MacClang a bit slower. This is very different from the decoding filter case!

Ok, throw in some SIMD, very much like above. Process data 16 bytes at a time, with a scalar loop at the end for any leftovers:

#if defined(__x86_64__) || defined(_M_X64)
#   define CPU_ARCH_X64 1
#   include <emmintrin.h> // sse2
#   include <tmmintrin.h> // sse3
#   include <smmintrin.h> // sse4.1
#endif
#if defined(__aarch64__) || defined(_M_ARM64)
#   define CPU_ARCH_ARM64 1
#   include <arm_neon.h>
#endif

void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        const uint8_t* srcPtr = src + ich;
        size_t ip = 0;

#       if CPU_ARCH_X64
        // SSE simd loop, 16 bytes at a time
        __m128i prev16 = _mm_set1_epi8(prev);
        alignas(16) uint8_t gathered[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            for (int lane = 0; lane < 16; ++lane)
            {
                gathered[lane] = *srcPtr;
                srcPtr += channels;
            }
            __m128i v = _mm_load_si128((const __m128i*)gathered);
            // delta from previous
            __m128i delta = _mm_sub_epi8(v, _mm_alignr_epi8(v, prev16, 15)); // sse3
            _mm_storeu_si128((__m128i*)dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = _mm_extract_epi8(prev16, 15); // sse4.1
#       endif // if CPU_ARCH_X64

#       if CPU_ARCH_ARM64
        // NEON simd loop, 16 bytes at a time
        uint8x16_t prev16 = vdupq_n_u8(prev);
        alignas(16) uint8_t gathered[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            for (int lane = 0; lane < 16; ++lane)
            {
                gathered[lane] = *srcPtr;
                srcPtr += channels;
            }
            uint8x16_t v = vld1q_u8(gathered);
            // delta from previous
            uint8x16_t delta = vsubq_u8(v, vextq_u8(prev16, v, 15));
            vst1q_u8(dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = vgetq_lane_u8(prev16, 15);
#       endif // if CPU_ARCH_ARM64

        // any trailing leftover
        for (ip = ip * 16; ip < dataElems; ++ip)
        {
            uint8_t v = *srcPtr;
            *dst = v - prev;
            prev = v;

            srcPtr += channels;
            dst += 1;
        }
    }
}

Which is “C”: WinVS 18.0ms, WinClang 18.3ms, MacClang 17.5ms (“B” was: 21.0, 18.0, 11.0). Hmm. MSVC keeps on improving, Windows Clang stays the same, Mac Clang gets a lot slower. Eek!

Undeterred, I’m going to do another attempt, just like for decompression above: replace the data gather loop written in C:

for (int lane = 0; lane < 16; ++lane)
{
    gathered[lane] = *srcPtr;
    srcPtr += channels;
}

with one that uses SIMD intrinsics instead:

void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        const uint8_t* srcPtr = src + ich;
        size_t ip = 0;

#       if CPU_ARCH_X64
        // SSE simd loop, 16 bytes at a time
        __m128i prev16 = _mm_set1_epi8(prev);
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            __m128i v = _mm_set1_epi8(0);
            v = _mm_insert_epi8(v, *srcPtr, 0); srcPtr += channels; // sse4.1
            v = _mm_insert_epi8(v, *srcPtr, 1); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 2); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 3); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 4); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 5); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 6); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 7); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 8); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 9); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 10); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 11); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 12); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 13); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 14); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 15); srcPtr += channels;
            // delta from previous
            __m128i delta = _mm_sub_epi8(v, _mm_alignr_epi8(v, prev16, 15)); // sse3
            _mm_storeu_si128((__m128i*)dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = _mm_extract_epi8(prev16, 15); // sse4.1
#       endif // if CPU_ARCH_X64

#       if CPU_ARCH_ARM64
        // NEON simd loop, 16 bytes at a time
        uint8x16_t prev16 = vdupq_n_u8(prev);
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            uint8x16_t v = vdupq_n_u8(0);
            v = vsetq_lane_u8(*srcPtr, v, 0); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 1); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 2); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 3); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 4); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 5); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 6); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 7); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 8); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 9); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 10); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 11); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 12); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 13); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 14); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 15); srcPtr += channels;

            // delta from previous
            uint8x16_t delta = vsubq_u8(v, vextq_u8(prev16, v, 15));
            vst1q_u8(dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = vgetq_lane_u8(prev16, 15);
#       endif // if CPU_ARCH_ARM64

        // any trailing leftover
        for (ip = ip * 16; ip < dataElems; ++ip)
        {
            uint8_t v = *srcPtr;
            *dst = v - prev;
            prev = v;

            srcPtr += channels;
            dst += 1;
        }
    }
}

Which now is “D”: WinVS 17.5ms, WinClang 15.2ms, MacClang 9.1ms (“C” was: 18.0, 18.3, 17.5). Whoa. This is the fastest version now, and Mac especially got out of that strange slowness from “C” case. But how and why?

The NEON assembly of gather loop in “C” case was like:

.LBB0_10:
add     x12, x0, x9
mov     x14, x10
dup     v1.16b, w13
.LBB0_11:
ld1     { v0.b }[0], [x12], x8
subs    x14, x14, #1
ld1     { v0.b }[1], [x12], x8
ld1     { v0.b }[2], [x12], x8
ld1     { v0.b }[3], [x12], x8
ld1     { v0.b }[4], [x12], x8
ld1     { v0.b }[5], [x12], x8
ld1     { v0.b }[6], [x12], x8
ld1     { v0.b }[7], [x12], x8
ld1     { v0.b }[8], [x12], x8
ld1     { v0.b }[9], [x12], x8
ld1     { v0.b }[10], [x12], x8
ld1     { v0.b }[11], [x12], x8
ld1     { v0.b }[12], [x12], x8
ld1     { v0.b }[13], [x12], x8
ld1     { v0.b }[14], [x12], x8
ext     v1.16b, v1.16b, v0.16b, #15
; ...

i.e. it’s a series of one-byte loads into each byte-wide lane of a NEON register. Cool. The assembly of “D” case, which is twice as fast, is:

.LBB0_10:
add     x12, x0, x9
mov     x14, x10
dup     v0.16b, w13
.LBB0_11:
movi    v1.2d, #0000000000000000
subs    x14, x14, #1
ld1     { v1.b }[0], [x12], x8
ld1     { v1.b }[1], [x12], x8
ld1     { v1.b }[2], [x12], x8
ld1     { v1.b }[3], [x12], x8
ld1     { v1.b }[4], [x12], x8
ld1     { v1.b }[5], [x12], x8
ld1     { v1.b }[6], [x12], x8
ld1     { v1.b }[7], [x12], x8
ld1     { v1.b }[8], [x12], x8
ld1     { v1.b }[9], [x12], x8
ld1     { v1.b }[10], [x12], x8
ld1     { v1.b }[11], [x12], x8
ld1     { v1.b }[12], [x12], x8
ld1     { v1.b }[13], [x12], x8
ld1     { v1.b }[14], [x12], x8
ext     v0.16b, v0.16b, v1.16b, #15
; ...

it’s the same series of one-byte loads into a NEON register! What gives?!

The movi v1.2d, #0000000000000000 is the key. In my hand-written NEON intrinsics version, I have for some reason wrote it in a way that first sets the whole register to zero: uint8x16_t v = vdupq_n_u8(0); and then proceeds to load each byte of it.

Whereas in the C version, there’s a alignas(16) uint8_t gathered[16]; variable outside the loop, and nothing tells the compiler or the CPU that it’s completely overwritten on each loop iteration. This, I guess, creates a dependency between loop iterations where some sort of register renaming whatever can not kick in.

Knowing that, we could get back to a version written in C, and on Mac Clang it is the same 9.1ms:

alignas(16) uint8_t gathered[16] = {};
for (int lane = 0; lane < 16; ++lane)
{
    gathered[lane] = *srcPtr;
    srcPtr += channels;
}
uint8x16_t v = vld1q_u8(gathered);

Note that = {}; in variable declaration is important; it’s not enough to just move the variable inside the loop. Logically upon each iteration the variable is “fresh new variable”, but the compiler decides to not explicitly set that register to zero, thus creating this kinda-false dependency between loop iterations.

Having the same “written in C” version for the SSE code path still does not result in MSVC emitting SSE instructions though.

Anyway, for compression filter I’m going to call it a day. We are here:

  • Good improvement on MSVC,
  • A tiny bit of improvement on Clang x64 and ARM. With several regressions along the way, but hey I learned to initialize my registers.

Summary

Putting all of that together, here’s how it affects the overall picture. Click for interactive chart; thick line is filter optimized as above; thin solid line is filter from part 3. Dashed line is just the lossless compressors, without any data filtering. Windows, Ryzen 5950X, VS2022:

  • Compression gets a bit faster; generally saves about 40ms.
  • Decompression also gets faster; saves about 30ms but the effect of that is much larger since the decompressors are faster. Something like LZ4 goes from 0.8GB/s to 1.0GB/s. Which is still way below 5GB/s that it does without any data filtering of course, but eh.

And Mac, Apple M1 Max, Clang 14:

  • Compression gets a tiny bit faster, but the effect is so small that it’s really nothing.
  • Decompression gets way faster. It saves about 60ms, which gets LZ4 from 0.8GB/s to 1.6GB/s. And for example zstd decompression now is within same ballpark as without any data filtering!

What’s next

Next up: either look into lossy compression, or into other ways of speeding up the data filtering.