Float Compression 6: Filtering Optimization

Introduction and index of this series is here.

Several posts ago we learned that filtering the data can make it more compressible. Out of several simple filters that I tried, “reorder data items byte-wise and delta encode that” was the most effective at improving compression ratio. That’s all nice and good, but the filtering has a cost to it. One is the extra memory needed to hold the filtered data, and another is the time it takes to do the filtering. While for compression the relative speed hit is fairly small (i.e. compression itself takes much longer), for decompression the cost is not trivial. A fast decompressor like LZ4 normally goes at ~5GB/s, but with “reorder bytes + delta” filter it only achieves 0.8GB/s:

Can we do something simple (i.e. something even me could do) to speed that up a bit? Let’s find out.

Decompression filter optimization

The code we are starting with is this: first decode the delta-encoded data, then un-split the data items, i.e. assemble the items from the “first byte of each item, then 2nd byte of each item, then 3rd byte of each item” layout. Like this:

// channels: how many items per data element
// dataElems: how many data elements
template<typename T>
static void UnSplit(const T* src, T* dst, int channels, size_t dataElems)
{
    for (int ich = 0; ich < channels; ++ich)
    {
        T* dstPtr = dst + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            *dstPtr = *src;
            src += 1;
            dstPtr += channels;
        }
    }
}
template<typename T>
static void DecodeDeltaDif(T* data, size_t dataElems)
{
    T prev = 0;
    for (size_t i = 0; i < dataElems; ++i)
    {
        T v = *data;
        v = prev + v;
        *data = v;
        prev = v;
        ++data;
    }
}
void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    DecodeDeltaDif<uint8_t>(src, channels * dataElems);
    UnSplit<uint8_t>(src, dst, channels, dataElems);
}

Now, this process does not depend on the compression library or the settings used for it; it’s always the same for all of them. So instead of complicated per-compressor scatter plots I’ll just give three time numbers: milliseconds that it takes to do the filtering process on our 94.5MB data set. Two numbers on Windows PC (Ryzen 5950X, Visual Studio 2022 16.4 and Clang 15.0), and one on a Mac laptop (Apple M1 Max, Apple Clang 14.0).

The code above, which I’m gonna label “A”: WinVS 27.9ms, WinClang 29.7ms, MacClang 23.9ms.

Attempt B: combine unsplit+delta into one

The code above does two passes over the data: first the delta-decode, then the un-split. This is good for “code reuse”, as in arbitrarily complex filters can be produced by combining several simple filters in sequence. But if we know we already locked onto “split bytes + delta”, then we can try combining all that into just once function:

void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        uint8_t* dstPtr = dst + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            uint8_t v = *src + prev;
            prev = v;
            *dstPtr = v;
            src += 1;
            dstPtr += channels;
        }
    }
}

“B”: WinVS 19.5ms, WinClang 18.9ms, MacClang 9.5ms (“A” was: 27.9, 29.7, 23.9). Not bad at all, and the code is actually smaller now.

Attempt C: I heard something about SIMD

CPUs these days have this “SIMD” thing, and for usual use cases we can pretty much assume that something like SSE4 is available on Intel architectures, and NEON on ARM architectures. Our code loops over data “quite a lot”, doing very simple operations with it. Would trying to sprinkle some manually written SIMD in there help?

One spanner in the works is that the loop in there is not independent operations: each iteration of the loop updates the prev byte value. Delta-encoded data decoding is essentially a prefix sum operation, and some five minute googling having been aware of all of Fabian’s tweets and toots, ever finds this gist with a prefix_sum_u8 function for SSE. So with that, let’s try to rewrite the above code so that the inner loop can do 16 bytes at a time, for both SSE and NEON cases.

The code’s quite a bit longer:

#if defined(__x86_64__) || defined(_M_X64)
#   define CPU_ARCH_X64 1
#   include <emmintrin.h> // sse2
#   include <tmmintrin.h> // sse3
#   include <smmintrin.h> // sse4.1
#endif
#if defined(__aarch64__) || defined(_M_ARM64)
#   define CPU_ARCH_ARM64 1
#   include <arm_neon.h>
#endif

#if CPU_ARCH_X64
// https://gist.github.com/rygorous/4212be0cd009584e4184e641ca210528
static inline __m128i prefix_sum_u8(__m128i x)
{
    x = _mm_add_epi8(x, _mm_slli_epi64(x, 8));
    x = _mm_add_epi8(x, _mm_slli_epi64(x, 16));
    x = _mm_add_epi8(x, _mm_slli_epi64(x, 32));
    x = _mm_add_epi8(x, _mm_shuffle_epi8(x, _mm_setr_epi8(-1,-1,-1,-1,-1,-1,-1,-1,7,7,7,7,7,7,7,7)));
    return x;
}
#endif // #if CPU_ARCH_X64
#if CPU_ARCH_ARM64
// straight-up port to NEON of the above; no idea if this is efficient at all, yolo!
static inline uint8x16_t prefix_sum_u8(uint8x16_t x)
{
    x = vaddq_u8(x, vshlq_u64(x, vdupq_n_u64(8)));
    x = vaddq_u8(x, vshlq_u64(x, vdupq_n_u64(16)));
    x = vaddq_u8(x, vshlq_u64(x, vdupq_n_u64(32)));
    alignas(16) uint8_t tbl[16] = {0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,7,7,7,7,7,7,7,7};
    x = vaddq_u8(x, vqtbl1q_u8(x, vld1q_u8(tbl)));
    return x;
}
#endif // #if CPU_ARCH_ARM64

void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        uint8_t* dstPtr = dst + ich;
        size_t ip = 0;

#       if CPU_ARCH_X64
        // SSE simd loop, 16 bytes at a time
        __m128i prev16 = _mm_set1_epi8(prev);
        __m128i hibyte = _mm_set1_epi8(15);
        alignas(16) uint8_t scatter[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // load 16 bytes of filtered data
            __m128i v = _mm_loadu_si128((const __m128i*)src);
            // un-delta via prefix sum
            prev16 = _mm_add_epi8(prefix_sum_u8(v), _mm_shuffle_epi8(prev16, hibyte));
            // scattered write into destination
            _mm_store_si128((__m128i*)scatter, prev16);
            for (int lane = 0; lane < 16; ++lane)
            {
                *dstPtr = scatter[lane];
                dstPtr += channels;
            }
            src += 16;
        }
        prev = _mm_extract_epi8(prev16, 15); // sse4.1
#       endif // if CPU_ARCH_X64

#       if CPU_ARCH_ARM64
        // NEON simd loop, 16 bytes at a time
        uint8x16_t prev16 = vdupq_n_u8(prev);
        uint8x16_t hibyte = vdupq_n_u8(15);
        alignas(16) uint8_t scatter[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // load 16 bytes of filtered data
            uint8x16_t v = vld1q_u8(src);
            // un-delta via prefix sum
            prev16 = vaddq_u8(prefix_sum_u8(v), vqtbl1q_u8(prev16, hibyte));
            // scattered write into destination
            vst1q_u8(scatter, prev16);
            for (int lane = 0; lane < 16; ++lane)
            {
                *dstPtr = scatter[lane];
                dstPtr += channels;
            }
            src += 16;
        }
        prev = vgetq_lane_u8(prev16, 15);
#       endif // if CPU_ARCH_ARM64

        // any trailing leftover
        for (ip = ip * 16; ip < dataElems; ++ip)
        {
            uint8_t v = *src + prev;
            prev = v;
            *dstPtr = v;
            src += 1;
            dstPtr += channels;
        }
    }
}

Phew. “C”: WinVS 19.7ms, WinClang 18.7ms, MacClang 8.3ms (“B” was: 19.5, 18.9, 9.5). Meh! 😕 This makes the Apple/NEON case a bit faster, but on PC/SSE case it’s pretty much the same timings.

Lesson: just because you use some SIMD, does not necessarily make things faster. In this case, I suspect it’s the lack of independent work that’s available (each loop iteration depends on results of previous iteration; and “work” within loop is tiny), and the scattered memory writes are the problem. If I was less lazy I’d try to draw up a data flow graph or something.

Attempt D: undeterred, even more SIMD

The SIMD attempt was a lot of code for very little (if any) gain, but hey what if we try adding even more SIMD? Looking at the assembly of the compiled code in compiler explorer, I noticed that while while Clang can keep code like this:

// scattered write into destination
_mm_store_si128((__m128i*)scatter, prev16);
for (int lane = 0; lane < 16; ++lane)
{
    *dstPtr = scatter[lane];
    dstPtr += channels;
}

completely within SIMD registers, MSVC can not. Clang compiles the above into a sequence of pextrb (SSE) and st1 (NEON) instructions, like:

; SSE
pextrb  byte ptr [rbx], xmm3, 0
pextrb  byte ptr [rbx + rax], xmm3, 1
add     rbx, rax
pextrb  byte ptr [rax + rbx], xmm3, 2
add     rbx, rax
pextrb  byte ptr [rax + rbx], xmm3, 3
add     rbx, rax
; ...
; NEON
st1     { v2.b }[0], [x13]
add     x13, x15, x8
st1     { v2.b }[1], [x12]
add     x12, x13, x8
st1     { v2.b }[2], [x15]
add     x15, x12, x8
; ...

But MSVC is emitting assembly very similar to how the code is written: first writes the SSE register into memory, and then stores each byte of that into final location:

movdqa  XMMWORD PTR scatter$1[rsp], xmm3
movdqa  xmm0, xmm3
psrldq  xmm0, 8
movd    ecx, xmm3
mov     BYTE PTR [rax], cl
add     rax, rbx
movzx   ecx, BYTE PTR scatter$1[rsp+1]
mov     BYTE PTR [rax], cl
add     rax, rbx
movzx   ecx, BYTE PTR scatter$1[rsp+2]
mov     BYTE PTR [rax], cl
add     rax, rbx
movzx   ecx, BYTE PTR scatter$1[rsp+3]
mov     BYTE PTR [rax], cl
add     rax, rbx
; ...

So how about if we replace that loop with a sequence of _mm_extract_epi8 (SSE intrinsic function that maps to pextrb)?

// scattered write into destination
//_mm_store_si128((__m128i*)scatter, prev16);
//for (int lane = 0; lane < 16; ++lane)
//{
//    *dstPtr = scatter[lane];
//    dstPtr += channels;
//}
*dstPtr = _mm_extract_epi8(prev16, 0); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 1); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 2); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 3); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 4); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 5); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 6); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 7); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 8); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 9); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 10); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 11); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 12); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 13); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 14); dstPtr += channels;
*dstPtr = _mm_extract_epi8(prev16, 15); dstPtr += channels;

And a very similar thing could be done for NEON, just each line would look like *dstPtr = vgetq_lane_u8(prev16, 0); dstPtr += channels; and so on. And now MSVC does keep everything within SIMD registers. There’s no change on Clang, either in SSE nor NEON case.

“D”: WinVS 18.8ms, WinClang 18.8ms, MacClang 8.3ms (“B” was: 19.7, 18.7, 8.3). Ok, a small improvement for MSVC, unchanged (as expected) for the two Clang cases.

Attempt E: try to flip it around

Ok so that whole SIMD thing was a lot of code for very little gain. How about something different? Our UnSplit8Delta reads the source data sequentially, but does “scattered” writes into destination array. How about if we change the order, so that the destination writes are done sequentially, but the source data is “gathered” from multiple locations?

This is not easily done due to delta decoding that needs to happen, so we’ll just first do delta-decoding in place (modifying the source array! YOLO!), and then to the “unsplit” part. Upper part of the code (prefix_sum_u8 etc.) as before; the function itself now is:

void UnSplit8Delta(uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    // first pass: decode delta
    const size_t dataSize = dataElems * channels;
    uint8_t* ptr = src;
    size_t ip = 0;
    uint8_t prev = 0;
#   if CPU_ARCH_X64
    // SSE simd loop, 16 bytes at a time
    __m128i prev16 = _mm_set1_epi8(0);
    __m128i hibyte = _mm_set1_epi8(15);
    for (; ip < dataSize / 16; ++ip)
    {
        __m128i v = _mm_loadu_si128((const __m128i*)ptr);
        // un-delta via prefix sum
        prev16 = _mm_add_epi8(prefix_sum_u8(v), _mm_shuffle_epi8(prev16, hibyte));
        _mm_storeu_si128((__m128i*)ptr, prev16);
        ptr += 16;
    }
    prev = _mm_extract_epi8(prev16, 15); // sse4.1
#   endif // if CPU_ARCH_X64
    
#   if CPU_ARCH_ARM64
    // NEON simd loop, 16 bytes at a time
    uint8x16_t prev16 = vdupq_n_u8(prev);
    uint8x16_t hibyte = vdupq_n_u8(15);
    for (; ip < dataSize / 16; ++ip)
    {
        uint8x16_t v = vld1q_u8(ptr);
        // un-delta via prefix sum
        prev16 = vaddq_u8(prefix_sum_u8(v), vqtbl1q_u8(prev16, hibyte));
        vst1q_u8(ptr, prev16);
        ptr += 16;
    }
    prev = vgetq_lane_u8(prev16, 15);
#   endif // if CPU_ARCH_ARM64

    // any trailing leftover
    for (ip = ip * 16; ip < dataSize; ++ip)
    {
        uint8_t v = *ptr + prev;
        prev = v;
        *ptr = v;
        ptr += 1;
    }

    // second pass: un-split; sequential write into destination
    uint8_t* dstPtr = dst;
    for (int ip = 0; ip < dataElems; ++ip)
    {
        const uint8_t* srcPtr = src + ip;
        for (int ich = 0; ich < channels; ++ich)
        {
            uint8_t v = *srcPtr;
            *dstPtr = v;
            srcPtr += dataElems;
            dstPtr += 1;
        }
    }
}

Times for “E”: WinVS 20.9ms, WinClang 14.3ms, MacClang 16.3ms (“C” was: 18.8, 18.8, 8.3). Now that is curious! The two “primary” configurations (Windows MSVC and Mac Clang) get slower or much slower. You could have expected that; this function gets us back into “two passes over memory” land. But the Windows Clang gets quite a bit faster 🤔

Looking at the code in compiler explorer, Clang for x64 decides to unroll the un-split loop into a loop that does 8 bytes at a time; whereas MSVC x64 and Clang arm64 does a simple loop of one byte at a time, as written in C code.

However I know my data; and I know that majority of my data is 16-byte data items. But, trying to explicitly add a code path for channels==16 case, and manually unrolling the loop to do 16 bytes at a time gets slower. Maybe something to look into some other day.

For now I’ll say it’s enough of decoding speedup attempts. We are here:

And the lesson so far is, that the most trivial of code changes (just fold delta + unsplit into one function, with one pass over memory) got the largest speedup – from 24..30ms depending on the platform, the time went down to 10..20ms.

Additional SIMD things can get it a bit faster, but I suspect if we’re looking for larger gains, we’d want to change the filter itself, so that it is no longer just one “stream” of delta bytes, but rather perhaps several interleaved streams. That way, all the fancy machinery inside CPUs that can do kphjillion operations in parallel can actually do it.

Compression filter optimization

While compression filter cost is relatively cheap compared to the lossless compression part itself, I did very similar attempts at speeding that one up. A short run through these is below. We start with “split data, delta encode” as two passes over memory:

template<typename T>
static void EncodeDeltaDif(T* data, size_t dataElems)
{
    T prev = 0;
    for (size_t i = 0; i < dataElems; ++i)
    {
        T v = *data;
        *data = v - prev;
        prev = v;
        ++data;
    }
}
template<typename T>
static void Split(const T* src, T* dst, int channels, size_t dataElems)
{
    for (int ich = 0; ich < channels; ++ich)
    {
        const T* ptr = src + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            *dst = *ptr;
            ptr += channels;
            dst += 1;
        }
    }
}
void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    Split<uint8_t>(src, dst, channels, dataElems);
    EncodeDeltaDif<uint8_t>(dst, channels * dataElems);
}

Which is “A”: WinVS 27.6ms, WinClang 17.9ms, MacClang 9.6ms.

Next up, fold split and delta into one function:

void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        const uint8_t* srcPtr = src + ich;
        for (size_t ip = 0; ip < dataElems; ++ip)
        {
            uint8_t v = *srcPtr;
            *dst = v - prev;
            prev = v;
            srcPtr += channels;
            dst += 1;
        }
    }
}

Which is “B”: WinVS 21.0ms, WinClang 18.0ms, MacClang 11.0ms (“A” was: 27.6, 17.9, 9.6). Curious! WinVS a good chunk faster, WinClang unchanged, MacClang a bit slower. This is very different from the decoding filter case!

Ok, throw in some SIMD, very much like above. Process data 16 bytes at a time, with a scalar loop at the end for any leftovers:

#if defined(__x86_64__) || defined(_M_X64)
#   define CPU_ARCH_X64 1
#   include <emmintrin.h> // sse2
#   include <tmmintrin.h> // sse3
#   include <smmintrin.h> // sse4.1
#endif
#if defined(__aarch64__) || defined(_M_ARM64)
#   define CPU_ARCH_ARM64 1
#   include <arm_neon.h>
#endif

void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        const uint8_t* srcPtr = src + ich;
        size_t ip = 0;

#       if CPU_ARCH_X64
        // SSE simd loop, 16 bytes at a time
        __m128i prev16 = _mm_set1_epi8(prev);
        alignas(16) uint8_t gathered[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            for (int lane = 0; lane < 16; ++lane)
            {
                gathered[lane] = *srcPtr;
                srcPtr += channels;
            }
            __m128i v = _mm_load_si128((const __m128i*)gathered);
            // delta from previous
            __m128i delta = _mm_sub_epi8(v, _mm_alignr_epi8(v, prev16, 15)); // sse3
            _mm_storeu_si128((__m128i*)dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = _mm_extract_epi8(prev16, 15); // sse4.1
#       endif // if CPU_ARCH_X64

#       if CPU_ARCH_ARM64
        // NEON simd loop, 16 bytes at a time
        uint8x16_t prev16 = vdupq_n_u8(prev);
        alignas(16) uint8_t gathered[16];
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            for (int lane = 0; lane < 16; ++lane)
            {
                gathered[lane] = *srcPtr;
                srcPtr += channels;
            }
            uint8x16_t v = vld1q_u8(gathered);
            // delta from previous
            uint8x16_t delta = vsubq_u8(v, vextq_u8(prev16, v, 15));
            vst1q_u8(dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = vgetq_lane_u8(prev16, 15);
#       endif // if CPU_ARCH_ARM64

        // any trailing leftover
        for (ip = ip * 16; ip < dataElems; ++ip)
        {
            uint8_t v = *srcPtr;
            *dst = v - prev;
            prev = v;

            srcPtr += channels;
            dst += 1;
        }
    }
}

Which is “C”: WinVS 18.0ms, WinClang 18.3ms, MacClang 17.5ms (“B” was: 21.0, 18.0, 11.0). Hmm. MSVC keeps on improving, Windows Clang stays the same, Mac Clang gets a lot slower. Eek!

Undeterred, I’m going to do another attempt, just like for decompression above: replace the data gather loop written in C:

for (int lane = 0; lane < 16; ++lane)
{
    gathered[lane] = *srcPtr;
    srcPtr += channels;
}

with one that uses SIMD intrinsics instead:

void Split8Delta(const uint8_t* src, uint8_t* dst, int channels, size_t dataElems)
{
    uint8_t prev = 0;
    for (int ich = 0; ich < channels; ++ich)
    {
        const uint8_t* srcPtr = src + ich;
        size_t ip = 0;

#       if CPU_ARCH_X64
        // SSE simd loop, 16 bytes at a time
        __m128i prev16 = _mm_set1_epi8(prev);
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            __m128i v = _mm_set1_epi8(0);
            v = _mm_insert_epi8(v, *srcPtr, 0); srcPtr += channels; // sse4.1
            v = _mm_insert_epi8(v, *srcPtr, 1); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 2); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 3); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 4); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 5); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 6); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 7); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 8); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 9); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 10); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 11); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 12); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 13); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 14); srcPtr += channels;
            v = _mm_insert_epi8(v, *srcPtr, 15); srcPtr += channels;
            // delta from previous
            __m128i delta = _mm_sub_epi8(v, _mm_alignr_epi8(v, prev16, 15)); // sse3
            _mm_storeu_si128((__m128i*)dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = _mm_extract_epi8(prev16, 15); // sse4.1
#       endif // if CPU_ARCH_X64

#       if CPU_ARCH_ARM64
        // NEON simd loop, 16 bytes at a time
        uint8x16_t prev16 = vdupq_n_u8(prev);
        for (; ip < dataElems / 16; ++ip)
        {
            // gather 16 bytes from source data
            uint8x16_t v = vdupq_n_u8(0);
            v = vsetq_lane_u8(*srcPtr, v, 0); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 1); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 2); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 3); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 4); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 5); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 6); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 7); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 8); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 9); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 10); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 11); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 12); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 13); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 14); srcPtr += channels;
            v = vsetq_lane_u8(*srcPtr, v, 15); srcPtr += channels;

            // delta from previous
            uint8x16_t delta = vsubq_u8(v, vextq_u8(prev16, v, 15));
            vst1q_u8(dst, delta);
            prev16 = v;
            dst += 16;
        }
        prev = vgetq_lane_u8(prev16, 15);
#       endif // if CPU_ARCH_ARM64

        // any trailing leftover
        for (ip = ip * 16; ip < dataElems; ++ip)
        {
            uint8_t v = *srcPtr;
            *dst = v - prev;
            prev = v;

            srcPtr += channels;
            dst += 1;
        }
    }
}

Which now is “D”: WinVS 17.5ms, WinClang 15.2ms, MacClang 9.1ms (“C” was: 18.0, 18.3, 17.5). Whoa. This is the fastest version now, and Mac especially got out of that strange slowness from “C” case. But how and why?

The NEON assembly of gather loop in “C” case was like:

.LBB0_10:
add     x12, x0, x9
mov     x14, x10
dup     v1.16b, w13
.LBB0_11:
ld1     { v0.b }[0], [x12], x8
subs    x14, x14, #1
ld1     { v0.b }[1], [x12], x8
ld1     { v0.b }[2], [x12], x8
ld1     { v0.b }[3], [x12], x8
ld1     { v0.b }[4], [x12], x8
ld1     { v0.b }[5], [x12], x8
ld1     { v0.b }[6], [x12], x8
ld1     { v0.b }[7], [x12], x8
ld1     { v0.b }[8], [x12], x8
ld1     { v0.b }[9], [x12], x8
ld1     { v0.b }[10], [x12], x8
ld1     { v0.b }[11], [x12], x8
ld1     { v0.b }[12], [x12], x8
ld1     { v0.b }[13], [x12], x8
ld1     { v0.b }[14], [x12], x8
ext     v1.16b, v1.16b, v0.16b, #15
; ...

i.e. it’s a series of one-byte loads into each byte-wide lane of a NEON register. Cool. The assembly of “D” case, which is twice as fast, is:

.LBB0_10:
add     x12, x0, x9
mov     x14, x10
dup     v0.16b, w13
.LBB0_11:
movi    v1.2d, #0000000000000000
subs    x14, x14, #1
ld1     { v1.b }[0], [x12], x8
ld1     { v1.b }[1], [x12], x8
ld1     { v1.b }[2], [x12], x8
ld1     { v1.b }[3], [x12], x8
ld1     { v1.b }[4], [x12], x8
ld1     { v1.b }[5], [x12], x8
ld1     { v1.b }[6], [x12], x8
ld1     { v1.b }[7], [x12], x8
ld1     { v1.b }[8], [x12], x8
ld1     { v1.b }[9], [x12], x8
ld1     { v1.b }[10], [x12], x8
ld1     { v1.b }[11], [x12], x8
ld1     { v1.b }[12], [x12], x8
ld1     { v1.b }[13], [x12], x8
ld1     { v1.b }[14], [x12], x8
ext     v0.16b, v0.16b, v1.16b, #15
; ...

it’s the same series of one-byte loads into a NEON register! What gives?!

The movi v1.2d, #0000000000000000 is the key. In my hand-written NEON intrinsics version, I have for some reason wrote it in a way that first sets the whole register to zero: uint8x16_t v = vdupq_n_u8(0); and then proceeds to load each byte of it.

Whereas in the C version, there’s a alignas(16) uint8_t gathered[16]; variable outside the loop, and nothing tells the compiler or the CPU that it’s completely overwritten on each loop iteration. This, I guess, creates a dependency between loop iterations where some sort of register renaming whatever can not kick in.

Knowing that, we could get back to a version written in C, and on Mac Clang it is the same 9.1ms:

alignas(16) uint8_t gathered[16] = {};
for (int lane = 0; lane < 16; ++lane)
{
    gathered[lane] = *srcPtr;
    srcPtr += channels;
}
uint8x16_t v = vld1q_u8(gathered);

Note that = {}; in variable declaration is important; it’s not enough to just move the variable inside the loop. Logically upon each iteration the variable is “fresh new variable”, but the compiler decides to not explicitly set that register to zero, thus creating this kinda-false dependency between loop iterations.

Having the same “written in C” version for the SSE code path still does not result in MSVC emitting SSE instructions though.

Anyway, for compression filter I’m going to call it a day. We are here:

  • Good improvement on MSVC,
  • A tiny bit of improvement on Clang x64 and ARM. With several regressions along the way, but hey I learned to initialize my registers.

Summary

Putting all of that together, here’s how it affects the overall picture. Click for interactive chart; thick line is filter optimized as above; thin solid line is filter from part 3. Dashed line is just the lossless compressors, without any data filtering. Windows, Ryzen 5950X, VS2022:

  • Compression gets a bit faster; generally saves about 40ms.
  • Decompression also gets faster; saves about 30ms but the effect of that is much larger since the decompressors are faster. Something like LZ4 goes from 0.8GB/s to 1.0GB/s. Which is still way below 5GB/s that it does without any data filtering of course, but eh.

And Mac, Apple M1 Max, Clang 14:

  • Compression gets a tiny bit faster, but the effect is so small that it’s really nothing.
  • Decompression gets way faster. It saves about 60ms, which gets LZ4 from 0.8GB/s to 1.6GB/s. And for example zstd decompression now is within same ballpark as without any data filtering!

What’s next

Next up: either look into lossy compression, or into other ways of speeding up the data filtering.


Float Compression 5: Science!

Introduction and index of this series is here.

Previous post was about mis-using meshoptimizer compression for compressing totally non-mesh data, for pretty good results. This time, let’s look at several libraries specifically targeted at compressing floating point data sets. Most of them are coming from the scientific community – after all, they do have lots of simulation data, which is very often floating point numbers, and some of that data is massive and needs some compression.

Let’s go!

Reminder: so far I’m only looking at lossless compression. Lossy compression investigation might be a future post.

zfp

zfp (website, github) is a library for either lossy or lossless compression of 1D-4D data (floats and integers, 32 and 64 bit sizes of either). I’ve seen it before in EXR post.

It is similar to GPU texture compression schemes – 2D data is divided into 4x4 blocks, and each block is encoded completely independently from the others. Inside the block, various magic stuff happens and then, ehh, some bits get out in the end :) The actual algorithm is well explained here.

Sounds cool! Let’s try it (again, I’m using the lossless mode of it right now). zfp is the red 4-sided star point:

Ouch. I hope I “used it wrong” in some way? But this is not looking great. Ratio is under 1, i.e. it makes the data larger, and is slow to decompress. (Edit: initially I made a mistake and misunderstood what zfp_compress returns – it returns the cumulative number of compressed bytes; not new compressed bytes done during the compress call. And the second mistake I made was when compressing 1D data - you need to tell it that Y dimension is zero, not one!)

In lossless mode, zfp is not looking super great – ratio is not stellar, and decompression speed in particular is quite slow. The compression ratio is similar to zstd without any data filtering. To be fair, zfp compresses the two 2D files from my data set much better than the two other (1D) files. My takeaway is that zfp might be mostly targeted at lossy compression, which is a topic for some other day. Let’s move on.

fpzip

fpzip (website, github) is from the same research group as zfp, and is their previous floating point compressor. It can also be both lossless and lossy, but from the description seems to be more targeted at lossless case. Let’s try it out (fpzip is the 5-sided star):

Ok, this is better compression ratio compared to zfp. Here I’m first splitting the data by floats (see part 3 post) so that all water heights are together, all velocities are together, etc. Without doing that, fpzip does not get any good compression ratio. Code is here.

  • Compression ratio and performance is really good! Our data set gets down to 24.8MB (not far from “split bytes + delta + zstd” 22.9MB or “mesh optimizer + zstd” 24.3MB). Compresses in 0.6 seconds.
  • However decompression performance is disappointing; takes same time as compression at 0.6 seconds – so between 5x and 10x slower than other best approaches so far.

SPDP

SPDP (website, 2018 paper) is interesting, in that it is a “generated” algorithm. They developed a number of building blocks (reorder something, delta, LZ-like compress, etc.) and then tested millions of their possible combinations on some datasets, and picked the winner. Source code only comes either as a HDF5 filter, or as a standalone command line utility. I had to modify it slightly (code) to be usable as a library, and to not use 1MB of stack space which Windows does not appreciate :)

A series of six-sided stars here:

  • Similar to fpzip case, I had to split the data by floats (code) to get better compression ratio.
  • Compression ratio is between regular zstd and lz4, and is a far behind the best options.
  • It is about twice as fast as fpzip at both compression and decompression, which is still far behind the better options.

The idea of having a bunch of building blocks and automatically picking their best sequence / combination on a bunch of data sets is interesting though. Maybe they should have had stronger building blocks though (their LZ-like codec might be not super good, I guess).

ndzip

ndzip (github, 2021 paper) promises a “efficient implementation on modern SIMD-capable multicore processors, it compresses and decompresses data at speeds close to main memory bandwidth, significantly outperforming existing schemes”, let’s see about it!

Note that I have not used or tested multi-threaded modes of any of the compressors present. Some can do it; all could do it if incoming data was split into some sort of chunks (or after splitting “by floats”, each float block compressed in parallel). But that’s not for today.

ndzip does not build on Windows out of the box (I opened a PR with some fixes), and for the multi-threaded code path it uses OpenMP features that MSVC 2022 seems to not have. It also is specifically designed for AVX2, and and that needs a bit of juggling to get compiled on Windows too. On Mac, it does not link due to some symbol issues related to STL (and AVX2 code path would not really work on an arm64). On Linux, the multi-threaded OpenMP path does not produce correct results, but single-threaded path does. Huge props for releasing source code, but all of this does sound more like a research project that is not quite yet ready for production use :)

Anyway, 7-sided yellow star here:

  • Similar to others, I split data by floats first, or otherwise it does not achieve pretty much any compression.
  • Now this one does achieve a place on the Pareto frontier for compression. The ratio is well behind the best possible (file size gets to 38.1MB; best others go down to 23MB), but it does compress at over 1GB/s. So if you need that kind of compression performance, this one’s interesting.
  • They also have CUDA and SYCL code for GPUs too. I haven’t tried that.

streamvbyte

streamvbyte (github, blog post, 2017 paper) is not meant for compressing floating point data; it is targeted as compressing 4-byte integers. But hey, there’s no law saying we can’t pretend our floats are integers, right?

  • Three-sided star is regular streamvbyte. Only 1.2x compression ratio, but it is the fastest of the bunch; compressing at 5.7GB/s, decompressing at ~10GB/s.
  • There’s also streamvbyte_delta, which on unfiltered data is not really good (not shown here).
  • However (similar to others), if the data is first split by floats, then streamvbyte_delta followed by a general purpose compressor is quite good. Especially if you need compression faster than 0.5GB/s, then “split by floats, streamvbyte_delta, zstd” is on the Pareto frontier, reaching 3.5x ratio.

Conclusion and what’s next

On this data set, in lossless mode, most of the compression libraries I tried in this post are not very impressive. My guess is that’s a combination of several factors: 1) maybe they are primarily targeted at double precision, and single precision floats somewhat of an afterthought, 2) maybe they are much better at 3D or 4D data, and are weaker at a mix of 2D and 1D data like in my case, 3) maybe they are much better when used in lossy compression mode.

However it is curious that some of the papers describing these compressors only either compare them to other scientific compressors, or only to regular lossless compression algorithms. So they go with conclusions like “we surpass zstd a bit!” and declare that a win, without comparing to something like “filter the data, and then zstd it”.

Another interesting aspect is that most of these libraries have symmetric compression and decompression performance, which is very different from most of regular data compression libraries, where compression part is often much slower.

Next up: either look into lossy compression, or into speeding up the data filtering part. Until then!


Float Compression 4: Mesh Optimizer

Introduction and index of this series is here.

The previous post investigated some lossless filtering of data, before passing it to a regular compression library. Our result so far is: 94.5MB of data can get filtered+compressed down to 23.0MB in one second (split/shuffle bytes, delta encode, zstd or kraken compression). It decompresses back in about 0.15 seconds (which quite a bit slower than without data filtering, but that’s something for later day).

In this post, I’ll look at one open source library that does not feel like it would immediately be suitable for compressing my data set.

zeux/meshoptimizer

meshoptimizer by Arseny Kapoulkine is a very nice library for processing, optimizing, and compressing 3D model (“mesh”) data. It can do vertex and index reordering for efficiency of various GPU caches, mesh simplification, quantization and conversion of the vertex data, has index and vertex data compression utilities, and various tools to do all that on glTF2 geometry files.

Notice the “vertex data compression utilities” bit? We’re going to try that one. Since the meshoptimizer compression is an official glTF extension, there’s a specification for it even (EXT_meshopt_compression).

It’s not immediately obvious that it can also be used for anything else than actual 3D model vertex data compression. But look at it:

  • It is completely lossless,
  • It takes “size of each vertex in bytes” and “number of vertices” as an input. But who says these need to be vertices? It’s just some data.
  • It assumes that there is some correllation / smoothness between neighboring vertices; that’s how it gets compression after all. We don’t have “vertices”, but our “data items” from water or snow simulation are nicely laid out in memory one after another, and their values do vary fairly smoothly.

In our case, water, snow simulation and float4 data files are all going to be “yeah, we are totally 16-byte vertices”, and the float3 data file is going to be “I’m full of 12-byte vertices, really”. And then we just use meshopt_encodeVertexBufferBound, meshopt_encodeVertexBuffer and meshopt_decodeVertexBuffer functions from the library.

So does it work?

The chart here shows our three regular compressors (zstd, lz4, kraken; dashed lines), as well as the same compressors with best filtering from previous post (solid lines). meshoptimizer is the large blue point, since it has no “compression levels” to speak of.

This is actually pretty impressive!

  • Really fast to compress (0.2 seconds), and really fast to decompress (3.5GB/s, almost at lz4 speeds).
  • For a “compress in under a second” task, it beats zstd on ratio, and achieves the same ratio as Oodle Kraken. 😮 We have a 29.5MB file.
  • It does not quite achieve the compression ratio of regular compressors with filtering applied. But hey, we are mis-using “mesh” optimizer library for data that is almost certainly not meshes :)

Oh but hey, meshoptimizer readme says:

The result of the encoding is generally significantly smaller than initial data, and remains compressible with general purpose compressors

So let’s try just that: compress our data with meshoptimizer, and then try adding our old friends zstd/lz4/kraken on top of that.

  • For compression under one second, this now achieves 24.4MB file size. Nice!
  • Kraken and zstd are almost the same performance and ratio here.
  • Still not as small as filtering + regular compression (which gets down to 23.0MB), but pretty close.
  • Decompression is still very fast; 3x faster than with filtering + regular decompression. Nice!

I have also tried various filtering approaches before doing mesh optimizer compression (split floats, split bytes, delta, xor, rotate floats left by one bit, etc.). And these do not actually help; often making compression ratio worse. This makes sense; mesh optimizer vertex compression codec has a lot of similarities to a data filter itself, so additional filtering just gets in the way and “confuses” it. Or that’s my impression.

Conclusion and what’s next

My takeaway is that if you have structured data that is mostly floating point and inherently has some similarities / smoothness across it, then you should take a look at using meshoptimizer vertex compression codec. Even if your data is not meshes at all!

It makes the data smaller by itself, but you can also pass that down into any other regular data compression library for further compression.

And it’s really fast at both compression and decompression. There’s a pure-JavaScript version too in there, if you’re targeting the web platform.

Next up, I’ll look into several libraries specifically targeted at floating point data compression, that are mostly coming from the scientific community. And then after that, maybe at lossy compression.


Float Compression 3: Filters

Introduction and index of this series is here.

In the previous parts we saw that using generic data compression libraries, we can get our 94.5MB data down to 33.8MB (zstd level 7) or 29.6MB (oodle kraken level 2) size, if we’re not willing to spend more than one second compressing it.

That’s not bad, but is there something else we can do? Turns out, there is, and in fact it’s quite simple. Enter data filtering.

Prediction / filtering

We saw filtering in the past (EXR and SPIR-V), and the idea is simple: losslessly transform the data so that it is more compressible. Filtering alone does nothing to reduce the data size, but (hopefully!) it decreases data randomness. So the process is: filter the data, then compress that. Decompression is reverse: decompress, then un-filter it.

Here’s some simple filters that I’ve tried (there are many, many other filters possible, I did not try them all!).

Reorder floats array-of-structures style

Recall that in our data, we know that water simulation has four floats per “element” (height, velocity x, velocity y, pollution); snow simulation similarly has four floats per element; and other data is either four or three floats per element. Instead of having the data like that (“array of structures” style), we can try to reorder it into “structure of arrays” style. For water simulation, that would be all heights first, then all x velocities, then all y velocities, etc.

So this:

becomes this:

Completely unoptimized code to do that could look like this (and our data is floats, i.e. 4 bytes, so you’d call these templates with a 4-byte type e.g. uint32_t):

// channels: how many items per data element
// dataElems: how many data elements
template<typename T>
static void Split(const T* src, T* dst, size_t channels, size_t dataElems)
{
	for (size_t ich = 0; ich < channels; ++ich)
	{
		const T* ptr = src + ich;
		for (size_t ip = 0; ip < dataElems; ++ip)
		{
			*dst = *ptr;
			ptr += channels;
			dst += 1;
		}
	}
}
template<typename T>
static void UnSplit(const T* src, T* dst, size_t channels, size_t dataElems)
{
	for (size_t ich = 0; ich < channels; ++ich)
	{
		T* ptr = dst + ich;
		for (size_t ip = 0; ip < dataElems; ++ip)
		{
			*ptr = *src;
			src += 1;
			ptr += channels;
		}
	}
}

Does that help? The results are interesting (click for an interactive chart):

  • It does help LZ4 to achieve a bit higher compression ratios.
  • Makes zstd compress faster, and helps the ratio at lower levels, but hurts the ratio at higher levels.
  • Hurts oodle kraken compression.
  • Hurts the decompression performance quite a bit (for lz4 and kraken, slashes it in half). In all cases the data still decompresses under 0.1 seconds, so acceptable for my case, but the extra pass over memory is not free.

Ok, so this one’s a bit “meh”, but hey, now that the data is grouped together (all heights, then all velocities, …), we could try to exploit the fact that maybe neighboring elements are similar to each other?

Reorder floats + XOR

In the simulation data example, it’s probably expected that usually the height, or velocity, or snow coverage, does not vary “randomly” over the terrain surface. Or in an image, you might have a color gradient that varies smoothly.

“But can’t data compressors already compress that really well?!”

Yes and no. Usually generic data compressors can’t. Most of them are very much oriented at finding repeated sequences of bytes. So if you have a very smoothly varying surface height or image pixel color, e.g. a sequence of bytes 10, 11, 12, 13, 14, 15, well that is not compressible at all! There are no repeating byte sequences.

But, if you transform the sequence using some sort of “difference” between neighboring elements, then repeated byte sequences might start appearing. At first I tried XOR’ing the neighboring elements together (interpreting each float as an uint32_t), since at some point I saw that trick being mentioned in some “time series database” writeups (e.g. Gorilla).

A completely unoptimized code to do that:

template<typename T>
static void EncodeDeltaXor(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		*data = v ^ prev;
		prev = v;
		++data;
	}
}
template<typename T>
static void DecodeDeltaXor(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		v = prev ^ v;
		*data = v;
		prev = v;
		++data;
	}
}

And that gives (faint dashed line: raw compression, thin line: previous attempt (split floats), thick line: split floats + XOR):

  • Compression ratio is way better for zstd and lz4 (for kraken, only at lower levels).
  • zstd pretty much reaches kraken compression levels! The lines almost overlap in the graph.
  • Decompression speed takes a bit of a hit, as expected. I might need to do something about it later.

So far we got from 33.8MB (zstd) / 29.6MB (kraken) at beginning of the post down to 28MB (zstd, kraken), while still compressing in under 1 second. Nice, we’re getting somewhere.

Reorder floats + Delta

The “xor neighboring floats” trick from Gorilla database was in the context of then extracting the non-zero sequences of bits from the result and storing that in less space than four bytes. I’m not doing any of that, so how about this: instead of XOR, do a difference (“delta”) between the neighboring elements? Note that delta is done by reinterpreting data as if it were unsigned integers, i.e. these templates are called with uint32_t type (you can’t easily do completely lossless floating point delta math).

template<typename T>
static void EncodeDeltaDif(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		*data = v - prev;
		prev = v;
		++data;
	}
}
template<typename T>
static void DecodeDeltaDif(T* data, size_t dataElems)
{
	T prev = 0;
	for (size_t i = 0; i < dataElems; ++i)
	{
		T v = *data;
		v = prev + v;
		*data = v;
		prev = v;
		++data;
	}
}

And that gives (faint dashed line: raw compression, thin line: previous attempt (split floats + XOR), thick line: split floats + delta):

Now that’s quite an improvement! All three compressors tested get their compression ratio lifted up. Good! Let’s keep on going.

Reorder bytes

Hey, how about instead of splitting each data point into 4-byte-wide “streams”, we split into 1-byte-wide ones? After all, general compression libraries are oriented at finding byte sequences that would be repeating. This is also known as a “shuffle” filter elsewhere (e.g. HDF5). Exactly the same Split and UnSplit functions as above, just with uint8_t type.

Faint dashed line: raw compression, thin line: previous attempt (split floats + Delta), thick line: split bytes:

  • kraken results are almost the same as with “split floats and delta”. Curious!
  • zstd ratio (and compression speed) is improved a bit.
  • lz4 ratio is improved a lot (it’s beating original unfiltered kraken at this point!).

I’ll declare this a small win, and let’s continue.

Reorder bytes + Delta

Split by bytes as previous, and delta-encode that. Faint dashed line: raw compression, thin line: previous attempt (split bytes), thick line: split bytes + delta:

Holy macaroni grated potato dumplings!

  • Another compression ratio increase. Both zstd and kraken get our data to 23MB in about one second (whereas it was 33.8MB and 29.6MB at the start of the post).
  • zstd actually slightly surpasses kraken at compression ratios in the area (“under 1 sec”) that I care about. 😮
  • lz4 is not too shabby either, being well ahead of unfiltered kraken.
  • Downside: decompression is slightly longer than 0.1 seconds now. Not “terrible”, but I’d want to look into whether all this reordering and delta could be sped up.

Conclusion and what’s next

There’s lots of other data filtering approaches and ideas I could have tried, but for now I’m gonna call “reorder bytes and delta” a pretty good win; it’s extremely simple to implement and gives a massive compression ratio improvement on my data set.

I did actually try a couple other filtering approaches. Split data by bits (using bitshuffle library) was producing worse ratios than splitting by bytes. Rotating each float left by one bit, to make the mantissa & exponent aligned on byte boundaties, was also not an impressive result. Oh well!

Maybe at some point I should also test filters specifically designed for 2D data (like the water and snow simulation data files in my test), e.g. something like PNG Paeth filter or JPEG-LS LOCO-I (aka “ClampedGrad”).

Next up, I’ll look at an open source library that does not advertise itself as a general data compressor, but I’m gonna try it anyway :) Until then!


Float Compression 2: Oodleflate

Introduction and index of this series is here.

In the previous part I looked at generic lossless data compressors (zlib, lz4, zstd, brotli), and was thinking of writing about data filtering next. But then, libdeflate and Oodle entered the chat!

libdeflate

libdeflate is an independent implementation of zlib/deflate compression, that is heavily optimized. I’ve seen it before (EXR blog post), and indeed it is very impressive, considering it produces completely zlib-compatible data, but much faster and at a bit higher compression ratio too.

Here are the results (click for an interactive chart). The compressors from the previous post are faded out; libdeflate addition is dark green:

It is not better than zstd, but definitely way better than zlib (~3x faster compression, ~2x faster decompression, slightly higher ratio). If you need to produce or consume deflate/zlib/gzip formatted data, then absolutely look into libdeflate.

Oodle

Oodle is a set of compressors by Epic Games Tools (née RAD Game Tools). It is not open source or freely available, but the good people there graciously gave me the libraries to play around with.

There are several compressors available, each aiming at different ratio vs. performance tradeoff. It’s easiest to think of them as: “Kraken” is “better zstd” (great ratio, good decompression performance), “Selkie” is “better lz4” (ok ratio, very fast decompression), and “Mermaid” is in the middle of the two, with no clear open source competitor.

Here they are, additions in various shades of purple:

And… yeah. On the compression front, Mermaid leaves everyone else (zstd, brotli) in the dust on the Pareto front (remember: upper right is better). And then Kraken improves on top of that 🤯 Decompression performance is a similar story: everything (except lz4) is behind.

Wizardry! It’s a shame they are not freely available, eh :/

What’s next?

Next up, we’ll really look at some simple data filtering (i.e. fairly similar to EXR post from a while ago). Until then!