Over the past month it seems like Gaussian Splatting (see my first post)
is experiencing a Cambrian Gaussian explosion of new research. The seminal paper
came out in July 2023, and starting about mid-November, it feels like every day there’s a new paper or two coming out,
related to Gaussian Splatting in some way. @MrNeRF and @henrypearce4D maintain an excellent list of all things related to 3DGS,
check out their Awesome 3D Gaussian Splatting Resources.
By no means an exhaustive list, just random selection of interesting bits:
Ecosystem and tooling
PlayCanvas has released SuperSplat, an online splat editor (source),
have added 3DGS support to the engine v1.67.0 and made their
own splat data compression approach (blog post) that is somewhat based on my blog posts, yay.
LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction reduces gaussian splat data sizes
by pruning and merging splats to reduce their count, makes spherical harmonics data smaller by reducing their order more cleverly than just
“drop some coefficients”, and applies several quantization schemes to final data. This is clever stuff!
Relightable Gaussian Codec Avatars is targeted at face avatars. They place gaussians on a coarse 3D
mesh, and use a learnable radiance transfer with diffuse SH and specular SG. Clever way of plugging gaussian splats into an existing mesh-based
avatar pipeline.
The Unity Gaussian Splatting project that I created with intent of
“eh, lemme try to make a quick toy 3DGS renderer in Unity, and maybe play around with data size reductions”, has somewhat surprisingly
reached 1300+ GitHub stars. Since the previous blog post it got a bunch
of random things:
Support for HDRP and URP rendering pipelines in adition to the built-in one.
Fine grained splat editing tools in form of selection and deletion (short video).
High level splat editing tools in form of ellipsoid and box shaped “cutouts”. @hybridherbst did the
initial implementation, and then shortly afterwards all other
3DGS editing tools got pretty much the same workflow. Nice!
Ability to export modified/edited splats back into a .PLY file.
Faster rendering via more tight oriented screenspace quads, instead of axis-aligned quads.
I made the gaussian splat rendering+editing piece an actual package (OpenUPM page),
and clarified license to be MIT.
(not part of github release, but in latest main branch) More fine grained editing tools (move individual splats), ability to bake
splat transform when exporting .PLY, and multiple splats can be merged together.
The project contains some bits that are not gaussian splat related, but might be useful elsewhere:
A Unity port of AMD FidelityFX GPU radix sort (shader, C#).
C# implementation of K-means clustering
(source), with Burst and multi-threading.
Aaaand with that, I’m thinking that my toying around will end here. I’ve made a toy renderer and integration into Unity,
learned a bunch of random things in the process, it’s time to call it a day and move onto something else. I suspect there will be another
kphjillion gaussian splatting related papers coming out over the next year. Will be interesting to see where all of this ends up at!
Previous post was about making Gaussian Splatting data sizes smaller
(both in-memory and on-disk). This one is still about the same topic! Now we look into clustering / VQ.
Teaser: this scene (garden tools from my own shed) is just 7.5 megabytes of data now. And it represents the metal shading
(anisotropy / brushed metal parts) quite well!
Spherical Harmonics take up a lot of space!
In raw uncompressed Gaussian Splat data, majority of the data is Spherical Harmonics coefficients. If we ignore the
very first SH coefficient (which we treat as “a color”), the rest is 45 floating point numbers for each splat (15 numbers
for R,G,B channels each). For something like the “bike” scene with 6.1 million splats, this is 1.1GB data just for the
SH coefficients alone. And while they can be converted into half-precision (FP16) floats with pretty much
no quality loss at all, or into smaller quantized formats (Norm11 and Norm565 from previous post), that still leaves them
at 350MB and 187MB worth of data. Even the idea that should not actually work – lay them out in a Morton order inside a texture and
compress as GPU BC1 format – does not look entirely terrible, but is still about 46MB of data.
Are Spherical Harmonics even worth having? That’s a good question. Without them, the scenes still look quite good,
but the surfaces lose quite a lot of “shininess”, especially different reflectance when moving the viewpoint. Below are “bike”
and “garden” scenes, rendered with full SH data (left side) vs just color (right side):
How does “reflection” of the vase on the metal part of the table work, you might ask? The gaussians have “learned”
the ages-old trick of duplicating and mirroring the geometry for reflection! Cute!
Anyway, for now let’s assume that we do want this “surface reflectivity looks nicer” effect that is provided by SH data.
Remember palettized images?
Remember how ages ago image files used to have a “color palette” of say 256 or 16 distinct colors, and each pixel would just
say “yeah, that one”, pointing at the index of the color inside the palette. Heck, even whole computer displays were using
palettes because “true color” was too costly at the time.
We can try doing the same thing for our SH data – given several million SH items inside a gaussian splat scene, can we actually
pick “just some amount” of distinct SH values, and have each splat just point to the needed SH item?
Why, yes, we can. I’ve spent a bit of time learning about “vector quantization”,
“clustering” and “k-means” and related jazz, and have played
around with clustering SHs into various amounts (from 1024 up to 65536).
Note that SH data, at 45 numbers per splat, is quite “high dimensional”, and that has various challenges (see
curse of dimensionality). One of them is that clustering millions
of splats into thousands of items, in 45 dimensions, is not exactly fast. Another is that clustering might not produce
good results. ⚠️ I don’t know anything about any of that; it could very well be that I should have done clustering entirely
differently! But hey, whatever :)
Also, I’m very impatient, like if anything takes longer than 10 minutes I go “this is not supposed to be that long”.
I first tried scikit-learn but that was taking ages to cluster SHs into even one thousand
items. Faiss was way faster, taking about 5 minutes to cluster “bike”
scene SHs into 16k items. However, I did not want to add that as a dependency, so I whipped up my own variant
of mini-batch k-means using Burst’ed C# directly inside Unity.
I probably did it all wrong and incorrectly, but it is about 3x faster than even Faiss and seems to provide better
quality, at least for this task, so 🤷
So the process is:
Take all the SH data from the gaussian splat scene,
Cluster that into 4k - 16k distinct SH item “palette”. Store that. I’m storing as FP16 numbers, so that’s 360KB - 1.44MB
data for the palette itself.
For each original SH data point, find which item of the palette it is closest to. Store that inded per splat. I’m storing
as 16 bits (even if some of the bits are not used), so for “bike” scene (6.1M splats) this is about 12MB indices.
Here’s full SH (left side) vs. SHs clustered into 16k items (right side):
This does retain the “shininess” effect, at expense of ~13MB data for either scene above. And while it does have
some lighting artifacts, they are not terribly bad. So… probably okay?
Aside: the excellent gsplat.tech by Jakub Červený (@jakub_c5y)
seems to also be using some sort of VQ/Clustering for the data. Seriously, check it out, it’s probably be nicest gaussian
splatting thing right now w.r.t. usability – very intuitive camera controls, nicely presented file sizes, and
works on WebGL2. Craftsmanship!
New quality levels
In my toy “gaussian splatting for Unity” implementation, currently
I only do SH clustering at “Low” and “Very Low” quality levels.
Previously, “Low” preset had data sizes of 119MB, 49MB, 113MB; PSNR respectively
34.72, 31.81, 33.05):
Now, the “Low” preset clusters SH into 16k items. Data sizes 98MB, 41MB, 93MB; PSNR respectively
35.17, 35.32, 35.00:
The “Very Low” preset previously was pretty much unusable (data sizes of 74MB, 32MB, 74MB; PSNR
24.02, 22.28, 23.10):
However now the Very Low preset is in “somewhat usable” territory! File sizes are similar; the savings from clustered SH I’ve spent
on other components that were suffering before. SH clustered into 4k items. Data sizes 79MB, 33MB, 75MB; PSNR
32.27, 30.19, 31.10:
Quality
Pos
Rot
Scl
Col
SH
Compr
PSNR
Very High
Norm16x3
Norm10_2
Norm16x3
F16x4
F16x3
2.1x
High
Norm16x3
Norm10_2
Norm16x3
F16x4
Norm11
2.9x
57.77
Medium
Norm11
Norm10_2
Norm11
Norm8x4
Norm565
5.1x
47.46
Low
Norm11
Norm10_2
Norm565
Norm8x4
Cluster16k
14.9x
35.17
Very Low
Norm11
Norm10_2
Norm565
BC7
Cluster4k
18.4x
32.27
Conclusions and future work
At this point, we can have “bike” and “garden” scenes in under 100MB of data (instead of original 1.4GB PLY file) at fairly acceptable quality.
Not bad!
Of course gaussian splatting at this point is useful for “rotate around a scanned object” use case; it is not useful for “in games”
or many other cases. We don’t know how to re-light them, or how to animate them well, etc. etc. Yet.
I haven’t done any of the “small things I could try” from the end of the previous post
yet. So maybe that’s next? Or maybe look into how to further reduce the splat data on-disk, as opposed to just reducing the memory
representation.
In the previous post I started to look at Gaussian
Splatting. One of the issues with it, is that the data sets are not exactly small. The renders look nice:
But each of the “bike”, “truck”, “garden” data sets is respectively a 1.42GB, 0.59GB, 1.35GB PLY file.
And they are loaded pretty much as-is into GPU memory as giant structured buffers, so at least that much VRAM
is needed too (plus more for sorting, plus in the official viewer implementation the tiled splat rasterizer uses
some-hundreds-of-MB).
I could tell you that I can make the data 19x smaller (78, 32, 74 MB respectively), but then it looks not that great.
Still recognizable, but really not good (however, the artifacts are not your typical “polygonal mesh rendering at low
LOD”, they are more like “JPG artifacts in space”):
However, in between these two extremes there are other configurations, that make the data 5x-10x smaller while looking
quite okay.
So we are starting at 248 bytes for each splat, and we want to get that down. Note: everywhere here I
will be exploring both storage and runtime memory usage, i.e. not “file compression”! Rather, I want to
cut down on GPU memory consumption too. Getting runtime data smaller also makes the data on disk smaller as a
side effect, but “storage size” is a whole another and partially independent topic. Maybe for some other day!
One obvious and easy thing to do with the splat data, is to notice that the “normal” (12 bytes) is completely unused.
That does not save much though. Then you can of course try making all the numbers be Float16 instead of Float32,
this is acceptably good but only makes the data 2x smaller.
You could also throw away all the spherical harmonics data and leave only the “base color” (i.e. SH0), and that would
cut down 75% of the data size! This does change the lighting and removes some “reflections”, and is more
visible in motion, but progressively dropping SH bands with lower quality levels (or progressively loading
them in) is easy and sensible.
So of course, let’s look at what else we can do :)
Reorder and cut into chunks
The ordering of splats inside the data file does not matter; we are going to sort
them by distance at rendering time anyway. In the PLY data file they are effectively random
(each point here is one splat, and color is gradient based on the point index):
But we could reorder them based on “locality” (or any other criteria). For example, ordering them in a
3D Morton order, generally, makes nearby points in space be near
each other inside the data array:
And then, I can group splats into chunks of N (N=256 was my choice), and hope that since they would generally
be close together, maybe they have lower variance of their data, or at least their data can be somehow represented
in fewer bits. If I visualize the chunk bounding boxes, they are generally small and scattered all over the scene:
Future work: try Hilbert curve ordering instead of Morton. Also try “partially filled chunks” to break up
large chunk bounds, that happen whenever the Morton curve flips to the other side.
By the way, Morton reordering can also make the rendering faster, since even after sorting by distance
the nearby points are more likely to be nearby in the original data array. And of course, nice code to
do Morton calculations without relying on BMI
or similar CPU instructions can be found on
Fabian’s blog, adapted here for 64 bit result case:
// Based on https://fgiesen.wordpress.com/2009/12/13/decoding-morton-codes/
// Insert two 0 bits after each of the 21 low bits of x
static ulong MortonPart1By2(ulong x)
{
x &= 0x1fffff;
x = (x ^ (x << 32)) & 0x1f00000000ffffUL;
x = (x ^ (x << 16)) & 0x1f0000ff0000ffUL;
x = (x ^ (x << 8)) & 0x100f00f00f00f00fUL;
x = (x ^ (x << 4)) & 0x10c30c30c30c30c3UL;
x = (x ^ (x << 2)) & 0x1249249249249249UL;
return x;
}
// Encode three 21-bit integers into 3D Morton order
public static ulong MortonEncode3(uint3 v)
{
return (MortonPart1By2(v.z) << 2) | (MortonPart1By2(v.y) << 1) | MortonPart1By2(v.x);
}
Make all data 0..1 relative to the chunk
Now that all the splats are cut into 256-splat size chunks, we can compute minimum and maximum
data values of everything (positions, scales, colors, SHs etc.) for each chunk, and store that away.
We don’t care about data size of that (yet?); just store them in full floats.
And now, adjust the splat data so that all the numbers are in 0..1 range between chunk minimum & maximum
values. If that is kept in Float32 as it was before, then this does not really change precision in any noticeable
way, just adds a bit of indirection inside the rendering shader (to figure out final splat data, you need to fetch
chunk min & max, and interpolate between those based on splat values).
Oh, and for rotations, I’m encoding the quaternions in
“smallest three”
format (store smallest 3 components, plus index of which component was the largest).
And now that the data is all in 0..1 range, we can try representing it with smaller data types than full Float32!
But first, how does all that 0..1 data look like? The following is various data displayed as RGB colors, one
pixel per splat, in row major order. With positions, you can clearly see that it changes within the 256 sized
chunk (it’s two chunks per horizontal line):
Rotations do have some horizontal streaks but are way more random:
Scale has some horizontal patterns too, but we can also see that most of scales are towards smaller values:
Color (SH0) is this:
And opacity is often either almost transparent, or almost opaque:
There’s a lot of spherical harmonics bands and they tend to look like a similar mess, so here’s one of them:
Hey this data looks a lot like textures!
We’ve got 3 or 4 values per each “thing” (position, color, rotation, …) that are all in 0..1 range now.
I know! Let’s put them into textures, one texel per splat. And then we can easily experiment
with using various texture formats on them, and have the GPU texture sampling hardware do all the heavy lifting
of turning the data into numbers.
We could even, I dunno, use something crazy like use compressed texture formats (e.g. BC1 or BC7) on these
textures. Would that work well? Turns out, not immediately. Here’s turning all the data
(position, rotation, scale, color/opacity, SH) into BC7 compressed texture. Data is just 122MB (12x smaller),
but PSNR is a low 21.71 compared to full Float32 data:
However, we know that GPU texture compression formats are block based, e.g. on typical PC the BCn compression
formats are all based on 4x4 texel blocks. But our texture data is laid out in 256x1 stripes of splat
chunks, one after another. Let’s reorder them some more, i.e. lay out each chunk in a 16x16 texel square, again
arranged in Morton order within it.
And if we rearrange all the texture data that way, then it looks like this now (position, rotation, scale, color,
opacity, SH1):
And encoding all that into BC7 improves the quality quite a bit (PSNR 21.71→24.18):
So what texture formats should be used?
After playing around with a whole bunch of possible settings, here’s the quality setting levels I came up with.
Formats indicated in the table below:
F32x4: 4x Float32 (128 bits). Since GPUs typically do not have a three-channel Float32 texture format,
I expand the data quite uselessly in this case, when only three components are needed.
F16x4: 4x Float16 (64 bits). Similar expansion to 4 components as above.
Norm10_2: unsigned normalized 10.10.10.2 (32 bits). GPUs do support this, and Unity almost supports
it – it exposes the format enum member, but actually does not allow you to create texture with said format (lol!).
So I emulate it by pretending the texture is in a single component Float32 format, and manually “unpack”
in the shader.
Norm11: unsigned normalized 11.10.11 (32 bits). GPUs do not have it, but since I’m emulating a similar format
anyway (see above), then why not use more bits when we only need three components.
Norm8x4: 4x unsigned normalized byte (32 bits).
Norm565: unsigned normalized 5.6.5 (16 bits).
BC7 and BC1: obvious, 8 and 4 bits respectively.
Quality
Pos
Rot
Scl
Col
SH
Compr
PSNR
Very High
F32x4
F32x4
F32x4
F32x4
F32x4
0.8x
High
F16x4
Norm10_2
Norm11
F16x4
Norm11
2.9x
54.82
Medium
Norm11
Norm10_2
Norm11
Norm8x4
Norm565
5.2x
47.82
Low
Norm11
Norm10_2
Norm565
BC7
BC1
12.2x
34.79
Very Low
BC7
BC7
BC7
BC7
BC1
18.7x
24.02
Here are the “reference” (“Very High”) images again (1.42GB, 0.59GB, 1.35GB data size):
At “Low” preset the color artifacts are more visible but not terribad (119MB, 49MB, 113MB – 12.2x smaller; PSNR respectively
34.72, 31.81, 33.05):
And the “Very Low” one mostly for reference; it kinda becomes useless at such low quality (74MB, 32MB, 74MB – 18.7x smaller;
PSNR 24.02, 22.28, 23.1):
Oh, and I also recorded an awkwardly-moving-camera video, since people like moving pictures:
Conclusions and future work
The gaussian splatting data size (both on-disk and in-memory) can be fairly easily cut down 5x-12x, at fairly acceptable rendering
quality level. Say, for that “garden” scene 1.35GB data file is “eek, sounds a bit excessive”, but at 110-260MB it’s becoming more
interesting. Definitely not small yet, but way more within being usable.
I think the idea of arranging the splat data “somehow”, and then compressing them not by just individually encoding each spat into
smaller amount of bits, but also “within neighbors” (like using BC7 or BC1), is interesting. Spherical Harmonics data in particular
looks quite ok even with BC1 compression (it helps that unlike “obviously wrong” rotation or scale, it’s much harder to tell when
your spherical harmonics coefficient is wrong :)).
There’s a bunch of small things I could try:
Splat reordering: reorder splats not only based on position, but also based on “something else”. Try Hilbert curve instead of Morton. Try using not-fully-256 size chunks whenever the curve flips to the other side.
Color/Opacity encoding: maybe it’s worth putting that into two separate textures, instead of trying to get BC7 to compress them both.
I do wonder how would reducing the texture resolution work, maybe for some components (spherical harmonics? color if opacity is separate?)
you could use lower resolution texture, i.e. below 1 texel per splat.
And then of course there are larger questions, in a sense of whether this way looking at reducing data size is sensible at all. Maybe
something along the lines of
“Random-Access Neural Compression of Material Textures”
(Vaidyanathan, Salvi, Wronski 2023) would work? If only I knew anything about this “neural/ML” thing :)
All my code for the above is in this PR on github(merged 2023 Sep).
In the followup post I look at making them even smaller!
SIGGRAPH 2023 just had a paper “3D Gaussian Splatting for Real-Time Radiance Field Rendering”
by Kerbl, Kopanas, Leimkühler, Drettakis, and it looks
pretty cool! Check out their website, source code repository, data sets and so on (I should note that
it is really, really good to see full source and full data sets being released. Way to go!).
I’ve decided to try to implement the realtime visualization part (i.e. the one that takes
already-produced gaussian splat “model” file) in Unity. As well as maybe play around with looking at
whether the data sizes could be made smaller (maybe use some of the learnings from
float compression series too?).
What’s a few million badly rendered boxes among friends, anyway?
For the impatient: I got something working over at aras-p/UnityGaussianSplatting, and will tinker with things there some more. And since this post, I wrote several others:
I have seen quite many 3rd party explanations of the concept at this point, and some of them, uhh, get a thing or
two wrong about it :)
This is not a NeRF (Neural Radiance Field)! There is absolutely nothing “neural” about it.
It is not somehow “fast, because it uses GPU rasterization hardware”. The official implementation does not
use the rasterization pipeline at all; it is 100% done with CUDA. In fact, it is fast because it does not use
the fixed function rasterization, as we’ll see below.
Anyway,
Gaussian Splats are, basically, “a bunch of blobs in space”. Instead of representing a 3D scene as
polygonal meshes, or voxels, or distance fields, it represents it as (millions of) particles:
Each particle (“a 3D Gaussian”) has position, rotation and a non-uniform scale in 3D space.
Each particle also has an opacity, as well as color (actually, not a single color, but rather
3rd order Spherical Harmonics coefficients - meaning “the color” can change depending on the view direction).
For rendering, the particles are rendered (“splatted”) as 2D Gaussians in screen space, i.e. they
are not rendered as scaled elongated spheres, actually! More on this below.
And that’s it. The “Gaussian Splatting” scene representation is just that, a whole bunch of scaled and colored
blobs in space. The genius part of the paper is several things:
They found a way how to create these millions of blobs that would represent a scene depicted by a bunch of
photos. This is using gradient descent and “differentiable rendering” and all the other things that are way
over my head. This feels like the major contribution, like maybe previously people assumed that in order
for gradient descent optimizer to work nicely, you need to use a continuous or connected scene representation (vs
“just a bunch of blobs”), and this paper proved that wrong? Anyway, I don’t understand this area, so I won’t talk
about it more :)
They have developed a fast way to render all these millions of scaled particles. This by itself is not particularly
ground breaking IMHO, various people have noticed that using something like a tile-based “software, but on GPU”
rasterizer is a good way to do this.
They have combined existing established approaches (like gaussian splatting, and spherical harmonics)
in a nice way.
And finally, they have resisted the temptation to do “neural” anything ;)
Previous Building Blocks
The Gaussian Splatting seems to be invented around year 2001-2002, see for example “EWA Splatting” paper by Zwicker, Pfister, Van Baar, Gross.
There they have scaled and oriented “blobs” in space, calculate how would they project onto screen, and then
do the actual “blob shape” (a “Gaussian”) in 2D, in screen-space. A bunch of signal processing, sampling, aliasing etc.
math presumably supports doing it that way.
Speaking of ellipsoids, Ecstatica game from 1994 had a fairly
unique ellipsoid-based renderer.
Spherical Harmonics (a way to represent a function over a surface of a sphere) have been around
for several hundred years in physics, but really were popularized in computer graphics around 2000
by Ravi Ramamoorthi and Peter-Pike Sloan. But actually, a 1984 “Ray tracing volume densities” paper by Kajiya & Von Herzen might be the first use of them in graphics.
A nice summary of various things related to SH is at Patapom’s page.
Point-Based Rendering in various forms has been around for a long time, e.g. particle systems were used since
“forever” (but typically used for vfx / non-solid phenomena).
“The Use of Points as a Display Primitive” is from 1985.
“Surfels” paper is from 2000.
Real-time VFX tools like Notch have pretty extensive features for creating, simulating
and displaying point/blob based “things”.
Ideas of representing images or scenes with a bunch of “primitive shapes”, as well as tools to generate those, have
been around too. E.g. fogleman/primitive (2016) is nice.
Media Molecule “Dreams” has a splat-based renderer (I think the shipped version is not purely splat-based
but a combination of several techniques). Check out the most excellent “Learning from Failure” talk by Alex Evans:
at SIGGRAPH 2015 (splats start at slide 109)
or video from Umbra Ignite 2015 (splats start at 22:34).
Tiled Rasterization for particles has been around at least since 2014 (“Holy smoke! Faster Particle Rendering
using Direct Compute” by Gareth Thomas). And the idea
that dividing screen into tiles, doing a bunch of things “inside the tile” thus cutting on memory traffic,
is how entire mobile GPU space operates, and has been operating since “forever”, tracing back to first PowerVR
designs (1996) and even Pixel Planes 5 from 1989.
This is all great! Taking existing, developed, solid building blocks and combining them in a novel way is excellent
work.
My Toy Playground
My current implementation (of just the visualizer of Gaussian Splat models) for Unity is over at github:
aras-p/UnityGaussianSplatting. Current state is “it kinda
works, but it is not fast”:
The rendering does not look horrible, but does not exactly match official implementation. Here is official vs
my rendering of the same scene. Official one has more small detail, and lighting is slightly differentFixed!
Performance is not great. The scene above renders on NVIDIA RTX 3080 Ti at 1200x800 in 7.40ms (135FPS) in
the official viewer, whereas my attempt is 23.8ms (42FPS) currently, i.e. 4x slower. For sorting
I’m using some fairly simple GPU bitonic sort (official impl uses CUDA radix sort which is
based on OneSweep algorithm). Rasterization in their case is tile-based and written in CUDA,
whereas I’m “just” using regular GPU rasterization pipeline and rendering each splat as a screenspace quad.
On the plus side, my code is all regular HLSL within Unity, which means it also happens to work on e.g.
Mac just fine. The scene above on Apple M1 Max renders in 108ms (9FPS) though :/
My implementation seems to use 2x less GPU memory right now too (official viewer: 4.8GB, mine: 2.2GB and that’s
including whatever Unity editor takes).
So all of that could be improved and optimized quite a bit!
One thing I haven’t seen talked much about, by everyone super excited
about Gaussian Splats, is data size and memory usage. Yeah, rendering is nice, but this bicycle scene above is
1.5GB of data on-disk, and then at runtime it needs some more (for sorting, tile based rendering etc.).
That scene is six million blobs in space, with each of them taking about 250 bytes. There has to be some
way to make that smaller! Actually the Dreams talk above has some neat ideas.
Some people asked whether I have tested LZSSE or Lizard.
I have not! But I have been aware of them for years. So here’s a short post, testing them on “my” data set. Note that at least currently
both of these compressors do not seem to be actively developed or updated.
LZSSE and Lizard, without data filtering
Here they are on Windows (VS2022, Ryzen 5950X). Also included Zstd and LZ4 for comparison, as faint dashed lines:
For LZSSE I have tested LZSSE8 variant, since that’s what readme tells to generally use.
“Zero” compression level here is the “fast” compressor; other levels are the “optimal” compressor. Compression levels beyond 5 seem
to not buy much ratio, but get much slower to compress. On this machine, on this data set, it does not look competetive -
compression ratio is very similar to LZ4; decompression a bit slower, compression a lot slower.
For Lizard (née LZ5), it really is like four different compression algorithms in there
(fastLZ4, LIZv1, fastLZ4 + Huffman, LIZv1 + Huffman). I have not tested the Huffman variants since they can not co-exist with Zstd
in the same build easily (symbol redefinitions). The fastLZ4 is shown as lizard1x here, and LIZv1 is shown as lizard2x.
lizard1x (i.e. Lizard compression levels 10..19) seems to be pretty much the same as LZ4. Maybe it was faster than LZ4 back in
2019, but since then LZ4 gained some performance improvements?
lizard2x is interesting - better compression ratio than LZ4, a bit slower decompression speed. In the middle between Zstd and LZ4
when it comes to decompression parameter space.
What about Mac?
The above charts are on x64 architecture, and Visual Studio compiler. How about a Mac (with a Clang compiler)? But first, we need
to get LZSSE working there, since it is very much written with raw SSE4.1 intrinsics and no fallback or other platform paths.
Luckily, just dropping a sse2neon.h into the project and doing a
tiny change in LZSSE source make it just work on an Apple M1 platform.
With that out of the way, here’s the chart on Apple M1 Max with Clang 14:
Here lzsse8 and lizard1xdo get ahead of LZ4 in terms of decompression performance. lizard1x is about 40% faster than LZ4 at
decompression at the same compression ratio. LZSSE is “a bit” faster (but compression performance is still a lot slower than LZ4).
LZSSE and Lizard, with data filtering and chunking
If there’s anything we’ve learned so far in this whole series, is that “filtering” the data before compression can increase the
compression ratio a lot (which in turn can speed up both compression and decompression due to data being easier or smaller). So let’s do
that!
Windows case, all compressors with “split bytes, delta” filter from part 7,
and each 1MB block is compressed independently (see part 8):
Well, neither LZSSE nor Lizard are very good here – LZ4 with filtering is faster than either of them, with a slightly better compression ratio
too. If you’d want higher compression ratio, you’d reach for filtered Zstd.
On a Mac things are a bit more interesting for lzsse8 case; it can get ahead of filtered LZ4 decompression performance at expense of some
compression ratio loss:
I have also tested on Windows (same Ryzen 5950X) but using Clang 15 compiler. Neither LZSSE nor Lizard are on the Pareto frontier here:
Conclusions
On my data set, neither LZSSE nor Lizard are much competetive against (filtered or unfiltered) LZ4 or Zstd. They might have been several
years ago when they were developed, but since then both LZ4 and Zstd got several speedup optimizations.
Lizard levels 10-19, without any data filtering, do get ahead of LZ4 in decompression performance, but only on Apple M1.
LZSSE is “basically LZ4” in terms of decompression performance, but the compressor is much slower (fair, the project says as much in the readme).
Curiously enough, where LZSSE gets ahead of LZ4 is on an Apple M1, a platform it is not even supposed to work on outside the box :)
Maybe next time I’ll finally look at lossy floating point compression. Who knows!