Two years of Blender VSE

So, Blender 5.0 has shipped while I was away at the excellent Graphics Programming Conference, but while all that was happening, I realized it has been two years since I mostly work on the Blender Video Sequence Editor (VSE). Perhaps not surprisingly, a year ago it was one year of that :)

Just like two years ago when I started, I am still mostly flailing my arms around, without realizing what I’m actually doing.

The good

It feels like recently VSE did get quite many improvements across workflow, user experience and performance. The first one I contributed anything to was Blender 4.1, and look what has happened since then (pasting screenshots of the release overview pages):

4.1 (full notes):

4.2 (full notes):

4.3 (full notes):

4.4 (full notes):

4.5 (full notes):

5.0 (full notes):

In addition to user-facing features or optimizations, there also has been quite a lot of code cleanups; too many to list individually but for a taste you could look at “winter of quality” task list of last year (#130975) or WIP list of upcoming “winter of quality” (#149160).

All of this was done by 3-4 people, all of them working on VSE part time. That’s not too bad! I seem to have landed about 200 pull requests in these two years. Also not terrible!

For upcoming year, we want to tackle three large items: 1) more compositor node-based things (modifiers, effects, transitions) including more performance to them, 2) hardware acceleration for video decoding/encoding, 3) workflows like media bins, media preview, three point editing. That and more “wishlist” type of items is detailed in this devtalk thread.

If you have tried Blender video editor a long time ago, and were not impressed, I suggest you try it again! You might still not be impressed, but then you would have learned to not trust anything I say :P

The bad

It can’t all be good; some terrible things have also happened in Blender VSE land too. For one, I have became the “module owner” (i.e. “a lead”) of the VSE related work. Uh-oh!

The wishlist

From the current “things we’d want to work on”, an obvious lacking part is everything related to audio – VSE has some audio functionality, but nowhere near enough for a proper video editing toolbox. Currently out of “just, like, three” part-time people working on VSE, no one is doing audio besides maintenance.

More community contributions in that area would be good. If you want to contribute, check out new developer documentation and #module-sequencer on the developer chat.


OpenEXR vs tinyexr

tinyexr is an excellent simple library for loading and saving OpenEXR files. It has one big advantage, in that it is very simple to start using: just one source file to compile and include! However, it also has some downsides, namely that not all features of OpenEXR are supported (for example, it can’t do PXR24, B44/B44A, DWAA/DWAB, HTJ2K compression modes), and performance might be behind the official library. It probably can’t do some of more exotic EXR features either (e.g. “deep” images), but I’ll ignore those for now.

But how large and how complex to use is the “official” OpenEXR library, anyways?

I do remember that a decade ago it was quite painful to build it, especially on anything that is not Linux. However these days (2025), that seems to be much simpler: it uses a CMake build system, and either directly vendors or automatically fetches whatever dependencies it needs, unless you really ask it to “please don’t do this”.

It is not exactly a “one source file” library though. However, I noticed that OpenUSD vendors OpenEXR “Core” library, builds it as a single C source file, and uses their own “nanoexr” wrapper around the API; see pxr/imaging/plugin/hioOpenEXR/OpenEXR. So I took that, adapted it to more recent OpenEXR versions (theirs uses 3.2.x, I updated to 3.4.4).

So I wrote a tiny app (github repo) that reads an EXR file, and writes it back as downsampled EXR (so this includes both reading & writing parts of an EXR library). And compared how large is the binary size between tinyexr and OpenEXR, as well as their respective source code sizes and performance.

Actual process was:

  • Take OpenEXR source repository (v3.4.4, 2025 Nov),
    • Take only the src/lib/OpenEXRCore and external/deflate folders from it.
    • openexr_config.h, compression.c, internal_ht.cpp have local changes! Look for LOCAL CHANGE comments.
  • Take OpenJPH source code, used 0.25.3 (2025 Nov), put under external/OpenJPH.
  • Take openexr-c.c, openexr-c.h, OpenEXRCoreUnity.h from the OpenUSD repository. They were for OpenEXR v3.2, and needed some adaptations for later versions. OpenJPH part can’t be compiled as C, nor compiled as “single file”, so just include these source files into the build separately.
  • Take tinyexr source repository (v1.0.12, 2025 Mar).

Results

Library Binary size, KB Source size, KB read+write time, s Notes
tinyexr 1.0.12 251 726 6.55
OpenEXR 3.2.4 2221 8556 2.19
OpenEXR 3.3.5 826 3831 1.68 Removed giant DWAA/DWAB lookup tables.
OpenEXR 3.4.3 1149 5373 1.68 Added HTJ2K compression (via OpenJPH).
OpenEXR 3.4.4 649 3216 1.65 Removed more B44/DWA lookup tables.
+ no HTJ2K 370 1716 Above, with HTJ2K/OpenJPH compiled out.
+ no DWA 318 Above, and with DWAA/DWAB compiled out.
+ no B44 305 Above, and with B44/B44A compiled out.
+ no PXR24 303 Above, and with PXR24 compiled out.

Notes:

  • Machine is Ryzen 5950X, Windows 10, compiler Visual Studio 2022 (17.14), Release build.
  • This compares both tinyexr and OpenEXR in fully single-threaded mode. Tinyexr has threading capabilities, but it spins up and shuts down a whole thread pool for each processed image, which is a bit “meh”; and while OpenEXRCore can be threaded (and using full high level OpenEXR library does use it that way), the “nanoexr” wrapper I took from USD codebase does not do any threading.
  • Timing is total time taken to read, downsample (by 2x) and write back 6 EXR files, input resolution 3840x2160, input files are ZIP FP16, ZIP FP32, ZIP w/ mips, ZIP tiled, PIZ and RLE compressed; output is ZIP compressed.

That’s it!


This many points is surely out of scope!

This is about an update to Blender video editing Scopes (waveform, vectorscope, etc.), and a detour into rendering many points on a GPU.

Making scopes more ready for HDR

Current Blender Studio production, Singularity, needed improvements to video editing visualizations, particularly in the HDR area. Visualizations that Blender can do are: histogram, waveform, RGB parade, vectorscope, and “show overexposed” (“zebra stripes”) overlay. Some of them were not handling HDR content in a useful way, e.g. histogram and waveform were clamping colors above “white” (1.0) and not displaying their actual value distribution.

So I started to look into that, and one of the issues, particularly with waveform, was that it gets calculated on the CPU, by putting the waveform into a width x 256 size bitmap.

This is what a waveform visualization does: each column displays pixel luminance distribution of that column of the input image. For low dynamic range (8 bit/channel) content, you can trivially know there are 256 possible vertical values that would be needed. But how tall should the waveform image be for HDR content? You could guesstimate things like “waveform displays +4 extra stops of exposure” and make a 4x taller bitmap.

Or you could…

…move Scopes to the GPU

I thought that doing calculations needed for waveform & vectorscope visualizations on the CPU, then sending that bitmap to the GPU for display sounds a bit silly. And, at something like 4K resolutions, that is not very fast either! So why not just do that on the GPU?

The process would be:

  • GPU already gets the image it needs to display anyway,
  • Drawing a scope would be rendering a point sprite for each input pixel. Sample the image based on sprite ID in the vertex shader, and position it on the screen accordingly. Waveform puts it at original coordinate horizontally, and at color luminance vertically. Vectorscope puts it based on color YUV U,V values.
  • The points need to use blending in “some way”, so that you can see how many points hit the same luminance level, etc.
  • The points might need to be larger than a pixel, if you zoom in.
  • The points might need to be “smaller than a pixel” if you zoom out, possibly by fading away their blending contribution.

So I did all that, it was easy enough. Performance on my RTX 3080Ti was also much better than with CPU based scopes. Since rendering alpha blended points makes it easy to have them colored, I also made each point retain a bit of original image pixel’s hue:

Yay, done! …and then I tested them on my Mac, just to double check if it works. It does! But the new scopes now playback at like 2 frames per second 🤯 Uhh, what is going on? Why?!

I mean, sure, at 4K resolution a full scope now renders 8 million points. But come on, that is on a M4 Max GPU; it should be able to easily do hundreds of millions of primitives in realtime!

Rendering points on a GPU

Turns out, the problematic performance was mostly the vectorscope visualization. Recall that a vectorscope places points based on their signed U,V (from YUV color model). Which means it places a lot of points very near the center, since usually most pixels are not very saturated. A vectorscope of a grayscale image would be all the points right in the middle!

And it turns out, GPUs are not entirely happy when many (tens of thousands or more) points are rendered at the same location and alpha blending is on. And Apple GPUs are extremely not happy about this. “Way too many” things in the same tile are likely to overflow some sort of tile capacity buffers (on tile-based GPUs), and blending “way too many” fragments in the same location is probably running into a bottleneck due to fixed capacity of blending / ROP backend queues (see “A trip through the Graphics Pipeline 2011, part 9”).

Rendering single-pixel points is not terribly efficient on any GPU, of course. GPUs rasterize everything in 2x2 pixel “quads”, so each single pixel point is at least 4 pixel shader executions, with three of them thrown out (see “Counting Quads” or “A trip through the Graphics Pipeline 2011, part 8”).

Could I rasterize the points in a compute shader instead? Would that be faster?

Previous research ("Rendering Point Clouds with Compute Shaders", related code) as well as “compute based rendering” approaches like Media Molecule Dreams or Unreal Nanite suggest that it might be worth a shot.

It was time to do some 🔬📊SCIENCE📊🔬: make a tiny WebGPU test that tests various point rendering scenarios, and test it out on a bunch of GPUs. And I did exactly that: webgpu-point-raster.html that renders millions of single pixel points in a “regular” (500x500-ish) area down to “very small” (5x5 pixel) area, with alpha blending, using either the built-in GPU point rendering, or using a compute shader.

A bunch of people on the interwebs tested it out and I got results from 30+ GPU models, spanning all sorts of GPU architectures and performance levels. Here, how much time each GPU takes to render 4 million single-pixel points into a roughly 460x460 pixel area (so about 20 points hitting each pixel). The second chart is how many times point rasterization becomes slower, if the same amount of points gets blended into a 5x5 pixel area (160 thousand points per pixel).

From the second chart we can see that even if conceptually the GPU does the same amount of work – same amount of points doing the same type of animation and blending, and the 2x2 quad overshading affects both scenarios the same – all the GPUs render slower when points hit a much smaller screen area. Everyone is slower by 2-5 times, and then there are Apple Mac GPUs that are 12-19 times slower. Also curiously enough, even within the same GPU vendor, it looks like the “high-end” GPUs experience a relatively larger slowdown.

My guess is that this shows the effect of blending units having a limited size “queue” and nature of the fact that blending needs to happen serially and in-order (again, see part 9 mentioned above). And Apple GPUs affected way more than anyone else is… I don’t know why exactly. Maybe because they do not have fixed function blending hardware at all (instead the shader reads current pixel value and does blending by modifying it), so in order to maintain the correct blending ordering, the whole pixel execution needs to be in some sort of “queue”? Curiously Apple’s own performance tools (Metal frame capture in Xcode) does not tell anything useful for this case, except “your fragment shader takes forever!”. It is not entirely incorrect, but it would be useful if it said “it is not the part of your code that is slow, it is blending”.

Let’s do some compute shader point rendering!

The compute shader is trivially naïve approach: have R,G,B uint per pixel buffers, each point does atomic add of the fixed point color, finally a regular fragment shader resolves these buffers to visible colors. It is a “baby’s first compute” type of approach really, without any tricks like using wave/subgroup operations to detect whole wavefront hitting the same pixel, or distributing points into tiles + prefix sum + rasterize points inside tiles, or trying to pack the color buffers into something more compact. None of that, so I was not expecting the compute shader approach to be much better.

Here’s two charts: how much faster is this simple compute shader approach, compared to built-in GPU point rendering. First for the “4M points in 460x460 pixel area” case, then for the 5x5 pixel area case:

Several surprising things:

  • Even this trivial compute shader for the not-too-crazy-overdraw case, is faster than built-in point rasterization on all GPUs. Mostly it is 1.5-2 times faster, with some outliers (AMD GPUs love it – it is like 10x faster than rasterization!).
  • For the “4M points in just a 5x5 pixel area” case, the compute shader approach is even better. I was not expecting that – the atomic additions it does would get crazily contended – but it is around 5x faster that rasterization across the board. My only guess is that while contended atomics are not great, they perhaps are still better than contended blending units?

Finally, a chart to match the rasterization chart: how many times the compute shader rendering gets slower, when 460x460 area gets reduced to 5x5 one:

I think this shows “how good the GPU is at dealing with contended atomics”, and it seems to suggest that relatively speaking, AMD GPUs and recent Apple GPUs are not that great there. But again, even with this relative slowdown, the compute shader approach is way faster than the rasterization one, so…

Compute shaders are useful! What a finding!

But let’s get back to Blender.

Blender Scopes on the GPU, with a compute shader

So that’s what I did then - I made the Blender video sequencer waveform/parade/vectorscope be calculated and rendered on the GPU, using a compute shader to do point rasterization. That also allowed to do “better” blending than what would be possible using fixed function blending, actually – since I am accumulating the points hitting the same pixel, I can do a non-linear alpha mapping in the final resolve pass.

The pull request #144867 has not landed yet has just landed, so scopes in Blender 5.0 will get faster and look better. All the scopes, everywhere, all at once, now look like this:

Whereas in current Blender 4.5 they look like this:

And for historical perspective, two years ago in Blender 4.0, before I started to dabble in this area, they looked like this:

Also, playback of this screen setup (4K EXR images, all these views/scopes) on my PC was at 1.1FPS in Blender 4.0; at 7.9FPS in Blender 4.5; and at 14.1FPS with these GPU scopes. Still work to do, but hey, progress.

That’s it, bye!


Lossless Float Image Compression

Back in 2021 I looked at OpenEXR lossless compression options (and I think my findings led a change of the default zip compression level, as well as change of the compression library from zlib to libdeflate. Yay blogging about things!). Then in 2023 I looked at losslessly compressing a bunch of floating point data, some of which might be image-shaped.

Well, now a discussion somewhere else has nerd-sniped me to look into lossless compression of floating point images, and especially the ones that might have more than just RGB(A) color channels. Read on!

Four bullet point summary, if you’re in a hurry:

  • Keep on using OpenEXR with ZIP compression.
  • Soon OpenEXR might add HTJ2K compression; that compresses slightly better but is worse compression and decompression performance, so YMMV.
  • JPEG-XL is not competitive with OpenEXR in this area today.
  • You can cook up a “custom image compression” that seems to be better than all of EXR, EXR HTJ2K and JPEG-XL, while also being way faster.

My use case and the data set

What I wanted to primarily look at, are “multi-layer” images that would be used for film composition workflows. In such an image, a single pixel does not have just the typical RGB (and possibly alpha) channels, but might have more. Ambient occlusion, direct lighting, indirect lighting, depth, normal, velocity, object ID, material ID, and so on. And the data itself is almost always floating point values (either FP16 or FP32); sometimes with different precision for different channels within the same image.

There does not seem to be a readily available “standard image set” like that to test things on, so I grabbed some that I could find, and some I have rendered myself out of various Blender splash screen files. Here’s the 10 data files I’m testing on (total uncompressed pixel size: 3122MB):

File Resolution Uncompressed size Channels Sample
Blender281rgb16.exr 3840x2160 47.5MB RGB half
Blender281rgb32.exr 3840x2160 94.9MB RGB float
Blender281layered16.exr 3840x2160 332.2MB 21 channels, half
Blender281layered32.exr 3840x2160 664.5MB 21 channels, float
Blender35.exr 3840x2160 332.2MB 18 channels, mixed half/float
Blender40.exr 3840x2160 348.0MB 15 channels, mixed half/float
Blender41.exr 3840x2160 743.6MB 37 channels, mixed half/float
Blender43.exr 3840x2160 47.5MB RGB half
ph_brown_photostudio_02_8k.exr 8192x4096 384.0MB RGB float, from polyhaven
ph_golden_gate_hills_4k.exr 4096x2048 128.0MB RGBA float, from polyhaven

OpenEXR

OpenEXR is an image file format that has existed since 1999, and is primarily used within film, vfx and game industries. It has several lossless compression modes (see my previous blog post series).

It looks like OpenEXR 3.4 (should be out 2025 Q3) is adding a new HTJ2K compression mode, which is based on “High-Throughput JPEG 2000” format/algorithms, using open source OpenJPH library. The new mode is already in OpenEXR main branch (PR #2041).

So here’s how EXR does on my data set (click for a larger interactive chart):

This is two plots: compression ratio vs. compression performance, and compression ratio vs. decompression performance. In both cases, the best place on the chart is top right – the largest compression ratio, and the best performance.

For performance, I’m measuring it in GB/s, in terms of uncompressed data size. That is, if we have 1GB worth of raw image pixel data and processing it took half a second, that’s 2GB/s throughput (even if compressed data size might be different). Note that the vertical scale of both graphs is different. I am measuring compression/decompression time without actual disk I/O, for simplicity – that is, I am “writing” and “reading” “files” from memory. The graph is from a run on Apple MacBookPro M4 Max, with things being compiled in “Release” build configuration using Xcode/clang 16.1.

Green dot is EXR ZIP at default compression level (which is 4, but changing the level does not affect things much). Blue dot is the new EXR HTJ2K compression – a bit better compression ratio, but also lower performance. Hmm dunno, not very impressive? However:

  • From what I understand, HTJ2K achieves better ratio on RGB images by applying a de-correlation transform. In case of multi-layer EXR files (which is most of my data set), it only does that for one layer (usually the “final color” one), but does not try to do that on, for example, “direct diffuse” layer which is also “actually RGB colors”. Maybe future work within OpenEXR HTJ2K will improve this?
  • Initial HTJ2K evaluation done in 2024 found that a commercial HTJ2K implementation (from Kakadu) is quite a bit faster than the OpenJPH that is used in OpenEXR. Maybe future work within OpenJPH will speed it up?
  • It very well might be that once/if OpenEXR will get lossy HTJ2K, things would be much more interesting. But that is a whole another topic.

I was testing OpenEXR main branch code from 2025 June (3.4.0-dev, rev 45ee12752), and things are multi-threaded via Imf::setGlobalThreadCount(). Addition of HTJ2K compression codec adds 308 kilobytes to executable size by the way (on Windows x64 build).

Moving on.

JPEG-XL lossless

JPEG-XL is a modern image file format that aims to be a good improvement over many already existing image formats; both lossless and lossy, supporting stardard and high dynamic range images, and so on. There’s a recent “The JPEG XL Image Coding System” paper on arXiv with many details and impressive results, and the reference open source implementation is libjxl.

However, the arXiv paper above does not have any comparisons in how well JPEG-XL does on floating point data (it does have HDR image comparisons, but at 10/12/16 bit integers with a HDR transfer function, which is not the same). So here is me, trying out JPEG-XL lossless mode on images that are either FP16 or FP32 data, often with many layers (JPEG-XL supports this via “additional channels” concept), and sometimes with different floating point types based on channel.

Here’s results with existing EXR data, and JPEG-XL additions in larger red dots (click for an interactive chart):

Immediate thoughts are okay this can achieve better compression, coupled with geez that is slow. Expanding a bit:

  • At compression effort 1-3 JPEG-XL does not win against OpenEXR (ZIP / HTJ2K) on compression ration, while being 3x slower to compress, and 3x-7x slower to decompress. So that is clearly not a useful place to be.
  • At compression effort levels 4+ it starts winning in compression ratio. Level 4 wins against HTJ2K a bit (1.947x -> 2.09x); the default level 7 wins more (2.186x), and there’s quite a large increase in ratio at level 8 (2.435x). I briefly tried levels 9 and 10, but they do not seem to be giving much ratio gains, while being extraordinarily slow to compress. Even level 8 is already 100 times slower to compress than EXR, and 5-13x slower to decompress. So yeah, if final file size is really important to you, then maybe; on the other hand, 100x slower compression is, well, slow.

Looking at the feature set and documentation of the format, it feels that JPEG-XL is mostly and primarily is targeted at “actually displayed images, perhaps for the web”. Whereas with EXR, you can immediately see that it is not meant for “images that are displayed” – it does not even have a concept of low dynamic range imagery; everything is geared towards it being for images used in the middle of the pipeline. From that falls out built-in features like arbitrary number of channels, multi-part images, mipmaps, etc. Within JPEG-XL, everything is centered around “color”, and while it can do more than just color, these feel like bolted-on things. It can do multiple frames, but these have to be same size/format and are meant in the “animation frames” sense; it can do multiple layers but these are meant in the “photoshop layers” sense; it talks about storing floating point data, but it is in the “HDR color or values a bit out of color gamut” sense. And that is fine; the JPEG-XL coding system paper itself has a chart of what JPEG-XL wants to be (I circled that with red) and where EXR is (circled with green):

More subjective notes and impressions:

  • Perhaps the floating point paths within libjxl did not (yet?) get the same attention as “regular images” did; it is very possible that they will improve the performance and/or ratio in the future (I was testing end-of-June 2025 code).
  • A cumbersome part of libjxl is that color channels need to be interleaved, and all the “other channels” need to be separate (planar). All my data is fully interleaved, so it costs some performance to arrange it as libjxl wants, both for compression and after decompression. As a user, it would be much more convenient to use if their API was similar to OpenEXR Slice that takes a pointer and two strides (stride between pixels, and stride between rows). Then any combination of interleaved/planar or mixed formats for different channels could be passed with the same API. In my own test code, reading and writing EXR images using OpenEXR is 80 lines of code, whereas JPEG-XL via libjxl is 550 lines.
  • On half-precision floats (FP16), libjxl currently is not fully lossless – subnormal values do not roundtrip correctly (issue #3881). The documentation also says that non-finite values (infinities / NaNs) in both FP32 and FP16 are not expected to roundtrip in an otherwise lossless mode. This is in contrast with EXR, where even for NaNs, their exact bit patterns are fully preserved. Again, this does not matter if the intended use case is “images on screen”, but matters if your use case is “this looks like an image, but is just some data”.
  • From what I can tell, some people did performance evaluations of EXR ZIP vs JPEG-XL by using ffmpeg EXR support; do not do that. At least right now, ffmpeg EXR code is their own custom implementation, that is completely single threaded and lacks some other optimizations that official OpenEXR library does.

I was testing libjxl main branch code from 2025 June (0.12-dev, rev a75b322e), and things are multi-threaded via JxlThreadParallelRunner. Library adds 6017 kilobytes to executable size (on Windows x64 build).

And now for something completely different:

Mesh Optimizer to compress images, why not?

Back when I was playing around with floating point data compression, one of the things I tried was using meshoptimizer by Arseny Kapoulkine to losslessly compress the data. It worked quite well, so why not try this again. Especially since it got both compression ratio and performance improvements since then.

So let’s try a “MOP”, which is not an actual image format, just something I quickly cooked up:

  • A small header with image size and channel information,
  • Then image is split into chunks, each being 16K pixels in size. Each chunk is compressed independently and in parallel.
  • A small table with compressed sizes for each chunk is written after the header, followed by the compressed data itself for each chunk.
  • Mesh optimizer needs “vertex size” (pixel size in this case) to be multiple of four; if that is not the case the chunk data is padded with zeroes inside the compression/decompression code.
  • And just like the previous time: mesh optimizer vertex codec is not an LZ-based compressor (it seems to be more like delta/prediction scheme that is packed nicely), and you can further compress the result by just piping it to a regular lossless compressor. In my case, I used zstd.

So here’s how “MOP” does on my data set (click for a larger interactive chart):

The purple dots are the new “MOP” additions. You can see there are two groups of them: 1) around 2.0x ratio and very high decompression speed is just mesh optimizer vertex codec, 2) around 2.3x ratio and slightly lower decompression speed is mesh optimizer codec followed by zstd.

And that is… very impressive, I think:

  • Just mesh optimizer vertex codec itself is about the same or slightly higher compression ratio as EXR HTJ2K, while being almost 2x faster to compress and 5x faster to decompress.
  • Coupled with zstd, it achieves compression ratio that is between JPEG-XL levels 7-8 (2.3x), while being 30-100 times faster to compress and 20 times faster to decompress. This combination also very handily wins against EXR (both ZIP and HTJ2K), in both ratio and performance.
  • Arseny is a witch!?

I was testing mesh optimizer v0.24 (2025 June) and zstd v1.5.7 (2025 Feb). Mesh optimizer itself adds just 26 kilobytes (!) of executable code; however zstd adds 405 kilobytes.

And here are the results of all the above, running on a different CPU, OS and compiler (Ryzen 5950X, Windows 10, Visual Studio 2022 v17.14). Everything is several times slower (some of that is due to Apple M4 having crazy high memory bandwidth, some of that CPU differences, some of that compiler differences, some OS behavior with large allocations, etc.). But overall “shape” of the charts is more or less the same:

That’s it for now!

So there. Source code of everything above is over at github.com/aras-p/test_exr_htj2k_jxl. Again, my own take aways are:

  • EXR ZIP is fine,
  • EXR HTJ2K is slightly better compression, worse performance. There is hope that performance can be improved.
  • JPEG-XL does not feel like a natural fit for this (multi-layered, floating point) images right now. However, it could become one in the future, perhaps.
  • JPEG-XL (libjxl) compression performance is very slow, however it can achieve better ratios than EXR. Decompression performance is also several times slower. It is possible that both performance and ratio could be improved, especially if they did not focus on floating point cases yet.
  • Mesh Optimizer (optionally coupled with zstd) is very impressive, both in terms of compression ratio and performance. It is not an actual image format that exists today, but if you need to losslessly compress some floating point images for internal needs only, it is worth looking at.

And again, all of that was for fully lossless compression. Lossy compression is a whole another topic, that I may or might not look into someday. Or, someone else could look! Feel free to use the image set I have used.


Voronoi, Hashing and OSL

Sergey from Blender asked me to look into why trying to manually sprinkle some SIMD into Cycles renderer Voronoi node code actually made things slower, and I started to look, and what I did in the end had nothing to do with SIMD whatsoever!

TL;DR: Blender 5.0 changed Voronoi node hash function to a faster one.

Voronoi in Blender

Blender has a Voronoi node that can be used in any node based scenario (materials, compositor, geometry nodes). More precisely, it is actually a Worley noise procedural noise function. It can be used to produce various interesting patterns:

A typical implementation of Voronoi uses a hash function to randomly offset each grid cell. For something like a 3D noise case, it has to calculate said hash on 27 neighboring cells (3x3x3), for each item being evaluated. That is a lot of hashing!

Current implementation of e.g. “calculate random 0..1 3D offset for a 3D cell coordinate” looked like this in Blender:

// Jenkins Lookup3 Hash Function
// https://burtleburtle.net/bob/c/lookup3.c
#define rot(x, k) (((x) << (k)) | ((x) >> (32 - (k))))
#define mix(a, b, c) { \
    a -= c; a ^= rot(c, 4); c += b; \
    b -= a; b ^= rot(a, 6); a += c; \
    c -= b; c ^= rot(b, 8); b += a; \
    a -= c; a ^= rot(c, 16); c += b; \
    b -= a; b ^= rot(a, 19); a += c; \
    c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a, b, c) { \
    c ^= b; c -= rot(b, 14); \
    a ^= c; a -= rot(c, 11); \
    b ^= a; b -= rot(a, 25); \
    c ^= b; c -= rot(b, 16); \
    a ^= c; a -= rot(c, 4); \
    b ^= a; b -= rot(a, 14); \
    c ^= b; c -= rot(b, 24); \
}
uint hash_uint3(uint kx, uint ky, uint kz)
{
    uint a;
    uint b;
    uint c;
    a = b = c = 0xdeadbeef + (3 << 2) + 13;
    c += kz;
    b += ky;
    a += kx;
    final(a, b, c);
    return c;
}
uint hash_uint4(uint kx, uint ky, uint kz, uint kw)
{
    uint a;
    uint b;
    uint c;
    a = b = c = 0xdeadbeef + (4 << 2) + 13;
    a += kx;
    b += ky;
    c += kz;
    mix(a, b, c);
    a += kw;
    final(a, b, c);
    return c;
}

float uint_to_float_incl(uint n)
{
    return (float)n * (1.0f / (float)0xFFFFFFFFu);
}
float hash_uint3_to_float(uint kx, uint ky, uint kz)
{
    return uint_to_float_incl(hash_uint3(kx, ky, kz));
}
float hash_uint4_to_float(uint kx, uint ky, uint kz, uint kw)
{
    return uint_to_float_incl(hash_uint4(kx, ky, kz, kw));
}
float hash_float3_to_float(float3 k)
{
    return hash_uint3_to_float(as_uint(k.x), as_uint(k.y), as_uint(k.z));
}
float hash_float4_to_float(float4 k)
{
    return hash_uint4_to_float(as_uint(k.x), as_uint(k.y), as_uint(k.z), as_uint(k.w));
}

float3 hash_float3_to_float3(float3 k)
{
    return float3(hash_float3_to_float(k),
        hash_float4_to_float(float4(k.x, k.y, k.z, 1.0)),
        hash_float4_to_float(float4(k.x, k.y, k.z, 2.0)));
}

i.e. it is based on Bob Jenkins’ “lookup3” hash function, and does that “kind of three times”, pretending to hash float3(x,y,z), float4(x,y,z,1) and float4(x,y,z,2). This is to calculate one offset of the grid cell. Repeat that to 27 grid cells for 3D Voronoi case.

I know! Let’s switch to PCG3D hash!

If you are aware of “Hash Functions for GPU Rendering” (Jarzynski, Olano, 2020) paper, you can say “hey, maybe instead of using hash function from 1997, let’s use a dedicated 3D->3D hash function from several decades later”. And you would be absolutely right:

uint3 hash_pcg3d(uint3 v)
{
  v = v * 1664525u + 1013904223u;
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  v = v ^ (v >> 16);
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  return v;
}
float3 hash_float3_to_float3(float3 k)
{
  uint3 uk = as_uint3(k);
  uint3 h = hash_pcg3d(uk);
  float3 f = float3(h);
  return f * (1.0f / (float)0xFFFFFFFFu);
}

Which is way cheaper (the hash function itself is like 4x faster on modern CPUs). Good! We are done!

If you are using hash functions from the 1990s, try some of the more modern ones! They might be both simpler and the same or better quality. Hash functions from several decades ago were built on assumption that multiplication is very expensive, which is very much not the case anymore.

So you do this for various Voronoi cases of 2D->2D, 3D->3D, 4D->4D. First in the Cycles C++ code (which compiles itself to both CPU execution, and to GPU via CUDA/Metal/HIP/oneAPI), then in EEVEE GPU shader code (GLSL), then in regular Blender C++ code (which is used in geometry nodes and CPU compositor).

And you think you are done until you realize…

Cycles with Open Shading Language (OSL)

The test suite reminds you that Blender Cycles can use OSL as the shading backend. Open Shading Language, similar to GLSL, HLSL or RSL, is a C-like language to write shaders in. Unlike some other languages, a “shader” does not output color; instead it outputs a “radiance closure” so that the result can be importance-sampled by the renderer, etc.

So I thought, okay, instead of updating the Voronoi code in three places (Cycles CPU, EEVEE GPU, Blender CPU), it will have to be four places. Let’s find out where and how does Cycles implements the shader nodes for OSL, update that place, and we’re good.

Except… turns out, OSL does not have unsigned integers (see data types). Also, it does not have bitcast from float to int.

I certainly did not expect an “Advanced shading language for production GI renderers” to not have a concept of unsigned integers, in year 2025 LOL :) I knew nothing about OSL just a day before, and now I was there wondering about the language data type system.

Luckily enough, specifically for Voronoi case, all of that can be worked around by:

  • Noticing that everywhere within Voronoi code, we need to calculate a pseudorandom “cell offset” out of integer cell coordinates only. That is, we do not need hash_float3_to_float3, we need hash_int3_to_float3. This works around the lack of bit casts in OSL.
  • We can work around lack of unsigned integers with a slight modification to PCG hash, that just operates on signed integers instead. OSL can do multiplications, XORs and bit shifts, just only on signed integers. Fine with us!
int3 hash_pcg3d_i(int3 v)
{
  v = v * 1664525 + 1013904223;
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  v = v ^ (v >> 16);
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  return v & 0x7FFFFFFF;
}
float3 hash_int3_to_float3(int3 k)
{
  int3 h = hash_pcg3d_i(k);
  float3 f = float3((float)h.x, (float)h.y, (float)h.z);
  return f * (1.0f / (float)0x7FFFFFFFu);
}

So that works, just instead of only having to change hash_float3_to_float3 and friends, this now required updating all the Voronoi code itself as well, to make it hash integer cell coordinates as inputs.

“Wait, but how did Voronoi OSL code work in Blender previously?!”

Good question! It was using the OSL built-in hashnoise() functions that take float as input, and produce a float output. And… yup, they just happened to use exactly the same Jenkins Lookup3 hash function underneath. Happy coincidence? One implementation copying what the other was doing? I don’t know.

It would be nice if OSL got unsigned integers and bitcasts though. Since today, if you need to hash float->float, you can only use the built-in OSL hash function, which is not particularly fast. For Voronoi case that can be worked around, but I bet there are other cases where workign around it is much harder.

So that’s it!

The pull request that makes Blender Voronoi node 2x-3x faster has been merged for Blender 5.0. It does change the actual resulting Voronoi pattern, e.g. before and after:

So while it “behaves” the same, the literal pattern has changed. And that is why a 5.0 release sounds like good timing to do it.

What did I learn?

  • Actually learned about how Voronoi/Worley noise code works, instead of only casually hearing about it.
  • Learned that various nodes within Blender have four separate implementations, that all have to match in behavior.
  • Learned that there is a shading language, in 2025, that does not have unsigned integers :)
  • There can be (and is) code out there that is using hash functions from the previous millenium, which might be not optimal today.
  • I should still look at the SIMD aspect of this whole thing.