Unity vs floating point

A tweet by @VehiclePhysics sparked my interest. It basically says:

For most math functions (Sqrt, Sin, Cos, Log, Pow…), prefer System.MathF over UnityEngine.Mathf. Unity’s Mathf casts to double, calls the double version, then converts back to float. System.MathF calls the float-native implementations directly. Less work, same result.

This advice is basically correct! But turns out, things are slightly more complicated.

Hidden double precision in Unity

The advice above applies to all UnityEngine.Mathf methods that deal with trigonometry (Sin, Cos, Tan, Asin, Acos, Atan, Atan2), exponentials (Sqrt, Pow, Exp, Log, Log10), rounding (Ceil, Floor, Round, CeilToInt, FloorToInt, RoundToInt), comparisons (Min, Max, Clamp, Clamp01) and others (Sign, SmoothStep, Gamma, Approximately, InverseLerp). About the only function it does not apply to is Mathf.Abs.

But… why? Well, because C#/.NET originally did not have single-precision methods for these sorts of math functions. The single precision System.MathF was introduced in .NET Core 2.0 (year 2017).

Now, you might have expected that almost ten years later, maybe Unity would have noticed this, and made them single precision? Alas, no. There could be potential backwards compatibility issues preventing that (or maybe not! see below).

You also might have guessed that Unity.Mathematics package, which was introduced (year 2019) as part of the whole DOTS push, and is modeled to be very similar to HLSL, would actually do single precision floating point for functions that look like single precision floating point… and that would be wrong too; for all the trigonometric and exponential functions like math.sqrt(float x) it routes that into the double precision C# implementation. Why? I don’t know.

But wait! There is way more double precision. The Mono C# runtime used in Unity does all math in double precision, everywhere. Yes, this means there is a ton of float⭤double conversions from in-memory representation to in-register representation, all over the place. I have first noticed this back in 2018 when doing a toy path tracer, and then Miguel de Icaza did an explanatory blog post, with plans outlined how to switch Mono to use actual floats for floats (yeah!).

“In Mono, decades ago, we made the mistake of performing all 32-bit float computations as 64-bit floats while still storing the data in 32-bit locations.”

Official Mono releases have switched to do that since then, but (I think) for backwards compatibility reasons Unity never enabled that functionality and kept everything at double precision so far.

Note however that the above only applies to Mono. The other two C# language/runtime implementations used across Unity today, IL2CPP and Burst, do not have the “everything is actually double precision” behavior. It is weird that Unity would not switch their Mono version to match; after all some of their main deployment platforms never use Mono (iOS, consoles, web)!

Let’s look at a square root

The above is fairly abstract, so let’s look at what actually happens with a very simple loop that sums up a bunch of square roots:

const int N = 10000000;
public static float UnityMathf(float v)
{
    for (int i = 0; i < N; ++i)
    {
        v += UnityEngine.Mathf.Sqrt(v); // classic Unity
        //v += System.MathF.Sqrt(v); // as advised by the tweet above
    }
    return v;
}

In the Unity editor (6000.0.76, but rough timings are the same on 2022.3, 6000.3 and 6000.6 versions), on Windows / Ryzen 5950X machine: UnityEngine.Mathf 282ms, System.MathF 186ms. Whoa indeed, this is way faster!

But hey! Back in 2018 we already found that Unity’s C# performance also very much depends on whether script debugging is enabled or not. Back then it was called “Editor Attaching” under preferences; these days it is this bad-contrast-in-light-theme Debug vs Release widget at lower right editor corner. In Release mode, in-editor timings are: UnityEngine.Mathf 242ms, System.MathF 149ms.

More square roots in more C# variants

To get a more complete picture, let’s also add a variant that uses the “new way of doing math” in Unity, i.e. the Unity.Mathematics package. And have timings for a player build that uses Mono, plus timings for an IL2CPP scripting backend. And while at it, also test performance of the same code under Burst compiler.

Editor Debug Editor Release Player Mono Player IL2CPP
Mathf 282 242 212 35
System.MathF 186 149 142 35
Mathematics 260 211 209 59
Burst Mathf 66 66 67 60
Burst Mathematics 35 34 34 34

And for a complete picture, the same loop, using System.MathF.Sqrt (C#) or sqrtf() (C++) in non-Unity implementations / runtimes:

C# Mono 6.12 C# .NET 10 C++ /O2
System.MathF 130 37
sqrtf() 35

Summary of the above:

  • 35 milliseconds to do this loop is “as good as it can get” on this machine, and that is achieved by C++ & .NET, and within Unity by using Burst + Unity.Mathematics, or when using IL2CPP, with either of Mathf.Sqrt or System.MathF.Sqrt. Under IL2CPP, there does seem to be some special code path that goes “oh this should actually be single precision square root” and generates underlying C++ code accordingly.
  • System.MathF functions are not supported by Burst for some reason; if you try to use them you will get Burst compile errors. If you do not need Burst, then System.MathF is often faster. It does make it harder to move code to Burst though.
  • Unity.Mathematics is often slightly better than the classic Mathf, except under IL2CPP, at least for the square root. IL2CPP does not seem to have special recognition of “oh this should be single precision square root” for it, and has other overheads too, see below.
  • In the opposite behavior to IL2CPP, Burst does not seem to do “oh this should be single precision” for Mathf.Sqrt, but it does for Mathematics.math.sqrt at single precision.

Also fun fact? All the Unity implementations above print the result of the above loop as 24212990000000.0, which is curiously not a number that exists as a single precision float (closest floats that exist are 24212989280256.0 and 24212991377408.0). That’s one of the signs of “yeah some stuff is always doubles underneath, somewhere”. The non-Unity (C# .NET, C++) implementations print the result 24212987183104.0.

Welcome to the world! Things are never simple!

Code generation of the square root loops in detail

Mono, UnityEngine.Mathf.Sqrt

As the original tweet says, Unity’s Mathf.Sqrt is implemented like this: public static float Sqrt(float f) => (float)Math.Sqrt((double)f); – it just calls into double precision System.Math.Sqrt. But if you look at the actual JIT’ed machine code generated by Mono, you can see that there is way more float⭤double conversions going on.

I have used Sebastian Schöner’s Asm Explorer tool to see the generated code. Given this C# code:

const int N = 10000000;
public static float UnityMathf(float v)
{
    for (int i = 0; i < N; ++i)
    {
        v += UnityEngine.Mathf.Sqrt(v);
    }
    return v;
}

the loop body ends up being this:

loop:
movss xmm0, dword [rsp+0x10]     ; xmm0 = v, as float
cvtss2sd xmm0, xmm0              ; xmm0 = (double)v, left side of v + sqrt(v)

movss xmm1, dword [rsp+0x10]     ; xmm1 = v, as float again, argument for sqrt
cvtss2sd xmm1, xmm1              ; xmm1 = (double)v

cvtsd2ss xmm5, xmm1              ; xmm5 = (float)(double)v, rounded back to float
movss [rsp+0x8], xmm5            ; store temporary float argument

movss xmm1, dword [rsp+0x8]      ; xmm1 = temporary float argument
cvtss2sd xmm1, xmm1              ; xmm1 = (double)temporary float

movsd [rsp-0x8], xmm1            ; store double for x87 sqrt input
fld qword [rsp-0x8]              ; push double onto x87 stack
fsqrt                            ; ST(0) = sqrt(ST(0))
fstp qword [rsp-0x8]             ; store sqrt result as double and pop x87 stack

movsd xmm1, qword [rsp-0x8]      ; xmm1 = sqrt result, as double
cvtsd2ss xmm1, xmm1              ; xmm1 = (float)sqrt result
cvtss2sd xmm1, xmm1              ; xmm1 = (double)(float)sqrt result

cvtsd2ss xmm5, xmm1              ; xmm5 = sqrt result rounded to float
movss [rsp+0x8], xmm5            ; store temporary sqrt float

movss xmm1, dword [rsp+0x8]      ; xmm1 = temporary sqrt float
cvtss2sd xmm1, xmm1              ; xmm1 = (double)temporary sqrt float

cvtsd2ss xmm5, xmm1              ; xmm5 = sqrt result rounded to float again
movss [rsp+0x8], xmm5            ; store temporary sqrt float again

movss xmm1, dword [rsp+0x8]      ; xmm1 = temporary sqrt float again
cvtss2sd xmm1, xmm1              ; xmm1 = (double)sqrt result

addsd xmm0, xmm1                 ; xmm0 = (double)v + (double)Mathf.Sqrt(v)

cvtsd2ss xmm5, xmm0              ; xmm5 = final iteration result, round to float
movss [rsp+0x10], xmm5           ; v = final iteration result

inc esi                          ; ++i
cmp esi, 0x989680                ; compare i against 10000000
jl loop                          ; if i < N, continue loop

If this were C#, it would be like v += UnityEngine.Mathf.Sqrt(v) actually expands to:

double lhs = (double)v;

double t0 = (double)v;
float t1 = (float)t0;
float stackFloat0 = t1;
float t2 = stackFloat0;
double sqrtInput = (double)t2;

double stackDouble0 = sqrtInput;
double sqrtDouble = X87_Fsqrt(stackDouble0); // represents x87 FPU fsqrt instruction

float t3 = (float)sqrtDouble;
double t4 = (double)t3;
float t5 = (float)t4;
float stackFloat1 = t5;
float t6 = stackFloat1;
double t7 = (double)t6;
float t8 = (float)t7;
float stackFloat2 = t8;
float t9 = stackFloat2;
double rhs = (double)t9;

double sum = lhs + rhs;
float result = (float)sum;
v = result;

That’s… not exactly great, to put it mildly. Unity is planning to switch to “actual .NET” (CoreCLR) really soon now (see Path to CoreCLR GDC 2026 talk) and codegen should get much better then. Meanwhile, I am rediscovering the same things as what Sebastian Schöner did, but he is also trying to do something about it – see Better codegen for Unity games on Mono blog post.

Using Unity.Mathematics.math.sqrt is a tiny bit better codegen than above, but not by much.

Mono, System.MathF.Sqrt

const int N = 10000000;
public static float UnityMathf(float v)
{
    for (int i = 0; i < N; ++i)
    {
        v += System.MathF.Sqrt(v);
    }
    return v;
}

the loop body ends up being this:

loop:
movss xmm0, dword [rbp-0x10]     ; xmm0 = v, as float
cvtss2sd xmm0, xmm0              ; xmm0 = (double)v
movsd [rbp-0x18], xmm0           ; save old v as double for later addition

movss xmm0, dword [rbp-0x10]     ; xmm0 = v, as float again, argument for MathF.Sqrt
cvtss2sd xmm0, xmm0              ; xmm0 = (double)v

cvtsd2ss xmm0, xmm0              ; xmm0 = (float)(double)v, argument to MathF.Sqrt
nop                              ; padding / alignment / patchpoint artifact

mov r11, 0x22494ee3918           ; r11 = JIT trampoline address for System.MathF.Sqrt(float)
call r11                         ; call MathF.Sqrt(float), argument in xmm0, return float in xmm0

cvtss2sd xmm1, xmm0              ; xmm1 = (double)MathF.Sqrt(v)

movsd xmm0, qword [rbp-0x18]     ; xmm0 = saved old v, as double
addsd xmm0, xmm1                 ; xmm0 = (double)old_v + (double)sqrt_v

cvtsd2ss xmm5, xmm0              ; xmm5 = final iteration result rounded to float
movss [rbp-0x10], xmm5           ; v = final iteration result

inc esi                          ; ++i
cmp esi, 0x989680                ; compare i against 10,000,000
jl loop                          ; if i < N, continue loop

and the assembly of the actual System.MathF.Sqrt function is:

xorps xmm1, xmm1                 ; xmm1 = 0.0f
ucomiss xmm1, xmm0               ; compare 0.0f with input
ja handlefail                    ; if 0.0f > input, input is negative: go handle failure/NaN path
sqrtss xmm0, xmm0                ; xmm0 = sqrtss(xmm0), scalar single-precision sqrt
ret                              ; return sqrt result in xmm0

handlefail:
; some code that handles failures/NaNs

it is effectively this:

static float MathF_Sqrt_Call(float x)
{
    if (0.0f > x)
        return MathF_Sqrt_SlowPath(x);
    return Sse_SqrtScalarSingle(x); // sqrtss instruction
}

// ...
double lhs = (double)v;

double t0 = (double)v;
float sqrtArg = (float)t0;
float sqrtResult = MathF_Sqrt_Call(sqrtArg);
double rhs = (double)sqrtResult;

double sum = lhs + rhs;
float result = (float)sum;
v = result;

There are still a bunch of float⭤double conversions! But way fewer, and instead of using the ancient x87 FPU, this now uses the scalar SSE square root instruction.

Burst, UnityEngine.Mathf.Sqrt

Under Burst, the v += UnityEngine.Mathf.Sqrt(v) inner loop part faithfully translates to:

vcvtss2sd   xmm1, xmm0, xmm0 ; convert float→double
vsqrtsd     xmm1, xmm1, xmm1 ; scalar double precision square root
vcvtsd2ss   xmm1, xmm1, xmm1 ; convert double→float
vaddss      xmm0, xmm0, xmm1 ; float +=

i.e. it does pretty much what you would expect, given Mathf.Sqrt implementation.

Burst, Unity.Mathematics.math.sqrt

The v += Unity.Mathematics.math.sqrt(v) under Burst translates to just:

vsqrtss     xmm1, xmm0, xmm0 ; scalar single precision square root
vaddss      xmm0, xmm0, xmm1 ; float +=

This is basically what you would want to happen.

This is somewhat curious though, since underlying math.sqrt code is actually public static float sqrt(float x) { return (float)System.Math.Sqrt((float)x); } – i.e. without Burst, it does end up calling into double precision function. But Burst gives this some sort of special treatment, that it does not do for the previous case, I guess.

And again, no System.MathF.Sqrt test with Burst, since it just fails if you try to use that.

IL2CPP, UnityEngine.Mathf.Sqrt

Unity’s IL2CPP scripting backend translates .NET bytecode into C++, and then relies on a regular C++ compiler to carry out optimizations.

For the Mathf.Sqrt code path, it does seem to actually give it special treatment – it does not call the double precision square root, even if on C# level it does do double precision. This is the opposite of what Burst does, and I guess this is another example of “you ship your org chart” in action.

The inner loop in generated C++ code is:

float L_0 = ___0_v;
float L_1 = ___0_v;
float L_2;
L_2 = sqrtf(L_1);
___0_v = (float)il2cpp_codegen_add(L_0, L_2); // template function, just + for simple types

which then the C++ compiler (MSVC 2022 v17.14, Release build config) actually unrolls to do ten square roots per iteration, with each square root snippet being this:

xorps       xmm1, xmm1     ; xmm1 = 0.0f
ucomiss     xmm1, xmm6     ; compare 0.0f with v
ja          edgecase       ; if 0.0f > v: use sqrtf fallback function
xorps       xmm0, xmm0     ; xmm0 = 0.0f
sqrtss      xmm0, xmm6     ; xmm0 = sqrt(v), scalar single-precision sqrt
jmp         end
edgecase:
movaps      xmm0, xmm6     ; xmm0 = v, argument for sqrtf
call        sqrtf          ; call C runtime sqrtf
end:
addss       xmm6, xmm0     ; v += sqrtResult

This is not a simple “just use sqrtss”, it only uses the instruction for valid inputs, and calls into “full” function for others (to set errno or deal with exceptions, I guess). You could argue that this is less optimal codegen than what Burst does, in practice on this benchmark it does not matter though.

IL2CPP, System.MathF.Sqrt

Now, for System.MathF the IL2CPP codegen is slightly different:

il2cpp_codegen_runtime_class_init_inline(MathF_longGUID_il2cpp_TypeInfo_var);
float L_0 = ___0_v;
float L_1 = ___0_v;
float L_2;
L_2 = sqrtf(L_1);
___0_v = (float)il2cpp_codegen_add(L_0, L_2); // template function, just + for simple types

– why yes, that is the il2cpp_codegen_runtime_class_init_inline call inside the hot inner loop. What that does, is it checks some flag and if it is not set, calls some other function. Some sort of “lazy C# class initialization”, that for some reason is not needed in the previous case, but is needed here.

In assembly, this looks very much like above, except now the loop body is not “tiny enough” so MSVC compiler does not do ten square roots per each actual loop iteration; it does only one. And before each square root, it does this:

mov         rcx,qword ptr [MathF_longGUID_il2cpp_TypeInfo_var]  
cmp         dword ptr [rcx+0E4h],0  
jne         inited
call        il2cpp_codegen_runtime_class_init
inited:

Now again, for this particular benchmark it does not matter (the memory address it checks is very much in the cache, and the branch is perfectly predictable). But if you are calling System.MathF.Sqrt outside of tiny inner loops, then each.and.every.call will have this extra memory fetch and a branch.

IL2CPP, Unity.Mathematics.math.sqrt

For the Mathematics.math.sqrt case, things get slightly weirder under IL2CPP: 1) instead of one “some sort of lazy initialization” branch like in case above, now it has two branches for each and every call, and 2) the actual square root is done in double precision.

Generated C++ code:

IL2CPP_MANAGED_FORCE_INLINE IL2CPP_METHOD_ATTR float math_sqrt_longGUID_inline (float x, const RuntimeMethod* method) 
{
  static bool s_Il2CppMethodInitialized;
  if (!s_Il2CppMethodInitialized)
  {
    il2cpp_codegen_initialize_runtime_metadata((uintptr_t*)&Math_longGUID_il2cpp_TypeInfo_var);
    s_Il2CppMethodInitialized = true;
  }
  {
    il2cpp_codegen_runtime_class_init_inline(Math_longGUID_il2cpp_TypeInfo_var);
    double l1 = sqrt((double)x);
    return (float)l1;
  }
}

which then translates into this assembly for the inner loop:

cmp         byte ptr [s_Il2CppMethodInitialized],0  
jne         inited1
lea         rcx,[Math_longGUID_il2cpp_TypeInfo_var]  
call        il2cpp_codegen_initialize_runtime_metadata
mov         byte ptr [s_Il2CppMethodInitialized],1  
inited1:
mov         rcx,qword ptr [Math_longGUID_il2cpp_TypeInfo_var]  
cmp         dword ptr [rcx+0E4h],0  
jne         inited2
call        il2cpp_codegen_runtime_class_init
inited2:
xorps       xmm1,xmm1  
xorps       xmm0,xmm0  
cvtss2sd    xmm1,xmm6  
ucomisd     xmm0,xmm1  
ja          edge_case
sqrtpd      xmm0,xmm1  
jmp         iter_end
edge_case:
movaps      xmm0,xmm1  
call        sqrt
iter_end:
cvtsd2ss    xmm0,xmm0  
addss       xmm6,xmm0 

Again, for this benchmark the two extra branches do not matter, but they might if you are calling math.sqrt not from inside of a tiny loop body. What does matter, and why under IL2CPP this is slower, is that the square root is done at double precision.

So there! Unity math is complex!

Well, that was something. Is the original advice of prefer System.MathF over UnityEngine.Mathf valid? Yes, unless you want Burst; there it simply does not work.

My takeaways:

  • I hope the upcoming switch to .NET / CoreCLR will clear up a lot of that mess, especially in the “even if you don’t spell out doubles anywhere in your code, Mono does everything in doubles in Unity”. And even without double precision, the Mono codegen is… not great.
  • Unity is quite inconsistent in how it treats precision of various math functions. Some of them are implemented as-if they were double precision, but IL2CPP and Burst magically treat them as single precision. Sometimes IL2CPP and Burst disagree on which ones get the special treatment.
    • Given that CoreCLR switch will have some potential backwards compat breakages anyway, I hope Unity will sanitize the math functions precision treatments in the same go.
  • It would be nice if you could use “functions that look & feel the same” (like UnityEngine.Mathf.Sqrt, System.MathF.Sqrt and Unity.Mathematics.math.sqrt) as being exactly equivalent, with no preferential treatment of one vs. the other. That is very much not the case today however, and what’s worse, there is no single answer for “which one is best”. It all depends whether you use IL2CPP or Burst, or both, or neither!
  • If you want best performance now, use Burst and Mathematics maths.
  • Also, you might want to look into Sebastian’s cpp2better, that is aimed at improving IL2CPP codegen. I have not evaluated it in this post however.

Syntonic Dentiforms redux

Some 22 years ago nesnausk! made a demo Syntonic Dentiforms. That was 2004! So of course the demo was written for Windows, 32 bit, Direct3D 9, used D3DX Effects Framework, and was compiled with Visual Studio 6. It used fairly-new at the time pixel shader model 2.0 (heck yeah!), but also had fallback rendering paths for shader models 1.4 and 1.1. Good times.

Now I took the source code of it, looked at it in horror, and rebuilt it for current platforms.

  • Replaced D3D9 / D3DX with sokol_gfx,
  • Replaced FMOD for audio playback with sokol_audio and stb_vorbis,
  • Instead of Windows / DX9 32 bit, now it compiles for Windows / DX11, Linux / OpenGL, macOS / Metal (all 64 bit), as well as Web (Emscipten / WebGL2).
  • Replaced Object-ID based shadowing with regular shadow maps, using Castaño’s 5x5 PCF filter,
  • All lighting is now per-pixel (previously reflections were lit per-vertex), lighting vectors are normalized more properly and the reflections are anti-aliased.

Here are the builds and the source code:

Musings on source code

This made me realize that the code I was writing 22 years ago has been really bad, judging by my today’s standards. So. much. pointless. abstractions. and. design. patterns. and. inheritance. Out of curiosity, I tried rewriting the parts of the code that I understand (there are some that I don’t; I left them as they are), just to see how much simpler and smaller the code can get.

For example everything related to “animations” initially was this: 16 files, with interfaces, and listeners, and traits, and whatever. IAnimChannel, CAnimChannel<T>, CAnimContext, CAnimCurve<T>, CAnimImmediateMixer<T>, IAnimListener<T>, IAnimStream<T>, CAbstractTimedAnimStream<T>, CAnimStreamMixer<T>, traits::anim_type<T>, IAnimation<T>, CAnimationBunch, CSampledAnimation<T>, CTimedAnimStream<T> – just, whyyyy. All of that can be simplified into two files with way fewer parts (AnimCurve, SampledAnimation, AnimationBunch). Same story with “graphics” or “resource loading” related parts.

So, what was 24 thousand lines of code across 216 source files, became 6 thousand lines of code across 49 files. Does anyone care? No, of course not. But I did it anyway :)

The executable became a megabyte smaller, by the way. Mostly because it was using D3DX (effects framework, texture loading, math), and I replaced them with other, smaller, libraries that do less stuff. I ❤️ sokol libraries by Andre Weissflog; they are simple, straight to the point, and let me get this working across all of windows/linux/mac/web. It is funny that back then, Andre’s Nebula Device game engine design was pretty influential for us, with all the abstractions and object-orientation. Sokol is almost complete opposite, and I love that.

The demo has a special place in my heart since this is the first “not complete shit” demo that I worked on :) We also managed to get a scene.org Breakthrough Performance award for it! Ren here is completely unfazed by the award though. She shows that awards are just a social construct.


Rapidhash Unity port

Ten years ago I was writing about various non-cryptographic hash functions. Back then xxHash was new (introduced in 2014)! However, quite some things have changed since then. xxHash itself got a new “XXH3” version (2020); “wyhash” appeared (2020+), and eventually evolved into “rapidhash” (2024+). Many others too, but this is about rapidhash.

It is small and beautiful. Yes, current (V3) version is over 500 lines of C code, but that is three hash function variants and several tweaking options.

I ported it to C# (Unity/Burst) and the full core rapidhash implementation is barely over 100 lines of code.

  • Full repository: UnitySmolRapidhash on github (MIT license).
  • The actual source file: SmolRapidhash3.cs
  • It uses Unity’s Burst to get access to 128 bit multiply function, and the code itself has [BurstCompile] on it.
  • API is similar to Unity.Collections.xxHash3 class, except it returns 64 bit value directly instead of an int2, and has helper entry points for hashing a single struct or various arrays:
    static ulong Hash64<T>(ref T key) where T : unmanaged;
    static ulong Hash64<T>(T[] key) where T : unmanaged;
    static ulong Hash64<T>(Span<T> key) where T : unmanaged;
    static ulong Hash64<T>(NativeArray<T> key) where T : unmanaged;
    static ulong Hash64(void* key, long length);
    

Performance

Burst approaches native (C) performance of rapidhash at larger input sizes, nice!

  • The calling benchmark program is just a C# (not Burst) script tested in the editor; might be cause of some overhead for small input sizes.
  • Curiously, C#/Burst port of XXH3 (as provided by Unity.Collections package) is 30-40% slower than native (C) implementation. This slowdown is not there for rapidhash.

Rapidhash is always faster than XXH3; the difference is more pronounced on arm64.

Ryzen 5950X / Windows / Visual Studio 2022 (17.14.23): rapidhash reaches 38GB/s. Native XXH3 version is similar for large input sizes (slower for small sizes). However the C#/Burst version of XXH3 only reaches 24GB/s.

Apple M4 Max / macOS / Xcode 16.1: rapidhash reaches 67GB/s. Native XXH3 version reaches 50GB/s, and C#/Burst version of XXH3 reaches only 30GB/s.

That’s it!


Two years of Blender VSE

So, Blender 5.0 has shipped while I was away at the excellent Graphics Programming Conference, but while all that was happening, I realized it has been two years since I mostly work on the Blender Video Sequence Editor (VSE). Perhaps not surprisingly, a year ago it was one year of that :)

Just like two years ago when I started, I am still mostly flailing my arms around, without realizing what I’m actually doing.

The good

It feels like recently VSE did get quite many improvements across workflow, user experience and performance. The first one I contributed anything to was Blender 4.1, and look what has happened since then (pasting screenshots of the release overview pages):

4.1 (full notes):

4.2 (full notes):

4.3 (full notes):

4.4 (full notes):

4.5 (full notes):

5.0 (full notes):

In addition to user-facing features or optimizations, there also has been quite a lot of code cleanups; too many to list individually but for a taste you could look at “winter of quality” task list of last year (#130975) or WIP list of upcoming “winter of quality” (#149160).

All of this was done by 3-4 people, all of them working on VSE part time. That’s not too bad! I seem to have landed about 200 pull requests in these two years. Also not terrible!

For upcoming year, we want to tackle three large items: 1) more compositor node-based things (modifiers, effects, transitions) including more performance to them, 2) hardware acceleration for video decoding/encoding, 3) workflows like media bins, media preview, three point editing. That and more “wishlist” type of items is detailed in this devtalk thread.

If you have tried Blender video editor a long time ago, and were not impressed, I suggest you try it again! You might still not be impressed, but then you would have learned to not trust anything I say :P

The bad

It can’t all be good; some terrible things have also happened in Blender VSE land too. For one, I have became the “module owner” (i.e. “a lead”) of the VSE related work. Uh-oh!

The wishlist

From the current “things we’d want to work on”, an obvious lacking part is everything related to audio – VSE has some audio functionality, but nowhere near enough for a proper video editing toolbox. Currently out of “just, like, three” part-time people working on VSE, no one is doing audio besides maintenance.

More community contributions in that area would be good. If you want to contribute, check out new developer documentation and #module-sequencer on the developer chat.


OpenEXR vs tinyexr

tinyexr is an excellent simple library for loading and saving OpenEXR files. It has one big advantage, in that it is very simple to start using: just one source file to compile and include! However, it also has some downsides, namely that not all features of OpenEXR are supported (for example, it can’t do PXR24, B44/B44A, DWAA/DWAB, HTJ2K compression modes), and performance might be behind the official library. It probably can’t do some of more exotic EXR features either (e.g. “deep” images), but I’ll ignore those for now.

But how large and how complex to use is the “official” OpenEXR library, anyways?

I do remember that a decade ago it was quite painful to build it, especially on anything that is not Linux. However these days (2025), that seems to be much simpler: it uses a CMake build system, and either directly vendors or automatically fetches whatever dependencies it needs, unless you really ask it to “please don’t do this”.

It is not exactly a “one source file” library though. However, I noticed that OpenUSD vendors OpenEXR “Core” library, builds it as a single C source file, and uses their own “nanoexr” wrapper around the API; see pxr/imaging/plugin/hioOpenEXR/OpenEXR. So I took that, adapted it to more recent OpenEXR versions (theirs uses 3.2.x, I updated to 3.4.4).

So I wrote a tiny app (github repo) that reads an EXR file, and writes it back as downsampled EXR (so this includes both reading & writing parts of an EXR library). And compared how large is the binary size between tinyexr and OpenEXR, as well as their respective source code sizes and performance.

Actual process was:

  • Take OpenEXR source repository (v3.4.4, 2025 Nov),
    • Take only the src/lib/OpenEXRCore and external/deflate folders from it.
    • openexr_config.h, compression.c, internal_ht.cpp have local changes! Look for LOCAL CHANGE comments.
  • Take OpenJPH source code, used 0.25.3 (2025 Nov), put under external/OpenJPH.
  • Take openexr-c.c, openexr-c.h, OpenEXRCoreUnity.h from the OpenUSD repository. They were for OpenEXR v3.2, and needed some adaptations for later versions. OpenJPH part can’t be compiled as C, nor compiled as “single file”, so just include these source files into the build separately.
  • Take tinyexr source repository (v1.0.12, 2025 Mar).

Results

Library Binary size, KB Source size, KB read+write time, s Notes
tinyexr 1.0.12 251 726 6.55
OpenEXR 3.2.4 2221 8556 2.19
OpenEXR 3.3.5 826 3831 1.68 Removed giant DWAA/DWAB lookup tables.
OpenEXR 3.4.3 1149 5373 1.68 Added HTJ2K compression (via OpenJPH).
OpenEXR 3.4.4 649 3216 1.65 Removed more B44/DWA lookup tables.
+ no HTJ2K 370 1716 Above, with HTJ2K/OpenJPH compiled out.
+ no DWA 318 Above, and with DWAA/DWAB compiled out.
+ no B44 305 Above, and with B44/B44A compiled out.
+ no PXR24 303 Above, and with PXR24 compiled out.

Notes:

  • Machine is Ryzen 5950X, Windows 10, compiler Visual Studio 2022 (17.14), Release build.
  • This compares both tinyexr and OpenEXR in fully single-threaded mode. Tinyexr has threading capabilities, but it spins up and shuts down a whole thread pool for each processed image, which is a bit “meh”; and while OpenEXRCore can be threaded (and using full high level OpenEXR library does use it that way), the “nanoexr” wrapper I took from USD codebase does not do any threading.
  • Timing is total time taken to read, downsample (by 2x) and write back 6 EXR files, input resolution 3840x2160, input files are ZIP FP16, ZIP FP32, ZIP w/ mips, ZIP tiled, PIZ and RLE compressed; output is ZIP compressed.

That’s it!