Optimizing Oklab gradients

An example how one might optimize Oklab color space gradients by… not doing anything related to Oklab itself!

The case at hand

I wrote about Oklab previously in the “gradients in linear space aren’t better” post. Now, let’s assume that the use case we have is this:

  • We have some gradients,
  • We need to evaluate them on a lot of things (particles, pixels, etc.),
  • Gradient colors are specified in sRGB (sometimes called “gamma space”), as 8-bit/channel values,
  • The evaluated gradient colors also have to be in sRGB, 8-bit/channel values. Why this, and not for example “linear” colors? Could be many reasons, ranging from “backwards compatibility” to “saving memory/bandwidth”.

What’s a gradient?

One simple way to represent a color gradient is to have color “keys” specified at increasing time values, for example:

struct Gradient
{
    static constexpr int kMaxKeys = 8;
    pix3 m_Keys[kMaxKeys]; // pix3 is just three bytes for R,G,B
    float m_Times[kMaxKeys];
    int m_KeyCount;
};

And the gradient like above would have 5 keys (red, blue, green, white, black) and key times 0.0, 0.3, 0.6, 0.8, 1.0.

Ah! But how exactly the resulting gradient looks like depends on how we interpolate between the color keys, which neatly ties into why we’d want Oklab to begin with. The gradient above is directly interpolating the colors in sRGB space, i.e. “how everyone used to do it for many decades until recently”. Photoshop just added “Perceptual” (Oklab) and “Linear” interpolation modes, and the same gradient would look like this then – Classic (sRGB) at top, Perceptual (Oklab) in the middle, Linear at the bottom. See more examples in my previous blog post.

Assuming our gradient keys are sorted in increasing time order, code to evaluate the gradient might look like this:

pix3 Gradient::Evaluate(float t) const
{
  // find the keys to interpolate between
  int idx = 0;
  while (idx < m_KeyCount-1 && t >= m_Times[idx+1])
    ++idx;
  // we are past the last key; just return that
  if (idx >= m_KeyCount-1)
    return m_Keys[m_KeyCount-1];
  // interpolate between the keys
  float a = (t - m_Times[idx]) / (m_Times[idx+1] - m_Times[idx]);
  return lerp(m_Keys[idx], m_Keys[idx+1], a); // interpolate in sRGB directly
}

Evaluating gradient in the three interpolation modes is exactly the same, all the way up to the last line:

  • sRGB: just a lerp, as above,
  • Linear: convert keys from sRGB to float Linear, lerp between them, convert back into fixed point sRGB,
  • Oklab: convert keys from sRGB to float Linear, then into Oklab, lerp between them, convert back into float Linear, then back into fixed point sRGB.

We’re gonna try to be smart upfront and save a division by m_Times[idx+1] - m_Times[idx], by precalculating inverse of them just once, i.e. m_InvTimeDeltas[i] = 1.0f / (m_Times[i + 1] - m_Times[i]). All the related source code is in gradient.cpp, mathlib.h, oklab.cpp in my toy repository.

Initial performance

How much time does it take to evaluate a gradient with 7 color keys? We’re gonna do it 10 million times, on one thread, and measure time it takes in milliseconds.

Platform sRGB Linear Oklab
Windows, vs2022 125.2 619.2 2424.5
Windows, clang 13 115.7 601.3 2405.5
Linux, gcc 9.3 123.1 433.6 1567.2
Linux, clang 10 106.6 411.4 1099.4
Mac, clang 13 146.2 408.5 966.9

The Windows & Linux rows are on a PC (AMD Ryzen 5950X), Mac row is on MacBookPro (M1 Max). Windows is Win10 21H2, Linux is Ubuntu 20 via WSL2, macOS is 12.1. Compiler options are -O2 for gcc&clang, Release for visual studio, everything else left at defaults.

Takeaways so far: Linear gradient interpolation is 3-6x slower than sRGB, and Oklab is 10-20x slower than sRGB. There are some variations between platforms & compilers, but overall patterns are similar.

Profiling Windows build says that majority of the time in Linear & Oklab cases is spent raising numbers to a power:

  • Linear spends 481ms inside powf(),
  • Oklab spends 1649ms inside cbrtf(), and 515ms inside powf().

Stop doing the same work repeatedly

That’s often a good performance optimization advice. Note the tail of gradient Oklab evaluation function code:

// to-Linear -> to-Oklab -> lerp -> to-Linear -> to-sRGB
float3 ca = pix_to_float(m_Keys[idx]);
float3 cb = pix_to_float(m_Keys[idx+1]);
ca = sRGB_to_Linear(ca);
cb = sRGB_to_Linear(cb);
ca = Linear_sRGB_to_OkLab_Ref(ca);
cb = Linear_sRGB_to_OkLab_Ref(cb);
float3 c = lerp(ca, cb, a);
c = OkLab_to_Linear_sRGB_Ref(c);
c = Linear_to_sRGB(c);
return float_to_pix(c);

…all the calculations up until the lerp line do not depend on the gradient evaluation time at all! We could, instead of just storing gradient color keys in sRGB, also precalculate their Linear and Oklab values. This does add some extra storage space into Gradient object, but perhaps saves a bit of computation.

So let’s do this (commit), and then the code above turns into:

float3 c = lerp(m_KeysOkLab[idx], m_KeysOkLab[idx+1], a);
c = OkLab_to_Linear_sRGB_Ref(c);
c = Linear_to_sRGB(c);
return float_to_pix(c);

And this gives the following performance numbers:

Platform sRGB Linear Oklab
Windows, vs2022 124.9 271.1 321.8
Linux, clang 10 107.0 196.0 277.7
Mac, clang 13 141.8 224.4 286.8

Linear is now only 1.5-2.1x slower than sRGB, and Oklab is 2.0-2.6x slower than sRGB. Still slower, but not “orders of magnitude” slower anymore. Nice!

Profiling Windows build says that Linear and Oklab still spend most of their remaining time inside powf() though, 152ms and 180ms respectively. This is all inside Linear_to_sRGB function. Ok, now what?

Table based Linear to sRGB conversion

Notice that we effectively need to convert from a Linear float into a fixed point (8-bit) sRGB. Right now we do that with a generic “linear float -> sRGB float” function, followed by a “normalized float -> byte” function. But turns out, people smarter than me figured out this can be done in a more optimal way, a decade ago. Of course that was Fabian ‘ryg’ Giesen in this gist file. It has extensive comments there, go take a read.

Let’s try this (commit):

Platform sRGB Linear Oklab
Windows, vs2022 126.9 148.9 173.4
Linux, clang 10 107.7 132.7 164.6
Mac, clang 13 140.1 157.7 180.9

Linear is now 1.2x slower than sRGB, and Oklab is 1.3-1.5x slower than sRGB. Yay!

Removing one matrix multiply

All the way up until now, we have not actually modified anything about Oklab calculations. The code & math we’re using are coming directly from Oklab post.

But! If all we need is to linearly blend between Oklab colors, we can simplify this a bit. For our particular use case (evaluating gradients), we don’t need some bits of Oklab: we’re not interested whether the Oklab numbers predict lightness, or whether “distances” between said numbers match perceived color differences. We just need to “nicely” interpolate between the gradient color keys.

Note that Linear -> Oklab conversion is effectively “multiply by matrix M1, apply cube root, multiply by matrix M2”. The opposite conversion is “multiply by inverse of M2, raise to 3rd power, multiply by inverse of M1”. We’re only going to be linearly interpolating between Oklab colors, so we can drop the multiplies related to matrix M2 and the result will be the same (minus a tiny amount of floating point rounding). That is, leave only the “multiply by matrix M1, apply cube root” and “raise to 3rd power, multiply by inverse of M1” parts.

Technically our gradient color keys are no longer in Oklab, but rather in LMS, but anyway the gradient evaluation result is the same.

And here are the results with this (commit):

Platform sRGB Linear Oklab*
Windows, vs2022 125.4 144.6 167.3
Linux, clang 10 108.4 133.0 151.3
Mac, clang 13 143.7 162.0 182.5

Linear is now only 1.1-1.2x slower than sRGB, and Oklab is 1.3-1.4x slower than sRGB. So dropping a matrix multiply made things a tiny bit faster.

And that’s it for now! Maybe some other time I’ll write about evaluating gradients using SIMD, and see what happens.


Curious lack of sprintf scaling

Some days ago I noticed that on a Mac, doing snprintf calls from multiple threads shows curious lack of scaling (see tweet). Replacing snprintf with {fmt} library can speed up the OBJ exporter in Blender 3.2 by 3-4 times. This could have been the end of the story, filed under a “eh, sprintf is bad!” drawer, but I started to wonder why it shows this lack of scaling.

Test case

A simple test: convert two million integers into strings. And then try to do the same on multiple threads at once, i.e. each thread converts two million integers. If the number of threads is below the number of CPU cores, this should take about the same time – each thread would just happily be converting their own numbers, and not interfere with the other threads. That is, a “this scales nicely” result would be where the graph is completely horizontal - no matter how many threads are doing the work at once, it takes the same amount of wall time.

Yes the reality is more complicated, with CPU thermals, shared caches and whatnot coming into play, but we’re interested in broad patterns, not exact science here!

And here’s what happens on an Apple M1 Max laptop. Horizontal axis is thread count; vertical axis is milliseconds (log scale) taken. Again, a good result would be a horizontal line:

Converting two million numbers into strings takes 100 milliseconds when one CPU core is doing it. When all eight “performance” cores are doing it (i.e. in total 16 million integers), it takes 1.8 seconds, or 18 times as long. That’s, like, not great!

Yo dude, you should not use sprintf

“Well duh” you say, “obviously you should not use sprintf, you should use C++ iostreams”. Okay. Here’s converting integers into strings via a std::stringstream <<.

Same scaling issue, except iostreams are two times slower. “Zero cost abstractions”, you know :)

What’s going on?

Instruments shows that with 8 threads, each thread spends over 90% of the time in something called localeconv_l, where it is mostly mutex locks.

At this point you might be thinking, “ah-ha! well this is related to a locale, and a locale is global, so of course some time spent on some mutex lock is expected”, which is “mmmaybe? but this amount of time feels excessive?". Given that this is an Apple operating system, we might know it has a snprintf_l function which takes an explicit locale, and hope that this would make it scale. Just pass NULL which means “use C locale”:

…aaand, nope. It is a tiny bit faster, but does not really address the issue.

But! Large parts of macOS Darwin kernel and system libraries have source code available, so let’s look at what’s going on. Here’s the latest localeconv_l at the time of writing: github link. It’s basically a:

lconv* localeconv_v(locale_t loc)
{
    lock_on(loc);
    if (loc->something_changed)
    {
        // do some stuff
    }
    unlock_on(loc);
    // ...
}

and the lock used internally is just a os_unfair_lock macOS primitive. What is curious, is that this code has very recently changed; before 2022 February it was like:

lconv* localeconv_v(locale_t loc)
{
    if (loc->something_changed)
    {
        lock_on(loc);
        if (loc->something_changed)
        {
            // do some stuff
        }
        unlock_on(loc);        
    }
    // ...
}

Which to me feels like the previous code was trying to do a “double checked locking” pattern, but without using actual atomic memory reads. Which probably happens to work just fine on Intel CPUs, but might be more problematic elsewhere, like maybe on Apple’s own CPUs? And then someone decided to just always take that mutex lock, instead of investigating possible use of atomic operations.

Now, Apple’s OS is BSD-based, so we can check what other BSD based systems do.

  • FreeBSD does not have any mutexes there, and before 2021 September was just checking a flag. Since then, the flag check was changed to use atomic operations.
  • OpenBSD does not use any atomics or mutexes at all, and the “has something changed?” flag is not even per-locale, it’s just a global variable. YOLO!

So given all this knowledge, presumably, if each thread used a physically different locale object and snprintf_l, then it would scale fine. And it does:

What else can we do?

Now, besides the old snprintf and std::stringstream, there are other things we can do. For example:

  • stb_sprintf, a trivial to integrate, public domain C library that is a full sprintf replacement, but without any locale specific stuff. It’s also presumably faster, smaller and works the same across different compilers/platforms.
  • {fmt}, a MIT-licensed C++ library “providing a fast and safe alternative to C stdio and C++ iostreams”. {fmt} was a base for C++20 formatting additions.
  • Not a general replacement, but if we only need to turn numbers into strings, C++17 has to_chars.

All of those scale with increased thread usage just fine, and all of them are way faster in single threaded case too. {fmt} looks very impressive. Yay!

Is this all Apple/Mac specific?

Let’s try all the above things on Windows with Visual Studio 2022. This one supports more things compared to clang 13 that I have on a Mac:

  • There is C++20 formatting library with format_to_n. This uses the same type safe syntax as {fmt} library, and we can hope it would be of a similar performance and scaling.
  • Similar to BSD-specific snprintf_l, Visual Studio has its own _snprintf_l.
  • Speaking of not-so-general solutions, Visual Studio also has itoa to convert integers into strings.

  • Unlike the Mac case, just the regular snprintf does not have the multi-threaded scaling issue! It takes around 100 milliseconds for two million integers, no matter how many threads are doing it at the same time.
  • C++ stringstream performance and scaling is really bad. It starts being 4x slower than snprintf at one thread, and goes up to be hundred times slower at 8 threads.
  • The new, hot, C++20 based formatting functionality using format_to_n is really bad too! It starts being 10x slower than snprintf (!), and goes to be 40x slower at 8 threads.

Ok, what is going on here?! Superluminal profiler to the rescue, and here’s what it says:

The stringstream, in one thread case, ends up spending most of the time in the infamous “zero-cost abstractions” of C++ :) A bunch of function calls, a tiny bit of work here and there, and then somewhere deep inside it ends up calling snprintf anyway. Just all around that, tiny bits and pieces of cost all add up. In the 8 threads case, it ends up spending all the time inside mutex locks, quite similar to how Mac/Apple case was doing. Just here it’s C++, so it ends up being worse - there’s not a single mutex lock, but rather what looks like three mutex locks on various parts of the locale object (via std::use_facet of different bits), and then there’s also reference counting, with atomic increase/decrease operations smashing the same locale object.

The format_to_n, in one thread case, ends up spending all the time in… 🥁… Loading resource files. :WAT: Each and every call “plz turn this integer into a string” ends up doing:

  • Create something called a _Fmt_codec object, which
  • Calls __std_get_cvt, which
  • Figures out “information about installed or available code page” via GetCPInfoExW, which
  • Ends up calling FindResourceExW and LoadResource on something. Which then call LdrpLoadResourceFromAlternativeModule and LdrpAccessResourceDataNoMultipleLanguage and so on and so on.

In the 8 threads case, that is all the same, except all that resource loading is presumably on the same “thing”, so it ends up spending a ton of time deep inside the OS kernel doing MiLockVadShared, and MiUnlockAndDereferenceVadShared, and LOCK_ADDRESS_SPACE_SHARED and so on.

So that is something I would not have expected to see, to be honest. Curiously enough, there is a similar sounding issue on Github of Microsoft’s STL, which is marked resolved since 2021 April.

And no, usual Internet advice of “MSVC sucks, use Clang” does not help in this particular case. Using Clang 13, the C++20 formatting library is not available yet, but otherwise all other options look pretty much the same, including the disappointing performance of stringstream:

What about Linux?

I only have an Ubuntu 20 install via WSL2 here to test, and using the default compilers there (clang 10 and gcc 9.3), things look pretty nice:

C++20 format library is not available in either of these compilers to test, but everything else scales really well with increased thread count. {fmt} continues to be impressive there as well.

Conclusion

Would you have expected a “turn an integer into a string” routine to be loading resource file information blocks from some library, for each and every call? Yeah, me neither.

Technically, there are no bugs anywhere above - all the functions work correctly, as far as standard is concerned. But some of them have interesting (lack of) multi-core scaling behavior, some others have just regular performance overheads compared to others, etc.

If you need to target multiple different compilers & platforms, and want consistent performance characteristics, then avoiding some parts of C or C++ standard libraries might be one way. Or at least, do not assume anything about performance (and especially about multi-thread scaling) characteristics of the standard libraries.

If you need to do string formatting in C++, I can highly recommend using {fmt}.


Speeding up Blender .obj export

This tweet by @zeuxcg sparked my interest:

If you think of Ryu as the gold standard of shortest correctly rounded floating point output, note that there’s still active research happening in this area, with papers from 2020-2021 (Schubfach, Dragonbox), with both being noticeably faster than Ryu.

and then I was thinking “interesting, if I’d find some code that prints a lot of floats, I should test out these new algorithms”. And then somehow I was casually profiling Blender’s .obj exporter, noticed that it spends most of the time inside fprintf, and was <💡>.

Note: I was profiling Blender 3.1 beta build, where it has a new obj exporter, written in C++ (previous one was written in Python). This new exporter is already 8x-12x faster than the old one, nice!

Typical reactions to the observation

Now, internet being internet, there are a bunch of “typical reactions” you might get when you notice something and raise a question about it. Especially if you’re measuring performance of a new, hot, fast! thing and wondering whether it might be somewhat suboptimal. Here’s a sampling of actual responses I got for the “obj exporter spends most of it’s time inside fprintf” observation:

  • “If 95% of the time is in fprintf then the export is super fast”
  • “The obj exporter generates files, right? So we need some kinda of fprint”
  • “Text based exporter, spends most its time in printf, news at 11”
  • “I think fprintf does by block flushing by default on files” (in response that a buffer above fprint might be useful)
  • “Is perf actually an issue? I mean if it spends 145 of 178 ms exporting a super large file”
  • “I don’t think that mutex locks add a significant amount of overhead here, because everything is on a single thread”
  • “That’s 20 lines full of potential off by one errors” (response to adding buffering above fprintf, ~20 lines of code)
  • “If you are I/O bound, memory mapping your files makes a big difference”

In many situations like this, people are raising valid questions, or expressing sensible doubts, or repeating “common wisdom”. That’s fine! This is all well meaning, and beyond my very selective “hot takes” listed above, the discussions were healthy and productive. Sometimes answering the initial questions, discussing the doubts and ignoring the usual common wisdom might lead to interesting places.

Test setup

I was mostly measuring .obj file export times on two different scenes:

  1. monkey: a heavily subdivided object (monkey head at subdivision level 6). Produces 330MB obj file with one object.
  2. splash: blender 3.0 splash screen ("sprite fright"). Produces 2.5GB obj file with 24303 objects inside of it.

All the test numbers are from my Windows PC, Blender built with Visual Studio 2022 in Release mode, AMD Ryzen 5950X (32 threads), PCIe 4.0 SSD. It would be useful to have numbers from other compilers/setups, but I only have this one PC at the moment…

Now, again - the new obj exporter in Blender 3.1 is way faster than the old one. On monkey old→new is 49.4s→6.3s, on splash it’s 392.3s→48.9s. Very, very nice.

Initial observations

First off, “is perf actually an issue” question. No, we are not at “milliseconds” – exporting splash takes 50 seconds, and that is not even a large scene by today’s standards.

Next up, we need to figure out whether we’re I/O bound. We could do a back-of-the-napkin calculation like: this SSD has a theoretical write speed of up to 4GB/s, so writing out a 2.5GB obj file should take under a second. Of course we’re not gonna reach the maximum write speed, but we’re off by 50 times.

We could also use some actual profiling, for example with the most excellent Superluminal. It says that WriteFile takes ~1.5 seconds. However, fprintf takes a whopping 41.5 seconds. So yes, the exporter does spend absolute majority of its time calling a standard library function to format a string and write it out to a file, but the actual “write to a file” portion is tiny.

The screenshot above is the thread timeline from Superluminal, while exporting the splash scene. Time is horizontal axis (all 50 seconds of it), and each row is a thread. I cropped out most other threads; they show very similar patterns anyway. We can see the main thread being busy all the time (mostly inside fprintf), with occasional tiny activities on the job threads; these are multi-threaded mesh evaluations that Blender has (e.g. “get me all the geometry edges” and so on).

A buffer above fprintf

The Blender 3.1 obj exporter is written in a way where there’s quite many calls to fprintf(). For example, each vertex is doing an equivalent of fprintf(f, "v %f %f %f\n", x, y, z), and for mesh face definitions there are multiple calls for each face.

Each and every call to fprintf ends up doing several “overhead” things: taking a mutex lock around the file object, and looking up the current system locale via thread local storage. Yes, then eventually all the C standard FILE output ends up using “some” buffering mechanism, but the mutex/locale overhead is something that you still pay for every function call.

The I/O buffering mechanism used internally by C runtime functions also varies from system to system. For example, on Windows / MSVC, the default I/O buffer size (BUFSIZ) is 512 bytes. That seems fairly small, eh? Probably the value was chosen back in 1989, and now it can’t ever be changed, since that would break backwards compatibility.

Anyway, a manually implemented buffer (64 kilobytes) of where text is appended into via snprintf, that gets written into a file once it’s full, was like 20 lines of code (and yes, 20 lines of possible off-by-one errors, as someone pointed out). 48.9s→42.5s. Not stellar, but not bad either.

Multi-threading all that printing

Now that the exporter output does not go directly into the file, but rather into “some memory buffer”, we could split the work up into multiple threads! Recall how the thread timeline showed that one thread is busy doing all the work, while all the others are twiddling thumbs.

There are several possible ways of splitting up the work. Initially I started like (pseudocode):

for each object:
    parallel for: write vertices
    write resulting text buffers into the file
    parallel for: write normals
    write resulting text buffers into the file
    parallel for: write texture coordinates
    write resulting text buffers into the file
    ...

but this approach does not scale all that well for small meshes. There were also some complexities involved in writing mesh face data, where there’s some amount of sequential logic that needs to be done for smoothing groups & material groups.

So I did this instead:

parallel for each object:
    write vertices
    write normals
    write texture coordinates
    ...
write resulting text buffers into the file

Here’s the resulting thread timeline, with time axis using the same scale as previous. 42.5s→12.1s:

Not bad! Of course this speedup is only there when exporting multiple objects; when exporting just a single mesh there’s not much threading going on. It could be improved by parallelising on both objects and within each object, i.e. combining the two pseudocode approaches above, but that’s an exercise for the reader (see Update below).

Caveat: now the exporter uses more memory. Previously it was just building whatever data structures it needed to hold exported object data, and then wrote output directly into a file. Now, it produces the file output into memory buffers (one for each object), before writing them out sequentially after all the thread jobs are finished. Additional memory usage while exporting the splash test case:

  • New Blender 3.1 exporter: +0.6GB.
  • My multi-threaded exporter: +3.1GB. That’s quite an increase, however…
  • Old Blender 3.0 exporter: +14.8GB!

Writing text files is not free

Digging more into where time is spent, Superluminal was pointing out that fwrite took 4.7s, but the actual WriteFile underneath was only about 1.5s. What’s the overhead? Writing a “text” file.

Turns out, the new exporter code was opening FILE with "w" write mode, which on Windows means: find all the LF newlines in the written bytes, and change them into CRLF newlines. So it can’t just route my 64 kilobyte text chunks for writing; it needs to scan them, chop into smaller lines or into some other buffer, etc. etc.

Really, that was just a bug/oversight in the new exporter code, since Blender’s documentation explicitly says: “OBJ’s export using Unix line endings \n even on Windows”. Changing file write mode to binary "wb" made that overhead disappear, 12.1s→8.7s:

Nice! That thread timeline is getting thinner.

Did you know? When Foo Fighters sing “lately I’ve been measuring / seems my time is growing thin”, that’s about a successful optimization story. The song is about someone working on a character deformation system: “skin and bones, skin and bones, skin and bones don’t you know?”

Multi-threading object data preparation

Before the exporter could start producing the final .obj file output, there is some preparation work needed. Basically it has to gather data from Blender’s data structures/format into something suitable for .obj format. Some of that work was already internally multi-threaded by Blender itself, but the remaining part was still mostly single threaded, and was taking about half of all export time now.

So the next logical step is to make the data extraction part parallel too, where possible. The final flow looks roughly like this:

for each object:
    gather material indices
    ensure normals/edges
parallel for each object:
    calculate normal & texture coordinates
for each object:
    calculate index offsets
parallel for each object:
    produce .obj text
write resulting text buffers into the file

And now the export time goes 8.7s→5.8s:

…aaaand that’s what landed into Blender 3.2 alpha, after Howard Trickey graciously reviewed it all. Timings on the two test cases:

  • Splash (2.5GB file, 24k objects): 48.9s→5.8s.
  • Monkey (330MB file, 1 object): 6.3s→4.9s.

🎉

(Update) Multi-threading within large meshes

A couple days later I decided to also implement multi-threading within a mesh, for “large enough” meshes. Fairly simple: if a mesh has more than 32 thousand of something (vertices, normals, UVs, polygons), then chop that up into chunks 32k each, produce their .obj texts in parallel, join into final output buffer after that is done.

Without this, exporting just a single mesh was not going parallel much, e.g. here’s exporting the monkey (4.9s):

And here’s the same with doing parts of the export in parallel, within that one mesh (1.2s):

There’s still a part of the export that does not “go wide”; that one is doing some normal deduplication work that might be possible to parallelize, but is not “20 trivial lines of code”, so again, an exercise for the future generations.

…aaaand that’s what landed into Blender 3.2 alpha too. Timings on the two test cases, compared to Blender 3.1:

  • Splash (2.5GB file, 24k objects): 48.9s→5.2s.
  • Monkey (330MB file, 1 object): 6.3s→1.2s.

🎉🎉

What about faster float formatting?

Recall how everything here started because I wanted to look into the modern fast float formatting algorithms? We did not get to that part yet, eh?

Dragonbox (Jeon 2020) seems to be the fastest known algorithm right now. Turns out, it has been integrated into “fmt” C++ library since late 2020, and one of 3rd party libraries that Blender already uses (OpenImageIO) already pulls fmt in…

Which makes it fairly easy to test it out. Hey look, another speedup! 5.8→4.9s on splash, 4.9s→3.5s on monkey:

So that’s nice. But pulling in fmt library in such a hacky way has some complications with the Blender build process, so that still needs to be figured out. Stay tuned, maybe this will land (or maybe not!).

Learnings

  • Profile, profile, profile. Did I mention that Superluminal is excellent?
  • Your compiler’s standard library float formatting may or might not be fast. There’s quite exciting recent research in this area!
  • It’s hard to be I/O limited with modern SSDs, unless you’re literally doing zero additional processing.
  • Even small overheads add up to quite a lot over many function calls.
  • Getting a change into Blender was quite a bit easier than I expected. Yay! (or: “they really let anyone land code these days, eh”)
  • Just because something was made 10x faster, does not mean it can’t be made another 10x faster :)
  • “Common wisdom” may or might not be common, or wisdom.
  • Sometimes it’s helpful to explore something for no other reason than simple curiosity.

Gradients in linear space aren't better

People smarter than me have already said it (Bart Wronski on twitter), but here’s my take in a blog post form too. (blog posts? is this 2005, grandpa?!)

When you want “a gradient”, interpolating colors directly in sRGB space does have a lot of situations where “it looks wrong”. However, interpolating them in “linear sRGB” is not necessarily better!

Background

In late 2020 Björn Ottosson designed “Oklab” color space for gradients and other perceptual image operations. I read about it, mentally filed under a “interesting, I should play around with it later” section, and kinda forgot about it.

Come October 2021, and Photoshop version 2022 was announced, including an “Improved Gradient tool”. One of the new modes, called “Perceptual”, is actually using Oklab math underneath.

Looks like CSS (“Color 4”) will be getting Oklab color space soon.

I was like, hmm, maybe I should look at this again.

sRGB vs Linear

Now - color spaces, encoding, display and transformations are a huge subject. Most people who are not into all that jazz, have a very casual understanding of it. Including myself. My understanding is two points:

  • Majority of images are in sRGB color space, and stored using sRGB encoding. Storage is primarily for precision / compression purposes – it’s “quite enough” to have 8 bits/channel for regular colors, and precision across the visible colors is okay-ish.
  • Lighting math should be done with “linear” color values, since we’re basically counting photons, and they add up linearly.

Around year 2010 or so, there was a big push in real-time rendering industry to move all lighting calculations into a “proper” linear space. This kind-of coincided with overall push to “physically based rendering”, which tried to undo various hacks done in many decades prior, and to have a “more correct” approach to rendering. All good.

However, I think that, in many bystander minds, has led to a “sRGB bad, Linear good” mental picture.

Which is the correct model when you’re thinking about calculating illumination or other areas where physical quantities of countable things are added up. “I want to go from color A to color B in a way that looks aesthetically pleasing” is not one of them though!

Gradients in Unity

While playing around with Oklab, I found things about gradients in Unity that I had no idea about!

Turns out, today in Unity you can have gradients either in sRGB or in Linear space, and this is independent on the “color space” project setting. The math being them is “just a lerp” in both cases of course, but it’s up to the system that uses the gradients to decide how they are interpreted.

Long story short, the particle systems (a.k.a. “shuriken”) assume gradient colors are specified in sRGB, and blended as sRGB; whereas the visual effect graph specifies colors as linear values, and blends them as such.

As I’ll show below, neither choice is strictly “better” than the other one!

Random examples of sRGB, Linear and Oklab gradients

All the images below have four rows of colors:

  1. Blend in sRGB, as used by a particle system in Unity.
  2. Blend in Oklab, used on the same particle system.
  3. Blend in Linear, as used by a visual effect graph in Unity.
  4. Blend in Oklab, used on the same visual effect graph.

Each color row is made up by a lot of opaque quads (i.e. separate particles), that’s why they are not all neatly regular:

Black-to-white is “too bright” in Linear.

Blue-to-white adds a magenta-ish tint in the middle, and also “too bright” in Linear.

Red-to-green is “too dark & muddy” in sRGB. Looks much better in Linear, but if you compare it with Oklab, you can see that in Linear, it feels like the “red” part is much smaller than the “green” part.

Blue-to-yellow is too dark in sRGB, too bright in Linear, and in both cases adds a magenta-ish tint. The blue part feels too narrow in Linear too.

Rainbow gradient using standard “VIBGYOR” color values is missing the cyan section in sRGB.

Black-red-yellow-blue-white adds magenta tint around blue in Linear, and the black part goes too bright too soon.

Random set of “muddy” colors - in Linear, yellow section is too wide & bright, and brown section is too narrow.

Red-blue-green goes through too dark magenta/cyan in sRGB, and too bright magenta/cyan in Linear.

Further reading

I don’t actually know anything about color science. If the examples above piqued your interest, reading material from people in the know might be useful. For example:

That’s it!


EXR: Filtering and ZFP

In the previous blog post I looked at using libdeflate for OpenEXR Zip compression. Let’s look at a few other things now!

Prediction / filtering

As noticed in the zstd post, OpenEXR does some filtering of the input pixel data before passing it to a zip compressor. The filtering scheme it does is fairly simple: assume input data is in 16-bit units, split that up into two streams (all lower bytes, all higher bytes), and delta-encode the result. Then do regular zip/deflate compression.

Another way to look at filtering is in terms of prediction: instead of storing the actual pixel values of an image, we try to predict what the next pixel value will be, and store the difference between actual and predicted value. The idea is, that if our predictor is any good, the differences will often be very small, which then compress really well. If we’d have a 100% perfect predictor, all we’d need to store is “first pixel value… and a million zeroes here!", which takes up next to nothing after compression.

When viewed this way, delta encoding is then simply a “next pixel will be the same as the previous one” predictor.

But we could build more fancy predictors for sure! PNG filters have several types (delta encoding is then the “Sub” type). In audio land, DPCM encoding is using predictors too, and was invented 70 years ago.

I tried using what is called “ClampedGrad” predictor (from Charles Bloom blog post), which turns out to be the same as LOCO-I predictor in JPEG-LS. It looks like this in pseudocode:

// +--+--+
// |NW|N |
// +--+--+
// |W |* |
// +--+--+
//
// W - pixel value to the left
// N - pixel value up (previous row)
// NW - pixel value up and to the left
// * - pixel we are predicting
int grad = N + W - NW;
int lo = min(N,W);
int hi = max(N,W);
return clamp(grad,lo,hi);

(whereas the current predictor used by OpenEXR would simply be return W)

Does it improve the compression ratio? Hmm, at least on my test image set, only barely. Zstd compression at level 1:

  • Current predictor: 2.463x compression ratio,
  • ClampedGrad predictor: 2.472x ratio.

So either I did something wrong :), or my test image set is not great, or trying this more fancy predictor sounds like it’s not worth it – the compression ratio gains are tiny.

Lossless ZFP compression

A topic jump! Let’s try ZFP (github) compression. ZFP seems to be primarily targeted at lossy compression, but it also has a lossless (“reversible”) mode which is what we’re going to use here.

It’s more similar to GPU texture compression schemes – 2D data is divided into 4x4 blocks, and each block is encoded completely independently from the others. Inside the block, various magic stuff happens and then, ehh, some bits get out in the end :) The actual algorithm is well explained here.

I used ZFP development version (d83d343 from 2021 Aug 18). At the time of writing, it only supported float and double floating point data types, but in OpenEXR majority of data is half-precision floats. I tested ZFP as-is, by converting half float data into floats back and forth as needed, but also tried hacking in native FP16 support (commit).

Here’s what I got (click for an interactive chart):

  • ▴ - ZFP as-is. Convert EXR FP16 data into regular floats, compress that.
  • ■ - as above, but also compress the result with Zstd level 1.
  • ● - ZFP, with added support for half-precision (FP16) data type.
  • ◆ - as above, but also compress the result with Zstd level 1.

Ok, so basically ZFP in lossless mode for OpenEXR data is “meh”. Compression ratio not great (1.8x - 2.0x), compression and decompression performance is pretty bad too. Oh well! If I’ll look at lossy EXR compression at some point, maybe it would be worth revisiting ZFP then.

Next up?

The two attempts above were both underwhelming. Maybe I should look into lossy compression next, but of course lossy compression is always hard. In addition to “how fast?” and “how small?”, there’s a whole additional “how good does it look?” axis to compare with, and it’s a much more complex comparison too. Maybe someday!