Comparing BCn texture decoders

PC GPUs use “BCn” texture compression formats (see “Understanding BCn Texture Compression Formats” by Nathan Reed or “Texture Block Compression in Direct3D 11” by Microsoft). While most of the interest is in developing BCn compressors (see “Texture Compression in 2020” post), I decided to look into various available BCn decompressors.

Why would you want that? After all, isn’t that done by the GPU, magically and efficiently? Normally, yes. Except if for “some reason” you need to access pixels on the CPU, or perhaps to use BCn data on a GPU that does not support it, or perhaps just to decode a BCn image in order to evaluate the compression quality/error.

Given that the oldest BCn formats (BC1..BC3, aka DXT1..DXT5, aka S3TC) are over twenty years old, you would think this is a totally solved problem, with plenty of great libraries that all do it correctly, fast and are easy to use.

So did I think :)

The task at hand is this: given an input image or BCn block bits, we want to know resulting pixel values. Our ideal library would: 1) do this correctly, 2) support all or most of BCn formats, 3) do it as fast as possible, 4) be easy to use, 5) be easy to build.

BCn decoding libraries

There are many available out there. Generally more libraries support older BCn formats like BC1 or BC3, with only several that support the more modern ones like BC6H or BC7 (“modern” is relative; they came out in year 2009). Here’s the ones I could find, that are usable from C or C++, along with the versions I tested, licenses and format support:

Library Version License 1   2   3   4U 5U 6U 6S 7  
amd_cmp 2022 Jun 20 (b2b4e) MIT ✓*
bc7enc_rdo 2021 Dec 27 (e6990) MIT/PublicDomain
bcdec 2022 Jun 23 (e3ca0) Unlicense
convection 2022 Jun 23 (35041) MIT
dxtex 2022 Jun 6 (67953) MIT
etcpak 2022 Jun 4 (a77d5) BSD 3-clause
icbc 2022 Jun 3 (502ec) MIT - -
mesa 2022 Jun 22 (ad3d6) MIT
squish 2019 Apr 25 (v1.15) MIT
swiftshader 2022 Jun 16 (2b79b) Apache-2.0

Notes:

  • Majority are “texture compression” libraries, that happen to also have decompression functionality in there.
  • Some, like mesa or swiftshader, are much larger projects where texture decoding is only a tiny part of everything they do.
  • There’s not that many libraries out there that do support all the BCn formats decoding: only bcdec, dxtex, mesa and swiftshader. Compressonator (amd_cmp) gets close, except it does not have BC6H signed support.
  • I have not looked at BC4 and BC5 signed formats, so they are missing from the table.
  • icbc has functions to decode BC1 and BC3, but produces wrong results. So I have not tested it further.
  • bc7enc also includes rgbcx. Multiple versions of them exist in several repositories, I picked this one since it has a BC7 decoder with additional performance optimizations.
  • Compressonator (amd_cmp) produces visually “ok” results while decoding BC6H format, but it does not match the other decoders bit-exactly.

Decoding performance

I made a small program that loads a bunch of DDS files and decodes them with all the libraries: bcn_decoder_tester.

Here’s decoding performance, in Mpix/s, single-threaded, on Apple M1 Max (arm64 arch, clang13). Higher values are better:

What I did not expect: there’s a up to 20x speed difference between various decoding libraries!

Here’s decoding performance a Windows PC (Ryzen 5950X, x64 arch), using Visual Studio 2022 and clang 14 respectively:

Ease of use / build notes

For “build complexity”, I’m indicating source file count & source file size. This is under-counting in many cases where the library overall is larger and it includes possibly a bunch of other header files it has. Only read these numbers as very rough ballpark estimates!

Library Source, files Source, KB Notes
bcdec 1 69 💙 Everything is great about this one! 💛
dxtex 4 220 Does not easily build on non-Windows: requires bits of DirectX-Headers and DirectXMath github projects, as well as an empty <malloc.h> and some stub <sal.h> header.
swiftshader 6 77 Part of a giant project, but “just the needed” source files can be taken out surprisingly easily. Nice!
amd_cmp 21 960 The github repo is 770MB payload (checkout 60MB). Does not decode BC6H bit-exact like other libraries.
mesa 17 137 Part of a giant project, with quite a lot of header dependencies. Most of decoding functionality is in “give me one pixel” fashion (so not expected to be fast); only BC6H/BC7 operate on whole blocks/images.
bc7dec_rdo 4 181 Two libraries: bc7decomp for BC7, and rgbcx for BC1/3/4/5. Way fastest BC7 decoder (incl. SSE code for x64).
squish 24 160 The library claims to support BC4/BC5, but really only for compression; decoding produces wrong results. Official repository in Subversion only.
etcpak 3 7 Hard to use only the decoder functionality without building the whole library. I took only the decoder-related files and modified them to remove everything not relevant.
convection 9 487 Simple to use, only BC6H/BC7 though. Fastest BC6H decoder (by a small margin).

So which BCn decoder should I use?

For me, using bcdec was easiest; it’s also the most “direct fit” – it’s a library that does one thing, and one thing only (decode BCn). Trivial to build, easy to use, supports all BCn formats. Initially I was a bit reserved about decoding performance, but then I did some low hanging fruit speedups and they got prompty merged, nice :) So right now: bcdec is easiest to use, and almost always the fastest one too. 🎉

I can’t quite recommend DirectXTex – while it sounds like the most natural place to search for BCn decoders, it’s both quite a hassle to build (at least outside of Windows), and is quite slow. Maybe the code is supposed to be “reference”, and not fast.

If you need fastest BC7 decoding, use bc7enc_rdo decoder. For fastest BC1/BC3 decoding, use etcpak (but it’s a bit of a hassle to build only the decoder).

That’s it!


Comparing .obj parse libraries

Wavefront .obj file format is a funny one. Similar to GIF, it’s a format from the 1990s, that absolutely should not be widely used anymore, yet it refuses to go away. Part of the appeal of OBJ is relative simplicity, I guess.

In the previous blog post I asked myself a question, “is this new Blender OBJ parsing code even good?” Which means, time to compare it with some other existing libraries for parsing Wavefront OBJ files.

OBJ parsing libraries

There’s probably a thousand of them out there, of various states of quality, maintenance, feature set, performance, etc. I’m going to focus on the ones written in C/C++ that I’m aware about. Here they are, along with the versions that I’ve used:

Library Version License
tinyobjloader 2021 Dec 27 (8322e00a), v1.0.6+ MIT
fast_obj 2022 Jan 29 (85778da5), v1.2+ MIT
rapidobj 2022 Jun 18 (0e545f1), v0.9 MIT
blender 2022 Jun 19 (9757b4ef), v3.3a GPL v3
assimp 2022 May 10 (ff43768d), v5.2.3+ BSD 3-clause
openscenegraph 2022 Apr 7 (68340324), v3.6.5+ LGPL-based

Some notes:

  • blender is not a “library” that you can really use without the rest of Blender codebase.
  • In the performance comparisons below there will also be a tinyobjloader_opt, which is the multi-threaded loader from tinyobjloader “experimental” folder.
  • openscenegraph will be shortened to osg below.

Let’s have a quick overview of the feature sets of all these libraries:

Feature tinyobjloader fast_obj rapidobj blender assimp osg
Base meshes
Base materials
PBR materials
Vertex colors (xyzrgb)
Vertex colors (MRGB)
Lines (l)
Points (p)
Curves (curv) ✓*
2D Curves (curv2)
Surfaces (surf)
Skin weights (vw)
Subdiv crease tags (t)
Line continuations (\)
Produces GPU-ready data
Language / Compiler C++03 C89 C++17 C++17 C++??* C++??*
  • Blender OBJ parser has only limited support for curves: only bspline curve type is supported.
  • It’s not clear to me which version of C++ assimp requires. It’s also different from all the other libraries, in that it does not return you the “raw data”, but rather creates “ready to be used for the GPU” mesh representation. Which might be what you want, or not what you want.
  • osg OBJ parser does not compile under C++17 out of the box (uses features removed in C++17), I had to slightly modify it.

As you can see, even if “base” functionality of OBJ/MTL is fairly simple and supported by all the parsing libraries, some more exotic features or extensions are not supported by all of them, or even not supported by any of them.

Some libraries also differ in how they handle/interpret the more under-specified parts of the format. OBJ is a file format without any “official” specification; all it ever had was more like a “readme” style document. It’s funny that more modern alternatives, like Alembic or USD also don’t really have their specifications - they are both “here’s a giant code library and some docs, have fun”.

Test setup

I’m going to test the libraries on several files:

  1. rungholt: “Rungholt” Minecraft map from McGuire Computer Graphics Archive. 270MB file, 2.5M vertices, 1 object with 84 materials.
  2. splash: Blender 3.0 splash screen ("sprite fright"). 2.5GB file, 14.4M vertices, 24 thousand objects.

Test numbers are from a Windows PC (AMD Ryzen 5950X, Windows 10, Visual Studio 2022) and a Mac laptop (M1 Max, macOS 12.3, clang 13). “Release” build configuration is used. Times are in seconds, memory usages in megabytes.

All the code I used is in a git repository. Nothing fancy really, just loads some files using all the libraries above and measures time. I have not included the large obj files into the git repo itself though.

Performance

Look at that – even if all the parsing libraries are written in “the same” programming language, there is up to 70 times performance difference between them. For something as simple as an OBJ file format!

Note that I clipped the horizontal part of the graph, or otherwise the openscenegraph line length would make hard to see all the others :)

Random observations and conclusions:

  • rapidobj is the performance winner here. On the technical level, it’s the most “advanced” out of all of them – it uses asynchronous file reading and multi-threaded parsing. However, it does not support macOS (update: rapidobj 0.9 added macOS support!). It also requires a fairly recent C++ compiler (C++17) support.
  • fast_obj is the fastest single-threaded parser. It also compiles pretty much anywhere (any C compiler would do), but also has the least amount of features. However I could make it crash on syntax it does not support (\ line continuation); it might be the least robust of the parsers.
  • tinyobjloader_opt, which is the multi-threaded experimental version of tinyobjloader, is quite fast. However, it very much feels “experimental” – it has a different API than the regular tinyobjloader, is missing some parameters/arguments for it to be able to find .mtl files if they are not in the current folder, and also see below for the memory usage.
  • blender is not the fastest, but “not too bad, eh”, especially among the single threaded ones. The difference between blender-initial and blender-now is what I’ve described in the previous post.
  • assimp is not fast. Which is by design – their website explicitly says “The library is not designed for speed”. It also does more work than strictly necessary – while all the others just return raw data, this one creates a “renderable mesh” data structure instead.
  • tinyobjloader, which feels like it’s the default go-to choice for people who want to use an OBJ parser from C++, is actually not that fast! It is one of the more fully featured ones, though, and indeed very simple to use.
  • osg is just… Well, it did not fit into the graphs on the horizontal axis, nothing more to say :)

Memory usage

Memory usage on Windows, both peak usage, and what is used after all the parsing is done.

Notes:

  • tinyobjloader, fast_obj, rapidobj, blender are all “fine” and not too dissimilar from each other.
  • tinyobjloader_opt peak usage is just bad. For one, it needs to read all the input file into memory, and then adds a whole bunch of “some data” on top of that during parsing. The final memory usage is not great either.
  • assimp and osg memory consumption is quite bad. So was the blender-initial :)

Wait, how do you multi-thread OBJ parsing?

A whole bunch of things inside an OBJ file are “stateful” - negative face indices are relative to the vertex counts, meaning you need to know all the vertices that came in before; commands like smooth group or material name set the “state” for the following lines, etc. This can feel like it’s not really possible to do the parsing in parallel.

But! Most of the cost of OBJ parsing, besides reading the file, is parsing numbers from a text representation.

What rapidobj and tinyobjloader_opt both do, is split the file contents into decently large “chunks”, parse them in parallel into “some representation”, and then produce the final data out of the parsed representation. They slightly differ in what’s their representation, and whether the whole file needs to be first read into memory or not (yes for tinyobjloader_opt, whereas rapidobj does not need the full file).

In rapidobj case, they only start doing multi-threaded parsing for files larger than 1MB in size, which makes sense – for small files the overhead of spawning threads would likely not give a benefit. But for giant files it pays off – the cost of converting text to numbers is much larger than the cost of final data fixup/merge.

Does Blender need a yet faster OBJ parser?

Maybe? But also, at this point OBJ parsing itself is not what’s taking up most of the time. Notice how splash file time is 45 seconds in the previous post, and 7 seconds in this one?

That’s the thing – in order to fully “import” this 2.5GB OBJ file into Blender, right now 7 seconds are spent loading & parsing the OBJ file, and then 38 seconds are doing something with the parsed data, until it’s ready to be used by a user inside Blender. For reference, this “other work” time breakdown is roughly:

  • 20 seconds - ensuring that object names are unique. Blender requires objects to have unique names, and the way it’s implemented right now is basically quadratic complexity with the scene object count. Maybe I should finish up some WIP code
  • 10 seconds - assigning materials to the objects. Note, not creating materials, but just assigning them to object material slots. Likely some optimization opportunity there; from a quick look it seems that assigning a material to an object also needs to traverse the whole scene for some reason (wut?).
  • 10 seconds - some processing/calculation of normals, after they are assigned from the imported data. I don’t quite understand what it does, but it’s something heavy :)

…anyway, that’s it! Personally I’m quite impressed by rapidobj.


Speeding up Blender .obj import

A while ago I wrote about speeding up Blender’s Wavefront OBJ exporter. All that landed into Blender 3.2. And since that was quite a nice experience, I started to look into the OBJ importer

Existing Python importer

Blender 3.1 and earlier has an OBJ importer written in Python. From a quick look, it was written “ages ago” and has largely unchanged since then. I’m going to test it on several files:

  1. rungholt: “Rungholt” Minecraft map from McGuire Computer Graphics Archive. 270MB file, 2.5M vertices, 1 object with 84 materials.
  2. splash: Blender 3.0 splash screen ("sprite fright"). 2.5GB file, 14.4M vertices, 24 thousand objects.

Test numbers are from my Windows PC, Blender built with Visual Studio 2022 in Release mode, AMD Ryzen 5950X (32 threads), PCIe 4.0 SSD. Times are seconds that it takes to do a full import.

rungholt splash
3.1 python 54.2 ~14000

Now, I know next to nothing about Python. Are these numbers good or not? 🤷‍♀️ My guesses are:

A minute to read through couple hundred megabytes of text (rungholt) does not feel fast.

Four hours for the splash scene does sound excessive! I suspect it runs into some “other” bottleneck than pure calculations – the memory usage quickly grows to 20GB during import, and then stays there, with CPU usage not even fully utilizing one core. The machine does have 64GB of memory though; it’s definitely not swapping. Quick profiling just shows all the time inside some Python DLL without debug symbols. Maybe Python decides to, for example, run the garbage collector after each and every allocation, once it reaches “some” amount of allocated memory? I’ve no idea.

Finishing up existing GSoC project

During Google Summer of Code 2020, there was a project to rewrite OBJ exporter & importer in C++. The exporter part had just been finished and landed into Blender 3.1 (which I’ve further optimized for Blender 3.2, see previous blog post).

The importer was not quite finished yet; it had a final report, initial diff done in late 2020, and a bunch of mainline code merges & some fixes on a branch done in 2021.

So the first task was to try to finish that one up. This meant a bunch of merges, adding automated tests, fixing about 20 bugs/issues uncovered by manual or automated testing, going through code review and so on. And then all of that landed during Blender 3.2 alpha! 🎉 Here’s the performance of it:

rungholt splash
3.1 python 54.2 ~14000
3.2 C++ initial 14.2 109

And then a bunch more fixes followed, and by now almost all known issues are fixed.

The new importer also uses quite a bit less memory, e.g. during import of rungholt it uses up 1.9GB of memory, compared to 7.0GB of the python importer.

rungholt is about 4 times faster, and splash is over a hundred times faster. 🥳 Yay, we’re done here! Or… are we?

Optimizing it some more!

Similar to the previous blog post, I used Superluminal profiler while on Windows, and Xcode Instruments while on Mac. And then looked at what they pointed out, and thought about whether that could be sped up. Here’s a list:

  • Replace number parsing std::stof / std::stoi with C++17 from_chars.
    • But! You’d think that C++17 is fully supported since Clang 5 (docs) and GCC 5 (docs), but there are caveats… For example: from_chars on floating point numbers did not get implemented until Clang 14 and GCC 11 🤦‍♀️.
    • So I added an external library fast_float by Daniel Lemire to do that. That library/algorithm is what’s used in clang/llvm 14, so should be pretty good.
  • Stop reading input file char-by-char using std::getline. Instead read in 64kb chunks, and parse from there, taking care of possibly handling lines split mid-way due to chunk boundaries.
  • Remove abstractions for splitting a line by some char. For example face parsing was done by first splitting a line like f 1/2/3 4/5/6 into:
    • First into keyword f and “the rest” 1/2/3 4/5/6,
    • Then into corners: array with [1/2/3, 4/5/6],
    • Then splitting each corner by slash into indices: array [1,2,3]; array [4,5,6].
    • All of these individually are not much, but quickly add up. Now the parser is mostly a single forward scan through the input line: find f, decide it’s a face, scan for integers separated by slashes or spaces. No extra vector allocations or multiple scans over the string.
  • Avoid tiny memory allocations: instead of storing a vector of polygon corners in each face, store all the corners in one big array, and per-face only store indices “where do corners start, and how many”. Likewise, don’t store full string names of material/group names for each face; only store indices into overall material/group names arrays.
  • Blender specific:
    • Stop always doing mesh validation, which is slow. Do it just like the Alembic importer does: only do validation if found some invalid faces during import, or if requested by the user via an import setting (which defaults to off).
    • Stop doing “collection sync” for each object being added; instead do the collection sync right after creating all the objects.

All really simple stuff, without any fancy things like “multi threading” or “SIMD” or “clever algorithmic optimizations”. Is that faster? Why, yes, a bit:

rungholt splash
3.1 python 54.2 ~14000
3.2 C++ initial 14.2 109.2
3.2 C++ opts 7.0 49.1

Note that the times here are what it takes to do the full import. Parsing the OBJ file is only a part of that, for example in splash scene it “only” takes about 12 seconds, the rest of the time is creating Blender objects out of the parsed data. Some places there could be optimized too, and likely would benefit other importers besides OBJ. Some other day, maybe…

These optimizations also have landed to Blender 3.2 alpha. Yay!

What about other OBJ parsers?

“Wait, but why did Blender have to write a new OBJ parser in C++? Aren’t there dozens of them written already?!”, you ask. That’s a very good question, that I don’t know the answer to. My guess would be on a combination of:

  • The new OBJ importer started as a Google Summer of Code project, and they tend towards “write new code!”, and much less, if at all, “integrate existing code into another codebase”.
  • Blender might have needed some obscure functionality that is not supported by existing OBJ parsing libraries. For example, line continuations (\), importing NURBS curves, etc. Anything that was supported by the previous Python importer, but not supported by the new importer, would be a regression in functionality.
  • Most of actual complexity of the importer is not in the OBJ “parser”, but rather in the code that does something with the parsed result - creates Blender mesh structures and material node graphs.
  • Maybe people who started all of this were not aware of existing OBJ parsing libraries? :)

All that said… yes, at least a performance comparison with some existing OBJ parsing libraries would be in order, so that we’d know whether the parser in Blender is anywhere near being “okay”. And that will be a topic of an upcoming blog post. Here’s a teaser without spoilers:

…while gathering data for this upcoming blog post, I noticed one more strange thing, so here’s about it:

Sometimes string_view / StringRef is not free

While comparing Blender’s OBJ parser performance with other libraries on both Windows and Mac, I noticed a “hey wait, this feels like something is too slow on Windows” type of thing in the Blender parser. For example, it would be similar performance to another parsing library on a Mac, but slower than the same library on Windows.

Now, these are different processors (AMD Ryzen 5950X vs Apple M1 Max), with different architectures (Intel vs ARM), different compilers (Visual Studio vs clang), and different operating systems. Any of these factors could affect performance. But still, something felt off.

Some more profiling and a bit of looking at disassembly later… it was due to convenience usage of Blender’s StringRef class (which is pretty much like C++17 string_view, but was done before they had C++17).

StringRef / string_view is just a “window” into some larger existing string. It’s very simple: just a pointer to characters, and the count of characters. A struct with two simple members, that does not allocate memory, does not do anything complicated, is trivially copy-able etc. etc. Surely it’s an abstraction that any Decent Compiler should be able to completely optimize out, right?!

The OBJ parser was using StringRef for convenience, with functions like “skip whitespace” or “parse a number” taking an input StringRef (representing the input line), and returning a new StringRef (representing the remainder of the line). For example:

StringRef drop_whitespace(StringRef str)
{
    while (!str.is_empty() && is_whitespace(str[0]))
        str = str.drop_prefix(1);
    return str;
}

This is convenient, but does more work than strictly needed – while parsing, only the “beginning” of the line ever changes by moving forward; the end of the line always stays the same. We can change the code to take a pair of pointers (begin of line, end of line) as input, and make the functions return the new begin of line pointer:

const char *drop_whitespace(const char *p, const char *end)
{
    while (p < end && is_whitespace(*p))
        ++p;
    return p;
}

“lol are you serious? doing things like that should not matter; the compiler optimizers these days are amazing!” - someone probably from the internet.

Calling Conventions enters the chat.

What’s the difference in the two functions above? They both take what is “two things” as input arguments (StringRef case: pointer to start, length; raw pointer case: two pointers). The StringRef function returns two things as well, i.e. the new StringRef object. The raw pointer version just returns a pointer.

Turns out, the Microsoft calling convention for 64-bit Intel architecture says that values up to 64 bits in size are returned in a CPU register, whereas larger values are returned via memory (SIMD values use slightly different rules). And that’s the rule of the platorm, and the compilers must abide. This not specific to Visual Studio compiler; Clang on Windows must do the same as well (and it does).

Whereas Mac and Linux use a different calling convention; on 64-bit Intel architecture values up to 128 bits in size can be returned in a pair of CPU registers. This is probably the case for 64-bit ARM on Windows too (link).

And that’s the curious performance difference, that turned out to be very much Windows + Intel specific, in an otherwise completely platform agnostic code. The fix has just landed for Blender 3.3 alpha.

rungholt splash
3.1 python 54.2 ~14000
3.2 C++ initial 14.2 109.2
3.2 C++ opts 7.0 49.1
3.3 C++ StringRef opt 5.8 45.5

And that’s a potential lesson: in functions heavily used in performance critical code paths, you might want to watch out for calling convention details! E.g. on Windows w/ Intel architecture, check whether your extremely often called functions return values larger than 64 bits. And check if not doing that speeds things up. It might!

So there. Right now OBJ importing is between 10x and 300x faster compared to previous Python importer, and about 2.5x faster compared to initial state of the C++ importer when I found it. Is ok.

…and that’s it for now. Until the next post about various OBJ parsing libraries!


Optimizing Oklab gradients

An example how one might optimize Oklab color space gradients by… not doing anything related to Oklab itself!

The case at hand

I wrote about Oklab previously in the “gradients in linear space aren’t better” post. Now, let’s assume that the use case we have is this:

  • We have some gradients,
  • We need to evaluate them on a lot of things (particles, pixels, etc.),
  • Gradient colors are specified in sRGB (sometimes called “gamma space”), as 8-bit/channel values,
  • The evaluated gradient colors also have to be in sRGB, 8-bit/channel values. Why this, and not for example “linear” colors? Could be many reasons, ranging from “backwards compatibility” to “saving memory/bandwidth”.

What’s a gradient?

One simple way to represent a color gradient is to have color “keys” specified at increasing time values, for example:

struct Gradient
{
    static constexpr int kMaxKeys = 8;
    pix3 m_Keys[kMaxKeys]; // pix3 is just three bytes for R,G,B
    float m_Times[kMaxKeys];
    int m_KeyCount;
};

And the gradient like above would have 5 keys (red, blue, green, white, black) and key times 0.0, 0.3, 0.6, 0.8, 1.0.

Ah! But how exactly the resulting gradient looks like depends on how we interpolate between the color keys, which neatly ties into why we’d want Oklab to begin with. The gradient above is directly interpolating the colors in sRGB space, i.e. “how everyone used to do it for many decades until recently”. Photoshop just added “Perceptual” (Oklab) and “Linear” interpolation modes, and the same gradient would look like this then – Classic (sRGB) at top, Perceptual (Oklab) in the middle, Linear at the bottom. See more examples in my previous blog post.

Assuming our gradient keys are sorted in increasing time order, code to evaluate the gradient might look like this:

pix3 Gradient::Evaluate(float t) const
{
  // find the keys to interpolate between
  int idx = 0;
  while (idx < m_KeyCount-1 && t >= m_Times[idx+1])
    ++idx;
  // we are past the last key; just return that
  if (idx >= m_KeyCount-1)
    return m_Keys[m_KeyCount-1];
  // interpolate between the keys
  float a = (t - m_Times[idx]) / (m_Times[idx+1] - m_Times[idx]);
  return lerp(m_Keys[idx], m_Keys[idx+1], a); // interpolate in sRGB directly
}

Evaluating gradient in the three interpolation modes is exactly the same, all the way up to the last line:

  • sRGB: just a lerp, as above,
  • Linear: convert keys from sRGB to float Linear, lerp between them, convert back into fixed point sRGB,
  • Oklab: convert keys from sRGB to float Linear, then into Oklab, lerp between them, convert back into float Linear, then back into fixed point sRGB.

We’re gonna try to be smart upfront and save a division by m_Times[idx+1] - m_Times[idx], by precalculating inverse of them just once, i.e. m_InvTimeDeltas[i] = 1.0f / (m_Times[i + 1] - m_Times[i]). All the related source code is in gradient.cpp, mathlib.h, oklab.cpp in my toy repository.

Initial performance

How much time does it take to evaluate a gradient with 7 color keys? We’re gonna do it 10 million times, on one thread, and measure time it takes in milliseconds.

Platform sRGB Linear Oklab
Windows, vs2022 125.2 619.2 2424.5
Windows, clang 13 115.7 601.3 2405.5
Linux, gcc 9.3 123.1 433.6 1567.2
Linux, clang 10 106.6 411.4 1099.4
Mac, clang 13 146.2 408.5 966.9

The Windows & Linux rows are on a PC (AMD Ryzen 5950X), Mac row is on MacBookPro (M1 Max). Windows is Win10 21H2, Linux is Ubuntu 20 via WSL2, macOS is 12.1. Compiler options are -O2 for gcc&clang, Release for visual studio, everything else left at defaults.

Takeaways so far: Linear gradient interpolation is 3-6x slower than sRGB, and Oklab is 10-20x slower than sRGB. There are some variations between platforms & compilers, but overall patterns are similar.

Profiling Windows build says that majority of the time in Linear & Oklab cases is spent raising numbers to a power:

  • Linear spends 481ms inside powf(),
  • Oklab spends 1649ms inside cbrtf(), and 515ms inside powf().

Stop doing the same work repeatedly

That’s often a good performance optimization advice. Note the tail of gradient Oklab evaluation function code:

// to-Linear -> to-Oklab -> lerp -> to-Linear -> to-sRGB
float3 ca = pix_to_float(m_Keys[idx]);
float3 cb = pix_to_float(m_Keys[idx+1]);
ca = sRGB_to_Linear(ca);
cb = sRGB_to_Linear(cb);
ca = Linear_sRGB_to_OkLab_Ref(ca);
cb = Linear_sRGB_to_OkLab_Ref(cb);
float3 c = lerp(ca, cb, a);
c = OkLab_to_Linear_sRGB_Ref(c);
c = Linear_to_sRGB(c);
return float_to_pix(c);

…all the calculations up until the lerp line do not depend on the gradient evaluation time at all! We could, instead of just storing gradient color keys in sRGB, also precalculate their Linear and Oklab values. This does add some extra storage space into Gradient object, but perhaps saves a bit of computation.

So let’s do this (commit), and then the code above turns into:

float3 c = lerp(m_KeysOkLab[idx], m_KeysOkLab[idx+1], a);
c = OkLab_to_Linear_sRGB_Ref(c);
c = Linear_to_sRGB(c);
return float_to_pix(c);

And this gives the following performance numbers:

Platform sRGB Linear Oklab
Windows, vs2022 124.9 271.1 321.8
Linux, clang 10 107.0 196.0 277.7
Mac, clang 13 141.8 224.4 286.8

Linear is now only 1.5-2.1x slower than sRGB, and Oklab is 2.0-2.6x slower than sRGB. Still slower, but not “orders of magnitude” slower anymore. Nice!

Profiling Windows build says that Linear and Oklab still spend most of their remaining time inside powf() though, 152ms and 180ms respectively. This is all inside Linear_to_sRGB function. Ok, now what?

Table based Linear to sRGB conversion

Notice that we effectively need to convert from a Linear float into a fixed point (8-bit) sRGB. Right now we do that with a generic “linear float -> sRGB float” function, followed by a “normalized float -> byte” function. But turns out, people smarter than me figured out this can be done in a more optimal way, a decade ago. Of course that was Fabian ‘ryg’ Giesen in this gist file. It has extensive comments there, go take a read.

Let’s try this (commit):

Platform sRGB Linear Oklab
Windows, vs2022 126.9 148.9 173.4
Linux, clang 10 107.7 132.7 164.6
Mac, clang 13 140.1 157.7 180.9

Linear is now 1.2x slower than sRGB, and Oklab is 1.3-1.5x slower than sRGB. Yay!

Removing one matrix multiply

All the way up until now, we have not actually modified anything about Oklab calculations. The code & math we’re using are coming directly from Oklab post.

But! If all we need is to linearly blend between Oklab colors, we can simplify this a bit. For our particular use case (evaluating gradients), we don’t need some bits of Oklab: we’re not interested whether the Oklab numbers predict lightness, or whether “distances” between said numbers match perceived color differences. We just need to “nicely” interpolate between the gradient color keys.

Note that Linear -> Oklab conversion is effectively “multiply by matrix M1, apply cube root, multiply by matrix M2”. The opposite conversion is “multiply by inverse of M2, raise to 3rd power, multiply by inverse of M1”. We’re only going to be linearly interpolating between Oklab colors, so we can drop the multiplies related to matrix M2 and the result will be the same (minus a tiny amount of floating point rounding). That is, leave only the “multiply by matrix M1, apply cube root” and “raise to 3rd power, multiply by inverse of M1” parts.

Technically our gradient color keys are no longer in Oklab, but rather in LMS, but anyway the gradient evaluation result is the same.

And here are the results with this (commit):

Platform sRGB Linear Oklab*
Windows, vs2022 125.4 144.6 167.3
Linux, clang 10 108.4 133.0 151.3
Mac, clang 13 143.7 162.0 182.5

Linear is now only 1.1-1.2x slower than sRGB, and Oklab is 1.3-1.4x slower than sRGB. So dropping a matrix multiply made things a tiny bit faster.

And that’s it for now! Maybe some other time I’ll write about evaluating gradients using SIMD, and see what happens.


Curious lack of sprintf scaling

Some days ago I noticed that on a Mac, doing snprintf calls from multiple threads shows curious lack of scaling (see tweet). Replacing snprintf with {fmt} library can speed up the OBJ exporter in Blender 3.2 by 3-4 times. This could have been the end of the story, filed under a “eh, sprintf is bad!” drawer, but I started to wonder why it shows this lack of scaling.

Test case

A simple test: convert two million integers into strings. And then try to do the same on multiple threads at once, i.e. each thread converts two million integers. If the number of threads is below the number of CPU cores, this should take about the same time – each thread would just happily be converting their own numbers, and not interfere with the other threads. That is, a “this scales nicely” result would be where the graph is completely horizontal - no matter how many threads are doing the work at once, it takes the same amount of wall time.

Yes the reality is more complicated, with CPU thermals, shared caches and whatnot coming into play, but we’re interested in broad patterns, not exact science here!

And here’s what happens on an Apple M1 Max laptop. Horizontal axis is thread count; vertical axis is milliseconds (log scale) taken. Again, a good result would be a horizontal line:

Converting two million numbers into strings takes 100 milliseconds when one CPU core is doing it. When all eight “performance” cores are doing it (i.e. in total 16 million integers), it takes 1.8 seconds, or 18 times as long. That’s, like, not great!

Yo dude, you should not use sprintf

“Well duh” you say, “obviously you should not use sprintf, you should use C++ iostreams”. Okay. Here’s converting integers into strings via a std::stringstream <<.

Same scaling issue, except iostreams are two times slower. “Zero cost abstractions”, you know :)

What’s going on?

Instruments shows that with 8 threads, each thread spends over 90% of the time in something called localeconv_l, where it is mostly mutex locks.

At this point you might be thinking, “ah-ha! well this is related to a locale, and a locale is global, so of course some time spent on some mutex lock is expected”, which is “mmmaybe? but this amount of time feels excessive?”. Given that this is an Apple operating system, we might know it has a snprintf_l function which takes an explicit locale, and hope that this would make it scale. Just pass NULL which means “use C locale”:

…aaand, nope. It is a tiny bit faster, but does not really address the issue.

But! Large parts of macOS Darwin kernel and system libraries have source code available, so let’s look at what’s going on. Here’s the latest localeconv_l at the time of writing: github link. It’s basically a:

lconv* localeconv_v(locale_t loc)
{
    lock_on(loc);
    if (loc->something_changed)
    {
        // do some stuff
    }
    unlock_on(loc);
    // ...
}

and the lock used internally is just a os_unfair_lock macOS primitive. What is curious, is that this code has very recently changed; before 2022 February it was like:

lconv* localeconv_v(locale_t loc)
{
    if (loc->something_changed)
    {
        lock_on(loc);
        if (loc->something_changed)
        {
            // do some stuff
        }
        unlock_on(loc);        
    }
    // ...
}

Which to me feels like the previous code was trying to do a “double checked locking” pattern, but without using actual atomic memory reads. Which probably happens to work just fine on Intel CPUs, but might be more problematic elsewhere, like maybe on Apple’s own CPUs? And then someone decided to just always take that mutex lock, instead of investigating possible use of atomic operations.

Now, Apple’s OS is BSD-based, so we can check what other BSD based systems do.

  • FreeBSD does not have any mutexes there, and before 2021 September was just checking a flag. Since then, the flag check was changed to use atomic operations.
  • OpenBSD does not use any atomics or mutexes at all, and the “has something changed?” flag is not even per-locale, it’s just a global variable. YOLO!

So given all this knowledge, presumably, if each thread used a physically different locale object and snprintf_l, then it would scale fine. And it does:

What else can we do?

Now, besides the old snprintf and std::stringstream, there are other things we can do. For example:

  • stb_sprintf, a trivial to integrate, public domain C library that is a full sprintf replacement, but without any locale specific stuff. It’s also presumably faster, smaller and works the same across different compilers/platforms.
  • {fmt}, a MIT-licensed C++ library “providing a fast and safe alternative to C stdio and C++ iostreams”. {fmt} was a base for C++20 formatting additions.
  • Not a general replacement, but if we only need to turn numbers into strings, C++17 has to_chars.

All of those scale with increased thread usage just fine, and all of them are way faster in single threaded case too. {fmt} looks very impressive. Yay!

Is this all Apple/Mac specific?

Let’s try all the above things on Windows with Visual Studio 2022. This one supports more things compared to clang 13 that I have on a Mac:

  • There is C++20 formatting library with format_to_n. This uses the same type safe syntax as {fmt} library, and we can hope it would be of a similar performance and scaling.
  • Similar to BSD-specific snprintf_l, Visual Studio has its own _snprintf_l.
  • Speaking of not-so-general solutions, Visual Studio also has itoa to convert integers into strings.

  • Unlike the Mac case, just the regular snprintf does not have the multi-threaded scaling issue! It takes around 100 milliseconds for two million integers, no matter how many threads are doing it at the same time.
  • C++ stringstream performance and scaling is really bad. It starts being 4x slower than snprintf at one thread, and goes up to be hundred times slower at 8 threads.
  • The new, hot, C++20 based formatting functionality using format_to_n is really bad too! It starts being 10x slower than snprintf (!), and goes to be 40x slower at 8 threads.

Ok, what is going on here?! Superluminal profiler to the rescue, and here’s what it says:

The stringstream, in one thread case, ends up spending most of the time in the infamous “zero-cost abstractions” of C++ :) A bunch of function calls, a tiny bit of work here and there, and then somewhere deep inside it ends up calling snprintf anyway. Just all around that, tiny bits and pieces of cost all add up. In the 8 threads case, it ends up spending all the time inside mutex locks, quite similar to how Mac/Apple case was doing. Just here it’s C++, so it ends up being worse - there’s not a single mutex lock, but rather what looks like three mutex locks on various parts of the locale object (via std::use_facet of different bits), and then there’s also reference counting, with atomic increase/decrease operations smashing the same locale object.

The format_to_n, in one thread case, ends up spending all the time in… 🥁… Loading resource files. :WAT: Each and every call “plz turn this integer into a string” ends up doing:

  • Create something called a _Fmt_codec object, which
  • Calls __std_get_cvt, which
  • Figures out “information about installed or available code page” via GetCPInfoExW, which
  • Ends up calling FindResourceExW and LoadResource on something. Which then call LdrpLoadResourceFromAlternativeModule and LdrpAccessResourceDataNoMultipleLanguage and so on and so on.

In the 8 threads case, that is all the same, except all that resource loading is presumably on the same “thing”, so it ends up spending a ton of time deep inside the OS kernel doing MiLockVadShared, and MiUnlockAndDereferenceVadShared, and LOCK_ADDRESS_SPACE_SHARED and so on.

So that is something I would not have expected to see, to be honest. Curiously enough, there is a similar sounding issue on Github of Microsoft’s STL, which is marked resolved since 2021 April.

And no, usual Internet advice of “MSVC sucks, use Clang” does not help in this particular case. Using Clang 13, the C++20 formatting library is not available yet, but otherwise all other options look pretty much the same, including the disappointing performance of stringstream:

What about Linux?

I only have an Ubuntu 20 install via WSL2 here to test, and using the default compilers there (clang 10 and gcc 9.3), things look pretty nice:

C++20 format library is not available in either of these compilers to test, but everything else scales really well with increased thread count. {fmt} continues to be impressive there as well.

Conclusion

Would you have expected a “turn an integer into a string” routine to be loading resource file information blocks from some library, for each and every call? Yeah, me neither.

Technically, there are no bugs anywhere above - all the functions work correctly, as far as standard is concerned. But some of them have interesting (lack of) multi-core scaling behavior, some others have just regular performance overheads compared to others, etc.

If you need to target multiple different compilers & platforms, and want consistent performance characteristics, then avoiding some parts of C or C++ standard libraries might be one way. Or at least, do not assume anything about performance (and especially about multi-thread scaling) characteristics of the standard libraries.

If you need to do string formatting in C++, I can highly recommend using {fmt}.