Doom in Blender VSE

You know how in Blender Video Sequence Editor (VSE) you can create Color strips, and then their color is displayed in the timeline?

You can create many of them, and when sufficiently zoomed out, the strip headings disappear since there’s not enough space for the label:

So if you created say 80 columns and 60 rows of color strips…

…and kept on changing their colors constantly… you could run Doom inside the Blender VSE timeline.

And so that’s what I did. Idea sparked after seeing someone make Doom run in Houdini COPs.

Result

Here’s the result:

And the file/code on github: github.com/aras-p/blender-vse-doom

It is a modal blender operator that loads doom file, creates VSE timeline full of color strips (80 columns, 60 rows), listens to keyboard input for player control, renders doom frame and updates the VSE color strip colors to match the rendered result. Escape key finishes the operator.

All the Doom-specific heavy lifting is in render.py, written by Mark Dufour and is completely unrelated to Blender. It is just a tiny pure Python Doom loader/renderer. I took it from “Minimal DOOM WAD renderer” and made two small edits to avoid division by zero exceptions that I was getting.

Performance

This runs pretty slow (~3fps) in current Blender (4.1 .. 4.4) 😢

I noticed that is was slow when I was “running it”, but when stopped, navigating the VSE timeline with all the strips still there was buttery smooth. And so, being an idiot that I am, I was “rah rah, Doom rendering is done in pure Python, of course it is slow!”

Yes, Python is slow, and yes, the minimal Doom renderer (in exactly 666 lines of code – nice!) is not written in “performant Python”. But turns out… performance problems are not there. Another case for “never guess, always look at what is going on”.

The pure-python Doom renderer part takes 7 milliseconds to render a 80x60 “frame”. Could it be faster? Probably. But… it takes 300 milliseconds to update the colors of all the VSE strips.

Note that in Blender 4.0 or earlier it runs even slower, because redrawing the VSE timeline with 4800 strips takes about 100 milliseconds; that is no longer slow (1-2ms) in later versions due to what I did a year ago.

Why does it take 300 milliseconds to update the strip colors? For that of course I brought up Superluminal and it tells me the problem is cache invalidation:

Luckily, cache invalidation is one of the easiest things in computer science, right? 🧌

Anyway, this looks like another case of accidental quadratic complexity: for each strip that gets a new color set on it, there’s code that 1) invalidates any cached results for that strip (ok), and 2) tries to find whether this strip belongs to any meta-strips to invalidate those (which scans all the strips), and 3) tries to find which strips intersect the strip horizontal range (i.e. are “composited above it”), and invalidate partial results of those – this again scans all the strips.

Step 2 above can be easily addressed, I think, as the codebase already maintains data structures for finding which strips are part of which meta-strips, without resorting to “look at everything”.

Step 3 is slightly harder in the current code. However, half a year ago during VSE workshop we talked about how the whole caching system within VSE is maybe too complexicated for no good reason.

Now that I think about it, I think most or all of that extra cost could be removed, if Someone™️ would rewrite VSE cache to be along the lines of how we discussed at the workshop.

Hmm. Maybe I have some work to do. And then the VSE timeline could be properly doomed.


Verbosity of coding styles

Everyone knows that different code styles have different verbosity. You can have very dense code that implements a path tracer in 99 lines of C, or on the back of a business card (one, two). On the other side of the spectrum, you can have very elaborate code where it can take you weeks to figure out where does the actual work happen, digging through all the abstraction layers and indirections.

Of course to be usable in a real world project, code style would preferably not sit on either extreme. How compact vs how verbose it should be? That, as always, depends on a lot of factors. How many people, and of what skill level, will work on the code? How much churn the code will have? Will it need to keep on adapting to wildly changing requirements very fast? Should 3rd parties be able to extend the code? Does it have public API that can never change? And a million of other things that all influence how to structure it all, how much abstraction (and of what kind) should there be, etc.

A concrete example: Compositor in Blender

The other day I was happily deleting 40 thousand lines of code (just another regular Thursday, eh), and I thought I’d check how much code is in the “new” Compositor in Blender, vs in the old one that I was removing.

What is the “old” and “new” compositor? Well, there have been more than just these two. You see, some months ago I removed the “old-old” (“tiled”) compositor already. There’s a good talk by Habib Gahbiche “Redesigning the compositor” from BCON'24 with all the history of the compositor backends over the years.

So, how large is the compositor backend code in Blender?

I am using scc to count the number of lines. It is pretty good! And counts the 4.3 million lines inside Blender codebase in about one second, which is way faster than some other line counting tools (tokei is reportedly also fast and good). I am using scc --count-as glsl:GLSL since right now scc does not recognize .glsl files as being GLSL, d’oh.

The “Tiled” compositor I removed a while ago (PR) was 20 thousand lines of code. Note however that this was just one “execution mode” of the compositor, and not the full backend.

The “Full-frame” compositor I deleted just now (PR) is 40 thousand lines of C++ code.

What remains is the “new” (used to be called “realtime”) compositor. How large is it? Turns out it is… 27 thousand lines of code. So it is way smaller!

And here’s the kicker: while the previous backends were CPU only, this one works on both CPU and GPU. With no magic, just literally “write the processing code twice: in C++ and GLSL”. “Oh no, code duplication!”… and yet… it is way more compact. Nice!

I know nothing about compositing, or about relative merits of “old” vs “new” compositor code. It is entirely possible that the verbosity of the old compositor backend was due to a design that, in retrospect, did not stand the test of time or production usage – afterall compositor within Blender is a 18 year old feature by now. Also, while I deleted the old code because I like deleting code, the actual hard work of writing the new code was done mostly by Omar Emara, Habib Gahbiche and others.

I found it interesting that the new code that does more things is much smaller than the old code, and that’s all!


A year in Blender VSE land

Turns out, now is exactly one year of me working on the video sequence editor (VSE).

Going pretty well so far! What I managed to put into Blender 4.1 and 4.2 is in the previous blog posts. Blender 4.3 has just shipped, and everything related to Video Sequence Editor is listed on this page. Items related to performance or thumbnails are my doing.

Some of the work I happened to do for VSE over this past year ended up improving other areas of Blender. E.g. video rendering improvements are useful for anyone who renders videos; or image scaling/filtering improvements are beneficial in other places as well. So that’s pretty cool!

Google Summer of Code

The main user-visible workflow changes in 4.3 VSE (“connected” strips, and preview area snapping) were done by John Kiril Swenson as part of Google Summer of Code, see his report blog post. I was “mentoring” the project, but that was surprisingly easy and things went very smoothly. Not much more to say, except that the project was successful, and the result is actually shipping now as part of Blender. Nice!

Sequencer workshop at Blender HQ

In 2024 August some of us had a “VSE Workshop” at the Blender office in Amsterdam. Besides geeking out on some technical details, most of discussion was about high level workflows, which is not exactly my area (I can implement an existing design, or fix some issues, but doing actual UI or UX work I’m the least suitable person for).

But! It was very nice to hear all the discussions, and to see people face to face, at last. Almost five years of working from home is mostly nice, but once in a while getting out of the house is also nice.

There’s a short blog post and a more detailed report thread about the workshop on Blender website/forum.

Surprising no one, what became clear is that the amount of possible work on the video editing tools is way more than the amount of people and the amount of time they can spend implementing them. Like, right now there’s maybe… 1.5 people actually working on it? (my math: three people, part-time). So while Blender 4.1, 4.2 and 4.3 all have VSE improvements, no “hey magically it is now better than Resolve / Premiere / Final Cut Pro” moments anytime soon :)

A side effect of the workshop: I got to cuddle Ton’s dog Bowie, and saw Sergey’s frog collection, including this most excellent gĂźiro:

Blender Conference 2024

I gave a short talk at BCON'24, “How to accidentally start working on VSE”. It was not so much about VSE per se, but more about “how to start working in a new area”. Vibing off the whole conference theme which was “building Blender”.

Here’s slides for it (pdf) and the recording:

The whole conference was lovely. All the talks are in this playlist, and overall feeling is well captured in the BCON'24 recap video.

What’s Next

Blender 4.4 development is happening as we speak, and VSE already got some stuffs done for it. For this release, so far:

  • Video improvements: H.265/HEVC support, 10- and 12-bit videos. Some colorspace and general color precision shenanigans.
  • Proxy improvements: proxies for EXR images work properly now, and are faster to build. There’s a ton of possible improvements for video proxies, but not sure how much of that I’ll manage to squeeze into 4.4 release.

Generally, just like this whole past year, I’m doing things without much planning. Stochastic development! Yay!


Vector math library codegen in Debug

This will be about how when in your C++ code you have a “vector math library”, and how the choices of code style in there affect non-optimized build performance.

Backstory

A month ago I got into the rabbit hole of trying to “sanitize” the various ways that images can be resized within Blender codebase. There were at least 4 different functions to do that, with different filtering modes (expected), but also different corner case behaviors and other funkiness, that was not well documented and not well understood.

I combed through all of that, fixed some arguably wrong behaviors of some of the functions, unified their behavior, etc. etc. Things got faster and better documented. Yay! (PR)

However. While doing that, I also made the code smaller, primarily following the guideline of “code should use our C++ math library, not the legacy C one”. That is, use Blender codebase classes like float4 with related functions and operators (e.g. float4 c = a + b), instead of float v[4] c; add_v4_v4v4(c, a, b); and so on. Sounds good? Yes!

But. There’s always a “but”.

Other developers later on noticed that some parts of Blender got slower, in non-optimized (“Debug”) build. Normally people say “oh it’s a Debug build, no one should care about performance of it”, and while in some cases it might be true, when anything becomes slower it is annoying.

In this particular case, it was “saving a file within Blender”. You see, as part of the saving process, it takes a screenshot of your application, resizes it to be smaller and embeds that as a “thumbnail” inside the file itself. And yes, this “resize it” part is exactly what my change affected. Many developers run their build in Debug mode for easier debugging and/or faster builds; some run it in Debug mode with Address Sanitizer on as well. If “save a file”, an operation that you normally do many times, became slower by say 2 seconds, that is annoying.

What can be done?

How Blender’s C++ math library is written today

It is pretty compact and neat! And perhaps too flexible :)

Base of the math vector types is this struct, with is just a fixed size array of N entries. For the (most common) case of 2D, 3D and 4D vectors, the struct is instead entries explicitly named x, y, z, w:

template<typename T, int Size>
struct vec_struct_base { std::array<T, Size> values; };
template<typename T> struct vec_struct_base<T, 2> { T x, y; };
template<typename T> struct vec_struct_base<T, 3> { T x, y, z; };
template<typename T> struct vec_struct_base<T, 4> { T x, y, z, w; };

And then it has functions and operators, where most of their implementations use an “unroll with a labmda” style. Here’s operator that adds two vectors together:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    VecBase result;
    unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
    return result;
}

with unroll itself being this:

template<class Fn, size_t... I> void unroll_impl(Fn fn, std::index_sequence<I...>)
{
    (fn(I), ...);
}
template<int N, class Fn> void unroll(Fn fn)
{
    unroll_impl(fn, std::make_index_sequence<N>());
}

– it takes “how many times to do the lambda” (typically vector dimension), the lambda itself, and then “calls” the lambda with the index N times.

And then most of the functions use indexing operator to access the element of a vector:

T &operator[](int index)
{
    BLI_assert(index >= 0);
    BLI_assert(index < Size);
    return reinterpret_cast<T *>(this)[index];
}

Pretty compact and hassle free. And given that C++ famously has “zero-cost abstractions”, these are all zero-cost, right? Let’s find out!

Test case

Let’s do some “simple” image processing code that does not serve a practical purpose, but is simple enough to test things out. Given an input image (RGBA, byte per channel), blur it by averaging 11x11 square of pixels around each pixel, and overlay a slight gradient over the whole image. This input, for example, gets turned into that output:

The filter code itself is this:

inline float4 load_pixel(const uint8_t* src, int size, int x, int y)
{
    x &= size - 1;
    y &= size - 1;
    uchar4 bpix(src + (y * size + x) * 4);
    float4 pix = float4(bpix) * (1.0f / 255.0f);
    return pix;
}
inline void store_pixel(uint8_t* dst, int size, int x, int y, float4 pix)
{
    pix = math_max(pix, float4(0.0f));
    pix = math_min(pix, float4(1.0f));
    pix = math_round(pix * 255.0f);
    ((uchar4*)dst)[y * size + x] = uchar4(pix);
}
void filter_image(int size, const uint8_t* src, uint8_t* dst)
{
    const int kFilter = 5;
    int idx = 0;
    float blend = 0.2f;
    float inv_size = 1.0f / size;
    for (int y = 0; y < size; y++)
    {
        for (int x = 0; x < size; x++)
        {
            float4 pix(0.0f);
            float4 tint(x * inv_size, y * inv_size, 1.0f - x * inv_size, 1.0f);
            for (int by = y - kFilter; by <= y + kFilter; by++)
            {
                for (int bx = x - kFilter; bx <= x + kFilter; bx++)
                {
                    float4 sample = load_pixel(src, size, bx, by);
                    sample = sample * (1.0f - blend) + tint * blend;
                    pix += sample;
                }
            }
            pix *= 1.0f / ((kFilter * 2 + 1) * (kFilter * 2 + 1));
            store_pixel(dst, size, x, y, pix);
        }
    }
}

This code uses very few vector math library operations: create a float4, add them together, multiply them by a scalar, some functions to do min/max/round. There are no branches (besides the loops themselves), no cross-lane swizzles, fancy packing or anything like that.

Let’s run this code in the usual “Release” setting, i.e. optimized build (-O2 on gcc/clang, /O2 on MSVC). Processing a 512x512 input image with the above filter, on Ryzen 5950X, Windows 10, single threaded, times in milliseconds (lower is better):

MSVC 2022 Clang 17 Clang 14 Gcc 12 Gcc 11
Release (O2) 67 41 45 70 70

Alright, Clang beats the others by a healthy margin here.

Enter Debug builds

At least within Blender (but also elsewhere), besides a build configuration that ships to the users, during development you often work with two or more other build configurations:

  • “Debug”, which often means “the compiler does no optimizations at all” (-O0 on gcc/clang, /Od on MSVC). This is the least confusing debugging experience, since nothing is “optimized out” or “folded together”.
    • On MSVC, people also sometimes put /JMC (“just my code” debugging), and that is default in recent MSVC project templates. Blender uses that too in the “Debug” cmake configuration.
  • “Developer”, which often is the same as “Release” but with some extra checks enabled. In Blender’s case, besides things like “enable unit tests” and “use a guarded memory allocator”, it also enables assertion checks, and in Linux/Mac also enables Address Sanitizer (-fsanitize=address).

While some people argue that “Debug” build configuration should pay no attention to performance at all, I’m not sold on that argument. I’ve seen projects where a non-optimized code build, while it works, produces such bad performance that using the resulting application is an exercise in frustration. Some places explicitly enable some compiler optimizations on an otherwise “Debug” build, since otherwise the result is just unusable (e.g. in C++-heavy codebase, you’d enable function inlining).

However, the “Developer” configuration is an interesting one. It is supposed to be “optimized”, just with “some” extra safety features. I would normally expect that to be “maybe 5% or 10% slower” than the final “Release” build, but not more than that.

Let’s find out!

MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release (O2) 67 41 45 70 70
Developer (O2 + asserts) 591 42 45 71 71
Debug (-O0 / /Od /JMC)175604965647656105942

Or, phrased in terms of “how many times a build configuration is slower compared to Release”:

MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release (O2) 1x 1x 1x 1x 1x
Developer (O2 + asserts) 9x 1x 1x 1x 1x
Debug (-O0 / /Od /JMC)262x121x144x80x85x

On Developer config, gcc and clang are good: assertions being enabled does not cause a slowdown. On MSVC, however, this makes the code run 9 times slower. All of that only because vector operator[](int index) has asserts in there. And it is only ever called with indices that are statically known to pass the asserts! So much for an “optimizing compiler”, eh.

The Debug build configuration is just bad everywhere. Yes it is the worst on MSVC, but come on, anything that is more than 10 times slower than optimized code is going into “too slow to be practical” territory. And here we are talking about things being from 80 times to 250 times slower!

What can be done without changing the code?

Perhaps realizing that “no optimizations at all produce unusably slow result” is true, some compiler developers have added an “a bit optimized, yet still debuggable” optimization level: -Og. GCC has added that in 2013 (gcc 4.8):

-Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.

Clang followed suit in 2017 (clang 4.0), however their -Og does exactly the same thing as -O1.

MSVC has no setting like that, but we can at least try to turn off “just my code debugging” (/JMC) flag and see what happens. The slowdown table:

MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release (O2) 1x 1x 1x 1x 1x
Faster Debug (-Og / /Od)114x2x2x50x14x
Debug (-O0 / /Od /JMC)262x121x144x80x85x

Alright, so:

  • On clang -Og makes the performance good. Expected since this is the same as -O1.
  • On gcc -Og is better than -O0. Curiously gcc 12 is slower than gcc 11 here for some reason.
  • MSVC without /JMC is better, but still very very slow.

Can we change the code to be more Debug friendly?

Current way that the math library is written is short and concise. If you are used to C++ lambda syntax, things are very clear:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    VecBase result;
    unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
    return result;
}

however, without compiler optimizations, for a float4 that produces (on clang):

  • 18 function calls,
  • 8 branches,
  • assembly listing of 150 instructions.

What it actually does, is do four float additions.

Loop instead of unroll lambda

What about, if instead of this unroll+lambda machinery, we used just a simple loop?

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    VecBase result;
    for (int i = 0; i < Size; i++) result[i] = a[i] + b[i];
    return result;
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release loop 1x 1x 1x 3x 10x
Developer unroll 9x
Developer loop 1x
Faster Debug unroll114x2x2x50x14x
Faster Debug loop 65x2x2x31x14x
Debug unroll 262x121x144x80x85x
Debug loop 126x102x108x58x58x

This does help Debug configurations somewhat (12 function calls, 9 branches, 80 assembly instructions). However! It hurts Gcc code generation even in full Release mode 😱, so that’s probably a no-go. If it were not for the Gcc slowdown, it would be a win: better performance in Debug configuration, and a simple loop is easier to understand than a variadic template + lambda.

Explicit code paths for 2D/3D/4D vector cases

Out of all the possible vector math cases, 2D, 3D and 4D vectors are by far the most common. I’m not sure if other cases even happen within Blender codebase, TBH. Maybe we could specialize those to help the compiler a bit? For example:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    if constexpr (Size == 4) {
        return VecBase(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
    }
    else if constexpr (Size == 3) {
        return VecBase(a.x + b.x, a.y + b.y, a.z + b.z);
    }
    else if constexpr (Size == 2) { 
        return VecBase(a.x + b.x, a.y + b.y);
    }
    else {
        VecBase result;
        unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
        return result;
    }
}

This is very verbose and a bit typo-prone however :( With some C preprocessor help it can be reduced to hide most of the ugliness inside a macro, and then the actual operator implementation is not terribad:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    BLI_IMPL_OP_VEC_VEC(+);
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release explicit 1x 1x 1x 1x 1x
Developer unroll 9x
Developer explicit 1x
Faster Debug unroll114x2x2x50x14x
Faster Debug explicit 19x2x2x5x3x
Debug unroll 262x121x144x80x85x
Debug explicit 55x18x30x22x21x

This actually helps Debug configurations quite a lot! One downside is that you have to have a handful of C preprocessor macros to hide away all the complexity that specializes implementations for 2D/3D/4D vectors.

Not using C++ vector math, use C instead

As a thought exercise – what if instead of using the C++ vector math library, we went back to the C-style of writing code?

Within Blender, right now there’s guidance of “use C++ math library for new code, occasionally rewrite old code to use C++ math library” too. That makes the code more compact and easier to read for sure, but does it have any possible downsides?

Our test image filter code becomes this then (there’s no “math library” used then, just operations on numbers):

inline void load_pixel(const uint8_t* src, int size, int x, int y, float pix[4])
{
    x &= size - 1;
    y &= size - 1;
    const uint8_t* ptr = src + (y * size + x) * 4;
    pix[0] = ptr[0] * (1.0f / 255.0f);
    pix[1] = ptr[1] * (1.0f / 255.0f);
    pix[2] = ptr[2] * (1.0f / 255.0f);
    pix[3] = ptr[3] * (1.0f / 255.0f);
}
inline void store_pixel(uint8_t* dst, int size, int x, int y, const float pix[4])
{
    float r = std::max(pix[0], 0.0f);
    float g = std::max(pix[1], 0.0f);
    float b = std::max(pix[2], 0.0f);
    float a = std::max(pix[3], 0.0f);
    r = std::min(r, 1.0f);
    g = std::min(g, 1.0f);
    b = std::min(b, 1.0f);
    a = std::min(a, 1.0f);
    r = std::round(r * 255.0f);
    g = std::round(g * 255.0f);
    b = std::round(b * 255.0f);
    a = std::round(a * 255.0f);
    uint8_t* ptr = dst + (y * size + x) * 4;
    ptr[0] = uint8_t(r);
    ptr[1] = uint8_t(g);
    ptr[2] = uint8_t(b);
    ptr[3] = uint8_t(a);
}
void filter_image(int size, const uint8_t* src, uint8_t* dst)
{
    const int kFilter = 5;
    int idx = 0;
    float blend = 0.2f;
    float inv_size = 1.0f / size;
    for (int y = 0; y < size; y++)
    {
        for (int x = 0; x < size; x++)
        {
          float pix[4] = { 0,0,0,0 };
          float tint[4] = { x * inv_size, y * inv_size, 1.0f - x * inv_size, 1.0f };
          for (int by = y - kFilter; by <= y + kFilter; by++)
          {
              for (int bx = x - kFilter; bx <= x + kFilter; bx++)
              {
                  float sample[4];
                  load_pixel(src, size, bx, by, sample);
                  sample[0] = sample[0] * (1.0f - blend) + tint[0] * blend;
                  sample[1] = sample[1] * (1.0f - blend) + tint[1] * blend;
                  sample[2] = sample[2] * (1.0f - blend) + tint[2] * blend;
                  sample[3] = sample[3] * (1.0f - blend) + tint[3] * blend;
                  pix[0] += sample[0];
                  pix[1] += sample[1];
                  pix[2] += sample[2];
                  pix[3] += sample[3];
              }
          }
          float scale = 1.0f / ((kFilter * 2 + 1) * (kFilter * 2 + 1));
          pix[0] *= scale;
          pix[1] *= scale;
          pix[2] *= scale;
          pix[3] *= scale;
          store_pixel(dst, size, x, y, pix);
        }
    }
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release C 1x 0.5x 0.5x 1x 1x
Developer unroll 9x
Developer C 1x
Faster Debug unroll114x2x2x50x14x
Faster Debug C 4x0.5x1.5x1x1x
Debug unroll 262x121x144x80x85x
Debug C 5x6x6x4x4x

Writing code in “pure C” style makes Debug build configuration performance really good! But the more interesting thing is… in Release build on clang, this is faster than C++ code. Even for this very simple vector math library, used on a very simple algorithm, C++ abstraction is not zero-cost!

What about SIMD?

In an ideal world, the compiler would take care of SIMD for us, especially in simple algorithms like the one being tested here. It is just number math, with very clear “four lanes” being operated on (maps perfectly to SSE or NEON registers), no complex cross-lane shuffles, packing or any of that stuff. Just some loops and some math.

Of course, as Matt Pharr writes in the excellent ISPC blog series, “Auto-vectorization is not a programming model” (post) (original quote by Theresa Foley).

What if we manually added template specializations to our math library, where for “type is float, dimension is 4” case it would just use SIMD intrinsics directly? Note that this is not the right model if you want absolute best performance that also scales past 4-wide SIMD; the correct way would be to map one SIMD lane to one “scalar item” in your algorithm. But that is a whole another topic; let’s limit ourselves to 4-wide SIMD and map a float4 to one SSE register:

template<> struct vec_struct_base<float, 4> { __m128 simd; };
template<>
inline VecBase<float, 4> operator+(const VecBase<float, 4>& a, const VecBase<float, 4>& b)
{
    VecBase<float, 4> r;
    r.simd = _mm_add_ps(a.simd, b.simd);
    return r;
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release C 1x 0.5x 0.5x 1x 1x
Release SIMD 0.8x 0.5x 0.5x 0.7x 0.7x
Developer unroll 9x
Developer C 1x
Developer SIMD 0.8x
Faster Debug unroll114x2x2x50x14x
Faster Debug C 4x0.5x1.5x1x1x
Faster Debug SIMD 24x1.3x1.3x2x2x
Debug unroll 262x121x144x80x85x
Debug C 5x6x6x4x4x
Debug SIMD 70x44x51x20x20x

Two surprising things here:

  • Even for this very simple case that sounds like “of course the compiler would perfectly vectorize this code, it is trivial!”, manually writing SIMD still wins everywhere except on Clang.
  • However, in Debug build configuration, SIMD intrinsics incur heavy cost on performance, i.e. code is way slower than written in pure C scalar style. SIMD intrinsics are still better at performance than our intial code that uses unroll+lambda style.

What about O3 optimization level?

You say “hey you are using -O2 on gcc/clang, you should use -O3!” Yes I’ve tried that, and:

  • On gcc it does not change anything, except fixes the curious “changing unroll lambda to a simple loop” problem, i.e. under -O3 there is no downside to using a loop in your vector math class compared to unroll+lambda.
  • On clang it makes the various C++ approaches from above almost reach the performance of either “raw C” or “SIMD” styles, but not quite.

What about “force inline”?

(Update 2024 Sep 17) Vittorio Romeo asked “what about using attributes that force inlining”? I don’t have a full table, but a quick summary is:

  • In MSVC, that is not possible at all. Under /Od nothing is inlined, even if you mark it as “force inline please”. There’s an open feature request to make that possible, but is not there (yet?).
  • In Clang force inline attributes help a bit, but not by much.

Learnings

All of the learnings are based on this particular code, which is “simple loops that do some simple pixel operations”. The learnings may or might not transfer to other domains.

  • Clang feels like the best compiler of the three. Consistently fastest code, compared to other compilers, across various coding styles.
  • “C++ has zero cost abstractions” (compared to raw C code) is not true, unless you’re on Clang.
  • Debug build (no optimizations at all) performance of any C++ code style is really bad. The only way I could make it acceptable, while still being C++, is by specializing code for common cases, which I achieved by… using C preprocessor macros 🤦.
  • It is not true that “MSVC has horrible Debug build performance”. Yes it is the worst of all of them, but the other compilers also produce really badly performing code in Debug build config.
  • SIMD intrinsics in a non-optimized build have quite bad performance :(
  • Using “enable some optimizations” build setting, e.g. -Og, might be worth looking into, if your codebase is C++ and heavy on inlined functions, lambdas and that stuff.
  • Using “just my code debugging” (/JMC) on Visual Studio has a high performance cost, on already really bad Debug build performance. I’m not sure if it is worth using at all, ever, anywhere.

All my test code of the above is in this tiny github repo, and a PR for Blender codebase that does “explicitly specialize for common cased via C macros” is at #127577. Whether it will get accepted is up in the air, since it does arguably make the code “more ugly” (the PR got merged into Blender mainline).


Random thoughts about Unity

Unity has a problem

From the outside, Unity lately seems to have a problem or two. By “lately”, I mean during the last decade, and by “a problem or two”, I mean probably over nine thousand problems. Fun! But what are they, how serious they are, and what can be done about it?

Unity is a “little engine that could”, that started out in the year 2004. Almost everything about games and related industries was different compared to today (Steam did not exist for 3rd party games! The iPhone was not invented yet! Neural networks were an interesting but mostly failed tinkering area for some nerds! A “serious” game engine could easily be like “a million dollars per project” in licensing costs! …and so on). I joined in early 2006 and left in early 2022, and saw quite an amazing journey within – against all odds, somehow, Unity turned from the game engine no one has heard about into arguably the most popular game engine.

But it is rare for something to become popular and stay popular. Some of that is a natural “cycle of change” that happens everywhere, some of that is external factors that are affecting the course of a product, some is self-inflicted missteps. For some other types of products or technologies, once they become an “industry standard”, they kinda just stay there, even without seemingly large innovations or a particular love from the user base – they have become so entrenched and captured so much of the industry that it’s hard to imagine anything else. Photoshops and Offices of the world come to mind, but even those are not guaranteed to forever stay the leaders.

Anyway! Here’s a bunch of thoughts on Unity as I see them (this is only my opinion that is probably wrong, yadda yadda).

Caveat: personally, I have benefitted immensely from Unity going public. It did break my heart and make my soul hollow, but financially? Hoo boy, I can’t complain at all. So everything written around here should be taken with a grain of salt, this is a rich, white, bald, middle aged man talking nonsense.

You don’t get rocket fuel to go grocery shopping

For better or worse, Unity did take venture capital investment back in 2009. The company and the product was steadily but slowly growing before that. But it also felt tiny and perhaps “not quite safe” – I don’t remember the specifics, but it might have very well been that it was always just “one unlucky month” away from running out of money. Or it could have been wiped out by any of the big giants at the time, with not much more than an accidental fart in our direction from their side. Microsoft coming up with XNA, Adobe starting to add 3D features to Flash, Google making O3D browser plugin technology – all of those felt like possible extinction level events. But miraculously, they were not!

I don’t even remember why and who decided that Unity should pursue venture capital. Might have happened in one of those “bosses calls” that were about overall strategy and direction that I was a part of, until I wasn’t but everyone forgot to tell me. I just kept on wondering why we stopped them… turns out we did not! But that’s a story for another day :)

The first Series A that Unity raised in 2009 ($5.5M), at least to me, felt like it removed the constant worry of possibly not making it to the next month.

However. VC money is like rocket fuel, and you don’t get rocket fuel just to continue grocery shopping every day. You have to get to space.

Many clever people have written a lot about whether the venture capital is a good or a bad model, and I won’t repeat any of that here. It does allow you to go to space, figuratively; but it also only allows you to go to space. Even if you’d rather keep on going to the grocery store forever.

A bunch of old-time Unity users (and possibly employees) who reminisce about “oh, Unity used to be different Back In The Day” have these fond memories of the left side of the graph below. Basically, before the primary goal of the company became “growth, growth and oh did I mention growth?”.

Here’s Unity funding rounds (in millions of $ per year), and Unity acquisitions (in number per year) over time. It might be off or incomplete (I just gathered data from what’s on the internet, press releases and public financial reports), but overall I think it paints an approximately correct picture. Unity had an IPO in 2020 ($1.3B raised as part of that), and in 2021 raised an additional $1.5B via convertible notes. Also went on a large acquisition spree in 2019-2021.

The “good old Unity” times that some of you fondly remember, i.e. Unity 4.x-5.x era? That’s 2012-2015. Several years after the initial Series A, but well before the really large funding rounds and acquisitions of 2019+. The “raising money is essentially free” situation that was a whole decade before 2020 probably fueled a lot of that spending in pursuit of “growth”.

Vision and Growth

In some ways, being a scrappy underdog is easy – you do have an idea, and you try to make that a reality. There’s a lot of work, a lot of challenges, a lot of unexpected things coming at you, but you do have that one idea that you are working towards.

On June 2005 the Unity website had this text describing what all of this is about:

“We create quality technology that allows ourselves and others to be creative in the field of game development. And we create games to build the insight necessary to create truly useful technology.

We want our technology to be used by creative individuals and groups everywhere to experiment, learn, and create novel interactive content.

We’re dedicated to providing a coherent and clear user experience. What makes us great is our constant focus on the clear interplay of features and functionality from your perspective.”

Whereas in 2008, the “about” page was this:

For comparison, right now in 2024 the tagline on the website is this: “We are the world’s leading platform for creating and operating interactive, real-time 3D (RT3D) content. We empower creators. Across industries and around the world.” Not bad, but also… it does not mean anything.

And while you do have this vision, and are trying to make it a reality, besides the business and lots-of-work struggles, things are quite easy. Just follow the vision!

But then, what do you do when said vision becomes reality? To me, it felt like around year 2012 the vision of “Unity is a flexible engine targeted at small or maybe medium sized teams, to make games on many platforms” was already true. Mobile gaming market was still somewhat friendly to independent developers, and almost everyone there was using Unity.

And then Unity faced a mid-life crisis. “Now what? Is this it?”

From a business standpoint, and the fact that there are VCs who would eventually want a return on their investment, it is not enough to be merely an engine that powers many games done by small studios. So multiple new directions emerged, some from a business perspective, some from “engine technology” perspective. In no particular order:

There are way more consumers than there are game developers. Can that somehow be used? Unity Ads (and several other internal initiatives, most of which failed) is a go at that. I have no idea whether Unity Ads is a good or bad network, or how it compares with others. But it is a large business branch that potentially scales with the number of game players.

There was a thought that gathering data in the form of analytics would somehow be usable or monetizable. “We know how a billion people behave across games!” etc. Most of that thought was before people, platforms and laws became more serious about privacy, data gathering and user tracking.

Other markets besides gaming. There are obvious ones that might need interactive 3D in some way: architecture, construction, product visualization, automotive, medical, movies, and yes, military. To be fair, even without doing anything special, many of those were already using Unity on their own. But from a business standpoint, there’s a thought “can we get more money from them?” which is entirely logical. Some of these industries are used to licensing really shoddy software for millions of dollars, afterall.

Within gaming, chasing “high end” / AAA is very alluring, and something that Unity has been trying to do since 2015 or so. Unity has been this “little engine”, kinda looked down on by others. It was hard to hire “famous developers” to work on it. A lot of that changed with JR becoming the CEO. Spending on R&D increased by a lot, many known and experienced games industry professionals were convinced to join, and I guess the compensation and/or prospect of rising stock value was good enough too. Suddenly it felt like everyone was joining Unity (well, the ones who were not joining Epic or Oculus/Facebook at the time).

Things were very exciting!

Except, growth is always hard. And growing too fast is dangerous.

What is our vision, again?

Unity today is way more capable engine technology wise, compared to Unity a decade ago. The push for “high end” did deliver way improved graphics capabilities (HDRP), artist tooling (VFX graph, shader graph, Timeline etc.), performance (DOTS, Burst, job system, internal engine optimizations), and so on.

But also, somehow, the product became much more fractured, more complex and in some ways less pleasant to use.

Somewhat due to business reasons, Unity tried to do everything. Mobile 2D games? Yes! High end AAA console games? Yes (pinky promise)! Web games? Sure! People with no experience whatsoever using the product? Of course! Seasoned industry veterans? Welcome! Small teams? Yes! Large teams? Yes!

At some point (IIRC around 2017-2018) some of the internal thinking became “nothing matters unless it is DOTS (high-end attempt) or AR (for some reason)”. That was coupled with, again, for some reason, “all new code should be written in C#” and “all new things should be in packages”. These two led to drastic slowdowns in iteration time – suddenly there’s way more C# code that has to be re-loaded every time you do any C# script change, and suddenly there’s way more complex compatibility matrix between which packages work with what.

The growth of R&D led to vastly different styles and thinking about the product, architecture and approaches of problem solving. Random examples:

  • Many AAA games veterans are great at building AAA games, but not necessarily great at building a platform. To them, technology is used by one or, at most, a handful of productions. Building something that is used by millions of people and tens of thousands of projects at once is a whole new world.
  • There was a large faction coming from the web development world, and they wanted to put a ton of “web-like” technologies into the engine. Maybe make various tools work in the browser as well. Someone was suggesting rewriting everything in JavaScript, as a way to fix development velocity, and my fear is that they were not joking.
  • A group of brilliant, top-talent engineers seemed to want to build technology that is the opposite of what Unity is or has been. In their ideal world, everyone would be writing all the code in SIMD assembly and lockless algorithms.
  • There was a faction of Unity old-timers going “What are all these new ideas? Why are we doing them?”. Sometimes raising good questions, sometimes resisting change just because. Yes, I’ve been both :)

All in all, to me it felt like after Unity has arguably achieved “we are the engine of choice for almost every small game developer, due to ease of use, flexibility and platform reach”, the question on what to do next coupled with business aspects made the engine go into all directions at once. Unity stopped having, err, “unity” with itself.

Yes, the original DOTS idea had a very strong vision and direction. I don’t know what the current DOTS vision is. But to me the original DOTS vision felt a lot like it is trying to be something else than Unity – it viewed runtime performance as the most important thing, and assumed that everyone’s main goal is getting best possible performance, thinking about data layout, efficient use of CPU cores and so on. All of these are lovely things, and it would be great if everyone thought of that, sure! But the amount of people who actually do that is like… all seventy of them? (insert “dozens of us!” meme)

What should Unity engine vision be?

That’s a great question. It is easier to point out things that are wrong, than to state what would be the right things. Even harder is to come up with an actionable plan on how to get from the current non-ideal state to where the “right things” are.

So! Because it is not my job to make hard decisions like that, I’m not going to do it :) What I’ll ponder about, is “what Unity should / could be, if there were no restrictions”. A way easier problem!

In my mind, what “made Unity be Unity” originally, was a combination of several things:

  • Ease of prototyping: the engine and tooling is flexible and general enough, not tied into any specific game type or genre. Trying out “anything” is easy, and almost anything can be changed to work however you want. There’s very few restrictions; things and features are “malleable”.
  • Platforms: you can create and deploy to pretty much any relevant platform that exists.
  • Extensible: the editor itself is extremely extensible - you can create menus, whole new windows, scene tooling, or whatever workflow additions are needed for your project.
  • Iteration time and language: C# is a “real” programming language with an enormous ecosystem (IDEs, debuggers, profilers, libraries, knowledge). Editor has reloading of script code, assets, shaders, etc.

I think of those items above as the “key” to what Unity is. Notice that for example “suitable for giant projects” or “best performance in the world” are not on the list. Would it be great to have them? Of course, absolutely! But for example it felt like the whole DOTS push was with the goal of achieving best runtime performance at the expense of the items above, which creates a conflict.

In the early days of Unity, it did not even have many features or tooling built-in. But because it is very extensible, there grew a whole ecosystem with other people providing various tools and extensions. Originally we thought that Asset Store would be mostly for, well, “assets” - models and sounds and texture packs. Sure it has that, but by far the most important things on the asset store turned out to be various editor extensions.

This is a double-edged sword. Yes it did create an impression, especially compared to say Unreal, that “Unity has so few tools, sure you can get many on the asset store but they should be built-in”. In the early days, Unity was simply not large enough to do everything. But with the whole push towards high-end and AAA and “more artist tooling”, it did gain more and more tools built-in (timeline, shader graph, etc.). However, with varying degrees of success.

Many of the new features and workflows added by Unity are (or at least feel like) they are way less “extensible”. Sure, here’s a feature, and that’s it. Can you modify it somehow or bend to your own needs in an easy way? Haha lol, nope. You can maybe fork the whole package, modify the source code and maintain your fork forever.

What took me a long time to realize, is that there is a difference between “extensible” and “modifiable”.The former tries to add various ways to customize and alter some behavior. The latter is more like “here’s the source code, you can fork it”. Both are useful, but in very different scenarios. And the number of people who would want to fork and maintain any piece of code is very small.

So what would my vision for Unity be?

Note that none of this are original ideas, discussions along this direction (and all the other possible directions!) have been circulated inside Unity forever. Which direction(s) will actually get done is anyone’s guess though.

I’d try to stick to the “key things” from the previous section: malleability, extensibility, platforms, iteration time. Somehow nail those, and never lose sight of them. Whatever is done, has to never sacrifice the key things, and ideally improve on them as well.

Make the tooling pleasant to use. Automate everything that is possible, reduce visible complexity (haha, easy, right?), in general put almost all effort into “tooling”. Runtime performance should not be stupidly bad, somehow, but is not the focus.

Achieving the above points would mean that you have to nail down:

  • Asset import and game build pipeline has to be fast, efficient and stable.
  • Iteration on code, shaders, assets has to be quick and robust.
  • Editor has to have plenty of ways to extend itself, and lots of helper ways to build tools (gizmos, debug drawing, tool UI widgets/layouts/interaction). For example, almost everything that comes with Odin Inspector should be part of Unity.
  • In general everything has to be flexible, with as few limitations as possible.

Unity today could be drastically improved in all the points above. Editor extensibility is still very good, even if it is made confusing with presence of multiple UI frameworks (IMGUI, which almsot everything is built on, and UIToolkit, which is new).

To this day I frankly don’t understand why Unity made UIToolkit, and also why it took so many resources (in terms of headcount and development time). I’d much rather liked Unity to invest in IMGUI along the lines of PanGui.

Additionally, I’d try to provide layered APIs to build things onto. Like a “low level, for experts, possibly Unity people themselves too” layer, and then higher level, easier to use, “for the rest of us” that is built on top of the low level one. Graphics is used to be my area of expertise, so for the low level layer you would imagine things like data buffers, texture buffers, ability to modify those, ability to launch things on the GPU (be it draw commands or compute shader dispatches, etc.), synchronization, etc. High level layer would be APIs for familiar concepts like “a mesh” or “a texture” or “a material”.

The current situation with Unity’s SRPs (“scriptable render pipelines” - URP and HDRP being the two built-in ones) is, shall we say, “less than ideal”. From what I remember, the original idea behind making the rendering engine be “scriptable” was something different than what it turned out to be. The whole SRP concept started out at a bit unfortunate time when Burst and C# Job System did not exist yet, the whole API perhaps should have been a bit different if these two were taken to heart. So today SRP APIs are in a weird spot of being neither low level enough to be very flexible and performant, nor high level enough to be expressive and easy to use.

In my mind, any sort of rendering pipeline (be it one of default ones, or user-made / custom) would work on the same source data, only extending the data with additional concepts or settings when absolutely needed. For example, in the old Unity’s built-in render pipeline, you had a choice between say “deferred lighting” and “per-vertex lighting”, and while these two target extremely different hardware capabilities, result in different rendering and support different graphics features, they work on the same data. Which means the choice between them is “just a setting” somewhere, and not an up-front decision that you have to make before even starting your project. Blender’s “render engines” are similar here - the “somewhat realtime” EEVEE and “offline path tracer” Cycles have different tradeoffs and differ in some features, but they both interpret the same Blender scene.

Within Unity’s SRP land, what started out initially as experiments and prototypes to validate the API itself – “is this API usable to build a high end PBR renderer?” and “is this API usable to build a minimalistic and lean low-end renderer?”, ended up shipping as a very in-your-face user setting. They should have been prototypes, and then the people making the two should have gathered together, decide on the learnings and findings about the API, and think about what to do for reals. But reality happened, and now there are two largely incompatible render pipelines. Oh well!

Oh, one more additional thing, just make source code available ffs. There’s nothing you are gaining by making people jump through licensing, legal and cost hoops to get to it, and you’re losing a lot. Being able to read, reason and debug source code, and maybe make a hotfix or two are very important to finish any complex project.

Ok, but who that engine would be for? That’s a complex question, but hey it is not my job to figure out the answers. “A nice easy to use game engine for prototypes and small teams”, I think, would definitely not be an IPO material, and probably not VC material either. Maybe it could be a healthy and sustainable business for a 50 employee sized company. Definitely not something that grew big, then stalled, then <who knows what will happen next> but it made a few dozen people filthy rich :)

Wot about AI?

I know next to nothing about all the modern AI (GenAI, LLMs etc.) thingies. It is a good question, whether the “current way” of building engines and tools is a good model for the future.

Maybe all the complex setups and lighting math that they do within computer graphics is kinda pointless, and you should just let a giant series of matrix multiplications hallucinate the rendered result? It used to be a joke that “the ideal game tool is a text field and a Make Game button”, but that joke is no longer funny now.

Anyhoo, given that I’m not an expert, I don’t have an opinion on all of this. “I don’t know!”

But what I do occasionally think about, is whether Unity is in a weird place of not being low-level enough, and not high-level enough at the same time.

A practical example would be, that within Unity there does not exist a concept like “this surface is made of pine tree” – to make a “wooden” thing in Unity, you have to get some wood textures, create a Material, pick a Shader, and set up parameters on that. The surface has to be a Mesh, and the object have Mesh Renderer and a (why?) Mesh Filter. Then you need to have a Collider, and set up some sort of logic of “play this sound when something hits it”, and the sounds have to be made by someone. The pine surface needs to have a Physics Material on it, with, uhh, some sort of friction, restitution and bounciness coefficients? Oh, if it moves it should have a Rigidbody with a bunch of settings. Should the surface break when something hits it hard enough? Where to even start on that?

Is it great that Unity allows you to specify all of these settings in minute detail? For some cases, yes maybe. I would imagine that many folks would happily take a choice of “make this look, feel and behave as if it is made of pine wood” however. So maybe the layer of Unity that people mostly interact with should be higher level than that of Box Colliders and Rigidbodies and Mesh Renderers. I don’t have an answer on how that level should look like exactly, but it is something to ponder about.

At the same time, the low-levels of Unity are not low-level enough. Looking at graphics related APIs specifically, a good low-level API would expose things like mesh shaders, and freely threaded buffer creation, and bindless resources by now.

Where I lose my train of thought and finish this post

Anyway. I was not sure where I was going with all of the above, so let’s say it is enough for now. I really hope that Unity decides where it actually wants to go, and then goes there with a clear plan. It has been sad to watch many good people leave or be laid off, many companies that made great Unity games switch away from Unity. The technology and the good people within the company deserve so much better than a slow moving trainwreck.