"Modern" C++ Lamentations

This will be a long wall of text, and kinda random! My main points are:

  1. C++ compile times are important,
  2. Non-optimized build performance is important,
  3. Cognitive load is important. I don’t expand much on this here, but if a programming language or a library makes me feel stupid, then I’m less likely to use it or like it. C++ does that a lot :)

Standard Ranges” blog post by Eric Niebler – about C++20 ranges feature – was doing rounds in the game development twitterverse lately, with many expressing something like a “dislike” (mildly said) about the state of modern C++.

I have expressed a similar thought too (link):

That example for Pythagorian Triples using C++20 ranges and other features sounds terrible to me. And yes I get that ranges can be useful, projections can be useful etc. Still, a terrible example! Why would anyone want to code like that?!

Which got slightly out of hand (now 5 days later, that tree of threads is still receiving a lot of replies!).

Now, apologies to Eric for pointing out his post; my lamentation is mostly about “general state of C++” lately. The “bunch of angry gamedevs” on twitter has been picking on Boost.Geometry rationale a year or two ago in a very similar way, and a dozen other times for other aspects of C++ ecosystem.

But you know, twitter not being exactly a nuanced place, etc. etc. So let me expand here!

Pythagorean Triples, C++20 Ranges Style

Here’s the full example from Eric’s post:

// A sample standard C++20 program that prints
// the first N Pythagorean triples.
#include <iostream>
#include <optional>
#include <ranges>   // New header!
 
using namespace std;
 
// maybe_view defines a view over zero or one
// objects.
template<Semiregular T>
struct maybe_view : view_interface<maybe_view<T>> {
  maybe_view() = default;
  maybe_view(T t) : data_(std::move(t)) {
  }
  T const *begin() const noexcept {
    return data_ ? &*data_ : nullptr;
  }
  T const *end() const noexcept {
    return data_ ? &*data_ + 1 : nullptr;
  }
private:
  optional<T> data_{};
};
 
// "for_each" creates a new view by applying a
// transformation to each element in an input
// range, and flattening the resulting range of
// ranges.
// (This uses one syntax for constrained lambdas
// in C++20.)
inline constexpr auto for_each =
  []<Range R,
     Iterator I = iterator_t<R>,
     IndirectUnaryInvocable<I> Fun>(R&& r, Fun fun)
        requires Range<indirect_result_t<Fun, I>> {
      return std::forward<R>(r)
        | view::transform(std::move(fun))
        | view::join;
  };
 
// "yield_if" takes a bool and a value and
// returns a view of zero or one elements.
inline constexpr auto yield_if =
  []<Semiregular T>(bool b, T x) {
    return b ? maybe_view{std::move(x)}
             : maybe_view<T>{};
  };
 
int main() {
  // Define an infinite range of all the
  // Pythagorean triples:
  using view::iota;
  auto triples =
    for_each(iota(1), [](int z) {
      return for_each(iota(1, z+1), [=](int x) {
        return for_each(iota(x, z+1), [=](int y) {
          return yield_if(x*x + y*y == z*z,
            make_tuple(x, y, z));
        });
      });
    });
 
    // Display the first 10 triples
    for(auto triple : triples | view::take(10)) {
      cout << '('
           << get<0>(triple) << ','
           << get<1>(triple) << ','
           << get<2>(triple) << ')' << '\n';
  }
}

Eric’s post comes from his earlier post from a few years back, which was a response to Bartosz Milewski’s post “Getting Lazy with C++”, where a simple C function to print first N Pythagorean Triples was presented:

void printNTriples(int n)
{
    int i = 0;
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z) {
                    printf("%d, %d, %d\n", x, y, z);
                    if (++i == n)
                        return;
                }
}

As well as some issues with it were pointed out:

This is fine, as long as you don’t have to modify or reuse this code. But what if, for instance, instead of printing, you wanted to draw the triples as triangles? Or if you wanted to stop as soon as one of the numbers reached 100?

And then lazy evaluation with list comprehensions is presented as the way to solve these issues. It is a way to solve these issues indeed, just that C++ the language does not quite have built-in functionality to do that, like Haskell or other languages have. C++20 will have more built-in things in that regard, similar to how Eric’s post shows. But I’ll get to that later.

Pythagorean Triples, Simple C++ Style

So, let’s get back to the simple (“fine as long you don’t have to modify or reuse”, as Bartosz says) C/C++ style of solving the problem. Here’s a complete program that prints first 100 triples:

// simplest.cpp
#include <time.h>
#include <stdio.h>

int main()
{
    clock_t t0 = clock();

    int i = 0;
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z) {
                    printf("(%i,%i,%i)\n", x, y, z);
                    if (++i == 100)
                        goto done;
                }
    done:

    clock_t t1 = clock();
    printf("%ims\n", (int)(t1-t0)*1000/CLOCKS_PER_SEC);
    return 0;
}

We can compile it: clang simplest.cpp -o outsimplest. Compilation takes 0.064 seconds, produces 8480 byte executable, which runs in 2 milliseconds and prints the numbers (machine is 2018 MacBookPro; Core i9 2.9GHz; Xcode 10 clang):

(3,4,5)
(6,8,10)
(5,12,13)
(9,12,15)
(8,15,17)
(12,16,20)
(7,24,25)
(15,20,25)
(10,24,26)
...
(65,156,169)
(119,120,169)
(26,168,170)

But wait! That was a default, non-optimized (“Debug”) build; let’s build an optimized (“Release”) build: clang simplest.cpp -o outsimplest -O2. That takes 0.071s to compile, produces same size (8480b) executable, and runs in 0ms (under the timer precision of clock()).

As Bartosz correctly points out, the algorithm is not “reusable” here, since it’s intermixed with “what to do with the results”. Whether that is a problem or not is outside the scope of this post (personally I think “reusability” or “avoid duplication at all costs” are overrated). Let’s assume it is a problem, and indeed we want “something” that would just return first N triples, without doing anything with them.

What I would probably do, is do the simplest possible thing – make something that can be called, that returns the next triple. It might look something like this:

// simple-reusable.cpp
#include <time.h>
#include <stdio.h>

struct pytriples
{
    pytriples() : x(1), y(1), z(1) {}
    void next()
    {
        do
        {
            if (y <= z)
                ++y;
            else
            {
                if (x <= z)
                    ++x;
                else
                {
                    x = 1;
                    ++z;
                }
                y = x;
            }
        } while (x*x + y*y != z*z);
    }
    int x, y, z;
};

int main()
{
    clock_t t0 = clock();

    pytriples py;
    for (int c = 0; c < 100; ++c)
    {
        py.next();
        printf("(%i,%i,%i)\n", py.x, py.y, py.z);
    }

    clock_t t1 = clock();
    printf("%ims\n", (int)(t1-t0)*1000/CLOCKS_PER_SEC);
    return 0;
}

This compiles and runs in pretty much the same times; Debug executable becomes 168 bytes larger; Release executable same size.

I did make a pytriples struct, where each call to next() advances to the next valid triple; the caller can do whatever it pleases with the result. Here, I just call it a hundred times, printing the triple each time.

However, while the implementation is functionally equivalent to what the triple-nested for loop was doing in the original example, indeed it becomes a lot less clear, at least to me. It’s very clear how it does it (some branches and simple operations on integers), but not immediately clear what it does on a high level.

If C++ had something like a “coroutine” concept, it would be possible to implement the triples generator that would be as clear as the original nested for loops, yet not have any of the “problems” (Jason Meisel points out exactly that in “Ranges, Code Quality, and the Future of C++” post); something like (tentative syntax, as coroutines aren’t part of any C++ standard yet):

generator<std::tuple<int,int,int>> pytriples()
{
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z)
                    co_yield std::make_tuple(x, y, z);
}

Back to C++ Ranges

Is the C++20 Ranges style more clear at what it does? From Eric’s post, this is the main part:

auto triples =
    for_each(iota(1), [](int z) {
        return for_each(iota(1, z+1), [=](int x) {
            return for_each(iota(x, z+1), [=](int y) {
                return yield_if(x*x + y*y == z*z,
                    make_tuple(x, y, z));
                });
            });
        });

You could argue either way. I think the “coroutines” approach above is way more clear. C++ way of creating lambdas, and the choice of C++ standard to make things look clever (“what’s an iota? it’s a Greek letter, look how smart I am!”) are both a bit cumbersome. Multiple returns feel unusual if reader is used to imperative programming style, but possibly one could get used to it.

Maybe you could squint your eyes and say that this is an acceptable and nice syntax.

However, I refuse to believe that “us mere mortals” without a PhD in C++ would be able to write the utilities that are needed for the code above to work:

template<Semiregular T>
struct maybe_view : view_interface<maybe_view<T>> {
  maybe_view() = default;
  maybe_view(T t) : data_(std::move(t)) {
  }
  T const *begin() const noexcept {
    return data_ ? &*data_ : nullptr;
  }
  T const *end() const noexcept {
    return data_ ? &*data_ + 1 : nullptr;
  }
private:
  optional<T> data_{};
};
inline constexpr auto for_each =
  []<Range R,
     Iterator I = iterator_t<R>,
     IndirectUnaryInvocable<I> Fun>(R&& r, Fun fun)
        requires Range<indirect_result_t<Fun, I>> {
      return std::forward<R>(r)
        | view::transform(std::move(fun))
        | view::join;
  };
inline constexpr auto yield_if =
  []<Semiregular T>(bool b, T x) {
    return b ? maybe_view{std::move(x)}
             : maybe_view<T>{};
  };

Maybe that is mother tongue to someone, but to me this feels like someone decided that “Perl is clearly too readable, but Brainfuck is too unreadable, let’s aim for somewhere in the middle”. And I’ve been programming mostly in C++ for the past 20 years. Maybe I’m too stupid to get this, okay.

And yes, sure, the maybe_view, for_each, yield_if are all “reusable components” that could be moved into a library; a point I’ll get to… right now.

Issues with “Everything is a library” C++

There are at least two: 1) compilation time, and 2) non-optimized runtime performance.

Let me allow to show them using this same Pythagorean Triples example, though the issue is true for many other features of C++ that are implemented as part of “libraries”, and not language itself.

Actual C++20 isn’t out yet, so for a quick test I took the current best “ranges” approximation, which is range-v3 (made by Eric Niebler himself), and compiled the canonical “Pythagorean Triples with C++ Ranges” example with it.

// ranges.cpp
#include <time.h>
#include <stdio.h>
#include <range/v3/all.hpp>

using namespace ranges;

int main()
{
    clock_t t0 = clock();

    auto triples = view::for_each(view::ints(1), [](int z) {
        return view::for_each(view::ints(1, z + 1), [=](int x) {
            return view::for_each(view::ints(x, z + 1), [=](int y) {
                return yield_if(x * x + y * y == z * z,
                    std::make_tuple(x, y, z));
            });
        });
    });

    RANGES_FOR(auto triple, triples | view::take(100))
    {
        printf("(%i,%i,%i)\n", std::get<0>(triple), std::get<1>(triple), std::get<2>(triple));
    }

    clock_t t1 = clock();
    printf("%ims\n", (int)(t1-t0)*1000/CLOCKS_PER_SEC);
    return 0;
}

I used a post-0.4.0 version (9232b449e44 on 2018 Dec 22), and compiled the example with clang ranges.cpp -I. -std=c++17 -lc++ -o outranges. It compiled in 2.92 seconds, executable was 219 kilobytes, and it runs in 300 milliseceonds.

And yes, that’s a non-optimized build. An optimized build (clang ranges.cpp -I. -std=c++17 -lc++ -o outranges -O2) compiles in 3.02 seconds, executable is 13976 bytes, and runs in 1ms. So the runtime performance is fine, executable is slightly larger, and compile time issue of course remains.

More on the points above:

Compilation Times Are a Big Issue in C++

Compile time of this really simple example takes 2.85 seconds longer than the “simple C++” version.

Lest you think that “under 3 seconds” is a short time – it’s absolutely not. In 3 seconds, a modern CPU can do a gajillion operations. For example, the time it takes for clang to compile a full actual database engine (SQLite) in Debug build, with all 220 thousand lines of code, is 0.9 seconds on my machine. In which world is it okay to compile a trivial 5-line example three times slower than a full database engine?!

C++ compilation times have been a source of pain in every non-trivial-size codebase I’ve worked on. Don’t believe me? Try building one of the widely available big codebases (any of: Chromium, Clang/LLVM, UE4 etc will do). Among the things I really want out of C++, “solve compile times” is probably #1 on the list, and has been since forever. Yet it feels like the C++ community at large pretends that is not an issue, with each revision of the language putting even more stuff into header files, and even more stuff into templated code that has to live in header files.

To a large extent that is caused by the ancient “just literally paste the file contents” #include model that C++ inherited from C. But whereas C tends to only have struct declarations and function prototypes in headers, in C++ you often need to put whole templated classes/functions in there too.

range-v3 is 1.8 megabytes of source code, all in header files! So while the example of “use ranges to output 100 triples” is 30 lines long, after processing header includes the compiler ends up with 102 thousand lines of code to compile. The “simple C++” example, after all preprocessing, is 720 lines of code.

But precompiled headers and/or modules solve this!, I hear you say. Fair enough. Let’s put the ranges header into a precompiled header (pch.h: #include <range/v3/all.hpp>, include pch.h instead, create the PCH: clang -x c++-header pch.h -I. -std=c++17 -o pch.h.pch, compile using pch: clang ranges.cpp -I. -std=c++17 -lc++ -o outranges -include-pch pch.h.pch). Compilation time becomes 2.24s, so PCHs do indeed save about 0.7 seconds of compile time here. They do not save the other 2.1s that is longer than simple C++ approach though :(

Non-Optimized Build Performance is Important

Runtime performance of the “ranges” example is 150 times slower. Two or three times maybe would be acceptable. Anything “over ten times slower”, and it likely means “unusable”. Over hundred times slower? Forget it.

In a real codebase that solves real problems, two orders of magnitude slower likely means that it just would not work on any real data set. I’m working in video games industry; in practical reasons this would mean that Debug builds of the engine or tooling would not work on any real game levels (performance would be nowhere near the needed interactivity level). Maybe there are industries where you run a program on a set of data, and wait for the result, and if it takes 10 or 100 times longer in Debug then it is merely “annoying”. But where something has to be interactive, it turns “annoying” into “unusable” – you literally can not “play” through a game level if it renders at two frames per second.

Yes, an optimized build (-O2 in clang) runs at the same performance as simple C++ version… so “zero cost abstractions” indeed, as long you don’t care about compilation times, and have an optimizing compiler.

But debugging optimized code is hard! Sure it’s possible, and actually a very useful skill to learn… Similar to how riding a unicycle is both possible, and teaches you an important skill of balance. Some people enjoy it and are really good at it even! But most people would not pick unicycle as their primary means of transportation, similar to how most people don’t debug optimized code if they can avoid it.

Arseny Kapoulkine has a great livestream “Optimizing OBJ loader” where he also ran into “Debug build is too slow” issue, and made it 10x faster by avoiding some of STL bits (commit). As a side effect, it also made compile times faster (source) and debugging easier, since Microsoft’s STL implementation in particular is extremely fond of deeply nested function calls.

That is not to say that “STL is necessarily bad”; it is possible to write an STL implementation that does not become 10x slower in a non-optimized build (as EASTL or libc++ do), but for whatever reason Microsoft’s STL is extremely slow due to over-reliance of “inlining will fix it up”.

As as user of the language, I don’t care whose fault it is though! All I know is that “STL is too slow in Debug”, and I’d rather have that fixed, or I will look into alternatives (e.g. not using STL, re-implementing the bits I need myself, or stop using C++ altogether).

How do other languages compare?

Here’s a brief look at a very similar implementation of “lazily evaluated Pythagorean Triples” in C#. Full C# source code:

using System;
using System.Diagnostics;
using System.Linq;

class Program
{
    public static void Main()
    {
        var timer = Stopwatch.StartNew();
        var triples =
            from z in Enumerable.Range(1, int.MaxValue)
            from x in Enumerable.Range(1, z)
            from y in Enumerable.Range(x, z)
            where x*x+y*y==z*z
            select (x:x, y:y, z:z);
        foreach (var t in triples.Take(100))
        {
            Console.WriteLine($"({t.x},{t.y},{t.z})");
        }
        timer.Stop();
        Console.WriteLine($"{timer.ElapsedMilliseconds}ms");
    }
}

Personally I find the actual bit pretty readable. Compare C#:

var triples =
    from z in Enumerable.Range(1, int.MaxValue)
    from x in Enumerable.Range(1, z)
    from y in Enumerable.Range(x, z)
    where x*x+y*y==z*z
    select (x:x, y:y, z:z);

with C++:

auto triples = view::for_each(view::ints(1), [](int z) {
    return view::for_each(view::ints(1, z + 1), [=](int x) {
        return view::for_each(view::ints(x, z + 1), [=](int y) {
            return yield_if(x * x + y * y == z * z,
                std::make_tuple(x, y, z));
        });
    });
});

I know which one I find cleaner. Do you? Though to be fair, an alternative, “less databasey” form of C# LINQ is pretty busy as well:

var triples = Enumerable.Range(1, int.MaxValue)
    .SelectMany(z => Enumerable.Range(1, z), (z, x) => new {z, x})
    .SelectMany(t => Enumerable.Range(t.x, t.z), (t, y) => new {t, y})
    .Where(t => t.t.x * t.t.x + t.y * t.y == t.t.z * t.t.z)
    .Select(t => (x: t.t.x, y: t.y, z: t.t.z));

How much time it takes to compile this C# code? I’m on a Mac, so I’ll use Mono compiler (which itself is written in C#), version 5.16. mcs Linq.cs takes 0.20 seconds. In comparison, compiling an equivalent “simple C#” version takes 0.17 seconds.

So this lazy evaluation LINQ style creates additional 0.03 seconds work for the compiler to do. In comparison, the C++ case was creating an additional 3 seconds of work, or 100x more!

This is what you get when “features” are part of the language, as opposed to “it comes as hundred thousand lines of code for the compiler to plow through”.

But can’t you just ignore the parts you don’t like?

Yes, to some extent.

For example here (Unity), we had a joke that “adding Boost into the codebase is a fireable offense”. I guess that is not true, since sometime last year Boost.Asio got added, and then I grumbled quite a bit about how it’s super slow to compile, and that merely including <asio.h> includes whole of <windows.h> with all the macro name hijack horrors that it has.

In general we’re trying to avoid most of STL too. For the containers, we have our own ones, somewhat along the same reasons as EASTL motivation – more uniform behavior across platforms/compilers, better performance in non-optimized builds, better integration with our own memory allocators & allocation tracking. Some other containers, purely for performance reasons (STL unordered_map can’t be fast by design since the standard requires it to be separately chained; our own hash table uses open addressing instead). Large parts of the standard library functionality we don’t actually even need.

However.

It takes time to convince each and every new hire (particularly the more junior ones straight from universities) that no, just because it’s called “modern” C++, does not automatically mean it’s better (it might be! or it might be not). Or that no, “C code” does not automatically mean it’s hard to understand, follow or is riddled with bugs (it might be! or it might be not).

Just a couple weeks ago at work I was rambling how I’m trying to understand some piece of (our own) code, and I can’t, since the code is “too complex” for me to get it. Another (junior) programmer drops by, asks me why I look like I’m about to ‎(ノಥ益ಥ)ノ ┻━┻, I say “so I’m trying to understand this code, but it’s way too complex for me”. His immediate response was “ah, old C style code?”, and I was “no, in fact the complete opposite!” (the code I was looking at was some sort of template metaprogramming thingamabob). He hasn’t worked with large codebases, neither C nor C++ yet, but something has already convinced him that “hard to understand” must be C code. I suspect the university; often classes flat out immediately say that “C is bad”, without ever explaining how exactly; but it does leave that impression on the young minds of future programmers.

So yes, I can certainly ignore parts I don’t like about C++. But it’s tiring to educate many others I’m working with, since many are under impression that “modern, must be better” or that “standard library, must be better than anything we could write ourselves”.

Why is C++ this way?

I don’t quite know. Admittedly they do have a very complex problem to solve, which is “how to evolve a language while keeping it close to 100% backwards compatible with many decades of past decisions”. Couple that with that C++ tries to serve many masters, use cases and experience levels, and you do get a complex problem.

But to some extent, it feels like a big part of C++ committee and the ecosystem is focused on “complexity” in terms of showing off or proving their worth.

There was this joke on the internet a while ago about typical progression of a C/C++ programmer. I remember myself being in the middle stage, some 16 years ago. Very impressed with Boost, in a sense of “wow you can do that, that’s so cool!”. Without questioning at why you’d want to do that.

Similar to, I don’t know, Formula 1 cars or triple-neck guitars. Impressive? Sure. Marvel of engineering? Of course. Requires massive amount of skill to handle them properly? Yes! Not the right tool for 99% of situations you’d even find yourself in? Yup.

Christer Ericson has put it nicely here:

Goal of programmers is to ship, on time, on budget. It’s not “to produce code.” IMO most modern C++ proponents 1) overassign importance to source code over 2) compile times, debugability, cognitive load for new concepts and extra complexity, project needs, etc. 2 is what matters.

And yes, people who are concerned with state of C++ and the standard libraries can of course join the effort in trying to improve it. Some do. Some are (or think they are) too busy to spend time in committees. Some ignore parts of the standards and make their own parallel libraries (e.g. EASTL). Some think C++ is beyond salvation and try to make their own languages (Jai) or jump ship elsewhere (Rust, subsets of C#).

Taking feedback, and giving feedback

I know how hard it can be, when “a bunch of angry people on the internet” are saying that your work is proverbial dung. I’m working on probably the most popular game engine, with millions of users, and some part of them love to point out, directly or indirectly, how “it sucks”. It’s hard; me and others have put so much thought and work into it, and someone else comes along and says we’re all idiots and our work is crap. Sad!

It probably feels the same for anyone working on C++, STL, or any other widely used technology really. They have worked for years on something, and then a bunch of Angry Internet People come and trash your lovely work.

It’s extremely easy to get defensive about this, and is the most natural reaction. Oftentimes also not the most constructive one though.

Ignoring literal trolls who complain on the internet “just for the lulz”, majority of complaints do have actual issue or problem behind it. It might be worded poorly, or exaggerated, or whoever is complaining did not think about other possible viewpoints, but there is a valid issue behind the complaint anyway.

What I do whenever someone complains about thing I’ve worked on, is try to forget about “me” and “work I did”, and get their point of view. What are they trying to solve, and what problems do they run into? The purpose of any software/library/language is to help their users solve the problems they have. It might be a perfect tool at solving their problem, an “ok I guess that will work” one, or a terribly bad one at that.

  • “I’ve worked very hard on this, but yeah sounds like my work is not a good at solving your problem” is a perfectly valid outcome!
  • “I’ve worked very hard on this, but did not know/consider your needs, let me see how they can be addressed” is a great outcome too!
  • “Sorry I don’t understand your problem” is fine as well!
  • “I’ve worked very hard on this, but turns out no one has the problem that my solution is supposed to solve” is a sad outcome, but it also can and does happen.

Some of the “all feedback will be ignored unless it comes in form of a paper presented at a C++ committee meeting” replies that I’ve seen in the past few days sound to me like a not very productive approach. Likewise, defending design of a library with an argument that “it was a popular library in Boost!” misses out on the part of the C++ world that does not think Boost libraries are a good design/solution.

The “games industry” at large, however, is also at fault to some extent. Game technologies are traditionally built with C or C++, just because up until very recently, other viable systems programming languages simply did not exist (now you at least have Rust as a possible contender). For the amount of reliance on C/C++, the industry certainly did not do a good enough job at making themselves visible and working on improving the language, the libraries and the ecosystem.

Yes it’s hard work, and yes complaining on the internet is way easier. And whoever ends up working on future of C++ is not solving “immediate problems” (like shipping a game or whatever); they are working on something much more longer-term. But there are companies who could afford to do this; any of big engine companies or large publishers with central technology groups could totally do it. If that would be worth doing, I don’t know, but indeed it’s a bit hypocritical to say “C++ is nuts and is not what we want”, while never telling the folks who make the language, what it is that “we want”.

My impression is that many in the games technology are “mostly fine” with recent (C++11/14/17) additions to the language itself - lambdas are useful, constexpr if is great, etc. etc. They tend to largely ignore whatever is getting added into the standard libraries, both because the design/implementations of STL have issues pointed out above (long compile times, bad Debug build performance), or are simply not that useful, or the companies have already wrote their own containers/strings/algorithms/… years ago, so why change something that already works.

Here I say that is enough for a post, I need to get dinner.


C++11 way of initializing integers

So this tweet by Arseny Kapoulkine got me interested:

One of the most annoying changes to C++ that happened in the last decade for me personally is the introduction of initialization-using-curly-braces and people starting to use it all over the place. There are rare cases when that might make sense, but seriously, int count{ 0 };?

Now, I love the curly-braces initializers in C#, but I haven’t used them in C++ much. There might or might not be good reasons for using them, I don’t have an opinion on that aspect. What got me interested, is this: how much work for a compiler is there?. I.e. if you have int a = 7; vs int a { 7 };, which one is faster to compile?

Oh come on Aras, who would care about trivial thing like that?! you say… well, I would care, for one. I was also recently optimizing compile times of our C++ testing framework, and turns out, once you have things that are repeating thousands of times (e.g. equality checks in unit tests), even tiny savings do add up.

Simple test: half a million integer initializations on Clang

At first I did a simple file that would initialize and use half a million integers, and measured the time it takes for Clang to compile that:

// test.cpp
#ifdef MODERN
#define STUFF { int a {7}; c += a; }
#else
#define STUFF { int a = 7; c += a; }
#endif
#define STUFF1 STUFF STUFF STUFF STUFF
#define STUFF2 STUFF1 STUFF1 STUFF1 STUFF1
#define STUFF3 STUFF2 STUFF2 STUFF2 STUFF2
#define STUFF4 STUFF3 STUFF3 STUFF3 STUFF3
#define STUFF5 STUFF4 STUFF4 STUFF4 STUFF4
#define STUFF6 STUFF5 STUFF5 STUFF5 STUFF5
#define STUFF7 STUFF6 STUFF6 STUFF6 STUFF6
#define STUFF8 STUFF7 STUFF7 STUFF7 STUFF7

#include <stdio.h>
int main()
{
	int c = 0;
	STUFF8; STUFF8; STUFF8; STUFF8; STUFF8; STUFF8; STUFF8; STUFF8;
	printf("got %i\n", c);
	return 0;
}

Of course I did not want to type half a million integer initializers, so preprocessor to the rescue. Each STUFFn expands previous one four times, and reaching half a million repeats is easy that way. It just ends up being either { int a = 7; c += a; } or { int a {7}; c += a; } over and over again, and the final value of c is printed.

So, let’s measure! This is on Intel Core i9 8950HK 2.9GHz (2018 MacBookPro), clang “Apple LLVM version 10.0.0” version (the one in Xcode 10).

  • Traditional time clang -std=c++11 test.cpp: 8.2 seconds.
  • C++11 style curly brace time clang -std=c++11 -DMODERN test.cpp: 9.0 seconds. That’s 9.7% longer!

So from this, it would seem that on Clang, using C++11 style initializer-list syntax for simple types does create more work for the compiler. In doing a debug build, Clang spends about 10% more time on the whole compilation (which also includes things like machine code generation, so the overhead on the frontend itself is larger).

What about the other compilers? That’s where things got a bit more complicated… :)

Initializing lots of integers on other compilers

Then I ran the same test on MSVC (VS2017, 15.9.4 version), and… crash with out of memory. Ok I was using the 32 bit cl.exe. Trying the 64 bit one, it proceeds to use about 25 gigabytes (!) of memory, and after about 3 minutes of compilation time I just gave up.

Reducing the amount of initializers four times (128 thousand), and testing on AMD ThreadRipper 1950X 3.4GHz:

#ifdef MODERN
#define STUFF { int a {7}; c += a; }
#else
#define STUFF { int a = 7; c += a; }
#endif
#define STUFF1 STUFF STUFF STUFF STUFF
#define STUFF2 STUFF1 STUFF1 STUFF1 STUFF1
#define STUFF3 STUFF2 STUFF2 STUFF2 STUFF2
#define STUFF4 STUFF3 STUFF3 STUFF3 STUFF3
#define STUFF5 STUFF4 STUFF4 STUFF4 STUFF4
#define STUFF6 STUFF5 STUFF5 STUFF5 STUFF5
#define STUFF7 STUFF6 STUFF6 STUFF6 STUFF6

#include <stdio.h>
int main()
{
	int c = 0;
	STUFF7; STUFF7; STUFF7; STUFF7; STUFF7; STUFF7; STUFF7; STUFF7;
	printf("got %i\n", c);
	return 0;
}
  • Traditional ptime cl test.cpp: 42-44 seconds.
  • C++11 style ptime cl /DMODERN test.cpp: 42-44 seconds. So, way, way longer than Clang, but there’s no compile time difference in how the integer is initialized.
    • Clang on Mac with the 4x reduced initializer count takes 2.0s (traditional) and 2.2s (curly brace).

What about gcc then? With half a million initializers, I also could not wait for it to finish. With 128k initializers, on the same ThreadRipper PC; “traditional” style using time gcc test.cpp -std=c++11, and “curly brace” style using time gcc test.cpp -std=c++11 -DMODERN:

  • gcc 5.4 in WSL: crashes after several minutes in both styles,
  • gcc 7.3 in VMWare: traditional 181 sec, curly brace 188 sec. So maybe 4% slower, but I did not do many measurements.

I’ve also tested Clang 3.8 on WSL and Clang 6.0 in VMWare on the same ThreadRipper PC; timings are consistent with Mac results, i.e. initializing a lot of integers using C++11 curly brace syntax makes compile time about 9% slower.

Note: I tested the manually-expanded giant C++ file with 128 thousand initializers in it, to figure out if it’s the macro expansion that is slowing down the compilers. On MSVC, it still takes the same ~42 seconds to compile, so the bottleneck is definitely not the macro expansion.

Takeaways

  • Clang seems to be about 9% slower at compiling a lot of int a { 7 }; initializers, compared to traditional int a = 7; ones.
  • Gcc might be about 3-4% slower at compiling the curly brace integer initializers.
  • MSVC compiles both forms of integer initialization in the same time.
  • Both MSVC and Gcc are way slower than Clang at compiling a function that has hundreds of thousands of integer initializers. Of course this is not a typical case, but I was still surprised at the compiler either taking 25 gigabytes of memory, or outright crashing.
  • Most of this does not matter in typical use cases. Unless you’re optimizing compile times for something used extremely often in a large codebase.

Pathtracer 17: WebAssembly

Introduction and index of this series is here.

Someone at work posted a “Web Development With Assembly” meme as a joke, and I pulled off a “well, actually” card pointing to WebAssembly. At that point I just had to make my toy path tracer work there.

So here it is: aras-p.info/files/toypathtracer

Porting to WebAssembly

The “porting” process was super easy, I was quite impressed how painless it was. Basically it was:

  1. Download & install the official Emscripten SDK, and follow the instructions there.
  2. Compile my source files, very similar to invoking gcc or clang on the command line, just Emscripten compiler is emcc. This was the full command line I used: emcc -O3 -std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s EXTRA_EXPORTED_RUNTIME_METHODS='["cwrap"]' -o toypathtracer.js main.cpp ../Source/Maths.cpp ../Source/Test.cpp
  3. Modify the existing code to make both threads & SIMD (two things that Emscripten/WebAssembly lacks at the moment) optional. Was just a couple dozen lines of code starting here in this commit.
  4. Write the “main” C++ entry point file that is specific for WebAssembly, and the HTML page to host it.

How to structure the main thing in C++ vs HTML? I basically followed the “Emscripting a C library to Wasm” doc by Google, and “Update a canvas from wasm” Rust example (my case is not Rust, but things were fairly similar). My C++ entry file is here (main.cpp), and the HTML page is here (toypathtracer.html). All pretty simple.

And that’s basically it!

Ok how fast does it run?

At the moment WebAssembly does not have SIMD, and does not have “typical” (shared memory) multi-threading support.

The Web almost got multi-threading at start of 2018, but then Spectre and Meltdown happened, and threading got promptly turned off. As soon as you have ability to run fast atomic instructions on a thread, you can build a really high precision timer, and as soon as you have a high precision timer, you can start measuring things that reveal what sort of thing got into the CPU caches. Having “just” that is enough to start building basic forms of these attacks.

By now the whole industry (CPU, OS, browser makers) scrambled to fix these vulnerabilities, and threading might be coming back to Web soon. However at this time it’s not enabled by default in any browsers yet.

All this means that the performance numbers of WebAssembly will be substantially lower than other CPU implementations – after all, it will be running on just one CPU core, and without any of the SIMD speedups we have done earlier.

Anyway, the results I have are below (higher numbers are better). You can try yourself at aras-p.info/files/toypathtracer

DeviceOSBrowserMray/s
Intel Core i9 8950HK 2.9GHz (MBP 2018)macOS 10.13Safari 115.8
Chrome 705.3
Firefox 635.1
Intel Xeon W-2145 3.7GHzWindows 10Chrome 705.3
AMD ThreadRipper 1950X 3.4GHzWindows 10Firefox 644.7
Chrome 704.6
Edge 174.5
iPhone XS / XR (A12)iOS 12Safari4.4
iPhone 8+ (A11)iOS 12Safari4.0
iPhone SE (A9)iOS 12Safari2.5
Galaxy Note 9 (Snapdragon 845)Android 8.1Chrome2.0
iPhone 6 (A8)iOS 12Safari1.7

For reference, if I turn off threading & SIMD in the regular C++ version, I get 7.0Mray/s on the Core i9 8950HK MacBookPro. So WebAssembly at 5.1-5.8 Mray/s is slightly slower, but not “a lot”. Is nice!

All code is on github at 17-wasm tag.


SPIR-V Compression: SMOL vs MARK

Two years ago I did a small utility to help with Vulkan (SPIR-V) shader compression: SMOL-V (see blog post or github repo).

It is used by Unity, and looks like also used by some non-Unity projects as well (if you use it, let me know! always interesting to see where it ends up at).

Then I remembered the github issue where SPIR-V compression was discussed at. It mentioned that SPIRV-Tools was getting some sort of “compression codec” (see comments) and got closed as “done”, so I decided to check it out.

SPIRV-Tools compression: MARK-V

SPIRV-Tools repository, which is a collection of libraries and tools for processing SPIR-V shaders (validation, stripping, optimization, etc.) has a compressor/decompressor in there too, but it’s not advertised much. It’s not built by default; and requires passing a SPIRV_BUILD_COMPRESSION=ON option to CMake build.

The sources related to it are under source/comp and tools/comp folders; and compression is not part of the main interfaces under include/spirv-tools headers; you’d have to manually include source/comp/markv.h. The build also produces a command line executable spirv-markv that can do encoding or decoding.

The code is well commented in terms of “here’s what this small function does”, but I didn’t find any high level description of “the algorithm” or properties of the compression. I see that it does something with shader instructions; there’s some Huffman related things in there, and large tables that are seemingly auto-generated somehow.

Let’s give it a go!

Getting MARK-V to compile

In SMOL-V repository I have a little test application (see testmain.cpp) that has on a bunch of shaders, runs either SMOL-V or Spirv-Remapper on them, additionally compresses result with zlib/lz4/zstd and so on. “Let’s add MARK-V in there too” sounded like a natural thing to do. And since I refuse to deal with CMake in my hobby projects :), I thought I’d just add relevant MARK-V source files…

First “uh oh” sign: while the number of files under compression related folders (source/comp, tools/comp) is not high, that is 500 kilobytes of source code. Half a meg of source, Carl!

And then of course it needs a whole bunch of surrounding code from SPIRV-Tools to compile. So I copied everything that it needed to work. In total, 1.8MB of source code across 146 files.

After finding all the source files and setting up include paths for them, it compiled easily on both Windows (VS2017) and Mac (Xcode 9.4).

Pet peeve: I never understood why people don’t use file-relative include paths (like #include "../foo/bar/baz.h"), instead requiring the users of your library to setup additional include path compiler flags. As far as I can tell, relative include paths have no downsides, and require way less fiddling to both compile your library and use it.

Side issue: STL vector for input data

The main entry point for MARK-V decoding (this is what would happen on the device when loading shaders – so this is the performance critical part) is:

spv_result_t MarkvToSpirv(
    spv_const_context context, const std::vector<uint8_t>& markv,
    const MarkvCodecOptions& options, const MarkvModel& markv_model,
    MessageConsumer message_consumer, MarkvLogConsumer log_consumer,
    MarkvDebugConsumer debug_consumer, std::vector<uint32_t>* spirv);

Ok, I kind of get the need (or at least convenience) of using std::vector for output data; after all you are decompressing and writing out an expanding array. Not ideal, but at least there is some explanation.

But for input data – why?! One of const uint8_t* markv, size_t markv_size or a const uint8_t* markv_begin, const uint8_t* markv_end is just as convenient, and allows way more flexibility for the user at where the data is coming from. I might have loaded my data as memory-mapped files, which then literally is just a pointer to memory. Why would I have to copy that data into an additional STL vector just to use your library?

Side issue: found bugs in “Max” compression

MARK-V has three compression models - “Lite”, “Mid” and “Max”. On some test shaders I had the “Max” one could not decompress successfully after compression, so I guess “some bugs are there somewhere”. Filed a bug report and excluded the “Max” model from further comparison :(

MARK-V vs SMOL-V

Size evaluation

CompressionNo filterSMOL-VMARK-V LiteMARK-V Mid
Size KBRatioSize KBRatioSize KBRatioSize KBRatio
Uncompressed 4870100.0% 163033.5% 136928.1% 108522.3%
zlib default 121324.9% 60212.4% 4118.5% 3366.9%
LZ4HC default 134327.6% 60612.5% 4108.4% 3346.9%
Zstd default 89918.5% 4469.1% 3948.1% 3296.8%
Zstd level 20 59012.1% 3487.1% 2936.0% 2575.3%

Two learnings from this:

  • MARK-V without additional compression on top (“Uncompressed” row) is not really competitive (~25%); just compressing shader data with Zstandard produces smaller result; or running through SMOL-V coupled with any other compression.
  • This suggests that MARK-V acts more like a “filter” (similar to SMOL-V or spirv-remap), that makes the data smaller, but also makes it more compressible. Coupled with additional compression, MARK-V produces pretty good results, e.g. the “Mid” model ends up compressing data to ~7% of original size. Nice!

Decompression performance

I checked how much time it takes to decode/decompress shaders (4870KB uncompressed size):

Windows
AMD TR 1950X
3.4GHz
Mac
i9-8950HK
2.9GHz
MARK-V Lite536.7ms9.1MB/s 492.7ms9.9MB/s
MARK-V Mid 759.1ms6.4MB/s 691.1ms7.0MB/s
SMOL-V 8.8ms 553.4MB/s 11.1ms438.7MB/s

Now, I haven’t seriously looked at my SMOL-V decompression performance (e.g. Zstandard general decompression algorithm does ~1GB/s), but at ~500MB/s it’s perhaps “not terrible”.

I can’t quite say the same about MARK-V though; it gets under 10MB/s of decompression performance. That, I think, is “pretty bad”. I don’t know what it does there, but this low decompression speed is within a “maybe I wouldn’t want to use this” territory.

Decompressor size

There is only one case where the decompressor code size does not matter: it’s if it comes pre-installed on the end hardware (as part of OS, runtimes, drivers, etc.). In all other cases, you have to ship decompressor inside your own application, i.e. statically or dynamically link to that code – so that, well, you can decompress the data you have compressed.

I evaluated decompressor code size by making a dynamic/shared library on a Mac (.dylib) with a single exported function that does a “decode these bytes please” work. I used -O2 -fvisibility=hidden -std=c++11 -fno-exceptions -fno-rtti compiler flags, and -shared -fPIC -lstdc++ -dead_strip -fvisibility=hidden linker flags.

  • SMOL-V decompressor .dylib size: 8.2 kilobytes.
  • MARK-V decompressor .dylib size (only with “Mid” model): 1853.2 kilobytes.

That’s right. 1.8 megabytes! At first I thought I did something wrong!

I looked at the size report via Bloaty, and yeah, in MARK-V decompressor it’s like: 570KB GetIdDescriptorHuffmanCodecs, 137KB GetOpcodeAndNumOperandsMarkovHuffmanCodec, 64KB GetNonIdWordHuffmanCodecs, 44KB kOpcodeTableEntries and then piles and piles of template instantiations that are smaller, but there’s lots of them.

In SMOL-V by comparison, it’s 2KB smolv::Decode, 1.3KB kSpirvOpData and the rest is misc stuff and/or dylib overhead.

Library compilation time

While this is not that important aspect, it’s relevant to my current work role as a build engineer :)

Compiling MARK-V libraries with optimizations on (-O2) takes 102 seconds on my Mac (single threaded; obviously multi-threaded would be faster). It is close to two megabytes of source code after all; and there is one file (tools/comp/markv_model_shader.cpp) that takes 16 seconds to compile alone. I think that got CI agents into timeouts in SPIRV-Tools project, and that was the reason why MARK-V is not enabled by default in the builds :)

Compiling SMOL-V library takes 0.4 seconds in comparison.

Conclusion

While looking at compression ratio in isolation, MARK-V coupled with additional lossless compression looks good, I don’t think I would recommend it due to other issues.

The decompressor executable size alone (almost 2MB!) means that in order for MARK-V to start to “make sense” compared to say SMOL-V, your total shader data size needs to be over 100 megabytes; only then additional compression from MARK-V offsets the massive decompressor size.

Sure, there are games with shaders that large, but then MARK-V is also quite slow at decompression – it would take over 10 seconds to decompress 100MB worth of shader data :(

All my evaluation code is on mark-v branch in SMOL-V repository. At this point I’m not sure I’ll merge it to the main branch.

This is all.


Pathtracer 16: Burst SIMD Optimization

Introduction and index of this series is here.

When I originally played with the Unity Burst compiler in “Part 3: C#, Unity, Burst”, I just did the simplest possible “get C# working, get it working on Burst” thing and left it there. Later on in “Part 10: Update C#” I updated it to use Structure-of-Arrays data layout for scene objects, and that was about it. Let’s do something about this.

Meanwhile, I have switched from late-2013 MacBookPro to mid-2018 one, so the performance numbers on a “Mac” will be different from the ones in previous posts.

Update to latest Unity + Burst + Mathematics versions

First of all, let’s update the Unity version we use from some random 2018.1 beta to the latest stable 2018.2.13, and update Burst (to 0.2.4-preview.34) & Mathematics (to 0.0.12-preview.19) packages along the way. Mathematics renamed lengthSquared to lengthsq, and introduced a PI constant that clashed with our own one :) These trivial updates in this commit.

Just that got performance on PC from 81.4 to 84.3 Mray/s, and on Mac from 31.5 to 36.5 Mray/s. I guess either Burst or Mathematics (or both) got some optimizations during this half a year, nice!

Add some “manual SIMD” to sphere intersection

Very similar to how in Part 8: SSE HitSpheres I made the C++ HitSpheres function do intersection testing of one ray against 4 spheres at once, we’ll do the same in our Unity C# Burst code.

The thought process and work done is extremely similar to the C++ side done in Part 8 and Part 9; basically:

  • Since data for our spheres is laid out nicely in SoA style arrays, we can easily load data for 4 of them at once.
  • Do all ray intersection math on these 4 spheres,
  • If any are hit, pick the closest one and calculate final hit position & normal.

HitSpheres function code gets to be extremely similar between C++ version and C# version. In fact the C# one is cleaner since float4, int4 and bool4 types in Mathematics package are way more complete SIMD wrappers than my toy manual implementations in the C++ version.

The full change commit is here.

Performance: PC from 84.3 to 133 Mray/s, and Mac from 35.5 to 60.0 Mray/s. Not bad!

Updated numbers for new Mac hardware

Implementation PC Mac
GPU 1854 246
C++, SSE+SoA HitSpheres 187 74
C#, Unity Burst, 4-wide HitSpheres 133 60
C++, SoA HitSpheres 100 36
C#, Unity Burst 82 36
C#, .NET Core 53.0 23.6
C#, mono -O=float32 --llvm w/ MONO_INLINELIMIT=100 22.0
C#, mono -O=float32 --llvm 18.9
C#, mono -O=float32 11.0
C#, mono 6.1
  • PC is AMD ThreadRipper 1950X (3.4GHz, 16c/16t - SMT disabled) with GeForce GTX 1080 Ti.
  • Mac is mid-2018 MacBookPro (Core i9-8950HK 2.9GHz, 6c/12t) with AMD Radeon Pro 560X.
  • Unity version 2018.2.13 with Burst 0.2.4-preview.34 and Mathematics 0.0.12-preview.19.
  • Mono version 5.12.
  • .NET Core version 2.1.302.

All code is on github at 16-burst-simd tag.