time-trace: timeline / flame chart profiler for Clang

TL;DR: I wanted Clang to emit timeline (“flame chart”) style profiling info on where it spends time. So I made it do that.

What kind of compiler profiler I want?

In the previous blog post I talked about how it is hard to get “useful” information on where C++ compilers spend their time. Various compilers have various ways of reporting something, but most of their reports seem to be geared towards the compiler developers themselves. For them it’s important to know, for example, whether register allocator is the bottleneck; for me as a compiler user that is much less of an importance – I want to know whether I’m bottlenecked by preprocessor / includes (if so, which ones?), parsing (if so, which classes/functions/templates?), template instantiation (if so, which ones?), code generation, backend optimizations, or something else.

Having added Chrome Tracing profiling outputs to other parts of our build system in the past (here or there), I quite like it. It’s not perfect, and the UI is not very good, but it gets the job done and the file format is trivial to write.

So I thought I’d try doing it – add some sort of flag to Clang that would emit the profiling information that I would find interesting. Turns out, getting to the “it seems to work” state was easier than I expected!

-ftime-trace: Chrome Tracing output for Clang

The very first thing I tried it on pointed out a really useful thing: we have a super slow to include header in one place. Just look:

This is a timeline / flame chart profiling view in Chrome chrome://tracing page. Horizontal axis is time, vertical is nested “callstacks”, so to speak. And it very clearly shows that there is one header file that takes over 8 seconds to include.

I suspect it’s some sort of performance issue with Clang itself; Gcc and Msvc are quite a bit faster at including this file. Will try to gather data and report a bug.

The file itself does not feel that complex; without this view I would have never guessed that it’s the cause of these slow compile times. Yay visibility into what is going on!

So that was already fairly useful. With this visualization, I can look into other files and see what they end up doing. And based on that, decide what to do to make compiles faster. When including/parsing files takes up time, I can tell which ones are the culprit exactly, and how long it takes to include them:

Already found some headers that took longer to process than you’d guess, and I’ve split them up into “rarely used, expensive” parts and “often used, cheap” parts.

Or when looking at template instantiations, I can see which ones end up taking most time to process, and focus on improving them:

But can’t you figure that out with existing means?

Template instantiation I probably could via Templight and Templight-Tools. I built them, tried it on some files, wrestled with how to visualize the “callgrind” output (WinCacheGrind refused to open them, QCacheGrind worked but I found it a bit intimidating to use). The process of “compile with a special flag, get a file, run the file through a converter, open in some app” is a bit involved, plus the converters aren’t very fast.

Using a “regular profiler” on Clang itself shows which parts of the compiler take up time, but does not show which parts of my code it is spending that time on. For example on that “one file was very slow to parse” example from top of the post, running Xcode Instruments profiler on Clang just shows this:

So I can see that “all the time is spent parsing”, but where exactly?

Ok where’s the code, and what now?

All the code is at this github PR for now, which is based on Clang 8.0.0-ish “trunk” as it was on 2019 January 10.

It adds a new command line flag to clang, -ftime-trace, that produces Chrome Tracing .json file next to the regular compiled object files.

I based it on the older LLVM/Clang “git monorepo”; they have switched to another github repository while I was doing it :) However the build instructions are the same as for regular LLVM/Clang. Some notes:

  • Do use Ninja backend for building. Default Makefiles backend spends ages figuring out what to build, every time you do the build.
  • Default build type is non-optimized Debug build. Build optimized Release build with the usual -DCMAKE_BUILD_TYPE=RelWithDebInfo CMake flag.
  • On Windows, make sure to set CC=cl and CXX=cl environment variables before invoking CMake. I had it pick up some random Gcc-based toolchain that was inside my Strawberry Perl install otherwise.
  • LLVM/Clang build is setup in such a way where any changes to the base “Support” library (where timing/profiling utilities are, among other things) cause rebuild of almost everything. Add a comment into Timer.cpp file? There’s 2500 files to recompile! And I wanted to add my profiling stuff in there…
    • This seems to be because some utility that generates some headers (which the rest of LLVM/Clang depends on) itself uses the same “Support” library. The utility is llvm-tblgen; I don’t know what exactly it is, but I’ll assume it generates the famous Amiga demoscene groups, and you cannot change my mind.
    • Turns out LLVM build has an option for “I have my own tblgen” (which I did have after building the whole stuff once) that you can point the build to. Then it will stop rebuilding it, and thus stop regenerating most of the headers. Use -DLLVM_TABLEGEN=path/to/tblgen option to CMake for that. This cuts down “change anything in Support library” from “2500 files to recompile” down to about “250 files to recompile”. Still not cheap, but much better.
  • Overall this was surprisingly easy! I mean I’ve never built LLVM/Clang before, and here, just 350 lines of code later, I have my own modified Clang with the profiling output I always wanted. Pretty good!

I’ve started a thread on Clang mailing list about my changes, and will see where that goes. There are some comments already; at some point I’d have to do a proper pull request via whatever is the LLVM/Clang code review tooling.

No idea whether any of this will get accepted into upstream Clang, but if not, then at least I will have my own Clang with the profiler I always wanted :)


Investigating compile times, and Clang -ftime-report

TL;DR: if you want to use -ftime-report Clang flag to help you figure out where or why your code is slow to compile… it’s not very helpful for that. But! In the next post we’ll try to do something about it :)

Need for build time investigation tools

Depending on how your large your codebase is, and how it is structured, C++ compilation times may or might not be an issue for you. It certainly is for us at Unity, and my understanding of build times of other large projects (like Chrome, LLVM or UE4), they aren’t exactly “fast” either. There are various underlying reasons for why C++ codebases might be slow to compile (preprocessor, complexity of the language, template instantiations, optimizations done by the compiler, linking times etc.).

It would make sense to have some tools that a programmer could use to help them understand where or why their code is slow to compile.

You might think that, given how big of an issue compile times are in C++, there would be readily-available tools to do that. E.g. some flags in the popular compilers, or something?

The reality, at least right now, is “ehhh, not really”. There really are no good tools in C++ compilers to help you with investigating build times.

Build timing tools in existing compilers

I’ll do example with a super small C++ snippet that just includes some STL headers and does something with them. See the snippet and the output from three major compilers (Visual Studio, Gcc, Clang) in Compiler Explorer here. The actual code does not do anything useful, I just needed something to throw at the compiler.

#include <vector>
#include <string>
#include <unordered_map>
#include <regex>
int main()
{
    std::vector<int> v(10);
    v.push_back(7);
    std::unordered_map<std::string, double> m;
    m.insert(std::make_pair("foo", 1.0));
    std::regex re("^asd.*$");
    return 0;
}

I’m testing with Visual Studio 2017 v15.9, Gcc 8.2 and Clang 7.0.

Visual Studio

Visual Studio has /Bt (see VC++ Team Blog post) and /d2cgsummary (see my blog post) flags, and while they don’t contain much information and the UX of them is “a bit lacking”, they are at least something.

/Bt prints how much time “frontend” (preprocessing, parsing - everything that is about the “C++ language”) and the “backend” (optimizations and final machine code generation – much less specific to C++ language) takes.

time(C:\msvc\v19_16\bin\Hostx64\x64\c1xx.dll)=0.775s
time(C:\msvc\v19_16\bin\Hostx64\x64\c2.dll)=1.303s

Of course it would be way friendlier if instead of printing two paths to DLLs, it would actually print “Frontend” and “Backend”. That’s what whatever\c1xx.dll and whatever\c2.dll actually are. Empathy towards the user is important!

Output of /d2cgsummary is even more cryptic, something like this (cut down for brevity):

Code Generation Summary
  Total Function Count: 821
  Elapsed Time: 1.210 sec
  Total Compilation Time: 4.209 sec
  Average time per function: 0.005 sec
  Anomalistic Compile Times: 15
    ?_Add_equiv@?$_Builder@PEBDDV?$regex_traits@D@std@@@std@@QEAAXPEBD0_J@Z: 0.299 sec, 869 instrs
    ?_Add_rep@?$_Builder@PEBDDV?$regex_traits@D@std@@@std@@QEAAXHH_N@Z: 0.239 sec, 831 instrs
    <...>
  Serialized Initializer Count: 1
  Serialized Initializer Time: 0.001 sec
RdrReadProc Caching Stats
  Functions Cached: 5
  Retrieved Count: 164
  Abandoned Retrieval Count: 0
  Abandoned Caching Count: 0
  Wasted Caching Attempts: 0
  Functions Retrieved at Least Once: 5
  Functions Cached and Never Retrieved: 0
  Most Hits:
    ?_Get_data@?$_Vector_alloc@U?$_Vec_base_types@IV?$allocator@I@std@@@std@@@std@@QEAAAEAV?$_Vector_val@U?$_Simple_types@I@std@@@2@XZ: 35
    ?_Get_second@?$_Compressed_pair@V?$allocator@I@std@@V?$_Vector_val@U?$_Simple_types@I@std@@@2@$00@std@@QEAAAEAV?$_Vector_val@U?$_Simple_types@I@std@@@2@XZ: 35
    <...>
  Least Hits:
    ?_Adopt@_Iterator_base0@std@@QEAAXPEBX@Z: 28
    <...>

Most of that probably makes sense for someone who actually works on the Visual Studio compiler, but again - I’m just a user, I have no idea what is “RdrReadProc”, what is “Serialized Initializer” or “Retrieved Count”, and most importantly, whether any of this output can help me to answer the question “how can I change my code to make it compile faster”? Also, why are the function names not demangled?

The “Anomalistic Compile Times” section can help to track down some patterns in code that end up being very slow for compiler to process (like forced inlining on large functions), but that’s about it.

So while MSC does have something, I could think of many possible improvements in what kind of timing/profiling information it could output.

Gcc

I’m not using Gcc much, and at least in my industry (game development), it’s not very popular. Out of all platforms that I happen to use at work, they are all on either Visual Studio or some variant of Clang. Gcc is only used on Linux, and even there, there’s an option to use Clang.

Anyway, gcc has -ftime-report argument that prints information about where time was spent during compilation. Brace yourself, this will be over 100 lines of text:

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.00 (  0%)   0.00 (  0%)   0.00 (  0%)    1509 kB (  1%)
 phase parsing                      :   0.59 ( 16%)   0.60 ( 43%)   2.35 ( 38%)   91628 kB ( 33%)
 phase lang. deferred               :   0.32 (  9%)   0.16 ( 12%)   0.47 (  8%)   46797 kB ( 17%)
 phase opt and generate             :   2.60 ( 72%)   0.62 ( 45%)   3.22 ( 53%)  132237 kB ( 48%)
 phase last asm                     :   0.08 (  2%)   0.00 (  0%)   0.09 (  1%)    2921 kB (  1%)
 |name lookup                       :   0.18 (  5%)   0.10 (  7%)   0.27 (  4%)    4820 kB (  2%)
 |overload resolution               :   0.13 (  4%)   0.12 (  9%)   0.31 (  5%)   26374 kB ( 10%)
 dump files                         :   0.08 (  2%)   0.01 (  1%)   0.04 (  1%)       0 kB (  0%)
 callgraph construction             :   0.08 (  2%)   0.02 (  1%)   0.11 (  2%)   12160 kB (  4%)
 callgraph optimization             :   0.02 (  1%)   0.02 (  1%)   0.04 (  1%)      87 kB (  0%)
 ipa function summary               :   0.00 (  0%)   0.01 (  1%)   0.02 (  0%)     808 kB (  0%)
 ipa cp                             :   0.01 (  0%)   0.01 (  1%)   0.01 (  0%)     564 kB (  0%)
 ipa inlining heuristics            :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     807 kB (  0%)
 ipa function splitting             :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     233 kB (  0%)
 ipa pure const                     :   0.02 (  1%)   0.00 (  0%)   0.01 (  0%)      22 kB (  0%)
 ipa icf                            :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 ipa SRA                            :   0.07 (  2%)   0.02 (  1%)   0.10 (  2%)    6231 kB (  2%)
 ipa free inline summary            :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 cfg construction                   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)     179 kB (  0%)
 cfg cleanup                        :   0.04 (  1%)   0.00 (  0%)   0.03 (  0%)     200 kB (  0%)
 trivially dead code                :   0.00 (  0%)   0.00 (  0%)   0.03 (  0%)       1 kB (  0%)
 df scan insns                      :   0.02 (  1%)   0.01 (  1%)   0.01 (  0%)       7 kB (  0%)
 df multiple defs                   :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 df reaching defs                   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 df live regs                       :   0.05 (  1%)   0.01 (  1%)   0.02 (  0%)      26 kB (  0%)
 df live&initialized regs           :   0.00 (  0%)   0.01 (  1%)   0.01 (  0%)       0 kB (  0%)
 df use-def / def-use chains        :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 df reg dead/unused notes           :   0.02 (  1%)   0.00 (  0%)   0.00 (  0%)     510 kB (  0%)
 register information               :   0.00 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 alias analysis                     :   0.02 (  1%)   0.00 (  0%)   0.04 (  1%)    1258 kB (  0%)
 alias stmt walking                 :   0.05 (  1%)   0.02 (  1%)   0.06 (  1%)      82 kB (  0%)
 preprocessing                      :   0.06 (  2%)   0.15 ( 11%)   1.38 ( 23%)    2618 kB (  1%)
 parser (global)                    :   0.11 (  3%)   0.19 ( 14%)   0.31 (  5%)   27095 kB ( 10%)
 parser struct body                 :   0.07 (  2%)   0.03 (  2%)   0.11 (  2%)   14169 kB (  5%)
 parser enumerator list             :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      94 kB (  0%)
 parser function body               :   0.04 (  1%)   0.06 (  4%)   0.12 (  2%)    4044 kB (  1%)
 parser inl. func. body             :   0.03 (  1%)   0.03 (  2%)   0.02 (  0%)    1952 kB (  1%)
 parser inl. meth. body             :   0.09 (  3%)   0.03 (  2%)   0.11 (  2%)    9238 kB (  3%)
 template instantiation             :   0.47 ( 13%)   0.23 ( 17%)   0.71 ( 12%)   64835 kB ( 24%)
 constant expression evaluation     :   0.00 (  0%)   0.02 (  1%)   0.01 (  0%)     225 kB (  0%)
 early inlining heuristics          :   0.02 (  1%)   0.01 (  1%)   0.03 (  0%)    1507 kB (  1%)
 inline parameters                  :   0.02 (  1%)   0.00 (  0%)   0.05 (  1%)    4043 kB (  1%)
 integration                        :   0.25 (  7%)   0.10 (  7%)   0.23 (  4%)   22167 kB (  8%)
 tree gimplify                      :   0.02 (  1%)   0.02 (  1%)   0.04 (  1%)    6664 kB (  2%)
 tree eh                            :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)    1039 kB (  0%)
 tree CFG construction              :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)    2800 kB (  1%)
 tree CFG cleanup                   :   0.04 (  1%)   0.02 (  1%)   0.09 (  1%)      62 kB (  0%)
 tree VRP                           :   0.08 (  2%)   0.01 (  1%)   0.03 (  0%)    1672 kB (  1%)
 tree Early VRP                     :   0.02 (  1%)   0.02 (  1%)   0.02 (  0%)    2084 kB (  1%)
 tree PTA                           :   0.07 (  2%)   0.03 (  2%)   0.08 (  1%)     648 kB (  0%)
 tree SSA rewrite                   :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)    2156 kB (  1%)
 tree SSA other                     :   0.01 (  0%)   0.01 (  1%)   0.03 (  0%)     282 kB (  0%)
 tree SSA incremental               :   0.02 (  1%)   0.00 (  0%)   0.02 (  0%)     393 kB (  0%)
 tree operand scan                  :   0.07 (  2%)   0.05 (  4%)   0.12 (  2%)    7305 kB (  3%)
 dominator optimization             :   0.11 (  3%)   0.00 (  0%)   0.13 (  2%)    1428 kB (  1%)
 backwards jump threading           :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     248 kB (  0%)
 tree SRA                           :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     385 kB (  0%)
 tree CCP                           :   0.04 (  1%)   0.01 (  1%)   0.04 (  1%)     413 kB (  0%)
 tree PRE                           :   0.00 (  0%)   0.01 (  1%)   0.06 (  1%)    1057 kB (  0%)
 tree FRE                           :   0.06 (  2%)   0.05 (  4%)   0.05 (  1%)     670 kB (  0%)
 tree forward propagate             :   0.02 (  1%)   0.00 (  0%)   0.01 (  0%)     454 kB (  0%)
 tree phiprop                       :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       7 kB (  0%)
 tree conservative DCE              :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      45 kB (  0%)
 tree aggressive DCE                :   0.03 (  1%)   0.01 (  1%)   0.04 (  1%)    2337 kB (  1%)
 tree DSE                           :   0.01 (  0%)   0.01 (  1%)   0.02 (  0%)      32 kB (  0%)
 tree loop invariant motion         :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       1 kB (  0%)
 complete unrolling                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     417 kB (  0%)
 tree iv optimization               :   0.01 (  0%)   0.01 (  1%)   0.02 (  0%)     705 kB (  0%)
 tree switch conversion             :   0.00 (  0%)   0.01 (  1%)   0.00 (  0%)       0 kB (  0%)
 tree strlen optimization           :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       2 kB (  0%)
 dominance computation              :   0.02 (  1%)   0.00 (  0%)   0.06 (  1%)       0 kB (  0%)
 out of ssa                         :   0.00 (  0%)   0.01 (  1%)   0.01 (  0%)      15 kB (  0%)
 expand vars                        :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     351 kB (  0%)
 expand                             :   0.04 (  1%)   0.01 (  1%)   0.02 (  0%)    8842 kB (  3%)
 post expand cleanups               :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     285 kB (  0%)
 forward prop                       :   0.04 (  1%)   0.00 (  0%)   0.02 (  0%)     375 kB (  0%)
 CSE                                :   0.02 (  1%)   0.01 (  1%)   0.11 (  2%)     141 kB (  0%)
 dead code elimination              :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 dead store elim1                   :   0.03 (  1%)   0.00 (  0%)   0.02 (  0%)     415 kB (  0%)
 dead store elim2                   :   0.02 (  1%)   0.00 (  0%)   0.04 (  1%)     630 kB (  0%)
 loop init                          :   0.01 (  0%)   0.01 (  1%)   0.03 (  0%)    2511 kB (  1%)
 loop fini                          :   0.02 (  1%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 CPROP                              :   0.04 (  1%)   0.00 (  0%)   0.06 (  1%)     619 kB (  0%)
 PRE                                :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     103 kB (  0%)
 CSE 2                              :   0.01 (  0%)   0.01 (  1%)   0.03 (  0%)      83 kB (  0%)
 branch prediction                  :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     413 kB (  0%)
 combiner                           :   0.03 (  1%)   0.00 (  0%)   0.03 (  0%)    1112 kB (  0%)
 if-conversion                      :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      64 kB (  0%)
 integrated RA                      :   0.09 (  3%)   0.01 (  1%)   0.05 (  1%)    5912 kB (  2%)
 LRA non-specific                   :   0.06 (  2%)   0.00 (  0%)   0.03 (  0%)     305 kB (  0%)
 LRA create live ranges             :   0.02 (  1%)   0.00 (  0%)   0.02 (  0%)      41 kB (  0%)
 LRA hard reg assignment            :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 reload                             :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 reload CSE regs                    :   0.02 (  1%)   0.00 (  0%)   0.03 (  0%)     571 kB (  0%)
 ree                                :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       8 kB (  0%)
 thread pro- & epilogue             :   0.03 (  1%)   0.00 (  0%)   0.03 (  0%)     460 kB (  0%)
 peephole 2                         :   0.02 (  1%)   0.00 (  0%)   0.02 (  0%)      41 kB (  0%)
 hard reg cprop                     :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)      17 kB (  0%)
 scheduling 2                       :   0.15 (  4%)   0.01 (  1%)   0.16 (  3%)     335 kB (  0%)
 reorder blocks                     :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     294 kB (  0%)
 shorten branches                   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 final                              :   0.04 (  1%)   0.01 (  1%)   0.04 (  1%)    4707 kB (  2%)
 symout                             :   0.15 (  4%)   0.02 (  1%)   0.17 (  3%)   26953 kB ( 10%)
 variable tracking                  :   0.03 (  1%)   0.00 (  0%)   0.02 (  0%)    2416 kB (  1%)
 var-tracking dataflow              :   0.03 (  1%)   0.00 (  0%)   0.09 (  1%)      74 kB (  0%)
 var-tracking emit                  :   0.05 (  1%)   0.00 (  0%)   0.08 (  1%)    4059 kB (  1%)
 tree if-combine                    :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)      10 kB (  0%)
 straight-line strength reduction   :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      11 kB (  0%)
 store merging                      :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      61 kB (  0%)
 initialize rtl                     :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      12 kB (  0%)
 early local passes                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 rest of compilation                :   0.13 (  4%)   0.02 (  1%)   0.12 (  2%)     731 kB (  0%)
 remove unused locals               :   0.01 (  0%)   0.00 (  0%)   0.05 (  1%)       3 kB (  0%)
 address taken                      :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 repair loop structures             :   0.02 (  1%)   0.00 (  0%)   0.00 (  0%)       3 kB (  0%)
 TOTAL                              :   3.59          1.38          6.13         275104 kB

Ok, I guess that totally makes sense, if you are a compiler developer working on gcc itself. But I’m not! Most of the above either does not mean anything to me, or I don’t really care about it.

  • I don’t care that much about “user” or “system” time; I care about how much time I had to wait for compilation to finish (“wall time”). Why it is in the middle of each row?
  • Is this output sorted by something? It’s not sorted from “most expensive” to “least expensive”. Not sorted alphabetically either. I guess it’s sorted by some internal structure that is not obvious to me.
  • I think the phase parsing is “generally frontend”, and phase opt and generate is “generally backend”; these are good to know. Then later on there’s a more detailed preprocessing to template instantiation which is frontend in more detail; that’s useful too. The rest? Besides some names that mean some optimization passes, many are cryptic to me, and I’m not sure why I should care about them. tree eh? Eh indeed.
  • Memory usage numbers are interesting to compiler developers, and maybe are useful to users that are under memory-constrained situations (like 32 bit systems etc.). Me? I don’t really care.

So, while the gcc -ftime-report prints something, and some of that is very useful (template instantiation time; preprocessing time; time spent performing inlining etc.), it seems to be geared towards compiler developers, and less towards users.

Again - if I’m investigating compile times of my code, the question I have is what can I change in my code to make it compile faster? Ideally some tool should be able to tell me both what takes time, and where in the code it does it.

Clang

Clang, being the most “fresh” of all these compilers, should be the most excellent, right? Right? :)

Just like gcc, it also has -ftime-report option. That produces 900 lines of output! Instead of pasting it in full here, I’ll just link to it.

Several things to note:

  • Almost half of it is just the same information, duplicated? “Register Allocation”, “Instruction Selection and Scheduling”, “DWARF Emission” and “… Pass execution timing report …” sections are just emitted twice, but curiously enough, the other sections are printed just once. This sounds like a bug to me, and testing out various Clang versions in Compiler Explorer suggests that it started with Clang 7.0.0 and still exists on today’s 8.0.0-trunk. I’ve reported it.
    • Within the “Pass Execution Timing Report” section, there’s a bunch of items that are repeated multiple times too, e.g. “Combine redundant instructions” is listed 8 times; and “Function Alias Analysis Results” is listed 21 times. I’ve no idea if that’s a bug or not; I think I don’t care about them anyway.
  • It suffers from the same “oriented to clang compiler developers, and not clang users” that gcc time report does. I really don’t care about all the LLVM optimization passes you ended up doing; and similarly I don’t care about user vs system times or memory usage.

If it stopped printing all the info I don’t care about, did not duplicate half of it, and stopped printing LLVM passes after top 5 most expensive ones, it would already be way more legible, e.g. like this:

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===
   ---Wall Time---  --- Name ---
   4.1943 ( 93.7%)  Code Generation Time
   0.2808 (  6.3%)  LLVM IR Generation Time
   4.4751 (100.0%)  Total
===-------------------------------------------------------------------------===
                              Register Allocation
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0180 seconds (0.0181 wall clock)
===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.2894 seconds (0.2894 wall clock)
===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.3599 seconds (0.3450 wall clock)
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 3.5271 seconds (3.5085 wall clock)
   ---Wall Time---  --- Name ---
   0.4859 ( 13.8%)  X86 Assembly Printer
   0.4388 ( 12.5%)  X86 DAG->DAG Instruction Selection
   0.2633 (  7.5%)  Function Integration/Inlining
   0.1228 (  3.5%)  Global Value Numbering
   0.0695 (  2.0%)  Combine redundant instructions
   3.5085 (100.0%)  Total
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 5.2175 seconds (6.3410 wall clock)
   ---Wall Time---  --- Name ---
   6.3410 (100.0%)  Clang front-end timer
   6.3410 (100.0%)  Total

(also, please, drop the “…” around “Pass execution timing report” name; why it is there?)

And then the various “Total” under the sections are quite a bit confusing. Let’s see:

  • “Clang front-end time report” seems to be time for “everything”, not just the frontend (Clang frontend, LLVM backend, and whatever else it did).
  • I think the “backend” (LLVM) total part is under “Misc Ungrouped Timers, Code Generation Time”. This is probably what is “Pass execution timing” + “Register Allocation” + “Instruction Selection and Scheduling” etc.
  • So probably the actual Clang “frontend” (preprocessor, parsing – i.e. doing the C++ bits of the compilation) I could get by subtracting various LLVM related timer bits from the total time. Ugh. Could you just tell me that time please?

Another thing to note: in all the optimization passes, “X86 Assembly Printer” seems to be the heaviest one? That doesn’t sound right. So I dug in a bit… turns out, once you pass -ftime-report flag, then the whole compilation time is heavily affected. It grows from 1.06s to 1.42s in that super simple STL snippet above, and from 16s to 26s in a much heavier source file I had. Normally for any sort of profiling tool I’d expect at max a couple percent overhead, but Clang’s time report seems to make compilation take 1.5x longer! This sounds like a bug to me, so… reported!

From a quick look, it feels like this could primarily be caused by code in AsmPrinter.cpp that adds time sampling around each and every assembly instruction that was generated (AsmPrinter.cpp, line 1087); this seems like it’s inside a loop that processes all the instructions. Taking that many timing samples probably does not end up being very fast, and thus heavily slows down the “X86 Assembly Printer” pass, when -ftime-report flag is used.

And each time sample capture itself seems to end up doing:

But dear clang! I only care about single “elapsed time”; you could replace all that complexity with a single *_clock::now call and be done with it.

Even if presence of -ftime-report did not distort the compile times so much, Clang’s current implementation leaves much to be desired. It does not even tell me “how much time I spent parsing C++ code”, for example (without me manually doing the math), and does not tell other important aspects like “how much time I spent instantiating templates”. And just like gcc time report, it never tells where in my code the expensive parts actually are.

But surely someone would have solved this already?

I thought so too! But that doesn’t seem to be the case.

For Clang, various people have proposed changes or submitted patches to improve time report in some ways, for example:

  • This patch by Eddie Elizondo adds template instantiation timers. Done in August 2017, everyone in discussion agreed it’s a good idea. Now it’s January 2019…
  • This patch by Brian Gesiak added preprocessor timer in August 2017 too. After some back and forth discussion, eventually abandoned since someone else (Andrew Tischenko) said he’ll implement it in a better way. That was 2018 February.

Other people have built tools that are similar to clang, but are separate programs. For example, Templight by Sven Mikael Persson – it works just like Clang, but invoked with additional arguments dumps detailed information about all template instantiations into a separate file, that can then be converted into various other formats for visualization.

Templight does look promising, but a bit unwieldy to use, and only covers the template instantiation aspect. What if my compile time issues are somewhere else?

So I have spent a bit of time trying to make Clang do a “more useful for me” time report. It already pointed out one seemingly simple header file that we had, that was causing Clang to take “ages” (5-8 seconds) just including it, due to some recursive macro usage that Clang isn’t fast to plow through. Yay tools!

What I did and how it works will be in the next blog post! But maybe you can guess on what it might be from my past blog posts (here or here), and from this teaser image:

Stay tuned! Update: next blog post is out.


Mentoring: You Won't Believe What Happened Next!

So! For three months by now, I’m mentoring three junior engineers at work.

But first, let me talk about something entirely else.

The Seniority Trap

In games, and especially in engine/middleware technology, it’s fairly common to see job requirements like “Senior engineer needed, 10+ years of AAA/console development experience required” etc.

Sometimes that makes sense. Some other times, it’s there just because of inertia or because that’s what the hiring policies tend to naturally gravitate towards, given the surrounding environment. A long sequence of individual “yeah that makes sense” steps can lead to an outcome that makes you go “wait what?”.

Here’s a simple example: it makes sense to allocate “headcount” for departments or teams when planning company’s financial year. However, if teams have fixed headcount available for new hires, they will tend to gravitate towards “well if we only can hire one more person, we’d better hire someone really really good”. So that tends to lean into “hire senior people only” territory. Repeat that for a number of years, and you get a situation where default action for any team anywhere in the company is to hire senior people. There was no “plan” to become like that, it just kinda evolved from a set of surrounding rules.

Senior people are convenient to hire - they already have lots of experience, you don’t have to explain the basics to them, they can bring valuable experience & knowledge from other places and products, you yourself often can learn quite a lot from them. Super cool! As a fairly senior (in my office at least) dude myself, I say “senior people FTW!”.

They are also rare, expensive, sometimes stubborn, perhaps unwilling to learn new things, and sometimes their experience is “same old, same old”. That describes me as well.

Hiring just senior people can lead to teams that have optimized for local maxima – each new hire fits well, is an incremental improvement to the team/product; what’s not to like. And so the team climbs into a local hill, completely missing a mountain peak that is beyond the valley.

Isn’t all that just hypothetical philosophizing?

Maybe. I don’t know. I do know that for many years, most of our job ads had requirements that the Unity founders, and many of the early employees would not satisfy :)

It’s hard to prove it without forking reality sometime in the past and picking an alternate history – but you could imagine that, for example, if Unity founders had 10+ years of game development experience, they would have picked different decisions. Some might have been better, or way better. But some might have been way worse, since they might have built an engine, tools and workflows that mimicked what they were familiar with!

My point is, I guess: you need both experienced people and “new” people. There can be many different reasons for getting “new” people.

Giving Juniors A Chance

The following is mostly about the environment we had in Unity’s Lithuania (Vilnius & Kaunas) offices. We have a ton of offices all over the place, and their setup, composition and structure can differ a lot!

About a year ago I noticed that here (Unity Lithuania offices) we have a lot of junior / entry-level people in frontline QA team. These are the folks that read all the incoming bug reports, try to make sense of them (reproduce the issue, find versions where it started happening, summarize in usable form for developers, and assign further down to devs if the issue turns out to be an actual bug). Very important job! But perhaps not the most interesting / motivating in the world; a large part of these QA people wanted to move into “product QA” or become programmers.

To some extent, we did have this “Seniority Trap” problem from above – most of our job requirements for engineers had a “yadda yadda, 5+ years programming experience”. And so on one hand, we have something like 50 junior QA people here, but on the other hand, there’s a sort of uncrossable chasm to get to any other positions from there.

I thought that’s both sad, and a bit stupid on our part, and started to lobby along the lines of “we should come up with ways of having a better career path for these QA people”. Some of more concrete steps we started doing:

  • Start planting seeds of thought that there’s no rule of universe that says “good people are all in San Francisco or Seattle” (I checked, there is no such rule indeed).
  • Make a training programme – various topics about coding etc. – that we’re doing for QA students here. I did a bunch of lectures there (check out my talks page around 2018 Fall), more people did other topics. This by itself does not give immediate benefit, but hopefully is good learning material for the attendees. Most of topics are not even Unity specific; just general “good to know” info, especially if one wants to become a developer in game-industry related company.
  • Make this whole QA group be much more visible inside the company, by talking with other teams more, and better “advertise” it as a possible source of talent.
  • All in all, it ended up with about 25 of QA students “graduating” to other teams or positions inside the company over this year (2018); in comparison last year the number was like “five”. I consider that to be a pretty good change!

Some of these “graduations” was result of me talking with our “Sustained Engineering” (SE) team. SE owns Long-Term-Support and Patch Releases, which is mostly “fixing stuff and prevent future breakages”, and not much of fancy feature development. I started to talk with them along the lines of “hey, so hypothetically, if we’d find some junior developers that I would mentor…”. Much to my surprise, they were “sounds great! can we start tomorrow?”, and I was gulp now I have to actually do this.

And that’s how I started doing a selection process sometime this summer; we picked three folks to start with, they started working as programmers and I started mentoring them three months ago. How did that go?

Mentoring: Actually Better Than Expected

When I realized this whole endeavor is in a shit I have to actually do it stage, I started to mildly panic! It’s been a few years when I was “responsible” for others at work (in terms of being a tech lead or whatever), and I’ve never done mentoring/onboarding/teaching someone at work who had no previous professional programming experience. And now there are going to be not one, but three people all at once!

Geez Aras why you couldn’t just continue coding in your little corner, things would have been so much simpler.

So far it seems that most of my fears have been unfounded; three months in all things are going well! I’m starting to think of increasing the number of people from 3 to a couple more.

Actual things I end up doing in this whole “mentoring” adventure:

  • Being available for any & all questions related to workflows, codebase, advice, etc.
  • Reviewing all of their code,
  • Picking some tasks (so far mostly bug fixes or backports of existing fixes),
  • Weekly 1:1 chats, and sometimes a talk with the whole team. We sometimes skip this if there isn’t much to talk about,
  • Impromptu talks about some concept (related to workflow, code, tips, tooling etc.) that we think is needed or useful,
  • Some pair programming sessions. Note to self: do more of this.

All in all, I’m spending maybe 10-20% of my time on this. I expected it to be way more, but turns out the current folks generally need a lot less hand-holding than I imagined it to be.

My observations about junior programmers are all very obvious “well duh”, but it’s easy to forget about these things when not dealing with them day to day. When you have years of experience, you tend to build up a lot of indirect knowledge and approaches at solving problems. Things like:

  • A good set of hunches at why something would behave the way it does. Is this because there’s an integer overflow? Does this fail because the thing it’s trying to do is not even supported on the platform it’s running on? and so on.
  • A good set of “but what about case…” questions to ask whenever you’re about to start something. If I do this here, what else can possibly be affected? What are the edge cases of functions I’m calling, or data I’m receiving? How will this behave on a different hardware? In a multi-threaded environment? In an environment where memory is scarce? etc.
  • Ability to efficiently debug. Coming up with good hypotheses at why something happens, quickly checking them, deciding what to check next if the hypothesis was correct (or what to do if it wasn’t!). Basically divide & conquer approach for debugging. This is an extremely useful & important skill, which many junior folks tend to not have at all, or have very little of it. Local schools don’t teach debugging at all :(
  • Using source control effectively. Starting from basics like “don’t put completely unrelated things into one commit”, to “writing good commit messages”, to investigating why something happened in the past (log / annotate / blame), and into various situations like merges, conflicts, rebases, ways of undoing mistakes etc.
  • Ensuring that code not only works today, but will keep on working tomorrow. Testing, asserts, compile time checks, runtime checks, logging - basically all of that is “new territory”, and again the schools don’t teach any of that. The reason for having tests or asserts often only comes with experience, and oftentimes by experiencing bugs getting introduced, functionality broken or other disasters. So yeah I get it that whenever I say “trust me on this, you will need it”, it’s hard to really internalize it :)
  • Likewise, ability to explain your code. This could be design/feature docs, code comments that explain “why”, commit messages etc. Like, “ok sure if you divide X by two and multiply by two again here it fixes the issue, but WHY does it fix the issue?”. In a codebase that serves tens of thousands of projects, and any code you write will live on for 5-10 years, it’s not good enough to go “eh I don’t know, but it fixes it, so I’ll just commit and move on”.
  • A good set of heuristics and rules of thumb for managing time. How long should you be “stuck” on a problem before reaching out for someone else’s opinion? What can you do for that half an hour that a new fresh build is going to take? How to efficiently context-switch when you have to. How to interleave high-focus-needed tasks with low-effort ones. etc.

The list above is not complaints though; it’s just a subset of what we “seniors” have acquired over the years that allows us to be productive, and what many juniors (no matter how smart/motivated/… they are) often don’t have yet. Mostly because some of these are only really acquired through experience, or they might be very different for each person.

I’m pretty sure the folks I’m working with right now will get it with time, and I hope I can be of help to get there. So far they are doing somewhat better than I expected, so d(・∀・○)

Looking forward to see what happens in a year or so!


"Modern" C++ Lamentations

This will be a long wall of text, and kinda random! My main points are:

  1. C++ compile times are important,
  2. Non-optimized build performance is important,
  3. Cognitive load is important. I don’t expand much on this here, but if a programming language or a library makes me feel stupid, then I’m less likely to use it or like it. C++ does that a lot :)

Standard Ranges” blog post by Eric Niebler – about C++20 ranges feature – was doing rounds in the game development twitterverse lately, with many expressing something like a “dislike” (mildly said) about the state of modern C++.

I have expressed a similar thought too (link):

That example for Pythagorian Triples using C++20 ranges and other features sounds terrible to me. And yes I get that ranges can be useful, projections can be useful etc. Still, a terrible example! Why would anyone want to code like that?!

Which got slightly out of hand (now 5 days later, that tree of threads is still receiving a lot of replies!).

Now, apologies to Eric for pointing out his post; my lamentation is mostly about “general state of C++” lately. The “bunch of angry gamedevs” on twitter has been picking on Boost.Geometry rationale a year or two ago in a very similar way, and a dozen other times for other aspects of C++ ecosystem.

But you know, twitter not being exactly a nuanced place, etc. etc. So let me expand here!

Pythagorean Triples, C++20 Ranges Style

Here’s the full example from Eric’s post:

// A sample standard C++20 program that prints
// the first N Pythagorean triples.
#include <iostream>
#include <optional>
#include <ranges>   // New header!
 
using namespace std;
 
// maybe_view defines a view over zero or one
// objects.
template<Semiregular T>
struct maybe_view : view_interface<maybe_view<T>> {
  maybe_view() = default;
  maybe_view(T t) : data_(std::move(t)) {
  }
  T const *begin() const noexcept {
    return data_ ? &*data_ : nullptr;
  }
  T const *end() const noexcept {
    return data_ ? &*data_ + 1 : nullptr;
  }
private:
  optional<T> data_{};
};
 
// "for_each" creates a new view by applying a
// transformation to each element in an input
// range, and flattening the resulting range of
// ranges.
// (This uses one syntax for constrained lambdas
// in C++20.)
inline constexpr auto for_each =
  []<Range R,
     Iterator I = iterator_t<R>,
     IndirectUnaryInvocable<I> Fun>(R&& r, Fun fun)
        requires Range<indirect_result_t<Fun, I>> {
      return std::forward<R>(r)
        | view::transform(std::move(fun))
        | view::join;
  };
 
// "yield_if" takes a bool and a value and
// returns a view of zero or one elements.
inline constexpr auto yield_if =
  []<Semiregular T>(bool b, T x) {
    return b ? maybe_view{std::move(x)}
             : maybe_view<T>{};
  };
 
int main() {
  // Define an infinite range of all the
  // Pythagorean triples:
  using view::iota;
  auto triples =
    for_each(iota(1), [](int z) {
      return for_each(iota(1, z+1), [=](int x) {
        return for_each(iota(x, z+1), [=](int y) {
          return yield_if(x*x + y*y == z*z,
            make_tuple(x, y, z));
        });
      });
    });
 
    // Display the first 10 triples
    for(auto triple : triples | view::take(10)) {
      cout << '('
           << get<0>(triple) << ','
           << get<1>(triple) << ','
           << get<2>(triple) << ')' << '\n';
  }
}

Eric’s post comes from his earlier post from a few years back, which was a response to Bartosz Milewski’s post “Getting Lazy with C++”, where a simple C function to print first N Pythagorean Triples was presented:

void printNTriples(int n)
{
    int i = 0;
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z) {
                    printf("%d, %d, %d\n", x, y, z);
                    if (++i == n)
                        return;
                }
}

As well as some issues with it were pointed out:

This is fine, as long as you don’t have to modify or reuse this code. But what if, for instance, instead of printing, you wanted to draw the triples as triangles? Or if you wanted to stop as soon as one of the numbers reached 100?

And then lazy evaluation with list comprehensions is presented as the way to solve these issues. It is a way to solve these issues indeed, just that C++ the language does not quite have built-in functionality to do that, like Haskell or other languages have. C++20 will have more built-in things in that regard, similar to how Eric’s post shows. But I’ll get to that later.

Pythagorean Triples, Simple C++ Style

So, let’s get back to the simple (“fine as long you don’t have to modify or reuse”, as Bartosz says) C/C++ style of solving the problem. Here’s a complete program that prints first 100 triples:

// simplest.cpp
#include <time.h>
#include <stdio.h>

int main()
{
    clock_t t0 = clock();

    int i = 0;
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z) {
                    printf("(%i,%i,%i)\n", x, y, z);
                    if (++i == 100)
                        goto done;
                }
    done:

    clock_t t1 = clock();
    printf("%ims\n", (int)(t1-t0)*1000/CLOCKS_PER_SEC);
    return 0;
}

We can compile it: clang simplest.cpp -o outsimplest. Compilation takes 0.064 seconds, produces 8480 byte executable, which runs in 2 milliseconds and prints the numbers (machine is 2018 MacBookPro; Core i9 2.9GHz; Xcode 10 clang):

(3,4,5)
(6,8,10)
(5,12,13)
(9,12,15)
(8,15,17)
(12,16,20)
(7,24,25)
(15,20,25)
(10,24,26)
...
(65,156,169)
(119,120,169)
(26,168,170)

But wait! That was a default, non-optimized (“Debug”) build; let’s build an optimized (“Release”) build: clang simplest.cpp -o outsimplest -O2. That takes 0.071s to compile, produces same size (8480b) executable, and runs in 0ms (under the timer precision of clock()).

As Bartosz correctly points out, the algorithm is not “reusable” here, since it’s intermixed with “what to do with the results”. Whether that is a problem or not is outside the scope of this post (personally I think “reusability” or “avoid duplication at all costs” are overrated). Let’s assume it is a problem, and indeed we want “something” that would just return first N triples, without doing anything with them.

What I would probably do, is do the simplest possible thing – make something that can be called, that returns the next triple. It might look something like this:

// simple-reusable.cpp
#include <time.h>
#include <stdio.h>

struct pytriples
{
    pytriples() : x(1), y(1), z(1) {}
    void next()
    {
        do
        {
            if (y <= z)
                ++y;
            else
            {
                if (x <= z)
                    ++x;
                else
                {
                    x = 1;
                    ++z;
                }
                y = x;
            }
        } while (x*x + y*y != z*z);
    }
    int x, y, z;
};

int main()
{
    clock_t t0 = clock();

    pytriples py;
    for (int c = 0; c < 100; ++c)
    {
        py.next();
        printf("(%i,%i,%i)\n", py.x, py.y, py.z);
    }

    clock_t t1 = clock();
    printf("%ims\n", (int)(t1-t0)*1000/CLOCKS_PER_SEC);
    return 0;
}

This compiles and runs in pretty much the same times; Debug executable becomes 168 bytes larger; Release executable same size.

I did make a pytriples struct, where each call to next() advances to the next valid triple; the caller can do whatever it pleases with the result. Here, I just call it a hundred times, printing the triple each time.

However, while the implementation is functionally equivalent to what the triple-nested for loop was doing in the original example, indeed it becomes a lot less clear, at least to me. It’s very clear how it does it (some branches and simple operations on integers), but not immediately clear what it does on a high level.

If C++ had something like a “coroutine” concept, it would be possible to implement the triples generator that would be as clear as the original nested for loops, yet not have any of the “problems” (Jason Meisel points out exactly that in “Ranges, Code Quality, and the Future of C++” post); something like (tentative syntax, as coroutines aren’t part of any C++ standard yet):

generator<std::tuple<int,int,int>> pytriples()
{
    for (int z = 1; ; ++z)
        for (int x = 1; x <= z; ++x)
            for (int y = x; y <= z; ++y)
                if (x*x + y*y == z*z)
                    co_yield std::make_tuple(x, y, z);
}

Back to C++ Ranges

Is the C++20 Ranges style more clear at what it does? From Eric’s post, this is the main part:

auto triples =
    for_each(iota(1), [](int z) {
        return for_each(iota(1, z+1), [=](int x) {
            return for_each(iota(x, z+1), [=](int y) {
                return yield_if(x*x + y*y == z*z,
                    make_tuple(x, y, z));
                });
            });
        });

You could argue either way. I think the “coroutines” approach above is way more clear. C++ way of creating lambdas, and the choice of C++ standard to make things look clever (“what’s an iota? it’s a Greek letter, look how smart I am!“) are both a bit cumbersome. Multiple returns feel unusual if reader is used to imperative programming style, but possibly one could get used to it.

Maybe you could squint your eyes and say that this is an acceptable and nice syntax.

However, I refuse to believe that “us mere mortals” without a PhD in C++ would be able to write the utilities that are needed for the code above to work:

template<Semiregular T>
struct maybe_view : view_interface<maybe_view<T>> {
  maybe_view() = default;
  maybe_view(T t) : data_(std::move(t)) {
  }
  T const *begin() const noexcept {
    return data_ ? &*data_ : nullptr;
  }
  T const *end() const noexcept {
    return data_ ? &*data_ + 1 : nullptr;
  }
private:
  optional<T> data_{};
};
inline constexpr auto for_each =
  []<Range R,
     Iterator I = iterator_t<R>,
     IndirectUnaryInvocable<I> Fun>(R&& r, Fun fun)
        requires Range<indirect_result_t<Fun, I>> {
      return std::forward<R>(r)
        | view::transform(std::move(fun))
        | view::join;
  };
inline constexpr auto yield_if =
  []<Semiregular T>(bool b, T x) {
    return b ? maybe_view{std::move(x)}
             : maybe_view<T>{};
  };

Maybe that is mother tongue to someone, but to me this feels like someone decided that “Perl is clearly too readable, but Brainfuck is too unreadable, let’s aim for somewhere in the middle”. And I’ve been programming mostly in C++ for the past 20 years. Maybe I’m too stupid to get this, okay.

And yes, sure, the maybe_view, for_each, yield_if are all “reusable components” that could be moved into a library; a point I’ll get to… right now.

Issues with “Everything is a library” C++

There are at least two: 1) compilation time, and 2) non-optimized runtime performance.

Let me allow to show them using this same Pythagorean Triples example, though the issue is true for many other features of C++ that are implemented as part of “libraries”, and not language itself.

Actual C++20 isn’t out yet, so for a quick test I took the current best “ranges” approximation, which is range-v3 (made by Eric Niebler himself), and compiled the canonical “Pythagorean Triples with C++ Ranges” example with it.

// ranges.cpp
#include <time.h>
#include <stdio.h>
#include <range/v3/all.hpp>

using namespace ranges;

int main()
{
    clock_t t0 = clock();

    auto triples = view::for_each(view::ints(1), [](int z) {
        return view::for_each(view::ints(1, z + 1), [=](int x) {
            return view::for_each(view::ints(x, z + 1), [=](int y) {
                return yield_if(x * x + y * y == z * z,
                    std::make_tuple(x, y, z));
            });
        });
    });

    RANGES_FOR(auto triple, triples | view::take(100))
    {
        printf("(%i,%i,%i)\n", std::get<0>(triple), std::get<1>(triple), std::get<2>(triple));
    }

    clock_t t1 = clock();
    printf("%ims\n", (int)(t1-t0)*1000/CLOCKS_PER_SEC);
    return 0;
}

I used a post-0.4.0 version (9232b449e44 on 2018 Dec 22), and compiled the example with clang ranges.cpp -I. -std=c++17 -lc++ -o outranges. It compiled in 2.92 seconds, executable was 219 kilobytes, and it runs in 300 milliseceonds.

And yes, that’s a non-optimized build. An optimized build (clang ranges.cpp -I. -std=c++17 -lc++ -o outranges -O2) compiles in 3.02 seconds, executable is 13976 bytes, and runs in 1ms. So the runtime performance is fine, executable is slightly larger, and compile time issue of course remains.

More on the points above:

Compilation Times Are a Big Issue in C++

Compile time of this really simple example takes 2.85 seconds longer than the “simple C++” version.

Lest you think that “under 3 seconds” is a short time – it’s absolutely not. In 3 seconds, a modern CPU can do a gajillion operations. For example, the time it takes for clang to compile a full actual database engine (SQLite) in Debug build, with all 220 thousand lines of code, is 0.9 seconds on my machine. In which world is it okay to compile a trivial 5-line example three times slower than a full database engine?!

C++ compilation times have been a source of pain in every non-trivial-size codebase I’ve worked on. Don’t believe me? Try building one of the widely available big codebases (any of: Chromium, Clang/LLVM, UE4 etc will do). Among the things I really want out of C++, “solve compile times” is probably #1 on the list, and has been since forever. Yet it feels like the C++ community at large pretends that is not an issue, with each revision of the language putting even more stuff into header files, and even more stuff into templated code that has to live in header files.

To a large extent that is caused by the ancient “just literally paste the file contents” #include model that C++ inherited from C. But whereas C tends to only have struct declarations and function prototypes in headers, in C++ you often need to put whole templated classes/functions in there too.

range-v3 is 1.8 megabytes of source code, all in header files! So while the example of “use ranges to output 100 triples” is 30 lines long, after processing header includes the compiler ends up with 102 thousand lines of code to compile. The “simple C++” example, after all preprocessing, is 720 lines of code.

But precompiled headers and/or modules solve this!, I hear you say. Fair enough. Let’s put the ranges header into a precompiled header (pch.h: #include <range/v3/all.hpp>, include pch.h instead, create the PCH: clang -x c++-header pch.h -I. -std=c++17 -o pch.h.pch, compile using pch: clang ranges.cpp -I. -std=c++17 -lc++ -o outranges -include-pch pch.h.pch). Compilation time becomes 2.24s, so PCHs do indeed save about 0.7 seconds of compile time here. They do not save the other 2.1s that is longer than simple C++ approach though :(

Non-Optimized Build Performance is Important

Runtime performance of the “ranges” example is 150 times slower. Two or three times maybe would be acceptable. Anything “over ten times slower”, and it likely means “unusable”. Over hundred times slower? Forget it.

In a real codebase that solves real problems, two orders of magnitude slower likely means that it just would not work on any real data set. I’m working in video games industry; in practical reasons this would mean that Debug builds of the engine or tooling would not work on any real game levels (performance would be nowhere near the needed interactivity level). Maybe there are industries where you run a program on a set of data, and wait for the result, and if it takes 10 or 100 times longer in Debug then it is merely “annoying”. But where something has to be interactive, it turns “annoying” into “unusable” – you literally can not “play” through a game level if it renders at two frames per second.

Yes, an optimized build (-O2 in clang) runs at the same performance as simple C++ version… so “zero cost abstractions” indeed, as long you don’t care about compilation times, and have an optimizing compiler.

But debugging optimized code is hard! Sure it’s possible, and actually a very useful skill to learn… Similar to how riding a unicycle is both possible, and teaches you an important skill of balance. Some people enjoy it and are really good at it even! But most people would not pick unicycle as their primary means of transportation, similar to how most people don’t debug optimized code if they can avoid it.

Arseny Kapoulkine has a great livestream “Optimizing OBJ loader” where he also ran into “Debug build is too slow” issue, and made it 10x faster by avoiding some of STL bits (commit). As a side effect, it also made compile times faster (source) and debugging easier, since Microsoft’s STL implementation in particular is extremely fond of deeply nested function calls.

That is not to say that “STL is necessarily bad”; it is possible to write an STL implementation that does not become 10x slower in a non-optimized build (as EASTL or libc++ do), but for whatever reason Microsoft’s STL is extremely slow due to over-reliance of “inlining will fix it up”.

As as user of the language, I don’t care whose fault it is though! All I know is that “STL is too slow in Debug”, and I’d rather have that fixed, or I will look into alternatives (e.g. not using STL, re-implementing the bits I need myself, or stop using C++ altogether).

How do other languages compare?

Here’s a brief look at a very similar implementation of “lazily evaluated Pythagorean Triples” in C#. Full C# source code:

using System;
using System.Diagnostics;
using System.Linq;

class Program
{
    public static void Main()
    {
        var timer = Stopwatch.StartNew();
        var triples =
            from z in Enumerable.Range(1, int.MaxValue)
            from x in Enumerable.Range(1, z)
            from y in Enumerable.Range(x, z)
            where x*x+y*y==z*z
            select (x:x, y:y, z:z);
        foreach (var t in triples.Take(100))
        {
            Console.WriteLine($"({t.x},{t.y},{t.z})");
        }
        timer.Stop();
        Console.WriteLine($"{timer.ElapsedMilliseconds}ms");
    }
}

Personally I find the actual bit pretty readable. Compare C#:

var triples =
    from z in Enumerable.Range(1, int.MaxValue)
    from x in Enumerable.Range(1, z)
    from y in Enumerable.Range(x, z)
    where x*x+y*y==z*z
    select (x:x, y:y, z:z);

with C++:

auto triples = view::for_each(view::ints(1), [](int z) {
    return view::for_each(view::ints(1, z + 1), [=](int x) {
        return view::for_each(view::ints(x, z + 1), [=](int y) {
            return yield_if(x * x + y * y == z * z,
                std::make_tuple(x, y, z));
        });
    });
});

I know which one I find cleaner. Do you? Though to be fair, an alternative, “less databasey” form of C# LINQ is pretty busy as well:

var triples = Enumerable.Range(1, int.MaxValue)
    .SelectMany(z => Enumerable.Range(1, z), (z, x) => new {z, x})
    .SelectMany(t => Enumerable.Range(t.x, t.z), (t, y) => new {t, y})
    .Where(t => t.t.x * t.t.x + t.y * t.y == t.t.z * t.t.z)
    .Select(t => (x: t.t.x, y: t.y, z: t.t.z));

How much time it takes to compile this C# code? I’m on a Mac, so I’ll use Mono compiler (which itself is written in C#), version 5.16. mcs Linq.cs takes 0.20 seconds. In comparison, compiling an equivalent “simple C#” version takes 0.17 seconds.

So this lazy evaluation LINQ style creates additional 0.03 seconds work for the compiler to do. In comparison, the C++ case was creating an additional 3 seconds of work, or 100x more!

This is what you get when “features” are part of the language, as opposed to “it comes as hundred thousand lines of code for the compiler to plow through”.

But can’t you just ignore the parts you don’t like?

Yes, to some extent.

For example here (Unity), we had a joke that “adding Boost into the codebase is a fireable offense”. I guess that is not true, since sometime last year Boost.Asio got added, and then I grumbled quite a bit about how it’s super slow to compile, and that merely including <asio.h> includes whole of <windows.h> with all the macro name hijack horrors that it has.

In general we’re trying to avoid most of STL too. For the containers, we have our own ones, somewhat along the same reasons as EASTL motivation – more uniform behavior across platforms/compilers, better performance in non-optimized builds, better integration with our own memory allocators & allocation tracking. Some other containers, purely for performance reasons (STL unordered_map can’t be fast by design since the standard requires it to be separately chained; our own hash table uses open addressing instead). Large parts of the standard library functionality we don’t actually even need.

However.

It takes time to convince each and every new hire (particularly the more junior ones straight from universities) that no, just because it’s called “modern” C++, does not automatically mean it’s better (it might be! or it might be not). Or that no, “C code” does not automatically mean it’s hard to understand, follow or is riddled with bugs (it might be! or it might be not).

Just a couple weeks ago at work I was rambling how I’m trying to understand some piece of (our own) code, and I can’t, since the code is “too complex” for me to get it. Another (junior) programmer drops by, asks me why I look like I’m about to ‎(ノಥ益ಥ)ノ ┻━┻, I say “so I’m trying to understand this code, but it’s way too complex for me”. His immediate response was “ah, old C style code?”, and I was “no, in fact the complete opposite!” (the code I was looking at was some sort of template metaprogramming thingamabob). He hasn’t worked with large codebases, neither C nor C++ yet, but something has already convinced him that “hard to understand” must be C code. I suspect the university; often classes flat out immediately say that “C is bad”, without ever explaining how exactly; but it does leave that impression on the young minds of future programmers.

So yes, I can certainly ignore parts I don’t like about C++. But it’s tiring to educate many others I’m working with, since many are under impression that “modern, must be better” or that “standard library, must be better than anything we could write ourselves”.

Why is C++ this way?

I don’t quite know. Admittedly they do have a very complex problem to solve, which is “how to evolve a language while keeping it close to 100% backwards compatible with many decades of past decisions”. Couple that with that C++ tries to serve many masters, use cases and experience levels, and you do get a complex problem.

But to some extent, it feels like a big part of C++ committee and the ecosystem is focused on “complexity” in terms of showing off or proving their worth.

There was this joke on the internet a while ago about typical progression of a C/C++ programmer. I remember myself being in the middle stage, some 16 years ago. Very impressed with Boost, in a sense of “wow you can do that, that’s so cool!“. Without questioning at why you’d want to do that.

Similar to, I don’t know, Formula 1 cars or triple-neck guitars. Impressive? Sure. Marvel of engineering? Of course. Requires massive amount of skill to handle them properly? Yes! Not the right tool for 99% of situations you’d even find yourself in? Yup.

Christer Ericson has put it nicely here:

Goal of programmers is to ship, on time, on budget. It’s not “to produce code.” IMO most modern C++ proponents 1) overassign importance to source code over 2) compile times, debugability, cognitive load for new concepts and extra complexity, project needs, etc. 2 is what matters.

And yes, people who are concerned with state of C++ and the standard libraries can of course join the effort in trying to improve it. Some do. Some are (or think they are) too busy to spend time in committees. Some ignore parts of the standards and make their own parallel libraries (e.g. EASTL). Some think C++ is beyond salvation and try to make their own languages (Jai) or jump ship elsewhere (Rust, subsets of C#).

Taking feedback, and giving feedback

I know how hard it can be, when “a bunch of angry people on the internet” are saying that your work is proverbial dung. I’m working on probably the most popular game engine, with millions of users, and some part of them love to point out, directly or indirectly, how “it sucks”. It’s hard; me and others have put so much thought and work into it, and someone else comes along and says we’re all idiots and our work is crap. Sad!

It probably feels the same for anyone working on C++, STL, or any other widely used technology really. They have worked for years on something, and then a bunch of Angry Internet People come and trash your lovely work.

It’s extremely easy to get defensive about this, and is the most natural reaction. Oftentimes also not the most constructive one though.

Ignoring literal trolls who complain on the internet “just for the lulz”, majority of complaints do have actual issue or problem behind it. It might be worded poorly, or exaggerated, or whoever is complaining did not think about other possible viewpoints, but there is a valid issue behind the complaint anyway.

What I do whenever someone complains about thing I’ve worked on, is try to forget about “me” and “work I did”, and get their point of view. What are they trying to solve, and what problems do they run into? The purpose of any software/library/language is to help their users solve the problems they have. It might be a perfect tool at solving their problem, an “ok I guess that will work” one, or a terribly bad one at that.

  • “I’ve worked very hard on this, but yeah sounds like my work is not a good at solving your problem” is a perfectly valid outcome!
  • “I’ve worked very hard on this, but did not know/consider your needs, let me see how they can be addressed” is a great outcome too!
  • “Sorry I don’t understand your problem” is fine as well!
  • “I’ve worked very hard on this, but turns out no one has the problem that my solution is supposed to solve” is a sad outcome, but it also can and does happen.

Some of the “all feedback will be ignored unless it comes in form of a paper presented at a C++ committee meeting” replies that I’ve seen in the past few days sound to me like a not very productive approach. Likewise, defending design of a library with an argument that “it was a popular library in Boost!” misses out on the part of the C++ world that does not think Boost libraries are a good design/solution.

The “games industry” at large, however, is also at fault to some extent. Game technologies are traditionally built with C or C++, just because up until very recently, other viable systems programming languages simply did not exist (now you at least have Rust as a possible contender). For the amount of reliance on C/C++, the industry certainly did not do a good enough job at making themselves visible and working on improving the language, the libraries and the ecosystem.

Yes it’s hard work, and yes complaining on the internet is way easier. And whoever ends up working on future of C++ is not solving “immediate problems” (like shipping a game or whatever); they are working on something much more longer-term. But there are companies who could afford to do this; any of big engine companies or large publishers with central technology groups could totally do it. If that would be worth doing, I don’t know, but indeed it’s a bit hypocritical to say “C++ is nuts and is not what we want”, while never telling the folks who make the language, what it is that “we want”.

My impression is that many in the games technology are “mostly fine” with recent (C++11/14/17) additions to the language itself - lambdas are useful, constexpr if is great, etc. etc. They tend to largely ignore whatever is getting added into the standard libraries, both because the design/implementations of STL have issues pointed out above (long compile times, bad Debug build performance), or are simply not that useful, or the companies have already wrote their own containers/strings/algorithms/… years ago, so why change something that already works.

Here I say that is enough for a post, I need to get dinner.


C++11 way of initializing integers

So this tweet by Arseny Kapoulkine got me interested:

One of the most annoying changes to C++ that happened in the last decade for me personally is the introduction of initialization-using-curly-braces and people starting to use it all over the place. There are rare cases when that might make sense, but seriously, int count{ 0 };?

Now, I love the curly-braces initializers in C#, but I haven’t used them in C++ much. There might or might not be good reasons for using them, I don’t have an opinion on that aspect. What got me interested, is this: how much work for a compiler is there?. I.e. if you have int a = 7; vs int a { 7 };, which one is faster to compile?

Oh come on Aras, who would care about trivial thing like that?! you say… well, I would care, for one. I was also recently optimizing compile times of our C++ testing framework, and turns out, once you have things that are repeating thousands of times (e.g. equality checks in unit tests), even tiny savings do add up.

Simple test: half a million integer initializations on Clang

At first I did a simple file that would initialize and use half a million integers, and measured the time it takes for Clang to compile that:

// test.cpp
#ifdef MODERN
#define STUFF { int a {7}; c += a; }
#else
#define STUFF { int a = 7; c += a; }
#endif
#define STUFF1 STUFF STUFF STUFF STUFF
#define STUFF2 STUFF1 STUFF1 STUFF1 STUFF1
#define STUFF3 STUFF2 STUFF2 STUFF2 STUFF2
#define STUFF4 STUFF3 STUFF3 STUFF3 STUFF3
#define STUFF5 STUFF4 STUFF4 STUFF4 STUFF4
#define STUFF6 STUFF5 STUFF5 STUFF5 STUFF5
#define STUFF7 STUFF6 STUFF6 STUFF6 STUFF6
#define STUFF8 STUFF7 STUFF7 STUFF7 STUFF7

#include <stdio.h>
int main()
{
	int c = 0;
	STUFF8; STUFF8; STUFF8; STUFF8; STUFF8; STUFF8; STUFF8; STUFF8;
	printf("got %i\n", c);
	return 0;
}

Of course I did not want to type half a million integer initializers, so preprocessor to the rescue. Each STUFFn expands previous one four times, and reaching half a million repeats is easy that way. It just ends up being either { int a = 7; c += a; } or { int a {7}; c += a; } over and over again, and the final value of c is printed.

So, let’s measure! This is on Intel Core i9 8950HK 2.9GHz (2018 MacBookPro), clang “Apple LLVM version 10.0.0” version (the one in Xcode 10).

  • Traditional time clang -std=c++11 test.cpp: 8.2 seconds.
  • C++11 style curly brace time clang -std=c++11 -DMODERN test.cpp: 9.0 seconds. That’s 9.7% longer!

So from this, it would seem that on Clang, using C++11 style initializer-list syntax for simple types does create more work for the compiler. In doing a debug build, Clang spends about 10% more time on the whole compilation (which also includes things like machine code generation, so the overhead on the frontend itself is larger).

What about the other compilers? That’s where things got a bit more complicated… :)

Initializing lots of integers on other compilers

Then I ran the same test on MSVC (VS2017, 15.9.4 version), and… crash with out of memory. Ok I was using the 32 bit cl.exe. Trying the 64 bit one, it proceeds to use about 25 gigabytes (!) of memory, and after about 3 minutes of compilation time I just gave up.

Reducing the amount of initializers four times (128 thousand), and testing on AMD ThreadRipper 1950X 3.4GHz:

#ifdef MODERN
#define STUFF { int a {7}; c += a; }
#else
#define STUFF { int a = 7; c += a; }
#endif
#define STUFF1 STUFF STUFF STUFF STUFF
#define STUFF2 STUFF1 STUFF1 STUFF1 STUFF1
#define STUFF3 STUFF2 STUFF2 STUFF2 STUFF2
#define STUFF4 STUFF3 STUFF3 STUFF3 STUFF3
#define STUFF5 STUFF4 STUFF4 STUFF4 STUFF4
#define STUFF6 STUFF5 STUFF5 STUFF5 STUFF5
#define STUFF7 STUFF6 STUFF6 STUFF6 STUFF6

#include <stdio.h>
int main()
{
	int c = 0;
	STUFF7; STUFF7; STUFF7; STUFF7; STUFF7; STUFF7; STUFF7; STUFF7;
	printf("got %i\n", c);
	return 0;
}
  • Traditional ptime cl test.cpp: 42-44 seconds.
  • C++11 style ptime cl /DMODERN test.cpp: 42-44 seconds. So, way, way longer than Clang, but there’s no compile time difference in how the integer is initialized.
    • Clang on Mac with the 4x reduced initializer count takes 2.0s (traditional) and 2.2s (curly brace).

What about gcc then? With half a million initializers, I also could not wait for it to finish. With 128k initializers, on the same ThreadRipper PC; “traditional” style using time gcc test.cpp -std=c++11, and “curly brace” style using time gcc test.cpp -std=c++11 -DMODERN:

  • gcc 5.4 in WSL: crashes after several minutes in both styles,
  • gcc 7.3 in VMWare: traditional 181 sec, curly brace 188 sec. So maybe 4% slower, but I did not do many measurements.

I’ve also tested Clang 3.8 on WSL and Clang 6.0 in VMWare on the same ThreadRipper PC; timings are consistent with Mac results, i.e. initializing a lot of integers using C++11 curly brace syntax makes compile time about 9% slower.

Note: I tested the manually-expanded giant C++ file with 128 thousand initializers in it, to figure out if it’s the macro expansion that is slowing down the compilers. On MSVC, it still takes the same ~42 seconds to compile, so the bottleneck is definitely not the macro expansion.

Takeaways

  • Clang seems to be about 9% slower at compiling a lot of int a { 7 }; initializers, compared to traditional int a = 7; ones.
  • Gcc might be about 3-4% slower at compiling the curly brace integer initializers.
  • MSVC compiles both forms of integer initialization in the same time.
  • Both MSVC and Gcc are way slower than Clang at compiling a function that has hundreds of thousands of integer initializers. Of course this is not a typical case, but I was still surprised at the compiler either taking 25 gigabytes of memory, or outright crashing.
  • Most of this does not matter in typical use cases. Unless you’re optimizing compile times for something used extremely often in a large codebase.