Two years in a build team!

Whoa! Has it been two years already since I stopped doing graphics programming?!

How does one end up on a build team?

I switched to the “build system” team by having two extremely short chats:

…and then a lot more chats with graphics people and some others of course; but the above “hey could I join? yes!” was kinda the whole of my “job interview” to the team.

And then I sent out a “goodbye graphics” email:

I actually had no idea for how long I’m leaving, so I wrote “a couple months”. Well look, it’s been two years already!

What does a build team do?

Most of my impressions I already wrote about (one week in, six months in).

A bunch of people have asked me either “ok so what is it that you actually do?”, or alternatively “surely the build work must be done by now?“. To which I don’t have an excellent answer. My own work is a combination of:

  • Switching from “old build” (JamPlus) to “new build” (Bee/Tundra) while not disrupting the work of everyone around us.
  • Speeding up builds by cleaning up dependencies, includes, removing code.
  • Upgrading various platforms to more recent compiler versions; usually this is not hard but e.g. VS2010 -> VS2015 was pretty painful due to needed rebuilds of all 3rd party static libraries.
  • Improving UX of various aspects of build process: cleaner logs, better diagnostic messages, more intuitive build command line arguments.
  • Support for people who have any build issues.
  • Fixing out various build issues that are accidentally introduced due to one reason or another. In a live codebase, it’s not like you can fix all issues and be done with it :)

Typical work weeks might look like this – this is from my own “week logs” doc, took a screenshot of two recent ones:

And “what I did during last year” summary looks like this. I’ve highlighted buildsystem-related work in there; the rest is “everything else”:

It doesn’t feel like “I got a lot done”, but doesn’t feel terrible either.

Anyway! Most of build stuff is fairly typical, but during last year our team has build some pretty neat tools that I wanted to write about. So here they are!

Neat Build Tools

In our main code repository the build entry point script is called jam (it does not use Jam anymore, but backwards compat with what people are used to type…). In some of our new code repositories (for DOTS/ECS and some packages) the build entry point would be called bee; both have the same tools, the examples below will be using jam entry point.

How exactly X is built?

jam how substring-of-something finds the most relevant build step (e.g. object file compile, executable link, file copy, whatever) and tells exactly how it is built. This is mostly useful to figure out exact compiler command line flags, and dependencies.

Why X got rebuilt?

If one is wondering why something gets rebuilt (recompiled, re-copied, relinked, etc.), jam why substring-of-something tells that:

Every build produces a “log of what got done and why” file, and the why tool looks at that to do the report.

Where time was spent during the build?

I added Chrome Tracing profiler output support to both JamPlus and Tundra (see previous blog post), and while that is all good and nice, sometimes what you want is just a “very quick summary”. Enter jam time-report. It shows top slowest action “types”, and top 10 items within each type:

Of course if you want more detail, you can drag the profiler output file into chrome://tracing or speedscope.app and browse it all:

What are the worst C/C++ header includes?

Since during the build Tundra scans source files for #include dependencies, we can use that to do some analysis / summary! jam include-report shows various summaries of what might be worth untangling:

It is very similar to Header Hero that I used before for include optimization. But I wanted something that would see actual includes instead of the approximation that Header Hero does, and something that works on a Mac, and something that would be built-in in all our builds. So there!

This is all! I’ll get back to reviewing some pull requests now.


'Infinite' sky shader for Unity

I saw a discussion today that lamented lack of “infinite projection” in Unity, so here’s a quick post about that.

Unity today (at least in 2017/2018 versions) uses “reversed Z” projection, but does not use “infinite projection”. This means that a Camera has an actual far clipping plane, and beyond that distance nothing is rendered.

This is fine for almost all cases, except when you want to render custom “skybox-like” geometry (a sky sphere or whatever); then you’d like to have it be “infinitely large” and thus guaranteed to always be beyond any actual scene geometry.

You could wait for Unity to implement “infinite projection”, and/or write a custom render pipeline to use your own infinite projection, or do a super small local change just inside your “sky object” shader to make it effectively be infinite. Let’s check out how to do the last bit!

“Infinite size” shader

To achieve an effectively “infinite size” (i.e. appears “behind any objects”) shader, all we have to do is to move the vertices to be “on the far plane” in the vertex shader. If o.vertex is a float4 with clip space position (e.g. computed by UnityObjectToClipPos), then just do this:

#if defined(UNITY_REVERSED_Z)
// when using reversed-Z, make the Z be just a tiny
// bit above 0.0
o.vertex.z = 1.0e-9f;
#else
// when not using reversed-Z, make Z/W be just a tiny
// bit below 1.0
o.vertex.z = o.vertex.w - 1.0e-6f;
#endif

And here it is. Far plane of only 20, and a “sky sphere” object that is 100000 in size. No clipping, renders behind scene geometry:

Here’s Unity 2018.2.15 project with the whole shader and a simple scene - InfiniteSkyUnityProject.zip (350kb)

What’s all this Reversed Z and Infinite Projection anyway?

Nathan Reed explains it much better than I ever could, read that post!

Summary is that reversing depth so that far plane is at zero, and near plane is at one, and using a floating point format depth buffer, results in much better depth precision.

In Unity we have implemented support for this reversed Z a while ago (in Unity 5.5). Today we don’t use infinite projection (yet?), so to achieve the infinite-like sky objects you’d have to do a shader trick like above.

This is all.


Another cool MSVC flag: /d1reportTime

A while ago I found about a fairly useful Visual C++ flag that helps to investigate where the compiler backend/optimizer spends time – /d2cgsummary, see blog post.

Turns out, there is a flag that reports where the compiler frontend spends time – /d1reportTime.

I’ve been writing about lack of compiler time profiling/investigation tools, starting from that same old post, up to the more recent ones that focused on clang - one and two. All this recent discussion about C++ compilation times (among other issues…) led me to finding this fresh project on github, UECompileTimesVisualizer by @phyronnaz. Ignore everything about UE4 there (unless you work with it of course :)), the new thing for me was that it says:

Add the following arguments to the C/C++ Command Line option in your project settings: /Bt+ /d2cgsummary /d1reportTime

/Bt+ and /d2cgsummary are known, but /d1reportTime I haven’t heard about before. What is it?! The only other place on the internet that talks about it – and it’s fairly impressive for anything to have just two mentions on the internet – is a small comment in Microsoft forums thread a couple months ago, and the only thing it says is: “Can you also share the output from /d1reportTime?”

So what is /d1reportTime?

Passing /d1reportTime to the MSVC compiler (cl.exe) will make it print:

  • Which header files are included (hierarchically), with time taken for each,
  • Which classes are being parsed, with time taken for each,
  • Which functions are being parsed, with time taken for each.

Additionally, at end of each list, “top N” are listed too. List of classes/functions seems to contain info about which templates (or nested classes? not sure yet) are being instantiated as well.

I have no idea which version of MSVC the flag has appeared in. VS2015 seems to not have it yet; and VS2017 15.7 (compiler version 19.14.26428) already has it; it might have appeared in earlier VS2017 builds but I haven’t checked.

Running it on anything non-trivial produces tons of output, e.g. I randomly ran it on a (fairly big) Shader.cpp file we have in our codebase, and it produces 30 thousand lines of output. Shortened a lot:

Shader.cpp
Include Headers:
  Count: 542
    c:\trunk\runtime\shaders\shader.h: 0.311655s
      c:\trunk\runtime\shaders\shadertags.h: 0.004842s
      c:\trunk\runtime\shaders\shaderpropertysheet.h: 0.127829s
        c:\trunk\runtime\math\color.h: 0.021522s
          c:\trunk\runtime\serialize\serializeutility.h: 0.012031s
            c:\trunk\runtime\serialize\serializationmetaflags.h: 0.005427s
            c:\trunk\runtime\utilities\enumtraits.h: 0.004738s
          c:\trunk\runtime\testing\testingforwarddecls.h: 0.000331s
        c:\trunk\runtime\math\vector2.h: 0.003681s
        <... a lot more skipped ...>

  Top 24 (top-level only):
    c:\trunk\runtime\graphics\scriptablerenderloop\scriptablerendercontext.h: 0.619189s
    c:\trunk\runtime\graphics\scriptablerenderloop\scriptablebatchrenderer.h: 0.407027s
    c:\trunk\runtime\shaders\shader.h: 0.311655s
    c:\trunk\runtime\shaders\shadernameregistry.h: 0.263254s
    <... a lot more skipped ...>

  Total: 2.125453s
Class Definitions:
  Count: 10381
    ShaderTagID: 0.001146s
    EnumTraits::internal::IsReflectableEnum<`template-type-parameter-1'>: 0.000057s
    EnumTraits::internal::DefaultTraitsHelper<`template-type-parameter-1',1>: 0.000085s
    EnumTraits::internal::DefaultTraitsHelper<`template-type-parameter-1',0>: 0.000064s
    <... a lot more skipped ...>

  Top 100 (top-level only):
    std::_Vector_const_iterator<class std::_Vector_val<struct std::_Simple_types<struct ShaderLab::SerializedProperty> > >: 0.031536s
    PersistentManager: 0.029135s
    ModuleMetadata: 0.029010s
    ShaderLab::SerializedSubProgram: 0.020047s
    GlobalCallbacks: 0.019955s
    <... a lot more skipped ...>

  Total: 1.412664s
Function Definitions:
  Count: 20544
    ShaderTagID::ShaderTagID: 0.000368s
    ShaderTagID::ShaderTagID: 0.000024s
    operator ==: 0.000089s
    operator !=: 0.000017s
    operator <: 0.000021s
    <... a lot more skipped ...>

  Top 100 (top-level only):
    std::operator ==: 0.031783s
    ShaderLab::SerializedTagMap::Transfer: 0.014827s
    math::matrixToQuat: 0.014715s
    math::mul: 0.012467s
    Texture::TextureIDMapInsert: 0.012006s
    <... a lot more skipped ...>

  Total: 2.658951s

At 30k lines it is not the most intuitive for manual use, but a lot of interesting information is there. Particularly the “Top N” lists sound like good things to start with, if looking at the output manually.

Even for something simple like (see in Compiler Explorer):

#include <vector>
int main()
{
    std::vector<int> v;
    v.push_back(13);
    return (int)v.size();
}

The /d1reportTime produces almost 2000 lines of output. Again, shortened here:

example.cpp
Include Headers:
    Count: 74
        c:\msvc\v19_16\include\vector: 0.215198s
            c:\msvc\v19_16\include\xmemory: 0.186372s
                c:\msvc\v19_16\include\xmemory0: 0.184278s
                    c:\msvc\v19_16\include\cstdint: 0.015689s
                        c:\msvc\v19_16\include\yvals.h: 0.015103s
                            c:\msvc\v19_16\include\yvals_core.h: 0.012148s
                                c:\msvc\v19_16\include\xkeycheck.h: 0.000246s
                                c:\msvc\v19_16\include\crtdefs.h: 0.011183s
                                <... skipped remaining 66 headers ...>

    Top 1 (top-level only):
        c:\msvc\v19_16\include\vector: 0.215198s
    Total: 0.215198s
Class Definitions:
    Count: 829
        _PMD: 0.000260s
        _TypeDescriptor: 0.000088s
        _s__CatchableType: 0.000113s
        _s__CatchableTypeArray: 0.000025s
        _s__ThrowInfo: 0.000044s
        _s__RTTIBaseClassDescriptor: 0.000056s
        _s__RTTIBaseClassArray: 0.000027s
        _s__RTTIClassHierarchyDescriptor: 0.000026s
        _s__RTTICompleteObjectLocator2: 0.000048s
        __s_GUID: 0.000049s
        <add_rvalue_reference><`template-type-parameter-1'>: 0.000012s
        <remove_reference><`template-type-parameter-1'>: 0.000006s
        <remove_reference><class $RBAAB &>: 0.000005s
        <... skipped remaining 816 classes ...>

    Top 100 (top-level only):
        std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >: 0.006266s
        std::vector<int,class std::allocator<int> >: 0.003368s
        std::pair<`template-type-parameter-1',`template-type-parameter-2'>: 0.002695s
        std::basic_string<`template-type-parameter-1',`template-type-parameter-2',`template-type-parameter-3'>: 0.001880s
        std::vector<`template-type-parameter-1',`template-type-parameter-2'>: 0.001117s
        <... skipped remaining 95 top classes ...>

    Total: 0.059558s
Function Definitions:
    Count: 773
        operator new: 0.000162s
        operator delete: 0.000031s
        operator new[]: 0.000025s
        operator delete[]: 0.000009s
        abs: 0.000048s
        abs: 0.000021s
        div: 0.000059s
        div: 0.000043s
        fpclassify: 0.000055s
        <... skipped remaining 764 functions ...>

    Top 100 (top-level only):
        std::logic_error::logic_error: 0.006603s
        main: 0.003693s
        std::_Uninitialized_move: 0.003665s
        std::vector<int,class std::allocator<int> >::push_back: 0.001054s
        std::numeric_limits<bool>::round_error: 0.000730s
        std::vector<int,class std::allocator<int> >::_Emplace_reallocate: 0.000612s
        _scwprintf_p: 0.000414s
        <... skipped remaining 93 top functions ...>
    Total: 0.056901s

Having reports like that is also useful to realize just how much work we’re asking the compiler to do. The C++ snippet above is “as simple as it gets” when using “standard” C++ features. Just make a vector of integers, and add a number to it. The compiler ends up including 74 headers, parsing over 800 classes and almost 800 functions. Just to make a simple “push an integer into a vector” task.

Yes, I know about precompiled headers. And I know about upcoming C++ modules. You can save on sending that email :)

Anyway, having that tooling is really good! Thanks Microsoft Visual C++ team! Also, u/gratilup mentions in this reddit thread that MSVC team is working on ETW/WPA based tooling to visualize compilation times. Great!


time-trace: timeline / flame chart profiler for Clang

TL;DR: I wanted Clang to emit timeline (“flame chart”) style profiling info on where it spends time. So I made it do that.

What kind of compiler profiler I want?

In the previous blog post I talked about how it is hard to get “useful” information on where C++ compilers spend their time. Various compilers have various ways of reporting something, but most of their reports seem to be geared towards the compiler developers themselves. For them it’s important to know, for example, whether register allocator is the bottleneck; for me as a compiler user that is much less of an importance – I want to know whether I’m bottlenecked by preprocessor / includes (if so, which ones?), parsing (if so, which classes/functions/templates?), template instantiation (if so, which ones?), code generation, backend optimizations, or something else.

Having added Chrome Tracing profiling outputs to other parts of our build system in the past (here or there), I quite like it. It’s not perfect, and the UI is not very good, but it gets the job done and the file format is trivial to write.

So I thought I’d try doing it – add some sort of flag to Clang that would emit the profiling information that I would find interesting. Turns out, getting to the “it seems to work” state was easier than I expected!

-ftime-trace: Chrome Tracing output for Clang

The very first thing I tried it on pointed out a really useful thing: we have a super slow to include header in one place. Just look:

This is a timeline / flame chart profiling view in Chrome chrome://tracing page. Horizontal axis is time, vertical is nested “callstacks”, so to speak. And it very clearly shows that there is one header file that takes over 8 seconds to include.

I suspect it’s some sort of performance issue with Clang itself; Gcc and Msvc are quite a bit faster at including this file. Will try to gather data and report a bug.

The file itself does not feel that complex; without this view I would have never guessed that it’s the cause of these slow compile times. Yay visibility into what is going on!

So that was already fairly useful. With this visualization, I can look into other files and see what they end up doing. And based on that, decide what to do to make compiles faster. When including/parsing files takes up time, I can tell which ones are the culprit exactly, and how long it takes to include them:

Already found some headers that took longer to process than you’d guess, and I’ve split them up into “rarely used, expensive” parts and “often used, cheap” parts.

Or when looking at template instantiations, I can see which ones end up taking most time to process, and focus on improving them:

But can’t you figure that out with existing means?

Template instantiation I probably could via Templight and Templight-Tools. I built them, tried it on some files, wrestled with how to visualize the “callgrind” output (WinCacheGrind refused to open them, QCacheGrind worked but I found it a bit intimidating to use). The process of “compile with a special flag, get a file, run the file through a converter, open in some app” is a bit involved, plus the converters aren’t very fast.

Using a “regular profiler” on Clang itself shows which parts of the compiler take up time, but does not show which parts of my code it is spending that time on. For example on that “one file was very slow to parse” example from top of the post, running Xcode Instruments profiler on Clang just shows this:

So I can see that “all the time is spent parsing”, but where exactly?

Ok where’s the code, and what now?

All the code is at this github PR for now, which is based on Clang 8.0.0-ish “trunk” as it was on 2019 January 10.

It adds a new command line flag to clang, -ftime-trace, that produces Chrome Tracing .json file next to the regular compiled object files.

I based it on the older LLVM/Clang “git monorepo”; they have switched to another github repository while I was doing it :) However the build instructions are the same as for regular LLVM/Clang. Some notes:

  • Do use Ninja backend for building. Default Makefiles backend spends ages figuring out what to build, every time you do the build.
  • Default build type is non-optimized Debug build. Build optimized Release build with the usual -DCMAKE_BUILD_TYPE=RelWithDebInfo CMake flag.
  • On Windows, make sure to set CC=cl and CXX=cl environment variables before invoking CMake. I had it pick up some random Gcc-based toolchain that was inside my Strawberry Perl install otherwise.
  • LLVM/Clang build is setup in such a way where any changes to the base “Support” library (where timing/profiling utilities are, among other things) cause rebuild of almost everything. Add a comment into Timer.cpp file? There’s 2500 files to recompile! And I wanted to add my profiling stuff in there…
    • This seems to be because some utility that generates some headers (which the rest of LLVM/Clang depends on) itself uses the same “Support” library. The utility is llvm-tblgen; I don’t know what exactly it is, but I’ll assume it generates the famous Amiga demoscene groups, and you cannot change my mind.
    • Turns out LLVM build has an option for “I have my own tblgen” (which I did have after building the whole stuff once) that you can point the build to. Then it will stop rebuilding it, and thus stop regenerating most of the headers. Use -DLLVM_TABLEGEN=path/to/tblgen option to CMake for that. This cuts down “change anything in Support library” from “2500 files to recompile” down to about “250 files to recompile”. Still not cheap, but much better.
  • Overall this was surprisingly easy! I mean I’ve never built LLVM/Clang before, and here, just 350 lines of code later, I have my own modified Clang with the profiling output I always wanted. Pretty good!

I’ve started a thread on Clang mailing list about my changes, and will see where that goes. There are some comments already; at some point I’d have to do a proper pull request via whatever is the LLVM/Clang code review tooling.

No idea whether any of this will get accepted into upstream Clang, but if not, then at least I will have my own Clang with the profiler I always wanted :)


Investigating compile times, and Clang -ftime-report

TL;DR: if you want to use -ftime-report Clang flag to help you figure out where or why your code is slow to compile… it’s not very helpful for that. But! In the next post we’ll try to do something about it :)

Need for build time investigation tools

Depending on how your large your codebase is, and how it is structured, C++ compilation times may or might not be an issue for you. It certainly is for us at Unity, and my understanding of build times of other large projects (like Chrome, LLVM or UE4), they aren’t exactly “fast” either. There are various underlying reasons for why C++ codebases might be slow to compile (preprocessor, complexity of the language, template instantiations, optimizations done by the compiler, linking times etc.).

It would make sense to have some tools that a programmer could use to help them understand where or why their code is slow to compile.

You might think that, given how big of an issue compile times are in C++, there would be readily-available tools to do that. E.g. some flags in the popular compilers, or something?

The reality, at least right now, is “ehhh, not really”. There really are no good tools in C++ compilers to help you with investigating build times.

Build timing tools in existing compilers

I’ll do example with a super small C++ snippet that just includes some STL headers and does something with them. See the snippet and the output from three major compilers (Visual Studio, Gcc, Clang) in Compiler Explorer here. The actual code does not do anything useful, I just needed something to throw at the compiler.

#include <vector>
#include <string>
#include <unordered_map>
#include <regex>
int main()
{
    std::vector<int> v(10);
    v.push_back(7);
    std::unordered_map<std::string, double> m;
    m.insert(std::make_pair("foo", 1.0));
    std::regex re("^asd.*$");
    return 0;
}

I’m testing with Visual Studio 2017 v15.9, Gcc 8.2 and Clang 7.0.

Visual Studio

Visual Studio has /Bt (see VC++ Team Blog post) and /d2cgsummary (see my blog post) flags, and while they don’t contain much information and the UX of them is “a bit lacking”, they are at least something.

/Bt prints how much time “frontend” (preprocessing, parsing - everything that is about the “C++ language”) and the “backend” (optimizations and final machine code generation – much less specific to C++ language) takes.

time(C:\msvc\v19_16\bin\Hostx64\x64\c1xx.dll)=0.775s
time(C:\msvc\v19_16\bin\Hostx64\x64\c2.dll)=1.303s

Of course it would be way friendlier if instead of printing two paths to DLLs, it would actually print “Frontend” and “Backend”. That’s what whatever\c1xx.dll and whatever\c2.dll actually are. Empathy towards the user is important!

Output of /d2cgsummary is even more cryptic, something like this (cut down for brevity):

Code Generation Summary
  Total Function Count: 821
  Elapsed Time: 1.210 sec
  Total Compilation Time: 4.209 sec
  Average time per function: 0.005 sec
  Anomalistic Compile Times: 15
    ?_Add_equiv@?$_Builder@PEBDDV?$regex_traits@D@std@@@std@@QEAAXPEBD0_J@Z: 0.299 sec, 869 instrs
    ?_Add_rep@?$_Builder@PEBDDV?$regex_traits@D@std@@@std@@QEAAXHH_N@Z: 0.239 sec, 831 instrs
    <...>
  Serialized Initializer Count: 1
  Serialized Initializer Time: 0.001 sec
RdrReadProc Caching Stats
  Functions Cached: 5
  Retrieved Count: 164
  Abandoned Retrieval Count: 0
  Abandoned Caching Count: 0
  Wasted Caching Attempts: 0
  Functions Retrieved at Least Once: 5
  Functions Cached and Never Retrieved: 0
  Most Hits:
    ?_Get_data@?$_Vector_alloc@U?$_Vec_base_types@IV?$allocator@I@std@@@std@@@std@@QEAAAEAV?$_Vector_val@U?$_Simple_types@I@std@@@2@XZ: 35
    ?_Get_second@?$_Compressed_pair@V?$allocator@I@std@@V?$_Vector_val@U?$_Simple_types@I@std@@@2@$00@std@@QEAAAEAV?$_Vector_val@U?$_Simple_types@I@std@@@2@XZ: 35
    <...>
  Least Hits:
    ?_Adopt@_Iterator_base0@std@@QEAAXPEBX@Z: 28
    <...>

Most of that probably makes sense for someone who actually works on the Visual Studio compiler, but again - I’m just a user, I have no idea what is “RdrReadProc”, what is “Serialized Initializer” or “Retrieved Count”, and most importantly, whether any of this output can help me to answer the question “how can I change my code to make it compile faster”? Also, why are the function names not demangled?

The “Anomalistic Compile Times” section can help to track down some patterns in code that end up being very slow for compiler to process (like forced inlining on large functions), but that’s about it.

So while MSC does have something, I could think of many possible improvements in what kind of timing/profiling information it could output.

Gcc

I’m not using Gcc much, and at least in my industry (game development), it’s not very popular. Out of all platforms that I happen to use at work, they are all on either Visual Studio or some variant of Clang. Gcc is only used on Linux, and even there, there’s an option to use Clang.

Anyway, gcc has -ftime-report argument that prints information about where time was spent during compilation. Brace yourself, this will be over 100 lines of text:

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.00 (  0%)   0.00 (  0%)   0.00 (  0%)    1509 kB (  1%)
 phase parsing                      :   0.59 ( 16%)   0.60 ( 43%)   2.35 ( 38%)   91628 kB ( 33%)
 phase lang. deferred               :   0.32 (  9%)   0.16 ( 12%)   0.47 (  8%)   46797 kB ( 17%)
 phase opt and generate             :   2.60 ( 72%)   0.62 ( 45%)   3.22 ( 53%)  132237 kB ( 48%)
 phase last asm                     :   0.08 (  2%)   0.00 (  0%)   0.09 (  1%)    2921 kB (  1%)
 |name lookup                       :   0.18 (  5%)   0.10 (  7%)   0.27 (  4%)    4820 kB (  2%)
 |overload resolution               :   0.13 (  4%)   0.12 (  9%)   0.31 (  5%)   26374 kB ( 10%)
 dump files                         :   0.08 (  2%)   0.01 (  1%)   0.04 (  1%)       0 kB (  0%)
 callgraph construction             :   0.08 (  2%)   0.02 (  1%)   0.11 (  2%)   12160 kB (  4%)
 callgraph optimization             :   0.02 (  1%)   0.02 (  1%)   0.04 (  1%)      87 kB (  0%)
 ipa function summary               :   0.00 (  0%)   0.01 (  1%)   0.02 (  0%)     808 kB (  0%)
 ipa cp                             :   0.01 (  0%)   0.01 (  1%)   0.01 (  0%)     564 kB (  0%)
 ipa inlining heuristics            :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     807 kB (  0%)
 ipa function splitting             :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     233 kB (  0%)
 ipa pure const                     :   0.02 (  1%)   0.00 (  0%)   0.01 (  0%)      22 kB (  0%)
 ipa icf                            :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 ipa SRA                            :   0.07 (  2%)   0.02 (  1%)   0.10 (  2%)    6231 kB (  2%)
 ipa free inline summary            :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 cfg construction                   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)     179 kB (  0%)
 cfg cleanup                        :   0.04 (  1%)   0.00 (  0%)   0.03 (  0%)     200 kB (  0%)
 trivially dead code                :   0.00 (  0%)   0.00 (  0%)   0.03 (  0%)       1 kB (  0%)
 df scan insns                      :   0.02 (  1%)   0.01 (  1%)   0.01 (  0%)       7 kB (  0%)
 df multiple defs                   :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 df reaching defs                   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 df live regs                       :   0.05 (  1%)   0.01 (  1%)   0.02 (  0%)      26 kB (  0%)
 df live&initialized regs           :   0.00 (  0%)   0.01 (  1%)   0.01 (  0%)       0 kB (  0%)
 df use-def / def-use chains        :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 df reg dead/unused notes           :   0.02 (  1%)   0.00 (  0%)   0.00 (  0%)     510 kB (  0%)
 register information               :   0.00 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 alias analysis                     :   0.02 (  1%)   0.00 (  0%)   0.04 (  1%)    1258 kB (  0%)
 alias stmt walking                 :   0.05 (  1%)   0.02 (  1%)   0.06 (  1%)      82 kB (  0%)
 preprocessing                      :   0.06 (  2%)   0.15 ( 11%)   1.38 ( 23%)    2618 kB (  1%)
 parser (global)                    :   0.11 (  3%)   0.19 ( 14%)   0.31 (  5%)   27095 kB ( 10%)
 parser struct body                 :   0.07 (  2%)   0.03 (  2%)   0.11 (  2%)   14169 kB (  5%)
 parser enumerator list             :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      94 kB (  0%)
 parser function body               :   0.04 (  1%)   0.06 (  4%)   0.12 (  2%)    4044 kB (  1%)
 parser inl. func. body             :   0.03 (  1%)   0.03 (  2%)   0.02 (  0%)    1952 kB (  1%)
 parser inl. meth. body             :   0.09 (  3%)   0.03 (  2%)   0.11 (  2%)    9238 kB (  3%)
 template instantiation             :   0.47 ( 13%)   0.23 ( 17%)   0.71 ( 12%)   64835 kB ( 24%)
 constant expression evaluation     :   0.00 (  0%)   0.02 (  1%)   0.01 (  0%)     225 kB (  0%)
 early inlining heuristics          :   0.02 (  1%)   0.01 (  1%)   0.03 (  0%)    1507 kB (  1%)
 inline parameters                  :   0.02 (  1%)   0.00 (  0%)   0.05 (  1%)    4043 kB (  1%)
 integration                        :   0.25 (  7%)   0.10 (  7%)   0.23 (  4%)   22167 kB (  8%)
 tree gimplify                      :   0.02 (  1%)   0.02 (  1%)   0.04 (  1%)    6664 kB (  2%)
 tree eh                            :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)    1039 kB (  0%)
 tree CFG construction              :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)    2800 kB (  1%)
 tree CFG cleanup                   :   0.04 (  1%)   0.02 (  1%)   0.09 (  1%)      62 kB (  0%)
 tree VRP                           :   0.08 (  2%)   0.01 (  1%)   0.03 (  0%)    1672 kB (  1%)
 tree Early VRP                     :   0.02 (  1%)   0.02 (  1%)   0.02 (  0%)    2084 kB (  1%)
 tree PTA                           :   0.07 (  2%)   0.03 (  2%)   0.08 (  1%)     648 kB (  0%)
 tree SSA rewrite                   :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)    2156 kB (  1%)
 tree SSA other                     :   0.01 (  0%)   0.01 (  1%)   0.03 (  0%)     282 kB (  0%)
 tree SSA incremental               :   0.02 (  1%)   0.00 (  0%)   0.02 (  0%)     393 kB (  0%)
 tree operand scan                  :   0.07 (  2%)   0.05 (  4%)   0.12 (  2%)    7305 kB (  3%)
 dominator optimization             :   0.11 (  3%)   0.00 (  0%)   0.13 (  2%)    1428 kB (  1%)
 backwards jump threading           :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     248 kB (  0%)
 tree SRA                           :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     385 kB (  0%)
 tree CCP                           :   0.04 (  1%)   0.01 (  1%)   0.04 (  1%)     413 kB (  0%)
 tree PRE                           :   0.00 (  0%)   0.01 (  1%)   0.06 (  1%)    1057 kB (  0%)
 tree FRE                           :   0.06 (  2%)   0.05 (  4%)   0.05 (  1%)     670 kB (  0%)
 tree forward propagate             :   0.02 (  1%)   0.00 (  0%)   0.01 (  0%)     454 kB (  0%)
 tree phiprop                       :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       7 kB (  0%)
 tree conservative DCE              :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      45 kB (  0%)
 tree aggressive DCE                :   0.03 (  1%)   0.01 (  1%)   0.04 (  1%)    2337 kB (  1%)
 tree DSE                           :   0.01 (  0%)   0.01 (  1%)   0.02 (  0%)      32 kB (  0%)
 tree loop invariant motion         :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       1 kB (  0%)
 complete unrolling                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     417 kB (  0%)
 tree iv optimization               :   0.01 (  0%)   0.01 (  1%)   0.02 (  0%)     705 kB (  0%)
 tree switch conversion             :   0.00 (  0%)   0.01 (  1%)   0.00 (  0%)       0 kB (  0%)
 tree strlen optimization           :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       2 kB (  0%)
 dominance computation              :   0.02 (  1%)   0.00 (  0%)   0.06 (  1%)       0 kB (  0%)
 out of ssa                         :   0.00 (  0%)   0.01 (  1%)   0.01 (  0%)      15 kB (  0%)
 expand vars                        :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     351 kB (  0%)
 expand                             :   0.04 (  1%)   0.01 (  1%)   0.02 (  0%)    8842 kB (  3%)
 post expand cleanups               :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     285 kB (  0%)
 forward prop                       :   0.04 (  1%)   0.00 (  0%)   0.02 (  0%)     375 kB (  0%)
 CSE                                :   0.02 (  1%)   0.01 (  1%)   0.11 (  2%)     141 kB (  0%)
 dead code elimination              :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 dead store elim1                   :   0.03 (  1%)   0.00 (  0%)   0.02 (  0%)     415 kB (  0%)
 dead store elim2                   :   0.02 (  1%)   0.00 (  0%)   0.04 (  1%)     630 kB (  0%)
 loop init                          :   0.01 (  0%)   0.01 (  1%)   0.03 (  0%)    2511 kB (  1%)
 loop fini                          :   0.02 (  1%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 CPROP                              :   0.04 (  1%)   0.00 (  0%)   0.06 (  1%)     619 kB (  0%)
 PRE                                :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     103 kB (  0%)
 CSE 2                              :   0.01 (  0%)   0.01 (  1%)   0.03 (  0%)      83 kB (  0%)
 branch prediction                  :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     413 kB (  0%)
 combiner                           :   0.03 (  1%)   0.00 (  0%)   0.03 (  0%)    1112 kB (  0%)
 if-conversion                      :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      64 kB (  0%)
 integrated RA                      :   0.09 (  3%)   0.01 (  1%)   0.05 (  1%)    5912 kB (  2%)
 LRA non-specific                   :   0.06 (  2%)   0.00 (  0%)   0.03 (  0%)     305 kB (  0%)
 LRA create live ranges             :   0.02 (  1%)   0.00 (  0%)   0.02 (  0%)      41 kB (  0%)
 LRA hard reg assignment            :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 reload                             :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 reload CSE regs                    :   0.02 (  1%)   0.00 (  0%)   0.03 (  0%)     571 kB (  0%)
 ree                                :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       8 kB (  0%)
 thread pro- & epilogue             :   0.03 (  1%)   0.00 (  0%)   0.03 (  0%)     460 kB (  0%)
 peephole 2                         :   0.02 (  1%)   0.00 (  0%)   0.02 (  0%)      41 kB (  0%)
 hard reg cprop                     :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)      17 kB (  0%)
 scheduling 2                       :   0.15 (  4%)   0.01 (  1%)   0.16 (  3%)     335 kB (  0%)
 reorder blocks                     :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)     294 kB (  0%)
 shorten branches                   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 final                              :   0.04 (  1%)   0.01 (  1%)   0.04 (  1%)    4707 kB (  2%)
 symout                             :   0.15 (  4%)   0.02 (  1%)   0.17 (  3%)   26953 kB ( 10%)
 variable tracking                  :   0.03 (  1%)   0.00 (  0%)   0.02 (  0%)    2416 kB (  1%)
 var-tracking dataflow              :   0.03 (  1%)   0.00 (  0%)   0.09 (  1%)      74 kB (  0%)
 var-tracking emit                  :   0.05 (  1%)   0.00 (  0%)   0.08 (  1%)    4059 kB (  1%)
 tree if-combine                    :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)      10 kB (  0%)
 straight-line strength reduction   :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      11 kB (  0%)
 store merging                      :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      61 kB (  0%)
 initialize rtl                     :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      12 kB (  0%)
 early local passes                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 rest of compilation                :   0.13 (  4%)   0.02 (  1%)   0.12 (  2%)     731 kB (  0%)
 remove unused locals               :   0.01 (  0%)   0.00 (  0%)   0.05 (  1%)       3 kB (  0%)
 address taken                      :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 repair loop structures             :   0.02 (  1%)   0.00 (  0%)   0.00 (  0%)       3 kB (  0%)
 TOTAL                              :   3.59          1.38          6.13         275104 kB

Ok, I guess that totally makes sense, if you are a compiler developer working on gcc itself. But I’m not! Most of the above either does not mean anything to me, or I don’t really care about it.

  • I don’t care that much about “user” or “system” time; I care about how much time I had to wait for compilation to finish (“wall time”). Why it is in the middle of each row?
  • Is this output sorted by something? It’s not sorted from “most expensive” to “least expensive”. Not sorted alphabetically either. I guess it’s sorted by some internal structure that is not obvious to me.
  • I think the phase parsing is “generally frontend”, and phase opt and generate is “generally backend”; these are good to know. Then later on there’s a more detailed preprocessing to template instantiation which is frontend in more detail; that’s useful too. The rest? Besides some names that mean some optimization passes, many are cryptic to me, and I’m not sure why I should care about them. tree eh? Eh indeed.
  • Memory usage numbers are interesting to compiler developers, and maybe are useful to users that are under memory-constrained situations (like 32 bit systems etc.). Me? I don’t really care.

So, while the gcc -ftime-report prints something, and some of that is very useful (template instantiation time; preprocessing time; time spent performing inlining etc.), it seems to be geared towards compiler developers, and less towards users.

Again - if I’m investigating compile times of my code, the question I have is what can I change in my code to make it compile faster? Ideally some tool should be able to tell me both what takes time, and where in the code it does it.

Clang

Clang, being the most “fresh” of all these compilers, should be the most excellent, right? Right? :)

Just like gcc, it also has -ftime-report option. That produces 900 lines of output! Instead of pasting it in full here, I’ll just link to it.

Several things to note:

  • Almost half of it is just the same information, duplicated? “Register Allocation”, “Instruction Selection and Scheduling”, “DWARF Emission” and “… Pass execution timing report …” sections are just emitted twice, but curiously enough, the other sections are printed just once. This sounds like a bug to me, and testing out various Clang versions in Compiler Explorer suggests that it started with Clang 7.0.0 and still exists on today’s 8.0.0-trunk. I’ve reported it.
    • Within the “Pass Execution Timing Report” section, there’s a bunch of items that are repeated multiple times too, e.g. “Combine redundant instructions” is listed 8 times; and “Function Alias Analysis Results” is listed 21 times. I’ve no idea if that’s a bug or not; I think I don’t care about them anyway.
  • It suffers from the same “oriented to clang compiler developers, and not clang users” that gcc time report does. I really don’t care about all the LLVM optimization passes you ended up doing; and similarly I don’t care about user vs system times or memory usage.

If it stopped printing all the info I don’t care about, did not duplicate half of it, and stopped printing LLVM passes after top 5 most expensive ones, it would already be way more legible, e.g. like this:

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===
   ---Wall Time---  --- Name ---
   4.1943 ( 93.7%)  Code Generation Time
   0.2808 (  6.3%)  LLVM IR Generation Time
   4.4751 (100.0%)  Total
===-------------------------------------------------------------------------===
                              Register Allocation
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0180 seconds (0.0181 wall clock)
===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.2894 seconds (0.2894 wall clock)
===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.3599 seconds (0.3450 wall clock)
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 3.5271 seconds (3.5085 wall clock)
   ---Wall Time---  --- Name ---
   0.4859 ( 13.8%)  X86 Assembly Printer
   0.4388 ( 12.5%)  X86 DAG->DAG Instruction Selection
   0.2633 (  7.5%)  Function Integration/Inlining
   0.1228 (  3.5%)  Global Value Numbering
   0.0695 (  2.0%)  Combine redundant instructions
   3.5085 (100.0%)  Total
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 5.2175 seconds (6.3410 wall clock)
   ---Wall Time---  --- Name ---
   6.3410 (100.0%)  Clang front-end timer
   6.3410 (100.0%)  Total

(also, please, drop the “…” around “Pass execution timing report” name; why it is there?)

And then the various “Total” under the sections are quite a bit confusing. Let’s see:

  • “Clang front-end time report” seems to be time for “everything”, not just the frontend (Clang frontend, LLVM backend, and whatever else it did).
  • I think the “backend” (LLVM) total part is under “Misc Ungrouped Timers, Code Generation Time”. This is probably what is “Pass execution timing” + “Register Allocation” + “Instruction Selection and Scheduling” etc.
  • So probably the actual Clang “frontend” (preprocessor, parsing – i.e. doing the C++ bits of the compilation) I could get by subtracting various LLVM related timer bits from the total time. Ugh. Could you just tell me that time please?

Another thing to note: in all the optimization passes, “X86 Assembly Printer” seems to be the heaviest one? That doesn’t sound right. So I dug in a bit… turns out, once you pass -ftime-report flag, then the whole compilation time is heavily affected. It grows from 1.06s to 1.42s in that super simple STL snippet above, and from 16s to 26s in a much heavier source file I had. Normally for any sort of profiling tool I’d expect at max a couple percent overhead, but Clang’s time report seems to make compilation take 1.5x longer! This sounds like a bug to me, so… reported!

From a quick look, it feels like this could primarily be caused by code in AsmPrinter.cpp that adds time sampling around each and every assembly instruction that was generated (AsmPrinter.cpp, line 1087); this seems like it’s inside a loop that processes all the instructions. Taking that many timing samples probably does not end up being very fast, and thus heavily slows down the “X86 Assembly Printer” pass, when -ftime-report flag is used.

And each time sample capture itself seems to end up doing:

But dear clang! I only care about single “elapsed time”; you could replace all that complexity with a single *_clock::now call and be done with it.

Even if presence of -ftime-report did not distort the compile times so much, Clang’s current implementation leaves much to be desired. It does not even tell me “how much time I spent parsing C++ code”, for example (without me manually doing the math), and does not tell other important aspects like “how much time I spent instantiating templates”. And just like gcc time report, it never tells where in my code the expensive parts actually are.

But surely someone would have solved this already?

I thought so too! But that doesn’t seem to be the case.

For Clang, various people have proposed changes or submitted patches to improve time report in some ways, for example:

  • This patch by Eddie Elizondo adds template instantiation timers. Done in August 2017, everyone in discussion agreed it’s a good idea. Now it’s January 2019…
  • This patch by Brian Gesiak added preprocessor timer in August 2017 too. After some back and forth discussion, eventually abandoned since someone else (Andrew Tischenko) said he’ll implement it in a better way. That was 2018 February.

Other people have built tools that are similar to clang, but are separate programs. For example, Templight by Sven Mikael Persson – it works just like Clang, but invoked with additional arguments dumps detailed information about all template instantiations into a separate file, that can then be converted into various other formats for visualization.

Templight does look promising, but a bit unwieldy to use, and only covers the template instantiation aspect. What if my compile time issues are somewhere else?

So I have spent a bit of time trying to make Clang do a “more useful for me” time report. It already pointed out one seemingly simple header file that we had, that was causing Clang to take “ages” (5-8 seconds) just including it, due to some recursive macro usage that Clang isn’t fast to plow through. Yay tools!

What I did and how it works will be in the next blog post! But maybe you can guess on what it might be from my past blog posts (here or here), and from this teaser image:

Stay tuned! Update: next blog post is out.