Replacing a live system is really hard

So! Turns out my “two years in a build team” post was almost the end of my time there :) I’ve started a new thing & new work area, and am wrapping up some of my leftover build team work as we speak. But! I wanted to write about one particular aspect of this build system work, which took almost three years in total.

Three. Years.

That’s a really long time, and that’s how long it took for us to switch from “the build system we had previously” to “the build system we have now”. Turns out, replacing a system in an ever-moving product is really, really, REALLY hard.

Sometimes I see that whenever people dream up some New Fancier Better system, they think that making this new system is the where most of the work will go into. In my experience (in build system, but also in a handful of other occasions), in addition to developing the new things, you also have to cover these:

  • While you will be busy doing new stuff, how will you keep up with changes to the old stuff? In a build system, people will keep on adding new files, libraries to be built, will tweak compiler flags, change preprocessor defines, update SDK/compiler versions and so on. Same with any other area – the old system is “live”, being used and being changed over time; maybe data for the old system is still being produced by someone out there. How will you transition all that?
  • How will the new system be rolled out, in a way that everything keeps on working, all the time? We have hundreds of developers on this codebase, a lot of automated processes running (builds, tests, packaging etc.), and if everyone loses even a day of work due to some mess-up, that’s a massive cost. Really risky changes have to be rolled out incrementally somehow, and only rolled out to “everyone” when all the large issues are found and fixed.

So here’s a story in how we did it! I don’t know if the chosen approach is good or bad; it seems to have worked out fine.

2016 May, “Jam with C#” project at Hackweek.

At Unity Hackweek 2016, one of the projects was “what if instead of Jam syntax to describe the build, we had C#?”. There’s a short video of it here:

We used a Jam-based build system called JamPlus to build everything inside Unity since about 2010. Overall the whole setup looks like this:

  • There is an actual “build engine”, the jam.exe itself. This piece knows how to parse *.jam files that describe the build, find which things need to be updated in order to build something, and execute these builds commands in parallel where possible.
  • “JamPlus” is a bunch of rules written on top of that, in a combination of Jam and Lua languages. These are helper utilities, like “finding a C++ compiler” and describing basic structure of a C/C++ program, etc. JamPlus can also generate IDE project files for Visual Studio, Xcode and so on.
  • And then we have a bunch of our own *.jam files, that describe pieces and platforms of Unity itself. From simple things like “this is a list of C++ files to compile, and here are the compiler flags”, to more involved things that are mostly about generating code at build time.

Jam language syntax is very compact, but also “interesting” – for example, it needs whitespace between all tokens; and yes that means space before every semicolon, or otherwise a lot of confusing messages you will get. Here’s a random example I found:

rule ConvertFile CMD : DEST : SOURCE_INPUT : GENERATED_INPUT
{
	INPUT = $(SOURCE_INPUT:G=hlslang) ;
	GENERATED_INPUT = $(GENERATED_INPUT:G=hlslang) ;
	DEST = $(DEST:G=hlslang) ;
	MakeLocate $(SOURCE_INPUT) : $(SEARCH_SOURCE) ;
	MakeLocate $(GENERATED_INPUT:D=) : $(LOCATE_TARGET)/$(GENERATED_INPUT:D) ;
	MakeLocate $(DEST:D=) : $(LOCATE_TARGET)/$(DEST:D) ;
	Clean clean : $(DEST) ;
	UseFileCache $(DEST) ;	
	Depends $(DEST) : $(SOURCE_INPUT) $(GENERATED_INPUT) ;		
	$(CMD) $(DEST) : $(SOURCE_INPUT) $(GENERATED_INPUT) ;	
	ScanContents $(DEST) ;
	return $(DEST) ;
}

So at this hackweek, what they did was embed C# (via Mono) directly into jam.exe, and make it be able to run C# code to describe everything there is to build, instead of parsing a Jam language file. They also wrote a converted from Jam language into C# language. If that sounds a bit crazy, that’s because it is, but eh, who here has not embedded C# into a piece of software written in 1993?

And so all of *.jam files (our own build code, but also most of JamPlus rules) get turned into C# files, but functionally nothing else changes. The auto-generated C# of course does not look much better; in fact at this point it’s more verbose than original Jam code:

static JamList ConvertFile(JamList CMD, JamList DEST, JamList SOURCE_INPUT, JamList GENERATED_INPUT)
{
    DEST = DEST.Clone();
    GENERATED_INPUT = GENERATED_INPUT.Clone();
    Vars.INPUT.Assign(SOURCE_INPUT.SetGrist("hlslang"));
    GENERATED_INPUT.Assign(GENERATED_INPUT.SetGrist("hlslang"));
    DEST.Assign(DEST.SetGrist("hlslang"));
    Jambase.MakeLocate(SOURCE_INPUT, Vars.SEARCH_SOURCE);
    Jambase.MakeLocate(GENERATED_INPUT.SetDirectory(""), Combine(Vars.LOCATE_TARGET, "/", GENERATED_INPUT.GetDirectory()));
    Jambase.MakeLocate(DEST.SetDirectory(""), Combine(Vars.LOCATE_TARGET, "/", DEST.GetDirectory()));
    Jambase.Action_Clean("clean", DEST);
    InvokeJamRule("UseFileCache", DEST);
    Depends(DEST, JamList(SOURCE_INPUT, GENERATED_INPUT));
    InvokeJamRule(CMD, DEST, JamList(SOURCE_INPUT, GENERATED_INPUT));
    InvokeJamRule("ScanContents", DEST);
    return DEST.Clone();
}

However with some cleanups and good IDEs (♥Rider) you can get to more legible C# fairly quickly eventually:

static void ConvertFile(string cmd, NPath dest, NPath sourceInput, NPath generatedInput)
{
    InvokeJamAction(cmd, dest, JamList(sourceInput, generatedInput));

    // Tell Jam that the generated bison/flex file "includes" the original tempate grammar files,
    // meaning it will include whatever regular C headers these include too, to detect needed rebuilds.
    Includes(dest, sourceInput);

    Depends(dest, JamList(sourceInput, generatedInput));
    Needs(dest, BuildZips.Instance.FlexAndBison.ArtifactVersionFile);
}

2016, actual work starts

Hackweeks are a lot of fun, and one can get very impressive results by doing the most interesting parts of the project. However, for actual production, “we’ll embed Mono into Jam, and write a language converter that kinda works” is not nearly enough. It has to actually work, etc. etc.

Anyway, a couple months after hackweek experiment, our previous effort to move from Jam/JamPlus to Gradle was canceled, and this new “Jam with C#” plan was greenlit.

It took until February of next year when this “Jam build engine, but build code is written in C#” was landed to everyone developing Unity. How did we test it?

  • Had a separate branch that tracks mainline, where on the build farm it was doing two builds at once:
    1. First, regular Jam build with the *.jam build code, and dumped whole Jam build graph structure,
    2. Second one, with all the *.jam code automatically converted to C#, and dumped whole Jam build graph structure,
    3. Checking that the two builds graphs were identical for each and every build target/platform that we have.
  • Had some developers at Unity opt-in to the new “Jam#” build code for a few months, to catch any possible issues. Especially the ones that are not tested/covered by the build farm, e.g. “are Visual Studio project files still generated just like before?“.
  • Before the final roll-out of “all .jam files are gone, .jam.cs files are in”, we also had a tool that would help anyone who had a long-lived branch that they want to land to mainline. They might have changed build code in Jam language, but after the C# roll-out their changes would have nowhere to merge! So there was a “give an old .jam file, we’ll get you the converted C# file back” tooling for that case.

And so in 2017 February we rolled out removal of all the old *.jam files, and the (horrible looking) auto-converted C# build code landed:

2017, starting to take advantage of this C# thing

Auto-converted-Jam-to-C# is arguably not much better. More verbose, actually kinda harder to read, but there are some upsides. Statically typed programming language! Great IDEs and debuggers! You have more data types besides “list of strings”! A lot of people inside Unity know C#, whereas “I know Jam” is not exactly common! And so on.

And so we started writing new C#-based APIs to express “how to build a program” rules, which we call “Bee(you’d have to ask @lucasmeijer about the name).

We were also rewriting IDE project files generation from the Lua-based one in JamPlus to, well, C#. My blog posts from 2017 relating to Visual Studio project files (this one or that one) might have been because I was doing it at the time :)

Of course at this point all of our own build code still used the old JamPlus-but-now-C# APIs to express how things need to be built. And we began taking all these pieces and converting them to use the new Bee build APIs:

This took much longer than initially expected, primarily because OMG you would not believe what a platform build code might be doing. Why is this arcane compiler flag used here? No one remembers! But who knows what might break if you change it. Why these files are being copied over there, ran through this strange tool, signed in triplicate, sent in, sent back, queried, lost, found, subjected to public inquiry, lost again, and finally buried in soft peat for three months and recycled as firelighters? Who knows! So there was a lot of that going on, inside each and every non-trivial build platform and build target that we had.

We also did a lot of work in other build areas, be it improving UX (error messages, colors, …), developing a system for downloading binary artifacts as part of the build, upgrading and packaging up compiler toolchains, experimenting with Ninja build backend instead of Jam (more on that later), optimizing codebase build times in general, improving project files and so on.

2018, Jam switched to Tundra backend

After some experiments with Ninja backend, we settled upon Tundra (our own fork) and replaced the Jam build backend with it.

The change was fairly hard to verify that “it works exactly the same as before”, because Tundra does not work exactly the same as Jam. One might think that all build systems are “have some rules, and a build dependency graph, and they execute the build actions”, but it’s a bit more complexicated than that. There’s a nice paper from Microsoft Research, “Build Systems à la Carte”, that categorizes build systems by type of dependencies they support (static vs dynamic), scheduling approach, early cutoff support, etc. Specifically, Tundra’s scheduler is different (I think it’s “restarting” one as per that paper, whereas Jam’s is “topological”).

In practice, at least in our codebase this means that Tundra very often rebuilds less things compared to Jam, especially when things involve files generated during build time. Since the order of build steps and even the amount of them is different between Tundra and Jam, we could not just build simple validation suite like “build everything with both, compare that they did exact same steps”.

So what we did was rely on the automated builds/tests that we already had for the product itself, and also on volunteer developers inside Unity to try out Tundra locally. Since 2018 January people could opt-in to Tundra by adding a tiny local environment change, and report any & all findings. We started with a handful of people, and over coming months it grew to several dozen. In late May it got turned on by default (still with ability to opt-out) for everyone on Mac, and next month everyone got Tundra switched on. Some time later remains of jam.exe got actually removed.

2019, current state

Today in our main code repository, jam.exe is long gone, and almost all of remains of JamPlus-converted C# code are gone.

Compared to the build state three years ago, a lot of nice build related tools were built (some I wrote about in the previous blog post), and in general I think various aspects of build performance, reliability, UX, workflow have been improved.

As a side effect, we also have this fairly nice build system (“Bee”) that we can use to build things outside of our main code repository! So that’s also used to build various external libraries that we use, in various plugins/packages, and I think even things like Project Tiny use it for building actual final game code.

So all that’s nice! But oh geez, that also took a lot of time. Hence the blog post title.


Two years in a build team!

Whoa! Has it been two years already since I stopped doing graphics programming?!

How does one end up on a build team?

I switched to the “build system” team by having two extremely short chats:

…and then a lot more chats with graphics people and some others of course; but the above “hey could I join? yes!” was kinda the whole of my “job interview” to the team.

And then I sent out a “goodbye graphics” email:

I actually had no idea for how long I’m leaving, so I wrote “a couple months”. Well look, it’s been two years already!

What does a build team do?

Most of my impressions I already wrote about (one week in, six months in).

A bunch of people have asked me either “ok so what is it that you actually do?”, or alternatively “surely the build work must be done by now?“. To which I don’t have an excellent answer. My own work is a combination of:

  • Switching from “old build” (JamPlus) to “new build” (Bee/Tundra) while not disrupting the work of everyone around us.
  • Speeding up builds by cleaning up dependencies, includes, removing code.
  • Upgrading various platforms to more recent compiler versions; usually this is not hard but e.g. VS2010 -> VS2015 was pretty painful due to needed rebuilds of all 3rd party static libraries.
  • Improving UX of various aspects of build process: cleaner logs, better diagnostic messages, more intuitive build command line arguments.
  • Support for people who have any build issues.
  • Fixing out various build issues that are accidentally introduced due to one reason or another. In a live codebase, it’s not like you can fix all issues and be done with it :)

Typical work weeks might look like this – this is from my own “week logs” doc, took a screenshot of two recent ones:

And “what I did during last year” summary looks like this. I’ve highlighted buildsystem-related work in there; the rest is “everything else”:

It doesn’t feel like “I got a lot done”, but doesn’t feel terrible either.

Anyway! Most of build stuff is fairly typical, but during last year our team has build some pretty neat tools that I wanted to write about. So here they are!

Neat Build Tools

In our main code repository the build entry point script is called jam (it does not use Jam anymore, but backwards compat with what people are used to type…). In some of our new code repositories (for DOTS/ECS and some packages) the build entry point would be called bee; both have the same tools, the examples below will be using jam entry point.

How exactly X is built?

jam how substring-of-something finds the most relevant build step (e.g. object file compile, executable link, file copy, whatever) and tells exactly how it is built. This is mostly useful to figure out exact compiler command line flags, and dependencies.

Why X got rebuilt?

If one is wondering why something gets rebuilt (recompiled, re-copied, relinked, etc.), jam why substring-of-something tells that:

Every build produces a “log of what got done and why” file, and the why tool looks at that to do the report.

Where time was spent during the build?

I added Chrome Tracing profiler output support to both JamPlus and Tundra (see previous blog post), and while that is all good and nice, sometimes what you want is just a “very quick summary”. Enter jam time-report. It shows top slowest action “types”, and top 10 items within each type:

Of course if you want more detail, you can drag the profiler output file into chrome://tracing or speedscope.app and browse it all:

What are the worst C/C++ header includes?

Since during the build Tundra scans source files for #include dependencies, we can use that to do some analysis / summary! jam include-report shows various summaries of what might be worth untangling:

It is very similar to Header Hero that I used before for include optimization. But I wanted something that would see actual includes instead of the approximation that Header Hero does, and something that works on a Mac, and something that would be built-in in all our builds. So there!

This is all! I’ll get back to reviewing some pull requests now.


'Infinite' sky shader for Unity

I saw a discussion today that lamented lack of “infinite projection” in Unity, so here’s a quick post about that.

Unity today (at least in 2017/2018 versions) uses “reversed Z” projection, but does not use “infinite projection”. This means that a Camera has an actual far clipping plane, and beyond that distance nothing is rendered.

This is fine for almost all cases, except when you want to render custom “skybox-like” geometry (a sky sphere or whatever); then you’d like to have it be “infinitely large” and thus guaranteed to always be beyond any actual scene geometry.

You could wait for Unity to implement “infinite projection”, and/or write a custom render pipeline to use your own infinite projection, or do a super small local change just inside your “sky object” shader to make it effectively be infinite. Let’s check out how to do the last bit!

“Infinite size” shader

To achieve an effectively “infinite size” (i.e. appears “behind any objects”) shader, all we have to do is to move the vertices to be “on the far plane” in the vertex shader. If o.vertex is a float4 with clip space position (e.g. computed by UnityObjectToClipPos), then just do this:

#if defined(UNITY_REVERSED_Z)
// when using reversed-Z, make the Z be just a tiny
// bit above 0.0
o.vertex.z = 1.0e-9f;
#else
// when not using reversed-Z, make Z/W be just a tiny
// bit below 1.0
o.vertex.z = o.vertex.w - 1.0e-6f;
#endif

And here it is. Far plane of only 20, and a “sky sphere” object that is 100000 in size. No clipping, renders behind scene geometry:

Here’s Unity 2018.2.15 project with the whole shader and a simple scene - InfiniteSkyUnityProject.zip (350kb)

What’s all this Reversed Z and Infinite Projection anyway?

Nathan Reed explains it much better than I ever could, read that post!

Summary is that reversing depth so that far plane is at zero, and near plane is at one, and using a floating point format depth buffer, results in much better depth precision.

In Unity we have implemented support for this reversed Z a while ago (in Unity 5.5). Today we don’t use infinite projection (yet?), so to achieve the infinite-like sky objects you’d have to do a shader trick like above.

This is all.


Another cool MSVC flag: /d1reportTime

A while ago I found about a fairly useful Visual C++ flag that helps to investigate where the compiler backend/optimizer spends time – /d2cgsummary, see blog post.

Turns out, there is a flag that reports where the compiler frontend spends time – /d1reportTime.

I’ve been writing about lack of compiler time profiling/investigation tools, starting from that same old post, up to the more recent ones that focused on clang - one and two. All this recent discussion about C++ compilation times (among other issues…) led me to finding this fresh project on github, UECompileTimesVisualizer by @phyronnaz. Ignore everything about UE4 there (unless you work with it of course :)), the new thing for me was that it says:

Add the following arguments to the C/C++ Command Line option in your project settings: /Bt+ /d2cgsummary /d1reportTime

/Bt+ and /d2cgsummary are known, but /d1reportTime I haven’t heard about before. What is it?! The only other place on the internet that talks about it – and it’s fairly impressive for anything to have just two mentions on the internet – is a small comment in Microsoft forums thread a couple months ago, and the only thing it says is: “Can you also share the output from /d1reportTime?”

So what is /d1reportTime?

Passing /d1reportTime to the MSVC compiler (cl.exe) will make it print:

  • Which header files are included (hierarchically), with time taken for each,
  • Which classes are being parsed, with time taken for each,
  • Which functions are being parsed, with time taken for each.

Additionally, at end of each list, “top N” are listed too. List of classes/functions seems to contain info about which templates (or nested classes? not sure yet) are being instantiated as well.

I have no idea which version of MSVC the flag has appeared in. VS2015 seems to not have it yet; and VS2017 15.7 (compiler version 19.14.26428) already has it; it might have appeared in earlier VS2017 builds but I haven’t checked.

Running it on anything non-trivial produces tons of output, e.g. I randomly ran it on a (fairly big) Shader.cpp file we have in our codebase, and it produces 30 thousand lines of output. Shortened a lot:

Shader.cpp
Include Headers:
  Count: 542
    c:\trunk\runtime\shaders\shader.h: 0.311655s
      c:\trunk\runtime\shaders\shadertags.h: 0.004842s
      c:\trunk\runtime\shaders\shaderpropertysheet.h: 0.127829s
        c:\trunk\runtime\math\color.h: 0.021522s
          c:\trunk\runtime\serialize\serializeutility.h: 0.012031s
            c:\trunk\runtime\serialize\serializationmetaflags.h: 0.005427s
            c:\trunk\runtime\utilities\enumtraits.h: 0.004738s
          c:\trunk\runtime\testing\testingforwarddecls.h: 0.000331s
        c:\trunk\runtime\math\vector2.h: 0.003681s
        <... a lot more skipped ...>

  Top 24 (top-level only):
    c:\trunk\runtime\graphics\scriptablerenderloop\scriptablerendercontext.h: 0.619189s
    c:\trunk\runtime\graphics\scriptablerenderloop\scriptablebatchrenderer.h: 0.407027s
    c:\trunk\runtime\shaders\shader.h: 0.311655s
    c:\trunk\runtime\shaders\shadernameregistry.h: 0.263254s
    <... a lot more skipped ...>

  Total: 2.125453s
Class Definitions:
  Count: 10381
    ShaderTagID: 0.001146s
    EnumTraits::internal::IsReflectableEnum<`template-type-parameter-1'>: 0.000057s
    EnumTraits::internal::DefaultTraitsHelper<`template-type-parameter-1',1>: 0.000085s
    EnumTraits::internal::DefaultTraitsHelper<`template-type-parameter-1',0>: 0.000064s
    <... a lot more skipped ...>

  Top 100 (top-level only):
    std::_Vector_const_iterator<class std::_Vector_val<struct std::_Simple_types<struct ShaderLab::SerializedProperty> > >: 0.031536s
    PersistentManager: 0.029135s
    ModuleMetadata: 0.029010s
    ShaderLab::SerializedSubProgram: 0.020047s
    GlobalCallbacks: 0.019955s
    <... a lot more skipped ...>

  Total: 1.412664s
Function Definitions:
  Count: 20544
    ShaderTagID::ShaderTagID: 0.000368s
    ShaderTagID::ShaderTagID: 0.000024s
    operator ==: 0.000089s
    operator !=: 0.000017s
    operator <: 0.000021s
    <... a lot more skipped ...>

  Top 100 (top-level only):
    std::operator ==: 0.031783s
    ShaderLab::SerializedTagMap::Transfer: 0.014827s
    math::matrixToQuat: 0.014715s
    math::mul: 0.012467s
    Texture::TextureIDMapInsert: 0.012006s
    <... a lot more skipped ...>

  Total: 2.658951s

At 30k lines it is not the most intuitive for manual use, but a lot of interesting information is there. Particularly the “Top N” lists sound like good things to start with, if looking at the output manually.

Even for something simple like (see in Compiler Explorer):

#include <vector>
int main()
{
    std::vector<int> v;
    v.push_back(13);
    return (int)v.size();
}

The /d1reportTime produces almost 2000 lines of output. Again, shortened here:

example.cpp
Include Headers:
    Count: 74
        c:\msvc\v19_16\include\vector: 0.215198s
            c:\msvc\v19_16\include\xmemory: 0.186372s
                c:\msvc\v19_16\include\xmemory0: 0.184278s
                    c:\msvc\v19_16\include\cstdint: 0.015689s
                        c:\msvc\v19_16\include\yvals.h: 0.015103s
                            c:\msvc\v19_16\include\yvals_core.h: 0.012148s
                                c:\msvc\v19_16\include\xkeycheck.h: 0.000246s
                                c:\msvc\v19_16\include\crtdefs.h: 0.011183s
                                <... skipped remaining 66 headers ...>

    Top 1 (top-level only):
        c:\msvc\v19_16\include\vector: 0.215198s
    Total: 0.215198s
Class Definitions:
    Count: 829
        _PMD: 0.000260s
        _TypeDescriptor: 0.000088s
        _s__CatchableType: 0.000113s
        _s__CatchableTypeArray: 0.000025s
        _s__ThrowInfo: 0.000044s
        _s__RTTIBaseClassDescriptor: 0.000056s
        _s__RTTIBaseClassArray: 0.000027s
        _s__RTTIClassHierarchyDescriptor: 0.000026s
        _s__RTTICompleteObjectLocator2: 0.000048s
        __s_GUID: 0.000049s
        <add_rvalue_reference><`template-type-parameter-1'>: 0.000012s
        <remove_reference><`template-type-parameter-1'>: 0.000006s
        <remove_reference><class $RBAAB &>: 0.000005s
        <... skipped remaining 816 classes ...>

    Top 100 (top-level only):
        std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >: 0.006266s
        std::vector<int,class std::allocator<int> >: 0.003368s
        std::pair<`template-type-parameter-1',`template-type-parameter-2'>: 0.002695s
        std::basic_string<`template-type-parameter-1',`template-type-parameter-2',`template-type-parameter-3'>: 0.001880s
        std::vector<`template-type-parameter-1',`template-type-parameter-2'>: 0.001117s
        <... skipped remaining 95 top classes ...>

    Total: 0.059558s
Function Definitions:
    Count: 773
        operator new: 0.000162s
        operator delete: 0.000031s
        operator new[]: 0.000025s
        operator delete[]: 0.000009s
        abs: 0.000048s
        abs: 0.000021s
        div: 0.000059s
        div: 0.000043s
        fpclassify: 0.000055s
        <... skipped remaining 764 functions ...>

    Top 100 (top-level only):
        std::logic_error::logic_error: 0.006603s
        main: 0.003693s
        std::_Uninitialized_move: 0.003665s
        std::vector<int,class std::allocator<int> >::push_back: 0.001054s
        std::numeric_limits<bool>::round_error: 0.000730s
        std::vector<int,class std::allocator<int> >::_Emplace_reallocate: 0.000612s
        _scwprintf_p: 0.000414s
        <... skipped remaining 93 top functions ...>
    Total: 0.056901s

Having reports like that is also useful to realize just how much work we’re asking the compiler to do. The C++ snippet above is “as simple as it gets” when using “standard” C++ features. Just make a vector of integers, and add a number to it. The compiler ends up including 74 headers, parsing over 800 classes and almost 800 functions. Just to make a simple “push an integer into a vector” task.

Yes, I know about precompiled headers. And I know about upcoming C++ modules. You can save on sending that email :)

Anyway, having that tooling is really good! Thanks Microsoft Visual C++ team! Also, u/gratilup mentions in this reddit thread that MSVC team is working on ETW/WPA based tooling to visualize compilation times. Great!


time-trace: timeline / flame chart profiler for Clang

Update: this has landed to LLVM/Clang mainline! So if all goes well, Clang 9.0 should contain this functionality. The upstreaming commit landed on 2019 March 30; thanks Anton Afanasyev for doing the work of landing it!

I wanted Clang to emit timeline (“flame chart”) style profiling info on where it spends time. So I made it do that.

What kind of compiler profiler I want?

In the previous blog post I talked about how it is hard to get “useful” information on where C++ compilers spend their time. Various compilers have various ways of reporting something, but most of their reports seem to be geared towards the compiler developers themselves. For them it’s important to know, for example, whether register allocator is the bottleneck; for me as a compiler user that is much less of an importance – I want to know whether I’m bottlenecked by preprocessor / includes (if so, which ones?), parsing (if so, which classes/functions/templates?), template instantiation (if so, which ones?), code generation, backend optimizations, or something else.

Having added Chrome Tracing profiling outputs to other parts of our build system in the past (here or there), I quite like it. It’s not perfect, and the UI is not very good, but it gets the job done and the file format is trivial to write.

So I thought I’d try doing it – add some sort of flag to Clang that would emit the profiling information that I would find interesting. Turns out, getting to the “it seems to work” state was easier than I expected!

-ftime-trace: Chrome Tracing output for Clang

The very first thing I tried it on pointed out a really useful thing: we have a super slow to include header in one place. Just look:

This is a timeline / flame chart profiling view in Chrome chrome://tracing page. Horizontal axis is time, vertical is nested “callstacks”, so to speak. And it very clearly shows that there is one header file that takes over 8 seconds to include.

I suspect it’s some sort of performance issue with Clang itself; Gcc and Msvc are quite a bit faster at including this file. Will try to gather data and report a bug.

The file itself does not feel that complex; without this view I would have never guessed that it’s the cause of these slow compile times. Yay visibility into what is going on!

So that was already fairly useful. With this visualization, I can look into other files and see what they end up doing. And based on that, decide what to do to make compiles faster. When including/parsing files takes up time, I can tell which ones are the culprit exactly, and how long it takes to include them:

Already found some headers that took longer to process than you’d guess, and I’ve split them up into “rarely used, expensive” parts and “often used, cheap” parts.

Or when looking at template instantiations, I can see which ones end up taking most time to process, and focus on improving them:

But can’t you figure that out with existing means?

Template instantiation I probably could via Templight and Templight-Tools. I built them, tried it on some files, wrestled with how to visualize the “callgrind” output (WinCacheGrind refused to open them, QCacheGrind worked but I found it a bit intimidating to use). The process of “compile with a special flag, get a file, run the file through a converter, open in some app” is a bit involved, plus the converters aren’t very fast.

Using a “regular profiler” on Clang itself shows which parts of the compiler take up time, but does not show which parts of my code it is spending that time on. For example on that “one file was very slow to parse” example from top of the post, running Xcode Instruments profiler on Clang just shows this:

So I can see that “all the time is spent parsing”, but where exactly?

Ok where’s the code, and what now?

All the code is at this github PR for now, which is based on Clang 8.0.0-ish “trunk” as it was on 2019 January 10.

It adds a new command line flag to clang, -ftime-trace, that produces Chrome Tracing .json file next to the regular compiled object files.

I based it on the older LLVM/Clang “git monorepo”; they have switched to another github repository while I was doing it :) However the build instructions are the same as for regular LLVM/Clang. Some notes:

  • Do use Ninja backend for building. Default Makefiles backend spends ages figuring out what to build, every time you do the build.
  • Default build type is non-optimized Debug build. Build optimized Release build with the usual -DCMAKE_BUILD_TYPE=RelWithDebInfo CMake flag.
  • On Windows, make sure to set CC=cl and CXX=cl environment variables before invoking CMake. I had it pick up some random Gcc-based toolchain that was inside my Strawberry Perl install otherwise.
  • LLVM/Clang build is setup in such a way where any changes to the base “Support” library (where timing/profiling utilities are, among other things) cause rebuild of almost everything. Add a comment into Timer.cpp file? There’s 2500 files to recompile! And I wanted to add my profiling stuff in there…
    • This seems to be because some utility that generates some headers (which the rest of LLVM/Clang depends on) itself uses the same “Support” library. The utility is llvm-tblgen; I don’t know what exactly it is, but I’ll assume it generates the famous Amiga demoscene groups, and you cannot change my mind.
    • Turns out LLVM build has an option for “I have my own tblgen” (which I did have after building the whole stuff once) that you can point the build to. Then it will stop rebuilding it, and thus stop regenerating most of the headers. Use -DLLVM_TABLEGEN=path/to/tblgen option to CMake for that. This cuts down “change anything in Support library” from “2500 files to recompile” down to about “250 files to recompile”. Still not cheap, but much better.
  • Overall this was surprisingly easy! I mean I’ve never built LLVM/Clang before, and here, just 350 lines of code later, I have my own modified Clang with the profiling output I always wanted. Pretty good!

I’ve started a thread on Clang mailing list about my changes, and will see where that goes. There are some comments already; at some point I’d have to do a proper pull request via whatever is the LLVM/Clang code review tooling.

No idea whether any of this will get accepted into upstream Clang, but if not, then at least I will have my own Clang with the profiler I always wanted :)