Blogs · Aras' website

Slow to Compile Table Initializers

Posted on Oct 24, 2017

Continuing on finding low hanging fruit in our codebase (Unity game engine, a lot of C++ code) build times with Visual Studio, here’s what I found with help of /d2cgsummary (see previous blog post).

Noticed one file that was taking 92 seconds to compile on my machine, in Release config. A quick look at it did not reveal any particularly crazy code structure; was fairly simple & “obvious” code. However, /d2cgsummary said:

Anomalistic Compile Times: 44
  ?IsFloatFormat@@YA_NW4GraphicsFormat@@@Z: 5.493 sec
  ?IsHalfFormat@@YA_NW4GraphicsFormat@@@Z: 5.483 sec
  ?GetGraphicsFormatString@@YA?AV?$basic_string@DV?$StringStorageDefault@D@core@@@core@@W4GraphicsFormat@@@Z: 4.680 sec
  ?ComputeMipmapSize@@YA_KHHW4GraphicsFormat@@@Z: 4.137 sec
  ?ComputeTextureSizeForTypicalGPU@@YA_KHHHW4GraphicsFormat@@HH_N@Z: 4.087 sec
  ?GetComponentCount@@YAIW4GraphicsFormat@@@Z: 2.722 sec
  ?IsUNormFormat@@YA_NW4GraphicsFormat@@@Z: 2.719 sec
  ?Is16BitPackedFormat@@YA_NW4GraphicsFormat@@@Z: 2.700 sec
  ?IsAlphaOnlyFormat@@YA_NW4GraphicsFormat@@@Z: 2.699 sec
  <...>

A bunch of functions taking 2-5 seconds to compile each; and each one of those looking very simple:

bool IsFloatFormat(GraphicsFormat format)
{
    return ((GetDesc(format).flags & kFormatPropertyIEEE754Bit) != 0) && (GetDesc(format).blockSize / GetComponentCount(format) == 4);
}

Why on earth a function like that would take 5 seconds to compile?!

Then I noticed that all of them call GetDesc(format), which looks like this:

const FormatDesc& GetDesc(GraphicsFormat format)
{
    static const FormatDesc table[] = //kFormatCount
    {
        // bSize,bX,bY,bZ, swizzleR,  swizzleG,        swizzleB,        swizzleA,        fallbackFormat,    alphaFormat,       textureFormat,    rtFormat,        comps, name,   flags
        {0, 0, 0, 0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatNone,       kFormatNone,       kTexFormatNone,   kRTFormatCount,  0, 0,  "None", 0},                                                                            // None,
        {1, 1, 1, 1, kFormatSwizzleR, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone,   kRTFormatCount,  1, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // R8_SRGB
        {2, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone,   kRTFormatCount,  2, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RG8_SRGB
        {3, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatRGB24,  kRTFormatCount,  3, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGB8_SRGB
        {4, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzleA, kFormatNone,       kFormatRGBA8_SRGB, kTexFormatRGBA32, kRTFormatARGB32, 3, 1,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGBA8_SRGB
        // <...a lot of other entries for all formats we have...>
    };
    CompileTimeAssertArraySize(table, kGraphicsFormatCount);
    return table[format];
}

Just a function that returns “description” struct for a “graphics format” with various info we might be interested in, using a pattern similar to Translations in C++ using tables with zero-based enums.

The table is huge though. Could it be what’s causing compile times to be super slow?

Let’s try moving the table to be a static global variable, outside of the function:

static const FormatDesc table[] = //kFormatCount
{
    // <...the table...>
};
CompileTimeAssertArraySize(table, kGraphicsFormatCount);
const FormatDesc& GetDesc(GraphicsFormat format)
{
    return table[format];
}

Boom. Compile time of that file went down to 10.3 seconds, or whole 80 seconds faster.

:what:

That’s crazy. Why does this happen?

This “initialize table inside a function” pattern differs from “just a global variable” in a few aspects:

Compiler must emit code to do initialization when the function is executed for the first time. For non-trivial data, the optimizer might not “see” that it’s all just bytes in the table. And so it will actually emit equivalent of if (!initYet) { InitTable(); initYet=true; } into the function.
Since C++11, the compiler is required to make that thread-safe too! So it does equivalent of mutex lock or some atomic checks for the initialization.

Whereas for a table initialized as a global variable, none of that needs to happen; if it’s not “just bytes” then compiler still generates the initializer code, but it’s called exactly once before “actual program” starts, and does not add any branches to GetDesc function.

Ok, so one takeaway is: for “static constant data” tables like that, it might be better to declare them as global variables, instead of static local variables inside a function.

But wait, there’s more! I tried to make a “simple” repro case for Microsoft C++ folks to show the unusually long compile times, and I could not. Something else was also going on.

What is “Simple Data”?

The same /d2cgsummary output also had this:

RdrReadProc Caching Stats
  Most Hits:
    ??U@YA?AW4FormatPropertyFlags@@W40@0@Z: 18328

Which probably says that enum FormatPropertyFlags __cdecl operator|(enum FormatPropertyFlags,enum FormatPropertyFlags) function was called/used 18 thousand times in this file. One member of our FormatDesc struct was a “bitmask enum” that had operators defined on it for type safety, similar to DEFINE_ENUM_FLAG_OPERATORS (see here or there).

At this point, let’s move onto actual code where things can be investigated outside of the whole Unity codebase. Full source code is here in the gist. Key parts:

#ifndef USE_ENUM_FLAGS
#define USE_ENUM_FLAGS 1
#endif
#ifndef USE_INLINE_TABLE
#define USE_INLINE_TABLE 1
#endif

enum FormatPropertyFlags { /* ...a bunch of stuff */ };
#if USE_ENUM_FLAGS
#define ENUM_FLAGS(T) inline T operator |(const T left, const T right) { return static_cast<T>(static_cast<unsigned>(left) | static_cast<unsigned>(right)); }
ENUM_FLAGS(FormatPropertyFlags);
#endif

struct FormatDesc { /* ...the struct */ };

#if USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
    static const FormatDesc table[] = { /* ... the huge table itself */ };
#if !USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
    return table[format];
}

UInt32 GetColorComponentCount(GraphicsFormat format)
{
    return GetDesc(format).colorComponents;
}
/* ... a bunch more functions similar to this one */

We have preprocessor defines to switch between the big table being defined as a static local variable inside GetDesc function or a static global variable, and a define to switch whether “type safe enums” machinery is used or not. I’m compiling for x64 with cl.exe /O2 /Zi /GS- /GR- /EHsc- /MT main.cpp /link /OPT:REF /OPT:ICF which is fairly typical “Release build” flag soup.

Compiler	Time	Time with /cgthreads1	Exe size, KB	Large symbols, KB
MSVC 2010 SP1	1.8s	-	58	12 GetDesc
MSVC 2015 Update 3	15.1s	42.6s	189	45 IsFloatFormat, 45 IsHalfFormat
MSVC 2017 15.3	18.2s	37.5s	96

Ok, so Visual C++ 2015/2017 is particularly slow at compiling this code pattern (big table, type-safe-enum operators used in it) with optimizations turned on. And a big increase in compile time compared to VS2010, hence I filed a bug for MS.

But what’s even more strange, is that code size is quite big too, particularly in VS2015 case. Each of IsFloatFormat and IsHalfFormat functions, which both are simple one-liners that just call GetDesc, compile into separate 45 kilobyte chunks of code (I found that via Sizer).

VS2015 compiles IsFloatFormat into this:

mov         qword ptr [rsp+8],rbx
mov         qword ptr [rsp+10h],rbp
mov         qword ptr [rsp+18h],rsi
push        rdi
; some prep stuff that checks for above mentioned "is table initialized yet",
; and then if it's not; does the table initialization:
movdqa      xmm0,xmmword ptr [__xmm@00000000000000040000000400000005 (07FF7302AA430h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E68h (07FF7302AC080h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000050000000200000001 (07FF7302AA690h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E98h (07FF7302AC0B0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000040000000000000003 (07FF7302AA680h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0ED8h (07FF7302AC0F0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000008000000050000000400000004 (07FF7302AA710h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F08h (07FF7302AC120h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@0000003e000000080000000800000005 (07FF7302AAAF0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F48h (07FF7302AC160h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000008000000050000000200000001 (07FF7302AA700h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F78h (07FF7302AC190h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000080000000000000003 (07FF7302AA6A0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FB8h (07FF7302AC1D0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@0000000c000000050000000400000004 (07FF7302AA7A0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FE8h (07FF7302AC200h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@000000000000000c0000000c00000005 (07FF7302AA480h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+1028h (07FF7302AC240h)],xmm0
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E50h (07FF7302AC068h)],1Ch
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E58h (07FF7302AC070h)],1010102h
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E60h (07FF7302AC078h)],1
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E64h (07FF7302AC07Ch)],4
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E78h (07FF7302AC090h)],1Ch
mov         word ptr [__NULL_IMPORT_DESCRIPTOR+0E7Ch (07FF7302AC094h)],2
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E80h (07FF7302AC098h)],rbp
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E88h (07FF7302AC0A0h)],1Ch
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E90h (07FF7302AC0A8h)],1010103h
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EA8h (07FF7302AC0C0h)],4
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EACh (07FF7302AC0C4h)],3
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EB0h (07FF7302AC0C8h)],1Ch
mov         word ptr [__NULL_IMPORT_DESCRIPTOR+0EB4h (07FF7302AC0CCh)],3
; repeat very similar thing for 45 more kilobytes...

Which is basically GetDesc, including the big table initializer, fully inlined into it. The initialization is not done via some “simple data segment copy”, but looks like carefully constructed entry by entry, field by field. And then a very similar thing is repeated for IsHalfFormat function.

VS2017 does not do any of that; the optimizer “realizes” that table is purely constant data, puts it into a data segment (yay!), and the IsFloatFormat function becomes fairly simple:

movsxd      rax,ecx
lea         r8,[__NULL_IMPORT_DESCRIPTOR+5C4h (07FF725393000h)]
imul        rdx,rax,38h
test        byte ptr [rdx+r8+30h],80h
je          IsFloatFormat+35h (07FF725391035h)
movzx       eax,byte ptr [rdx+r8+25h]
movzx       ecx,byte ptr [rdx+r8+24h]
add         ecx,eax
movzx       eax,byte ptr [rdx+r8]
xor         edx,edx
div         eax,ecx
cmp         eax,4
jne         IsFloatFormat+35h (07FF725391035h)
mov         al,1
ret
xor         al,al
ret

What if the table is moved to be a global variable, like my suggestion above? Passing /DUSE_INLINE_TABLE=0 to the compiler we get:

Compiler	Time	Exe size, KB	Large symbols, KB
MSVC 2010 SP1	1.4s	50	2k dynamicinitializerfor'table'
MSVC 2015 Update 3	2.1s	96
MSVC 2017 15.3	2.7s	96

VS2017 generates completely identical code as before, just does it 6 times faster. VS2015 also compiles it into a data segment table like VS2017; does it 7 times faster, and the executable is 90 kilobytes smaller.

VS2010 still emits the global table initializer function, and is a bit faster to compile. But it wasn’t as slow to compile to begin with.

What if we left table as a local variable, but just remove the single usage of type-safety enum flags? Passing /DUSE_ENUM_FLAGS=0 to the compiler we get:

Compiler	Time	Exe size, KB
MSVC 2010 SP1	0.2s	46
MSVC 2015 Update 3	0.4s	96
MSVC 2017 15.3	0.4s	96

Whoa. All three compilers now “realize” that the table is pure simple data, put it into a data segment, and take a lot less time to compile the whole thing.

And all that just because this function got removed from the source:

inline FormatPropertyFlags operator|(const FormatPropertyFlags left, const FormatPropertyFlags right)
{
	return static_cast<FormatPropertyFlags>(static_cast<unsigned>(left) | static_cast<unsigned>(right));
}

In this particular piece of code, that “type safe enum” machinery does not actually do anything useful, but in other general code more type safety on enums & bit flags is a very useful thing to have! Quite a bit sad that “seemingly trivial” type safety abstractions incur so much compile time overhead in year 2017, but oh well… reality is sad, especially in year 2017 :(

Would using C++11 “enum classes” feature allow us having type safe enums, have bitmask operations on them (similar to this approach), and have good compile times? I don’t know that yet. An excercise for the reader right now! But also, see constexpr section below.

For reference, I also checked gcc (g++ 5.4 on Windows 10 Linux subsystem, with -O2 flag) compile times and executable size for each case above. In all cases compiled everything in 0.3 seconds on my machine, and executable size was the same.

What about constexpr?

As pointed out over at reddit, what about declaring the table as C++11 constexpr? The type safe enum utility has to be declared like that too then, so let’s try it. ENUM_FLAGS macro needs to get a constexpr in front of the operator it declares, and the table needs to change from static const FormatDesc table[] to static constexpr FormatDesc table[].

Good news: compile times are back at being fast, and all is good! Thanks reddit! VS2010 does not grok the new construct though, so removed it from the table.

Compiler	Time	Exe size, KB
MSVC 2015 Update 3	0.4s	96
MSVC 2017 15.3	0.4s	96

Our codebase at the moment is mostly still C++98 (some platforms and their compilers…), but maybe we should sprinkle some optional C++11/14/… dust in a few places where it helps the compiler.

Summary

Big table initializers as static local variables might be better as static global variables.
Type-safe enum bitmask machinery is not “free” in terms of compilation time, when using optimized builds in MSVC.
“Complex” table initializers (what is “complex” depends on table size & data in it) might get a lot of code emitted to set them up in MSVC (2015 at least); what looks “just a simple data” to you might not look so for the compiler.
I keep on running into code structure patterns that are much slower to compile with MSVC compared to clang/gcc. Pretty sure the opposite cases exist too, just right now our MSVC builds are slower than clang builds, and hence that’s where I’m looking at.
Even if you’re not using C++11/14 features otherwise, in some places sprinkling things like constexpr might help a C++11 capable compiler to do the right thing.
/d2cgsummary flag is awesome to figure out where MSVC optimizer is spending time. I probably would not have guessed any of the above, without it pointing me towards the large table!

Best unknown MSVC flag: d2cgsummary

Posted on Oct 23, 2017

I’ve been looking at C++ compilation times lately (e.g. here or there), and through correspondence with Microsoft C++ compiler folks learned about a fairly awesome, but undocumented, cl.exe flag.

It’s /d2cgsummary.

How to investigate slow compile times?

Given that compile/build times are a massive problem in large C++ codebases, it’s amazing how little information we can get about why they are slow. For example when using Visual Studio C++ compiler, about the only options seem to be:

1) Use /Bt flag, to get time spent in frontend (c1xx.dll) vs backend (c2.dll) compiler parts. This is good to narrow down whether slow compiles are due to too many things being parsed (e.g. a problem of including too many header files), which is typically a frontend problem, or is the problem some sort of language constructs that cause the backend (typically the optimizer part of the compiler) take “a lot of time”.

The “common knowledge” is that long C++ compile times are caused by too many header files being included, and in general by C/C++ #include model. That is often the case, but sometimes it’s the backend part of the compiler that takes “forever” to compile something (e.g. the __forceinline compile time issue I found in my previous post). The backend part typically only takes “long” when compile optimizations are on.

2) Take a profiler capture of cl.exe, using any of existing native code profiling tools (e.g. xperf), load symbols, and try to guess what is causing the long compile time based on profiling results.

3) Just do guesses and kinda randomly change code structure and see whether it helped.

Well, turns out there’s another option, the /d2cgsummary flag.

MSVC /d2cgsummary flag

I can’t find anything about this flag on the internet, so I’d assume it’s something “for internal use only”, and may or might not exist or be changed in any MSVC compiler version. It does seem to exist & work in VS2015 Update 3 and in VS2017. Passing it to cl.exe will give an output of “Anomalistic Compile Times” and some sort of “Caching Stats”, like this:

Code Generation Summary
        Total Function Count: 742
        Elapsed Time: 13.756 sec
        Total Compilation Time: 13.751 sec
        Efficiency: 100.0%
        Average time per function: 0.019 sec
        Anomalistic Compile Times: 1
                ?EvaluateRootMotion@animation@mecanim@@YAXAEBUClipMuscleConstant@12@AEBUClipMuscleInput@12@PEBM2AEAUMotionOutput@12@_N@Z: 11.667 sec, 0 instrs
        Serialized Initializer Count: 4
        Serialized Initializer Time: 0.006 sec
RdrReadProc Caching Stats
        Functions Cached: 61
        Retrieved Count: 20162
        Abandoned Retrieval Count: 0
        Abandoned Caching Count: 14
        Wasted Caching Attempts: 0
        Functions Retrieved at Least Once: 61
        Functions Cached and Never Retrieved: 0
        Most Hits:
                ?f@?$load@$00@?$SWIZ@$0DCB@@v4f@meta@math@@SA?AT__m128@@T6@@Z: 2775
                ?f@?$SWIZ@$0DCB@@v4f@meta@math@@SA?AT__m128@@T5@@Z: 2774
                ?f@?$load@$00@?$SWIZ@$0EDCB@@v4f@meta@math@@SA?AT__m128@@T6@@Z: 1924
                ?f@?$SWIZ@$0EDCB@@v4f@meta@math@@SA?AT__m128@@T5@@Z: 1920
                ?pack@?$v@Uv4f@meta@math@@U?$sp@Uv4f@meta@math@@$0DCB@$02@23@$02@meta@math@@SA?AU123@AEBT__m128@@@Z: 1296
                ??B_float4@math@@QEBA?AT__m128@@XZ: 1165

Anomalistic Compile Times seems to output “longest to compile” functions, where I guess their times are much longer than the average. Here, it immediately says that among all 700+ functions that were compiled, there is one that takes almost all the time! The names are mangled, but using either the undname.exe MSVC utility or an online demangler, it says the slow-to-compile function in this particular file was called mecanim::animation::EvaluateRootMotion.

This means that if I want to speed up compilation of this file, I need to only look at that one function, and somehow figure out why it’s slow to compile. The “why” is still an exercise for the reader, but at least I know I don’t have to investigate 700 other functions. Great!

RdrReadProc Caching Stats section I don’t quite understand, but it seems to be listing functions that were used most often in the whole compilation. Again after demangling, seems that math::meta::v4f::SWIZ<801>::load got used almost 3000 times, etc.

On some other files I was looking into, the /d2cgsummary flag did not point to one slow function, but at least pointed me to a subsystem that might be contributing to slow compile times. For example here:

Code Generation Summary
        Total Function Count: 4260
        Elapsed Time: 20.710 sec
        Total Compilation Time: 20.685 sec
        Efficiency: 99.9%
        Average time per function: 0.005 sec
        Anomalistic Compile Times: 32
                ?VirtualRedirectTransfer@Shader@@UEAAXAEAVRemapPPtrTransfer@@@Z: 0.741 sec, 0 instrs
                ??$TransferBase@VShader@@@RemapPPtrTransfer@@QEAAXAEAVShader@@W4TransferMetaFlags@@@Z: 0.734 sec, 0 instrs
                ??$Transfer@VShader@@@RemapPPtrTransfer@@QEAAXAEAVShader@@PEBDW4TransferMetaFlags@@@Z: 0.726 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@Shader@@QEAAXAEAVRemapPPtrTransfer@@@Z: 0.711 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@?$SerializeTraits@VShader@@@@SAXAEAVShader@@AEAVRemapPPtrTransfer@@@Z: 0.708 sec, 0 instrs
                ??$Transfer@USerializedShader@ShaderLab@@@RemapPPtrTransfer@@QEAAXAEAUSerializedShader@ShaderLab@@PEBDW4TransferMetaFlags@@@Z: 0.632 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@?$SerializeTraits@USerializedShader@ShaderLab@@@@SAXAEAUSerializedShader@ShaderLab@@AEAVRemapPPtrTransfer@@@Z: 0.625 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@SerializedShader@ShaderLab@@QEAAXAEAVRemapPPtrTransfer@@@Z: 0.621 sec, 0 instrs
                <...>
RdrReadProc Caching Stats
        Functions Cached: 227
        Retrieved Count: 74070
        Abandoned Retrieval Count: 0
        Abandoned Caching Count: 0
        Wasted Caching Attempts: 0
        Functions Retrieved at Least Once: 221
        Functions Cached and Never Retrieved: 6
        Most Hits:
                ?GetNewInstanceIDforOldInstanceID@RemapPPtrTransfer@@QEAAHH@Z: 3065
                ?MightContainPPtr@?$SerializeTraits@D@@SA_NXZ: 2537
                ?Align@TransferBase@@QEAAXXZ: 1965
                <...>

Out of total 20 second compile time, there’s a bunch of functions that take almost 1 second to compile each (after demangling: Shader::VirtualRedirectTransfer(class RemapPPtrTransfer), RemapPPtrTransfer::TransferBase<class Shader>(class Shader&, enum TransferMetaFlags)) etc. So hey, out of everything that Shader.cpp does, I know that in our codebase it’s the “data serialization” part that is slow to compile.

What would be super useful to have from compiler vendors: way more diagnostics on what & why something is slow. Detailed reports like this flag, and/or custom ETW providers for xperf, or anything like that. These would not be things used by most people, but when someone has to use them, they would be invaluable. Saving a minute or two of build time for several hundred programmers is an enormous amount of productivity gain!

Anyway, now you know. If you use Microsoft’s C++ compiler, try out /d2cgsummary flag to cl.exe. You might find something interesting! Next time, how this flag pointed me towards a particular programming pattern that is slow to compile on MSVC, and allowed saving 80 seconds of compile time of one file.

Update: An item on Visual Studio feedback site to document the flag and perhaps some others that might be useful.

Update 2: “Visual Studio 2017 Throughput Improvements and Advice” post went up on Visual C++ team blog with /d2cgsummary and other useful compiler throughput tips.

Forced Inlining Might Be Slow

Posted on Oct 9, 2017

We were upgrading from Visual Studio 2010 to 2015 compiler at work the other day, and I noticed that Release configuration build times went up. Quite noticeably up: a full rebuild of “the engine” went from ~6 minutes to ~12 minutes. Something needed to be done!

Using the same build profiler I had added before, I quickly found one C++ source file that was taking 10 minutes to compile with VS2015, and a few others that were a bit less extreme.

Just like before, the slow files tended to be heavy users of our SIMD math library, and I started to think along the lines of “oh no, will probably be a lot of work to investigate how to reduce all the complexity of it”, and so on. After all, it’s a fractal of templates that I don’t understand.

However, feedback from the bug report to Microsoft (“2015 got a lot slower than 2010 at compiling the same code”; worth notifying them about it in any case) suggested that it’s our usage of __forceinline that might be the culprit.

Make two functions non-inlined, speed up the build twice

From the answers to my bug report:

The basic root cause here is there are a few functions (one is inverse_WorksFor_SingularAffineX, but there are similar ones that appear to be stamped out by the same set of macros) which have deep, deep inline trees. Around 8000 individual inline instances; the majority of which come from __forceinline functions.

Looking at our code, I could see that pretty much everything in our math library used forced inlining. The reason for that, AFAIK, was to work around “SIMD parameters are passed through stack” x64 ABI problem that existed before Microsoft added __vectorcall in VS 2013.

Most of the inlines make sense, however maybe we indeed had some functions that were long or deep? Turns out, yes we did. We had a “robust matrix inverse” function implemented via SVD (singular value decomposition), which was force-inlined, along with everything else it called. SVD inverse is: some matrix transposes, 5 Jacobi iterations that involve a bunch of quaternion-matrix conversions and transposes, sorting of singular values, some factorization, some quat-matrix conversions again, etc. All that compiles down to 6.5 kilobytes of machine code. That’s way out of “sensible thing to force inline”, IMHO!

So I moved svdInverse and a similar svdOrthogonalize functions from header into .cpp file.

Full build got twice as fast; from 11m54s down to 5m46s.

The slowest file with a lot of unit tests, that took 660 seconds to compile before, now compiled in 6.5 seconds. That’s a… 100x compile time speedup.

Here’s what could be useful to have in a compiler output: some sort of diagnostic switch that would tell me “hey, you marked this as force inlined, but the function compiles down to several kilobytes, and is used in more than a handful places”.

Since today investigating a thing like that is basically “read the source code, try to guess which functions end up being large”. That, or do a special build without inlining, and use Sizer / SymbolSort / Bloaty McBloatface to figure out large functions. All that is fairly involved in a large codebase.

Then I made two other functions not be inlined too, and it got some additional build speedup. Full build was back to about 5 minutes, by just slightly moving 4 functions in a multi-million line codebase. Not bad!

This time, the primary reason for build being slow was not template metaprogramming! :)

Synthetic test for inlining compile times

To check whether it’s not just something in “our code”, I did a small synthetic test. It has nonsensical functions that don’t do anything useful, but kinda “approximates” what we had in our codebase.

#if defined(_MSC_VER)
#define INL __forceinline
#else
#define INL inline __attribute__((always_inline))
#endif

// --- Basic stuff of what you might see in a "SSE2 SIMD math library" code
#include <emmintrin.h>

struct float4
{
    __m128 val;
    float4() { val = _mm_setzero_ps(); }
    float4(float x) { val = _mm_set1_ps(x); }
    float4(float x, float y) { val = _mm_set_ps(y, x, y, x); }
    float4(float x, float y, float z) { val = _mm_set_ps(0.f, z, y, x); }
    float4(float x, float y, float z, float w) { val = _mm_set_ps(w, z, y, x); }
    float4(__m128 v) { val = v; }
};

static INL float4 operator+(const float4& a, const float4& b) { return float4(_mm_add_ps(a.val, b.val)); }
static INL float4 operator-(const float4& a, const float4& b) { return float4(_mm_sub_ps(a.val, b.val)); }
static INL float4 operator*(const float4& a, const float4& b) { return float4(_mm_mul_ps(a.val, b.val)); }
static INL float4 operator/(const float4& a, const float4& b) { return float4(_mm_div_ps(a.val, b.val)); }
static INL float4 csum(const float4 &p)
{
    __m128 r = _mm_add_ps(p.val, _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(p.val), _MM_SHUFFLE(0, 3, 2, 1))));
    return _mm_add_ps(r, _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(r), _MM_SHUFFLE(1, 0, 3, 2))));
}
static INL float4 dot(const float4 &p0, const float4 &p1) { return csum(p0*p1); }
static INL float4 dot(const float4 &p) { return dot(p, p); }
static INL float4 rsqrt(const float4 &x)
{
    #define C0  9.999998e-01f
    #define C1  3.0000002e+00f
    #define C2  .5f
    #define C3  340282346638528859811704183484516925440.f
    __m128 e = _mm_mul_ps(_mm_rsqrt_ps((__m128) x.val), _mm_set_ps(C0, C0, C0, C0));
    e = _mm_min_ps(e, _mm_set_ps(C3, C3, C3, C3));
    return _mm_mul_ps(_mm_mul_ps(e, _mm_set_ps(C2, C2, C2, C2)), _mm_sub_ps(_mm_set_ps(C1, C1, C1, C1), _mm_mul_ps(_mm_mul_ps(x.val, e), e)));
}
static INL float4 normalize(const float4 &v)
{
    return v*rsqrt(dot(v));
}

// --- Functions that don't really make sense, I just kinda randomly
// approximated what we had, without pulling all of it into a repro.
// They don't do anything useful whatsoever; just for testing compile
// time performance.

static INL float4 ident() { return float4(0.f, 0.f, 0.f, 1.f); }
static INL float4 whatever1(const float4 &x, const float4 &y)
{
    return csum(x) / x + y;
}
static INL float4 whatever2(const float4& q1, const float4& q2)
{
    return whatever1(q1 * q2, q2 - q1) * (q1 + q2);
}
static INL float4 whatever3(const float4 &pq, const float4 &mask)
{
    const float c8 = 0.923879532511287f;
    const float s8 = 0.38268343236509f;
    const float g = 5.82842712474619f;

    float4 ch = float4(2) * (normalize(pq) - normalize(mask));
    float4 sh = pq * normalize(ch);
    float4 r = ((g*sh*sh - ch*ch) + sh / float4(s8, s8, s8, c8)) * mask;
    return normalize(r);
}
struct matrix
{
    float4 m0, m1, m2;
};
static INL float4 whateverIteration(matrix &s, int count = 5)
{
    matrix qm;
    float4 q, v = ident();
    for (int iter = 0; iter < count; iter++)
    {
        q = whatever3(s.m0, float4(0, 0, 1, 1));
        v = whatever2(v, q);
        v = normalize(v);
        q = whatever3(s.m1, float4(1, 0, 0, 1));
        v = whatever2(v, q);
        v = normalize(v);
        q = whatever3(s.m2, float4(0, 1, 0, 1));
        v = whatever2(v, q);
        v = normalize(v);
    }
    return v;
}

// Now, kinda "obviously" this function is a bit too large to be force-inlined.
// This is what I'm testing with; having & not having an "INL" on this function.
static INL float4 whateverDecomposition(const matrix &a, const float4 &u, const float4 &v)
{
    float4 r;
    matrix s = a;
    s.m0 = normalize(s.m0) + u;
    s.m1 = normalize(s.m1) * v;
    s.m2 = normalize(s.m2);
    r = whateverIteration(s);
    r = normalize(v) * u + (normalize(u) / v);
    s.m0 = s.m0 + r;
    s.m1 = s.m1 + r;
    s.m2 = s.m2 + r;
    r = whateverIteration(s);
    s.m0 = s.m0 / normalize(r);
    s.m1 = s.m1 / normalize(v+r);
    s.m2 = s.m2 / normalize(v*r);
    r = whateverIteration(s);
    s.m0 = s.m0 * s.m1;
    s.m1 = s.m1 * s.m2 - r;
    s.m2 = s.m2 * s.m0 + r;
    r = whateverIteration(s);
    return r;
}

int main(int argc, const char** argv)
{
    matrix a;
    a.m0 = (float)argv[0][0];
    a.m1 = (float)argc;
    a.m2 = (float)(argv[0][0] - argc);

    float4 u = a.m0;
    float4 v = a.m1;
    float4 e = whateverDecomposition(a, a.m0, a.m1);
    e = e + whateverDecomposition(a, a.m2, normalize(a.m1));
    e = e + whateverDecomposition(a, normalize(a.m2), e);
    e = e + whateverDecomposition(a, e, e);
    e = e + whateverDecomposition(a, normalize(e), normalize(e));
    e = e * whateverDecomposition(a, e, e);
    e = e - whateverDecomposition(a, e, e);
    e = e * whateverDecomposition(a, e, e);
    e = e + whateverDecomposition(a, e, e);
    float4 r = normalize(e);

    return (int)_mm_cvtss_f32(r.val);
}

I checked compile times under different compilers. Visual Studio was compiled with /O2, and clang/gcc with -O2, to get an “optimized build”. Everything on the same machine (Core i7-5820K); clang/gcc via Windows 10 Linux subsystem thing. In all cases compiled for x64 architecture. In the two timing cases, the only difference in source was removing forced inlining from a single whateverDecomposition function.

Compiler	Time	Time without force inline	Speedup
MSVC 2010 SP1	8.1s	0.4s	20x
MSVC 2015 Update 3	19.8s	0.6s	33x
MSVC 2017 15.3	9.8s	0.7s	14x
g++ 5.4	1.5s	0.5s	3x
clang 3.8	0.5s	0.3s	1.7x

The conclusion seems to be that Visual Studio compiler is way more slow at deep inlining, compared to gcc/clang. And VS2015 in particular had a compile regression there somewhere, with it being mostly gone in VS2017.

(I would have added all this to my bug report in Microsoft Connect… except I don’t seem to be able to add more comments on it. Sorry MS!)

Six Months in a Service Team

Posted on Sep 6, 2017

So it’s been six months since I switched to doing build engineering / developer tooling. Here are some random things I learned.

It’s a service team

We are working on a live codebase, with I don’t even know how many (~500?) developers on it, and it has to keep on building and not crumble down with additional code / people / stuff. This means the team I’m on is a “service / live ops” team to a large extent, does a bunch of support and short term problem solving, in addition to steering the whole codebase build towards a hopefully better future.

That is not a surprise; I fully expected that.

What was a surprise: I thought it would be much worse to be on a “service” team.

A build system is one of these areas where “if it works well, no one notices it”, and yes, in many cases when people talk about it they talk about problems they are having. And so before starting all this I had mentally prepared myself to jump into a super thankless job, where everyone else is complaining about state of things and shouting at you for all the issues there, etc.

So far it’s less thankless, and way less people shouting at me than I expected!

It is true that when people hop onto our Slack channel, they are coming with some problem they are having. They might provide excellent information and articulate what is happening in good detail, or they might be very vague and confused; they might be just stating the problem or expressing frustration, disappointment, anger or whatever else. However, all that is basically complaining about computers, or software, or codebase; I don’t remember a single time where anyone did a “stupid build people” type of complaints. For some reason I expected some amount of issues to reach an ad-hominem level… maybe I’ve read too many bad stories on the internet, and none of them turned out to be applicable here. Great!

That said, an occasional “yo! just wanted to say that you are all doing a good job, keep it up!” is appreciated, when working in a service-like team. Go and tell that right now to people making sure your wireless works, your servers are building stuff, your office fruits are delivered on time, your patches are predictably shipped to the cloud, …

When someone has a problem, they want it solved

When someone has a problem and could not solve it for themselves in X amount of time, they hopefully turn for help. Be it coworkers, Slack, email, twitter or whatever. The composition of my team right now is that people working on “build system” are fairly remote, so the way people ask us is mostly Slack messages.

Here’s the main thing: for their problem, what they are mostly interested in is “make it disappear”. The most important thing is to unblock them, so that they can continue doing whatever they were trying to do.

First figure out what is going on; do the minimal thing to make the problem disappear for them (“graft this changeset”, “change file X line Y to Z”, “pass this argument on command line”, “delete this folder from build artifacts and continue build”, “click that setting in your IDE”, …) - whatever quick fix/hack it takes. Add a TODO for yourself to fix it properly, and actually follow through on that.

If it’s an issue that happened more than once or twice, add to a list of “known issues” (e.g. via pinned messages in Slack channel), with good, searchable wording and clear instructions how to workaround/solve it. Keep the known issues list more or less up to date!

What people are not interested in, when they come to your support channel:

A lecture on how Well Actually It’s Not An Issue You See, You Are Doing Things Wrong. Sure they might be doing something wrong; they don’t need a patronizing lecture though. Short, to the point information on what should be done instead. Also take a note - maybe they do have a valid point? Maybe the commands / options / file locations / whatever is indeed confusing and error prone? Maybe it could or should be improved, either in code changes or at least in better error messages / documentation / training?
Whose fault the issue is (though more on that separately below).
Why the issue happens, and an explanation of some innards of build code. Some people might be interested in details on why the issue happens; but first solve the issue for them, and provide details as an optional bonus information, if needed.
A long rant about how your team is understaffed and can’t deal with that many issues. Every team would want to do more things than they have resources for; and yes maybe your team is in a more dire situation – that’s not what the inquiry is about however. Solve their problem first. If you feel like it, maybe mention that “yeah we’ll solve this eventually, but so far didn’t have time yet”. But better talk with someone who can fix the staffing situation (management, etc.).

Own your f***ups

There’s one exception I have for the “no one is interested in whose fault the issue is” guideline – if it’s your fault, then saying “Sorry about this! My fault, fixing ASAP” is acceptable :)

Over half a year of working on build stuff, I think I have made a dozen or so embarrassing things that caused people frustration & wasted their time. It’s not ideal, in retrospect for at least half of them “wot were you thinking aras” is a good summary. I did try my best to provide workarounds, apologize and fix as soon as possible. And then hope that I did more than a dozen things that improved the build experience.

And again, if it’s a problem caused by anyone else, there’s no need to point fingers. You can still say “sorry! we’ll be fixing ASAP” of course.

Build performance / stability is an uphill battle

Codebases tend to grow over time – more stuff being added, more developers doing it; and usually “old things” can not be retired and removed as fast. This means that even to keep “build times” constant over time, someone has to actively spend time chipping away on various inefficiencies.

Over last month I’ve spent some time optimizing full/clean builds (one example), and for some build targets it’s effectively “twice as fast” now. However, those slow parts did not exist half a year ago! So over that time period, the full build time has not improved… it’s “just” no longer twice as bad.

Similar with build stability - various bugs in the build process & dependency graph are not some old problems; and instead they are caused by new systems landing, new ways that code is split up, new platforms, and so on. Just maintaining same level of build sanity is actual work that needs to be done, and it’s never finished. On the plus side… hey, job security! :)

Job satisfaction?

Kinda related to “this was supposed to be a thankless job” (turns out, not so much), I’ve been thinking on what about it can be considered “job satisfaction”. For me, it’s perhaps the “I feel like I’m being useful” part. That’s a very broad statement, however it also means that I can pick many areas to work in, compared to something like “the only way to make me happy is if I get to write GCN assembly all day long” :)

The work I’m doing is not some sort of cutting edge research or “interesting” problems. There’s no best paper awards for finding out how to split up .vcxproj and .csproj files for most optimal Visual Studio intellisense experience, for example. But at this point in my life, I’m perfectly fine with that!

Do I still get to delete code?

Happy to report – yes! I won’t reach my extrapolated goal of removing 200kloc this year though. So far this year: 82kloc added & 143kloc removed (so made the codebase 61kloc smaller). However first two months were still doing graphics stuff; since I started doing build engineering the stats are 50kloc added, 87kloc removed (net effect 37kloc smaller). Not bad!

Well, that’s it. I’ll get back to working on some C# project generation stuffs.

Unreasonable Effectiveness of Profilers

Posted on Aug 8, 2017

A couple months ago I added profiling output to our build system (a fork of JamPlus). It’s a simple Chrome Tracing view, and adding support for that to Jam was fairly easy. Jam being written in C, I found a simple C library to do it (minitrace) and with a couple compilation fixes it was working and had profiler blocks around whatever tasks the build system was executing.

That by itself is not remarkable. However… once you have profiling output, you can start to…

Notice Things

The other day I was doing something unrelated, and for some reason looked at a profiler output of Unity editor build I’ve just done. “Common sense” about C++ codebases tells us that usually it’s the linking time that’s The Worst, however here it was not that:

There’s a whole gap right before linking starts, when everything else is done, just the build is waiting for one C++ file to compile. I was busy with something else right then, so added a TODO card to our team task board, to be looked at later.

Sometime later I was doing a different build (of just the “engine” without the editor bits), and looked at the profiler too:

Now that looks pretty bad. The total build time (this was a “full rebuild”) was ten minutes, and almost five of them were “wait for that one C++ file to finish compiling”. The file in question had a compile time of seven minutes.

This sounded like a build inefficiency that is too large to ignore, so I dug in for a bit.

Aside: at the moment our small “build team” has not even started to look into “full build” performance, for various reasons. Mostly repaying tech debt (with interest!), optimizing incremental build performance and improving things like IDE projects. Optimizing full build time we will get to, but haven’t just yet.

7 minutes compile time for one file? Craaazyy!

Yes, it is crazy. Average compile time for C++ files in this project and this config (win64 Release) is about 2 seconds. 400+ seconds is definitely an outlier (though there’s a handful of other files that take 30+ seconds to compile). What is going on?!

I did some experiments and figured out that:

Our Lump/Batch/Unity builds are not the culprit; there’s one .cpp file in that particular lump that takes all the time to compile.
MSVC compiler exhibits this slow compilation behavior; clang compile times on that file are way better (10+ times faster).
Only Release builds are this slow (or more specifically, builds where inlining is on).
The fairly old VS2010 compiler that we use is not the culprit; compiling with VS2015 is actually a tiny bit slower :(
In general the files that have 30+ second compile time in Release build with MSVC compiler tend to be heavy users of our “SIMD math library”, which provides a very nice HLSL-like syntax for math code (including swizzles & stuff). However the implementation of that is… shall we say… quite template & macro heavy.
That 7 minute compile file had a big SIMD math heavy function, that was templated to expand into a number of instantiations, to get rid of runtime per-item branches. This is in CPU performance sensitive code, so that approach does make sense.

Whether the design of our math library is a good trade-off or not, is a story for another day. Looks like it has some good things (convenient HLSL-like usage, good SIMD codegen), and some not so good things (crazy complex implementation, crazy compile times on MSVC at least). Worth discussing and doing something about it in the future!

Meanwhile… could some lo-tech changes be made, to speed up the builds?

Speeding it up

One easy change that is entirely in “build systems” area even, is to stop including these “slow to compile” files into lump/batch/unity builds. Lump builds primarily save time on invoking the compiler executable, and doing preprocessing of typically similar/common included code. However, all that is sub-second times; if a single source file takes 10+ seconds to compile then there’s little point in lumping. By itself this does not gain much, however for incremental builds people won’t have to wait too long if they are working on a fast-to-build file that would have ended in the same lump as a slow file.

Could also somehow make the build system schedule slow files for compilation as soon as possible. The earlier they start, the sooner they will finish. Ideally the build system would perhaps have historical data of past compile times, and use that to guide build task scheduling. We don’t have that in our build system (yet?)… However in our current build code, moving slow files out of lumps makes them be built earlier as a by-product. Good!

The above things didn’t actually speed up compilation of that 7-minute file, but were trivial to do and gained a minute or so out of full build time (which was 10 minutes originally).

And then I tried something that I didn’t have high hopes of working - in that slow file, factor out parts of the big templated function into smaller, less-templated functions.

Fairly trivial stuff; if you’re into fancy refactoring IDEs that is done by “select some code lines, press Extract Method/Function button”. Except this is C++, and the IDEs don’t quite get this right (at least in this code), so you do it manually.

I split off five or so functions like that and… compile time of that file went from 420 seconds to 70 seconds. That is six times faster!

Of course, factoring out functions can mean they are no longer inlined, and the original code will have to call them. So that’s time spent in passing the arguments, doing the jump etc. However, it could make parent function use registers better (or worse…), could result in lower amount of total code (less I$ misses), etc. We have profiled the code change on several platforms, and the difference in performance seems to be negligible. So in this particular case, it’s a go!

Now of course, a minute to compile a C++ file is still crazy, but additional speedups probably involve changing the design of our math library or somesuch – needs more work than someone clueless randomly trying to speed it up :)

The build profile looks much better after the change. No longer gaps of “all the CPU cores wait for one job to finish” type of stuff! Well ok, linking is serial but that is old news. Full build time went from 10 minutes down to 5min 10sec, or almost 2x faster!

Takeaways

Add some sort of profiling data capture to everything you care about. Had I not added the profiler output, who knows when (or if?) someone would have noticed that we have one super-slow to compile C++ source file. In my case, adding this simple profiling capture to our build system took about a day… most of that spent learning about Jam’s codebase (my first change in it).
- Seriously, just do it. It’s easy, fun and worth it!
Every template instantiation creates additional work for the compiler’s optimizer; pretty much as if the instantiated functions were literally copy-pasted. If a templated function was slow to compile and you instantiate the template 20 times… well now it’s very slow to compile.
- Factoring out parts into less-templated functions might help things.
Complicated template-heavy libraries can be slow to compile (“well duh”, right). But not necessarily due to parsing (“it’s all header files”); the optimizer can spend ages in them too. In MSVC case in this code, from what I can tell it ends up spending all the time deciding what & how to inline.
Speeding up builds is good work, or at least I like to think so myself :) Less time spent by developers waiting for them; better throughput and lower latency in build farms, etc. etc.