Slow to Compile Table Initializers
Continuing on finding low hanging fruit in our codebase (Unity game engine, a lot of C++ code) build
times with Visual Studio, here’s what I found with help of /d2cgsummary
(see
previous blog post).
Noticed one file that was taking 92 seconds to compile on my machine, in Release config.
A quick look at it did not reveal any particularly crazy code structure;
was fairly simple & “obvious” code. However, /d2cgsummary
said:
Anomalistic Compile Times: 44
?IsFloatFormat@@YA_NW4GraphicsFormat@@@Z: 5.493 sec
?IsHalfFormat@@YA_NW4GraphicsFormat@@@Z: 5.483 sec
?GetGraphicsFormatString@@YA?AV?$basic_string@DV?$StringStorageDefault@D@core@@@core@@W4GraphicsFormat@@@Z: 4.680 sec
?ComputeMipmapSize@@YA_KHHW4GraphicsFormat@@@Z: 4.137 sec
?ComputeTextureSizeForTypicalGPU@@YA_KHHHW4GraphicsFormat@@HH_N@Z: 4.087 sec
?GetComponentCount@@YAIW4GraphicsFormat@@@Z: 2.722 sec
?IsUNormFormat@@YA_NW4GraphicsFormat@@@Z: 2.719 sec
?Is16BitPackedFormat@@YA_NW4GraphicsFormat@@@Z: 2.700 sec
?IsAlphaOnlyFormat@@YA_NW4GraphicsFormat@@@Z: 2.699 sec
<...>
A bunch of functions taking 2-5 seconds to compile each; and each one of those looking very simple:
bool IsFloatFormat(GraphicsFormat format)
{
return ((GetDesc(format).flags & kFormatPropertyIEEE754Bit) != 0) && (GetDesc(format).blockSize / GetComponentCount(format) == 4);
}
Why on earth a function like that would take 5 seconds to compile?!
Then I noticed that all of them call GetDesc(format)
, which looks like this:
const FormatDesc& GetDesc(GraphicsFormat format)
{
static const FormatDesc table[] = //kFormatCount
{
// bSize,bX,bY,bZ, swizzleR, swizzleG, swizzleB, swizzleA, fallbackFormat, alphaFormat, textureFormat, rtFormat, comps, name, flags
{0, 0, 0, 0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatNone, kFormatNone, kTexFormatNone, kRTFormatCount, 0, 0, "None", 0}, // None,
{1, 1, 1, 1, kFormatSwizzleR, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone, kRTFormatCount, 1, 0, "", kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // R8_SRGB
{2, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone, kRTFormatCount, 2, 0, "", kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RG8_SRGB
{3, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatRGB24, kRTFormatCount, 3, 0, "", kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGB8_SRGB
{4, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzleA, kFormatNone, kFormatRGBA8_SRGB, kTexFormatRGBA32, kRTFormatARGB32, 3, 1, "", kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGBA8_SRGB
// <...a lot of other entries for all formats we have...>
};
CompileTimeAssertArraySize(table, kGraphicsFormatCount);
return table[format];
}
Just a function that returns “description” struct for a “graphics format” with various info we might be interested in, using a pattern similar to Translations in C++ using tables with zero-based enums.
The table is huge though. Could it be what’s causing compile times to be super slow?
Let’s try moving the table to be a static global variable, outside of the function:
static const FormatDesc table[] = //kFormatCount
{
// <...the table...>
};
CompileTimeAssertArraySize(table, kGraphicsFormatCount);
const FormatDesc& GetDesc(GraphicsFormat format)
{
return table[format];
}
Boom. Compile time of that file went down to 10.3 seconds, or whole 80 seconds faster.
:what:
That’s crazy. Why does this happen?
This “initialize table inside a function” pattern differs from “just a global variable” in a few aspects:
- Compiler must emit code to do initialization when the function is executed for the first time.
For non-trivial data, the optimizer might not “see” that it’s all just bytes in the table.
And so it will actually emit equivalent of
if (!initYet) { InitTable(); initYet=true; }
into the function. - Since C++11, the compiler is required to make that thread-safe too! So it does equivalent of mutex lock or some atomic checks for the initialization.
Whereas for a table initialized as a global variable, none of that needs to happen; if it’s not “just bytes”
then compiler still generates the initializer code, but it’s called exactly once before “actual program” starts,
and does not add any branches to GetDesc
function.
Ok, so one takeaway is: for “static constant data” tables like that, it might be better to declare them as global variables, instead of static local variables inside a function.
But wait, there’s more! I tried to make a “simple” repro case for Microsoft C++ folks to show the unusually long compile times, and I could not. Something else was also going on.
What is “Simple Data”?
The same /d2cgsummary
output also had this:
RdrReadProc Caching Stats
Most Hits:
??U@YA?AW4FormatPropertyFlags@@W40@0@Z: 18328
Which probably says that enum FormatPropertyFlags __cdecl operator|(enum FormatPropertyFlags,enum FormatPropertyFlags)
function was called/used 18 thousand times in this file. One member of our FormatDesc
struct was a “bitmask enum”
that had operators defined on it for type safety, similar to DEFINE_ENUM_FLAG_OPERATORS
(see here or there).
At this point, let’s move onto actual code where things can be investigated outside of the whole Unity codebase. Full source code is here in the gist. Key parts:
#ifndef USE_ENUM_FLAGS
#define USE_ENUM_FLAGS 1
#endif
#ifndef USE_INLINE_TABLE
#define USE_INLINE_TABLE 1
#endif
enum FormatPropertyFlags { /* ...a bunch of stuff */ };
#if USE_ENUM_FLAGS
#define ENUM_FLAGS(T) inline T operator |(const T left, const T right) { return static_cast<T>(static_cast<unsigned>(left) | static_cast<unsigned>(right)); }
ENUM_FLAGS(FormatPropertyFlags);
#endif
struct FormatDesc { /* ...the struct */ };
#if USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
static const FormatDesc table[] = { /* ... the huge table itself */ };
#if !USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
return table[format];
}
UInt32 GetColorComponentCount(GraphicsFormat format)
{
return GetDesc(format).colorComponents;
}
/* ... a bunch more functions similar to this one */
We have preprocessor defines to switch between the big table being defined as a static local variable inside GetDesc
function or a static global variable, and a define to switch whether “type safe enums” machinery is used or not.
I’m compiling for x64 with cl.exe /O2 /Zi /GS- /GR- /EHsc- /MT main.cpp /link /OPT:REF /OPT:ICF
which is fairly typical
“Release build” flag soup.
Compiler | Time | Time with /cgthreads1 | Exe size, KB | Large symbols, KB |
---|---|---|---|---|
MSVC 2010 SP1 | 1.8s | - | 58 | 12 GetDesc |
MSVC 2015 Update 3 | 15.1s | 42.6s | 189 | 45 IsFloatFormat, 45 IsHalfFormat |
MSVC 2017 15.3 | 18.2s | 37.5s | 96 |
Ok, so Visual C++ 2015/2017 is particularly slow at compiling this code pattern (big table, type-safe-enum operators used in it) with optimizations turned on. And a big increase in compile time compared to VS2010, hence I filed a bug for MS.
But what’s even more strange, is that code size is quite big too, particularly in VS2015 case. Each of IsFloatFormat
and
IsHalfFormat
functions, which both are simple one-liners that just call GetDesc
, compile into separate 45 kilobyte
chunks of code (I found that via Sizer).
VS2015 compiles IsFloatFormat
into this:
mov qword ptr [rsp+8],rbx
mov qword ptr [rsp+10h],rbp
mov qword ptr [rsp+18h],rsi
push rdi
; some prep stuff that checks for above mentioned "is table initialized yet",
; and then if it's not; does the table initialization:
movdqa xmm0,xmmword ptr [__xmm@00000000000000040000000400000005 (07FF7302AA430h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E68h (07FF7302AC080h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@00000004000000050000000200000001 (07FF7302AA690h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E98h (07FF7302AC0B0h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@00000004000000040000000000000003 (07FF7302AA680h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0ED8h (07FF7302AC0F0h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@00000008000000050000000400000004 (07FF7302AA710h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F08h (07FF7302AC120h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@0000003e000000080000000800000005 (07FF7302AAAF0h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F48h (07FF7302AC160h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@00000008000000050000000200000001 (07FF7302AA700h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F78h (07FF7302AC190h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@00000004000000080000000000000003 (07FF7302AA6A0h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FB8h (07FF7302AC1D0h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@0000000c000000050000000400000004 (07FF7302AA7A0h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FE8h (07FF7302AC200h)],xmm0
movdqa xmm0,xmmword ptr [__xmm@000000000000000c0000000c00000005 (07FF7302AA480h)]
movdqa xmmword ptr [__NULL_IMPORT_DESCRIPTOR+1028h (07FF7302AC240h)],xmm0
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0E50h (07FF7302AC068h)],1Ch
mov qword ptr [__NULL_IMPORT_DESCRIPTOR+0E58h (07FF7302AC070h)],1010102h
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0E60h (07FF7302AC078h)],1
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0E64h (07FF7302AC07Ch)],4
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0E78h (07FF7302AC090h)],1Ch
mov word ptr [__NULL_IMPORT_DESCRIPTOR+0E7Ch (07FF7302AC094h)],2
mov qword ptr [__NULL_IMPORT_DESCRIPTOR+0E80h (07FF7302AC098h)],rbp
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0E88h (07FF7302AC0A0h)],1Ch
mov qword ptr [__NULL_IMPORT_DESCRIPTOR+0E90h (07FF7302AC0A8h)],1010103h
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0EA8h (07FF7302AC0C0h)],4
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0EACh (07FF7302AC0C4h)],3
mov dword ptr [__NULL_IMPORT_DESCRIPTOR+0EB0h (07FF7302AC0C8h)],1Ch
mov word ptr [__NULL_IMPORT_DESCRIPTOR+0EB4h (07FF7302AC0CCh)],3
; repeat very similar thing for 45 more kilobytes...
Which is basically GetDesc
, including the big table initializer, fully inlined into it. The initialization
is not done via some “simple data segment copy”, but looks like carefully constructed entry by entry, field by field.
And then a very similar thing is repeated for IsHalfFormat
function.
VS2017 does not do any of that; the optimizer “realizes” that table is purely constant data, puts it into a data
segment (yay!), and the IsFloatFormat
function becomes fairly simple:
movsxd rax,ecx
lea r8,[__NULL_IMPORT_DESCRIPTOR+5C4h (07FF725393000h)]
imul rdx,rax,38h
test byte ptr [rdx+r8+30h],80h
je IsFloatFormat+35h (07FF725391035h)
movzx eax,byte ptr [rdx+r8+25h]
movzx ecx,byte ptr [rdx+r8+24h]
add ecx,eax
movzx eax,byte ptr [rdx+r8]
xor edx,edx
div eax,ecx
cmp eax,4
jne IsFloatFormat+35h (07FF725391035h)
mov al,1
ret
xor al,al
ret
What if the table is moved to be a global variable, like my suggestion above? Passing /DUSE_INLINE_TABLE=0
to the compiler
we get:
Compiler | Time | Exe size, KB | Large symbols, KB |
---|---|---|---|
MSVC 2010 SP1 | 1.4s | 50 | 2k dynamicinitializerfor'table' |
MSVC 2015 Update 3 | 2.1s | 96 | |
MSVC 2017 15.3 | 2.7s | 96 |
VS2017 generates completely identical code as before, just does it 6 times faster. VS2015 also compiles it into a data segment table like VS2017; does it 7 times faster, and the executable is 90 kilobytes smaller.
VS2010 still emits the global table initializer function, and is a bit faster to compile. But it wasn’t as slow to compile to begin with.
What if we left table as a local variable, but just remove the single usage of type-safety enum flags? Passing /DUSE_ENUM_FLAGS=0
to the compiler we get:
Compiler | Time | Exe size, KB |
---|---|---|
MSVC 2010 SP1 | 0.2s | 46 |
MSVC 2015 Update 3 | 0.4s | 96 |
MSVC 2017 15.3 | 0.4s | 96 |
Whoa. All three compilers now “realize” that the table is pure simple data, put it into a data segment, and take a lot less time to compile the whole thing.
And all that just because this function got removed from the source:
inline FormatPropertyFlags operator|(const FormatPropertyFlags left, const FormatPropertyFlags right)
{
return static_cast<FormatPropertyFlags>(static_cast<unsigned>(left) | static_cast<unsigned>(right));
}
In this particular piece of code, that “type safe enum” machinery does not actually do anything useful, but in other general code more type safety on enums & bit flags is a very useful thing to have! Quite a bit sad that “seemingly trivial” type safety abstractions incur so much compile time overhead in year 2017, but oh well… reality is sad, especially in year 2017 :(
Would using C++11 “enum classes” feature allow us having type safe enums, have bitmask operations on them (similar to this approach), and have good compile times? I don’t know that yet. An excercise for the reader right now! But also, see
constexpr
section below.
For reference, I also checked gcc (g++ 5.4 on Windows 10 Linux subsystem, with -O2
flag) compile times and executable size
for each case above. In all cases compiled everything in 0.3 seconds on my machine, and executable size was the same.
What about constexpr?
As pointed out over at reddit,
what about declaring the table as C++11 constexpr
? The type safe enum utility has to be declared like that too then,
so let’s try it. ENUM_FLAGS
macro needs to get a constexpr
in front of the operator it declares, and the table
needs to change from static const FormatDesc table[]
to static constexpr FormatDesc table[]
.
Good news: compile times are back at being fast, and all is good! Thanks reddit! VS2010 does not grok the new construct though, so removed it from the table.
Compiler | Time | Exe size, KB |
---|---|---|
MSVC 2015 Update 3 | 0.4s | 96 |
MSVC 2017 15.3 | 0.4s | 96 |
Our codebase at the moment is mostly still C++98 (some platforms and their compilers…), but maybe we should sprinkle some optional C++11/14/… dust in a few places where it helps the compiler.
Summary
- Big table initializers as static local variables might be better as static global variables.
- Type-safe enum bitmask machinery is not “free” in terms of compilation time, when using optimized builds in MSVC.
- “Complex” table initializers (what is “complex” depends on table size & data in it) might get a lot of code emitted to set them up in MSVC (2015 at least); what looks “just a simple data” to you might not look so for the compiler.
- I keep on running into code structure patterns that are much slower to compile with MSVC compared to clang/gcc. Pretty sure the opposite cases exist too, just right now our MSVC builds are slower than clang builds, and hence that’s where I’m looking at.
- Even if you’re not using C++11/14 features otherwise, in some places sprinkling things like
constexpr
might help a C++11 capable compiler to do the right thing. /d2cgsummary
flag is awesome to figure out where MSVC optimizer is spending time. I probably would not have guessed any of the above, without it pointing me towards the large table!