Minimizing <windows.h>

I’ve been looking at our (Unity game engine) C++ codebase header file hygiene a bit, and this is a short rant-post about <windows.h>.

On all Microsoft platforms, our precompiled header was including <windows.h> since about forever, and so effectively the whole codebase was getting all the symbols and macros from it. Now of course, the part of the codebase that actually needs to use Windows API is very small, so I thought “hey, let’s see what happens if I stop making windows headers be included everywhere”.

But why would you want that?

Good question! I don’t know if I do, yet. Conceptually, since only very small part of the codebase needs to use stuff from windows headers, it “feels wrong” to include it everywhere. It might still make sense to keep it in the precompiled header, if that helps with build performance.

“Precompiled headers, grandpa?! What about C++ modules?”

It’s true that C++ modules perhaps one day will make precompiled headers obsolete. Good to see that the major compilers (MSVC, clang, gcc) have some level of implementation of them now.

Not all our platforms are on these compilers; and some others are not on recent enough versions yet. So… I dunno, some years in the future? But maybe we’ll just be writing Rust by then, who knows :)

Anyway. Just how large <windows.h> actually is? In VS2017 (version 15.5, Windows SDK 10.0.16299), after preprocessing and ignoring empty lines, it ends up being 69100 lines of actual code. When defining WIN32_LEAN_AND_MEAN before inclusion to cut down some rarely used parts, it ends up being 33375 lines. That’s not massive, but not trivially small either.

For comparison, here are line counts (after preprocessing, removing empty lines) of various STL headers in the same compiler:

  • <vector> 20354
  • <string> 26355
  • <unordered_map> 24291
  • <map> 21144
  • <algorithm> 18578
  • vector+string+unordered_map+algorithm 35957
  • <cstdlib> 1800; just <stdlib.h> itself: 1184. C was small and neat!

Actual problem with windows.h: macros

Back to task at hand! So, remove <windows.h> from precompiled header, hit build and off you go, perhaps sprinkling a now-needed windows include here and there, right? Of course, wrong :)

Others have said it in harsher words, but the gist is: Windows headers define a lot of symbols as preprocessor macros. While min and max are optional (define NOMINMAX before including), some others like near and far are not.

And then of course a bunch of function names that Windows maps to either ANSI or Unicode variants with a preprocessor. SendMessage, GetObject, RegisterClass, DrawText, CreateFile and so on. Good thing these are obscure names that another codebase would never want to use… oh wait. These are the exact function names that caused problems in my attempt.

If you have a central place where you include the windows headers, you could #undef them after inclusion, and make your own code explicitly do SendMessageW instead of using a macro.

But then there’s some 3rd party libraries that include windows.h from their own header files. I’m looking at you, Enlighten, Oculus PC SDK, CEF and others. Ugh!

Modular/smaller windows.h

There are other ways to approach this problem, of course.

One is to not have any Windows-specific things in many/any of your header files, wrap everything into a nice little set of Windows-specific source files, only include windows headers there, and go on doing fun things. That would be ideal, but is not a five minute job in a large existing codebase.

If you don’t need many things from Windows headers, you could declare the minimal set that you need yourself, manually. Contrary to what some Stack Overflow threads say (“why WAIT_OBJECT_0 is defined to be STATUS_WAIT_0 + 0? because the values might not be the same in the future!” – err nope, they can’t realistically ever change their values), it’s not a terribly dangerous thing to do. Windows API is one of the most stable APIs in the history of computing, after all.

Here’s an example of a minimal set of manually defined Windows things that Direct3D 9 needs, from @zeuxcg: Minimal set of headers for D3D9.

Another example is hand-crafted, quite extensive windows headers replacement from @ArvidGerstmann: Modular Windows.h Header File.

However, even Windows headers themselves can be used in a much more modular way directly! What I found so far is this:

  • Preprocessor macro _AMD64_ (for x64), _X86_ (for x86) or _ARM_ (for ARM) has to be defined by you, and then,
  • You can include one of the many “smaller” headers:
    • <windef.h> to get most of the base types (DWORD, HWND, …) as well as atomic functions and stuff. 7811 lines of actual code.
    • <fileapi.h> for file functions (CreateFile, FindFirstFile, …), 7773 lines.
    • <synchapi.h> for synchronization primitives (WaitForMultipleObjects, InitializeCriticalSection, …), 8450 lines.
    • <debugapi.h> for IsDebuggerPresent, OutputDebugString etc. No need to include whole thing for it :)
  • The Windows SDK headers, at least in Win10 versions, seem to be structured in a fairly stable way, and with same partition of these smaller header files across Win32, XboxOne and UWP platforms.

This is all.


2017 Holidays Giveaway

I told the internet I’d buy books and stuff for people, and you won’t believe what happened next!

45 people got books and stuff.

FIN.

Backstory

I wrote this on both mastodon and twitter:

Let’s try some “we’ll probably make it through 2017” gifts. Do you need a book, a course, an app, a workshop, github, hosting, or something else that would help you? Something in $50-$100 range; ping me and I’ll try to do it!

Why? Dunno, sounded like that would be a nice thing to do. And then I got two requests, which I fulfilled, and then… nothing. However, I repeated the same thing the next day, during a more US-friendly hour, and then the requests started flowing in.

Expectations

This being the internet, I expected totally random and weird things. A million requests, a bunch of trolls, a bunch of nonsense, etc.

I’m happy to report that the internet managed to pleasantly surprise me in how well it behaved in this particular occasion!

Requests

In the end I gave gifts to 45 strangers on the internet before calling it a day, and here’s what they turned out to be.

Unsurprisingly enough, majority were something about computers.

  • 13 books: 11 on programming, 2 on art/design.
  • 5 courses or tutorials: 3 on programming, 2 on art.
  • 9 items of physical hardware: a couple of Raspberry Pi 3s, leapmotion, 360 camera, daydream VR viewer etc.
  • 4 software licenses: Git Tower (twice), Gamemaker Studio 2, ArtStation Pro.

Of course, the most interesting requests are those that are not computer related :)

Was really surprised by amount of people saying “I don’t need anything myself, but if you could donate to charity/nonprofit X that would be great”. 13% of all requests being of that kind is pretty sweet! For some of the requests I decided to go for a larger than $50-$100 contribution. Here’s what people asked to donate to:

And of course, “everything else” is where it was fairly random. I bought tree saplings, socks, room heater, covered an electricity bill, vet expenses and so on. Very interesting!

Tips and Findings

  • Buying stuff for strangers on the internet is easiest via Amazon wishlists. I pretty much got this ready-to-paste for inquiries: “Can you create a wishlist on Amazon, add the item to it, setup a shipping address and give me the link?” One big advantage – it does not disclose actual address, phone or other personal details to me; and overall it’s way easier than filling in whole shipping forms.
  • Keep on mind though that for each and every new shipping address, Amazon will ask you to re-enter your credit card details. This makes sense for security purposes, but in this particular scenario it means re-entering them each.and.every.time. Your own Amazon UI for shipping address choices will become a mess in the end as well, and will need some cleanup.
  • Shipping costs are very real! Both you & me probably have a “oh yeah, a book is like $50” mental image, but depending on where it needs to be shipped that can end up anywhere in $50 to $100 for the same book.
  • Time investment wise, I probably have spent 15-20 hours on all this, during evenings over a few days. With some people it took quite a few roundtrips of back & forth messaging until I got a concrete actionable thing (URL of a thing: actionable, “I want a book I saw last week”: not so much :)). Maybe next time it would be easier, given that now I know I should just tell people “yo, gimmeh your amazon wishlist” fast-track.

It was quite a lot of fun and felt good! If you have money to spare, I can definitely recommend.


Gamedev Mastodon: mastodon.gamedev.place

TL;DR: I have set up a game development themed Mastodon instance at mastodon.gamedev.place. In three weeks it got to almost 2000 users and seems to be holding up nicely. It’s similar to Twitter, but might be nicer in some aspects!

Why?

Social media in 2017 can be pretty dreadful. Here are some sentiments:

I wake up every morning to Twitter: Trump is an idiot, everyone is a Nazi, AI will kill us, big games suck, buy my indie game for $0.99, buy this ICO, EU commission thinks about you, look at my shadertoy, look at the craft beer I just had. Why do I keep doing this to myself? (via)

or

Here are two reasons to yell about an important thing:

  1. Because nobody cares about it yet,
  2. Everyone cares now, but you need all your friends to know you care the loudest. Everyone who stops at 1 quit social media years ago. (This was unfair. <…> 3., anguished screaming into the void). (via)

“Please Listen To Me” webcomic has put it up nicely here.

One part of my issues with social media lately is the amount of general “everything is on fire, we are doomed” things I see on it every day. Yes, the world might be on fire, and yes, we might be doomed, but repeating that to me over and over again is getting tiring by now. Yes, I’ve got the memo, and I’ve already decided in what ways I plan to improve the situation (as much as I can), kthxbye.

The other part of my annoyances are twitter specific (ads, bots, spam, trolls, randomly non-chrono timeline), and apparent inability of twitter-the-company to solve them in a meaningful way. Even a tiny thing like reporting a sex-spam-bot takes a whole lot of mouse movement and a dozen clicks. Why?! And the issues I have here are all extremely minor, compared to someone who might experience harassment or worse.

Enter Mastodon

So Mastodon is this open source thing where someone can run “their own small twitter-like instance” on their own servers. With some clever technology, it “federates” with other instances; for example you can follow people from other servers, and it kinda works out. Each instance can setup its own rules for moderation, acceptable behavior etc., and it’s up to them to deal with whatever bad things might happen.

Maybe this will take off, or maybe it will be one of the neat ideas that never grew into anything meaningful. I’ve no way to tell.

mastodon.gamedev.place

…but I went and created my own instance anyway, at mastodon.gamedev.place

  • It’s like a mini-twitter about everything game development.
  • There are short Usage Guidelines.
  • And a Code of Conduct too.
  • Since I know nothing about servers or the tech stack Mastodon runs on, I used masto.host for the “actual hosting” part; seems to be working out fine.

You are welcome to join!

How is it working out?

It’s been three weeks since the instance is running, and it has reached almost 2000 users by now. That’s quite nice!

Now of course, a “smaller twitter” has downsides too; by definition it is smaller and has less people to interact with, and being game-development specific, has somewhat less topics being talked about. It’s a good idea for peace of mind and cultivating your existing interests; it’s probably not a good place for getting new interests, expanding your horizons or experiencing wildly different points of view.

Right now, it seems to be fairly heavy on programming/engineering side, and out of that, somehow it got a lot of Godot engine users/lovers. I’ve no idea how that happened; I don’t know any of them :) There are some artists, designers, audio people as well, but I’d love if there were more (see point above about limited set of topics). Would be nice to get more diversity in other aspects too (gender, race, age, experience, …).

So far (knocks on wood…) there were no bad incidents; the moderation reports were only a dozen or so in total, and most of them due to people looking into “Federated” (== “friends of friends”) timeline in the UI and thinking these people are from this instance. To the community there: you are doing great, keep it up :)

With current amount of users it runs comfortably on the €60/month server tier, and that should hold up quite well into the future. The cost is totally fine with me; I can keep running it “basically forever” in Internet timeline terms.

I’ve heard from a handful of people that they have effectively quit twitter and are now only on Mastodon. Yay, I guess!

Anyhoo.

Join mastodon.gamedev.place!


Slow to Compile Table Initializers

Continuing on finding low hanging fruit in our codebase (Unity game engine, a lot of C++ code) build times with Visual Studio, here’s what I found with help of /d2cgsummary (see previous blog post).

Noticed one file that was taking 92 seconds to compile on my machine, in Release config. A quick look at it did not reveal any particularly crazy code structure; was fairly simple & “obvious” code. However, /d2cgsummary said:

Anomalistic Compile Times: 44
  ?IsFloatFormat@@YA_NW4GraphicsFormat@@@Z: 5.493 sec
  ?IsHalfFormat@@YA_NW4GraphicsFormat@@@Z: 5.483 sec
  ?GetGraphicsFormatString@@YA?AV?$basic_string@DV?$StringStorageDefault@D@core@@@core@@W4GraphicsFormat@@@Z: 4.680 sec
  ?ComputeMipmapSize@@YA_KHHW4GraphicsFormat@@@Z: 4.137 sec
  ?ComputeTextureSizeForTypicalGPU@@YA_KHHHW4GraphicsFormat@@HH_N@Z: 4.087 sec
  ?GetComponentCount@@YAIW4GraphicsFormat@@@Z: 2.722 sec
  ?IsUNormFormat@@YA_NW4GraphicsFormat@@@Z: 2.719 sec
  ?Is16BitPackedFormat@@YA_NW4GraphicsFormat@@@Z: 2.700 sec
  ?IsAlphaOnlyFormat@@YA_NW4GraphicsFormat@@@Z: 2.699 sec
  <...>

A bunch of functions taking 2-5 seconds to compile each; and each one of those looking very simple:

bool IsFloatFormat(GraphicsFormat format)
{
    return ((GetDesc(format).flags & kFormatPropertyIEEE754Bit) != 0) && (GetDesc(format).blockSize / GetComponentCount(format) == 4);
}

Why on earth a function like that would take 5 seconds to compile?!

Then I noticed that all of them call GetDesc(format), which looks like this:

const FormatDesc& GetDesc(GraphicsFormat format)
{
    static const FormatDesc table[] = //kFormatCount
    {
        // bSize,bX,bY,bZ, swizzleR,  swizzleG,        swizzleB,        swizzleA,        fallbackFormat,    alphaFormat,       textureFormat,    rtFormat,        comps, name,   flags
        {0, 0, 0, 0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatNone,       kFormatNone,       kTexFormatNone,   kRTFormatCount,  0, 0,  "None", 0},                                                                            // None,
        {1, 1, 1, 1, kFormatSwizzleR, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone,   kRTFormatCount,  1, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // R8_SRGB
        {2, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone,   kRTFormatCount,  2, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RG8_SRGB
        {3, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatRGB24,  kRTFormatCount,  3, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGB8_SRGB
        {4, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzleA, kFormatNone,       kFormatRGBA8_SRGB, kTexFormatRGBA32, kRTFormatARGB32, 3, 1,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGBA8_SRGB
        // <...a lot of other entries for all formats we have...>
    };
    CompileTimeAssertArraySize(table, kGraphicsFormatCount);
    return table[format];
}

Just a function that returns “description” struct for a “graphics format” with various info we might be interested in, using a pattern similar to Translations in C++ using tables with zero-based enums.

The table is huge though. Could it be what’s causing compile times to be super slow?

Let’s try moving the table to be a static global variable, outside of the function:

static const FormatDesc table[] = //kFormatCount
{
    // <...the table...>
};
CompileTimeAssertArraySize(table, kGraphicsFormatCount);
const FormatDesc& GetDesc(GraphicsFormat format)
{
    return table[format];
}

Boom. Compile time of that file went down to 10.3 seconds, or whole 80 seconds faster.

:what:

That’s crazy. Why does this happen?

This “initialize table inside a function” pattern differs from “just a global variable” in a few aspects:

  • Compiler must emit code to do initialization when the function is executed for the first time. For non-trivial data, the optimizer might not “see” that it’s all just bytes in the table. And so it will actually emit equivalent of if (!initYet) { InitTable(); initYet=true; } into the function.
  • Since C++11, the compiler is required to make that thread-safe too! So it does equivalent of mutex lock or some atomic checks for the initialization.

Whereas for a table initialized as a global variable, none of that needs to happen; if it’s not “just bytes” then compiler still generates the initializer code, but it’s called exactly once before “actual program” starts, and does not add any branches to GetDesc function.

Ok, so one takeaway is: for “static constant data” tables like that, it might be better to declare them as global variables, instead of static local variables inside a function.

But wait, there’s more! I tried to make a “simple” repro case for Microsoft C++ folks to show the unusually long compile times, and I could not. Something else was also going on.

What is “Simple Data”?

The same /d2cgsummary output also had this:

RdrReadProc Caching Stats
  Most Hits:
    ??U@YA?AW4FormatPropertyFlags@@W40@0@Z: 18328

Which probably says that enum FormatPropertyFlags __cdecl operator|(enum FormatPropertyFlags,enum FormatPropertyFlags) function was called/used 18 thousand times in this file. One member of our FormatDesc struct was a “bitmask enum” that had operators defined on it for type safety, similar to DEFINE_ENUM_FLAG_OPERATORS (see here or there).

At this point, let’s move onto actual code where things can be investigated outside of the whole Unity codebase. Full source code is here in the gist. Key parts:

#ifndef USE_ENUM_FLAGS
#define USE_ENUM_FLAGS 1
#endif
#ifndef USE_INLINE_TABLE
#define USE_INLINE_TABLE 1
#endif

enum FormatPropertyFlags { /* ...a bunch of stuff */ };
#if USE_ENUM_FLAGS
#define ENUM_FLAGS(T) inline T operator |(const T left, const T right) { return static_cast<T>(static_cast<unsigned>(left) | static_cast<unsigned>(right)); }
ENUM_FLAGS(FormatPropertyFlags);
#endif

struct FormatDesc { /* ...the struct */ };

#if USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
    static const FormatDesc table[] = { /* ... the huge table itself */ };
#if !USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
    return table[format];
}

UInt32 GetColorComponentCount(GraphicsFormat format)
{
    return GetDesc(format).colorComponents;
}
/* ... a bunch more functions similar to this one */

We have preprocessor defines to switch between the big table being defined as a static local variable inside GetDesc function or a static global variable, and a define to switch whether “type safe enums” machinery is used or not. I’m compiling for x64 with cl.exe /O2 /Zi /GS- /GR- /EHsc- /MT main.cpp /link /OPT:REF /OPT:ICF which is fairly typical “Release build” flag soup.

Compiler Time Time with /cgthreads1 Exe size, KB Large symbols, KB
MSVC 2010 SP1 1.8s - 58 12 GetDesc
MSVC 2015 Update 3 15.1s 42.6s189 45 IsFloatFormat, 45 IsHalfFormat
MSVC 2017 15.3 18.2s 37.5s 96

Ok, so Visual C++ 2015/2017 is particularly slow at compiling this code pattern (big table, type-safe-enum operators used in it) with optimizations turned on. And a big increase in compile time compared to VS2010, hence I filed a bug for MS.

But what’s even more strange, is that code size is quite big too, particularly in VS2015 case. Each of IsFloatFormat and IsHalfFormat functions, which both are simple one-liners that just call GetDesc, compile into separate 45 kilobyte chunks of code (I found that via Sizer).

VS2015 compiles IsFloatFormat into this:

mov         qword ptr [rsp+8],rbx
mov         qword ptr [rsp+10h],rbp
mov         qword ptr [rsp+18h],rsi
push        rdi
; some prep stuff that checks for above mentioned "is table initialized yet",
; and then if it's not; does the table initialization:
movdqa      xmm0,xmmword ptr [__xmm@00000000000000040000000400000005 (07FF7302AA430h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E68h (07FF7302AC080h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000050000000200000001 (07FF7302AA690h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E98h (07FF7302AC0B0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000040000000000000003 (07FF7302AA680h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0ED8h (07FF7302AC0F0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000008000000050000000400000004 (07FF7302AA710h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F08h (07FF7302AC120h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@0000003e000000080000000800000005 (07FF7302AAAF0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F48h (07FF7302AC160h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000008000000050000000200000001 (07FF7302AA700h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F78h (07FF7302AC190h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000080000000000000003 (07FF7302AA6A0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FB8h (07FF7302AC1D0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@0000000c000000050000000400000004 (07FF7302AA7A0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FE8h (07FF7302AC200h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@000000000000000c0000000c00000005 (07FF7302AA480h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+1028h (07FF7302AC240h)],xmm0
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E50h (07FF7302AC068h)],1Ch
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E58h (07FF7302AC070h)],1010102h
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E60h (07FF7302AC078h)],1
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E64h (07FF7302AC07Ch)],4
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E78h (07FF7302AC090h)],1Ch
mov         word ptr [__NULL_IMPORT_DESCRIPTOR+0E7Ch (07FF7302AC094h)],2
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E80h (07FF7302AC098h)],rbp
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E88h (07FF7302AC0A0h)],1Ch
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E90h (07FF7302AC0A8h)],1010103h
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EA8h (07FF7302AC0C0h)],4
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EACh (07FF7302AC0C4h)],3
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EB0h (07FF7302AC0C8h)],1Ch
mov         word ptr [__NULL_IMPORT_DESCRIPTOR+0EB4h (07FF7302AC0CCh)],3
; repeat very similar thing for 45 more kilobytes...

Which is basically GetDesc, including the big table initializer, fully inlined into it. The initialization is not done via some “simple data segment copy”, but looks like carefully constructed entry by entry, field by field. And then a very similar thing is repeated for IsHalfFormat function.

VS2017 does not do any of that; the optimizer “realizes” that table is purely constant data, puts it into a data segment (yay!), and the IsFloatFormat function becomes fairly simple:

movsxd      rax,ecx
lea         r8,[__NULL_IMPORT_DESCRIPTOR+5C4h (07FF725393000h)]
imul        rdx,rax,38h
test        byte ptr [rdx+r8+30h],80h
je          IsFloatFormat+35h (07FF725391035h)
movzx       eax,byte ptr [rdx+r8+25h]
movzx       ecx,byte ptr [rdx+r8+24h]
add         ecx,eax
movzx       eax,byte ptr [rdx+r8]
xor         edx,edx
div         eax,ecx
cmp         eax,4
jne         IsFloatFormat+35h (07FF725391035h)
mov         al,1
ret
xor         al,al
ret

What if the table is moved to be a global variable, like my suggestion above? Passing /DUSE_INLINE_TABLE=0 to the compiler we get:

Compiler Time Exe size, KB Large symbols, KB
MSVC 2010 SP1 1.4s 50 2k dynamicinitializerfor'table'
MSVC 2015 Update 3 2.1s 96
MSVC 2017 15.3 2.7s 96

VS2017 generates completely identical code as before, just does it 6 times faster. VS2015 also compiles it into a data segment table like VS2017; does it 7 times faster, and the executable is 90 kilobytes smaller.

VS2010 still emits the global table initializer function, and is a bit faster to compile. But it wasn’t as slow to compile to begin with.

What if we left table as a local variable, but just remove the single usage of type-safety enum flags? Passing /DUSE_ENUM_FLAGS=0 to the compiler we get:

Compiler Time Exe size, KB
MSVC 2010 SP1 0.2s 46
MSVC 2015 Update 3 0.4s 96
MSVC 2017 15.3 0.4s 96

Whoa. All three compilers now “realize” that the table is pure simple data, put it into a data segment, and take a lot less time to compile the whole thing.

And all that just because this function got removed from the source:

inline FormatPropertyFlags operator|(const FormatPropertyFlags left, const FormatPropertyFlags right)
{
	return static_cast<FormatPropertyFlags>(static_cast<unsigned>(left) | static_cast<unsigned>(right));
}

In this particular piece of code, that “type safe enum” machinery does not actually do anything useful, but in other general code more type safety on enums & bit flags is a very useful thing to have! Quite a bit sad that “seemingly trivial” type safety abstractions incur so much compile time overhead in year 2017, but oh well… reality is sad, especially in year 2017 :(

Would using C++11 “enum classes” feature allow us having type safe enums, have bitmask operations on them (similar to this approach), and have good compile times? I don’t know that yet. An excercise for the reader right now! But also, see constexpr section below.

For reference, I also checked gcc (g++ 5.4 on Windows 10 Linux subsystem, with -O2 flag) compile times and executable size for each case above. In all cases compiled everything in 0.3 seconds on my machine, and executable size was the same.

What about constexpr?

As pointed out over at reddit, what about declaring the table as C++11 constexpr? The type safe enum utility has to be declared like that too then, so let’s try it. ENUM_FLAGS macro needs to get a constexpr in front of the operator it declares, and the table needs to change from static const FormatDesc table[] to static constexpr FormatDesc table[].

Good news: compile times are back at being fast, and all is good! Thanks reddit! VS2010 does not grok the new construct though, so removed it from the table.

Compiler Time Exe size, KB
MSVC 2015 Update 3 0.4s 96
MSVC 2017 15.3 0.4s 96

Our codebase at the moment is mostly still C++98 (some platforms and their compilers…), but maybe we should sprinkle some optional C++11/14/… dust in a few places where it helps the compiler.

Summary

  • Big table initializers as static local variables might be better as static global variables.
  • Type-safe enum bitmask machinery is not “free” in terms of compilation time, when using optimized builds in MSVC.
  • “Complex” table initializers (what is “complex” depends on table size & data in it) might get a lot of code emitted to set them up in MSVC (2015 at least); what looks “just a simple data” to you might not look so for the compiler.
  • I keep on running into code structure patterns that are much slower to compile with MSVC compared to clang/gcc. Pretty sure the opposite cases exist too, just right now our MSVC builds are slower than clang builds, and hence that’s where I’m looking at.
  • Even if you’re not using C++11/14 features otherwise, in some places sprinkling things like constexpr might help a C++11 capable compiler to do the right thing.
  • /d2cgsummary flag is awesome to figure out where MSVC optimizer is spending time. I probably would not have guessed any of the above, without it pointing me towards the large table!

Best unknown MSVC flag: d2cgsummary

I’ve been looking at C++ compilation times lately (e.g. here or there), and through correspondence with Microsoft C++ compiler folks learned about a fairly awesome, but undocumented, cl.exe flag.

It’s /d2cgsummary.

How to investigate slow compile times?

Given that compile/build times are a massive problem in large C++ codebases, it’s amazing how little information we can get about why they are slow. For example when using Visual Studio C++ compiler, about the only options seem to be:

1) Use /Bt flag, to get time spent in frontend (c1xx.dll) vs backend (c2.dll) compiler parts. This is good to narrow down whether slow compiles are due to too many things being parsed (e.g. a problem of including too many header files), which is typically a frontend problem, or is the problem some sort of language constructs that cause the backend (typically the optimizer part of the compiler) take “a lot of time”.

The “common knowledge” is that long C++ compile times are caused by too many header files being included, and in general by C/C++ #include model. That is often the case, but sometimes it’s the backend part of the compiler that takes “forever” to compile something (e.g. the __forceinline compile time issue I found in my previous post). The backend part typically only takes “long” when compile optimizations are on.

2) Take a profiler capture of cl.exe, using any of existing native code profiling tools (e.g. xperf), load symbols, and try to guess what is causing the long compile time based on profiling results.

3) Just do guesses and kinda randomly change code structure and see whether it helped.

Well, turns out there’s another option, the /d2cgsummary flag.

MSVC /d2cgsummary flag

I can’t find anything about this flag on the internet, so I’d assume it’s something “for internal use only”, and may or might not exist or be changed in any MSVC compiler version. It does seem to exist & work in VS2015 Update 3 and in VS2017. Passing it to cl.exe will give an output of “Anomalistic Compile Times” and some sort of “Caching Stats”, like this:

Code Generation Summary
        Total Function Count: 742
        Elapsed Time: 13.756 sec
        Total Compilation Time: 13.751 sec
        Efficiency: 100.0%
        Average time per function: 0.019 sec
        Anomalistic Compile Times: 1
                ?EvaluateRootMotion@animation@mecanim@@YAXAEBUClipMuscleConstant@12@AEBUClipMuscleInput@12@PEBM2AEAUMotionOutput@12@_N@Z: 11.667 sec, 0 instrs
        Serialized Initializer Count: 4
        Serialized Initializer Time: 0.006 sec
RdrReadProc Caching Stats
        Functions Cached: 61
        Retrieved Count: 20162
        Abandoned Retrieval Count: 0
        Abandoned Caching Count: 14
        Wasted Caching Attempts: 0
        Functions Retrieved at Least Once: 61
        Functions Cached and Never Retrieved: 0
        Most Hits:
                ?f@?$load@$00@?$SWIZ@$0DCB@@v4f@meta@math@@SA?AT__m128@@T6@@Z: 2775
                ?f@?$SWIZ@$0DCB@@v4f@meta@math@@SA?AT__m128@@T5@@Z: 2774
                ?f@?$load@$00@?$SWIZ@$0EDCB@@v4f@meta@math@@SA?AT__m128@@T6@@Z: 1924
                ?f@?$SWIZ@$0EDCB@@v4f@meta@math@@SA?AT__m128@@T5@@Z: 1920
                ?pack@?$v@Uv4f@meta@math@@U?$sp@Uv4f@meta@math@@$0DCB@$02@23@$02@meta@math@@SA?AU123@AEBT__m128@@@Z: 1296
                ??B_float4@math@@QEBA?AT__m128@@XZ: 1165

Anomalistic Compile Times seems to output “longest to compile” functions, where I guess their times are much longer than the average. Here, it immediately says that among all 700+ functions that were compiled, there is one that takes almost all the time! The names are mangled, but using either the undname.exe MSVC utility or an online demangler, it says the slow-to-compile function in this particular file was called mecanim::animation::EvaluateRootMotion.

This means that if I want to speed up compilation of this file, I need to only look at that one function, and somehow figure out why it’s slow to compile. The “why” is still an exercise for the reader, but at least I know I don’t have to investigate 700 other functions. Great!

RdrReadProc Caching Stats section I don’t quite understand, but it seems to be listing functions that were used most often in the whole compilation. Again after demangling, seems that math::meta::v4f::SWIZ<801>::load got used almost 3000 times, etc.

On some other files I was looking into, the /d2cgsummary flag did not point to one slow function, but at least pointed me to a subsystem that might be contributing to slow compile times. For example here:

Code Generation Summary
        Total Function Count: 4260
        Elapsed Time: 20.710 sec
        Total Compilation Time: 20.685 sec
        Efficiency: 99.9%
        Average time per function: 0.005 sec
        Anomalistic Compile Times: 32
                ?VirtualRedirectTransfer@Shader@@UEAAXAEAVRemapPPtrTransfer@@@Z: 0.741 sec, 0 instrs
                ??$TransferBase@VShader@@@RemapPPtrTransfer@@QEAAXAEAVShader@@W4TransferMetaFlags@@@Z: 0.734 sec, 0 instrs
                ??$Transfer@VShader@@@RemapPPtrTransfer@@QEAAXAEAVShader@@PEBDW4TransferMetaFlags@@@Z: 0.726 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@Shader@@QEAAXAEAVRemapPPtrTransfer@@@Z: 0.711 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@?$SerializeTraits@VShader@@@@SAXAEAVShader@@AEAVRemapPPtrTransfer@@@Z: 0.708 sec, 0 instrs
                ??$Transfer@USerializedShader@ShaderLab@@@RemapPPtrTransfer@@QEAAXAEAUSerializedShader@ShaderLab@@PEBDW4TransferMetaFlags@@@Z: 0.632 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@?$SerializeTraits@USerializedShader@ShaderLab@@@@SAXAEAUSerializedShader@ShaderLab@@AEAVRemapPPtrTransfer@@@Z: 0.625 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@SerializedShader@ShaderLab@@QEAAXAEAVRemapPPtrTransfer@@@Z: 0.621 sec, 0 instrs
                <...>
RdrReadProc Caching Stats
        Functions Cached: 227
        Retrieved Count: 74070
        Abandoned Retrieval Count: 0
        Abandoned Caching Count: 0
        Wasted Caching Attempts: 0
        Functions Retrieved at Least Once: 221
        Functions Cached and Never Retrieved: 6
        Most Hits:
                ?GetNewInstanceIDforOldInstanceID@RemapPPtrTransfer@@QEAAHH@Z: 3065
                ?MightContainPPtr@?$SerializeTraits@D@@SA_NXZ: 2537
                ?Align@TransferBase@@QEAAXXZ: 1965
                <...>

Out of total 20 second compile time, there’s a bunch of functions that take almost 1 second to compile each (after demangling: Shader::VirtualRedirectTransfer(class RemapPPtrTransfer), RemapPPtrTransfer::TransferBase<class Shader>(class Shader&, enum TransferMetaFlags)) etc. So hey, out of everything that Shader.cpp does, I know that in our codebase it’s the “data serialization” part that is slow to compile.

What would be super useful to have from compiler vendors: way more diagnostics on what & why something is slow. Detailed reports like this flag, and/or custom ETW providers for xperf, or anything like that. These would not be things used by most people, but when someone has to use them, they would be invaluable. Saving a minute or two of build time for several hundred programmers is an enormous amount of productivity gain!

Anyway, now you know. If you use Microsoft’s C++ compiler, try out /d2cgsummary flag to cl.exe. You might find something interesting! Next time, how this flag pointed me towards a particular programming pattern that is slow to compile on MSVC, and allowed saving 80 seconds of compile time of one file.

Update: An item on Visual Studio feedback site to document the flag and perhaps some others that might be useful.

Update 2:Visual Studio 2017 Throughput Improvements and Advice” post went up on Visual C++ team blog with /d2cgsummary and other useful compiler throughput tips.