2017 Holidays Giveaway

I told the internet I’d buy books and stuff for people, and you won’t believe what happened next!

45 people got books and stuff.

FIN.

Backstory

I wrote this on both mastodon and twitter:

Let’s try some “we’ll probably make it through 2017” gifts. Do you need a book, a course, an app, a workshop, github, hosting, or something else that would help you? Something in $50-$100 range; ping me and I’ll try to do it!

Why? Dunno, sounded like that would be a nice thing to do. And then I got two requests, which I fulfilled, and then… nothing. However, I repeated the same thing the next day, during a more US-friendly hour, and then the requests started flowing in.

Expectations

This being the internet, I expected totally random and weird things. A million requests, a bunch of trolls, a bunch of nonsense, etc.

I’m happy to report that the internet managed to pleasantly surprise me in how well it behaved in this particular occasion!

Requests

In the end I gave gifts to 45 strangers on the internet before calling it a day, and here’s what they turned out to be.

Unsurprisingly enough, majority were something about computers.

  • 13 books: 11 on programming, 2 on art/design.
  • 5 courses or tutorials: 3 on programming, 2 on art.
  • 9 items of physical hardware: a couple of Raspberry Pi 3s, leapmotion, 360 camera, daydream VR viewer etc.
  • 4 software licenses: Git Tower (twice), Gamemaker Studio 2, ArtStation Pro.

Of course, the most interesting requests are those that are not computer related :)

Was really surprised by amount of people saying “I don’t need anything myself, but if you could donate to charity/nonprofit X that would be great”. 13% of all requests being of that kind is pretty sweet! For some of the requests I decided to go for a larger than $50-$100 contribution. Here’s what people asked to donate to:

And of course, “everything else” is where it was fairly random. I bought tree saplings, socks, room heater, covered an electricity bill, vet expenses and so on. Very interesting!

Tips and Findings

  • Buying stuff for strangers on the internet is easiest via Amazon wishlists. I pretty much got this ready-to-paste for inquiries: “Can you create a wishlist on Amazon, add the item to it, setup a shipping address and give me the link?” One big advantage – it does not disclose actual address, phone or other personal details to me; and overall it’s way easier than filling in whole shipping forms.
  • Keep on mind though that for each and every new shipping address, Amazon will ask you to re-enter your credit card details. This makes sense for security purposes, but in this particular scenario it means re-entering them each.and.every.time. Your own Amazon UI for shipping address choices will become a mess in the end as well, and will need some cleanup.
  • Shipping costs are very real! Both you & me probably have a “oh yeah, a book is like $50” mental image, but depending on where it needs to be shipped that can end up anywhere in $50 to $100 for the same book.
  • Time investment wise, I probably have spent 15-20 hours on all this, during evenings over a few days. With some people it took quite a few roundtrips of back & forth messaging until I got a concrete actionable thing (URL of a thing: actionable, “I want a book I saw last week”: not so much :)). Maybe next time it would be easier, given that now I know I should just tell people “yo, gimmeh your amazon wishlist” fast-track.

It was quite a lot of fun and felt good! If you have money to spare, I can definitely recommend.


Gamedev Mastodon: mastodon.gamedev.place

TL;DR: I have set up a game development themed Mastodon instance at mastodon.gamedev.place. In three weeks it got to almost 2000 users and seems to be holding up nicely. It’s similar to Twitter, but might be nicer in some aspects!

Why?

Social media in 2017 can be pretty dreadful. Here are some sentiments:

I wake up every morning to Twitter: Trump is an idiot, everyone is a Nazi, AI will kill us, big games suck, buy my indie game for $0.99, buy this ICO, EU commission thinks about you, look at my shadertoy, look at the craft beer I just had. Why do I keep doing this to myself? (via)

or

Here are two reasons to yell about an important thing:

  1. Because nobody cares about it yet,
  2. Everyone cares now, but you need all your friends to know you care the loudest. Everyone who stops at 1 quit social media years ago. (This was unfair. <…> 3., anguished screaming into the void). (via)

“Please Listen To Me” webcomic has put it up nicely here.

One part of my issues with social media lately is the amount of general “everything is on fire, we are doomed” things I see on it every day. Yes, the world might be on fire, and yes, we might be doomed, but repeating that to me over and over again is getting tiring by now. Yes, I’ve got the memo, and I’ve already decided in what ways I plan to improve the situation (as much as I can), kthxbye.

The other part of my annoyances are twitter specific (ads, bots, spam, trolls, randomly non-chrono timeline), and apparent inability of twitter-the-company to solve them in a meaningful way. Even a tiny thing like reporting a sex-spam-bot takes a whole lot of mouse movement and a dozen clicks. Why?! And the issues I have here are all extremely minor, compared to someone who might experience harassment or worse.

Enter Mastodon

So Mastodon is this open source thing where someone can run “their own small twitter-like instance” on their own servers. With some clever technology, it “federates” with other instances; for example you can follow people from other servers, and it kinda works out. Each instance can setup its own rules for moderation, acceptable behavior etc., and it’s up to them to deal with whatever bad things might happen.

Maybe this will take off, or maybe it will be one of the neat ideas that never grew into anything meaningful. I’ve no way to tell.

mastodon.gamedev.place

…but I went and created my own instance anyway, at mastodon.gamedev.place

  • It’s like a mini-twitter about everything game development.
  • There are short Usage Guidelines.
  • And a Code of Conduct too.
  • Since I know nothing about servers or the tech stack Mastodon runs on, I used masto.host for the “actual hosting” part; seems to be working out fine.

You are welcome to join!

How is it working out?

It’s been three weeks since the instance is running, and it has reached almost 2000 users by now. That’s quite nice!

Now of course, a “smaller twitter” has downsides too; by definition it is smaller and has less people to interact with, and being game-development specific, has somewhat less topics being talked about. It’s a good idea for peace of mind and cultivating your existing interests; it’s probably not a good place for getting new interests, expanding your horizons or experiencing wildly different points of view.

Right now, it seems to be fairly heavy on programming/engineering side, and out of that, somehow it got a lot of Godot engine users/lovers. I’ve no idea how that happened; I don’t know any of them :) There are some artists, designers, audio people as well, but I’d love if there were more (see point above about limited set of topics). Would be nice to get more diversity in other aspects too (gender, race, age, experience, …).

So far (knocks on wood…) there were no bad incidents; the moderation reports were only a dozen or so in total, and most of them due to people looking into “Federated” (== “friends of friends”) timeline in the UI and thinking these people are from this instance. To the community there: you are doing great, keep it up :)

With current amount of users it runs comfortably on the €60/month server tier, and that should hold up quite well into the future. The cost is totally fine with me; I can keep running it “basically forever” in Internet timeline terms.

I’ve heard from a handful of people that they have effectively quit twitter and are now only on Mastodon. Yay, I guess!

Anyhoo.

Join mastodon.gamedev.place!


Slow to Compile Table Initializers

Continuing on finding low hanging fruit in our codebase (Unity game engine, a lot of C++ code) build times with Visual Studio, here’s what I found with help of /d2cgsummary (see previous blog post).

Noticed one file that was taking 92 seconds to compile on my machine, in Release config. A quick look at it did not reveal any particularly crazy code structure; was fairly simple & “obvious” code. However, /d2cgsummary said:

Anomalistic Compile Times: 44
  ?IsFloatFormat@@YA_NW4GraphicsFormat@@@Z: 5.493 sec
  ?IsHalfFormat@@YA_NW4GraphicsFormat@@@Z: 5.483 sec
  ?GetGraphicsFormatString@@YA?AV?$basic_string@DV?$StringStorageDefault@D@core@@@core@@W4GraphicsFormat@@@Z: 4.680 sec
  ?ComputeMipmapSize@@YA_KHHW4GraphicsFormat@@@Z: 4.137 sec
  ?ComputeTextureSizeForTypicalGPU@@YA_KHHHW4GraphicsFormat@@HH_N@Z: 4.087 sec
  ?GetComponentCount@@YAIW4GraphicsFormat@@@Z: 2.722 sec
  ?IsUNormFormat@@YA_NW4GraphicsFormat@@@Z: 2.719 sec
  ?Is16BitPackedFormat@@YA_NW4GraphicsFormat@@@Z: 2.700 sec
  ?IsAlphaOnlyFormat@@YA_NW4GraphicsFormat@@@Z: 2.699 sec
  <...>

A bunch of functions taking 2-5 seconds to compile each; and each one of those looking very simple:

bool IsFloatFormat(GraphicsFormat format)
{
    return ((GetDesc(format).flags & kFormatPropertyIEEE754Bit) != 0) && (GetDesc(format).blockSize / GetComponentCount(format) == 4);
}

Why on earth a function like that would take 5 seconds to compile?!

Then I noticed that all of them call GetDesc(format), which looks like this:

const FormatDesc& GetDesc(GraphicsFormat format)
{
    static const FormatDesc table[] = //kFormatCount
    {
        // bSize,bX,bY,bZ, swizzleR,  swizzleG,        swizzleB,        swizzleA,        fallbackFormat,    alphaFormat,       textureFormat,    rtFormat,        comps, name,   flags
        {0, 0, 0, 0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatNone,       kFormatNone,       kTexFormatNone,   kRTFormatCount,  0, 0,  "None", 0},                                                                            // None,
        {1, 1, 1, 1, kFormatSwizzleR, kFormatSwizzle0, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone,   kRTFormatCount,  1, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // R8_SRGB
        {2, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzle0, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatNone,   kRTFormatCount,  2, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RG8_SRGB
        {3, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzle1, kFormatRGBA8_SRGB, kFormatRGBA8_SRGB, kTexFormatRGB24,  kRTFormatCount,  3, 0,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGB8_SRGB
        {4, 1, 1, 1, kFormatSwizzleR, kFormatSwizzleG, kFormatSwizzleB, kFormatSwizzleA, kFormatNone,       kFormatRGBA8_SRGB, kTexFormatRGBA32, kRTFormatARGB32, 3, 1,  "",     kFormatPropertyNormBit | kFormatPropertySRGBBit | kFormatPropertyUnsignedBit}, // RGBA8_SRGB
        // <...a lot of other entries for all formats we have...>
    };
    CompileTimeAssertArraySize(table, kGraphicsFormatCount);
    return table[format];
}

Just a function that returns “description” struct for a “graphics format” with various info we might be interested in, using a pattern similar to Translations in C++ using tables with zero-based enums.

The table is huge though. Could it be what’s causing compile times to be super slow?

Let’s try moving the table to be a static global variable, outside of the function:

static const FormatDesc table[] = //kFormatCount
{
    // <...the table...>
};
CompileTimeAssertArraySize(table, kGraphicsFormatCount);
const FormatDesc& GetDesc(GraphicsFormat format)
{
    return table[format];
}

Boom. Compile time of that file went down to 10.3 seconds, or whole 80 seconds faster.

:what:

That’s crazy. Why does this happen?

This “initialize table inside a function” pattern differs from “just a global variable” in a few aspects:

  • Compiler must emit code to do initialization when the function is executed for the first time. For non-trivial data, the optimizer might not “see” that it’s all just bytes in the table. And so it will actually emit equivalent of if (!initYet) { InitTable(); initYet=true; } into the function.
  • Since C++11, the compiler is required to make that thread-safe too! So it does equivalent of mutex lock or some atomic checks for the initialization.

Whereas for a table initialized as a global variable, none of that needs to happen; if it’s not “just bytes” then compiler still generates the initializer code, but it’s called exactly once before “actual program” starts, and does not add any branches to GetDesc function.

Ok, so one takeaway is: for “static constant data” tables like that, it might be better to declare them as global variables, instead of static local variables inside a function.

But wait, there’s more! I tried to make a “simple” repro case for Microsoft C++ folks to show the unusually long compile times, and I could not. Something else was also going on.

What is “Simple Data”?

The same /d2cgsummary output also had this:

RdrReadProc Caching Stats
  Most Hits:
    ??U@YA?AW4FormatPropertyFlags@@W40@0@Z: 18328

Which probably says that enum FormatPropertyFlags __cdecl operator|(enum FormatPropertyFlags,enum FormatPropertyFlags) function was called/used 18 thousand times in this file. One member of our FormatDesc struct was a “bitmask enum” that had operators defined on it for type safety, similar to DEFINE_ENUM_FLAG_OPERATORS (see here or there).

At this point, let’s move onto actual code where things can be investigated outside of the whole Unity codebase. Full source code is here in the gist. Key parts:

#ifndef USE_ENUM_FLAGS
#define USE_ENUM_FLAGS 1
#endif
#ifndef USE_INLINE_TABLE
#define USE_INLINE_TABLE 1
#endif

enum FormatPropertyFlags { /* ...a bunch of stuff */ };
#if USE_ENUM_FLAGS
#define ENUM_FLAGS(T) inline T operator |(const T left, const T right) { return static_cast<T>(static_cast<unsigned>(left) | static_cast<unsigned>(right)); }
ENUM_FLAGS(FormatPropertyFlags);
#endif

struct FormatDesc { /* ...the struct */ };

#if USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
    static const FormatDesc table[] = { /* ... the huge table itself */ };
#if !USE_INLINE_TABLE
const FormatDesc& GetDesc(GraphicsFormat format)
{
#endif
    return table[format];
}

UInt32 GetColorComponentCount(GraphicsFormat format)
{
    return GetDesc(format).colorComponents;
}
/* ... a bunch more functions similar to this one */

We have preprocessor defines to switch between the big table being defined as a static local variable inside GetDesc function or a static global variable, and a define to switch whether “type safe enums” machinery is used or not. I’m compiling for x64 with cl.exe /O2 /Zi /GS- /GR- /EHsc- /MT main.cpp /link /OPT:REF /OPT:ICF which is fairly typical “Release build” flag soup.

Compiler Time Time with /cgthreads1 Exe size, KB Large symbols, KB
MSVC 2010 SP1 1.8s - 58 12 GetDesc
MSVC 2015 Update 3 15.1s 42.6s189 45 IsFloatFormat, 45 IsHalfFormat
MSVC 2017 15.3 18.2s 37.5s 96

Ok, so Visual C++ 2015/2017 is particularly slow at compiling this code pattern (big table, type-safe-enum operators used in it) with optimizations turned on. And a big increase in compile time compared to VS2010, hence I filed a bug for MS.

But what’s even more strange, is that code size is quite big too, particularly in VS2015 case. Each of IsFloatFormat and IsHalfFormat functions, which both are simple one-liners that just call GetDesc, compile into separate 45 kilobyte chunks of code (I found that via Sizer).

VS2015 compiles IsFloatFormat into this:

mov         qword ptr [rsp+8],rbx
mov         qword ptr [rsp+10h],rbp
mov         qword ptr [rsp+18h],rsi
push        rdi
; some prep stuff that checks for above mentioned "is table initialized yet",
; and then if it's not; does the table initialization:
movdqa      xmm0,xmmword ptr [__xmm@00000000000000040000000400000005 (07FF7302AA430h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E68h (07FF7302AC080h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000050000000200000001 (07FF7302AA690h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0E98h (07FF7302AC0B0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000040000000000000003 (07FF7302AA680h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0ED8h (07FF7302AC0F0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000008000000050000000400000004 (07FF7302AA710h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F08h (07FF7302AC120h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@0000003e000000080000000800000005 (07FF7302AAAF0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F48h (07FF7302AC160h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000008000000050000000200000001 (07FF7302AA700h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0F78h (07FF7302AC190h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@00000004000000080000000000000003 (07FF7302AA6A0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FB8h (07FF7302AC1D0h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@0000000c000000050000000400000004 (07FF7302AA7A0h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+0FE8h (07FF7302AC200h)],xmm0
movdqa      xmm0,xmmword ptr [__xmm@000000000000000c0000000c00000005 (07FF7302AA480h)]
movdqa      xmmword ptr [__NULL_IMPORT_DESCRIPTOR+1028h (07FF7302AC240h)],xmm0
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E50h (07FF7302AC068h)],1Ch
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E58h (07FF7302AC070h)],1010102h
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E60h (07FF7302AC078h)],1
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E64h (07FF7302AC07Ch)],4
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E78h (07FF7302AC090h)],1Ch
mov         word ptr [__NULL_IMPORT_DESCRIPTOR+0E7Ch (07FF7302AC094h)],2
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E80h (07FF7302AC098h)],rbp
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0E88h (07FF7302AC0A0h)],1Ch
mov         qword ptr [__NULL_IMPORT_DESCRIPTOR+0E90h (07FF7302AC0A8h)],1010103h
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EA8h (07FF7302AC0C0h)],4
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EACh (07FF7302AC0C4h)],3
mov         dword ptr [__NULL_IMPORT_DESCRIPTOR+0EB0h (07FF7302AC0C8h)],1Ch
mov         word ptr [__NULL_IMPORT_DESCRIPTOR+0EB4h (07FF7302AC0CCh)],3
; repeat very similar thing for 45 more kilobytes...

Which is basically GetDesc, including the big table initializer, fully inlined into it. The initialization is not done via some “simple data segment copy”, but looks like carefully constructed entry by entry, field by field. And then a very similar thing is repeated for IsHalfFormat function.

VS2017 does not do any of that; the optimizer “realizes” that table is purely constant data, puts it into a data segment (yay!), and the IsFloatFormat function becomes fairly simple:

movsxd      rax,ecx
lea         r8,[__NULL_IMPORT_DESCRIPTOR+5C4h (07FF725393000h)]
imul        rdx,rax,38h
test        byte ptr [rdx+r8+30h],80h
je          IsFloatFormat+35h (07FF725391035h)
movzx       eax,byte ptr [rdx+r8+25h]
movzx       ecx,byte ptr [rdx+r8+24h]
add         ecx,eax
movzx       eax,byte ptr [rdx+r8]
xor         edx,edx
div         eax,ecx
cmp         eax,4
jne         IsFloatFormat+35h (07FF725391035h)
mov         al,1
ret
xor         al,al
ret

What if the table is moved to be a global variable, like my suggestion above? Passing /DUSE_INLINE_TABLE=0 to the compiler we get:

Compiler Time Exe size, KB Large symbols, KB
MSVC 2010 SP1 1.4s 50 2k dynamicinitializerfor'table'
MSVC 2015 Update 3 2.1s 96
MSVC 2017 15.3 2.7s 96

VS2017 generates completely identical code as before, just does it 6 times faster. VS2015 also compiles it into a data segment table like VS2017; does it 7 times faster, and the executable is 90 kilobytes smaller.

VS2010 still emits the global table initializer function, and is a bit faster to compile. But it wasn’t as slow to compile to begin with.

What if we left table as a local variable, but just remove the single usage of type-safety enum flags? Passing /DUSE_ENUM_FLAGS=0 to the compiler we get:

Compiler Time Exe size, KB
MSVC 2010 SP1 0.2s 46
MSVC 2015 Update 3 0.4s 96
MSVC 2017 15.3 0.4s 96

Whoa. All three compilers now “realize” that the table is pure simple data, put it into a data segment, and take a lot less time to compile the whole thing.

And all that just because this function got removed from the source:

inline FormatPropertyFlags operator|(const FormatPropertyFlags left, const FormatPropertyFlags right)
{
	return static_cast<FormatPropertyFlags>(static_cast<unsigned>(left) | static_cast<unsigned>(right));
}

In this particular piece of code, that “type safe enum” machinery does not actually do anything useful, but in other general code more type safety on enums & bit flags is a very useful thing to have! Quite a bit sad that “seemingly trivial” type safety abstractions incur so much compile time overhead in year 2017, but oh well… reality is sad, especially in year 2017 :(

Would using C++11 “enum classes” feature allow us having type safe enums, have bitmask operations on them (similar to this approach), and have good compile times? I don’t know that yet. An excercise for the reader right now! But also, see constexpr section below.

For reference, I also checked gcc (g++ 5.4 on Windows 10 Linux subsystem, with -O2 flag) compile times and executable size for each case above. In all cases compiled everything in 0.3 seconds on my machine, and executable size was the same.

What about constexpr?

As pointed out over at reddit, what about declaring the table as C++11 constexpr? The type safe enum utility has to be declared like that too then, so let’s try it. ENUM_FLAGS macro needs to get a constexpr in front of the operator it declares, and the table needs to change from static const FormatDesc table[] to static constexpr FormatDesc table[].

Good news: compile times are back at being fast, and all is good! Thanks reddit! VS2010 does not grok the new construct though, so removed it from the table.

Compiler Time Exe size, KB
MSVC 2015 Update 3 0.4s 96
MSVC 2017 15.3 0.4s 96

Our codebase at the moment is mostly still C++98 (some platforms and their compilers…), but maybe we should sprinkle some optional C++11/14/… dust in a few places where it helps the compiler.

Summary

  • Big table initializers as static local variables might be better as static global variables.
  • Type-safe enum bitmask machinery is not “free” in terms of compilation time, when using optimized builds in MSVC.
  • “Complex” table initializers (what is “complex” depends on table size & data in it) might get a lot of code emitted to set them up in MSVC (2015 at least); what looks “just a simple data” to you might not look so for the compiler.
  • I keep on running into code structure patterns that are much slower to compile with MSVC compared to clang/gcc. Pretty sure the opposite cases exist too, just right now our MSVC builds are slower than clang builds, and hence that’s where I’m looking at.
  • Even if you’re not using C++11/14 features otherwise, in some places sprinkling things like constexpr might help a C++11 capable compiler to do the right thing.
  • /d2cgsummary flag is awesome to figure out where MSVC optimizer is spending time. I probably would not have guessed any of the above, without it pointing me towards the large table!

Best unknown MSVC flag: d2cgsummary

I’ve been looking at C++ compilation times lately (e.g. here or there), and through correspondence with Microsoft C++ compiler folks learned about a fairly awesome, but undocumented, cl.exe flag.

It’s /d2cgsummary.

How to investigate slow compile times?

Given that compile/build times are a massive problem in large C++ codebases, it’s amazing how little information we can get about why they are slow. For example when using Visual Studio C++ compiler, about the only options seem to be:

1) Use /Bt flag, to get time spent in frontend (c1xx.dll) vs backend (c2.dll) compiler parts. This is good to narrow down whether slow compiles are due to too many things being parsed (e.g. a problem of including too many header files), which is typically a frontend problem, or is the problem some sort of language constructs that cause the backend (typically the optimizer part of the compiler) take “a lot of time”.

The “common knowledge” is that long C++ compile times are caused by too many header files being included, and in general by C/C++ #include model. That is often the case, but sometimes it’s the backend part of the compiler that takes “forever” to compile something (e.g. the __forceinline compile time issue I found in my previous post). The backend part typically only takes “long” when compile optimizations are on.

2) Take a profiler capture of cl.exe, using any of existing native code profiling tools (e.g. xperf), load symbols, and try to guess what is causing the long compile time based on profiling results.

3) Just do guesses and kinda randomly change code structure and see whether it helped.

Well, turns out there’s another option, the /d2cgsummary flag.

MSVC /d2cgsummary flag

I can’t find anything about this flag on the internet, so I’d assume it’s something “for internal use only”, and may or might not exist or be changed in any MSVC compiler version. It does seem to exist & work in VS2015 Update 3 and in VS2017. Passing it to cl.exe will give an output of “Anomalistic Compile Times” and some sort of “Caching Stats”, like this:

Code Generation Summary
        Total Function Count: 742
        Elapsed Time: 13.756 sec
        Total Compilation Time: 13.751 sec
        Efficiency: 100.0%
        Average time per function: 0.019 sec
        Anomalistic Compile Times: 1
                ?EvaluateRootMotion@animation@mecanim@@YAXAEBUClipMuscleConstant@12@AEBUClipMuscleInput@12@PEBM2AEAUMotionOutput@12@_N@Z: 11.667 sec, 0 instrs
        Serialized Initializer Count: 4
        Serialized Initializer Time: 0.006 sec
RdrReadProc Caching Stats
        Functions Cached: 61
        Retrieved Count: 20162
        Abandoned Retrieval Count: 0
        Abandoned Caching Count: 14
        Wasted Caching Attempts: 0
        Functions Retrieved at Least Once: 61
        Functions Cached and Never Retrieved: 0
        Most Hits:
                ?f@?$load@$00@?$SWIZ@$0DCB@@v4f@meta@math@@SA?AT__m128@@T6@@Z: 2775
                ?f@?$SWIZ@$0DCB@@v4f@meta@math@@SA?AT__m128@@T5@@Z: 2774
                ?f@?$load@$00@?$SWIZ@$0EDCB@@v4f@meta@math@@SA?AT__m128@@T6@@Z: 1924
                ?f@?$SWIZ@$0EDCB@@v4f@meta@math@@SA?AT__m128@@T5@@Z: 1920
                ?pack@?$v@Uv4f@meta@math@@U?$sp@Uv4f@meta@math@@$0DCB@$02@23@$02@meta@math@@SA?AU123@AEBT__m128@@@Z: 1296
                ??B_float4@math@@QEBA?AT__m128@@XZ: 1165

Anomalistic Compile Times seems to output “longest to compile” functions, where I guess their times are much longer than the average. Here, it immediately says that among all 700+ functions that were compiled, there is one that takes almost all the time! The names are mangled, but using either the undname.exe MSVC utility or an online demangler, it says the slow-to-compile function in this particular file was called mecanim::animation::EvaluateRootMotion.

This means that if I want to speed up compilation of this file, I need to only look at that one function, and somehow figure out why it’s slow to compile. The “why” is still an exercise for the reader, but at least I know I don’t have to investigate 700 other functions. Great!

RdrReadProc Caching Stats section I don’t quite understand, but it seems to be listing functions that were used most often in the whole compilation. Again after demangling, seems that math::meta::v4f::SWIZ<801>::load got used almost 3000 times, etc.

On some other files I was looking into, the /d2cgsummary flag did not point to one slow function, but at least pointed me to a subsystem that might be contributing to slow compile times. For example here:

Code Generation Summary
        Total Function Count: 4260
        Elapsed Time: 20.710 sec
        Total Compilation Time: 20.685 sec
        Efficiency: 99.9%
        Average time per function: 0.005 sec
        Anomalistic Compile Times: 32
                ?VirtualRedirectTransfer@Shader@@UEAAXAEAVRemapPPtrTransfer@@@Z: 0.741 sec, 0 instrs
                ??$TransferBase@VShader@@@RemapPPtrTransfer@@QEAAXAEAVShader@@W4TransferMetaFlags@@@Z: 0.734 sec, 0 instrs
                ??$Transfer@VShader@@@RemapPPtrTransfer@@QEAAXAEAVShader@@PEBDW4TransferMetaFlags@@@Z: 0.726 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@Shader@@QEAAXAEAVRemapPPtrTransfer@@@Z: 0.711 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@?$SerializeTraits@VShader@@@@SAXAEAVShader@@AEAVRemapPPtrTransfer@@@Z: 0.708 sec, 0 instrs
                ??$Transfer@USerializedShader@ShaderLab@@@RemapPPtrTransfer@@QEAAXAEAUSerializedShader@ShaderLab@@PEBDW4TransferMetaFlags@@@Z: 0.632 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@?$SerializeTraits@USerializedShader@ShaderLab@@@@SAXAEAUSerializedShader@ShaderLab@@AEAVRemapPPtrTransfer@@@Z: 0.625 sec, 0 instrs
                ??$Transfer@VRemapPPtrTransfer@@@SerializedShader@ShaderLab@@QEAAXAEAVRemapPPtrTransfer@@@Z: 0.621 sec, 0 instrs
                <...>
RdrReadProc Caching Stats
        Functions Cached: 227
        Retrieved Count: 74070
        Abandoned Retrieval Count: 0
        Abandoned Caching Count: 0
        Wasted Caching Attempts: 0
        Functions Retrieved at Least Once: 221
        Functions Cached and Never Retrieved: 6
        Most Hits:
                ?GetNewInstanceIDforOldInstanceID@RemapPPtrTransfer@@QEAAHH@Z: 3065
                ?MightContainPPtr@?$SerializeTraits@D@@SA_NXZ: 2537
                ?Align@TransferBase@@QEAAXXZ: 1965
                <...>

Out of total 20 second compile time, there’s a bunch of functions that take almost 1 second to compile each (after demangling: Shader::VirtualRedirectTransfer(class RemapPPtrTransfer), RemapPPtrTransfer::TransferBase<class Shader>(class Shader&, enum TransferMetaFlags)) etc. So hey, out of everything that Shader.cpp does, I know that in our codebase it’s the “data serialization” part that is slow to compile.

What would be super useful to have from compiler vendors: way more diagnostics on what & why something is slow. Detailed reports like this flag, and/or custom ETW providers for xperf, or anything like that. These would not be things used by most people, but when someone has to use them, they would be invaluable. Saving a minute or two of build time for several hundred programmers is an enormous amount of productivity gain!

Anyway, now you know. If you use Microsoft’s C++ compiler, try out /d2cgsummary flag to cl.exe. You might find something interesting! Next time, how this flag pointed me towards a particular programming pattern that is slow to compile on MSVC, and allowed saving 80 seconds of compile time of one file.

Update: An item on Visual Studio feedback site to document the flag and perhaps some others that might be useful.

Update 2:Visual Studio 2017 Throughput Improvements and Advice” post went up on Visual C++ team blog with /d2cgsummary and other useful compiler throughput tips.


Forced Inlining Might Be Slow

We were upgrading from Visual Studio 2010 to 2015 compiler at work the other day, and I noticed that Release configuration build times went up. Quite noticeably up: a full rebuild of “the engine” went from ~6 minutes to ~12 minutes. Something needed to be done!

Using the same build profiler I had added before, I quickly found one C++ source file that was taking 10 minutes to compile with VS2015, and a few others that were a bit less extreme.

Just like before, the slow files tended to be heavy users of our SIMD math library, and I started to think along the lines of “oh no, will probably be a lot of work to investigate how to reduce all the complexity of it”, and so on. After all, it’s a fractal of templates that I don’t understand.

However, feedback from the bug report to Microsoft (“2015 got a lot slower than 2010 at compiling the same code”; worth notifying them about it in any case) suggested that it’s our usage of __forceinline that might be the culprit.

Make two functions non-inlined, speed up the build twice

From the answers to my bug report:

The basic root cause here is there are a few functions (one is inverse_WorksFor_SingularAffineX, but there are similar ones that appear to be stamped out by the same set of macros) which have deep, deep inline trees. Around 8000 individual inline instances; the majority of which come from __forceinline functions.

Looking at our code, I could see that pretty much everything in our math library used forced inlining. The reason for that, AFAIK, was to work around “SIMD parameters are passed through stack” x64 ABI problem that existed before Microsoft added __vectorcall in VS 2013.

Most of the inlines make sense, however maybe we indeed had some functions that were long or deep? Turns out, yes we did. We had a “robust matrix inverse” function implemented via SVD (singular value decomposition), which was force-inlined, along with everything else it called. SVD inverse is: some matrix transposes, 5 Jacobi iterations that involve a bunch of quaternion-matrix conversions and transposes, sorting of singular values, some factorization, some quat-matrix conversions again, etc. All that compiles down to 6.5 kilobytes of machine code. That’s way out of “sensible thing to force inline”, IMHO!

So I moved svdInverse and a similar svdOrthogonalize functions from header into .cpp file.

Full build got twice as fast; from 11m54s down to 5m46s.

The slowest file with a lot of unit tests, that took 660 seconds to compile before, now compiled in 6.5 seconds. That’s a… 100x compile time speedup.

Here’s what could be useful to have in a compiler output: some sort of diagnostic switch that would tell me “hey, you marked this as force inlined, but the function compiles down to several kilobytes, and is used in more than a handful places”.

Since today investigating a thing like that is basically “read the source code, try to guess which functions end up being large”. That, or do a special build without inlining, and use Sizer / SymbolSort / Bloaty McBloatface to figure out large functions. All that is fairly involved in a large codebase.

Then I made two other functions not be inlined too, and it got some additional build speedup. Full build was back to about 5 minutes, by just slightly moving 4 functions in a multi-million line codebase. Not bad!

This time, the primary reason for build being slow was not template metaprogramming! :)

Synthetic test for inlining compile times

To check whether it’s not just something in “our code”, I did a small synthetic test. It has nonsensical functions that don’t do anything useful, but kinda “approximates” what we had in our codebase.

#if defined(_MSC_VER)
#define INL __forceinline
#else
#define INL inline __attribute__((always_inline))
#endif

// --- Basic stuff of what you might see in a "SSE2 SIMD math library" code
#include <emmintrin.h>

struct float4
{
    __m128 val;
    float4() { val = _mm_setzero_ps(); }
    float4(float x) { val = _mm_set1_ps(x); }
    float4(float x, float y) { val = _mm_set_ps(y, x, y, x); }
    float4(float x, float y, float z) { val = _mm_set_ps(0.f, z, y, x); }
    float4(float x, float y, float z, float w) { val = _mm_set_ps(w, z, y, x); }
    float4(__m128 v) { val = v; }
};

static INL float4 operator+(const float4& a, const float4& b) { return float4(_mm_add_ps(a.val, b.val)); }
static INL float4 operator-(const float4& a, const float4& b) { return float4(_mm_sub_ps(a.val, b.val)); }
static INL float4 operator*(const float4& a, const float4& b) { return float4(_mm_mul_ps(a.val, b.val)); }
static INL float4 operator/(const float4& a, const float4& b) { return float4(_mm_div_ps(a.val, b.val)); }
static INL float4 csum(const float4 &p)
{
    __m128 r = _mm_add_ps(p.val, _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(p.val), _MM_SHUFFLE(0, 3, 2, 1))));
    return _mm_add_ps(r, _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(r), _MM_SHUFFLE(1, 0, 3, 2))));
}
static INL float4 dot(const float4 &p0, const float4 &p1) { return csum(p0*p1); }
static INL float4 dot(const float4 &p) { return dot(p, p); }
static INL float4 rsqrt(const float4 &x)
{
    #define C0  9.999998e-01f
    #define C1  3.0000002e+00f
    #define C2  .5f
    #define C3  340282346638528859811704183484516925440.f
    __m128 e = _mm_mul_ps(_mm_rsqrt_ps((__m128) x.val), _mm_set_ps(C0, C0, C0, C0));
    e = _mm_min_ps(e, _mm_set_ps(C3, C3, C3, C3));
    return _mm_mul_ps(_mm_mul_ps(e, _mm_set_ps(C2, C2, C2, C2)), _mm_sub_ps(_mm_set_ps(C1, C1, C1, C1), _mm_mul_ps(_mm_mul_ps(x.val, e), e)));
}
static INL float4 normalize(const float4 &v)
{
    return v*rsqrt(dot(v));
}

// --- Functions that don't really make sense, I just kinda randomly
// approximated what we had, without pulling all of it into a repro.
// They don't do anything useful whatsoever; just for testing compile
// time performance.

static INL float4 ident() { return float4(0.f, 0.f, 0.f, 1.f); }
static INL float4 whatever1(const float4 &x, const float4 &y)
{
    return csum(x) / x + y;
}
static INL float4 whatever2(const float4& q1, const float4& q2)
{
    return whatever1(q1 * q2, q2 - q1) * (q1 + q2);
}
static INL float4 whatever3(const float4 &pq, const float4 &mask)
{
    const float c8 = 0.923879532511287f;
    const float s8 = 0.38268343236509f;
    const float g = 5.82842712474619f;

    float4 ch = float4(2) * (normalize(pq) - normalize(mask));
    float4 sh = pq * normalize(ch);
    float4 r = ((g*sh*sh - ch*ch) + sh / float4(s8, s8, s8, c8)) * mask;
    return normalize(r);
}
struct matrix
{
    float4 m0, m1, m2;
};
static INL float4 whateverIteration(matrix &s, int count = 5)
{
    matrix qm;
    float4 q, v = ident();
    for (int iter = 0; iter < count; iter++)
    {
        q = whatever3(s.m0, float4(0, 0, 1, 1));
        v = whatever2(v, q);
        v = normalize(v);
        q = whatever3(s.m1, float4(1, 0, 0, 1));
        v = whatever2(v, q);
        v = normalize(v);
        q = whatever3(s.m2, float4(0, 1, 0, 1));
        v = whatever2(v, q);
        v = normalize(v);
    }
    return v;
}

// Now, kinda "obviously" this function is a bit too large to be force-inlined.
// This is what I'm testing with; having & not having an "INL" on this function.
static INL float4 whateverDecomposition(const matrix &a, const float4 &u, const float4 &v)
{
    float4 r;
    matrix s = a;
    s.m0 = normalize(s.m0) + u;
    s.m1 = normalize(s.m1) * v;
    s.m2 = normalize(s.m2);
    r = whateverIteration(s);
    r = normalize(v) * u + (normalize(u) / v);
    s.m0 = s.m0 + r;
    s.m1 = s.m1 + r;
    s.m2 = s.m2 + r;
    r = whateverIteration(s);
    s.m0 = s.m0 / normalize(r);
    s.m1 = s.m1 / normalize(v+r);
    s.m2 = s.m2 / normalize(v*r);
    r = whateverIteration(s);
    s.m0 = s.m0 * s.m1;
    s.m1 = s.m1 * s.m2 - r;
    s.m2 = s.m2 * s.m0 + r;
    r = whateverIteration(s);
    return r;
}

int main(int argc, const char** argv)
{
    matrix a;
    a.m0 = (float)argv[0][0];
    a.m1 = (float)argc;
    a.m2 = (float)(argv[0][0] - argc);

    float4 u = a.m0;
    float4 v = a.m1;
    float4 e = whateverDecomposition(a, a.m0, a.m1);
    e = e + whateverDecomposition(a, a.m2, normalize(a.m1));
    e = e + whateverDecomposition(a, normalize(a.m2), e);
    e = e + whateverDecomposition(a, e, e);
    e = e + whateverDecomposition(a, normalize(e), normalize(e));
    e = e * whateverDecomposition(a, e, e);
    e = e - whateverDecomposition(a, e, e);
    e = e * whateverDecomposition(a, e, e);
    e = e + whateverDecomposition(a, e, e);
    float4 r = normalize(e);

    return (int)_mm_cvtss_f32(r.val);
}

I checked compile times under different compilers. Visual Studio was compiled with /O2, and clang/gcc with -O2, to get an “optimized build”. Everything on the same machine (Core i7-5820K); clang/gcc via Windows 10 Linux subsystem thing. In all cases compiled for x64 architecture. In the two timing cases, the only difference in source was removing forced inlining from a single whateverDecomposition function.

CompilerTimeTime without force inlineSpeedup
MSVC 2010 SP1 8.1s 0.4s20x
MSVC 2015 Update 3 19.8s 0.6s33x
MSVC 2017 15.3 9.8s 0.7s14x
g++ 5.4 1.5s 0.5s3x
clang 3.8 0.5s 0.3s1.7x

The conclusion seems to be that Visual Studio compiler is way more slow at deep inlining, compared to gcc/clang. And VS2015 in particular had a compile regression there somewhere, with it being mostly gone in VS2017.

(I would have added all this to my bug report in Microsoft Connect… except I don’t seem to be able to add more comments on it. Sorry MS!)