I was profiling something, and noticed that HighestBit() was taking suspiciously large amount of time. So I looked at the code.
It had some platform specific implementations, but the cross-platform fallback was this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
Not the best implementation of the functionality, but probably not the worst either. Takes three branches, and then a small look-up table.
Notice anything suspicious?
Let’s take a look at the assembly (MSVC 2010, x86).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
Ouch. It is creating that look-up table. Each. And. Every. Time.
Well, the code asked for that: const int lut[] = {-1,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3}, so the compiler does exactly what it was told.
Could the compiler be smarter, notice that the table is actually always constant, and put that into the data segment?
“I would if I was a compiler, and I’m not even smart!” The compiler could do this, I guess, but it does not have to. More often
than not, if you’re expecting the compiler to “be smart”, it will do the opposite.
So the above code, it fills the table. This makes the function long enough that the compiler decides to not inline it. And since it’s filling up some table on the stack, MSVC’s “stack protection” code bits come into play (which are on by default), making the code even longer.
I’ve done a quick test and timed how long does this take: for (int i = 0; i < 100000000; ++i) sum += HighestBitRef(i); on a
Core i7-2600K @ 3.4GHz… 565 milliseconds.
The fix? Do not initialize the lookup table each time!
1 2 3 4 5 6 7 | |
Note: I could have just put in a static const int lut[] in the original function code. But that sounds like this might not be thread-safe
(at least similar initialization of more complex objects isn’t; not sure about array initializers). A quick test with MSVC2010 reveals that it is thread-safe, but I wouldn’t want to rely on that.
How much faster now? 298 milliseconds if explicitly non-inlined, 110 ms when inlined. Five times faster by moving one line up!
For completeness sake, using MSVC _BitScanReverse intrinsic (__builtin_clz in gcc), which compiles down to x86 BSR instruction, takes 94 ms in the same test.
So… yeah. Careful with those initializers.