Archive for 'code'

Fixed function lighting in vertex shader – how?

Sometime soon I’ll have to implement fixed function lighting pipeline in vertex shaders. Why? Because mixing fixed function and vertex shaders in multiple passes does not guarantee identical transformation results, thus requiring depth bias or projection matrix tweaks, which leads to various artifacts that annoy people to hell.

I don’t really know why that happens, because it seems that most modern cards don’t have fixed function units, so internally they are running shaders anyway. DX9 runtime on Vista’s WDDM also seems to be only handling shaders to the driver internally. Still, for some reason somewhere the precision does not match…

How such a task should be approached?

My requirements are:

  • Should handle any possible state combination in D3D fixed function T&L.
  • D3D 9.0c, using vertex shader 2.0 is ok. For now I don’t care about OpenGL.
  • No HLSL at runtime. I don’t want to add a megabyte or more to Unity web player just for HLSL. DX9 shader assembly is ok, because we already have the assembler code.
  • Should work as fast (or close to) as the regular fixed function pipeline.

I looked at ATI’s FixedFuncShader sample. It’s an ubershader approach; one large (230 instructions or so) shader with static VS2.0 branching. It had some obvious places to optimize, I could get it down to 190 or so instructions, kill some rcp‘s and reduce the amount of constant storage by 2x.

Still, it did not handle some things in the D3D T&L or had some issues:

  • It assumes one input UV, one output UV and no texture matrices. This place in T&L gets quite convoluted – any input UVs or a texgen mode can be transformed by matrices of various sizes, and routed into any output UVs.
  • It was not using full T&L lighting model. No biggie here.
  • I haven’t checked with NVShaderPerf or AMD ShaderAnalyzer yet, but last time I checked the static branch instruction was taking two clocks on some NV architecture. So ubershader approach does not come for free.

Another thing I’m considering, is to combine final shader(s) from assembly fragments, with some simple register allocation.

In T&L shader code, there’s only limited set of could-be-redundant computations, mostly computing world space position, camera space normal, view vector and so on (those could be used lighting, texgen or fog). Those computations can be explicitly put into separate fragments, and later fragments could just use their result.

What is left then is some register allocation. A shader assembly fragment could want some temporary registers for internal use (this is simple, just give it a bunch of unused registers), also want some registers as input (from previous fragments), and save some output in registers.

Again, I haven’t checked with shader performance tools, but I think, guess and hope that the drivers do additional register allocation, liveness analysis etc. when converting D3D shader bytecode into hardware format. This would mean that I can be quite sloppy with it, i.e. don’t have to implement some super smart allocation scheme.

I wrote some experimental code for the shader assembly combiner and so far it looks like a reasonable approach (and not too hard either).

Does that make sense? Or did everyone solve those problems eons ago already?

Edit: half a year later, I wrote a technical report on how I implemented all this: http://aras-p.info/texts/VertexShaderTnL.html

Achievement of the week: MakeVistaDWMHappyDance

This was the function that I added:

void GUIView::MakeVistaDWMHappyDance()
{
    // Looks like Vista has some bug in DWM. Whenever we maximize or dock
    // a view, we must do something magic, otherwise
    // white stuff appears in place of the view.
    // See http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=4208117&SiteID=1

    bool earlierThanVista = systeminfo::GetOperatingSystemNumeric() < 600;
    if( earlierThanVista )
        return;

    // What seems to work is drawing one pixel via GDI.
    // We draw it at (1,1) with usual background color.
    int grayColor = 0.61f * 255.0f;
    PAINTSTRUCT ps;
    BeginPaint(m_View, &ps);
    SetPixel(ps.hdc, 1, 1, RGB(grayColor,grayColor,grayColor));
    EndPaint(m_View, &ps);
}

I know. Reading from screen when Aero is on is slow, bad and wrong. But then, what do you do? It’s better than users staring an all-white window just because Vista decided to draw it white, no matter what you think you’re drawing into it.

…still, MakeVistaDWMHappyDance is not nearly as cool as

internal interface ICanHazCustomMenu { … }

that Nicholas added a while ago.

Don’t try to outsmart the compiler

The other day at work there was a need to flip an image vertically, in a way that did not bring large portions of other code that deals with images. Flipping vertically is easy:

for( int y = 0; y < height/2; ++y ) {
    memswap( img+y*width, img+(height-y-1)*width, width*img(arr[0]) );
}

memswap function was done this way:

// why isnt this in the std lib?
// using XOR to avoid tmp var
void memswap( void* m1, void* m2, size_t n )
{
    char *p = (char*)m1; char *q = (char*)m2;
    while ( n-- ) {
        *p ^= *q; *q ^= *p; *p ^= *q;
        p++; q++;
    }
}

The comment above the function was what triggered my interest. I just added:

// because it can be slower (local variable is likely in register;
// whereas using XOR involves reads/writes to memory)

But then I got interested in this, I just had to check what happens in one or another case.

Using Apple's gcc 4.0.1 on Core 2 Duo, the above memory swapping code takes about 12.5 clock cycles per swapped image pixel (pixel = 4 bytes). The inner loop is this:

movzx  eax,BYTE PTR [edx-0x1]
xor    al,BYTE PTR [ecx-0x1]
mov    BYTE PTR [edx-0x1],al
xor    al,BYTE PTR [ecx-0x1]
mov    BYTE PTR [ecx-0x1],al
xor    BYTE PTR [edx-0x1],al
dec    ebx
inc    edx
inc    ecx
cmp    ebx,0xffffffff
jne    loopstart

So the loop is three memory reads, three writes and some increments of the pointers / loop counter. Visual C++ 2008 compiles it very similarly, just uses more complex addressing mode to save one loop counter:

movzx       edx,byte ptr [ecx+eax]
xor         byte ptr [eax],dl
mov         dl,byte ptr [eax]
xor         byte ptr [ecx+eax],dl
mov         dl,byte ptr [ecx+eax]
xor         byte ptr [eax],dl
dec         esi
inc         eax
test        esi,esi
jne         loopstart

What if we don't do this "XOR trick", and just swap the contents using a temporary variable?

// ...
char t = *p; *p = *q; *q = t;
// ...

Lo and behold, now it runs at 7 cycles / pixel (almost twice as fast), and the inner loop is two memory reads and two writes:

movzx  edx,BYTE PTR [ebx-0x1]
movzx  eax,BYTE PTR [ecx-0x1]
mov    BYTE PTR [ebx-0x1],al
mov    BYTE PTR [ecx-0x1],dl
// ... incrementing pointers / counter here, like in previous case

So yeah. The XOR trick is pretty much useless here - it's twice as slow. Hey, it can even be slower as images get larger - if tested on a 2048x2048 image, regular swap still takes 7 cycles/pixel, but XOR trick takes 55 cycles/pixel!

I guess XOR trick is useful only in quite rare situations, for example when you're inside of some inner loop and want to swap register values without spilling them to memory or using an additional register. Heh, Wikipedia has info on this, so I'm not saying anything new :)

Now of course, if we happen to know that our pixels are 32 bits in size, there's no good reason to keep the loop in bytes. We can operate on integers instead:

void memswapI( void* m1, void* m2, size_t n )
{
    size_t nn = n/sizeof(int);
    int *p = (int*)m1; int *q = (int*)m2;
    while ( nn-- ) {
        int t = *p; *p = *q; *q = t;
        p++; q++;
    }
}

This runs at 1.5 cycles/pixel (XOR variant at 2.5 cycles/pixel). The assembly is pretty much the same, just with 32 bit registers.

Another option? If you use STL, just use:

std::swap_ranges(p, p+n, q);

on the pixel datatype. On 32 bit pixels, this also runs at 1.5 cycles/pixel.

So yeah. Don't try to outsmart the compiler without measuring it.

Implicit to-pointer operators must die!

For the sake of the nation,
this operator must die!

Seriously. Suppose there is some class, let’s say ColorRGBAf. That has four floats inside. Now, someone at some point decided to add this operator to it:

operator float* () { /**/ }
operator const float* () const { /**/ }

Probably because it’s easier to pass color to OpenGL this way, or something like that.

This is evil. Like, really evil. Especially if that class did not have comparison operators defined, and some totally unrelated code four years later does:

if (color != oldColor) { /* … */ }

Ouch! Sounds like someone will spend four hours debugging something that looks like an event routing issue that only happens on Windows and only with optimizations on (yes, I just did that…).

What happens here? The compiler takes pointers to two colors and compares the pointers. If for some reason both colors are temporary objects, then it can even happen that both get folded into the same variable/register/whatnot. The pointers are the same. Ouch!

Implicit “nice” operators are just disguised evil. Remove that operator, add something like GetPointer() to class if someone really wants to use that, and better even make the comparison operators private and without implementations. Yes. Much better.

How watchdog threads should NOT be done…

Here, a thread function that checks whether some tool got stuck:

static void WatchdogFunc()
{
    while( true )
    {
        time_t now = time(NULL);
        Mutex::AutoLock lock(g_WatchdogMutex);
        if( now - g_StartTime > kWatchdogTimeout )
            ComplainLoudlyAndDoSomething();
        Thread::Sleep( 0.1f );
    }
}

Mutex is taken because g_StartTime can be occasionally updated by the same tool. Yes, possibly a mutex is an overkill here, and aligned variable + some memory fences should be enough (or just nothing), but hey, this is some random offline tool code.

What is horribly wrong with it?

Mutex is held locked for the whole duration of Sleep! That is, almost all the time; and other thread(s) barely have a chance to ever update g_StartTime.

And this is the code I’ve written. Oh stupid me.

It must be a bug in OS/compiler/…

Ever looked at the code which is absolutely correct, yet runs incorrectly? Sometimes it looks like a genuine compiler bug. “I swear, mister! The compiler corrupts my code!”

Look again. And again. Eventually you’ll find where your code is broken.

(Of course, in some cases quite often the compiler is broken… GLSL, anyone?)

Pimp my code, part 15: The Greatest Bug of All says the above in a much nicer way:

Maybe the problem was there was some huge bug in Apple’s Mach, where if you open too many files in a short period of time, the filesystem tried to, like, cache the results, and the cache blew up, and as a result the filesystem incorrectly just would fail to open any more files, instead of flushing the cache.

I’ve also been around long enough to know that whenever I know the operating system must be bugged, since my code is correct, I should take a damn close look at my code. The old adage (not mine) is that 99% of the time operating system bugs are actually bugs in your program, and the other 1% of the time they are still bugs in your program, so look harder, dammit.

A post well worth reading… about the process of investigating tricky bugs. And sincere as well. It’s so good that I’ll just quote it again:

It’s a bug we should have caught. We should have spent the time to get the images in the 10,000 item file. I messed up.

Software is written by humans. Humans get tired. Humans become discouraged. They aren’t perfect beings. As developers, we want to pretend this isn’t so, that our software springs from our head whole and immaculate like the goddess Athena. Customers don’t want to hear us admit that we fail.

The measure of a man cannot be whether he ever makes mistakes, because he will make mistakes. It’s what he does in response to his mistakes. The same is true of companies.

We have to apologize, we have to fix the problem, and we have to learn from our mistakes.

So very true.

Depth bias and the power of deceiving yourself

In Unity we very often mix fixed function and programmable vertex pipelines. In our lighting model, some amount of brightest lights per object are drawn in pixel lit mode, and the rest are drawn using fixed function vertex lighting. Naturally the pixel lights most often use vertex shaders, as they want to calculate some texcoords for light cookies, or do something with tangent space, or calculate some texcoords for shadow mapping, and so on. The vertex lighting pass uses fixed function, because it’s the easiest way. It is possible to implement fixed function lighting equivalent in vertex shaders, but we haven’t done that yet because of complexities of Direct3D and OpenGL, the need to support shader model 1.1 and various other issues. Call me lazy.

And herein lies the problem: most often precision of vertex transformations is not the same in fixed function versus programmable vertex pipelines. If you’d just draw some objects in multiple passes, mixing fixed function and programmable paths, this is roughly what you will get (excuse my programmer’s art):
Mixing fixed function and vertex shaders

Not pretty at all! This should have looked like this:
All good here

So what do we do to make it look like this? We “pull” (bias) some rendering passes slighly towards the camera, so there is no depth fighting.

Now, at the moment Unity editor runs only on the Macs, which use OpenGL. In there, most of hardware configurations do not need this depth bias at all – they are able to generate same results in fixed function and programmable pipelines. Only Intel cards do need the depth bias on Mac OS X (on Windows, AMD and Intel cards need depth bias). So people author their games using OpenGL, where it does not need depth bias in most cases.

How do you apply depth bias in OpenGL? Enable GL_POLYGON_OFFSET_FILL and set glPolygonOffset to something like -1, -1. This works.

How do you apply depth bias in Direct3D 9? Conceptually, you do the same. There are DEPTHBIAS and SLOPESCALEDEPTHBIAS render states that do just that. And so we did use them.

And people complained about funky results on Windows.

And I’d look at their projects, see that they are using something like 0.01 for camera’s near plane and 1000.0 for the far plane, and tell them something along the lines of “increase your near plane, stupid!” (well ok, without the “stupid” part). And I’d explain all the above about mixing fixed function and vertex shaders, and how we do depth bias in that case, and how on OpenGL it’s often not needed but on Direct3D it’s pretty much always needed. And yes, how sometimes that can produce “double lighting” artifacts on close or intersecting geometry, and how the only solution is to increase the near plane and/or avoid close or intersecting geometry.

Sometimes this helped! I was so convinced that their too-low-near-plane was always the culprit.

And then one day I decided to check. This is what I’ve got on Direct3D:
Depth bias artefacts

Ok, this scene is intentionally using a low near plane, but let me stress this again. This is what I’ve got:
Epic fail!

Not good at all.

What happened? It happened in roughly this way:

  1. First, depth bias documentation on Direct3D is wrong. Depth bias is not in 0..16 range, it is in 0..1 range which corresponds to entire range of depth buffer.
  2. Back then, our code was always using 16 bit depth buffers, so the equivalent of -1,-1 depth bias in OpenGL was multiplied with something like 1.0/65535.0, and that was fed into Direct3D. Hey, it seemed to work!
  3. Later on, the device setup code was modified to do proper format selection, so most often it ended up using 24 bit depth buffer. Of course no one I never modified the depth bias code to account for this change…
  4. And it stayed there. And I kept deceiving myself that the content of the users is to blame, and not some stupid code of mine.

It’s good to check your assumptions once in a while.

So yeah, the proper multiplier for depth bias on Direct3D with 24 bit depth buffer should be not 1.0/65535.0, but something like 1.0/(2^24-1). Except that this value is really small, so something like 4.8e-7 should be used instead (see Lengyel’s GDC2007 talk). Oh, but for some reason it’s not really enough in practice, so something like 2.0*4.8e-7 should be used instead (tested so far on GeForce 8600, Radeon HD 3850, Radeon 9600, Intel 945, reference rasterizer). Oh, and the same value should be used even when a 16 bit depth buffer is used; using 1.0/65535.0 multiplier with 16 bit depth buffer produces way too large bias.

With proper bias values the image is good on Direct3D again. Yay for that (fix is coming in Unity 2.1 soon).

…and yes, I know that real men fudge projection matrix instead of using depth bias… someday maybe.

Argh MFC!

When introductory documentation for something has this, you know it won’t be pretty:

CAsyncMonikerFile is derived from CMonikerFile, which in turn is derived from COleStreamFile. A COleStreamFile object represents a stream of data; a CMonikerFile object uses an IMoniker to obtain the data, and a CAsyncMonikerFile object does so asynchronously.

So yeah, I am dealing with downloading something from the internet inside an ActiveX control that is written in MFC. A seemingly simple task – I give you an URL, you give me back the bytes. But no! That would not be a proper architecture, so instead it has asynchronous monikers which are based on monikers which are based on stream files which use some interfaces and whatnot. And for ActiveX controls the docs suggest using CDataPathProperty or CCachedDataPathProperty, which are abstractions build on top of the above crap. And I don’t even know what “a moniker” is!

Of course all this complexity fails spectacularly in some quite common situations. For example, try downloading something when the web server serves gzip compressed html output. Good luck trying to figure out why everything seemingly works, you are notified of downloading progress, but never get the actual downloaded bytes.

Turns out the solution is to change downloading behaviour of the above pile of abstractions to use “pull data” model, instead of default “push data” model. The default behaviour just seems to be broken (though it is not broken in that pile of abstractions, instead it is broken somewhere deeper in Windows code). Is this mentioned anywhere in the docs? Of course not!

This is pretty much how a code comment looks like for this:

We don’t use CCachedDataPathProperty because it’s awfully slow, doing data reallocations for each 1KB received. For 8MB file it’s 8000 reallocations and 32 GB (!) of data copied for no good reason!

While we’re at it, we don’t use CDataPathProperty either, because it’s a useless wrapper over CAsyncMonikerFile.

Oh, and we don’t use CAsyncMonikerFile either, because it has bugs in VS2003′ MFC where it never notifies the container that it is done with download, making IE still display “X items remaining” indefinitely. Some smart coder was converting information message and returning “out of memory” error if result was NULL, even if input message was NULL (which it often was). So we use our own “fixed” version of CAsyncMonikerFile instead.

Oh MFC, how we love thee.

On job titles and design patterns

I just changed my job title to say “Code Chef“. I like it, and it represents my current understanding of programming pretty well. I cook code. That’s my job.

Some N years ago I would have liked a title with “Architect” or “Analyst” or something like that. I would have called myself “developer” instead of “programmer” because hey, a developer thinks up things, whereas a programmer is a mere “code monkey”. More on code monkeys below.

But wait! Back then I also believed that knowing and using Design Patterns is essential for a programmer! In one place when I was interviewing new hires, design pattern knowledge was something I would look for… how stupid! Nowadays my view of patterns is more along the lines of “yeah, whatever”. I don’t exactly think of them as things from hell, but they could have caused more harm than good already.

Back to job titles. Code monkey is actually the key employee. A software product is largely defined by the code, heck, it is code. Sure, it also has the user interface, the fancy icons, the documentation, the website, the support, the roadmap and whatnot, but the code is the product, whereas everything else is more or less addons (possibly excluding UI… UI also defines the product).

Code design? Design patterns? Who cares about that.

It’s the final result that matters. Futurist programming for the win.

On the other hand, Memento Observer is probably very cool.

Tricky bugs: peculiarities of dynamic linking, and magic divisions

After wasting nearly two days on some really funky animation import crash, I checked in a code change with this log message:

Fix FBX animation import crash once more. When exported symbols are not listed for a dylib, it seems to link back to calling executable (?!), making them share function impls with the same name. And because Keyframe is actually different in editor vs ImportFBX, this is wrong. Apparently this is OS X Leopard only, or something. Argh.

The code change in question was just telling the compiler “here’s the list of the functions that are exported from this dynamic library”. The list was already there, just the compiler was never told about existence of it.

The bug manifested itself as a crash when importing animations. But it would not happen when importer was run from a small unit test application. There were no memory corruptions happening, it was not running out of memory, yet the code was crashing with access violation, usually because STL’s vector was returning it’s wrong size (but the actual data of the vector was correct; it was just returning bogus size). And it was doing that only on OS X Leopard, and not on OS X Tiger. Huh?

Turns out what did happen – and I’m not sure if that’s a bug in OS X or a feature – is that the calling application did contain a class called Keyframe. And the shared library (where the crash was happening) also contained a class called Keyframe. But those classes were slightly different; first was 20 bytes in size, and second one was 16 bytes.

Now, somehow when the shared library was calling vector<Keyframe>::size(), the function from the calling application was used. I have no idea at all how or why this was happening, but it sure was! I could see from tracing the assembly code, that it was doing difference of two pointers, and then doing something that for sure was not division by 16.

What was the code doing? Turns out it was calculating division by 20 in a cunning way:

mov  edx,esi   # edx = end()
sub  edx,eax   # edx -= begin()
mov  eax,edx   # eax = edx
sar  eax,0x2   # eax >>= 2
imul eax,eax,0xcccccccd # eax *= 0xcccccccd

In other words, the compiler was replacing division by constant (as used in vector’s size()) by a shift and multiplication with a magic number. You can read more about the technique here or here.

But of course the code above only works if the number was actually divisible by 20; otherwise it returns totally wrong result. This is perfectly fine for computing the difference in two pointers to structures of known size… Except that inside the shared library the Keyframe structures are 16 bytes, and not 20!

So yeah. Watch out for peculiarities of dynamic linking on your platform.