Off to game jam
Off to local Global Game Jam!
Twitter! Twitter!Ok, I’m somewhat late to jump onto the latest fads bandwagon, but here it goes – I added a Twitter widget here on the sidebar. I blame Steve Streeting for pushing me over the edge! Fixed function lighting in vertex shader – how?Sometime soon I’ll have to implement fixed function lighting pipeline in vertex shaders. Why? Because mixing fixed function and vertex shaders in multiple passes does not guarantee identical transformation results, thus requiring depth bias or projection matrix tweaks, which leads to various artifacts that annoy people to hell. I don’t really know why that happens, because it seems that most modern cards don’t have fixed function units, so internally they are running shaders anyway. DX9 runtime on Vista’s WDDM also seems to be only handling shaders to the driver internally. Still, for some reason somewhere the precision does not match… How such a task should be approached? My requirements are:
I looked at ATI’s FixedFuncShader sample. It’s an ubershader approach; one large (230 instructions or so) shader with static VS2.0 branching. It had some obvious places to optimize, I could get it down to 190 or so instructions, kill some rcp’s and reduce the amount of constant storage by 2x. Still, it did not handle some things in the D3D T&L or had some issues:
Another thing I’m considering, is to combine final shader(s) from assembly fragments, with some simple register allocation. In T&L shader code, there’s only limited set of could-be-redundant computations, mostly computing world space position, camera space normal, view vector and so on (those could be used lighting, texgen or fog). Those computations can be explicitly put into separate fragments, and later fragments could just use their result. What is left then is some register allocation. A shader assembly fragment could want some temporary registers for internal use (this is simple, just give it a bunch of unused registers), also want some registers as input (from previous fragments), and save some output in registers. Again, I haven’t checked with shader performance tools, but I think, guess and hope that the drivers do additional register allocation, liveness analysis etc. when converting D3D shader bytecode into hardware format. This would mean that I can be quite sloppy with it, i.e. don’t have to implement some super smart allocation scheme. I wrote some experimental code for the shader assembly combiner and so far it looks like a reasonable approach (and not too hard either). Does that make sense? Or did everyone solve those problems eons ago already? Edit: half a year later, I wrote a technical report on how I implemented all this: http://aras-p.info/texts/VertexShaderTnL.html Quote of the daySomewhat amusing quote from gamedeff.com:
Preemptive note: Google Translate does not quite cope with it. ARB_draw_buffers
No, I don’t have any particular point to make. But I did not even get the t-shirt… Achievement of the week: MakeVistaDWMHappyDanceThis was the function that I added:
I know. Reading from screen when Aero is on is slow, bad and wrong. But then, what do you do? It’s better than users staring an all-white window just because Vista decided to draw it white, no matter what you think you’re drawing into it. …still,
that Nicholas added a while ago. Don’t try to outsmart the compilerThe other day at work there was a need to flip an image vertically, in a way that did not bring large portions of other code that deals with images. Flipping vertically is easy:
memswap function was done this way:
The comment above the function was what triggered my interest. I just added:
But then I got interested in this, I just had to check what happens in one or another case. Using Apple's gcc 4.0.1 on Core 2 Duo, the above memory swapping code takes about 12.5 clock cycles per swapped image pixel (pixel = 4 bytes). The inner loop is this: movzx eax,BYTE PTR [edx-0x1] xor al,BYTE PTR [ecx-0x1] mov BYTE PTR [edx-0x1],al xor al,BYTE PTR [ecx-0x1] mov BYTE PTR [ecx-0x1],al xor BYTE PTR [edx-0x1],al dec ebx inc edx inc ecx cmp ebx,0xffffffff jne loopstart So the loop is three memory reads, three writes and some increments of the pointers / loop counter. Visual C++ 2008 compiles it very similarly, just uses more complex addressing mode to save one loop counter: movzx edx,byte ptr [ecx+eax] xor byte ptr [eax],dl mov dl,byte ptr [eax] xor byte ptr [ecx+eax],dl mov dl,byte ptr [ecx+eax] xor byte ptr [eax],dl dec esi inc eax test esi,esi jne loopstart What if we don't do this "XOR trick", and just swap the contents using a temporary variable? // ... char t = *p; *p = *q; *q = t; // ... Lo and behold, now it runs at 7 cycles / pixel (almost twice as fast), and the inner loop is two memory reads and two writes: movzx edx,BYTE PTR [ebx-0x1] movzx eax,BYTE PTR [ecx-0x1] mov BYTE PTR [ebx-0x1],al mov BYTE PTR [ecx-0x1],dl // ... incrementing pointers / counter here, like in previous case So yeah. The XOR trick is pretty much useless here - it's twice as slow. Hey, it can even be slower as images get larger - if tested on a 2048x2048 image, regular swap still takes 7 cycles/pixel, but XOR trick takes 55 cycles/pixel! I guess XOR trick is useful only in quite rare situations, for example when you're inside of some inner loop and want to swap register values without spilling them to memory or using an additional register. Heh, Wikipedia has info on this, so I'm not saying anything new :) Now of course, if we happen to know that our pixels are 32 bits in size, there's no good reason to keep the loop in bytes. We can operate on integers instead:
This runs at 1.5 cycles/pixel (XOR variant at 2.5 cycles/pixel). The assembly is pretty much the same, just with 32 bit registers. Another option? If you use STL, just use: std::swap_ranges(p, p+n, q); on the pixel datatype. On 32 bit pixels, this also runs at 1.5 cycles/pixel. So yeah. Don't try to outsmart the compiler without measuring it. Cool tech vs. boring detailsSome of the stuff I’ve been working on last week:
Boring tiny little details. This probably best summarizes where lion’s share of time goes when developing anything. I’m not working on some cool spherical harmonics lightmap compression. Or on cunning ways to encode shadow map information for better filtering. Or on using CUDA to compute something interesting. In other words, I’m not working on cool technology. Instead I’m adding missing menu items. Fixing obscure corner cases. Fighting inconsistencies in operating system APIs. Spotting misplaced pixels. Adding missing keyboard shortcuts. Nothing interesting to blog about! But still, methinks the difference between software that is merely “good” and software that is “great” is in the details. And only in the details. I’ll just take care of tons of more details. Maybe it will result in something good. Crunchtime!A few weeks ago it was all calm in the source control. Now it’s crunchtime! I’m the master of svn deception. I do tons of useless commits just so that the stats look good. Yeah! …ok, back to work. Windows 7After a steaming pile of poo that is Windows Vista, looks like Windows 7 will be something that is done right. Ok, to be fair, Vista has lots of new features and improvements under the hood. Now, I haven’t used them, but transactional file system, exposed low level APIs to get detailed memory/IO stats, etc. etc. sound like cool & useful stuff. The problem with Vista is that all those core improvements are out-weighted by inconsistent & slow UI and some stupid blunders. Now, Windows 7 seems to be taking on two things: 1) performance and 2) consistency. Building on all the low level improvements done in Vista, and getting the part that is visible to the user right. Yay if Microsoft can pull this off. We’ll see. |