Fixed function lighting in vertex shader – how?

Sometime soon I’ll have to implement fixed function lighting pipeline in vertex shaders. Why? Because mixing fixed function and vertex shaders in multiple passes does not guarantee identical transformation results, thus requiring depth bias or projection matrix tweaks, which leads to various artifacts that annoy people to hell.

I don’t really know why that happens, because it seems that most modern cards don’t have fixed function units, so internally they are running shaders anyway. DX9 runtime on Vista’s WDDM also seems to be only handling shaders to the driver internally. Still, for some reason somewhere the precision does not match…

How such a task should be approached?

My requirements are:

  • Should handle any possible state combination in D3D fixed function T&L.
  • D3D 9.0c, using vertex shader 2.0 is ok. For now I don’t care about OpenGL.
  • No HLSL at runtime. I don’t want to add a megabyte or more to Unity web player just for HLSL. DX9 shader assembly is ok, because we already have the assembler code.
  • Should work as fast (or close to) as the regular fixed function pipeline.

I looked at ATI’s FixedFuncShader sample. It’s an ubershader approach; one large (230 instructions or so) shader with static VS2.0 branching. It had some obvious places to optimize, I could get it down to 190 or so instructions, kill some rcp‘s and reduce the amount of constant storage by 2x.

Still, it did not handle some things in the D3D T&L or had some issues:

  • It assumes one input UV, one output UV and no texture matrices. This place in T&L gets quite convoluted – any input UVs or a texgen mode can be transformed by matrices of various sizes, and routed into any output UVs.
  • It was not using full T&L lighting model. No biggie here.
  • I haven’t checked with NVShaderPerf or AMD ShaderAnalyzer yet, but last time I checked the static branch instruction was taking two clocks on some NV architecture. So ubershader approach does not come for free.

Another thing I’m considering, is to combine final shader(s) from assembly fragments, with some simple register allocation.

In T&L shader code, there’s only limited set of could-be-redundant computations, mostly computing world space position, camera space normal, view vector and so on (those could be used lighting, texgen or fog). Those computations can be explicitly put into separate fragments, and later fragments could just use their result.

What is left then is some register allocation. A shader assembly fragment could want some temporary registers for internal use (this is simple, just give it a bunch of unused registers), also want some registers as input (from previous fragments), and save some output in registers.

Again, I haven’t checked with shader performance tools, but I think, guess and hope that the drivers do additional register allocation, liveness analysis etc. when converting D3D shader bytecode into hardware format. This would mean that I can be quite sloppy with it, i.e. don’t have to implement some super smart allocation scheme.

I wrote some experimental code for the shader assembly combiner and so far it looks like a reasonable approach (and not too hard either).

Does that make sense? Or did everyone solve those problems eons ago already?

Edit: half a year later, I wrote a technical report on how I implemented all this: http://aras-p.info/texts/VertexShaderTnL.html

9 Responses to 'Fixed function lighting in vertex shader – how?'

  1. Michael Daum

    Not sure if you’re interested, but the ftransform() function in GLSL solves exactly this problem, providing the correct transform which will allow one to mix GLSL passes with GL fixed pipeline passes. Plus, you’ll have portable code as a bonus after you switch from D3D!!

  2. Simon Kozlov

    I”m not sure if it helps, but FYI – Vista doesn’t use fixed function pipeline at all (for WDDM drivers, which is everywhere), it’s all implemented via vertex/pixel shaders so you shouldn’t see the difference.

  3. Jack Palevich

    I’ve done something like this for Xbox 360, and it worked out pretty well. I think most Windows graphics drivers try pretty hard to optimize the shaders that are handed to them, so you probably don’t need to do much optimization on your own. In fact you optimizations might even be counterproductive, because they might prevent the driver from doing some of its own optimizations.

    The Xbox 360, of course, does no run-time shader optimization. At least it didn’t back when I was programming it.

    One problem you may run into is that for DX9 99% of games use HLSL rather than hand-written assembly, which means that the graphics drivers are only being tested and optimized for the kind of assembly that HLSL puts out. If you start feeding in your own assembly, you may find that you get sub-optimal results or even expose bugs in the drivers. (For example, there are some opcodes which HLSL never generates. If you start using those opcodes you may find the driver doesn’t handle them correctly.)

    I highly recommend writing your shader generator to produce HLSL as well as assembly, so that you can compare the relative performance (and bugs). Also, you might find that the HLSL generation time isn’t too slow.

    You will probably learn a lot about the fixed function pipeline. I know I did. :-)

  4. Jack Palevich

    Oh, and as for”why” it happens, the answer is that floating point math operations are not associative or commutative, but the HLSL compiler (and sometimes the graphics drivers) pretend that they are.

    So it’s not really a fixed function vs HLSL issue, its a “two shaders that do the same operations in a different order give slightly different results” issue. You can sometimes run into it with two similar HLSL shaders that end up being optimized differently.

    Skipping HLSL and using assembly fragments can help, but an aggressive driver can still decide to reorder the instructions in your shaders, leading to differences in results.

  5. Aras Pranckevičius

    @Michael: thanks, but using GLSL does not really help under DX9 (why DX9? Because D3D is way more stable than OpenGL on Windows). Also, ftransform() does not work as expected in some OpenGL implementations. E.g. on Apple platforms, ftransform() on some ATI cards does not work like it should – it produces different result than fixed function pipe.

    @Simon: I know. But I still have quite a large chunk of XP installs to support that won’t go away soon.

    @Jack: thanks for the info. I guess I’ll just work on this further and see what happens.

  6. Sean Barrett

    Using HLSL instead of assembly requires an extra megabyte? How so? Does that stuff live in a library that if you can link in separately? Or it gets trimmed out automatically if you don’t call it? I’d have assumed that was all DLLs at least.

  7. Aras Pranckevičius

    @Sean: HLSL compiler is in D3DX (not in core D3D runtime). Nowadays it’s in a DLL, so that’s 3MB or so (we can’t rely on the right DLL being present on user’s machine, hence we’d need to include D3DX redist in our own installer). Back in the old days (SDKs from 2004), D3DX was a static library, so if you weren’t using HLSL, the linker would strip it out.

  8. Sean Barrett

    Aha, makes sense, thanks.

  9. Pat Wilson

    Instead of implementing an ubershader, I implemented 3 very simple and small shaders which cover most simple cases: VertexColor, Texture, VertexColorModTexture, VertexColorAddTexture. This covers debug rendering, UI elements, simple particle systems, and such. Everything else must use a proper Material to render, in which case the Material system will take care of generating a shader, or using fixed-function to render the object.

    I don’t favor the ubershader-FF approach because, in my opinion, it continues to re-enforce the entire fixed-function concept. Providing developers with a way to interact with a device in a deprecated way (like fixed function) is convenient, but it’s the reason OpenGL’s API is the way it is now.

Leave a Reply