Archive for 'd3d'

Implementing fixed function T&L in vertex shaders

Almost half a year ago I was wondering how to implement T&L in vertex shaders.

Well, finally I implemented it for upcoming Unity 2.6. I wrote some sort of a technical report here.

In short, I’m combining assembly fragments and doing simple temporary register allocation, which seems to work quite well. Performance is very similar to using fixed function (I know it’s implemented as vertex shaders internally by the runtime/driver) on several different cards I tried (Radeon HD 3xxx, GeForce 8xxx, Intel GMA 950).

What was unexpected: the most complex piece is not the vertex lighting! Most complexity is in how to route/generate texture coordinates and transform them. Huge combination explosion there.

Otherwise – I like! Here’s a link to the article again.

Fixed function lighting in vertex shader – how?

Sometime soon I’ll have to implement fixed function lighting pipeline in vertex shaders. Why? Because mixing fixed function and vertex shaders in multiple passes does not guarantee identical transformation results, thus requiring depth bias or projection matrix tweaks, which leads to various artifacts that annoy people to hell.

I don’t really know why that happens, because it seems that most modern cards don’t have fixed function units, so internally they are running shaders anyway. DX9 runtime on Vista’s WDDM also seems to be only handling shaders to the driver internally. Still, for some reason somewhere the precision does not match…

How such a task should be approached?

My requirements are:

  • Should handle any possible state combination in D3D fixed function T&L.
  • D3D 9.0c, using vertex shader 2.0 is ok. For now I don’t care about OpenGL.
  • No HLSL at runtime. I don’t want to add a megabyte or more to Unity web player just for HLSL. DX9 shader assembly is ok, because we already have the assembler code.
  • Should work as fast (or close to) as the regular fixed function pipeline.

I looked at ATI’s FixedFuncShader sample. It’s an ubershader approach; one large (230 instructions or so) shader with static VS2.0 branching. It had some obvious places to optimize, I could get it down to 190 or so instructions, kill some rcp’s and reduce the amount of constant storage by 2x.

Still, it did not handle some things in the D3D T&L or had some issues:

  • It assumes one input UV, one output UV and no texture matrices. This place in T&L gets quite convoluted – any input UVs or a texgen mode can be transformed by matrices of various sizes, and routed into any output UVs.
  • It was not using full T&L lighting model. No biggie here.
  • I haven’t checked with NVShaderPerf or AMD ShaderAnalyzer yet, but last time I checked the static branch instruction was taking two clocks on some NV architecture. So ubershader approach does not come for free.

Another thing I’m considering, is to combine final shader(s) from assembly fragments, with some simple register allocation.

In T&L shader code, there’s only limited set of could-be-redundant computations, mostly computing world space position, camera space normal, view vector and so on (those could be used lighting, texgen or fog). Those computations can be explicitly put into separate fragments, and later fragments could just use their result.

What is left then is some register allocation. A shader assembly fragment could want some temporary registers for internal use (this is simple, just give it a bunch of unused registers), also want some registers as input (from previous fragments), and save some output in registers.

Again, I haven’t checked with shader performance tools, but I think, guess and hope that the drivers do additional register allocation, liveness analysis etc. when converting D3D shader bytecode into hardware format. This would mean that I can be quite sloppy with it, i.e. don’t have to implement some super smart allocation scheme.

I wrote some experimental code for the shader assembly combiner and so far it looks like a reasonable approach (and not too hard either).

Does that make sense? Or did everyone solve those problems eons ago already?

Edit: half a year later, I wrote a technical report on how I implemented all this: http://aras-p.info/texts/VertexShaderTnL.html

Depth bias and the power of deceiving yourself

In Unity we very often mix fixed function and programmable vertex pipelines. In our lighting model, some amount of brightest lights per object are drawn in pixel lit mode, and the rest are drawn using fixed function vertex lighting. Naturally the pixel lights most often use vertex shaders, as they want to calculate some texcoords for light cookies, or do something with tangent space, or calculate some texcoords for shadow mapping, and so on. The vertex lighting pass uses fixed function, because it’s the easiest way. It is possible to implement fixed function lighting equivalent in vertex shaders, but we haven’t done that yet because of complexities of Direct3D and OpenGL, the need to support shader model 1.1 and various other issues. Call me lazy.

And herein lies the problem: most often precision of vertex transformations is not the same in fixed function versus programmable vertex pipelines. If you’d just draw some objects in multiple passes, mixing fixed function and programmable paths, this is roughly what you will get (excuse my programmer’s art):
Mixing fixed function and vertex shaders

Not pretty at all! This should have looked like this:
All good here

So what do we do to make it look like this? We “pull” (bias) some rendering passes slighly towards the camera, so there is no depth fighting.

Now, at the moment Unity editor runs only on the Macs, which use OpenGL. In there, most of hardware configurations do not need this depth bias at all – they are able to generate same results in fixed function and programmable pipelines. Only Intel cards do need the depth bias on Mac OS X (on Windows, AMD and Intel cards need depth bias). So people author their games using OpenGL, where it does not need depth bias in most cases.

How do you apply depth bias in OpenGL? Enable GL_POLYGON_OFFSET_FILL and set glPolygonOffset to something like -1, -1. This works.

How do you apply depth bias in Direct3D 9? Conceptually, you do the same. There are DEPTHBIAS and SLOPESCALEDEPTHBIAS render states that do just that. And so we did use them.

And people complained about funky results on Windows.

And I’d look at their projects, see that they are using something like 0.01 for camera’s near plane and 1000.0 for the far plane, and tell them something along the lines of “increase your near plane, stupid!” (well ok, without the “stupid” part). And I’d explain all the above about mixing fixed function and vertex shaders, and how we do depth bias in that case, and how on OpenGL it’s often not needed but on Direct3D it’s pretty much always needed. And yes, how sometimes that can produce “double lighting” artifacts on close or intersecting geometry, and how the only solution is to increase the near plane and/or avoid close or intersecting geometry.

Sometimes this helped! I was so convinced that their too-low-near-plane was always the culprit.

And then one day I decided to check. This is what I’ve got on Direct3D:
Depth bias artefacts

Ok, this scene is intentionally using a low near plane, but let me stress this again. This is what I’ve got:
Epic fail!

Not good at all.

What happened? It happened in roughly this way:

  1. First, depth bias documentation on Direct3D is wrong. Depth bias is not in 0..16 range, it is in 0..1 range which corresponds to entire range of depth buffer.
  2. Back then, our code was always using 16 bit depth buffers, so the equivalent of -1,-1 depth bias in OpenGL was multiplied with something like 1.0/65535.0, and that was fed into Direct3D. Hey, it seemed to work!
  3. Later on, the device setup code was modified to do proper format selection, so most often it ended up using 24 bit depth buffer. Of course no one I never modified the depth bias code to account for this change…
  4. And it stayed there. And I kept deceiving myself that the content of the users is to blame, and not some stupid code of mine.

It’s good to check your assumptions once in a while.

So yeah, the proper multiplier for depth bias on Direct3D with 24 bit depth buffer should be not 1.0/65535.0, but something like 1.0/(2^24-1). Except that this value is really small, so something like 4.8e-7 should be used instead (see Lengyel’s GDC2007 talk). Oh, but for some reason it’s not really enough in practice, so something like 2.0*4.8e-7 should be used instead (tested so far on GeForce 8600, Radeon HD 3850, Radeon 9600, Intel 945, reference rasterizer). Oh, and the same value should be used even when a 16 bit depth buffer is used; using 1.0/65535.0 multiplier with 16 bit depth buffer produces way too large bias.

With proper bias values the image is good on Direct3D again. Yay for that (fix is coming in Unity 2.1 soon).

…and yes, I know that real men fudge projection matrix instead of using depth bias… someday maybe.

Holy FPU precision, Batman!

(cross-posted from blogs.unity3d.com)

One of our customers found an interesting bug the other day: embedding Unity Web Player into a web page makes some javascript animation libraries not work correctly. For example, script.aculo.us or Dojo Toolkit would stop doing some of their tasks. But only on Windows, and only on some browsers (Firefox and Safari).

Wait a moment… Unity plugin makes nice wobbling web page elements not wobble anymore!? Sounds like an interesting issue…

So I prepared for a debug session and tried the usual “divide by two until you locate the problem” approach.

  • Unity Web Player is composed of two parts: a small browser plugin, and the actual “engine” (let’s call it “runtime”). First I change the plugin so that it only loads the data, but never loads or starts the runtime. Everything works. So the problem is not in the plugin. Good.
  • Load the runtime and do basic initialization (create child window, load Mono, …), but never actually start playing the content – everything works.
  • Load the runtime and fully initialize everything, but never actually start playing the content – the bug appears! By now I know that the problem is somewhere in the initialization.

Initialization reads some settings from the data file, creates some “manager objects” for the runtime, initializes graphics device, loads first game “level” and then the game can play.

What of the above could cause something inside browser’s JavaScript engine stop working? And do that only on Windows, and only on some browsers? My first guess was the most platform-specific part: intialization of the graphics device, which on Windows usually happens to be Direct3D.

So I continued:

  • Try using OpenGL instead of Direct3D – everything works. By now it’s confirmed that initializing Direct3D causes something else in the browser not work.
  • “A-ha!” moment: tell Direct3D to not change floating point precision (via a create flag). Voilà, everything works!

I don’t know how I actually came up with the idea of testing floating point precision flag. Maybe I remembered some related problems we had a while ago, where Direct3D would cause timing calculations be “off”, if the user’s machine was not rebooted for a couple of weeks or more. That time around we properly changed our timing code to use 64 bit integers, but left Direct3D precision setting intact.

Side note: Intel x86 floating point unit (FPU) can operate in various precision modes, usually 32, 64 or 80 bit. By default Direct3D 9 sets FPU precision to 32 bit (i.e. single precision). Telling D3D to not change FPU settings could lower performance somewhat, but in my tests it did not have any noticeable impact.

So there it was. A debugging session, one line of change in the code, and fancy javascript webpage animations work on Windows in Firefox and Safari. This is coming out in Unity 2.0.2 update soon.

The moral? Something in one place can affect seemingly completely unrelated things in another place!

Is OpenGL really faster than D3D9?

The common knowledge is that drawing stuff in OpenGL is much more faster than in D3D9. I wonder – is this actually true, or just an urban legend? I could very well imagine that setting everything up to draw a single model and then issuing 1000 draw calls for it is faster in OpenGL… but come on, that’s not a very life-like scenario!

At work we now have a D3D9 and an OpenGL renderers on Windows. The original codebase was very much designed for OpenGL, so I had to jump through a lot of hoops to get it fully working on D3D… small differences that add up, like: there’s no object space texgen on D3D, shaders don’t track built-in state (world, modelview matrices, light positions, …), textures in GL vs. textures + sampler state in D3D, and so on. Anyway, the codebase was definitely not designed to exploit D3D strengths and OpenGL weaknesses, more likely the other way around.

But wait! I look at our benchmark tests, and D3D9 is consistently faster than OpenGL. Some examples:

  • Real world scene with lots of shadow casting lights (different objects, different shaders, different lights, different shadow types in one scene):
    • Core Duo with Radeon X1600: 23 FPS D3D9, 13 FPS GL.
    • P4 with GeForce 6800GT: 16 FPS D3D9, 9 FPS GL.
    • Core2 Duo with Radeon HD 2600: 41 FPS D3D9, 35 FPS GL.
  • High object count test (1000 objects, multiple lights, 5 passes per object total):
    • Core Duo with Radeon X1600: 18.3 FPS D3D9, 12.5 FPS GL.
    • P4 with GeForce 6800GT: 13.2 FPS D3D9, 9.4 FPS GL.
    • Core2 Duo with Radeon HD 2600: 34.8 FPS D3D9, 29.3 FPS GL.
  • Dynamic geometry (lots of particle systems) test (this is limited by vertex buffer writing speed and CPU calculating the particles, not draw by calls):
    • Core Duo with Radeon X1600: 170 FPS D3D9, 102 FPS GL.
    • P4 with GeForce 6800GT: 108 FPS D3D9, 74 FPS GL.
    • Core2 Duo with Radeon HD 2600: 325 FPS D3D9, 242 FPS GL.
  • …and so on.

To be fair, there are a couple of tests where on some hardware OpenGL has a slight edge. But in 95% of the cases, D3D9 is faster. Not to mention that we have about 10x less broken hardware/driver workarounds for D3D9 than we have for OpenGL…

What gives? Either our OpenGL code is horribly suboptimal, or “OpenGL is faster!!!!11oneoneeleven” is a myth. I have trouble figuring out in which places our code would be horribly suboptimal, I think we follow all advice given by hardware vendors on how to make OpenGL efficient (not that there is much advice out there though…).

There isn’t much software that can run the same content on both D3D and OpenGL and is suitable for benchmarking. I tried Ogre 3D demos on one machine (GeForce 6800GT card) and guess what? D3D9 is faster in tests that specifically stress draw count (like the instancing demo… D3D9 is faster both in instanced and non-instanced modes).

Am I crazy?

Back from Seattle

Just got back from MVP Global Summit 2007 in Seattle. Among usual things, like watching Bill’s keynote, meeting other MVPs, DirectX/XNA guys, getting a grip of some NDA information and such, here are some of the other highlights:

Amsterdam airport:

Officer: You speak English sir?
Me: Yeah.
O (takes a look at my passport): Ah, you speak Russian of course!
M: No, not really.
O: But your language is very similar to Russian, right?
M: Hm…

Well, here we know who gets the Linguist of the Year award.

Seattle-Tahoma airport, lady at checkin: “what kind of passport is that?“. It also takes 5 times to enter my last name properly, from the printed letters in the passport. Each time trying to persuade me that I did change the ticket date of course!

Seattle-Tahoma airport, security: “sir, you have been selected for additional screening“. Do they randomly select people for that quite involved process? Why this “selection” happens immediately after they take a look at my passport?

Random quotes:

Ten minutes walk is a long distance! Ten minutes of walking distance in the States is a very good reason to buy a car. At least SUV; preferably a Hummer.

DirectX SDK is the source of all sorts of high frequency goodness.

Sony is always good at announcements.

No? Rumours on the internet? Shock! Horror!

It’s always unexpected

I have a MacBook Pro now and slowly am getting used to it. It’s quite hard, considering that I’ve never had a laptop before; and actually used any Mac for the first time just a couple of months ago. My daughter thinks the best part about it are the weird image effects in PhotoBooth. I just can’t disagree.

On the unrelated note, now I am a Microsoft DirectX MVP. Just about the time when I almost stopped using it! I’d love to, but we’re making a product that primarily runs on the Macs… quite hard to use D3D there. But almost every day I wish I could, and every second day I’m annoying my coworkers by saying that D3D is lightyears ahead of TheOtherAPI!

The MVP award just came out of nowhere. It’s one of the things that you never expect – but hey, it feels good anyway. And now I have a MVP laptop case for my MacBook :)

Reading DX10 docs…

Reading DirectX10 preview documentation right now (you know, it’s released with Dec2005 SDK). It is pretty impressive, I must say! Seems like a huge leap forward. Back to reading!

An article on efficient D3DX Effects state management

I wrote an article on the subject I was talking about recently – an auto-magical system that manages device states in the effects. The article and links to implementation are on my homepage here: aras-p.info/texts/d3dx_fx_states.html

State management in D3DX Effects #2

I’ve written down the basic idea here. Done some tests and it really seems to work!

That required tiny 700 lines of hacky C++ code in the engine; but in exchange there’s no longer a need to write state restoring passes by hand. Maybe such effect usage scheme would even be useable in RealWorld!

Too bad I didn’t think it up a couple of months ago. My ShaderX4 article about this subject would have been much better…

Ok, still got to test this stuff on real world data (i.e. trying it on our demos)