Archive for 'gpu'

Screenspace vs. mip-mapping

Just spent half a day debugging this, so here it is for the future reference of the internets.

In a deferred rendering setup (see Game Angst for a good discussion of deferred shading & lighting), lights are applied using data from screen-space buffers. Position, normal and other things are reconstructed from buffers and lighting is computed “in screen space”.

Because each light is applied to a portion of the screen, the pixels it computes can belong to different objects. If in any place of lighting computation you use textures with mipmaps, be careful. Most common use for mipmapped light textures is light “cookies” (aka Gobo).

Let’s say we have a very simple scene with a spot light: (more…)

Direct3D GPU Hacks

I’m catching up on various GPU hacks that exist for Direct3D 9 (things like native shadow mapping, render to vertex buffer, etc.). Turns out there’s a lot of them, but all the information is scattered around the intertubes.

So here are the D3D9 hacks known to me in one place.

Let me know if I missed something or got something wrong. I also want to figure out if Intel GPUs/drivers implement any of them.

Strided blur and other tips for SSAO

If you’re new to SSAO, here are good overview blog posts: meshula.net and levelofdetail. Some tips and an idea on strided blur below.

(more…)

Compact Normal Storage for small g-buffers

I’ve been experimenting with compact storage of view space normals for small g-buffers. Think about storing depth and normal in a single 8 bit/channel RGBA texture.

Here are my findings – with error visualization and shader performance numbers for some GPUs.

If you know any other method to encode/store normals in a compact way, please let me know!

Encoding floats to RGBA – the final?

The saga continues! In short, I need to pack a floating point number in [0..1) range into several channels of 8 bit/channel render texture. My previous approach is not ideal.

Turns out some folks have figured out an approach that finally seems to work.

Here it is for my own reference:

So here’s the proper way:

inline float4 EncodeFloatRGBA( float v ) {
  float4 enc = float4(1.0, 255.0, 65025.0, 160581375.0) * v;
  enc = frac(enc);
  enc -= enc.yzww * float4(1.0/255.0,1.0/255.0,1.0/255.0,0.0);
  return enc;
}
inline float DecodeFloatRGBA( float4 rgba ) {
  return dot( rgba, float4(1.0, 1/255.0, 1/65025.0, 1/160581375.0) );
}

That is, the difference from the previous approach is that the “magic” (read: hardware dependent) bias is replaced with subtracting next component’s encoded value from the previous component’s encoded value.

Implementing fixed function T&L in vertex shaders

Almost half a year ago I was wondering how to implement T&L in vertex shaders.

Well, finally I implemented it for upcoming Unity 2.6. I wrote some sort of a technical report here.

In short, I’m combining assembly fragments and doing simple temporary register allocation, which seems to work quite well. Performance is very similar to using fixed function (I know it’s implemented as vertex shaders internally by the runtime/driver) on several different cards I tried (Radeon HD 3xxx, GeForce 8xxx, Intel GMA 950).

What was unexpected: the most complex piece is not the vertex lighting! Most complexity is in how to route/generate texture coordinates and transform them. Huge combination explosion there.

Otherwise – I like! Here’s a link to the article again.

Shaders must die, part 3

Continuing the series (see Part 1, Part 2)…

Got different lighting models (BRDFs) working. Without further ado, code snippets that produce real actual working shaders that work with lights & shadows and whatnot:

(more…)

Shaders must die, part 2

I started playing around with the idea of “shaders must die“. I’m experimenting with extracting “surface shaders” for now.

Right now my experimental pipeline is:

  1. Write a surface shader file
  2. Perl script transforms it into Unity 2.x shader file
  3. Which in turn is compiled by Unity into all lighting/shadows permutations, for D3D9 and OpenGL backends. Cg is used for actual shader compilation.

I have very simple cases working. For example: (more…)

Shaders must die

It came in as a simple thought, and now I can’t shake it off. So I say:
Shaders Must Die

Ok, now that the controversial bits are done, let’s continue.

(more…)

Fixed function lighting in vertex shader – how?

Sometime soon I’ll have to implement fixed function lighting pipeline in vertex shaders. Why? Because mixing fixed function and vertex shaders in multiple passes does not guarantee identical transformation results, thus requiring depth bias or projection matrix tweaks, which leads to various artifacts that annoy people to hell.

I don’t really know why that happens, because it seems that most modern cards don’t have fixed function units, so internally they are running shaders anyway. DX9 runtime on Vista’s WDDM also seems to be only handling shaders to the driver internally. Still, for some reason somewhere the precision does not match…

How such a task should be approached?

My requirements are:

  • Should handle any possible state combination in D3D fixed function T&L.
  • D3D 9.0c, using vertex shader 2.0 is ok. For now I don’t care about OpenGL.
  • No HLSL at runtime. I don’t want to add a megabyte or more to Unity web player just for HLSL. DX9 shader assembly is ok, because we already have the assembler code.
  • Should work as fast (or close to) as the regular fixed function pipeline.

I looked at ATI’s FixedFuncShader sample. It’s an ubershader approach; one large (230 instructions or so) shader with static VS2.0 branching. It had some obvious places to optimize, I could get it down to 190 or so instructions, kill some rcp’s and reduce the amount of constant storage by 2x.

Still, it did not handle some things in the D3D T&L or had some issues:

  • It assumes one input UV, one output UV and no texture matrices. This place in T&L gets quite convoluted – any input UVs or a texgen mode can be transformed by matrices of various sizes, and routed into any output UVs.
  • It was not using full T&L lighting model. No biggie here.
  • I haven’t checked with NVShaderPerf or AMD ShaderAnalyzer yet, but last time I checked the static branch instruction was taking two clocks on some NV architecture. So ubershader approach does not come for free.

Another thing I’m considering, is to combine final shader(s) from assembly fragments, with some simple register allocation.

In T&L shader code, there’s only limited set of could-be-redundant computations, mostly computing world space position, camera space normal, view vector and so on (those could be used lighting, texgen or fog). Those computations can be explicitly put into separate fragments, and later fragments could just use their result.

What is left then is some register allocation. A shader assembly fragment could want some temporary registers for internal use (this is simple, just give it a bunch of unused registers), also want some registers as input (from previous fragments), and save some output in registers.

Again, I haven’t checked with shader performance tools, but I think, guess and hope that the drivers do additional register allocation, liveness analysis etc. when converting D3D shader bytecode into hardware format. This would mean that I can be quite sloppy with it, i.e. don’t have to implement some super smart allocation scheme.

I wrote some experimental code for the shader assembly combiner and so far it looks like a reasonable approach (and not too hard either).

Does that make sense? Or did everyone solve those problems eons ago already?

Edit: half a year later, I wrote a technical report on how I implemented all this: http://aras-p.info/texts/VertexShaderTnL.html