Archive for 'gpu'

Kindernoiser!

kindernoiserI said so – 4 kilobyte intros are really getting interesting.

Meet kindernoiser – 4 kilobytes, quaternion Julia fractal on the GPU, screen space ambient occlusion and so on. iq has a nice article on the tech behind SSAO.

Keep ‘em coming!

Encoding floats to RGBA, redux

Gleserg has interesting comments in my earlier post. So I thought I’d share what I am using right now, and try to throw some more complexities in :)

Here is what I am doing right now:

inline float4 EncodeFloatRGBA( float v ) {
  return frac( float4(1.0, 255.0, 65025.0, 160581375.0) * v ) + 0.5/255.0;
}
inline float DecodeFloatRGBA( float4 rgba ) {
  return dot( rgba, float4(1.0, 1/255.0, 1/65025.0, 1/160581375.0) );
}

And this seems to work fine almost everywhere (see below). Why am I doing this – good question, I don’t have a hard theory on which bits go where and so on. I think I saw someone on gamedev.net forums saying that in hardware 0 == 0.0 and 255 == 1.0, and that truncation is actually done on the values (not rounding). So that would mean you multiply by 255 and add a half of a bit.

Now, the trick: the above does not quite work on Radeons (at least the X1600 that I’m mostly developing on while I’m on a Mac). Instead of adding 0.5/255.0, you have to subtract 0.55/255.0 – and that value is still not perfect, but that’s the best I could come up with by plowing through various combinations. I have no idea why this must be performed (24 bit internal precision? or does it round up? something else?). On GeForces and even Intel’s shader-capable hardware, the expected +0.5/255.0 value works.

…anyone up to figuring out the mathematical proof on why encoding/decoding this way actually works? :) And yes, the last component (the one that uses 160581375) is pretty much meaningless.

A day well spent (encoding floats to RGBA)

RGBA encoding 01Breaking news: sometimes seemingly trivial tasks take insane amounts of time! I am sure no one knew this before! So it was yesterday – almost whole day spent fighting rounding/precision errors when encoding floating point numbers into regular 8 bit RGBA textures. You know, the trivial stuff where you start with

inline float4 EncodeFloatRGBA( float v ) {
  return frac( float4(1.0, 256.0, 65536.0, 16777216.0) * v );
}
inline float DecodeFloatRGBA( float4 rgba ) {
  return dot( rgba, float4(1.0, 1.0/256.0, 1.0/65536.0, 1.0/16777216.0) );
}

and everything is fine until sometimes, somewhere there’s “something wrong”. Must be rounding or quantizations errors; or maybe I should use 255 instead of 256; plus optionally add or subtract 0.5/256.0 (or would that be 0.5/255.0?). Or maybe the error is entirely somewhere else, and I’m just chasing ghosts here!

RGBA encoding 02What would you do then? Why, of course, build an Encoding Floats Into Textures Studio 2007! (don’t tell me it’s not a great idea for a commercial software package! game studios would pay insane amounts of money for a tool like this!) The images here are exactly that – render into a texture, encoding UV coordinate as RGBA, then read from that texture, displaying RGBA and error from the expected value in some weird way. Turns out image postprocessing filters in Unity are a pretty good tool to do all this. Yay!

RGBA encoding 03Sometimes in situations like this I figure out that graphics hardware still leaves a lot to be desired. This last image shows some calculations that depend only on the horizontal UV coordinate, so they should produce some purely vertical pattern (sans the part at the bottom, that is expected to be different). Heh, you wish!

Speculation: pipelining geometry shaders

A followup to the older “discussion” about how/why geometry shaders would be okay/slow:

The graphics hardware has been quite successful so far at hiding memory latencies (i.e. when sampling textures). It does so (according to my understanding) by having a looong pixel pipeline, where hundreds (or thousands) pixels might be at one or another processing stage. ATI talks about this in big letters (R520 dispatch processor) and speculations suggest that GeForceFX had something like that (article). I have no idea about the older cards, but presumably they did something similar as well.

I am not sure how the vertex texture fetches are pipelined – pretty slow performance on GeForce6/7 suggest that they aren’t :) Probably vertex shaders in current cards operate in a simpler way – just fetch the vertices and run whole shaders on them (in contrast to pixel shaders, which seem to run just several instructions, then go to another pixels, return back, etc.).

With DX10, we have arbitrary memory fetches in any stage of the pipeline. Even the boundary between different fetch types is somewhat blurry (constant buffers vs. arbitrary buffers vs. textures) – perhaps they will differ only in bandwidth/latency (e.g. constant buffers live near the GPU while textures live in video memory).

So, with arbitrary memory fetches anywhere (and some of them being high latency), everything needs to have long pipelines (again, just my guess). This is all great, but the longer the pipeline, the worse it performs in non-friendly scenarios: pipeline flush is more expensive, drawing just a couple of “things” (primitives, vertices, pixels) is inefficient, etc.

I guess we’ll just learn a new set of performance rules for tomorrow’s hardware!

Back to GS pipelining: I imagine that the “slow” scenarios would be like this: vertices have shaders with dynamic branches or memory fetches differing vastly in execution lengths – so GS has to wait for all vertex shaders of the current primitive (optional: plus topology) to finish; and then each GS has dynamic branches or memory fetches, and outputs different number of primitives to the rasterizer. If I’d were hardware, I’d be scared :)

More HDR woes

I’m still spending an occasional minute on my HDR demo. Now, everything is fine so far, except one thing: I can’t get MSAA working on some Radeons (and I don’t have a Radeon right now, which makes debugging a lot harder). The main point of my demo is to have MSAA on ordinary hw, so this is bad.

The reason seems to be that on older Radeons MSAA does not resolve alpha channel, which obsiously messes things up in my case. I’m using RGBE8 encoding for the main rendertarget, and it RGB gets MSAA’d and exponent not – then oh well, no good anti aliasing most of the time.

Of course I could always manually supersample everything, but this would defeat the whole point of the demo. Or I could render everything in two passes, one for RGB and one for exponent – but this also is not very nice…

Probably I’ll just release the demo as it is now and wait for possible feedback. Or dig up an old Radeon somewhere and debug more – but replacing the video card in my Shuttle XPC is not an easy task :)

Jumped onto HDR bandwagon

I’m doing a small HDR demo for fun. Nothing fancy – linear gamma, Reinhard’s tone mapping and whatnot – everyone does that. But the thing I made so far does not even look good! :)

I’m trying to support both HDR and FSAA at the same time on ordinary DX9 hardware (no Radeons 1k) by using RGBE8 rendertarget for the main scene. It’s all okay so far.

The most difficult task right now is making it look good. Once I have that I’ll post the results.

The video cards are damn fast

I was working on our next demo the other day. Boy, the video cards are damn fast nowadays!

We have a high-poly model for the main character (~200k tris), for the demo we use low-poly (~6500 tris) and a normalmap. Now, I’ve put 128 lights scattered on the hemisphere above him, each using shadow buffer. I have 4 shadow buffers, render to these from four lights, then render the character, fetching shadows from four shadowmaps at once. The result is that it’s almost realtime ambient occlusion for the animating character, and it runs at ~40FPS on my geforce 6800gt!

This is of course pretty useless, we don’t need realtime AO in the demo. But it has been nice :)