|
|
Archive for 'gpu'
Sometime soon I’ll have to implement fixed function lighting pipeline in vertex shaders. Why? Because mixing fixed function and vertex shaders in multiple passes does not guarantee identical transformation results, thus requiring depth bias or projection matrix tweaks, which leads to various artifacts that annoy people to hell.
I don’t really know why that happens, because it seems that most modern cards don’t have fixed function units, so internally they are running shaders anyway. DX9 runtime on Vista’s WDDM also seems to be only handling shaders to the driver internally. Still, for some reason somewhere the precision does not match…
How such a task should be approached?
My requirements are:
- Should handle any possible state combination in D3D fixed function T&L.
- D3D 9.0c, using vertex shader 2.0 is ok. For now I don’t care about OpenGL.
- No HLSL at runtime. I don’t want to add a megabyte or more to Unity web player just for HLSL. DX9 shader assembly is ok, because we already have the assembler code.
- Should work as fast (or close to) as the regular fixed function pipeline.
I looked at ATI’s FixedFuncShader sample. It’s an ubershader approach; one large (230 instructions or so) shader with static VS2.0 branching. It had some obvious places to optimize, I could get it down to 190 or so instructions, kill some rcp‘s and reduce the amount of constant storage by 2x.
Still, it did not handle some things in the D3D T&L or had some issues:
- It assumes one input UV, one output UV and no texture matrices. This place in T&L gets quite convoluted – any input UVs or a texgen mode can be transformed by matrices of various sizes, and routed into any output UVs.
- It was not using full T&L lighting model. No biggie here.
- I haven’t checked with NVShaderPerf or AMD ShaderAnalyzer yet, but last time I checked the static branch instruction was taking two clocks on some NV architecture. So ubershader approach does not come for free.
Another thing I’m considering, is to combine final shader(s) from assembly fragments, with some simple register allocation.
In T&L shader code, there’s only limited set of could-be-redundant computations, mostly computing world space position, camera space normal, view vector and so on (those could be used lighting, texgen or fog). Those computations can be explicitly put into separate fragments, and later fragments could just use their result.
What is left then is some register allocation. A shader assembly fragment could want some temporary registers for internal use (this is simple, just give it a bunch of unused registers), also want some registers as input (from previous fragments), and save some output in registers.
Again, I haven’t checked with shader performance tools, but I think, guess and hope that the drivers do additional register allocation, liveness analysis etc. when converting D3D shader bytecode into hardware format. This would mean that I can be quite sloppy with it, i.e. don’t have to implement some super smart allocation scheme.
I wrote some experimental code for the shader assembly combiner and so far it looks like a reasonable approach (and not too hard either).
Does that make sense? Or did everyone solve those problems eons ago already?
Edit: half a year later, I wrote a technical report on how I implemented all this: http://aras-p.info/texts/VertexShaderTnL.html
Posted on 2009-01-22 22:32 in code, d3d, gpu, rendering, work | 9 Comments »
(if this sounds like a rehash of a blog post on blogs.unity3d.com, well, it is…)
Everyone knows the Valve’s hardware survey. But what if your target game players are not the traditional “big budget AAA game” type? For example, at the moment most Unity Web Player games are oriented to much more casual market, so hardware there might be very different. And indeed, turns out it is quite different.
Without further ado, here’s the data we have: Unity Web Player hardware statistics.
It’s about two million data points since we started gathering it earlier this year.
Some subjective points of interest (I’ll be using current data for 2008 Q3 here):
- Operating systems: Mac OS X is 2.5%, the rest is Windows. 64 bit Windows haven’t really picked up yet (0.7%). Windows 2000 is dying fast (0.7%). OS X Leopard already took over OS X Tiger.
- CPUs: poor Transmeta :) Dual core CPUs are becoming the norm (46%).
- Graphics cards: quite sad, in fact… top 15 cards are slow or horribly slow. Capability wise, they are quite good, with about 70% having shader model 2.0 or higher. Shader model 1.x cards are dead. “Can has DX10” is 2.7%.
- Casual machines don’t have lots of RAM. Nor lots of VRAM.
- Most popular nvidia driver? 56.73. Looks like this is the driver that comes integrated in XP SP2… Now, who says regular people ever update their drivers? Likewise, vga.dll (i.e. standard VGA) is 1.6% of machines; additional 1.5% don’t report any driver (not sure how that happens…).
So yeah. Casual machines: capabilities quite okay, performance low, low, low. That’s life.
Posted on 2008-08-28 20:32 in games, gpu, unity | 3 Comments »
You know something became a cultural phenomenon when hardware review sites start putting up images like this…
From AnandTech’s Radeon HD 4850 & 4870 review: I can has vertex data?
Edit: gee, nowadays the reviews have funny performance measures. Like, FPS per square centimeter (of GPU die size)! It does actually make (some) sense, but it’s still funny. Frames per second per square centimeter… mmm… delicious.
Posted on 2008-06-26 7:54 in gpu, random | No Comments »
Hey, it looks like the quest for encoding floats to RGBA textures (part 1, part 2) did not end yet.
Here’s the “best available” code that I have now:
inline float4 EncodeFloatRGBA( float v ) {
return frac( float4(1.0, 255.0, 65025.0, 160581375.0) * v ) + bias;
}
inline float DecodeFloatRGBA( float4 rgba ) {
return dot( rgba, float4(1.0, 1/255.0, 1/65025.0, 1/160581375.0) );
}
Before I thought that bias should be +0.5/255.0 normally, except it had to be around -0.55/255.0 on Radeon cards (older than Radeon HD series). Well, turns out I was wrong, the bias mostly has to be around -0.5/255.0.
Here’s the list (same bias on Windows/D3D9 and OS X/OpenGL, so it seems to be hardware dependent, and not something in API/drivers):
- Radeon 9500 to X850: -0.61/255
- Radeon X1300 to X1900: -0.66/255
- Radeon HD 2xxx/3xxx: -0.49/255
- GeForce FX, 6, 7, 8: -0.48/255
- Intel 915, 945, 965: -0.5/255
Those are the best bias values I could find. Still, every once in a while (rarely) encoding the value to RGBA texture and reading it back would produce something where one channel is half a bit off. Not a problem if you were encoding numbers were originally 0..1 range, but for example if you were encoding something that spans over whole range of the camera, then 0..1 range gets expanded into 0..FarPlane…
And all of a sudden there are huge precision errors, up to the point of being unusable. I just tried doing a quick’n'dirty depth of field and soft particles implementation using depth encoded this way… not good.
Oh well. Has anyone successfully used encoding of high precision number into RGBA channels before?
Posted on 2008-06-20 17:55 in gpu | 6 Comments »
Okay, so Apple just announced OpenCL (Open Computing Language) technology in upcoming OS X 10.6. This is starting to get interesting.
My prediction? OpenCL should be something along lines of CUDA or BrookGPU. Will work on various DX10-level graphics cards, and on the CPU. I think trying to target older graphics cards does not make sense – using real actual integer types is useful in general purpose computing (DX10 tech), and Apple will probably only be shipping DX10 level graphics cards in a year (at the moment only Intel cards in Macs are DX9 level; the rest is GeForce 8s and Radeon HDs). With a multithreaded CPU fallback any older machines will be taken care of anyway (and leaves the future open for Larrabees). So yeah, quite similar to BrookGPU actually.
It has “open” in the title, so maybe they will make it for other platforms as well. I doubt that they will ship implementation though; perhaps just make it royalty/patent/whatever free and publish the spec. Which is about the same level of “openness” as other technologies with “open” in their name (OpenGL, OpenAL, OpenMP, OpenCV, …) – not exactly open, but not the worst kind either.
Oh, and suddenly there are new uses for other technologies recently developed at Apple, like LLVM or clang.
We’ll see how it goes.
Posted on 2008-06-10 21:27 in gpu | 2 Comments »
ShiftShader 2.0, a pure software renderer with a Direct3D 9 interface, just got released. I tried it on rendering unit tests and some benchmark tests we have for Unity.
In short, I’m impressed.
It runs rendering tests almost correctly; the only minor bugs seem to be somewhere in attenuation of fixed function vertex lights. Everything else, including shaders, shadows, render to texture works without any problems.
Performance wise, of course it’s dozens to hundreds times slower than a real graphics card, but hey. I also tested with Intel 965 (aka GMA X3000) integrated graphics for comparison. All this on Intel Core2 Quad (Q6600), 3 GB RAM, Windows XP SP2.
- Avert Fate demo: Radeon HD 3850 about 300 FPS, SwiftShader about 5 FPS (about 15 FPS if per-pixel lighting is turned off), Intel 965 about 22 FPS (about 50 FPS if per-pixel lighting is turned off).
- Scene with lots of objects and lots of shadow-casting lights: Radeon HD 3850 about 76 FPS, SwiftShader 2.5 FPS, Intel – shadows not supported, duh.
- High detail terrain with lots of vegetation and four cameras rendering it simultaneously: Radeon HD 3850 about 68 FPS, SwiftShader about 3 FPS, Intel 965 about 12 FPS.
Ok, so SwiftShader loses on performance to Intel 965, but the difference is only “a couple of times”, and not in order of magnitude or so. Pretty good I’d say.
Posted on 2008-04-07 14:05 in gpu, rendering | 3 Comments »
Seriously, what are they up to? Intel acquires Offset Software, a game development studio that is doing a game and an engine. Wait, I was thinking the game and tech are for PC and Xbox360? What would Intel do with that?
Not so long ago, some well known graphics guys went to work for Intel. A while ago Intel acquired Neoptica…
Signs of Larrabee coming? Intel starting to take GPUs seriously? Something else?
Posted on 2008-02-21 9:59 in gpu | 4 Comments »
I said so – 4 kilobyte intros are really getting interesting.
Meet kindernoiser – 4 kilobytes, quaternion Julia fractal on the GPU, screen space ambient occlusion and so on. iq has a nice article on the tech behind SSAO.
Keep ‘em coming!
Posted on 2007-11-21 14:39 in demos, gpu | No Comments »
Gleserg has interesting comments in my earlier post. So I thought I’d share what I am using right now, and try to throw some more complexities in :)
Here is what I am doing right now:
inline float4 EncodeFloatRGBA( float v ) {
return frac( float4(1.0, 255.0, 65025.0, 160581375.0) * v ) + 0.5/255.0;
}
inline float DecodeFloatRGBA( float4 rgba ) {
return dot( rgba, float4(1.0, 1/255.0, 1/65025.0, 1/160581375.0) );
}
And this seems to work fine almost everywhere (see below). Why am I doing this – good question, I don’t have a hard theory on which bits go where and so on. I think I saw someone on gamedev.net forums saying that in hardware 0 == 0.0 and 255 == 1.0, and that truncation is actually done on the values (not rounding). So that would mean you multiply by 255 and add a half of a bit.
Now, the trick: the above does not quite work on Radeons (at least the X1600 that I’m mostly developing on while I’m on a Mac). Instead of adding 0.5/255.0, you have to subtract 0.55/255.0 – and that value is still not perfect, but that’s the best I could come up with by plowing through various combinations. I have no idea why this must be performed (24 bit internal precision? or does it round up? something else?). On GeForces and even Intel’s shader-capable hardware, the expected +0.5/255.0 value works.
…anyone up to figuring out the mathematical proof on why encoding/decoding this way actually works? :) And yes, the last component (the one that uses 160581375) is pretty much meaningless.
Posted on 2007-06-29 9:58 in gpu | 5 Comments »
Breaking news: sometimes seemingly trivial tasks take insane amounts of time! I am sure no one knew this before! So it was yesterday – almost whole day spent fighting rounding/precision errors when encoding floating point numbers into regular 8 bit RGBA textures. You know, the trivial stuff where you start with
inline float4 EncodeFloatRGBA( float v ) {
return frac( float4(1.0, 256.0, 65536.0, 16777216.0) * v );
}
inline float DecodeFloatRGBA( float4 rgba ) {
return dot( rgba, float4(1.0, 1.0/256.0, 1.0/65536.0, 1.0/16777216.0) );
}
and everything is fine until sometimes, somewhere there’s “something wrong”. Must be rounding or quantizations errors; or maybe I should use 255 instead of 256; plus optionally add or subtract 0.5/256.0 (or would that be 0.5/255.0?). Or maybe the error is entirely somewhere else, and I’m just chasing ghosts here!
What would you do then? Why, of course, build an Encoding Floats Into Textures Studio 2007! (don’t tell me it’s not a great idea for a commercial software package! game studios would pay insane amounts of money for a tool like this!) The images here are exactly that – render into a texture, encoding UV coordinate as RGBA, then read from that texture, displaying RGBA and error from the expected value in some weird way. Turns out image postprocessing filters in Unity are a pretty good tool to do all this. Yay!
Sometimes in situations like this I figure out that graphics hardware still leaves a lot to be desired. This last image shows some calculations that depend only on the horizontal UV coordinate, so they should produce some purely vertical pattern (sans the part at the bottom, that is expected to be different). Heh, you wish!
Posted on 2007-03-03 18:33 in gpu | 13 Comments »
|