Direct3D GPU Hacks

I’m catching up on various GPU hacks that exist for Direct3D 9 (things like native shadow mapping, render to vertex buffer, etc.). Turns out there’s a lot of them, but all the information is scattered around the intertubes.

So here are the D3D9 hacks known to me in one place.

Let me know if I missed something or got something wrong. I also want to figure out if Intel GPUs/drivers implement any of them.


Improving C#/Mono for Games

A tweet by Michael Hutchinson on C#/Mono usage in games caused me to do a couple of short replies (one, two). But then I started thinking a bit more, and here’s a longer post on what is needed for C# (and more specifically Mono) to be used in games more.

In Unity we use Mono to do game code (well, Unity users are doing that, not us). Overall it’s great; it has tons of advantages, loads of awesome and a flying ninja here and there. But no technology is perfect, right?

Edit: Miguel rightly points out in the comments that Mono team is solving or has already solved some of these issues already. In some areas they are moving so fast that we at Unity can’t keep up!

#1: Garbage Collector

Most game developers do not like Garbage Collection (GC) very much. Typically, the more limited/hardcore their target platform is, the more they dislike GC. The reason? Most GC implementations cause rather unpredictable spikes.

Here’s a run of something recorded in the (awesome) Unity 2.6 profiler. Horizontal axis is time, vertical is CPU time spent in that frame:

Garbage collection spikes

At the bottom you see dark red thingies appearing once in a while. This is garbage collector kicking in, because some script code is allocating some memory at runtime.

Now of course, it is possible to write your script code so that it does no allocations (or almost no allocations). Preallocate your objects into pools, manually invoke GC when there’s a game situation when a small hickup won’t affect gameplay, etc. In fact, a lot of iPhone games made with Unity do that.

But that kind of side steps the whole advantage of “garbage collector almost frees you from doing memory management”. If you’re not allocating anything anyway, GC could just as well not be there!

A little side story. Me and Unity’s iPhone tech lead ReJ tried to explain what GC is to a non-programmer. Here’s what we came up with:

Garbage Collection is this cleaning service for lazy people. They can just leave any garbage on the floor in their house, and once in a while a garbage guy comes, collects all the garbage and takes it outside. Now, there are some intricacies in the service. First, you never know when the garbage guy will come. You might be taking a shower, doing a meditation or having some “sexy time” - and it’s in the service agreement that when a garbage guy comes, you have to let him in to do his work.

Second thing is, the garbage guy is usually some homeless drunkard. He smells so bad that when he comes, you have to stop whatever you were doing, go outside and wait until he’s done with the garbage collection. Even your neighbors, who might be doing something entirely else in parallel, actually have to stop and idle while garbage is being collected in your house!

There are variations of this GC service. One variation is called “moving GC”, where the garbage guy also rearranges your furniture while collecting the garbage - he moves it all into one side of your house. This is so that you can buy a bigger piece of furniture, or throw a huge piece of garbage - and there will be enough unused space for you to do that! Of course this way GC process takes somewhat longer, but hey, you get all your stuff nicely packed into one corner.

Can’t you see that this service is the greatest idea of all time?

This is quite a harsh attitude towards GC, and of course it’s exaggerated. But there is some truth to it. So how could GC be fixed?

GC fix #1: more control

More explicit control on when & how long GC runs. I want to say to the garbage guy, “come everyday at 4PM and do your work for 20 minutes”. In the game, I’d want to call GC with an upper time limit, say 1 millisecond for each call, and I would be calling that 30 times per second.

GC fix #2: sometimes I want to clean garbage myself

Inefficiencies and unpredictability of GC cause people to do even more work than a normal, oldskool memory allocation. Why not provide an option to deal with deallocations manually? I.e. a keyword reallynew could allocate an object that is not part of garbage collected world. It would function as a regular .NET object, just it would be user’s responsibility to reallydelete it.

Mono is already extending .NET (see SIMD and continuations). Maybe it makes sense to add some way to bypass garbage collector?

#2: Distribution Size

Using C#/.NET in a game requires having .NET runtime. None of the interesting platforms are guaranteed to have it, and even on Windows you can’t count on it being present. Mono is great here in a sense that it can be used on many more platforms than Microsoft’s own .NET. It’s also great on distribution size, but only if you compare it to Microsoft’s .NET.

In Unity Web Player, we package Mono DLL + mscorlib assembly into something like 1.5 megabytes (after LZMA compression). Which is great compared to 20+ megabytes of .NET runtime, but not that great it you compare it so, say, Lua runtime (which is less than 100 kilobytes).

On some platforms (iPhone, Xbox 360, PS3, …) it’s not possible to generate code at runtime, so Mono’s JIT does not work. All code that’s written in C# has to be precompiled to machine code ahead of time (AOT compilation). This is not a problem per se, but because .NET framework was never designed with small size and few dependencies in mind, doing anything will ultimately pull in a lot of code.

We joke that doing anything in C# will result in an XML parser being included somewhere. This is not that far from the truth; e.g. calling float.ToString() will pull in whole internationalization system, which probably somewhere needs to read some global XML configuration file to figure out whether daylight savings time is active when Eastern European Brazilian Chinese calendar is used.

Size fix #1: custom core .NET libraries?

For game uses, most of “fat” stuff in .NET runtime is not really needed. float.ToString() could just always use period as a decimal separator. Core libraries could consist just of essential collections (list, array, hash table) and maybe a String class, with just essential methods. Maybe it’s worth sacrificing some of the generality of .NET if that could shave off a couple of megabytes from your iPhone game size?

Of course this is very much doable; “all that is needed” (tm) is writing custom mscorlib+friends, and telling C# compiler to not ever reference any of the “real” libraries.

Size fix #2: make Mono runtime smaller

Uncompressed Mono DLL in our Windows build is 1.5 megabytes. We have turned off all the easy stuff (profiler, debugger, logging, COM, AOT etc.). But probably some more could be stripped away. Do our games really need multiple AppDomains? Some fancy marshalling? I don’t know, it just feels that 1.5MB is a lot.

#3: Porting to New Platforms

You know this classic: “There’s no portable code. There’s only code that’s been ported.”

Most existing gaming platforms are quite weird. Most upcoming smartphone platforms also are quite weird, each in their own interesting way. Porting a large project like Mono is not easy, especially since parts of it (JIT or AOT engine) highly depend on the platform.

For Unity iPhone, unexpected discovery that it’s not possible to JIT on iPhone made the initial release be delayed by something like 4 months. It did not help that in early iPhone SDK builds JIT was actually possible, and Apple decided to disable runtime generated code later. Making Mono actually work there required significant work both from Mono team and from Unity. We still have one guy working almost exclusively on Mono+iPhone issues!

Of course, maybe all the Mono iPhone work made porting to new platforms easier as a byproduct. But so far we don’t have Mono ported to any other platform, up to production quality. So judging from experience, we now always assume Mono port will be a pain, just because “some nasty surprises will come up” (and they always do).

#4: Small Stuff

There is a ton of small bits where extending .NET would benefit gaming scenarios. For example:

Suppose there is some array on the native engine side; for example vertex positions in a mesh (3xFloat for each vertex). Is it possible to make that piece of memory be represented as a native struct array for .NET side? So that it would not involve any extra memory copies, but N vertices somewhere in memory would look just like Vector3[N] for C#?

On a similar note, having “strided arrays” would be useful. For example, mesh data is often interleaved, so for each vertex there is a position, normal, UVs and so on. It would be cool if in C# position array would still look like Vector3[N], but internally the distance between each element would be larger than 12 bytes required for Vector3.

Where do we go from here?

The above are just random ideas, and I’m not complaining about Mono. It is great! It’s just not perfect. Mono being open source is a very good thing, which means pretty much any interested party can improve it as needed. So rock on.


Deferred Cascaded Shadow Maps

Reading “Rendering Technology at Black Rock Studios” made me realize that cascaded shadow maps I did 2+ years ago in Unity 2.0 are probably called “deferred shadowing”. Since I never wrote how they are done… here:

The process is roughly this (all of this is DX9 level tech on PCs; later tech or consoles could and should use more optimizations):

  1. Render shadow map cascades. All of them packed into one shadow map via viewports.

  2. Collect shadows into screen sized render target. This is the shadow term.

  3. Blur the shadow term.

  4. In regular forward rendering, use shadow term in screen space.

More detail:

Render Shadow Cascades

Nothing fancy here. All cascades packed into a single shadow map. For example two 512x512 cascades would be packed into 1024x512 shadow map side by side.

Screen-space Shadow Term

Render all shadow receivers with a shader that “collects” shadow map term. In effect, shadows from all cascades are collected into a screen-sized texture. After this step, original cascaded shadowmaps are not needed anymore.

Unity supports up to 4 shadow map cascades, which neatly fit into a float4 register in the pixel shader. Correct cascade is sampled just once, without using static or dynamic branching. Pixel shader pseudocode:

 float4 near = float4 (z >= _LightSplitsNear);
 float4 far = float4 (z < _LightSplitsFar);
 float4 weights = near * far;
 float2 coord =
     i._ShadowCoord[0] * weights.x +
     i._ShadowCoord[1] * weights.y +
     i._ShadowCoord[2] * weights.z +
     i._ShadowCoord[3] * weights.w;
 float sm = tex2D (_ShadowMapTexture, coord.xy).r;

Additionally, shadow fadeout is applied here (shadows in Unity can be cast up to specified distance from the camera, and they fade out when approaching that distance).

After this I end up having shadow term in screen space. Note that here I do not do any shadow map filtering; that is done in screen space later.

On PCs in DX9 there is (or there was?) no easy/sane way to read depth buffer in the pixel shader, so while collecting shadows the shader also outputs depth packed into two channels of the render target.

Screen-space Shadow Blur

Previous step results in screen space shadow term and depth. Shadow term is blurred into another render target, using a spatially varying Poisson disc-like filter.

Filter size depends on depth (shadow boundaries closer to the camera are blurred more). Filter also discards samples if difference in depth is larger than something, to avoid blurring over object boundaries. It’s not totally robust, but seems to work quite well.

Using shadow term in forward rendering

In forward rendering, this blurred shadow term texture is used. Here shadow term already has filtering & fadeout applied, and the shaders do not need to know anything about shadow cascades. Just read pixel from the texture and use it in lighting computation. Done!

Fin

Back then I didn’t know this would be called “deferred” (that would probably have scared me away!). I don’t know if this approach is any good, but so far it works quite well for Unity needs. Also, reduces shader permutation count a lot, which I like.


Fixing bugs, in Tom Waits' words

Mixing a sprint of bug fixing before the release and Tom Waits’ music results in interesting combination. For example, Crossroads describes bug fixing process perfectly:

And that’s where ol’ George found himself out there at the FogBugz
Fixin’ the devil’s bugs
Now, a man figures it’s his bugs and he’ll assign whom he wants
But it don’t always work out that way
You see, some bugs are special for a certain target
A certain platform, or a certain person
And no matter whom you’re assignin’, that’s where the bug ’ll end up
And in the moment of assigning your mouse turns into a dowser’s wand
And clicks where the bug wants to go.

Uhm. Yeah.


Strided blur and other tips for SSAO

If you’re new to SSAO, here are good overview blog posts: meshula.net and levelofdetail. Some tips and an idea on strided blur below.

Bits and pieces I found useful

  • SSAO can be generated at a smaller resolution than screen, with depth+normals aware upsample/blur step.

  • If random offset vector points away from surface normal, flip it. This makes random vectors be in the upper hemisphere, which reduces false occlusion on flat surfaces. Of course this requires having surface normals.

  • When generating random vectors for your AO kernel:

    • Generate vectors inside unit sphere (not on unit sphere).
    • Use energy minimization to distribute your samples better, especially at low sample counts. See malmer.ru blog post.
  • In your AO blurring/upsampling step: no need to sample each pixel for blur. Just skip some of them, i.e. make kernel offsets larger. See below.

Strided blur for AO

Normally you’d blur AO term using some sort of standard blur, for example separable Gaussian: horizontal blur, followed by vertical blur. How one can imagine horizontal blur kernel:

Horizontal Blur Kernel

Here’s how Rune taught me how to blur better:

Rune: The other thing is the blur. I tried to make the blur 4 times stronger, and it looks much better IMO without any artifacts I could see. I could even use 4x downsampling with that blur amount and still get acceptable results.

Aras: how did you make it 4x stronger? (I was going to say that blur step is already quite expensive, and I don’t want to add more samples to make it even more expensive, yadda yadda)

Rune:
m_SSAOMaterial.SetVector ("_TexelOffsetScale", m_IsOpenGL ?
new Vector4 (4,0,1.0f/m_Downsampling,0) :
new Vector4 (4.0f/source.width,0,0,0));
And similar for vertical.

Aras: hmm. that’s strange :)

Rune: I have no idea what I’m doing of course but it looks good.

Aras: so this way it does not do Gaussian on 9x9 pixels, but instead only takes each 4th pixel. Wider area, but… it should not work! :)

Rune: It creates a very fine pattern at pixel level but it’s way more subtle than the noise you get otherwise.

Aras: ok (hides in the corner and weeps)

So yeah. The blur kernel can be “spread” to skip some pixels, effectively resulting in a larger blur radius for the same sample count:

Blur with 2 pixel stride

Or even this:

Blur with 3 pixel stride

Yes, it’s not correct blur. But that’s okay, we’re not building nuclear reactors that depend on SSAO blur being accurate. If you are, SSAO is probably a wrong approach anyway, I’ve heard it’s not that useful for nuclear stuff.

I’m not sure how this blur should be called. Strided blur? Interleaved blur? Interlaced blur? Or maybe everyone is doing that already and it has a well established name? Let me know.

Some images of blur in action. Raw AO term (very low - 8 - sample count and increased contrast on purpose):

Raw AO at low sample count

Regular 9x9 blur (does not blur over depth+normals discontinuities):

Blurred AO

Blur that goes in 2 pixel stride (effectively 17x17):

Blurred AO with stride 2

It does create a fine interleaved pattern because it skips pixels. But you get wider blur!

Blurred AO with stride 2, magnified

Blur that goes in 3 pixel stride (effectively 25x25):

Blurred AO with stride 3

At 3 pixel stride the artifacts are becoming apparent. But hey, this is very low AO sample count, increased contrast and no textures in the scene.

Blured AO with stride 3, magnified

For sake of completeness, the same raw AO term, but computed at 2x2 smaller resolution (still using low sample count etc.):

AO computed at lower resolution

Now, 2x2 smaller AO, blurred with 3 pixels stride:

AO at lower resolution, blurred with 3 pixel stride

Happy blurring!