Tiled Forward Shading links

Main idea of my previous post was roughly this: in forward rendering, there’s no reason why we still have to use per-object light lists. We can apply roughly the same ideas as those of tiled deferred shading.

Really nice to see that other people have thought about this before or about the same time; here are some links:

As Andrew Lauritzen points out in the comments of my previous post, claiming “but deferred will need super-fat G-buffers!” is an over-simplification. You could just as well store material indices plus data for sampling textures (UVs + derivatives); and going “deferred” you have more choices in how you schedule your computations.

There’s no principal difference between “forward” and “deferred” these days. As soon as you have a Z-prepass you already are caching/deferring something, and then it’s a whole spectrum of options what and how to cache or “defer” for later computation.

Ultimately of course, the best approach depends on a million of factors. The only lesson to learn from this post is that “forward rendering does not have to use per-object light lists”.


2012 Theory for Forward Rendering

Good question in a tweet by @ivanassen:

So what is the 2012 theory on lights in a forward renderer?

Hard to answer that in 140 characters, so here goes raw brain dump (warning: not checked in practice!).

Short answer

A modern forward renderer for DX11-class hardware would probably be something like AMD’s Leo demo.

They seem to be doing light culling in a compute shader, and the result is per-pixel / tile linked lists of lights. Then scene is rendered normally in forward rendering, fetching the light lists and computing shading. Advantages are many; arbitrary shading models with many parameters that would be hard to store in a G-buffer; semitransparent objects; hardware MSAA support; much smaller memory requirements compared to some fat G-buffer layout.

Disadvantages would be storing linked lists, I guess. Potentially unbounded memory usage here, though I guess various schemes similar to Adaptive Transparency could be used to cap the maximum number of lights per pixel/tile.

Deferred == Caching

All the deferred lighting/shading approaches are essentially caching schemes. We cache some amount of surface information, in screen space, in order to avoid fetching or computing the same information over and over again, while applying lights one by one in traditional forward rendering.

Now, the “cache in screenspace” leads to disadvantages like “it’s really hard to do transparencies” - since with transparencies you do not have one point in space mapping to one pixel on screen anymore. There’s no reason why caching should be done in screen space however; lighting could also just as well be computed in texture space (like some skin rendering techniques, but they do it for a different reason), world space (voxels?), etc.

Does “modern” forward rendering still need caching?

Caching information was important since in DX9 / Shader Model 3 times, it was hard to do forward rendering that could almost arbitrarily apply variable number of lights - with good efficiency - in one pass. That led to either shader combination explosion, or inefficient multipass rendering, or both. But now we have DX11, compute, structured buffers and unordered access views, so maybe we can actually do better?

Because at some point we will want to have BRDFs with more parameters than it is viable to store in a G-buffer (side image: this is half of parameters for a material). We will want many semitransparent objects. And then we’re back to square one; we can not efficiently do this in a traditional “deferred” way where we cache N numbers per pixel.

AMD’s Leo goes in that direction. It seems to be a blend of tiled deferred approaches to light culling, applied to forward rendering.

I imagine it doing something like:

  1. Z-prepass:

    1. Render Z prepass of opaque objects to fill in depth buffer.
    2. Store that away (copy into another depth buffer).
    3. Continue Z prepass of transparent objects; writing to depth.
    4. Now we have two Z buffers, and for any pixel we know the Z-extents of anything interesting in it (from closest transparent object up to closest opaque surface)
  2. Shadowmaps, as usual. Would need to keep all shadowmaps for all lights in memory, which can be a problem!

  3. Light culling, very similar to what you’d do in tiled deferred case!

    1. Have all lights stored in a buffer. Light types, positions/directions/ranges/angles, colors etc.
    2. From the two depth buffers above, we can compute Z ranges per pixel/tile in order to do better light culling.
    3. Run a compute shader that does light culling. Could do this per pixel or per small tiles (e.g. 8x8 ). Result is buffer(s) / lists per pixel or tile, with lights that affect said pixel or tile.
  4. Render objects in forward rendering:

    1. Z-buffer is already pre-filled in 1.1.
    2. Each shader would have to do “apply all lights that affect this pixel/tile” computation. So that would involve fetching those arbitrary light informations, looping over lights etc.
    3. Otherwise, each object is free to use as many shader parameters as it wants, or use any BRDF it wants.
    4. Rendering order is like usual forward rendering; batch-friendly order (since Z is prefilled already) for opaque, per-object or per-triangle back-to-front order for semitransparent objects.
  5. Profit!

Now, I have hand-waved over some potentially problematic details.

For example, “two depth buffers” is not robust for cases where there’s no opaque objects in some area; we’d need to track minimum and maximum depths of semitransparent stuff, or accept worse light culling for those tiles. Likewise, copying the depth buffer might lose some hardware Hi-Z information, so in practice it could be better to track semitransparent depths using another approach (min/max blending of a float texture etc.).

4.b. bit about “let’s apply all lights” assumes there is some way to do that efficiently, while supporting complicated things like each light having a different cookie/gobo texture, or a different shadowmap etc. Texture arrays could almost certainly be used here, but since this just a brain dump without verification in practice, it’s hard to say how would this work.

Update: other papers came out describing almost the same idea, with actual implementations & measurements. Check them out here!


Prophets and duct-tapers or: useful programmer traits

I liked Pierre’s The Prophet Programmer post. Go read it now.

Now of course that post is a rant. It exaggerates. It puts everything into one bit grayscale colors. There’s never one person completely like this “prophet programmer” and another like the idolized “best programmer… not afraid of anything!!1”.

But it does highlight at least this thing: some aspects of programmer’s behavior are either useful or not.

Obsessing over latest hypes, “the proper ways”, following books by the letter just by itself is not useful. Sure, sometimes a dash of “proper ways” or recommendations is good, but the benefits of doing that are really, really tiny. Hence it’s not worth thinking/arguing much about.

Here’s some actually useful programmer traits instead.** I’m thinking about real actual people I’m working with here, even if I’m not telling names.

He feels what needs to be done to get the solution, in the big picture. Sometimes these are unusual ideas that probably no one is doing - because everyone has always been seeing the problem in the standard way. The solutions seem obvious once you see them, but require some sort of step function in thinking to get there. Zero iteration way of hooking up touchscreen device input to test the game is to play the game on PC, stream images into the device and stream inputs back. Least hassle free asset pipeline is when there is no “export/import asset” step. Or a more famous outside example, tablets before and after the iPad. You rarely, if ever, can do things like that by doing user surveys or improving on existing solutions; you need someone who can see through and find what’s the actual problem you want to solve. This guy is worth gold.

She can cut things. “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to cut away”, quoth Saint-Exupéry. To be good at doing anything you (both you and your team) need to focus, which means cutting things. Let go of bad ideas and blind alleys. If your justification for doing it is “but we already spent so much time on it”, just don’t - it will only get worse. Cut features that aren’t quite ready by the deadlines. Remove old things that aren’t useful anymore. Doing that can and will make some people upset; it’s really, really hard to postpone or even completely abandon a thing that someone put a lot of effort into. But it needs to be done; and you need her on the team to make these hard decisions.

That other guy is freaking fast. And not in a sense of “types tons of code real fast and then sometimes it works, and two weeks after someone else has to clean it up”. No - he’s cranking out good, solid, tested, working code at incredible speeds. Got ten bugs; they are fixed by next day. Got a new feature to do; commits with everything implemented (and working!) are pushed in a few days. When he goes on vacation your burndown chart changes slope. How he does it? I don’t know. But by all means, keep onto him!

The other girl can figure out any complex problem real fast. Be it a tricky bug, unexpected behavior, really weird interaction with other systems - others could be spending hours, if not days, trying to figure out what’s going on. She, on the other hand, checks just a handful of things and goes “ha! the problem’s right there”. As if applying binary search to the whole problem space, except to everyone else the space seems unsorted and they don’t even know what they’re looking for!

This dude can keep a ton of context in his head while doing anything. How will this feature interact with dozens or even hundreds of other features; he’s able to think about all of them and majority of corner cases and get everything right in one go. Would take dozens of roundtrips between coding & QA for someone else to get right. When estimating effort for new things, he can immediately list all the tricky work that will need to be done; whereas others would go “sounds easy” only to find out it’s a month of work.

She’s not satisfied with the status quo. No this isn’t good enough, she says; and let me show you where & how spectacularly it breaks. And it does not matter if everyone else is doing it this way; here’s why putting that stuff into uniform grid isn’t good. A lot of times you need this extra bump to snap out of your own “this is good enough, no one will care” thoughts.

He’s doing a lot of boring work to get others more productive. There’s a ton of boring work on even the most exciting projects, and someone has to do it. He’s often the unsung hero, quietly working on infrastructure, build times, fixing annoyances in the tools, processes and workflows; all just so that others can be better at doing exciting things. You could call him a janitor or a plumber if you wish, but any place gets rotten and broken real fast without those people.

…and the list could go on. Unlike obsessing over irrelevant details, these make a difference. Makes your team run circles around others. Helps you solve hard problems, invent things, moves you forward at enormous velocity.

You need people with those traits and attitudes.


Fast Mobile Shaders or, I did a talk at SIGGRAPH!

Finally after many years of dreaming I made it to SIGGRAPH! And not only that, I also did a talk/course with ReJ for 1.5 hours. This was the first time Unity had real presence at SIGGRAPH and I hope we’ll be more active & visible next time around.

Here it is, 100+ slides with notes: Fast Mobile Shaders (17MB pdf). This isn’t strictly about shaders; there’s info about mobile GPU architectures, general performance, hidden surface removal and so on. Also, graphs with logarithmic scales; can’t go wrong with that!


Testing Graphics Code, 4 years later

Almost four years ago I wrote how we test rendering code at Unity. Did it stand the test of time and more importantly, growing the company from less than 10 people to more than 100 people?

I’m happy to say it did! That’s it, move on to read the rest of the internets.

The earlier post was more focused on hardware compatibility area (differences between platforms, GPUs, driver versions, driver bugs and their workarounds etc.). In addition to that, we do regression tests on a bunch of actual Unity made games. All that is good and works, let’s talk about what tests the rendering team at Unity is using in the daily lives instead.

Graphics Feature & Regression Testing

In daily life of a graphics programmer, you care about two things related to testing:

1. Whether a new feature you are adding, more or less, works. 2. Whether something new you added or something you refactored broke or changed any existing features.

Now, “works” is a vague term. Definitions can range from equally vague

Works For Me!

to something like

It has been battle tested on thousands of use cases, hundreds of shipped games, dozens of platforms, thousands of platform configurations and within each and every one of them there’s not a single wrong pixel, not a single wasted memory byte and not a single wasted nanosecond! No kittehs were harmed either!

In ideal world we’d only consider the latter as “works”, however that’s quite hard to achieve.

So instead we settle for small “functional tests”, where each feature has a small scene setup that exercises said feature (very much like talked about in previous post). It’s graphics programmer’s responsibility to add tests like that for his stuff.

For example, Fog handling might be tested by a couple scenes like this:

Another example, tests for various corner cases of Deferred Lighting:

So that’s basic testing for “it works” that the graphics programmers themselves do. Beyond that, features are tested by QA and a large beta testing group, tried, profiled and optimized on real actual game projects and so on.

The good thing is, doing these basic tests also provides you with point 2 (did I break or change something?) automatically. If after your changes, all the graphics tests still pass, there’s a pretty good chance you did not break anything. Of course this testing is not exhaustive, but any time a regression is spotted by QA, beta testers or reported by users, you can add a new graphics test to check for that situation.

How do we actually do it?

We use TeamCity for the build/test farm. It has several build machines set up as graphics test agents (unlike most other build machines, they need an actual GPU, or a iOS device connected to them, or a console devkit etc.) that run graphics test configurations for all branches automatically. Each branch has it’s graphics tests run daily, and branches with “high graphics code activity” (i.e. branches that the rendering team is actually working on) have them run more often. You can always initiate the tests manually by clicking a button of course. What you want to see at any time is this:

The basic approach is the same as 4 years ago: a “game level” (“scene” in Unity speak) for each test, runs for defined number of frames, run everything at fixed timestep, take a screenshot at end of each frame. Compare each screenshot with “known good” image for that platform; any differences equals “FAIL”. On many platforms you have to allow a couple of wrong pixels because many consumer GPUs are not fully deterministic it seems.

So you have this bunch of “this is the golden truth” images for all the tests:

And each platform automatically tested on TeamCity has it’s own set:

Since the “test controller” can run on a different device than actual tests (the case for iOS, Xbox 360 etc.), the test executable opens a socket connection to transfer the screenshots. The test controller is a relatively simple C# application that listens on a socket, fetches the screenshots and compares them with the template ones. The result of it is output that TeamCity can understand; along with “build artifacts” that consist of failed tests (for each failed test: expected image, failed image, difference image with increased contrast).

That’s pretty much it! And of course, automated tests are nice and all, but that should not get too much into the way of actual programming manifesto.