Optimizing Unity Renderer Part 3: Fixed Function Removal

Last time I wrote about some cleanups and optimizations. Since then, I got sidetracked into doing some Unity 5.1 work, removing Fixed Function Shaders and other unrelated things. So not much blogging about optimization per se.

Fixed Function What?

Once upon a time, GPUs did not have these fancy things called “programmable shaders”; instead they could be configured in more or less (mostly less) flexible ways, by enabling and disabling certain features. For example, you could tell them to calculate some lighting per-vertex; or to add two textures together per-pixel.

Unity started out a long time ago, back when fixed function GPUs were still a thing; so naturally it supports writing shaders in this fixed function style (“shaderlab” in Unity lingo). The syntax for them is quite easy, and actually they are much faster to write than vertex/fragment shader pairs if all you need is some simple shader.

For example, a Unity shader pass that turns on regular alpha blending, and outputs texture multiplied by material color, is like this:

1
2
3
4
5
Pass
{
  Blend SrcAlpha OneMinusSrcAlpha
  SetTexture [_MainTex] { constantColor[_Color] combine texture * contant }
}

compare that with a vertex+fragment shader that does exactly the same:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Pass
{
  Blend SrcAlpha OneMinusSrcAlpha
  CGPROGRAM
  #pragma vertex vert
  #pragma fragment frag
  #include "UnityCG.cginc"
  struct v2f
  {
      float2 uv : TEXCOORD0;
      float4 pos : SV_POSITION;
  };
  float4 _MainTex_ST;
  v2f vert (float4 pos : POSITION, float2 uv : TEXCOORD0)
  {
      v2f o;
      o.pos = mul(UNITY_MATRIX_MVP, pos);
      o.uv = TRANSFORM_TEX(uv, _MainTex);
      return o;
  }
  sampler2D _MainTex;
  fixed4 _Color;
  fixed4 frag (v2f i) : SV_Target
  {
      return tex2D(_MainTex, i.uv) * _Color;
  }
  ENDCG
}

Exactly the same result, a lot more boilerplate typing.

Now, we have removed support for actually fixed function GPUs and platforms (in practice: OpenGL ES 1.1 on mobile and Direct3D 7 GPUs on Windows) in Unity 4.3 (that was in late 2013). So there’s no big technical reason to keep on writing shaders in this “fixed function style”… except that 1) a lot of existing projects and packages already have them and 2) for simple things it’s less typing.

That said, fixed function shaders in Unity have downsides too:

  • They do not work on consoles (like PS4, XboxOne, Vita etc.), primarily because generating shaders at runtime is very hard on these platforms.
  • They do not work with MaterialPropertyBlocks, and as a byproduct, do not work with Unity’s Sprite Renderer nor materials animated via animation window.
  • By their nature they are suitable for only very simple things. Often you could start with a simple fixed function shader, only to find that you need to add more functionality that can’t be expressed in fixed function vertex lighting / texture combiners.

How are fixed function shaders implemented in Unity? And why?

Majority of platforms we support do not have the concept of “fixed function rendering pipeline” anymore, so these shaders are internally converted into “actual shaders” and these are used for rendering. The only exceptions where fixed function still exists are: legacy desktop OpenGL (GL 1.x-2.x) and Direct3D 9.

Truth to be told, even on Direct3D 9 we’ve been creating actual shaders to emulate fixed function since Unity 2.6; see this old article. So D3D9 was the first platform we got that implemented this “lazily create actual shaders for each fixed function shader variant” thing.

Then more platforms came along; for OpenGL ES 2.0 we implemented a very similar thing as for D3D9, just instead of concatenating bits of D3D9 shader assembly we’d concatenate GLSL snippets. And then even more platforms came (D3D11, Flash, Metal); each of them implemented this “fixed function generation” code. The code is not terribly complicated; the problem was pretty well understood and we had enough graphics tests verify it works.

Each step along the way, somehow no one really questioned why we keep doing all this. Why do all that at runtime, instead of converting fixed-function-style shaders into “actual shaders” offline, at shader import time? (well ok, plenty of people asked that question; the answer has been “yeah that would make sense; just requires someone to do it” for a while…)

A long time ago generating “actual shaders” for for fixed function style ones offline was not very practical, due to sheer number of possible variants that need to be supported. The trickiest ones to support were texture coordinates (routing of UVs into texture stages; optional texture transformation matrices; optional projected texturing; and optional texture coordinate generation). But hey, we’ve removed quite some of that in Unity 5.0 anyway. Maybe now it’s easier? Turns out, it is.

Converting fixed function shaders into regular shaders, at import time

So I set out to do just that. Remove all the runtime code related to “fixed function shaders”; instead just turn them into “regular shaders” when importing the shader file in Unity editor. Created an outline of idea & planned work on our wiki, and started coding. I thought the end result would be “I’ll add 1000 lines and remove 4000 lines of existing code”. I was wrong!

Once I got the basics of shader import side working (turned out, about 1000 lines of code indeed), I started removing all the fixed function bits. That was a day of pure joy:

Almost twelve thousand lines of code, gone. This is amazing!

I never realized all the fixed function code was that large. You write it for one platform, and then it basically works; then some new platform comes and the code is written for that, and then it basically works. By the time you get N platforms, all that code is massive, but it never came in one sudden lump so no one realized it.

Takeaway: once in a while, look at a whole subsystem. You migth be surprised at how much it grew over the years. Maybe some of the conditions for why it is like that do not apply anymore?

Sidenote: per-vertex lighting in a vertex shader

If there was one thing that was easy with fixed function pipeline, is that many features were easily composable. You could enable any number of lights (well, up to 8); and each of them could be a directional, point or a spot light; toggling specular on or off was just a flag away; same with fog etc.

It feels like “easy composition of features” is a big thing we lost when we all moved to shaders. Shaders as we know them (i.e. vertex/fragment/… stages) aren’t composable at all! Want to add some optional feature – that pretty much means either “double the amount of shaders”, or branching in shaders, or generating shaders at runtime. Each of these has their own tradeoffs.

For example, how do you write a vertex shader that can do up to 8 arbitrary lights? There are many ways of doing it; what I have done right now is:

Separate vertex shader variants for “any spot lights present?”, “any point lights present?” and “directional lights only” cases. My guess is that spot lights are very rarely used with per-vertex fixed function lighting; they just look really bad. So in many cases, the cost of “compute spot lights” won’t be paid.

Number of lights is passed into the shader as an integer, and the shader loops over them. Complication: OpenGL ES 2.0 / WebGL, where loops can only have constant number of iterations :( In practice many OpenGL ES 2.0 implementations do not enforce that limitation; however WebGL implementations do. At this very moment I don’t have a good answer; on ES2/WebGL I just always loop over all 8 possible lights (the unused lights have black colors set). For a real solution, instead of a regular loop like this:

1
2
3
4
5
6
uniform int lightCount;
// ...
for (int i = 0; i < lightCount; ++i)
{
  // compute light #i
}

I’d like to emit a shader like this when compiling for ES2.0/WebGL:

1
2
3
4
5
6
7
8
uniform int lightCount;
// ...
for (int i = 0; i < 8; ++i)
{
  if (i == lightCount)
      break;
  // compute light #i
}

Which would be valid according to the spec; it’s just annoying to deal with seemingly arbitrary limitations like this (I heard that WebGL 2 does not have this limitation, so that’s good).

What do we have now

So the current situation is that, by removing a lot of code, I achieved the following upsides:

  • “Fixed function style” shaders work on all platforms now (consoles! dx12!).
  • They work more consistenly across platforms (e.g. specular highlights and attenuation were subtly different between PC & mobile before).
  • MaterialPropertyBlocks work with them, which means sprites etc. all work.
  • Fixed function shaders aren’t rasterized at a weird half-pixel offset on Windows Phone anymore.
  • It’s easier to go from fixed functon shader to an actual shader now; I’ve added a button in shader inspector that just shows all the generated code; you can paste that back and start extending it.
  • Code is smaller; that translates to executable size too. For example Windows 64 bit player got smaller by 300 kilobytes.
  • Rendering is slightly faster (even when fixed function shaders aren’t used)!

That last point was not the primary objective, but is a nice bonus. No particular big place was affected, but quite a few branches and data members were removed from platform graphics abstraction (that only were there to support fixed function runtime). Depending on the projects I’ve tried, I’ve seen up to 5% CPU time saved on the rendering thread (e.g. 10.2ms -> 9.6ms), which is pretty good.

Are there any downsides? Well, yes, a couple:

  • You can not create fixed function shaders at runtime anymore. Before, you could do something like a var mat = new Material("<fixed function shader string>") and it would all work. Well, except for consoles, where these shaders were never working. For this reason I’ve made the Material(string) constructor be obsolete with a warning for Unity 5.1; but it will actually stop working later on.
  • It’s a web player backwards compatibility breaking change, i.e. if/once this code lands to production (for example, Unity 5.2) that would mean 5.2 runtime can not playback Unity 5.0/5.1 data files anymore. Not a big deal, we just have to decide if with (for example) 5.2 we’ll switch to a different web player release channel.
  • Several corner cases might not work. For example, a fixed function shader that uses a globally-set texture that is not a 2D texture. Nothing about that texture is specified in the shader source itself; so while I’m generating actual shader at import time I don’t know if it’s a 2D or a Cubemap texture. So for global textures I just assume they are going to be 2D ones.
  • Probably that’s it!

Removing all that fixed function runtime support also revealed more potential optimizations. Internally we were passing things like “texture type” (2D, Cubemap etc.) for all texture changes – but it seems that it was only the fixed function pipeline that was using it. Likewise, we are passing a vertex-declaration-like structure for each and every draw call; but now I think that’s not really needed anymore. Gotta look further into it.

Until next time!

Optimizing Unity Renderer Part 2: Cleanups

With the story introduction in the last post, let’s get to actual work now!

As already alluded in the previous post, first I try to remember / figure out what the existing code does, do some profiling and write down things that stand out.

Profiling on several projects mostly reveals two things:

1) Rendering code could really use wider multithreading than “main thread and render thread” that we have now. Here’s one capture from Unity 5 timeline profiler:

In this particular case, CPU bottleneck is the rendering thread, where majority of the time it just spends in glDrawElements (this was on a MacBookPro; GPU-simplified scene from Butterfly Effect demo doing about 6000 draw calls). Main thead just ends up waiting for the render thread to catch up. Depending on the hardware, platform, graphics API etc. the bottleneck can be elsewhere, for example the same project on a much faster PC under DX11 is spending about the same time in main vs. rendering thread.

The culling sliver on the left side looks pretty good, eventually we want all our rendering code to look like that too. Here’s the culling part zoomed in:

2) There are no “optimize this one function, make everything twice as fast” places :( It’s going to be a long journey of rearranging data, removing redundant decisions, removing small things here and there until we can reach something like “2x faster per thread”. If ever.

The rendering thread profiling data is not terribly interesting here. Majority of the time (everything highlighted below) is from OpenGL runtime/driver. Adding a note that perhaps we do something stupid that is causing the driver to do too much work (I dunno, switching between different vertex layouts for no good reason etc.), but otherwise not much to see on our side. Most of the remaining time is spent in dynamic batching.

Looking into the functions heavy on the main thread, we get this:

Now there are certainly questions raised (why so many hashtable lookups? why sorting takes so long? etc., see list above), but the point is, there’s no single place where optimizing something would give magic performance gains and a pony.

Observation 1: material “display lists” are being re-created often

In our code, a Material can pre-record what we call a “display list” for the rendering thread. Think of it as a small command buffer, where a bunch of commands (“set this raster state object, set this shader, set these textures”) are stored. Most important thing about them: they are stored with all the parameters (final texture values, shader uniform values etc.) resolved. When “applying” a display list, we just hand it off to the rendering thread, no need to chase down material property values or other things.

That’s all well and good, except when something changes in the material that makes the recorded display list invalid. In Unity, each shader internally often is many shader variants, and when switching to a different shader variant, we need to apply a different display list. If the scene was setup in such a way that caused the same Materials to alternate between different lists, then we have a problem.

Which was exactly what was happening in several of the benchmark projects I had; short story being “multiple per-pixel lights in forward rendering cause this”. Turns out we had code to fix this on some branch, it just needed to be finished up – so I found it, made to compile in the current codebase and it pretty much worked. Now the materials can pre-record more than one “display list” in them, and that problem goes away.

On a PC (Core i7 5820K), one scene went from 9.52ms to 7.25ms on the main thread which is fairly awesome.

Spoilers ahead: this change is the one that brought the largest benefit on the affected scenes, from everything I did over almost two weeks. And it wasn’t even the code that “I wrote”; I just fished it out from a somewhat neglected branch. So, yay! An easy change for a 30% performance gain! And bummer, I will not get this kind of change from anything written below.

Observation 2: too many hashtable lookups

Next from the observations list above, looked into “why so many hashtable lookups” question (if there isn’t a song about it yet, there should be!).

In the rendering code, many years ago I added something like Material::SetPassWithShader(Shader* shader, ...) since the calling code already knows with which shader material state should be setup with. Material also knows it’s shader, but it stores something we call a PPtr (“persistent pointer”) which is essentially a handle. Passing the pointer directly avoids doing a handle->pointer lookup (which currently is a hashtable lookup since for various complicated reasons it’s hard to do an array-based handle system, I’m told).

Turns out, over many changes, somehow Material::SetPassWithShader ended up doing two handle->pointer lookups, even if it already got the actual shader pointer as a parameter! Fixed:

Ok so this one turned out to be good, measurable and very easy performance optimization. Also made the codebase smaller, which is a very good thing.

Small tweaks here and there

From the render thread profiling on a Mac above, our own code in BindDefaultVertexArray was taking 2.3% which sounded excessive. Turns out, it was looping over all possible vertex component types and checking some stuff; made the code loop only over the vertex components used by the shader. Got slightly faster.

One project was calling GetTextureDecodeValues a lot, which compute some color space, HDR and lightmap decompression constants for a texture. It was doing a bunch of complicated sRGB math on an optional “intensity multiplier” parameter which was set to exactly 1.0 in all calling places except one. Recognizing that in the code made a bunch of pow() calls go away. Added to a “look later” list: why we call that function very often in the first place?

Some code in rendering loops that was figuring out where draw call batch boundaries need to be put (i.e. where to switch to a new shader etc.), was comparing some state represented as separate bools. Packed a bunch of them into bitfields and did comparisons on one integer. No observable performance gains, but the code actually ended up being smaller, so a win :)

Noticed that figuring out which vertex buffers and vertex layouts are used by objects queries Mesh data that’s quite far apart in memory. Reordered data based on usage type (rendering data, collision data, animation data etc.).

Also reduced data packing holes using @msinilo’s excellent CruncherSharp (and did some tweaks to it along the way :)) I hear there’s a similar tool for Linux (pahole). On a Mac there’s struct_layout but it was taking forever on Unity’s executable and the Python script often would fail with some recursion overflow exception.

While browsing through the code, noticed that the way we track per-texture mipmap bias is very convoluted, to put it mildly. It is set per-texture, then the texture tracks all the material property sheets where it’s being used; notifies them upon any mip bias change, and the bias is fetched from property sheets and applied together with the texture each and every time a texture is set on a graphics device. Geez. Fixed. Since this changed interface of our graphics API abstraction, this means changing all 11 rendering backends; just a few fairly trivial changes in each but can feel intimidating (I can’t even build half of them locally). No fear, we have the build farm to check for compile errors, and the test suites to check for regressions!

No significant performance difference, but it feels good to get rid of all that complexity. Adding to a “look later” list: there’s one more piece of similar data that we track per-texture; something about UV scaling for non-power-of-two textures. I suspect it’s there for no good reason these days, gotta look and remove it if possible.

And some other similar localized tweaks, each of them is easy and makes some particular place a tiny bit better, but does not result in any observable performance improvement. Maybe doing a hundred of them would result in some noticable effect, but it’s much more possible that we’ll need some more serious re-working of things to get good results.

Data layout of Material property sheets

One thing that has been bothering me is how we store material properties. Each time I’d show the codebase to a new engineer, in that place I’d roll my eyes and say “and yeah, here we store the textures, matrices, colors etc. of the material, in separate STL maps. The horror. The horror.”.

There’s this popular thought that C++ STL containers have no place in high performance code, and no good game has ever shipped with them (not true), and if you use them you must be stupid and deserve to be ridiculed (I don’t know… maybe?). So hey, how about I go and replace these maps with a better data layout? Must make everything a million times better, right?

In Unity, parameters to shaders can come from two places: either from per-material data, or from “global” shader parameters. Former is typically for things like “diffuse texture”, while the latter is for things like “fog color” or “camera projection” (there’s slight complication with per-instance parameters in form of MaterialPropertyBlock etc., but let’s ignore that for now).

The data layout we had before was roughly this (PropertyName is basically an integer):

1
2
3
4
5
6
map<PropertyName, float> m_Floats;
map<PropertyName, Vector4f> m_Vectors;
map<PropertyName, Matrix4x4f> m_Matrices;
map<PropertyName, TextureProperty> m_Textures;
map<PropertyName, ComputeBufferID> m_ComputeBuffers;
set<PropertyName> m_IsGammaSpaceTag; // which properties come as sRGB values

What I replaced it with (simplified, only showing data members; dynamic_array is very much like std::vector, but more EASTL style):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct NameAndType { PropertyName name; PropertyType type; };

// Data layout:
// - Array of name+type information for lookups (m_Names). Do
//   not put anything else; only have info needed for lookups!
// - Location of property data in the value buffer (m_Offsets).
//   Uses 4 byte entries for smaller data; don't use size_t!
// - Byte buffer with actual property values (m_ValueBuffer).
// - Additional data per-property in m_GammaProps and
//   m_TextureAuxProps bit sets.
//
// All the arrays need to be kept in sync (sizes the same; all
// indexed by the same property index).
dynamic_array<NameAndType> m_Names;
dynamic_array<int> m_Offsets;
dynamic_array<UInt8> m_ValueBuffer;

// A bit set for each property that should do gamma->linear
// conversion when in linear color space
dynamic_bitset m_GammaProps;
// A bit set for each property that is aux for a texture
// (e.g. *_ST for texture scale/tiling)
dynamic_bitset m_TextureAuxProps;

When a new property is added to the property sheet, it is just appended to all the arrays. Property name/type information and property location in the data buffer are kept separate so that when searching for properties, we don’t even fetch the data that’s not needed for the search itself.

Biggest external change is that before, one could find a property value and store a direct pointer to it (was used in the pre-recorded material display lists, to be able to “patch in” values of global shader properties before replaying them). Now the pointers are invalidated whenever resizing the arrays; so all the code that was possibly storing pointers has to be changed to store offsets into the property sheet data instead. So in the end this was quite some code changes.

Finding properties has changed from being an O(logN) operation (map lookup) into an O(N) operation (linear scan though the names array). This sounds bad if you’re learning computer science as it is typically taught. However, I looked at various projects and found that typically the property sheets contain between 5 and 30 properties in total (most often around 10); and a linear scan with all the search data right next to each other in memory is not that bad compared to STL map lookup where the map nodes can be placed arbitrarily far away from one another (if that happens, each node visit can be a CPU cache miss). From profiling on several different projects, the part that does “search for properties” was consistently slightly faster on a PC, a laptop and an iPhone.

Did this change brought magic performance improvements though? Nope. It brought a small improvement in average frame time and slightly smaller memory consumption, especially when there are lots of different materials. But does “just replace STL maps with packed arrays” result in magic? Alas, no. Well, at least I don’t have to be roll my eyes anymore when showing this code to people, so there’s that.

Upon my code review, one comment that popped up is that I should try splitting up property data so that properties of the same type are grouped together. A property sheet could know which start and end index is for a particular type, and then searching for a particular property would only need to scan the names array of that type (and the array would only contain an integer name per property, instead of name+type). Adding a new property into the sheet would become more expensive, but looking them up cheaper.

A side note from all this: modern CPUs are impressively fast at what you could call “bad code”, and have mightily large caches too. I wasn’t paying much attention to mobile CPU hardware, and just realized that iPhone 6 CPU has a 4 megabyte L3 cache. Four. Megabytes. On a phone. That’s about how much RAM my first PC had!

Results so far

So that was about 2 weeks of work (I’d estimate at 75% time - the rest spent on unrelated bugfixes, code reviews, etc.); with a state where all the platforms are building and tests are passing; and a pull request ready. 40 commits, 135 files, about 2000 lines of code changed.

Performance wise, one benchmark project improved a lot (the one most affected by “display lists being re-created” issue), with total frame time 11.8ms -> 8.50ms on PC; and 29.2ms -> 26.9ms on a laptop. Other projects improved, but nowhere near as much (numbers like 7.8ms -> 7.3ms on PC; another project 15.2ms -> 14.1ms on iPhone etc.).

Most of the performance improvements came from two places really (display list re-creation; and avoiding useless hashtable lookups). Not sure how to feel about the rest of the changes - it feels like they are good changes overall, if only because I now have a better understanding of the codebase, and have added quite many comments explaining what & why. I also now have an even longer list of “here are the places that are weird or should be improved”.

Is spending almost two weeks worth the results I got so far? Hard to say. Sometimes I do have a week where it feels like I did nothing at all, so it’s better than that :)

Overall I’m still not sure if “optimizing stuff” is my strong area. I think I’m pretty good at only a few things: 1) debugging hard problems – I can come up with plausible hypotheses and ways to divide-and-conquer the problem fast; 2) understanding implications of some change or a system – what other systems will be affected and what could/would be problematic interactions; and 3) having good ambient awareness of things done by others in the codebase – I can often figure out when several people are working on somewhat overlapping stuff and tell them “yo, you two should coordinate”.

Is any of that a useful skill for optimization? I don’t know. I certainly can’t juggle instruction latencies and execution ports and TLB misses in my head. But maybe I’ll get better at it if I practice? Who knows.

Not quite sure which path to go next at the moment; I see at least several possible ways:

  1. Continue doing incremental improvements, and hope that net effect of a lot of them will be good. Individually each of them is a bit disappointing since the improvement is hard to measure.
  2. Start looking at the bigger picture and figure out how we can avoid a lot of currently done work completely, i.e. more serious “reshaping” of how things are structured.
  3. Once some more cleanup is done, switch to helping others with “multithread more stuff” approaches.
  4. Optimization is hard! Let’s play more Rocksmith until situation improves.

I guess I’ll discuss with others and do one or more of the above. Until next time!

Update: Part 3: Fixed Function Removal is up.

Optimizing Unity Renderer Part 1: Intro

At work we formed a small “strike team” for optimizing CPU side of Unity’s rendering. I’ll blog about my part as I go (idea of doing that seems to be generally accepted). I don’t know where that will lead to, but hey that’s part of the fun!

Backstory / Parental Warning

I’m going to be harsh and say “this code sucks!” in a lot of cases. When trying to improve the code, you obviously want to improve what is bad, and so that is often the focus. Does not mean the codebase in general is that bad, or that it can’t be used for good things! Just this March, we’ve got Pillars of Eternity, Ori and the Blind Forest and Cities: Skylines among top rated PC games; all made with Unity. Not too bad for “that mobile engine that’s only good for prototyping”, eh.

The truth is, any codebase that grows over long period of time and is worked on by more than a handful of people is very likely “bad” in some sense. There are areas where code is weird; there are places that no one quite remembers how or why they work; there are decisions done many years ago that don’t quite make sense anymore, and no one had time to fix yet. In a big enough codebase, no single person can know all the details about how it’s supposed to work, so some decisions clash with some others in subtle ways. Paraphrasing someone, “there are codebases that suck, and there are codebases that aren’t used” :)

It is important to try to improve the codebase though! Always keep on improving it. We’ve done lots of improvements in all the areas, but frankly, the rendering code improvements over last several years have been very incremental, without anyone taking a hard look at the whole of it and focusing on just improving it as a fulltime effort. It’s about time we did that!

A number of times I’d point at some code and say “ha ha! well that is stupid!”. And it was me who wrote it in the first place. That’s okay. Maybe now I know better; or maybe the code made sense at the time; or maybe the code made sense considering the various factors (lack of time etc.). Or maybe I just was stupid back then. Maybe in five years I’ll look at my current code and say just as much.

Anyway…

Wishlist

In pictures, here’s what we want to end up with. A high throughput rendering system, working without bottlenecks.

Animated GIFs aside, here’s the “What is this?” section pretty much straight from our work wiki:

Current (Unity 5.0) rendering, shader runtime & graphics API CPU code is NotTerriblyEfficient(tm). It has several issues, and we want to address as much of that as possible:

  • GfxDevice (our rendering API abstraction):
    • Abstraction is mostly designed around DX9 / partially-DX10 concepts. E.g. constant/uniform buffers are a bad fit now.
    • A mess that grew organically over many years; there are places that need a cleanup.
    • Allow parallel command buffer creation on modern APIs (like consoles, DX12, Metal or Vulkan).
  • Rendering loops:
    • Lots of small inefficiencies and redundant decisions; want to optimize.
    • Run parts of loops in parallel and/or jobify parts of them. Using native command buffer creation API where possible.
    • Make code simpler & more uniform; factor out common functionality. Make more testable.
  • Shader / Material runtime:
    • Data layout in memory is, uhm, “not very good”.
    • Convoluted code; want to clean up. Make more testable.
    • “Fixed function shaders” concept should not exist at runtime. See [Generating Fixed Function Shaders at Import Time].
    • Text based shader format is stupid. See [Binary Shader Serialization].

Constraints

Whatever we do to optimize / cleanup the code, we should keep the functionality working as much as possible. Some rarely used functionality or corner cases might be changed or broken, but only as a last resort.

Also one needs to consider that if some code looks complicated, it can be due to several reasons. One of them being “someone wrote too complicated code” (great! simplify it). Another might be “there was some reason for the code to be complex back then, but not anymore” (great! simplify it).

But it also might be that the code is doing complex things, e.g. it needs to handle some tricky cases. Maybe it can be simplified, but maybe it can’t. It’s very tempting to start “rewriting something from scratch”, but in some cases your new and nice code might grow complicated as soon as you start making it do all the things the old code was doing.

The Plan

Given a piece of CPU code, there are several axes of improving its performance: 1) “just making it faster” and 2) make it more parallel.

I thought I’d focus on “just make it faster” initially. Partially because I also want to simplify the code and remember the various tricky things it has to do. Simplifying the data, making the data flows more clear and making the code simpler often also allows doing step two (“make it more parallel”) easier.

I’ll be looking at higher level rendering logic (“rendering loops”) and shader/material runtime first, while others on the team will be looking at simplifying and sanitizing rendering API abstraction, and experimenting with “make it more parallel” approaches.

For testing rendering performance, we’re gonna need some actual content to test on. I’ve looked at several game and demo projects we have, and made them be CPU limited (by reducing GPU load - rendering at lower resolution; reducing poly count where it was very large; reducing shadowmap resolutions; reducing or removing postprocessing steps; reducing texture resolutions). To put higher load on the CPU, I duplicated parts of the scenes so they have way more objects rendered than originally.

It’s very easy to test rendering performance on something like “hey I have these 100000 cubes”, but that’s not a very realistic use case. “Tons of objects using exact same material and nothing else” is a very different rendering scenario from thousands of materials with different parameters, hundreds of different shaders, dozens of render target changes, shadowmap & regular rendering, alpha blended objects, dynamically generated geometry etc.

On the other hand, testing on a “full game” can be cumbersome too, especially if it requires interaction to get anywhere, is slow to load the full level, or is not CPU limited to begin with.

When testing for CPU performance, it helps to test on more than one device. I typically test on a development PC in Windows (currently Core i7 5820K), on a laptop in Mac (2013 rMBP) and on whatever iOS device I have around (right now, iPhone 6). Testing on consoles would be excellent for this task; I keep on hearing they have awesome profiling tools, more or less fixed clocks and relatively weak CPUs – but I’ve no devkits around. Maybe that means I should get one.

The Notes

Next up, I ran the benchmark projects and looked at profiler data (both Unity profiler and 3rd party profilers like Sleepy/Instruments), and also looked at the code to see what it is doing. At this point whenever I see something strange I just write it down for later investigation:

Some of the weirdnesses above might have valid reasons, in which case I go and add code comments explaining them. Some might have had reasons once, but not anymore. In both cases source control log / annotate functionality is helpful, and asking people who wrote the code originally on why something is that way. Half of the list above is probably because I wrote it that way many years ago, which means I have to remember the reason(s), even if they are “it seemed like a good idea at the time”.

So that’s the introduction. Next time, taking some of the items from the above “WAT?” list and trying to do something about them!

Update: next blog post is up. Part 2: Cleanups.

Random Thoughts on New Explicit Graphics APIs

Last time I wrote about graphics APIs was almost a year ago. Since then, Apple Metal was unveiled and shipped in iOS 8; as well as Khronos Vulkan was announced (which is very much AMD Mantle, improved to make it cross-vendor). DX12 continues to be developed for Windows 10.

@promit_roy has a very good post on gamedev.net about why these new APIs are needed and what problems do they solve. Go read it now, it’s good.

Just a couple more thoughts I’d add.

Metal experience

When I wrote the previous OpenGL rant, we already were working on Metal and had it “basically work in Unity”. I’ve only ever worked on PC/mobile graphics APIs before (D3D9, D3D11, OpenGL/ES, as well as D3D7-8 back in the day), so Metal was the first of these “more explicit APIs” I’ve experienced (I never actually did anything on consoles before, besides just seeing some code).

ZOMG what a breath of fresh air.

Metal is super simple and very, very clear. I was looking at the header files and was amazed at how small they are – “these few short files, and that’s it?! wow.” A world of difference compared to how much accumulated stuff is in OpenGL core & extension headers (to a lesser degree in OpenGL ES too).

Conceptually Metal is closer to D3D11 than OpenGL ES (separate shader stages; constant buffers; same coordinate conventions), so “porting Unity to it” was mostly taking D3D11 backend (which I did back in the day too, so familiarity with the code helped), changing the API calls and generally removing stuff.

Create a new buffer (vertex, index, constant - does not matter) – one line of code (MTLDevice.newBuffer*). Then just get a pointer to data and do whatever you want. No mapping, no locking, no staging buffers, no trying to nudge the API into doing the exact type of buffer update you want (on the other hand, data synchronization is your own responsibility now).

And even with the very early builds we had, everything more or less “just worked”, with a super useful debug & validation layer. Sure there were issues and there were occasional missing things in the early betas, but nothing major and the issues got fixed fast.

To me Metal showed that a new API that gets rid of the baggage and exposes platform strengths is a very pleasant thing to work with. Metal is essentially just several key ideas (many of which are shared by other “new APIs” like Vulkan or DX12):

  • Command buffer creation is separate from submission; and creation is mostly stateless (do that from any thread).
  • Whole rendering pipeline state is specified, thus avoiding “whoops need to revalidate & recompile shaders” issues.
  • Unified memory; just get a pointer to buffer data. Synchronization is your own concern.

Metal very much keeps the existing resource binding model from D3D11/OpenGL - you bind textures/buffers to shader stage “resource slots”.

I think of all public graphics APIs (old ones like OpenGL/D3D11 and new ones like Vulkan/DX12), Metal is probably the easiest to learn. Yes it’s very much platform specific, but again, OMG so easy.

Partially because it keeps the traditional binding model – while that means Metal can’t do fancy things like bindless resources, it also means the binding model is simple. I would not want to be a student learning graphics programming, and having to understand Vulkan/DX12 resource binding.

Explicit APIs and Vulkan

This bit from Promit’s post,

But there’s a very real message that if these APIs are too challenging to work with directly, well the guys who designed the API also happen to run very full featured engines requiring no financial commitments.

I love conspiracy theories as much as the next guy, but I don’t think that’s quite true. If I’d put my cynical hat on, then sure: making graphics APIs hard to use is an interest of middleware providers. You could also say that making sure there are lots of different APIs is an interest of middleware providers too! The harder it is to make things, the better for them, right.

In practice I think we’re all too much hippies to actually think that way. I can’t speak for Epic, Valve or Frostbite of course, but on Unity side it was mostly Christophe being involved in Vulkan, and if you think his motivation is commercial interest then you don’t know him :) I and others from graphics team were only very casually following the development – I would have loved to be more involved, but was 100% busy on Unity 5.0 development all the time.

So there.

That said, to some extent the explicit APIs (both Vulkan and DX12) are harder to use. I think it’s mostly due to more flexible (but much more complicated) resource binding model. See Metal above - my opinion is that stateless command buffer creation and fully specified pipeline state actually make it easier to use the API. But new way of resource binding and to some extent ability to reuse & replay command buffer chunks (which Metal does not have either) does complicate things.

However, I think this is worth it. The actual lowest level of API should be very efficient, even if somewhat “hard to use” (or require an expert to use it). Additional “ease of use” layers can be put on top of that! The problem with OpenGL was that it was trying to do both at once, with a very sad side effect that everyone and their dog had to implement all the layers (in often subtly incompatible ways).

I think there’s plenty of space for “somewhat abstacted” libraries or APIs to be layered on top of Vulkan/DX12. Think XNA back when it was a thing; or three.js in WebGL world. Or bgfx in the C++ world. These are all good and awesome.

Let’s see what happens!

Optimizing Shader Info Loading, or Look at Yer Data!

A story about a million shader variants, optimizing using Instruments and looking at the data to optimize some more.

The Bug Report

The bug report I was looking into was along the lines of “when we put these shaders into our project, then building a game becomes much slower – even if shaders aren’t being used”.

Indeed it was. Quick look revealed that for ComplicatedReasons(tm) we load information about all shaders during the game build – that explains why the slowdown was happening even if shaders were not actually used.

This issue must be fixed! There’s probably no really good reason we must know about all the shaders for a game build. But to fix it, I’ll need to pair up with someone who knows anything about game data build pipeline, our data serialization and so on. So that will be someday in the future.

Meanwhile… another problem was that loading the “information for a shader” was slow in this project. Did I say slow? It was very slow.

That’s a good thing to look at. Shader data is not only loaded while building the game; it’s also loaded when the shader is needed for the first time (e.g. clicking on it in Unity’s project view); or when we actually have a material that uses it etc. All these operations were quite slow in this project.

Turns out this particular shader had massive internal variant count. In Unity, what looks like “a single shader” to the user often has many variants inside (to handle different lights, lightmaps, shadows, HDR and whatnot - typical ubershader setup). Usually shaders have from a few dozen to a few thousand variants. This shader had 1.9 million. And there were about ten shaders like that in the project.

The Setup

Let’s create several shaders with different variant counts for testing: 27 thousand, 111 thousand, 333 thousand and 1 million variants. I’ll call them 27k, 111k, 333k and 1M respectively. For reference, the new “Standard” shader in Unity 5.0 has about 33 thousand internal variants. I’ll do tests on MacBook Pro (2.3 GHz Core i7) using 64 bit Release build.

Things I’ll be measuring:

  • Import time. How much time it takes to reimport the shader in Unity editor. Since Unity 4.5 this doesn’t do much of actual shader compilation; it just extracts information about shader snippets that need compiling, and the variants that are there, etc.
  • Load time. How much time it takes to load shader asset in the Unity editor.
  • Imported data size. How large is the imported shader data (serialized representation of actual shader asset; i.e. files that live in Library/metadata folder of a Unity project).

So the data is:

Shader   Import    Load    Size
   27k    420ms   120ms    6.4MB
  111k   2013ms   492ms   27.9MB
  333k   7779ms  1719ms   89.2MB
    1M  16192ms  4231ms  272.4MB

Enter Instruments

Last time we used xperf to do some profiling. We’re on a Mac this time, so let’s use Apple Instruments. Just like xperf, Instruments can show a lot of interesting data. We’re looking at the most simple one, “Time Profiler” (though profiling Zombies is very tempting!). You pick that instrument, attach to the executable, start recording, and get some results out.

You then select the time range you’re interested in, and expand the stack trace. Protip: Alt-Click (ok ok, Option-Click you Mac peoples) expands full tree.

So far the whole stack is just going deep into Cocoa stuff. “Hide System Libraries” is very helpful with that:

Another very useful feature is inverting the call tree, where the results are presented from the heaviest “self time” functions (we won’t be using that here though).

When hovering over an item, an arrow is shown on the right (see image above). Clicking on that does “focus on subtree”, i.e. ignores everything outside of that item, and time percentages are shown relative to the item. Here we’ve focused on ShaderCompilerPreprocess (which does majority of shader “importing” work).

Looks like we’re spending a lot of time appending to strings. That usually means strings did not have enough storage buffer reserved and are causing a lot of memory allocations. Code change:

This small change has cut down shader importing time by 20-40%! Very nice!

I did a couple other small tweaks from looking at this profiling data - none of them resulted in any signifinant benefit though.

Profiling shader load time also says that most of the time ends up being spent on loading editor related data that is arrays of arrays of strings and so on:

I could have picked functions from the profiler results, went though each of them and optimized, and perhaps would have achieved a solid 2-3x improvement over initial results. Very often that’s enough to be proud!

However…

Taking a step back

Or like Mike Acton would say, ”look at your data!” (check his CppCon2014 slides or video). Another saying is also applicable: ”think!

Why do we have this problem to begin with?

For example, in 333k variant shader case, we end up sending 610560 lines of shader variant information between shader compiler process & editor, with macro strings in each of them. In total we’re sending 91 megabytes of data over RPC pipe during shader import.

One possible area for improvement: the data we send over and store in imported shader data is a small set of macro strings repeated over and over and over again. Instead of sending or storing the strings, we could just send the set of strings used by a shader once, assign numbers to them, and then send & store the full set as lists of numbers (or fixed size bitmasks). This should cut down on the amount of string operations we do (massively cut down on number of small allocations), size of data we send, and size of data we store.

Another possible approach: right now we have source data in shader that indicate which variants to generate. This data is very small: just a list of on/off features, and some built-in variant lists (“all variants to handle lighting in forward rendering”). We do the full combinatorial explosion of that in the shader compiler process, send the full set over to the editor, and the editor stores that in imported shader data.

But the way we do the “explosion of source data into full set” is always the same. We could just send the source data from shader compiler to the editor (a very small amount!), and furthermore, just store that in imported shader data. We can rebuild the full set when needed at any time.

Changing the data

So let’s try to do that. First let’s deal with RPC only, without changing serialized shader data. A few commits later…

This made shader importing over twice as fast!

Shader   Import
   27k    419ms ->  200ms
  111k   1702ms ->  791ms
  333k   5362ms -> 2530ms
    1M  16784ms -> 8280ms

Let’s do the other part too; where we change serialized shader variant data representation. Instead of storing full set of possible variants, we only store data needed to generate the full set:

Shader   Import              Load                 Size
   27k    200ms ->   285ms    103ms ->    396ms     6.4MB -> 55kB
  111k    791ms ->  1229ms    426ms ->   1832ms    27.9MB -> 55kB
  333k   2530ms ->  3893ms   1410ms ->   5892ms    89.2MB -> 56kB
    1M   8280ms -> 12416ms   4498ms ->  18949ms   272.4MB -> 57kB

Everything seems to work, and the serialized file size got massively decreased. But, both importing and loading got slower?! Clearly I did something stupid. Profile!

Right. So after importing or loading the shader (from now a small file on disk), we generate the full set of shader variant data. Which right now is resulting in a lot of string allocations, since it is generating arrays of arrays of strings or somesuch.

But we don’t really need the strings at this point; for example after loading the shader we only need the internal representation of “shader variant key” which is a fairly small bitmask. A couple of tweaks to fix that, and we’re at:

Shader  Import    Load
   27k    42ms     7ms
  111k    47ms    27ms
  333k    94ms    76ms
    1M   231ms   225ms

Look at that! Importing a 333k variant shader got 82 times faster; loading its metadata got 22 times faster, and the imported file size got over a thousand times smaller!

One final look at the profiler, just because:

Weird, time is spent in memory allocation but there shouldn’t be any at this point in that function; we aren’t creating any new strings there. Ahh, implicit std::string to UnityStr (our own string class with better memory reporting) conversion operators (long story…). Fix that, and we’ve got another 2x improvement:

Shader  Import    Load
   27k    42ms     5ms
  111k    44ms    18ms
  333k    53ms    46ms
    1M   130ms   128ms

The code could still be optimized further, but there ain’t no easy fixes left I think. And at this point I’ll have more important tasks to do…

What we’ve got

So in total, here’s what we have so far:

Shader   Import                Load                 Size
   27k    420ms-> 42ms (10x)    120ms->  5ms (24x)    6.4MB->55kB (119x)
  111k   2013ms-> 44ms (46x)    492ms-> 18ms (27x)   27.9MB->55kB (519x)
  333k   7779ms-> 53ms (147x)  1719ms-> 46ms (37x)   89.2MB->56kB (this is getting)
    1M  16192ms->130ms (125x)  4231ms->128ms (33x)  272.4MB->57kB (ridiculous!)

And a fairly small pull request to achieve all this (~400 lines of code changed, ~400 new added):

Overall I’ve probably spent something like 8 hours on this – hard to say exactly since I did some breaks and other things. Also I was writing down notes & making sceenshots for the blog too :) The fix/optimization is already in Unity 5.0 beta 20 by the way.

Conclusion

Apple’s Instruments is a nice profiling tool (and unlike xperf, the UI is not intimidating…).

However, Profiler Is Not A Replacement For Thinking! I could have just looked at the profiling results and tried to optimize “what’s at top of the profiler” one by one, and maybe achieved 2-3x better performance. But by thinking about the actual problem and why it happens, I got a way, way better result.

Happy thinking!