Solving DX9 Half-Pixel Offset

Summary: the Direct3D 9 “half pixel offset” problem that manages to annoy everyone can be solved in a single isolated place, robustly, and in a way where you don’t have to think about it ever again. Just add two instructions to all your vertex shaders, automatically.

…here I am wondering if the target audience for D3D9 related blog post in 2016 is more than 7 people in the world. Eh, whatever!

Background

Direct3D before version 10 had this pecularity called ”half pixel offset”, where viewport coordinates are shifted by half a pixel compared to everyone else (OpenGL, D3D10+, Metal etc.). This causes various problems, particularly with image post-processing or UI rendering, but elsewhere too.

The official documentation (”Directly Mapping Texels to Pixels”), while being technically correct, is not exactly summarized into three easy bullet points.

The typical advice is various: “shift your quad vertex positions by half a pixel” or “shift texture coordinates by half a texel”, etc. Most of them talk almost exclusively about screenspace rendering for image processing or UI.

The problem with all that, is that this requires you to remember to do things in various little places. Your postprocessing code needs to be aware. Your UI needs to be aware. Your baking code needs to be aware. Some of your shaders need to be aware. When 20 places in your code need to remember to deal with this, you know you have a problem.

3D has half-pixel problem too!

While most of material on D3D9 half pixel offset talks about screen-space operations, the problem exists in 3D too! 3D objects are rendered slightly shifted compared to what happens on OpenGL, D3D10+ or Metal.

Here’s a crop of a scene, rendered in D3D9 vs D3D11:

And a crop of a crop, scaled up even more, D3D9 vs D3D11:

Root Cause and Solution

The root cause is that viewport is shifted by half a pixel compared to where we want it to be. Unfortunately we can’t fix it by changing all coordinates passed into SetViewport, shifting them by half a pixel (D3DVIEWPORT9 coordinate members are integers).

However, we have vertex shaders. And the vertex shaders output clip space position. We can adjust the clip space position, to shift everything by half a viewport pixel. Essentially we need to do this:

1
2
3
// clipPos is float4 that contains position output from vertex shader
// (POSITION/SV_Position semantic):
clipPos.xy += renderTargetInvSize.xy * clipPos.w;

That’s it. Nothing more to do. Do this in all your vertex shaders, setup shader constant that contains viewport size, and you are done.

I must stress that this is done across the board. Not only postprocessing or UI shaders. Everything. This fixes the 3D rasterizing mismatch, fixes postprocessing, fixes UI, etc.

Wait, why no one does this then?

Ha. Turns out, they do!

Maybe it’s common knowledge, and only I managed to be confused? Sorry about that then! Should have realized this years ago…

Solving This Automatically

The “add this line of HLSL code to all your shaders” is nice if you are writing or generating all the shader source yourself. But what if you don’t? (e.g. Unity falls into this camp; zillions of shaders already written out there)

Turns out, it’s not that hard to do this at D3D9 bytecode level. No HLSL shader code modifications needed. Right after you compile the HLSL code into D3D9 bytecode (via D3DCompile or fxc), just slightly modify it.

D3D9 bytecode is documented in MSDN, ”Direct3D Shader Codes”.

I thought whether I should be doing something flexible/universal (parse “instructions” from bytecode, work on them, encode back into bytecode), or just write up minimal amount of code needed for this patching. Decided on the latter; with any luck D3D9 is nearing it’s end-of-life. It’s very unlikely that I will ever need more D3D9 bytecode manipulation. If in 5 years from now we’ll still need this code, I will be very sad!

The basic idea is:

  1. Find which register is “output position” (clearly defined in shader model 2.0; can be arbitrary register in shader model 3.0), let’s call this oPos.
  2. Find unused temporary register, let’s call this tmpPos.
  3. Replace all usages of oPos with tmpPos.
  4. Add mad oPos.xy, tmpPos.w, constFixup, tmpPos and mov oPos.zw, tmpPos at the end.

Here’s what it does to simple vertex shader:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
vs_3_0           // unused temp register: r1
dcl_position v0
dcl_texcoord v1
dcl_texcoord o0.xy
dcl_texcoord1 o1.xyz
dcl_color o2
dcl_position o3  // o3 is position
pow r0.x, v1.x, v1.y
mul r0.xy, r0.x, v1
add o0.xy, r0.y, r0.x
add o1.xyz, c4, -v0
mul o2, c4, v0
dp4 o3.x, v0, c0 // -> dp4 r1.x, v0, c0
dp4 o3.y, v0, c1 // -> dp4 r1.y, v0, c1
dp4 o3.z, v0, c2 // -> dp4 r1.z, v0, c2
dp4 o3.w, v0, c3 // -> dp4 r1.w, v0, c3
                 // new: mad o3.xy, r1.w, c255, r1
                 // new: mov o3.zw, r1

Here’s the code in a gist.

At runtime, each time viewport is changed, set vertex shader constant (I picked c255) to contain (-1.0f/width, 1.0f/height, 0, 0).

That’s it!

Any downsides?

Not much :) The whole fixup needs shaders that:

  • Have an unused constant register. Majority of our shaders are shader model 3.0, and I haven’t seen vertex shaders that use all 32 temporary registers. If that is a problem, “find unused register” analysis could be made smarter, by looking for an unused register just in the place between earliest and latest position writes. I haven’t done that.
  • Have an unused constant register at some (easier if fixed) index. Base spec for both shader model 2.0 and 3.0 is that vertex shaders have 256 constant registers, so I just picked the last one (c255) to contain fixup data.
  • Have instruction slot space to put two more instructions. Again, shader model 3.0 has 512 instruction slot limit and it’s very unlikely it’s using more than 510.

Upsides!

Major ones:

  • No one ever needs to think about D3D9 half-pixel offset, ever, again.
  • 3D rasterization positions match exactly between D3D9 and everything else (D3D11, GL, Metal etc.).

Fixed up D3D9 vs D3D11. Matches now:

I ran all the graphics tests we have, inspected all the resulting differences, and compared the results with D3D11. Turns out, this revealed a few minor places where we got the half-pixel offset wrong in our shaders/code before. So additional advantages (all Unity specific):

  • Some cases of GrabPass were sampling in the middle of pixels, i.e. slightly blurred results. Matches DX11 now.
  • Some shadow acne artifacts slightly reduced; matches DX11 now.
  • Some cases of image postprocessing effects having a one pixel gap on objects that should have been touching edge of screen exactly, have been fixed. Matches DX11 now.

All this will probably go into Unity 5.5. Still haven’t decided whether it’s too invasive/risky change to put into 5.4 at this stage.

Backporting Fixes and Shuffling Branches

“Everyday I’m Shufflin’” – LMFAO

For past few months at work I’m operating this “graphics bugfixes” service. It’s a very simple, free (*) service that I’m doing to reduce overhead of doing “I have this tiny little fix there” changes. Aras-as-a-service, if you will. It works like explained in the image on the right.

(*) Where’s the catch? It’s free, so I’ll get a million people to use it, and then it’s a raging success! My plan is perfect.

Backporting vs Forwardporting Fixes

We pretty much always work on three releases at once: the “next” one (“trunk”, terminology back from Subversion source control days), the “current” one and the “previous” one. Right now these are:

  • trunk: will become Unity 5.5 sometime later this year (see roadmap).
  • 5.4: at the moment in fairly late beta, stabilization/polish.
  • 5.3.x: initially released end of 2015, currently “long term support” release that will get fixes for many months.

Often fixes need to go into all three releases, sometimes with small adjustments. About the only workflow that has worked reliably here is: “make the fix in the latest release, backport to earlier as needed”. In this case, make the fix on trunk-based code, backport to 5.4 if needed, and to 5.3 if needed.

The alternative could be making the fix on 5.3, and forward porting to 5.4 & trunk. This we do sometimes, particularly for “zomg everything’s on fire, gotta fix asap!” type of fixes. The risk with this, is that it’s easy to “lose” a fix in the future releases. Even with best intentions, some fixes will be forgotten to be forward-ported, and then it’s embarrassing for everyone.

Shufflin’ 3 branches in your head can get confusing, doubly so if you’re also trying to get something else done. Here’s what I do.

1: Write Down Everything

When making a rollup pull request of fixes to trunk, write down everything in the PR description.

  • List of all fixes, with human-speak sentence of each (i.e. what would go into release notes). Sometimes a summary of commit message already has that, sometimes it does not. In the latter case, look at the fix and describe in simple words what it does (preferably from user’s POV).
  • Separate the fixes that don’t need backporting, from the ones that need to go into 5.4 too, and the ones that need to go into both 5.4 and 5.3.
  • Write down who made the fix, which bug case numbers it solves, and which code commits contain the fix (sometimes more than one!). The commits are useful later when doing actual backports; easier than trying to fish them out of the whole branch.

Here’s a fairly small bugfix pull request iteration with all the above:

Nice and clean! However, some bugfix batches do end up quite messy; here’s the one that was quite involved. Too many fixes, too large fixes, too many code changes, hard to review etc.:

2: Do Actual Backporting

We’re using Mercurial, so this is mostly grafting commits between branches. This is where having commit hashes written down right next to fixes is useful.

Several situations can result when grafting things back:

  • All fine! No conflicts no nothing. Good, move on.
  • Turns out, someone already backported this (communication, it’s hard!). Good, move on.
  • Easy conflicts. Look at the code, fix if trivial enough and you understand it.
  • Complex conflicts. Talk with author of original fix; it’s their job to either backport manually now, or advise you how to fix the conflict.

As for actual repository organization, I have three clones of the codebase on my machine: trunk, 5.4, 5.3. Was using a single repository and switching branches before, but it takes too long to do all the preparation and building, particularly when switching between branches that are too far away from each other. Our repository size is fairly big, so hg share extension is useful - make trunk the “master” repository, and the other ones just working copies that share the same innards of a .hg folder. SSDs appreciate 20GB savings!

3: Review After Backporting

After all the grafting, create pull requests for 5.4 and 5.3, scan through the changes to make sure everything got through fine. Paste relevant bits into PR description for other reviewers:

And then you’re left with 3 or so pull requests, each against a corresponding release. In our case, this means potentially adding more reviewers for sanity checking, running builds and tests on the build farm, and finally merging them into actual release once everything is done. Dance time!

This is all.

10 Years at Unity

Turns out, I started working on this “Unity” thing exactly 10 years ago. I wrote the backstory in ”2 years later” and ”4 years later” posts, so not worth repeating it here.

A lot of things have happened over these 10 years, some of which are quite an experience.

Seeing the company go through various stages, from just 4 of us back then to, I dunno, 750 amazing people by now? is super interesting. You get to experience the joys & pains of growth, the challenges and opportunities allowed by that and so on.

Seeing all the amazing games made with Unity is extremely motivating. Being a part this super-popular engine that everyone loves to hate is somewhat less motivating, but hey let’s not focus on that today :)

Having my tiny contributions in all releases from Unity 1.2.2 to (at the time of writing) 5.3.1 and 5.4 beta feels good too!

What now? or “hey, what happened to Optimizing Unity Renderer posts?”

Last year I did several “Optimizing Unity Renderer” posts (part 1, part 2, part 3) and then, when things were about to get interesting, I stopped. Wat happend?

Well, I stopped working on them optimizations; the multi-threaded rendering and other optimizations are done by people who are way better at it than I am (@maverikou, @n3rvus, @joeldevahl and some others who aren’t on twitterverse).

So what I am doing then?

Since mid-2015 I’ve moved into kinda-maybe-a-lead-but-not-quite position. Perhaps that’s better characterized as “all seeing evil eye” or maybe “that guy who asks why A is done but B is not”. I was the “graphics lead” a number of years ago, until I decided that I should just be coding instead. Well, now I’m back to “you don’t just code” state.

In practice I do several things for past 6 months:

  • Reviewing a lot of code, or the “all seeing eye” bit. I’ve already been doing quite a lot of that, but with large amount of new graphics hires in 2015 the amount of graphics-related code changes has gone up massively.
  • Part of a small team that does overall “graphics vision”, prioretization, planning, roadmapping and stuff. The “why A is done when B should be done instead” bit. This also means doing job interviews, looking into which areas are understaffed, onboarding new hires etc.
  • Bugfixing, bugfixing and more bugfixing. It’s not a secret that stability of Unity could be improved. Or that “Unity does not do what I think it should do” (which very often is not technically “bugs”, but it feels like that from the user’s POV) happens a lot.
  • Improving workflow for other graphics developers internally. For example trying to reduce the overhead of our graphics tests and so on.
  • Work on some stuff or write some documentation when time is left from the above. Not much of actual coding done so far, largest items I remember are some work on frame debugger improvements (in 5.3), texture arrays / CopyTexture (in 5.4 beta) and a bunch of smaller items.

For the foreseeable future I think I’ll continue doing the above.

By now we do have quite a lot of people to work on graphics improvements; my own personal goal is that by mid-2016 there will be way less internet complaints along the lines of “unity is shit”. So, less regressions, less bugs, more stability, more in-depth documentation etc.

Wish me luck!

Careful With That STL Map Insert, Eugene

So we had this pattern in some of our code. Some sort of “device/API specific objects” need to be created out of simple “descriptor/key” structures. Think D3D11 rasterizer state or Metal pipeline state, or something similar to them.

Most of that code looked something like this (names changed and simplified):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// m_States is std::map<StateDesc, DeviceState>

const DeviceState* GfxDevice::CreateState(const StateDesc& key)
{
  // insert default state (will do nothing if key already there)
  std::pair<CachedStates::iterator, bool> res = m_States.insert(std::make_pair(key, DeviceState()));
  if (res.second)
  {
      // state was not there yet, so actually create it
      DeviceState& state = res.first->second;
      // fill/create state out of key
  }
  // return already existing or just created state
  return &res.first->second;
}

Now, past the initial initialization/loading, absolute majority of CreateState calls will just return already created states.

StateDesc and DeviceState are simple structs with just plain old data in them; they can be created on the stack and copied around fairly well.

What’s the performance of the code above?

It is O(logN) complexity based on how many states are created in total, that’s a given (std::map is a tree, usually implemented as a red-black tree; lookups are logarithmic complexity). Let’s say that’s not a problem, we can live with logN complexity there.

Yes, STL maps are not quite friendly for the CPU cache, since all the nodes of a tree are separately allocated objects, which could be all over the place in memory. Typical C++ answer is “use a special allocator”. Let’s say we have that too; all these maps use a nice “STL map” allocator that’s designed for fixed allocation size of a node and they are all mostly friendly packed in memory. Yes the nodes have pointers which take up space etc., but let’s say that’s ok in our case too.

In the common case of “state is already created, we just return it from the map”, besides the find cost, are there any other concerns?

Turns out… this code allocates memory. Always (*). And in the major case of state already being in the map, frees the just-allocated memory too, right there.

“bbbut… why?! how?”

(*) not necessarily always, but at least in some popular STL implementations it does.

Turns out, quite some STL implementations have map.insert written in this way:

1
2
3
4
node = allocateAndInitializeNode(key, value);
insertNodeIfNoKeyInMap(node);
if (didNotInsert)
  destroyAndFreeNode(node);

So in terms of memory allocations, calling map.insert with a key that already exists is more costly (incurs allocation+free). Why?! I have no idea.

I’ve tested with several things I had around.

STLs that always allocate:

Visual C++ 2015 Update 1:

1
2
_Nodeptr _Newnode = this->_Buynode(_STD forward<_Valty>(_Val));
return (_Insert_nohint(false, this->_Myval(_Newnode), _Newnode));

(_Buynode allocates, _Insert_nohint at end frees if not inserted).

Same behaviour in Visual C++ 2010.

Xcode 7.0.1 default libc++:

1
2
3
4
5
__node_holder __h = __construct_node(_VSTD::forward<_Vp>(__v));
pair<iterator, bool> __r = __node_insert_unique(__h.get());
if (__r.second)
    __h.release();
return __r;

STLs that only allocate when need to insert:

These implementations first do a key lookup and return if found, and only if not found yet then allocate the tree node and insert it.

Xcode 7.0.1 with (legacy?) libstdc++.

EA’s EASTL. See red_black_tree.h.

@msinilo’s RDESTL. See rb_tree.h.

Conclusion?

STL is hard. Hidden differences between platforms like that can bite you. Or as @maverikou said, “LOL. this calls for a new emoji”.

In this particular case, a helper function that manually does a search, and only insert if needed would help things. Using a lower_bound + insert with iterator “trick” to avoid second O(logN) search on insert might be useful. See this answer on stack overflow.

Curiously enough, on that (and other similar) SO threads other answers are along the lines of “for simple key/value types, just calling insert will be as efficient”. Ha. Haha.

Optimizing Unity Renderer Part 3: Fixed Function Removal

Last time I wrote about some cleanups and optimizations. Since then, I got sidetracked into doing some Unity 5.1 work, removing Fixed Function Shaders and other unrelated things. So not much blogging about optimization per se.

Fixed Function What?

Once upon a time, GPUs did not have these fancy things called “programmable shaders”; instead they could be configured in more or less (mostly less) flexible ways, by enabling and disabling certain features. For example, you could tell them to calculate some lighting per-vertex; or to add two textures together per-pixel.

Unity started out a long time ago, back when fixed function GPUs were still a thing; so naturally it supports writing shaders in this fixed function style (“shaderlab” in Unity lingo). The syntax for them is quite easy, and actually they are much faster to write than vertex/fragment shader pairs if all you need is some simple shader.

For example, a Unity shader pass that turns on regular alpha blending, and outputs texture multiplied by material color, is like this:

1
2
3
4
5
Pass
{
  Blend SrcAlpha OneMinusSrcAlpha
  SetTexture [_MainTex] { constantColor[_Color] combine texture * contant }
}

compare that with a vertex+fragment shader that does exactly the same:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Pass
{
  Blend SrcAlpha OneMinusSrcAlpha
  CGPROGRAM
  #pragma vertex vert
  #pragma fragment frag
  #include "UnityCG.cginc"
  struct v2f
  {
      float2 uv : TEXCOORD0;
      float4 pos : SV_POSITION;
  };
  float4 _MainTex_ST;
  v2f vert (float4 pos : POSITION, float2 uv : TEXCOORD0)
  {
      v2f o;
      o.pos = mul(UNITY_MATRIX_MVP, pos);
      o.uv = TRANSFORM_TEX(uv, _MainTex);
      return o;
  }
  sampler2D _MainTex;
  fixed4 _Color;
  fixed4 frag (v2f i) : SV_Target
  {
      return tex2D(_MainTex, i.uv) * _Color;
  }
  ENDCG
}

Exactly the same result, a lot more boilerplate typing.

Now, we have removed support for actually fixed function GPUs and platforms (in practice: OpenGL ES 1.1 on mobile and Direct3D 7 GPUs on Windows) in Unity 4.3 (that was in late 2013). So there’s no big technical reason to keep on writing shaders in this “fixed function style”… except that 1) a lot of existing projects and packages already have them and 2) for simple things it’s less typing.

That said, fixed function shaders in Unity have downsides too:

  • They do not work on consoles (like PS4, XboxOne, Vita etc.), primarily because generating shaders at runtime is very hard on these platforms.
  • They do not work with MaterialPropertyBlocks, and as a byproduct, do not work with Unity’s Sprite Renderer nor materials animated via animation window.
  • By their nature they are suitable for only very simple things. Often you could start with a simple fixed function shader, only to find that you need to add more functionality that can’t be expressed in fixed function vertex lighting / texture combiners.

How are fixed function shaders implemented in Unity? And why?

Majority of platforms we support do not have the concept of “fixed function rendering pipeline” anymore, so these shaders are internally converted into “actual shaders” and these are used for rendering. The only exceptions where fixed function still exists are: legacy desktop OpenGL (GL 1.x-2.x) and Direct3D 9.

Truth to be told, even on Direct3D 9 we’ve been creating actual shaders to emulate fixed function since Unity 2.6; see this old article. So D3D9 was the first platform we got that implemented this “lazily create actual shaders for each fixed function shader variant” thing.

Then more platforms came along; for OpenGL ES 2.0 we implemented a very similar thing as for D3D9, just instead of concatenating bits of D3D9 shader assembly we’d concatenate GLSL snippets. And then even more platforms came (D3D11, Flash, Metal); each of them implemented this “fixed function generation” code. The code is not terribly complicated; the problem was pretty well understood and we had enough graphics tests verify it works.

Each step along the way, somehow no one really questioned why we keep doing all this. Why do all that at runtime, instead of converting fixed-function-style shaders into “actual shaders” offline, at shader import time? (well ok, plenty of people asked that question; the answer has been “yeah that would make sense; just requires someone to do it” for a while…)

A long time ago generating “actual shaders” for for fixed function style ones offline was not very practical, due to sheer number of possible variants that need to be supported. The trickiest ones to support were texture coordinates (routing of UVs into texture stages; optional texture transformation matrices; optional projected texturing; and optional texture coordinate generation). But hey, we’ve removed quite some of that in Unity 5.0 anyway. Maybe now it’s easier? Turns out, it is.

Converting fixed function shaders into regular shaders, at import time

So I set out to do just that. Remove all the runtime code related to “fixed function shaders”; instead just turn them into “regular shaders” when importing the shader file in Unity editor. Created an outline of idea & planned work on our wiki, and started coding. I thought the end result would be “I’ll add 1000 lines and remove 4000 lines of existing code”. I was wrong!

Once I got the basics of shader import side working (turned out, about 1000 lines of code indeed), I started removing all the fixed function bits. That was a day of pure joy:

Almost twelve thousand lines of code, gone. This is amazing!

I never realized all the fixed function code was that large. You write it for one platform, and then it basically works; then some new platform comes and the code is written for that, and then it basically works. By the time you get N platforms, all that code is massive, but it never came in one sudden lump so no one realized it.

Takeaway: once in a while, look at a whole subsystem. You migth be surprised at how much it grew over the years. Maybe some of the conditions for why it is like that do not apply anymore?

Sidenote: per-vertex lighting in a vertex shader

If there was one thing that was easy with fixed function pipeline, is that many features were easily composable. You could enable any number of lights (well, up to 8); and each of them could be a directional, point or a spot light; toggling specular on or off was just a flag away; same with fog etc.

It feels like “easy composition of features” is a big thing we lost when we all moved to shaders. Shaders as we know them (i.e. vertex/fragment/… stages) aren’t composable at all! Want to add some optional feature – that pretty much means either “double the amount of shaders”, or branching in shaders, or generating shaders at runtime. Each of these has their own tradeoffs.

For example, how do you write a vertex shader that can do up to 8 arbitrary lights? There are many ways of doing it; what I have done right now is:

Separate vertex shader variants for “any spot lights present?”, “any point lights present?” and “directional lights only” cases. My guess is that spot lights are very rarely used with per-vertex fixed function lighting; they just look really bad. So in many cases, the cost of “compute spot lights” won’t be paid.

Number of lights is passed into the shader as an integer, and the shader loops over them. Complication: OpenGL ES 2.0 / WebGL, where loops can only have constant number of iterations :( In practice many OpenGL ES 2.0 implementations do not enforce that limitation; however WebGL implementations do. At this very moment I don’t have a good answer; on ES2/WebGL I just always loop over all 8 possible lights (the unused lights have black colors set). For a real solution, instead of a regular loop like this:

1
2
3
4
5
6
uniform int lightCount;
// ...
for (int i = 0; i < lightCount; ++i)
{
  // compute light #i
}

I’d like to emit a shader like this when compiling for ES2.0/WebGL:

1
2
3
4
5
6
7
8
uniform int lightCount;
// ...
for (int i = 0; i < 8; ++i)
{
  if (i == lightCount)
      break;
  // compute light #i
}

Which would be valid according to the spec; it’s just annoying to deal with seemingly arbitrary limitations like this (I heard that WebGL 2 does not have this limitation, so that’s good).

What do we have now

So the current situation is that, by removing a lot of code, I achieved the following upsides:

  • “Fixed function style” shaders work on all platforms now (consoles! dx12!).
  • They work more consistenly across platforms (e.g. specular highlights and attenuation were subtly different between PC & mobile before).
  • MaterialPropertyBlocks work with them, which means sprites etc. all work.
  • Fixed function shaders aren’t rasterized at a weird half-pixel offset on Windows Phone anymore.
  • It’s easier to go from fixed functon shader to an actual shader now; I’ve added a button in shader inspector that just shows all the generated code; you can paste that back and start extending it.
  • Code is smaller; that translates to executable size too. For example Windows 64 bit player got smaller by 300 kilobytes.
  • Rendering is slightly faster (even when fixed function shaders aren’t used)!

That last point was not the primary objective, but is a nice bonus. No particular big place was affected, but quite a few branches and data members were removed from platform graphics abstraction (that only were there to support fixed function runtime). Depending on the projects I’ve tried, I’ve seen up to 5% CPU time saved on the rendering thread (e.g. 10.2ms -> 9.6ms), which is pretty good.

Are there any downsides? Well, yes, a couple:

  • You can not create fixed function shaders at runtime anymore. Before, you could do something like a var mat = new Material("<fixed function shader string>") and it would all work. Well, except for consoles, where these shaders were never working. For this reason I’ve made the Material(string) constructor be obsolete with a warning for Unity 5.1; but it will actually stop working later on.
  • It’s a web player backwards compatibility breaking change, i.e. if/once this code lands to production (for example, Unity 5.2) that would mean 5.2 runtime can not playback Unity 5.0/5.1 data files anymore. Not a big deal, we just have to decide if with (for example) 5.2 we’ll switch to a different web player release channel.
  • Several corner cases might not work. For example, a fixed function shader that uses a globally-set texture that is not a 2D texture. Nothing about that texture is specified in the shader source itself; so while I’m generating actual shader at import time I don’t know if it’s a 2D or a Cubemap texture. So for global textures I just assume they are going to be 2D ones.
  • Probably that’s it!

Removing all that fixed function runtime support also revealed more potential optimizations. Internally we were passing things like “texture type” (2D, Cubemap etc.) for all texture changes – but it seems that it was only the fixed function pipeline that was using it. Likewise, we are passing a vertex-declaration-like structure for each and every draw call; but now I think that’s not really needed anymore. Gotta look further into it.

Until next time!