Blogs · Aras' website

The Virtual and No-Virtual

Posted on Feb 1, 2011

You are writing some system where different implementations have to be used for different platforms. To keep things real, let’s say it’s a rendering system which we’ll call “GfxDevice” (based on a true story!). For example, on Windows there could be a Direct3D 9, Direct3D 11 or OpenGL implementations; on iOS/Android there could be OpenGL ES 1.1 & 2.0 ones and so on.

For sake of simplicity, let’s say our GfxDevice interface needs to do this (in real world it would need to do much more):

void SetShader (ShaderType type, ShaderID shader);
void SetTexture (int unit, TextureID texture);
void SetGeometry (VertexBufferID vb, IndexBufferID ib);
void Draw (PrimitiveType prim, int primCount);

How this can be done?

Approach #1: virtual interface!

Many a programmer would think like this: why of course, GfxDevice is an interface with virtual functions, and then we have multiple implementations of it. Sounds good, and that’s what you would have been taught at the university in various software design courses. Here we go:

class GfxDevice {
public:
    virtual ~GfxDevice();
    virtual void SetShader (ShaderType type, ShaderID shader) = 0;
    virtual void SetTexture (int unit, TextureID texture) = 0;
    virtual void SetGeometry (VertexBufferID vb, IndexBufferID ib) = 0;
    virtual void Draw (PrimitiveType prim, int primCount) = 0;
};
// and then we have:
class GfxDeviceD3D9 : public GfxDevice {
    // ...
};
class GfxDeviceGLES20 : public GfxDevice {
    // ...
};
class GfxDeviceGCM : public GfxDevice {
    // ...
};
// and so on

And then based on platform (or something else) you create the right GfxDevice implementation, and the rest of the code uses that. This is all good and it works.

But then… hey! Some platforms can only ever have one GfxDevice implementation. On PS3 you will always end up using GfxDeviceGCM. Does it really make sense to have virtual functions on that platform?

Side note: of course the cost of a virtual function call is not something that stands out immediately. It’s much less than, for example, doing a network request to get the leaderboards or parsing that XML file that ended up in your game for reasons no one can remember. Virtual function calls will not show up in the profiler as “a heavy bottleneck”. However, they are not free and their cost will be scattered around in a million places that is very hard to eradicate. You can end up having death by a thousand paper cuts.

If we want to get rid of virtual functions on platforms where they are useless, what can we do?

Approach #2: preprocessor to the rescue

We just have to take out the “virtual” bit from the interface, and the “= 0” abstract function bit. With a bit of preprocessor we can:

#define GFX_DEVICE_VIRTUAL (PLATFORM_WINDOWS || PLATFORM_MOBILE_UNIVERSAL || SOMETHING_ELSE)
#if GFX_DEVICE_VIRTUAL
    #define GFX_API virtual
    #define GFX_PURE = 0
#else
    #define GFX_API
    #define GFX_PURE
#endif
class GfxDevice {
public:
    GFX_API ~GfxDevice();
    GFX_API void SetShader (ShaderType type, ShaderID shader) GFX_PURE;
    GFX_API void SetTexture (int unit, TextureID texture) GFX_PURE;
    GFX_API void SetGeometry (VertexBufferID vb, IndexBufferID ib) GFX_PURE;
    GFX_API void Draw (PrimitiveType prim, int primCount) GFX_PURE;
};

And then there’s no separate class called GfxDeviceGCM for PS3; it’s just GfxDevice class implementing non-virtual methods. You have to make sure you don’t try to compile multiple GfxDevice class implementations on PS3 of course.

Ta-da! Virtual functions are gone on some platforms and life is good.

But we still have the other platforms, where there can be more than one GfxDevice implementation, and the decision for which one to use is made at runtime. Like our good old friend the PC: you could use Direct3D 9 or Direct3D 11 or OpenGL, based on the OS, GPU capabilities or user’s preference. Or a mobile platform where you don’t know whether OpenGL ES 2.0 will be available and you’d have to fallback to OpenGL ES 1.1.

Let’s think about what virtual functions actually are

How virtual functions work? Usually they work like this: each object gets a “pointer to a virtual function table” as it’s first hidden member. The virtual function table (vtable) is then just pointers to where the functions are in the code. Something like this:

The key points are: 1) each object’s data starts with a vtable pointer, and 2) vtable layout for classes implementing the same interface is the same.

When the compiler generates code for something like this:

device->Draw (kPrimTriangles, 1337);

it will generate something like the following pseudo-assembly:

vtable = load pointer from [device] address
drawlocation = vtable + 3*PointerSize ; since Draw is at index [3] in vtable
drawfunction = load pointer from [drawlocation] address
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address

This code will work no matter if device is of GfxDeviceGLES20 or GfxDeviceGLES11 kind. For both cases, the first pointer in the object will point to the appropriate vtable, and the fourth pointer in the vtable will point to the appropriate Draw function.

By the way, the above illustrates the overhead of a virtual function call. If we’d assume a platform where we have an in-order CPU and reading from memory takes 500 CPU cycles (which is not far from truth for current consoles), then if nothing we need is in the CPU cache yet, this is what actually happens:

vtable = load pointer from [device] address
; *wait 500 cycles* until the pointer arrives
drawlocation = vtable + 3*PointerSize
drawfunction = load pointer from [drawlocation] address
; *wait 500 cycles* until the pointer arrives
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address
; *wait 500 cycles* until code at that address is loaded

Can we do better?

Look at the picture in the previous paragraph and remember the “wait 500 cycles” for each pointer we are chasing. Can we reduce the number of pointer chases? Of course we can: why not ditch the vtable altogether, and just put function pointers directly into the GfxDevice object?

Virtual tables are implemented in this way mostly to save space. If we had 10000 objects of some class that has 20 virtual methods, we only pay one pointer overhead per object (40000 bytes on 32 bit architecture) and we store the vtable (20*4=80 bytes on 32 bit arch) just once, in total 39.14 kilobytes.

If we’d move all function pointers into objects themselves, we’d need to store 20 function pointers in each object. Which would be 781.25 kilobytes! Clearly this approach does not scale with increasing object instance counts.

However, how many GfxDevice object instances do we really have? Most often… exactly one.

Approach #3: function pointers

If we move function pointers to the object itself, we’d have something like this:

There’s no built-in language support for implementing this in C++ however, so that would have to be done manually. Something like:

struct GfxDeviceFunctions {
    SetShaderFunc SetShader;
    SetTextureFunc SetTexture;
    SetGeometryFunc SetGeometry;
    DrawFunc Draw;
};
class GfxDeviceGLES20 : public GfxDeviceFunctions {
    // ...
};

And then when creating a particular GfxDevice, you have to fill in the function pointers yourself. And the functions were member functions which magically take “this” parameter; it’s hard to just use them as function pointers without going to clumsy C++ member function pointer syntax and related issues.

We can be more explicit, C style, and instead just have the functions be static, taking “this” parameter directly:

class GfxDeviceGLES20 : public GfxDeviceFunctions {
    // ...
    static void DrawImpl (GfxDevice* self, PrimitiveType prim, int primCount);
    // ...
};

Code that uses it would look like this then:

device->Draw (device, kPrimTriangles, 1337);

and it would generate the following pseudo-assembly:

drawlocation = device + 3*PointerSize
drawfunction = load pointer from [drawlocation] address
; *wait 500 cycles* until the pointer arrives
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address
; *wait 500 cycles* until code at that address is loaded

Look at that, one of “wait 500 cycles” is gone!

More C style

We could move function pointers outside of GfxDevice if we want to, and just make them global:

In GLES1.1 case, that global GfxDevice funcs block would point to different pieces of code. And the pseudocode for this:

// global variables!
SetShaderFunc GfxSetShader;
SetTextureFunc GfxSetTexture;
SetGeometryFunc GfxSetGeometry;
DrawFunc GfxDraw;
// GLES2.0 implementation:
void GfxDrawGLES20 (GfxDevice* self, PrimitiveType prim, int primCount) { /* ... */ }

Code that uses it:

GfxDraw (device, kPrimTriangles, 1337);

and the pseudo-assembly:

drawfunction = load pointer from [GfxDraw variable] address
; wait 500 cycles until the pointer arrives
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address
; wait 500 cycles until code at that address is loaded

Is it worth it?

I can hear some saying, “what? throwing away C++ OOP and implementing the same in almost raw C?! you’re crazy!”

Whether going the above route is better or worse is mostly a matter of programming style and preferences. It does get rid of one “wait 500 cycles” in the worst case for sure. And yes, to get that you do lose some of automagic syntax sugar in C++.

Is it worth it? Like always, depends on a lot of things. But if you do find yourself pondering the virtual function overhead for singleton-like objects, or especially if you do see that your profiler reports cache misses when calling into them, at least you’ll know one of the many possible alternatives, right?

And yeah, another alternative that’s easy to do on some platforms? Just put different GfxDevice implementations into dynamically loaded libraries, exposing the same set of functions. Which would end up being very similar to the last approach of “store function pointer table globally”, except you’d get some compiler syntax sugar to make it easier; and you wouldn’t even need to load the code that is not going to be used.

iOS shader tricks, or it's 2001 all over again

Posted on Feb 1, 2011

I was recently optimizing some OpenGL ES 2.0 shaders for iOS/Android, and it was funny to see how performance tricks that were cool in 2001 are having their revenge again. Here’s a small example of starting with a normalmapped Blinn-Phong shader and optimizing it to run several times faster. Most of the clever stuff below was actually done by ReJ, props to him!

Here’s a small test I’ll be working on: just a single plane with albedo and normal map textures:

I’ll be testing on iPhone 3Gs with iOS 4.2.1. Timer is started before glClear() and stopped after glFinish() that I added just after drawing the mesh.

Let’s start with an initial naive shader version:

Should be pretty self-explanatory to anyone who’s familiar with tangent space normal mapping and Blinn-Phong BRDF. Running time: 24.5 milliseconds. On iPhone 4’s Retina resolution, this would be about 4x slower!

What can we do next? On mobile platforms using appropriate precision of variables is often very important, especially in a fragment shader. So let’s go and add highp/mediump/lowp qualifiers to the fragment shader: shader source

Still the same running time! Alas, iOS does not have low level shader analysis tools, so we can’t really tell why that is happening. We could be limited by something else (e.g. normalizing vectors and computing pow() being the bottlenecks that run in parallel with all low precision stuff), or the driver might be promoting most of our computations to higher precision because it feels like it. It’s a magic box!

Let’s start approximating instead. How about computing normalized view direction per vertex, and interpolating that for the fragment shader? It won’t be entirely “correct”, but hey, it’s a phone we’re talking about. shader source

15 milliseconds! But… the rendering is wrong; everything turned white near the bottom of the screen. Turns out PowerVR SGX (the GPU in all current iOS devices) is really meaning “low precision” when we want to add two lowp vectors and normalize the result. Let’s try promoting one of them to medium precision with a “varying mediump vec3 v_viewdir”: shader source

That fixed rendering, but we’re back to 24.5 milliseconds. Sad shader writers are sad… oh shader performance analysis tools, where art thou?

Let’s try approximating some more: compute half-vector in the vertex shader, and interpolate normalized value. This would get rid of all normalizations in the fragment shader. shader source

16.3 milliseconds, not too bad! We still have pow() computed in the fragment shader, and that one is probably not the fastest operation there…

Almost a decade ago, a very common trick was to use a lookup texture to do the lighting. For example, a 2D texture indexed by (N.L, N.H). Since all lighting data would be “baked” into the texture, it does not necessarily have to be Blinn-Phong even; we can prepare faux-anisotropic, metallic, toon-shading or other fancy BRDFs there, as long as they can be expressed in terms of N.L and N.H. So let’s try creating 128x128 RGBA lookup texture and use that: shader source

A fast & not super efficient code to create the lighting lookup texture for Blinn-Phong:

9.1 milliseconds! We lost some precision in the specular though (it’s dimmer):

What else can be done? Notice that we clamp N.L and N.H values in the fragment shader, but this could be done just as well by the texture sampler, if we set texture’s addressing mode to CLAMP_TO_EDGE. Let’s get rid of the clamps: shader source

This is 8.3 milliseconds, or 7.6 milliseconds if we reduce our lighting texture resolution to 32x128.

Should we stop there? Not necessarily. For example, the shader is still multiplying albedo with a per-material color. Maybe that’s not very useful and can be let go. Maybe we can also make specular be always white?

How fast is this? 5.9 milliseconds, or over 4 times faster than our original shader.

Could it be made faster? Maybe; that’s an exercise for the reader :) I tried computing just the RGB color channels and setting alpha to zero, but that got slightly slower. Without real shader analysis tools it’s hard to see where or if additional cycles could be squeezed out.

I’m adding Xcode project with sources, textures and shaders of this experiment. Notes about it: only tested on iPhone 3Gs (probably will crash on iPhone 3G, and iPad will have wrong aspect ratio). Might not work at all! Shader is read from Resources/Shaders/shader.txt, next to it are shader versions of the steps of this experiment. Enjoy!

GLSL Optimizer

Posted on Sep 29, 2010

During development of Unity 3.0, I was not-so-pleasantly surprised to see that our cross-compiled shaders run slow on iPhone 3Gs. And by “slow”, I mean SLOW; at the speeds of “stop the presses, we can not ship brand new OpenGL ES 2.0 support with THAT performance”.

Back story

Take this HLSL pixel shader for particles, that does nothing but multiplies texture with per-vertex color:

half4 frag (v2f i) : COLOR { return i.color * tex2D (_MainTex, i.texcoord); }

This is about as simple as it can get; should be one texture fetch and one multiply for the GPU.

Now of course, when HLSL gets cross-compiled into GLSL, it is augmented by some dummy functions/moves to match GLSL’s semantics of “a function called main that takes no arguments and returns no value”. So you get something like this in GLSL:

vec4 frag (in v2f i) { return i.color * texture2D (_MainTex, i.texcoord); }
void main() {
    vec4 xl_retval;
    v2f xlt_i;
    xlt_i.color = gl_Color;
    xlt_i.texcoord = gl_TexCoord[0];
    xl_retval = frag (xlt_i);
    gl_FragData[0] = xl_retval;
}

Makes sense. The original function was translated, and main() got added that fills in the input structure, calls the function and writes result to gl_FragData[0] (aka gl_FragColor).

Lo and behold, the above (with some OpenGL ES 2.0 specific stuff added, like precision qualifiers, definitions of varyings etc.) runs like sh*t on a mobile platform.

Which probably means mobile platform drivers are quite bad at optimizing GLSL. I mostly tested iOS, but some tests on Android indicate that situation is the same (maybe even worse, depending on exact kind of Android you have). Which is sad since said platforms also do not have any way to precompile shaders offline, where they could afford good but slow compilers.

Now of course, if you’re writing GLSL shaders by hand, you’re probably writing close to optimal code, with no redundant data moves or wrapper functions. But if you’re cross-compiling them from Cg/HLSL, or generating from some shader fragments, or from visual shader editors, you probably depend on shader compiler being decent at optimizing redundant bits.

GLSL Optimizer

Around the same time I accidentally discovered that Mesa 3D guys are working on new GLSL compiler, dubbed GLSL2. I looked at the code and I liked it a lot; very hackable and “no bullshit” approach. So I took that Mesa’s GLSL compiler and made it output GLSL back after it has done all the optimizations.

Here it is: http://github.com/aras-p/glsl-optimizer

It reads GLSL, does some architecture independent optimizations (dead code removal, algebraic simplifications, constant propagation, constant folding, inlining, …) and spits out “optimized” GLSL back.

Results

The above simple particle shader example. GLSL optimizer optimizes it into:

void main() {
    gl_FragData[0] =
        (gl_Color.xyzw * texture2D (_MainTex, gl_TexCoord[0].xy)).xyzw;
}

Save for redundant swizzle outputs (on my todo list), this is pretty much what you’d be writing by hand. No redundant moves, function call inlined, no extra temporaries, sweet!

How much difference does this make?

Lots of particles, non-optimized GLSL on the left; optimized GLSL on the right (click for larger image). Yep, it’s 236 vs. 36 milliseconds/frame (4 vs. 27 FPS).

This result is for iPhone 3Gs running iOS 4.1. Some Android results: Motorola Droid (some PowerVR GPU): 537 vs. 223 ms; Nexus One (Snapdragon 8250 w/ Adreno GPU): 155 vs. 155 ms (yay! good drivers!); Samsung Galaxy S (some PowerVR GPU): 200 vs. 60 ms. All tests were ran at native device resolutions, so do not take this as performance comparisons between devices.

What about a more complex shader example? Let’s try per-pixel lit Diffuse shader (which is quite simple, but will do ok as “complex shader” example for a mobile platform). You can see that the GLSL code below is mostly auto-generated; writing it by hand wouldn’t produce that many data moves, unused struct members etc. Cg compiles original shader code into 10 ALU and 1 TEX instructions for D3D9 pixel shader 2.0, and is able to optimize away all the redundant stuff.

struct SurfaceOutput {
    vec3 Albedo;
    vec3 Normal;
    vec3 Emission;
    float Specular;
    float Gloss;
    float Alpha;
};
struct Input {
    vec2 uv_MainTex;
};
struct v2f_surf {
    vec4 pos;
    vec2 hip_pack0;
    vec3 normal;
    vec3 vlight;
};
uniform vec4 _Color;
uniform vec4 _LightColor0;
uniform sampler2D _MainTex;
uniform vec4 _WorldSpaceLightPos0;
void surf (in Input IN, inout SurfaceOutput o) {
    vec4 c;
    c = texture2D (_MainTex, IN.uv_MainTex) * _Color;
    o.Albedo = c.xyz;
    o.Alpha = c.w;
}
vec4 LightingLambert (in SurfaceOutput s, in vec3 lightDir, in float atten) {
    float diff;
    vec4 c;
    diff = max (0.0, dot (s.Normal, lightDir));
    c.xyz  = (s.Albedo * _LightColor0.xyz) * (diff * atten * 2.0);
    c.w  = s.Alpha;
    return c;
}
vec4 frag_surf (in v2f_surf IN) {
    Input surfIN;
    SurfaceOutput o;
    float atten = 1.0;
    vec4 c;
    surfIN.uv_MainTex = IN.hip_pack0.xy;
    o.Albedo = vec3 (0.0);
    o.Emission = vec3 (0.0);
    o.Specular = 0.0;
    o.Alpha = 0.0;
    o.Gloss = 0.0;
    o.Normal = IN.normal;
    surf (surfIN, o);
    c = LightingLambert (o, _WorldSpaceLightPos0.xyz, atten);
    c.xyz += (o.Albedo * IN.vlight);
    c.w = o.Alpha;
    return c;
}
void main() {
    vec4 xl_retval;
    v2f_surf xlt_IN;
    xlt_IN.hip_pack0 = vec2 (gl_TexCoord[0]);
    xlt_IN.normal = vec3 (gl_TexCoord[1]);
    xlt_IN.vlight = vec3 (gl_TexCoord[2]);
    xl_retval = frag_surf (xlt_IN);
    gl_FragData[0] = xl_retval;
}

Running the above through GLSL optimizer produces this:

uniform vec4 _Color;
uniform vec4 _LightColor0;
uniform sampler2D _MainTex;
uniform vec4 _WorldSpaceLightPos0;
void main ()
{
    vec4 c;
    vec4 tmpvar_32;
    tmpvar_32 = texture2D (_MainTex, gl_TexCoord[0].xy) * _Color;
    vec3 tmpvar_33;
    tmpvar_33 = tmpvar_32.xyz;
    float tmpvar_34;
    tmpvar_34 = tmpvar_32.w;
    vec4 c_i0_i1;
    c_i0_i1.xyz = ((tmpvar_33 * _LightColor0.xyz) *
    	(max (0.0, dot (gl_TexCoord[1].xyz, _WorldSpaceLightPos0.xyz)) * 2.0)).xyz;
    c_i0_i1.w = (vec4(tmpvar_34)).w;
    c = c_i0_i1;
    c.xyz = (c_i0_i1.xyz + (tmpvar_33 * gl_TexCoord[2].xyz)).xyz;
    c.w = (vec4(tmpvar_34)).w;
    gl_FragData[0] = c.xyzw;
}

All functions got inlined, all unused variable assignments got eliminated, and most of redundant moves are gone. There are some redundant moves left though (again, on my todo list), and the variables are assigned cryptic names after inlining. But otherwise, writing the equivalent shader by hand would be pretty close.

Difference between non-optimized and optimized GLSL in this case:

Non-optimized vs. optimized: 350 vs. 267 ms/frame (2.9 vs. 3.7 FPS). Not bad either!

Closing thoughts

Pulling off this GLSL optimizer quite late in Unity 3.0 release cycle was a risky move, but it did work.

Hats off to Mesa folks (Eric Anholt, Ian Romanick, Kenneth Graunke et al) for making an awesome codebase of the GLSL compiler! I haven’t merged up latest GLSL compiler developments on Mesa tree; they’ve implemented quite a few new compiler optimizations but I was too busy shipping Unity 3 already. Will try to merge them in soon-ish.

I’ve tested non-optimized vs. optimized GLSL a bit on a desktop platform (MacBook Pro, GeForce 8600M, OS X 10.6.4) and there is no observable speed difference. Which makes sense, and I would have expected mobile drivers to be good at optimization as well, but apparently that’s not the case.

Now of course, mobile drivers will improve over time, and I hope offline “GLSL optimization” step will become obsolete in the future. I still think it makes perfect sense to fully compile shaders offline, so at runtime there’s no trace of GLSL at all (just load binary blob of GPU microcode into the driver), but that’s a story for another day.

In the meantime, you’re welcome to try GLSL Optimizer out!

Surface Shaders, one year later

Posted on Jul 16, 2010

Over a year ago I had a thought that “Shaders must die” (part 1, part 2, part 3).

And what do you know - turns out we’re trying to pull this off in upcoming Unity 3. We call this Surface Shaders cause I’ve a suspicion “shaders must die” as a feature name wouldn’t have flied very far.

Idea

The main idea is that 90% of the time I just want to declare surface properties. This is what I want to say:

Hey, albedo comes from this texture mixed with this texture, and normal comes from this normal map. Use Blinn-Phong lighting model please, and don’t bother me again!

With the above, I don’t have to care whether this will be used in a forward or deferred rendering, or how various light types will be handled, or how many lights per pass will be done in a forward renderer, or how some indirect illumination SH probes will come in, etc. I’m not interested in all that! These dirty bits are job of rendering programmers, just make it work dammit!

This is not a new idea. Most graphical shader editors that make sense do not have “pixel color” as the final output node; instead they have some node that basically describes surface parameters (diffuse, specularity, normal, …), and all the lighting code is usually not expressed in the shader graph itself. OpenShadingLanguage is a similar idea as well (but because it’s targeted at offline rendering for movies, it’s much richer & more complex).

Example

Here’s a simple - but full & complete - Unity 3.0 shader that does diffuse lighting with a texture & a normal map.

Shader "Example/Diffuse Bump" {
  Properties {
    _MainTex ("Texture", 2D) = "white" {}
    _BumpMap ("Bumpmap", 2D) = "bump" {}
  }
  SubShader {
    Tags { "RenderType" = "Opaque" }
    CGPROGRAM
    #pragma surface surf Lambert
    struct Input {
      float2 uv_MainTex;
      float2 uv_BumpMap;
    };
    sampler2D _MainTex;
    sampler2D _BumpMap;
    void surf (Input IN, inout SurfaceOutput o) {
      o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
      o.Normal = UnpackNormal (tex2D (_BumpMap, IN.uv_BumpMap));
    }
    ENDCG
  } 
  Fallback "Diffuse"
}

Given pretty model & textures, it can produce pretty pictures! How cool is that?

I grayed out bits that are not really interesting (declaration of serialized shader properties & their UI names, shader fallback for older machines etc.). What’s left is Cg/HLSL code, which is then augmented by tons of auto-generated code that deals with lighting & whatnot.

This surface shader dissected into pieces:

#pragma surface surf Lambert: this is a surface shader with main function “surf”, and a Lambert lighting model. Lambert is one of predefined lighting models, but you can write your own.
struct Input: input data for the surface shader. This can have various predefined inputs that will be computed per-vertex & passed into your surface function per-pixel. In this case, it’s two texture coordinates.
surf function: actual surface shader code. It takes Input, and writes into SurfaceOutput (a predefined structure). It is possible to write into custom structures, provided you use lighting models that operate on those structures. The actual code just writes Albedo and Normal to the output.

What is generated

Unity’s “surface shader code generator” would take this, generate actual vertex & pixel shaders, and compile them to various target platforms. With default settings in Unity 3.0, it would make this shader support:

Forward renderer and Deferred Lighting (Light Pre-Pass) renderer.
Objects with precomputed lightmaps and without.
Directional, Point and Spot lights; with projected light cookies or without; with shadowmaps or without. Well ok, this is only for forward renderer because in Light Pre-Pass lighting happens elsewhere.
For Forward renderer, it would compile in support for lights computed per-vertex and spherical harmonics lights computed per-object. It would also generate extra additive blended pass if needed for the case when additional per-pixel lights have to be rendered in separate passes.
For Light Pre-Pass renderer, it would generate base pass that outputs normals & specular power; and a final pass that combines albedo with lighting, adds in any lightmaps or emissive lighting etc.
It can optionally generate a shadow caster rendering pass (needed if custom vertex position modifiers are used for vertex shader based animation; or some complex alpha-test effects are done).

For example, here’s code that would be compiled for a forward-rendered base pass with one directional light, 4 per-vertex point lights, 3rd order SH lights; optional lightmaps (I suggest just scrolling down):

#pragma vertex vert_surf
#pragma fragment frag_surf
#pragma fragmentoption ARB_fog_exp2
#pragma fragmentoption ARB_precision_hint_fastest
#pragma multi_compile_fwdbase
#include "HLSLSupport.cginc"
#include "UnityCG.cginc"
#include "Lighting.cginc"
#include "AutoLight.cginc"
struct Input {
	float2 uv_MainTex : TEXCOORD0;
};
sampler2D _MainTex;
sampler2D _BumpMap;
void surf (Input IN, inout SurfaceOutput o)
{
	o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
	o.Normal = UnpackNormal (tex2D (_BumpMap, IN.uv_MainTex));
}
struct v2f_surf {
  V2F_POS_FOG;
  float2 hip_pack0 : TEXCOORD0;
  #ifndef LIGHTMAP_OFF
  float2 hip_lmap : TEXCOORD1;
  #else
  float3 lightDir : TEXCOORD1;
  float3 vlight : TEXCOORD2;
  #endif
  LIGHTING_COORDS(3,4)
};
#ifndef LIGHTMAP_OFF
float4 unity_LightmapST;
#endif
float4 _MainTex_ST;
v2f_surf vert_surf (appdata_full v) {
  v2f_surf o;
  PositionFog( v.vertex, o.pos, o.fog );
  o.hip_pack0.xy = TRANSFORM_TEX(v.texcoord, _MainTex);
  #ifndef LIGHTMAP_OFF
  o.hip_lmap.xy = v.texcoord1.xy * unity_LightmapST.xy + unity_LightmapST.zw;
  #endif
  float3 worldN = mul((float3x3)_Object2World, SCALED_NORMAL);
  TANGENT_SPACE_ROTATION;
  #ifdef LIGHTMAP_OFF
  o.lightDir = mul (rotation, ObjSpaceLightDir(v.vertex));
  #endif
  #ifdef LIGHTMAP_OFF
  float3 shlight = ShadeSH9 (float4(worldN,1.0));
  o.vlight = shlight;
  #ifdef VERTEXLIGHT_ON
  float3 worldPos = mul(_Object2World, v.vertex).xyz;
  o.vlight += Shade4PointLights (
    unity_4LightPosX0, unity_4LightPosY0, unity_4LightPosZ0,
    unity_LightColor0, unity_LightColor1, unity_LightColor2, unity_LightColor3,
    unity_4LightAtten0, worldPos, worldN );
  #endif // VERTEXLIGHT_ON
  #endif // LIGHTMAP_OFF
  TRANSFER_VERTEX_TO_FRAGMENT(o);
  return o;
}
#ifndef LIGHTMAP_OFF
sampler2D unity_Lightmap;
#endif
half4 frag_surf (v2f_surf IN) : COLOR {
  Input surfIN;
  surfIN.uv_MainTex = IN.hip_pack0.xy;
  SurfaceOutput o;
  o.Albedo = 0.0;
  o.Emission = 0.0;
  o.Specular = 0.0;
  o.Alpha = 0.0;
  o.Gloss = 0.0;
  surf (surfIN, o);
  half atten = LIGHT_ATTENUATION(IN);
  half4 c;
  #ifdef LIGHTMAP_OFF
  c = LightingLambert (o, IN.lightDir, atten);
  c.rgb += o.Albedo * IN.vlight;
  #else // LIGHTMAP_OFF
  half3 lmFull = DecodeLightmap (tex2D(unity_Lightmap, IN.hip_lmap.xy));
  #ifdef SHADOWS_SCREEN
  c.rgb = o.Albedo * min(lmFull, atten*2);
  #else
  c.rgb = o.Albedo * lmFull;
  #endif
  c.a = o.Alpha;
  #endif // LIGHTMAP_OFF
  return c;
}

Of those 90 lines of code, 10 are your original surface shader code; the remaining 80 would have to be pretty much written by hand in Unity 2.x days (well ok, less code would have to be written because 2.x had less rendering features). But wait, that was only base pass of the forward renderer! It also generates code for additive pass, for deferred base pass, deferred final pass, optionally for shadow caster pass and so on.

So this should be an easier to write lit shaders (it is for me at least). I hope this will also increase the number of Unity users who can write shaders at least 3 times (i.e. to 30 up from 10!). It should be more future proof to accomodate changes to the lighting pipeline we’ll do in Unity next.

Predefined Input values

The Input structure can contain texture coordinates and some predefined values, for example view direction, world space position, world space reflection vector and so on. Code to compute them is only generated if they are actually used. For example, if you use world space reflection to do some cubemap reflections (as emissive term) in your surface shader, then in Light Pre-Pass base pass the reflection vector will not be computed (since it does not output emission, so by extension does not need reflection vector).

As a small example, the shader above extended to do simple rim lighting:

#pragma surface surf Lambert
struct Input {
    float2 uv_MainTex;
    float2 uv_BumpMap;
    float3 viewDir;
};
sampler2D _MainTex;
sampler2D _BumpMap;
float4 _RimColor;
float _RimPower;
void surf (Input IN, inout SurfaceOutput o) {
    o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
    o.Normal = UnpackNormal (tex2D (_BumpMap, IN.uv_BumpMap));
    half rim =
        1.0 - saturate(dot (normalize(IN.viewDir), o.Normal));
    o.Emission = _RimColor.rgb * pow (rim, _RimPower);
}

Vertex shader modifiers

It is possible to specify custom “vertex modifier” function that will be called at start of the generated vertex shader, to modify (or generate) per-vertex data. You know, vertex shader based tree wind animation, grass billboard extrusion and so on. It can also fill in any non-predefined values in the Input structure.

My favorite vertex modifier? Moving vertices along their normals.

Custom Lighting Models

There are a couple simple lighting models built-in, but it’s possible to specify your own. A lighting model is nothing more than a function that will be called with the filled SurfaceOutput structure and per-light parameters (direction, attenuation and so on). Different functions would have to be called in forward and light pre-pass rendering cases; and naturally the light pre-pass one has much less flexibility. So for any fancy effects, it is possible to say “do not compile this shader for light pre-pass”, in which case it will be rendered via forward rendering.

Example of wrapped-Lambert lighting model:

#pragma surface surf WrapLambert
half4 LightingWrapLambert (SurfaceOutput s, half3 dir, half atten) {
    dir = normalize(dir);
    half NdotL = dot (s.Normal, dir);
    half diff = NdotL * 0.5 + 0.5;
    half4 c;
    c.rgb = s.Albedo * _LightColor0.rgb * (diff * atten * 2);
    c.a = s.Alpha;
    return c;
}
struct Input {
    float2 uv_MainTex;
};
sampler2D _MainTex;
void surf (Input IN, inout SurfaceOutput o) {
    o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
}

Behind the scenes

I’m using HLSL parser from Ryan Gordon’s mojoshader to parse the original surface shader code and infer some things from the AST mojoshader produces. This way I can figure out what members are in what structures, go over function prototypes and so on. At this stage some error checking is done to tell the user his surface function is of wrong prototype, or his structures are missing required members - which is much better than failing with dozens of compile errors in the generated code later.

To figure out which surface shader inputs are actually used in the various lighting passes, I’m generating small dummy pixel shaders, compile them with Cg and use Cg’s API to query used inputs & outputs. This way I can figure out, for example, that a normal map nor it’s texture coordinate is not actually used in Light Pre-Pass’ final pass, and save some vertex shader instructions & a texcoord interpolator.

The code that is ultimately generated is compiled with various shader compilers depending on the target platform (Cg for PC/Mac, XDK HLSL for Xbox 360, PS3 Cg for PS3, and my own fork of HLSL2GLSL for iPhone, Android and upcoming NativeClient port of Unity).

So yeah, that’s it. We’ll see where this goes next, or what happens when Unity 3 will be released.

Compiling HLSL into GLSL in 2010

Posted on May 21, 2010

Realtime shader languages these days have settled down into two camps: HLSL (or Cg, which for all practical reasons is the same) and GLSL (or GLSL ES, which is sufficiently similar). HLSL/Cg is used by Direct3D and the big consoles (Xbox 360, PS3). GLSL/ES is used by OpenGL and pretty much all modern mobile platforms (iPhone, Android, …).

Since shaders are more or less “assets”, having two different languages to deal with is not very nice. What, I’m supposed to write my shader twice just to support both (for example) D3D and iPad? You would think in 2010, almost a decade since high level realtime shader languages have appeared, this problem would be solved… but it isn’t!

In upcoming Unity 3.0, we’re going to have OpenGL ES 2.0 for mobile platforms, where GLSL ES is the only option to write shaders in. However, almost all other platforms (Windows, 360, PS3) need HLSL/Cg.

I tried a bit making Cg spit out GLSL code. In theory it can, and I read somewhere that id uses it for OpenGL backend for Rage… But I just couldn’t make it work. What’s possible for John apparently is not possible for mere mortals.

Then I looked at ATI’s HLSL2GLSL. That did produce GLSL shaders that were not absolutely horrible. So I started using it, and (surprise!) quickly ran into small issues here and there. Too bad development of the library stopped around 2006… on the plus side, it’s open source!

So I just forked it. Here it is: http://code.google.com/p/hlsl2glslfork/ (commit log here). There are no prebuilt binaries or source drops right now, just a Mercurial repository. BSD license. Patches welcome.

Note on the codebase: I don’t particularly like the codebase. It seems somewhat over-engineered code, that was probably taken from reference GLSL parser that 3DLabs once did, and adapted to parse HLSL and spit out GLSL. There are pieces of code that are unused, unfinished or duplicated. Judging from comments, some pieces of code have been in the hands of 3DLabs, ATI and NVIDIA (what good can come out of that?!). However, it works, and that’s the most important trait any code can have.

Note on the preprocessor: I bumped into some preprocessor issues that couldn’t be easily fixed without first understanding someone else’s ancient code and then changing it significantly. Fortunately, Ryan Gordon’s project, MojoShader, happens to have preprocessor that very closely emulates HLSL’s one (including various quirks). So I’m using that to preprocess any source before passing it down to HLSL2GLSL. Kudos to Ryan!

Side note on MojoShader: Ryan is also working on HLSL->GLSL cross compiler in MojoShader. I like that codebase much more; will certainly try it out once it’s somewhat ready.

You can never have enough notes: Google’s ANGLE project (running OpenGL ES 2.0 on top of Direct3D runtime+drivers) seems to be working on the opposite tool. For obvious reasons, they need to take GLSL ES shaders and produce D3D compatible shaders (HLSL or shader assembly/bytecode). The project seems to be moving fast; and if one day we’ll decide to default to GLSL as shader language in Unity, I’ll know where to look for a translator into HLSL :)