Compact Normal Storage for small G-Buffers (Old and Wrong)

Stop! It's an error!

This version of my article has some stupidity: encoding shaders do not normalize the incoming per-vertex normal. This makes quality evaluation results being somewhat wrong. Also, if normal is assumed to be normalized, then three methods in this article (Sphere Map, Cry Engine 3 and Lambert Azimuthal) are in fact completely equivalent. You'd better just read the new & improved version of this article, trust me!

The old and wrong version follows, in case you really want to see it...

Intro
Baseline: store X&Y&Z
Method 1: X&Y
Method 2: X&Y&sign of Z
Method 3: Spherical Coordinates
Method 3a: Spherical Coordinates w/ texture LUT
Method 4: Spheremap Transform
Method 5: Cry Engine 3
Method 6: Lambert Azimuthal Equal-Area projection
Method 7: Stereographic projection
Performance Comparison
Quality Comparison
Changelog
TODO

Intro

Various deferred rendering or deferred lighting approaches need to store normals as part of their g-buffer. Let's figure out a compact storage method for view space normals. In my case, main target is minimalist g-buffer, where depth and normals are packed into a single 32 bit (8 bits/channel) render texture. I try to minimize error and shader cycles to encode/decode.

Now of course, 8 bits/channel storage for normals is usually not enough for deferred rendering/shading, if you want specular (low precision & quantization leads to specular "wobble" when camera or objects move). However, everything below should Just Work (tm) for 10 or 16 bits/channel integer formats. For 16 bits/channel half-float formats, some of the computations are not necessary (e.g. bringing normal values into 0..1 range).

If you know other ways to store/encode normals, please let me know in the comments!

Here's a small test scene. Note that the outer walls have view space normals that point away from the camera. The same scenario happens on the edges of the spheres. Click for larger version:

Various normal encoding methods and their comparison below. Notes:

Error images are: 1-pow(dot(n1,n2)32), abs(n1-n2) and abs(n1-n2)*10, where n1 is actual normal, and n2 is normal encoded into a texture, read back & decoded. MSE and PSNR is computed on the difference (abs(n1-n2)) image.
Shader code is HLSL. Compiled into ps_2_0 and ps_3_0 by d3dx9_40.dll (November 2008 SDK).
Radeon GPU performance numbers from AMD's GPU ShaderAnalyzer 1.51, using Catalyst 9.4 driver.
GeForce GPU performance numbers from NVIDIA's NVShaderPerf 2.0, using 174.74 driver.

Baseline: store X&Y&Z

Just to set the basis, store all three components of the normal. It's not suitable for our quest, but I include it here to evaluate "base" encoding error (which here happens only because of quantization to 8 bits per component).

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0000054; PSNR: 52.661 dB.

Method #1: store X&Y, reconstruct Z

Used by Killzone 2 among others (PDF link).

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0312331; PSNR: 15.054 dB.

Pros:

Very simple to encode/decode

Cons:

Normal can point away from the camera. My test scene setup actually has that. See Resistance 2 Prelighting paper (PDF link) for explanation.

Encoding	Decoding
enc = n.xy * 0.5 + 0.5;	n.xy = enc*2-1; n.z = sqrt(1-dot(n.xy,n.xy));
ps_2_0 def c0, 0.5, 0, 0, 1 dcl t0.xy mad_pp r0.xy, t0, c0.x, c0.x mov_pp r0.zw, c0 mov_pp oC0, r0	ps_2_0 def c0, 2, -1, 1, 0 dcl t0.xyz dcl_2d s0 mov r0.xyz, t0 mov r0.w, t0.z texldp r0, r0, s0 mad_pp r0.xy, r0, c0.x, c0.y dp2add r1.w, r0, -r0, c0.z rsq r1.x, r1.w rcp_pp r0.z, r1.x mov r0.w, c0.z mov_pp oC0, r0
3 ALU Radeon 9700: 1 GPR, 2 clk, 4.00 pix/clk Radeon X1600 and up: 1 GPR, 1 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 1 clk, 6400 mpix/s GeForce 7800GT: 1 GPR, 1 clk, 9600 mpix/s GeForce 8800GTX: 5 GPR, 7 clk, 14394 mpix/s	8 ALU, 1 TEX Radeon 9700: 1 GPR, 4.00 clk, 2.00 pix/clk Radeon X1900: 1 GPR, 2.67 clk, 6.00 pix/clk Radeon HD 2900,3870: 2 GPR, 1.50 clk, 10.67 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 4 clk, 1600 mpix/s GeForce 7800GT: 1 GPR, 3 clk, 3200 mpix/s GeForce 8800GTX: 5 GPR, 20 clk, 8384 mpix/s
ps_3_0 def c0, 0.5, 0, 1, 0 dcl_texcoord v0.xy mad_pp oC0, v0.xyxx, c0.xxyy, c0.xxyz	ps_3_0 def c0, 2, -1, 1, 0 dcl_texcoord v0.xyz dcl_2d s0 texldp r0, v0.xyzz, s0 mad_pp r0.xy, r0, c0.x, c0.y dp2add r0.z, r0, -r0, c0.z mov_pp oC0.xy, r0 rsq r0.x, r0.z rcp_pp oC0.z, r0.x mov_pp oC0.w, c0.z
1 ALU Radeon 9700: -- Radeon X1600 and up: 1 GPR, 1 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 1 clk, 6400 mpix/s GeForce 7800GT: 1 GPR, 1 clk, 9600 mpix/s GeForce 8800GTX: 5 GPR, 7 clk, 14394 mpix/s	6 ALU, 1 TEX Radeon 9700: -- Radeon X1900: 1 GPR, 2.67 clk, 6.00 pix/clk Radeon HD 2900,3870: 2 GPR, 1.50 clk, 10.67 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 4 clk, 1600 mpix/s GeForce 7800GT: 1 GPR, 3 clk, 3200 mpix/s GeForce 8800GTX: 5 GPR, 15 clk, 11075 mpix/s

Method #2: store X&Y&sign of Z, reconstruct Z

Basically, method #1 with proper support for negative view space Z component. See gamedev.net forum thread for details.

Pros:

Simple to encode/decode

Cons:

Takes away one bit of storage from something else.

TODO!

Method #3: Spherical Coordinates

It is possible to use spherical coordinates to encode the normal. Since we know it's unit length, we can just store the two angles.

Suggested by Pat Wilson of Garage Games: GG blog post. Other mentions: MJP's blog, GarageGames thread, Wolf Engel's blog, gamedev.net forum thread.

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0000945; PSNR: 40.244 dB.

Pros:

Suitable for normals in general (not necessarily view space)

Cons:

Uses trig instructions (quite heavy on ALU). Possible to replace some of that with texture lookups though, see Method #3a.

Encoding	Decoding
// kPI = 3.1415926536f enc = (float2(atan2(n.y,n.x)/kPI, n.z)+1.0)*0.5;	// kPI = 3.1415926536f float2 ang = enc2-1; float2 scth; sincos(ang.x kPI, scth.x, scth.y); float2 scphi = float2(sqrt(1.0 - ang.yang.y), ang.y); n = float3(scth.yscphi.x, scth.x*scphi.x, scphi.y);
ps_2_0 def c0, 0.0208350997, -0.0851330012, 0.180141002, -0.330299497 def c1, 0.999866009, 0, 1, 3.14159274 def c2, -2, 1.57079637, 0.318309873, 0.5 def c3, 0, 0, 0, 1 dcl t0.xyz abs r0.w, t0.y abs r0.x, t0.x max r1.w, r0.w, r0.x rcp r0.y, r1.w min r1.x, r0.x, r0.w add r0.x, -r0.w, r0.x cmp r0.x, r0.x, c1.y, c1.z mul r0.y, r0.y, r1.x mul r0.z, r0.y, r0.y mad r0.w, r0.z, c0.x, c0.y mad r0.w, r0.z, r0.w, c0.z mad r0.w, r0.z, r0.w, c0.w mad r0.z, r0.z, r0.w, c1.x mul r0.y, r0.y, r0.z mad r0.z, r0.y, c2.x, c2.y mad r0.x, r0.z, r0.x, r0.y cmp r0.y, t0.x, -c1.y, -c1.w add r0.x, r0.x, r0.y add r0.y, r0.x, r0.x min r0.z, t0.x, t0.y cmp r0.z, r0.z, c1.y, c1.z max r0.w, t0.y, t0.x cmp r0.w, r0.w, c1.z, c1.y mul r0.z, r0.z, r0.w mad r0.x, r0.z, -r0.y, r0.x mul r0.x, r0.x, c2.z mov r0.y, t0.z add r0.xy, r0, c1.z mul_pp r0.xy, r0, c2.w mov_pp r0.zw, c3 mov_pp oC0, r0	ps_2_0 def c0, 2, -1, 0.5, 1 def c1, 6.28318548, -3.14159274, 0, 0 def c2, -1.55009923e-006, -2.17013894e-005, 0.00260416674, 0.00026041668 def c3, -0.020833334, -0.125, 1, 0.5 dcl t0.xyz dcl_2d s0 mov r0.xyz, t0 mov r0.w, t0.z texldp r0, r0, s0 mad r0.xy, r0, c0.x, c0.y mad r0.x, r0.x, c0.z, c0.z frc r0.x, r0.x mad r0.x, r0.x, c1.x, c1.y sincos r1.xy, r0.x, c2, c3 mad r0.x, r0.y, -r0.y, c0.w mov_pp r2.z, r0.y rsq r0.x, r0.x rcp r0.x, r0.x mul_pp r2.xy, r1, r0.x mov_pp r2.w, c0.w mov_pp oC0, r2
31 ALU Radeon 9700: 3 GPR, 15 clk, 0.53 pix/clk Radeon X1900: 3 GPR, 5.67 clk, 2.82 pix/clk Radeon HD 2900,3870: 1 GPR, 4.00 clk, 4.00 pix/clk Radeon HD 4870: 1 GPR, 1.60 clk, 10.00 pix/clk GeForce 6800U: 2 GPR, 12 clk, 533 mpix/s GeForce 7800GT: 3 GPR, 9 clk, 1066 mpix/s GeForce 8800GTX: 9 GPR, 36 clk, 5760 mpix/s	14 ALU, 1 TEX Radeon 9700: 2 GPR, 12.00 clk, 0.67 pix/clk Radeon X1900: 2 GPR, 3.00 clk, 5.33 pix/clk Radeon HD 2900,3870: 2 GPR, 2.25 clk, 7.11 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 7 clk, 914 mpix/s GeForce 7800GT: 2 GPR, 5 clk, 1920 mpix/s GeForce 8800GTX: 6 GPR, 28 clk, 5903 mpix/s
ps_3_0 def c0, 0.0208350997, -0.0851330012, 0.180141002, -0.330299497 def c1, 0.999866009, 0, 1, 3.14159274 def c2, -2, 1.57079637, 0.318309873, 0.5 dcl_texcoord v0.xyz add r0.xy, -v0_abs, v0_abs.yxzw cmp r0.xz, r0.x, v0_abs.xyyw, v0_abs.yyxw cmp r0.y, r0.y, c1.y, c1.z rcp r0.z, r0.z mul r0.x, r0.x, r0.z mul r0.z, r0.x, r0.x mad r0.w, r0.z, c0.x, c0.y mad r0.w, r0.z, r0.w, c0.z mad r0.w, r0.z, r0.w, c0.w mad r0.z, r0.z, r0.w, c1.x mul r0.x, r0.x, r0.z mad r0.z, r0.x, c2.x, c2.y mad r0.x, r0.z, r0.y, r0.x cmp r0.y, v0.x, -c1.y, -c1.w add r0.x, r0.x, r0.y add r0.y, r0.x, r0.x add r0.z, -v0.x, v0.y cmp r0.zw, r0.z, v0.xyxy, v0.xyyx cmp r0.zw, r0, c1.xyyz, c1.xyzy mul r0.z, r0.w, r0.z mad r0.x, r0.z, -r0.y, r0.x mul r0.x, r0.x, c2.z mov r0.y, v0.z add r0.xy, r0, c1.z mul_pp oC0.xy, r0, c2.w mov_pp oC0.zw, c1.xyyz	ps_3_0 def c0, 2, -1, 0.5, 1 def c1, 6.28318548, -3.14159274, 1, 0 dcl_texcoord v0.xyz dcl_2d s0 texldp r0, v0.xyzz, s0 mad r0.xy, r0, c0.x, c0.y mad r0.x, r0.x, c0.z, c0.z frc r0.x, r0.x mad r0.x, r0.x, c1.x, c1.y sincos r1.xy, r0.x mad r0.x, r0.y, -r0.y, c0.w mad_pp oC0.zw, r0.y, c1, c1.xywz rsq r0.x, r0.x rcp r0.x, r0.x mul_pp oC0.xy, r1, r0.x
26 ALU Radeon 9700: -- Radeon X1900: 4 GPR, 6.00 clk, 2.67 pix/clk Radeon HD 2900,3870: 1 GPR, 4.25 clk, 3.76 pix/clk Radeon HD 4870: 1 GPR, 1.70 clk, 9.41 pix/clk GeForce 6800U: 3 GPR, 12 clk, 533 mpix/s GeForce 7800GT: 3 GPR, 10 clk, 960 mpix/s GeForce 8800GTX: 9 GPR, 43 clk, 5146 mpix/s	10 ALU, 1 TEX Radeon 9700: -- Radeon X1900: 2 GPR, 3.00 clk, 5.33 pix/clk Radeon HD 2900,3870: 2 GPR, 2.25 clk, 7.11 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 7 clk, 914 mpix/s GeForce 7800GT: 2 GPR, 5 clk, 1920 mpix/s GeForce 8800GTX: 6 GPR, 23 clk, 7119 mpix/s

Method #3a: Spherical Coordinates w/ texture LUT

Method #3, ALU operations replaced with texture lookups.

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0002237; PSNR: 36.503 dB.

Pros:

Like Method #3, suitable for normals in any space.
Very cheap on ALU.

Cons:

One extra texture lookup for encoding & decoding.
Quality slightly worse than pure ALU method (#3).

Encoding	Decoding
float3 in01 = n*0.5+0.5; enc.x = tex2D(_Atan2Lookup, in01.xy).a; enc.y = in01.z;	half3 sclook = tex2D(_SinCosLookup, enc).rgb; n = sclook*2-1;
ps_2_0 def c0, 0.5, 0, 0, 1 dcl t0.xyz dcl_2d s0 mad r0.xy, t0, c0.x, c0.x texld_pp r0, r0, s0 mov_pp r0.x, r0.w mad_pp r0.y, t0.z, c0.x, c0.x mov_pp r0.zw, c0 mov_pp oC0, r0	ps_2_0 def c0, 2, -1, 1, 0 dcl t0.xyz dcl_2d s0 dcl_2d s1 mov r0.xyz, t0 mov r0.w, t0.z texldp r0, r0, s0 texld_pp r0, r0, s1 mad_pp r0.xyz, r0, c0.x, c0.y mov_pp r0.w, c0.z mov_pp oC0, r0
5 ALU, 1 TEX Radeon 9700: 2 GPR, 4.00 clk, 2.00 pix/clk Radeon X1900: 1 GPR, 3.33 clk, 4.80 pix/clk Radeon HD 2900,3870: 2 GPR, 1.00 clk, 16.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 3.00 clk, 2133 mpix/s GeForce 7800GT: 1 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 7 GPR, 10.00 clk, 14241 mpix/s	5 ALU, 2 TEX Radeon 9700: 1 GPR, 2.00 clk, 4.00 pix/clk Radeon X1900: 1 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 2900,3870: 2 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 2.00 clk, 3200 mpix/s GeForce 7800GT: 1 GPR, 2.00 clk, 4800 mpix/s GeForce 8800GTX: 6 GPR, 20.00 clk, 8689 mpix/s
ps_3_0 def c0, 0.5, 1, 0, 0 dcl_texcoord v0.xyz dcl_2d s0 mad r0.xy, v0, c0.x, c0.x texld_pp r0, r0, s0 mad_pp oC0.xzw, r0.w, c0.yyzz, c0.zyzy mad_pp oC0.y, v0.z, c0.x, c0.x	ps_3_0 def c0, 2, -1, 1, 0 dcl_texcoord v0.xyz dcl_2d s0 dcl_2d s1 texldp r0, v0.xyzz, s0 texld_pp r0, r0, s1 mad_pp oC0.xyz, r0, c0.x, c0.y mov_pp oC0.w, c0.z
3 ALU, 1 TEX Radeon 9700: -- Radeon X1900: 1GPR, 2.33 clk, 6.86 pix/clk Radeon HD 2900,3870: 2 GPR, 1.00 clk, 16.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 3.00 clk, 2133 mpix/s GeForce 7800GT: 2 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 7 GPR, 10.00 clk, 14241 mpix/s	2 ALU, 2 TEX Radeon 9700: -- Radeon X1900: 1 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 2900,3870: 2 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 2.00 clk, 3200 mpix/s GeForce 7800GT: 1 GPR, 2.00 clk, 4800 mpix/s GeForce 8800GTX: 6 GPR, 20.00 clk, 8689 mpix/s
// UnityScript code to create _Atan2Lookup var size = 256; tex = new Texture2D(size,size, TextureFormat.Alpha8,false); var pix = new Color[sizesize]; var idx = 0; var sizeF : float = size; for (var y = 0; y < size; ++y) { for (var x = 0; x < size; ++x) { var xval = x/sizeF 2.0 - 1.0; var yval = y/sizeF * 2.0 - 1.0; var atanRes = (Math.Atan2(yval, xval) + Mathf.PI) / (Mathf.PI*2); pix[idx] = new Color(0,0,0,atanRes+0.5/255.0); ++idx; } } tex.SetPixels (pix,0); tex.Apply(); tex.wrapMode = TextureWrapMode.Clamp; tex.filterMode = FilterMode.Point; Shader.SetGlobalTexture ("_Atan2Lookup", tex);	// UnityScript code to create _SinCosLookup var tex = new Texture2D(size,size, TextureFormat.ARGB32,false); var pix = new Color[sizesize]; var idx = 0; var sizeF : float = size; for (y = 0; y < size; ++y) { var angY : float = y/sizeF 2.0 - 1.0; var scphi = Mathf.Sqrt(1.0-angYangY); for (x = 0; x < size; ++x) { var ang : float = x/sizeF 2.0 - 1.0; ang = Mathf.PI; var vs = Mathf.Sin(ang); var vc = Mathf.Cos(ang); pix[idx] = new Color( vcscphi0.5+0.5+0.5/255.0, vsscphi0.5+0.5+0.5/255.0, angY0.5+0.5+0.5/255.0, 1 ); ++idx; } } tex.SetPixels (pix,0); tex.Apply(); tex.wrapMode = TextureWrapMode.Clamp; tex.filterMode = FilterMode.Point; Shader.SetGlobalTexture ("_SinCosLookup", tex);

Method #4: Spheremap Transform

Spherical environment mapping (indirectly) maps reflection vector to a texture coordinate in [0..1] range. The reflection vector can point away from the camera, just like our view space normals. Bingo! See Siggraph 99 notes for sphere map math. Normal we want to encode is R, resulting values are (s,t).

I'm not aware of any uses of this method. If someone has used it to store normals, please let me know (mail or comment on my blog post).

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0000333; PSNR: 44.781 dB.

Pros:

Quality pretty good!

Cons:

Encoding	Decoding
float f = n.z2+1; float g = dot(n,n); float p = sqrt(g+f); enc = n/p 0.5 + 0.5;	float2 tmp = -encenc+enc; float f = tmp.x+tmp.y; float m = sqrt(4f-1); n.xy = (enc4-2) m; n.z = 8f-3; // optimized n.xy = -encenc+enc; n.z = -1; float f = dot(n, float3(1,1,0.25)); float m = sqrt(f); n.xy = (enc8-4) m; n.z += 8*f;
ps_2_0 def c0, 2, 1, 0.5, 0 def c1, 0, 1, 0, 0 dcl t0.xyz mad r0.w, t0.z, c0.x, c0.y dp3 r0.x, t0, t0 add r0.x, r0.w, r0.x rsq r0.x, r0.x mul r0.xy, r0.x, t0 mad_pp r0.xy, r0, c0.z, c0.z mov_pp r0.z, c1.x mov_pp r0.w, c1.y mov_pp oC0, r0	ps_2_0 def c0, -1, 0.25, 1, 1 def c1, 8, -4, -1, 0 dcl t0.xyz dcl_2d s0 mov r0.xyz, t0 mov r0.w, t0.z texldp r0, r0, s0 mad r1.xy, -r0, r0, r0 mad r0.xy, r0, c1.x, c1.y mov r1.z, c0.x dp3 r0.z, r1, c0.wzyx rsq r0.w, r0.z mad_pp r1.z, r0.z, c1.x, c1.z rcp r0.z, r0.w mul_pp r1.xy, r0, r0.z mov r1.w, c0.w mov_pp oC0, r1
9 ALU Radeon 9700: 2 GPR, 4.00 clk, 2.00 pix/clk Radeon X1900: 3 GPR, 2.33 clk, 6.86 pix/clk Radeon HD 2900,3870: 2 GPR, 1.50 clk, 10.67 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 6 clk, 1066 mpix/s GeForce 7800GT: 1 GPR, 3 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 13 clk, 11712 mpix/s	12 ALU, 1 TEX Radeon 9700: 2 GPR, 6.00 clk, 1.33 pix/clk Radeon X1900: 2 GPR, 3.00 clk, 5.33 pix/clk Radeon HD 2900,3870: 3 GPR, 1.75 clk, 9.14 pix/clk Radeon HD 4870: 3 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 6 clk, 1066 mpix/s GeForce 7800GT: 1 GPR, 3 clk, 3200 mpix/s GeForce 8800GTX: 7 GPR, 20 clk, 8320 mpix/s
ps_3_0 def c0, 2, 1, 0.5, 0 dcl_texcoord v0.xyz mad r0.x, v0.z, c0.x, c0.y dp3 r0.y, v0, v0 add r0.x, r0.x, r0.y rsq r0.x, r0.x mul r0.xy, r0.x, v0 mad_pp oC0.xy, r0, c0.z, c0.z mov_pp oC0.zw, c0.xywy	ps_3_0 def c0, -1, 1, 0.25, 8 def c1, 8, -4, 0, 0 dcl_texcoord v0.xyz dcl_2d s0 mov r0.z, c0.x texldp r1, v0.xyzz, s0 mad r0.xy, -r1, r1, r1 mad r1.xy, r1, c1.x, c1.y dp3 r0.x, r0, c0.yyzw rsq r0.y, r0.x mad_pp oC0.z, r0.x, c0.w, c0.x rcp r0.x, r0.y mul_pp oC0.xy, r1, r0.x mov_pp oC0.w, c0.y
7 ALU Radeon 9700: -- Radeon X1900: 3 GPR, 2.33 clk, 6.86 pix/clk Radeon HD 2900,3870: 2 GPR, 1.50 clk, 10.67 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 6 clk, 1066 mpix/s GeForce 7800GT: 1 GPR, 3 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 13 clk, 11712 mpix/s	8 ALU, 1 TEX Radeon 9700: -- Radeon X1900: 2 GPR, 3.00 clk, 5.33 pix/clk Radeon HD 2900,3870: 3 GPR, 1.75 clk, 9.14 pix/clk Radeon HD 4870: 3 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 6 clk, 1066 mpix/s GeForce 7800GT: 1 GPR, 3 clk, 3200 mpix/s GeForce 8800GTX: 7 GPR, 15 clk, 10777 mpix/s

Method #5: Cry Engine 3

Somewhat similar to Method #4 (sphere map). Used in Cry Engine 3, presented by Martin Mittring in "A bit more Deferred" presentation (PPT link, slide 13). For Unity, I had to negate Z component of view space normal to produce good results, I guess Unity's and Cry Engine's coordinate systems are different.

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0000921; PSNR: 40.355 dB.

Pros:

Used by Cry Engine 3, so it must be good! :)

Cons:

Encoding	Decoding
enc = normalize(n.xy) * (sqrt(-n.z0.5+0.5)); enc = enc0.5+0.5;	float2 fenc = enc2-1; n.z = -(dot(fenc,fenc)2-1); n.xy = normalize(fenc) * sqrt(1-n.zn.z); // optimized // enc4 is float4, with .rg containing encoded normal float4 nn = enc4float4(2,2,0,0) + float4(-1,-1,1,-1); float l = dot(nn.xyz,-nn.xyw); nn.z = l; nn.xy = sqrt(l); n = nn.xyz 2 + float3(0,0,-1);
ps_2_0 def c0, 0, -0.5, 0.5, 0 def c1, 0, 1, 0, 0 dcl t0.xyz dp2add r0.w, t0, t0, c0.x rsq r0.x, r0.w mul r0.xy, r0.x, t0 mad r0.z, t0.z, c0.y, c0.z rsq r0.z, r0.z rcp r0.z, r0.z mul r0.xy, r0, r0.z mad_pp r0.xy, r0, c0.z, c0.z mov_pp r0.z, c1.x mov_pp r0.w, c1.y mov_pp oC0, r0	ps_2_0 def c0, 2, 2, 0, 0 def c1, -1, -1, 1, -1 def c2, 0, 0, -1, 2 dcl t0.xyz dcl_2d s0 mov r0.xyz, t0 mov r0.w, t0.z texldp r0, r0, s0 mov r1, c1 mad r0, r0, c0, r1 mul r1.x, -r0.x, r0.x mad r1.x, r0.y, -r0.y, r1.x mad r1.z, r0.z, -r0.w, r1.x rsq r1.w, r1.z rcp r1.w, r1.w mul r1.xy, r0, r1.w mad_pp r0.xyz, r1, c2.w, c2 mov r0.w, c1.z mov_pp oC0, r0
11 ALU Radeon 9700: 2 GPR, 5.00 clk, 1.60 pix/clk Radeon X1900: 2 GPR, 2.33 clk, 6.86 pix/clk Radeon HD 2900,3870: 2 GPR, 1.50 clk, 10.67 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 6.00 clk, 1067 mpix/s GeForce 7800GT: 2 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 20.00 clk, 8246 mpix/s	13 ALU, 1 TEX Radeon 9700: 2 GPR, 7.00 clk, 1.14 pix/clk Radeon X1900: 2 GPR, 2.67 clk, 6.00 pix/clk Radeon HD 2900,3870: 2 GPR, 2.25 clk, 7.11 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 6.00 clk, 1066 mpix/s GeForce 7800GT: 2 GPR, 5.00 clk, 1920 mpix/s GeForce 8800GTX: 6 GPR, 20.00 clk, 8128 mpix/s
ps_3_0 def c0, 0, -0.5, 0.5, 1 dcl_texcoord v0.xyz dp2add r0.x, v0, v0, c0.x rsq r0.x, r0.x mul r0.xy, r0.x, v0 mad r0.z, v0.z, c0.y, c0.z rsq r0.z, r0.z rcp r0.z, r0.z mul r0.xy, r0, r0.z mad_pp oC0.xy, r0, c0.z, c0.z mov_pp oC0.zw, c0.xyxw	ps_3_0 def c0, 2, 0, -1, 1 dcl_texcoord v0.xyz dcl_2d s0 texldp r0, v0.xyzz, s0 mad r0, r0, c0.xxyy, c0.zzwz dp3 r1.z, r0, -r0.xyww rsq r0.z, r1.z rcp r0.z, r0.z mul r1.xy, r0, r0.z mad_pp oC0.xyz, r1, c0.x, c0.yyzw mov_pp oC0.w, c0.w
9 ALU Radeon 9700: -- Radeon X1900: 2 GPR, 2.33 clk, 6.86 pix/clk Radeon HD 2900,3870: 2 GPR, 1.50 clk, 10.67 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 6.00 clk, 1067 mpix/s GeForce 7800GT: 2 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 20.00 clk, 8246 mpix/s	7 ALU, 1 TEX Radeon 9700: -- Radeon X1900: 2 GPR, 2.67 clk, 6.00 pix/clk Radeon HD 2900,3870: 3 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 4870: 3 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 5.00 clk, 1280 mpix/s GeForce 7800GT: 2 GPR, 4.00 clk, 2400 mpix/s GeForce 8800GTX: 6 GPR, 15.00 clk, 10501 mpix/s

Method #6: Lambert Azimuthal Equal-Area Projection

What the title says: use Lambert Azimuthal Equal-Area projection (Wikipedia link). Suggested by Sean Barrett in comments for this article.

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0000495; PSNR: 43.054 dB.

Pros:

Quality pretty good!
Quite cheap to encode/decode.

Cons:

Encoding	Decoding
float f = sqrt(8*n.z+8); enc = n.xy / f + 0.5;	float2 fenc = enc4-2; float f = dot(fenc,fenc); float g = sqrt(1-f/4); n.xy = fencg; n.z = 1-f/2;
ps_2_0 def c0, 8, 0.5, 1, 0 dcl t0.xyz mad r0.w, t0.z, c0.x, c0.x rsq r0.x, r0.w mad_pp r0.xy, t0, r0.x, c0.y mov_pp r0.z, c0.w mov_pp r0.w, c0.z mov_pp oC0, r0	ps_2_0 def c0, 4, -2, 0, 1 def c1, 0.25, 1, 0.5, 0 dcl t0.xyz dcl_2d s0 mov r0.xyz, t0 mov r0.w, t0.z texldp r0, r0, s0 mad r0.xy, r0, c0.x, c0.y dp2add r0.z, r0, r0, c0.z mad r0.w, r0.z, -c1.x, c1.y mad_pp r1.z, r0.z, -c1.z, c1.y rsq r0.z, r0.w rcp r0.z, r0.z mul_pp r1.xy, r0, r0.z mov_pp r1.w, c0.w mov_pp oC0, r1
6 ALU Radeon 9700: 1 GPR, 3.00 clk, 2.67 pix/clk Radeon X1900: 1 GPR, 1.67 clk, 9.60 pix/clk Radeon HD 2900,3870: 2 GPR, 1.00 clk, 16.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 4.00 clk, 1600 mpix/s GeForce 7800GT: 1 GPR, 2.00 clk, 4800 mpix/s GeForce 8800GTX: 5 GPR, 12.00 clk, 13724 mpix/s	11 ALU, 1 TEX Radeon 9700: 1 GPR, 7.00 clk, 1.14 pix/clk Radeon X1900: 1 GPR, 3.33 clk, 4.80 pix/clk Radeon HD 2900,3870: 2 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 6.00 clk, 1067 mpix/s GeForce 7800GT: 1 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 20.00 clk, 8282 mpix/s
ps_3_0 def c0, 8, 0.5, 0, 1 dcl_texcoord v0.xyz mad r0.x, v0.z, c0.x, c0.x rsq r0.x, r0.x mad_pp oC0.xy, v0, r0.x, c0.y mov_pp oC0.zw, c0	ps_3_0 def c0, 4, -2, 0, 1 def c1, 0.25, 0.5, 1, 0 dcl_texcoord v0.xyz dcl_2d s0 texldp r0, v0.xyzz, s0 mad r0.xy, r0, c0.x, c0.y dp2add r0.z, r0, r0, c0.z mad r0.zw, r0.z, -c1.xyxy, c1.z rsq r0.z, r0.z mad_pp oC0.zw, r0.w, c0.xywz, c0 rcp r0.z, r0.z mul_pp oC0.xy, r0, r0.z
4 ALU Radeon 9700: -- Radeon X1900: 1 GPR, 1.67 clk, 9.60 pix/clk Radeon HD 2900,3870: 2 GPR, 1.00 clk, 16.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 4.00 clk, 1600 mpix/s GeForce 7800GT: 1 GPR, 2.00 clk, 4800 mpix/s GeForce 8800GTX: 5 GPR, 12.00 clk, 13724 mpix/s	7 ALU, 1 TEX Radeon 9700: -- Radeon X1900: 1 GPR, 3.33 clk, 4.80 pix/clk Radeon HD 2900,3870: 2 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 6.00 clk, 1067 mpix/s GeForce 7800GT: 1 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 15.00 clk, 10952 mpix/s

Method #7: Stereographic Projection

What the title says: use Stereographic Projection (Wikipedia link), plus rescaling so that "practically visible" range of normals maps into unit circle (regular stereographic projection maps sphere to circle of infinite size). In my tests, scaling factor of 1.7777 produced best results; in practice it depends on FOV used and how much do you care about normals that point away from the camera.

Suggested by Sean Barrett and Ignacio Castano in comments for this article.

Encoding, Error to power, Error*1, Error*10 images below. MSE: 0.0000380; PSNR: 44.207 dB.

Pros:

Quality pretty good!
Quite cheap to encode/decode.

Cons:

Encoding	Decoding
float scale = 1.7777; enc = n.xy / (n.z+1); enc /= scale; enc = enc*0.5+0.5;	// enc4 is float4, with .rg containing encoded normal float scale = 1.7777; float3 nn = enc4.xyzfloat3(2scale,2scale,0) + float3(-scale,-scale,1); float g = 2.0 / dot(nn.xyz,nn.xyz); n.xy = gnn.xy; n.z = g-1;
ps_2_0 def c0, 1, 0.281262308, 0.5, 0 def c1, 0, 0, 0, 1 dcl t0.xyz add r0.w, t0.z, c0.x rcp r0.x, r0.w mul r0.xy, r0.x, t0 mad_pp r0.xy, r0, c0.y, c0.z mov_pp r0.zw, c1 mov_pp oC0, r0	ps_2_0 def c0, 3.55539989, 3.55539989, 0, 1 def c1, -1.77769995, -1.77769995, 1, -2 dcl t0.xyz dcl_2d s0 mov r0.xyz, t0 mov r0.w, t0.z texldp r0, r0, s0 mov r1.xyz, c1 mad r0.xyz, r0, c0, r1 dp3 r0.z, r0, r0 rcp r0.z, r0.z add r0.w, r0.z, r0.z mad_pp r1.z, r0.z, -c1.w, -c1.z mul_pp r1.xy, r0, r0.w mov_pp r1.w, c0.w mov_pp oC0, r1
6 ALU Radeon 9700: 1 GPR, 4.00 clk, 2.00 pix/clk Radeon X1900: 1 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 2900,3870: 2 GPR, 1.00 clk, 16.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 2.00 clk, 3200 mpix/s GeForce 7800GT: 1 GPR, 2.00 clk, 4800 mpix/s GeForce 8800GTX: 5 GPR, 12.00 clk, 13734 mpix/s	11 ALU, 1 TEX Radeon 9700: 1 GPR, 6.00 clk, 1.33 pix/clk Radeon X1900: 1 GPR, 3.00 clk, 5.33 pix/clk Radeon HD 2900,3870: 3 GPR, 1.75 clk, 9.14 pix/clk Radeon HD 4870: 3 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 3.00 clk, 2133 mpix/s GeForce 7800GT: 2 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 16.00 clk, 9897 mpix/s
ps_3_0 def c0, 1, 0.281262308, 0.5, 0 dcl_texcoord v0.xyz add r0.x, c0.x, v0.z rcp r0.x, r0.x mul r0.xy, r0.x, v0 mad_pp oC0.xy, r0, c0.y, c0.z mov_pp oC0.zw, c0.xywx	ps_3_0 def c0, 3.55539989, 0, -1.77769995, 1 def c1, 2, -1, 0, 0 dcl_texcoord v0.xyz dcl_2d s0 texldp r0, v0.xyzz, s0 mad r0.xyz, r0, c0.xxyw, c0.zzww dp3 r0.z, r0, r0 rcp r0.z, r0.z add r0.w, r0.z, r0.z mad_pp oC0.z, r0.z, c1.x, c1.y mul_pp oC0.xy, r0, r0.w mov_pp oC0.w, c0.w
5 ALU Radeon 9700: -- Radeon X1900: 1 GPR, 2.00 clk, 8.00 pix/clk Radeon HD 2900,3870: 2 GPR, 1.00 clk, 16.00 pix/clk Radeon HD 4870: 2 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 1 GPR, 2.00 clk, 3200 mpix/s GeForce 7800GT: 1 GPR, 2.00 clk, 4800 mpix/s GeForce 8800GTX: 5 GPR, 12.00 clk, 13734 mpix/s	7 ALU, 1 TEX Radeon 9700: -- Radeon X1900: 1 GPR, 3.00 clk, 5.33 pix/clk Radeon HD 2900,3870: 3 GPR, 1.75 clk, 9.14 pix/clk Radeon HD 4870: 3 GPR, 1.00 clk, 16.00 pix/clk GeForce 6800U: 2 GPR, 3.00 clk, 2133 mpix/s GeForce 7800GT: 2 GPR, 3.00 clk, 3200 mpix/s GeForce 8800GTX: 6 GPR, 12.00 clk, 12493 mpix/s

Performance Comparison

GPU performance comparison in a single table:

Encoding, GPU cycles, SM2.0
	#1: X & Y	#3: Spherical	#3a: w/ LUT	#4: Spheremap	#5: Cry3	#6: Lambert	#7: Stereo
Radeon X1900	1.00	5.67	3.33	2.33	2.33	1.67	2.00
Radeon HD3870	1.00	4.00	1.00	1.50	1.50	1.00	1.00
GeForce 6800U	1.00	12.00	3.00	6.00	6.00	4.00	2.00
GeForce 8800GTX	7.00	36.00	10.00	13.00	20.00	12.00	12.00
Decoding, GPU cycles, SM2.0
Radeon X1900	2.67	3.00	2.00	3.00	2.67	3.33	3.00
Radeon HD3870	1.50	2.25	2.00	1.75	2.25	2.00	1.75
GeForce 6800U	4.00	7.00	2.00	6.00	6.00	6.00	3.00
GeForce 8800GTX	20.00	28.00	20.00	20.00	20.00	20.00	16.00
Encoding, D3D ALU+TEX instruction slots
SM2.0	3	31	6	9	11	6	6
SM3.0	1	26	4	7	9	4	5
Decoding, D3D ALU+TEX instruction slots
SM2.0	8	14	6	12	13	11	11
SM3.0	6	10	3	8	7	7	7

Quality Comparison

Quality comparison in a single table. PSNR based, higher numbers are better.

Method	PSNR, dB
#1: X & Y	15.054
#3: Spherical	40.244
#3a: w/ LUT	36.503
#4: Spheremap	44.781
#5: Cry Engine 3	40.355
#6: Lambert	43.054
#7: Stereographic	44.207

Changelog

2010 03 25: Stop! Read the new & improved version of this article!
2009 08 12: Added Method #7: Stereographic projection. Suggested by Sean Barrett and Ignacio Castano.
2009 08 12: Optimized Method #5, suggested by Steve Hill.
2009 08 08: Added power difference images.
2009 08 07: Optimized Method #4: Sphere map. Suggested by Irenee Caroulle.
2009 08 07: Added Method #6: Lambert Azimuthal Equal Area. Suggested by Sean Barrett.
2009 08 05: Added Method #5: Cry Engine 3. Suggested by Steve Hill.
2009 08 05: Improved quality of Method #3a: round values in texture LUT.
2009 08 05: Added MSE and PSNR values for all methods.
2009 08 04: Added Method #3a: Spherical Coordinates w/ texture LUT.
2009 08 04: Method #1: 1-dot(n.xy,n.xy) is slightly better than 1-n.x*n.x-n.y*n.y (better pipelining on NV and ATI). Suggested by Arseny "zeux" Kapoulkine.