Compact Normal Storage for small G-Buffers

Intro
Baseline: store X&Y&Z
Method 1: X&Y
Method 3: Spherical Coordinates
Method 4: Spheremap Transform
Method 7: Stereographic projection
Method 8: Per-pixel View Space
Performance Comparison
Quality Comparison
Changelog

Heads up!

I wrote this article in year 2009. Since then a lot of things have happened. A decade later, it looks like everyone settled onto something like semi-octahedral normal encoding. Which was not described here since, well, it was not quite invented yet :) See:

“Octahedron normal vector encoding” (Narkowicz), 2014.
“Survey of Efficient Representations for Independent Unit Vectors” (Cigolle, Donow, Evangelakos, Mara, McGuire, Meyer), 2014
“Signed Octahedron Normal Encoding” (White), 2017

So my advice is just to use that and ignore everything below :)

Intro

Various deferred shading/lighting approaches or image postprocessing effects need to store normals as part of their G-buffer. Let's figure out a compact storage method for view space normals. In my case, main target is minimalist G-buffer, where depth and normals are packed into a single 32 bit (8 bits/channel) render texture. I try to minimize error and shader cycles to encode/decode.

Now of course, 8 bits/channel storage for normals can be not enough for shading, especially if you want specular (low precision & quantization leads to specular "wobble" when camera or objects move). However, everything below should Just Work (tm) for 10 or 16 bits/channel integer formats. For 16 bits/channel half-float formats, some of the computations are not necessary (e.g. bringing normal values into 0..1 range).

If you know other ways to store/encode normals, please let me know in the comments!

Various normal encoding methods and their comparison below. Notes:

Error images are: 1-pow(dot(n1,n2),1024) and abs(n1-n2)*30, where n1 is actual normal, and n2 is normal encoded into a texture, read back & decoded. MSE and PSNR is computed on the difference (abs(n1-n2)) image.
Shader code is HLSL. Compiled into ps_3_0 by d3dx9_42.dll (February 2010 SDK).
Radeon GPU performance numbers from AMD's GPU ShaderAnalyzer 1.53, using Catalyst 9.12 driver.
GeForce GPU performance numbers from NVIDIA's NVShaderPerf 2.0, using 174.74 driver.

Note: there was an error!

Original version of my article had some stupidity: encoding shaders did not normalize the incoming per-vertex normal. This resulted in quality evaluation results being somewhat wrong. Also, if normal is assumed to be normalized, then three methods in original article (Sphere Map, Cry Engine 3 and Lambert Azimuthal) are in fact completely equivalent. The old version is still available for the sake of integrity of the internets.

Test Playground Application

Here is a small Windows application I used to test everything below: NormalEncodingPlayground.zip (4.8MB, source included).

It requires GPU with Shader Model 3.0 support. When it writes fancy shader reports, it expects AMD's GPUShaderAnalyzer and NVIDIA's NVShaderPerf to be installed. Source code should build with Visual C++ 2008.

Baseline: store X&Y&Z

Just to set the basis, store all three components of the normal. It's not suitable for our quest, but I include it here to evaluate "base" encoding error (which happens here only because of quantization to 8 bits per component).

Encoding, Error to Power, Error * 30 images below. MSE: 0.000008; PSNR: 51.081 dB.

Encoding	Decoding
half4 encode (half3 n, float3 view) { return half4(n.xyz*0.5+0.5,0); }	half3 decode (half4 enc, float3 view) { return enc.xyz*2-1; }
ps_3_0 def c0, 0.5, 0, 0, 0 dcl_texcoord_pp v0.xyz mad_pp oC0, v0.xyzx, c0.xxxy, c0.xxxy	ps_3_0 def c0, 2, -1, 0, 0 dcl_texcoord2 v0.xy dcl_2d s0 texld_pp r0, v0, s0 mad_pp oC0.xyz, r0, c0.x, c0.y mov_pp oC0.w, c0.z
1 ALU Radeon HD 2400: 1 GPR, 1.00 clk Radeon HD 3870: 1 GPR, 1.00 clk Radeon HD 5870: 1 GPR, 0.50 clk GeForce 6200: 1 GPR, 1.00 clk GeForce 7800GT: 1 GPR, 1.00 clk GeForce 8800GTX: 6 GPR, 8.00 clk	2 ALU, 1 TEX Radeon HD 2400: 1 GPR, 1.00 clk Radeon HD 3870: 1 GPR, 1.00 clk Radeon HD 5870: 1 GPR, 0.50 clk GeForce 6200: 1 GPR, 1.00 clk GeForce 7800GT: 1 GPR, 1.00 clk GeForce 8800GTX: 6 GPR, 10.00 clk

Encoding

Decoding

half4 encode (half3 n, float3 view)
{
    return half4(n.xyz*0.5+0.5,0);
}

half3 decode (half4 enc, float3 view)
{
    return enc.xyz*2-1;
}

ps_3_0
def c0, 0.5, 0, 0, 0
dcl_texcoord_pp v0.xyz
mad_pp oC0, v0.xyzx, c0.xxxy, c0.xxxy

ps_3_0
def c0, 2, -1, 0, 0
dcl_texcoord2 v0.xy
dcl_2d s0
texld_pp r0, v0, s0
mad_pp oC0.xyz, r0, c0.x, c0.y
mov_pp oC0.w, c0.z

1 ALU
Radeon HD 2400: 1 GPR, 1.00 clk
Radeon HD 3870: 1 GPR, 1.00 clk
Radeon HD 5870: 1 GPR, 0.50 clk
GeForce 6200: 1 GPR, 1.00 clk
GeForce 7800GT: 1 GPR, 1.00 clk
GeForce 8800GTX: 6 GPR, 8.00 clk

2 ALU, 1 TEX
Radeon HD 2400: 1 GPR, 1.00 clk
Radeon HD 3870: 1 GPR, 1.00 clk
Radeon HD 5870: 1 GPR, 0.50 clk
GeForce 6200: 1 GPR, 1.00 clk
GeForce 7800GT: 1 GPR, 1.00 clk
GeForce 8800GTX: 6 GPR, 10.00 clk

Method #1: store X&Y, reconstruct Z

Used by Killzone 2 among others (PDF link).

Encoding, Error to Power, Error * 30 images below. MSE: 0.013514; PSNR: 18.692 dB.

Pros:

Very simple to encode/decode

Cons:

Normal can point away from the camera. My test scene setup actually has that. See Resistance 2 Prelighting paper (PDF link) for explanation.

Encoding	Decoding
half4 encode (half3 n, float3 view) { return half4(n.xy*0.5+0.5,0,0); }	half3 decode (half2 enc, float3 view) { half3 n; n.xy = enc*2-1; n.z = sqrt(1-dot(n.xy, n.xy)); return n; }
ps_3_0 def c0, 0.5, 0, 0, 0 dcl_texcoord_pp v0.xy mad_pp oC0, v0.xyxx, c0.xxyy, c0.xxyy	ps_3_0 def c0, 2, -1, 1, 0 dcl_texcoord2 v0.xy dcl_2d s0 texld_pp r0, v0, s0 mad_pp r0.xy, r0, c0.x, c0.y dp2add_pp r0.z, r0, -r0, c0.z mov_pp oC0.xy, r0 rsq_pp r0.x, r0.z rcp_pp oC0.z, r0.x mov_pp oC0.w, c0.w
1 ALU Radeon HD 2400: 1 GPR, 1.00 clk Radeon HD 3870: 1 GPR, 1.00 clk Radeon HD 5870: 1 GPR, 0.50 clk GeForce 6200: 1 GPR, 1.00 clk GeForce 7800GT: 1 GPR, 1.00 clk GeForce 8800GTX: 5 GPR, 7.00 clk	7 ALU, 1 TEX Radeon HD 2400: 1 GPR, 1.00 clk Radeon HD 3870: 1 GPR, 1.00 clk Radeon HD 5870: 1 GPR, 0.50 clk GeForce 6200: 1 GPR, 4.00 clk GeForce 7800GT: 1 GPR, 3.00 clk GeForce 8800GTX: 5 GPR, 15.00 clk

Encoding

Decoding

half4 encode (half3 n, float3 view)
{
    return half4(n.xy*0.5+0.5,0,0);
}

half3 decode (half2 enc, float3 view)
{
    half3 n;
    n.xy = enc*2-1;
    n.z = sqrt(1-dot(n.xy, n.xy));
    return n;
}

ps_3_0
def c0, 0.5, 0, 0, 0
dcl_texcoord_pp v0.xy
mad_pp oC0, v0.xyxx, c0.xxyy, c0.xxyy

ps_3_0
def c0, 2, -1, 1, 0
dcl_texcoord2 v0.xy
dcl_2d s0
texld_pp r0, v0, s0
mad_pp r0.xy, r0, c0.x, c0.y
dp2add_pp r0.z, r0, -r0, c0.z
mov_pp oC0.xy, r0
rsq_pp r0.x, r0.z
rcp_pp oC0.z, r0.x
mov_pp oC0.w, c0.w

1 ALU
Radeon HD 2400: 1 GPR, 1.00 clk
Radeon HD 3870: 1 GPR, 1.00 clk
Radeon HD 5870: 1 GPR, 0.50 clk
GeForce 6200: 1 GPR, 1.00 clk
GeForce 7800GT: 1 GPR, 1.00 clk
GeForce 8800GTX: 5 GPR, 7.00 clk

7 ALU, 1 TEX
Radeon HD 2400: 1 GPR, 1.00 clk
Radeon HD 3870: 1 GPR, 1.00 clk
Radeon HD 5870: 1 GPR, 0.50 clk
GeForce 6200: 1 GPR, 4.00 clk
GeForce 7800GT: 1 GPR, 3.00 clk
GeForce 8800GTX: 5 GPR, 15.00 clk

Method #3: Spherical Coordinates

It is possible to use spherical coordinates to encode the normal. Since we know it's unit length, we can just store the two angles.

Suggested by Pat Wilson of Garage Games: GG blog post. Other mentions: MJP's blog, GarageGames thread, Wolf Engel's blog, gamedev.net forum thread.

Encoding, Error to Power, Error * 30 images below. MSE: 0.000062; PSNR: 42.042 dB.

Pros:

Suitable for normals in general (not necessarily view space)

Cons:

Uses trig instructions (quite heavy on ALU). Possible to replace some of that with texture lookups though.

Encoding	Decoding
#define kPI 3.1415926536f half4 encode (half3 n, float3 view) { return half4( (half2(atan2(n.y,n.x)/kPI, n.z)+1.0)*0.5, 0,0); }	half3 decode (half2 enc, float3 view) { half2 ang = enc2-1; half2 scth; sincos(ang.x kPI, scth.x, scth.y); half2 scphi = half2(sqrt(1.0 - ang.yang.y), ang.y); return half3(scth.yscphi.x, scth.x*scphi.x, scphi.y); }
ps_3_0 def c0, 0.999866009, 0, 1, 3.14159274 def c1, 0.0208350997, -0.0851330012, 0.180141002, -0.330299497 def c2, -2, 1.57079637, 0.318309873, 0.5 dcl_texcoord_pp v0.xyz add_pp r0.xy, -v0_abs, v0_abs.yxzw cmp_pp r0.xz, r0.x, v0_abs.xyyw, v0_abs.yyxw cmp_pp r0.y, r0.y, c0.y, c0.z rcp_pp r0.z, r0.z mul_pp r0.x, r0.x, r0.z mul_pp r0.z, r0.x, r0.x mad_pp r0.w, r0.z, c1.x, c1.y mad_pp r0.w, r0.z, r0.w, c1.z mad_pp r0.w, r0.z, r0.w, c1.w mad_pp r0.z, r0.z, r0.w, c0.x mul_pp r0.x, r0.x, r0.z mad_pp r0.z, r0.x, c2.x, c2.y mad_pp r0.x, r0.z, r0.y, r0.x cmp_pp r0.y, v0.x, -c0.y, -c0.w add_pp r0.x, r0.x, r0.y add_pp r0.y, r0.x, r0.x add_pp r0.z, -v0.x, v0.y cmp_pp r0.zw, r0.z, v0.xyxy, v0.xyyx cmp_pp r0.zw, r0, c0.xyyz, c0.xyzy mul_pp r0.z, r0.w, r0.z mad_pp r0.x, r0.z, -r0.y, r0.x mul_pp r0.x, r0.x, c2.z mov_pp r0.y, v0.z add_pp r0.xy, r0, c0.z mul_pp oC0.xy, r0, c2.w mov_pp oC0.zw, c0.y	ps_3_0 def c0, 2, -1, 0.5, 1 def c1, 6.28318548, -3.14159274, 1, 0 dcl_texcoord2 v0.xy dcl_2d s0 texld_pp r0, v0, s0 mad_pp r0.xy, r0, c0.x, c0.y mad r0.x, r0.x, c0.z, c0.z frc r0.x, r0.x mad r0.x, r0.x, c1.x, c1.y sincos_pp r1.xy, r0.x mad_pp r0.x, r0.y, -r0.y, c0.w mul_pp oC0.zw, r0.y, c1 rsq_pp r0.x, r0.x rcp_pp r0.x, r0.x mul_pp oC0.xy, r1, r0.x
26 ALU Radeon HD 2400: 1 GPR, 17.00 clk Radeon HD 3870: 1 GPR, 4.25 clk Radeon HD 5870: 2 GPR, 0.95 clk GeForce 6200: 2 GPR, 12.00 clk GeForce 7800GT: 2 GPR, 9.00 clk GeForce 8800GTX: 9 GPR, 43.00 clk	17 ALU, 1 TEX Radeon HD 2400: 1 GPR, 17.00 clk Radeon HD 3870: 1 GPR, 4.25 clk Radeon HD 5870: 2 GPR, 0.95 clk GeForce 6200: 2 GPR, 7.00 clk GeForce 7800GT: 1 GPR, 5.00 clk GeForce 8800GTX: 6 GPR, 23.00 clk

Encoding

Decoding

#define kPI 3.1415926536f
half4 encode (half3 n, float3 view)
{
    return half4(
      (half2(atan2(n.y,n.x)/kPI, n.z)+1.0)*0.5,
      0,0);
}

half3 decode (half2 enc, float3 view)
{
    half2 ang = enc*2-1;
    half2 scth;
    sincos(ang.x * kPI, scth.x, scth.y);
    half2 scphi = half2(sqrt(1.0 - ang.y*ang.y), ang.y);
    return half3(scth.y*scphi.x, scth.x*scphi.x, scphi.y);
}

ps_3_0
def c0, 0.999866009, 0, 1, 3.14159274
def c1, 0.0208350997, -0.0851330012,
    0.180141002, -0.330299497
def c2, -2, 1.57079637, 0.318309873, 0.5
dcl_texcoord_pp v0.xyz
add_pp r0.xy, -v0_abs, v0_abs.yxzw
cmp_pp r0.xz, r0.x, v0_abs.xyyw, v0_abs.yyxw
cmp_pp r0.y, r0.y, c0.y, c0.z
rcp_pp r0.z, r0.z
mul_pp r0.x, r0.x, r0.z
mul_pp r0.z, r0.x, r0.x
mad_pp r0.w, r0.z, c1.x, c1.y
mad_pp r0.w, r0.z, r0.w, c1.z
mad_pp r0.w, r0.z, r0.w, c1.w
mad_pp r0.z, r0.z, r0.w, c0.x
mul_pp r0.x, r0.x, r0.z
mad_pp r0.z, r0.x, c2.x, c2.y
mad_pp r0.x, r0.z, r0.y, r0.x
cmp_pp r0.y, v0.x, -c0.y, -c0.w
add_pp r0.x, r0.x, r0.y
add_pp r0.y, r0.x, r0.x
add_pp r0.z, -v0.x, v0.y
cmp_pp r0.zw, r0.z, v0.xyxy, v0.xyyx
cmp_pp r0.zw, r0, c0.xyyz, c0.xyzy
mul_pp r0.z, r0.w, r0.z
mad_pp r0.x, r0.z, -r0.y, r0.x
mul_pp r0.x, r0.x, c2.z
mov_pp r0.y, v0.z
add_pp r0.xy, r0, c0.z
mul_pp oC0.xy, r0, c2.w
mov_pp oC0.zw, c0.y

ps_3_0
def c0, 2, -1, 0.5, 1
def c1, 6.28318548, -3.14159274, 1, 0
dcl_texcoord2 v0.xy
dcl_2d s0
texld_pp r0, v0, s0
mad_pp r0.xy, r0, c0.x, c0.y
mad r0.x, r0.x, c0.z, c0.z
frc r0.x, r0.x
mad r0.x, r0.x, c1.x, c1.y
sincos_pp r1.xy, r0.x
mad_pp r0.x, r0.y, -r0.y, c0.w
mul_pp oC0.zw, r0.y, c1
rsq_pp r0.x, r0.x
rcp_pp r0.x, r0.x
mul_pp oC0.xy, r1, r0.x

26 ALU
Radeon HD 2400: 1 GPR, 17.00 clk
Radeon HD 3870: 1 GPR, 4.25 clk
Radeon HD 5870: 2 GPR, 0.95 clk
GeForce 6200: 2 GPR, 12.00 clk
GeForce 7800GT: 2 GPR, 9.00 clk
GeForce 8800GTX: 9 GPR, 43.00 clk

17 ALU, 1 TEX
Radeon HD 2400: 1 GPR, 17.00 clk
Radeon HD 3870: 1 GPR, 4.25 clk
Radeon HD 5870: 2 GPR, 0.95 clk
GeForce 6200: 2 GPR, 7.00 clk
GeForce 7800GT: 1 GPR, 5.00 clk
GeForce 8800GTX: 6 GPR, 23.00 clk

Method #4: Spheremap Transform

Spherical environment mapping (indirectly) maps reflection vector to a texture coordinate in [0..1] range. The reflection vector can point away from the camera, just like our view space normals. Bingo! See Siggraph 99 notes for sphere map math. Normal we want to encode is R, resulting values are (s,t).

If we assume that incoming normal is normalized, then there are methods derived from elsewhere that end up being exactly equivalent:

Used in Cry Engine 3, presented by Martin Mittring in "A bit more Deferred" presentation (PPT link, slide 13). For Unity, I had to negate Z component of view space normal to produce good results, I guess Unity's and Cry Engine's coordinate systems are different. The code would be:

half2 encode (half3 n, float3 view)
{
    half2 enc = normalize(n.xy) * (sqrt(-n.z*0.5+0.5));
    enc = enc*0.5+0.5;
    return enc;
}
half3 decode (half4 enc, float3 view)
{
    half4 nn = enc*half4(2,2,0,0) + half4(-1,-1,1,-1);
    half l = dot(nn.xyz,-nn.xyw);
    nn.z = l;
    nn.xy *= sqrt(l);
    return nn.xyz * 2 + half3(0,0,-1);
}

Lambert Azimuthal Equal-Area projection (Wikipedia link). Suggested by Sean Barrett in comments for this article. The code would be:

half2 encode (half3 n, float3 view)
{
    half f = sqrt(8*n.z+8);
    return n.xy / f + 0.5;
}
half3 decode (half4 enc, float3 view)
{
    half2 fenc = enc*4-2;
    half f = dot(fenc,fenc);
    half g = sqrt(1-f/4);
    half3 n;
    n.xy = fenc*g;
    n.z = 1-f/2;
    return n;
}

Encoding, Error to Power, Error * 30 images below. MSE: 0.000016; PSNR: 48.071 dB.

Pros:

Quality pretty good!
Quite cheap to encode/decode.
Similar derivation used by Cry Engine 3, so it must be good :)

Cons:

Encoding	Decoding
half4 encode (half3 n, float3 view) { half p = sqrt(n.z*8+8); return half4(n.xy/p + 0.5,0,0); }	half3 decode (half2 enc, float3 view) { half2 fenc = enc4-2; half f = dot(fenc,fenc); half g = sqrt(1-f/4); half3 n; n.xy = fencg; n.z = 1-f/2; return n; }
ps_3_0 def c0, 8, 0.5, 0, 0 dcl_texcoord_pp v0.xyz mad_pp r0.x, v0.z, c0.x, c0.x rsq_pp r0.x, r0.x mad_pp oC0.xy, v0, r0.x, c0.y mov_pp oC0.zw, c0.z	ps_3_0 def c0, 4, -2, 0, 1 def c1, 0.25, 0.5, 1, 0 dcl_texcoord2 v0.xy dcl_2d s0 texld_pp r0, v0, s0 mad_pp r0.xy, r0, c0.x, c0.y dp2add_pp r0.z, r0, r0, c0.z mad_pp r0.zw, r0.z, -c1.xyxy, c1.z rsq_pp r0.z, r0.z mul_pp oC0.zw, r0.w, c0.xywz rcp_pp r0.z, r0.z mul_pp oC0.xy, r0, r0.z
4 ALU Radeon HD 2400: 2 GPR, 3.00 clk Radeon HD 3870: 2 GPR, 1.00 clk Radeon HD 5870: 2 GPR, 0.50 clk GeForce 6200: 1 GPR, 4.00 clk GeForce 7800GT: 1 GPR, 2.00 clk GeForce 8800GTX: 5 GPR, 12.00 clk	8 ALU, 1 TEX Radeon HD 2400: 2 GPR, 3.00 clk Radeon HD 3870: 2 GPR, 1.00 clk Radeon HD 5870: 2 GPR, 0.50 clk GeForce 6200: 1 GPR, 6.00 clk GeForce 7800GT: 1 GPR, 3.00 clk GeForce 8800GTX: 6 GPR, 15.00 clk

Encoding

Decoding

half4 encode (half3 n, float3 view)
{
    half p = sqrt(n.z*8+8);
    return half4(n.xy/p + 0.5,0,0);
}

half3 decode (half2 enc, float3 view)
{
    half2 fenc = enc*4-2;
    half f = dot(fenc,fenc);
    half g = sqrt(1-f/4);
    half3 n;
    n.xy = fenc*g;
    n.z = 1-f/2;
    return n;
}

ps_3_0
def c0, 8, 0.5, 0, 0
dcl_texcoord_pp v0.xyz
mad_pp r0.x, v0.z, c0.x, c0.x
rsq_pp r0.x, r0.x
mad_pp oC0.xy, v0, r0.x, c0.y
mov_pp oC0.zw, c0.z

ps_3_0
def c0, 4, -2, 0, 1
def c1, 0.25, 0.5, 1, 0
dcl_texcoord2 v0.xy
dcl_2d s0
texld_pp r0, v0, s0
mad_pp r0.xy, r0, c0.x, c0.y
dp2add_pp r0.z, r0, r0, c0.z
mad_pp r0.zw, r0.z, -c1.xyxy, c1.z
rsq_pp r0.z, r0.z
mul_pp oC0.zw, r0.w, c0.xywz
rcp_pp r0.z, r0.z
mul_pp oC0.xy, r0, r0.z

4 ALU
Radeon HD 2400: 2 GPR, 3.00 clk
Radeon HD 3870: 2 GPR, 1.00 clk
Radeon HD 5870: 2 GPR, 0.50 clk
GeForce 6200: 1 GPR, 4.00 clk
GeForce 7800GT: 1 GPR, 2.00 clk
GeForce 8800GTX: 5 GPR, 12.00 clk

8 ALU, 1 TEX
Radeon HD 2400: 2 GPR, 3.00 clk
Radeon HD 3870: 2 GPR, 1.00 clk
Radeon HD 5870: 2 GPR, 0.50 clk
GeForce 6200: 1 GPR, 6.00 clk
GeForce 7800GT: 1 GPR, 3.00 clk
GeForce 8800GTX: 6 GPR, 15.00 clk

Method #7: Stereographic Projection

What the title says: use Stereographic Projection (Wikipedia link), plus rescaling so that "practically visible" range of normals maps into unit circle (regular stereographic projection maps sphere to circle of infinite size). In my tests, scaling factor of 1.7777 produced best results; in practice it depends on FOV used and how much do you care about normals that point away from the camera.

Suggested by Sean Barrett and Ignacio Castano in comments for this article.

Encoding, Error to Power, Error * 30 images below. MSE: 0.000038; PSNR: 44.147 dB.

Pros:

Quality pretty good!
Quite cheap to encode/decode.

Cons:

Encoding	Decoding
half4 encode (half3 n, float3 view) { half scale = 1.7777; half2 enc = n.xy / (n.z+1); enc /= scale; enc = enc*0.5+0.5; return half4(enc,0,0); }	half3 decode (half4 enc, float3 view) { half scale = 1.7777; half3 nn = enc.xyzhalf3(2scale,2scale,0) + half3(-scale,-scale,1); half g = 2.0 / dot(nn.xyz,nn.xyz); half3 n; n.xy = gnn.xy; n.z = g-1; return n; }
ps_3_0 def c0, 1, 0.281262308, 0.5, 0 dcl_texcoord_pp v0.xyz add_pp r0.x, c0.x, v0.z rcp r0.x, r0.x mul_pp r0.xy, r0.x, v0 mad_pp oC0.xy, r0, c0.y, c0.z mov_pp oC0.zw, c0.w	ps_3_0 def c0, 3.55539989, 0, -1.77769995, 1 def c1, 2, -1, 0, 0 dcl_texcoord2 v0.xy dcl_2d s0 texld_pp r0, v0, s0 mad_pp r0.xyz, r0, c0.xxyw, c0.zzww dp3_pp r0.z, r0, r0 rcp r0.z, r0.z add_pp r0.w, r0.z, r0.z mad_pp oC0.z, r0.z, c1.x, c1.y mul_pp oC0.xy, r0, r0.w mov_pp oC0.w, c0.y
5 ALU Radeon HD 2400: 2 GPR, 4.00 clk Radeon HD 3870: 2 GPR, 1.00 clk Radeon HD 5870: 2 GPR, 0.50 clk GeForce 6200: 1 GPR, 2.00 clk GeForce 7800GT: 1 GPR, 2.00 clk GeForce 8800GTX: 5 GPR, 12.00 clk	7 ALU, 1 TEX Radeon HD 2400: 2 GPR, 4.00 clk Radeon HD 3870: 2 GPR, 1.00 clk Radeon HD 5870: 2 GPR, 0.50 clk GeForce 6200: 1 GPR, 4.00 clk GeForce 7800GT: 1 GPR, 4.00 clk GeForce 8800GTX: 6 GPR, 12.00 clk

Encoding

Decoding

half4 encode (half3 n, float3 view)
{
    half scale = 1.7777;
    half2 enc = n.xy / (n.z+1);
    enc /= scale;
    enc = enc*0.5+0.5;
    return half4(enc,0,0);
}

half3 decode (half4 enc, float3 view)
{
    half scale = 1.7777;
    half3 nn =
        enc.xyz*half3(2*scale,2*scale,0) +
        half3(-scale,-scale,1);
    half g = 2.0 / dot(nn.xyz,nn.xyz);
    half3 n;
    n.xy = g*nn.xy;
    n.z = g-1;
    return n;
}

ps_3_0
def c0, 1, 0.281262308, 0.5, 0
dcl_texcoord_pp v0.xyz
add_pp r0.x, c0.x, v0.z
rcp r0.x, r0.x
mul_pp r0.xy, r0.x, v0
mad_pp oC0.xy, r0, c0.y, c0.z
mov_pp oC0.zw, c0.w

ps_3_0
def c0, 3.55539989, 0, -1.77769995, 1
def c1, 2, -1, 0, 0
dcl_texcoord2 v0.xy
dcl_2d s0
texld_pp r0, v0, s0
mad_pp r0.xyz, r0, c0.xxyw, c0.zzww
dp3_pp r0.z, r0, r0
rcp r0.z, r0.z
add_pp r0.w, r0.z, r0.z
mad_pp oC0.z, r0.z, c1.x, c1.y
mul_pp oC0.xy, r0, r0.w
mov_pp oC0.w, c0.y

5 ALU
Radeon HD 2400: 2 GPR, 4.00 clk
Radeon HD 3870: 2 GPR, 1.00 clk
Radeon HD 5870: 2 GPR, 0.50 clk
GeForce 6200: 1 GPR, 2.00 clk
GeForce 7800GT: 1 GPR, 2.00 clk
GeForce 8800GTX: 5 GPR, 12.00 clk

7 ALU, 1 TEX
Radeon HD 2400: 2 GPR, 4.00 clk
Radeon HD 3870: 2 GPR, 1.00 clk
Radeon HD 5870: 2 GPR, 0.50 clk
GeForce 6200: 1 GPR, 4.00 clk
GeForce 7800GT: 1 GPR, 4.00 clk
GeForce 8800GTX: 6 GPR, 12.00 clk

Method #8: Per-pixel View Space

If we compute view space per-pixel, then Z component of a normal can never be negative. Then just store X&Y, and compute Z.

Suggested by Yuriy O'Donnell on Twitter.

Encoding, Error to Power, Error * 30 images below. MSE: 0.000134; PSNR: 38.730 dB.

Pros:

Cons:

Quite heavy on ALU

Encoding	Decoding
float3x3 make_view_mat (float3 view) { view = normalize(view); float3 x,y,z; z = -view; x = normalize (float3(z.z, 0, -z.x)); y = cross (z,x); return float3x3 (x,y,z); } half4 encode (half3 n, float3 view) { return half4(mul (make_view_mat(view), n).xy0.5+0.5,0,0); } half3 decode (half4 enc, float3 view) { half3 n; n.xy = enc2-1; n.z = sqrt(1+dot(n.xy,-n.xy)); n = mul(n, make_view_mat(view)); return n; }
ps_3_0 def c0, 1, -1, 0, 0.5 dcl_texcoord_pp v0.xyz dcl_texcoord1 v1.xyz mov r0.x, c0.z nrm r1.xyz, v1 mov r1.w, -r1.z mul r0.yz, r1.xxzw, c0.xxyw dp2add r0.w, r1.wxzw, r0.zyzw, c0.z rsq r0.w, r0.w mul r0.xyz, r0, r0.w mul r2.xyz, -r1.zxyw, r0 mad r1.xyz, -r1.yzxw, r0.yzxw, -r2 dp2add r0.x, r0.zyzw, v0.xzzw, c0.z dp3 r0.y, r1, v0 mad_pp oC0.xy, r0, c0.w, c0.w mov_pp oC0.zw, c0.z	ps_3_0 def c0, 2, -1, 1, 0 dcl_texcoord1 v0.xyz dcl_texcoord2 v1.xy dcl_2d s0 mov r0.y, c0.w nrm r1.xyz, v0 mov r1.w, -r1.z mul r0.xz, r1.zyxw, c0.yyzw dp2add r0.w, r1.wxzw, r0.xzzw, c0.w rsq r0.w, r0.w mul r0.xyz, r0, r0.w mul r2.xyz, -r1.zxyw, r0.yzxw mad r2.xyz, -r1.yzxw, r0.zxyw, -r2 texld_pp r3, v1, s0 mad_pp r3.xy, r3, c0.x, c0.y mul r2.xyz, r2, r3.y mad r0.xyz, r3.x, r0, r2 dp2add_pp r0.w, r3, -r3, c0.z rsq_pp r0.w, r0.w rcp_pp r0.w, r0.w mad_pp oC0.xyz, r0.w, -r1, r0 mov_pp oC0.w, c0.w
17 ALU Radeon HD 2400: 3 GPR, 11.00 clk Radeon HD 3870: 3 GPR, 2.75 clk Radeon HD 5870: 2 GPR, 0.80 clk GeForce 6200: 4 GPR, 12.00 clk GeForce 7800GT: 4 GPR, 8.00 clk GeForce 8800GTX: 8 GPR, 24.00 clk	21 ALU, 1 TEX Radeon HD 2400: 3 GPR, 11.00 clk Radeon HD 3870: 3 GPR, 2.75 clk Radeon HD 5870: 2 GPR, 0.80 clk GeForce 6200: 3 GPR, 12.00 clk GeForce 7800GT: 3 GPR, 9.00 clk GeForce 8800GTX: 12 GPR, 29.00 clk

Encoding

Decoding

float3x3 make_view_mat (float3 view)
{
    view = normalize(view);
    float3 x,y,z;
    z = -view;
    x = normalize (float3(z.z, 0, -z.x));
    y = cross (z,x);
    return float3x3 (x,y,z);
}
half4 encode (half3 n, float3 view)
{
    return half4(mul (make_view_mat(view), n).xy*0.5+0.5,0,0);
}
half3 decode (half4 enc, float3 view)
{
    half3 n;
    n.xy = enc*2-1;
    n.z = sqrt(1+dot(n.xy,-n.xy));
    n = mul(n, make_view_mat(view));
    return n;
}

ps_3_0
def c0, 1, -1, 0, 0.5
dcl_texcoord_pp v0.xyz
dcl_texcoord1 v1.xyz
mov r0.x, c0.z
nrm r1.xyz, v1
mov r1.w, -r1.z
mul r0.yz, r1.xxzw, c0.xxyw
dp2add r0.w, r1.wxzw, r0.zyzw, c0.z
rsq r0.w, r0.w
mul r0.xyz, r0, r0.w
mul r2.xyz, -r1.zxyw, r0
mad r1.xyz, -r1.yzxw, r0.yzxw, -r2
dp2add r0.x, r0.zyzw, v0.xzzw, c0.z
dp3 r0.y, r1, v0
mad_pp oC0.xy, r0, c0.w, c0.w
mov_pp oC0.zw, c0.z

ps_3_0
def c0, 2, -1, 1, 0
dcl_texcoord1 v0.xyz
dcl_texcoord2 v1.xy
dcl_2d s0
mov r0.y, c0.w
nrm r1.xyz, v0
mov r1.w, -r1.z
mul r0.xz, r1.zyxw, c0.yyzw
dp2add r0.w, r1.wxzw, r0.xzzw, c0.w
rsq r0.w, r0.w
mul r0.xyz, r0, r0.w
mul r2.xyz, -r1.zxyw, r0.yzxw
mad r2.xyz, -r1.yzxw, r0.zxyw, -r2
texld_pp r3, v1, s0
mad_pp r3.xy, r3, c0.x, c0.y
mul r2.xyz, r2, r3.y
mad r0.xyz, r3.x, r0, r2
dp2add_pp r0.w, r3, -r3, c0.z
rsq_pp r0.w, r0.w
rcp_pp r0.w, r0.w
mad_pp oC0.xyz, r0.w, -r1, r0
mov_pp oC0.w, c0.w

17 ALU
Radeon HD 2400: 3 GPR, 11.00 clk
Radeon HD 3870: 3 GPR, 2.75 clk
Radeon HD 5870: 2 GPR, 0.80 clk
GeForce 6200: 4 GPR, 12.00 clk
GeForce 7800GT: 4 GPR, 8.00 clk
GeForce 8800GTX: 8 GPR, 24.00 clk

21 ALU, 1 TEX
Radeon HD 2400: 3 GPR, 11.00 clk
Radeon HD 3870: 3 GPR, 2.75 clk
Radeon HD 5870: 2 GPR, 0.80 clk
GeForce 6200: 3 GPR, 12.00 clk
GeForce 7800GT: 3 GPR, 9.00 clk
GeForce 8800GTX: 12 GPR, 29.00 clk

Performance Comparison

GPU performance comparison in a single table:

Encoding, GPU cycles
	#1: X & Y	#3: Spherical	#4: Spheremap	#7: Stereo	#8: PPView
Radeon HD2400	1.00	17.00	3.00	4.00	11.00
Radeon HD5870	0.50	0.95	0.50	0.50	0.80
GeForce 6200	1.00	12.00	4.00	2.00	12.00
GeForce 8800	7.00	43.00	12.00	12.00	24.00
Decoding, GPU cycles
Radeon HD2400	1.00	17.00	3.00	4.00	11.00
Radeon HD5870	0.50	0.95	0.50	1.00	0.80
GeForce 6200	4.00	7.00	6.00	4.00	12.00
GeForce 8800	15.00	23.00	15.00	12.00	29.00
Encoding, D3D ALU+TEX instruction slots
SM3.0	1	26	4	5	17
Decoding, D3D ALU+TEX instruction slots
SM3.0	8	18	9	8	22

Quality Comparison

Quality comparison in a single table. PSNR based, higher numbers are better.

Method	PSNR, dB
#1: X & Y	18.629
#3: Spherical	42.042
#4: Spheremap	48.071
#7: Stereographic	44.147
#8: Per pixel view	38.730

Changelog

2021 08 17: Added a note about octahedral encodings at the top of the post.
2010 03 25: Added Method #8: Per-pixel View Space. Suggested by Yuriy O'Donnell.
2010 03 24: Stop! Everything before was wrong! Old article moved here.
2009 08 12: Added Method #7: Stereographic projection. Suggested by Sean Barrett and Ignacio Castano.
2009 08 12: Optimized Method #5, suggested by Steve Hill.
2009 08 08: Added power difference images.
2009 08 07: Optimized Method #4: Sphere map. Suggested by Irenee Caroulle.
2009 08 07: Added Method #6: Lambert Azimuthal Equal Area. Suggested by Sean Barrett.
2009 08 05: Added Method #5: Cry Engine 3. Suggested by Steve Hill.
2009 08 05: Improved quality of Method #3a: round values in texture LUT.
2009 08 05: Added MSE and PSNR values for all methods.
2009 08 04: Added Method #3a: Spherical Coordinates w/ texture LUT.
2009 08 04: Method #1: 1-dot(n.xy,n.xy) is slightly better than 1-n.x*n.x-n.y*n.y (better pipelining on NV and ATI). Suggested by Arseny "zeux" Kapoulkine.