Non Power of Two Textures

Support for non power of two (“NPOT”, i.e. arbitrary sized) textures has been in GPUs for quite a while, but the state of support can be confusing. Recent question from @rygorous:

Lazyweb, <…> GL question: <…> ARB NPOT textures is fairly demanding, and texture rectangle are awkward. Is there an equivalent to ES2-style semantics in regular GL? Bonus Q: Is there something like the Unity HW survey, but listing supported GL extensions? :)

There are generally three big types of texture size support:

  • Full support for arbitrary texture sizes. This includes mipmaps, all texture wrap & filtering modes, and most often compressed texture formats as well.
  • “Limited” support for non-power-of-two sizes. No mipmaps, texture wrap mode has to be Clamp, but does allow texture filtering. This makes such textures not generally useful in 3D space, but just good enough for anything in screen space (UI, 2D, postprocessing).
  • No support, texture sizes have to be powers of two (16,32,64,128,…). If you’re running on really, really old GPU then textures might also need to be square (width = height).

Direct3D 9

Things are quite simple here. D3DCAPS9.TextureCaps has capability bits, D3DPTEXTURECAPS_POW2 and D3DPTEXTURECAPS_NONPOW2CONDITIONAL both being off indicates full support for NPOT texture sizes. When both D3DPTEXTURECAPS_POW2 and D3DPTEXTURECAPS_NONPOW2CONDITIONAL bits are on, then you have limited NPOT support.

I’ve no idea what would it mean if NONPOW2CONDITIONAL bit is set, but POW2 bit is not.

Hardware wise, limited NPOT has been generally available since 2002-2004, and full NPOT since 2006 or so.

Direct3D 11

Very simple; feature level 10_0 and up has full support for NPOT sizes; while feature level 9_x has limited support for NPOT. MSDN link.

OpenGL

Support for NPOT textures has been a core OpenGL feature since 2.0; promoted from earlier ARB_texture_non_power_of_two extension. The extension specifies “full” support for NPOT, including mipmaps & wrapping modes. There’s no practical way to detect hardware that can only do “limited” NPOT textures.

However, in traditional OpenGL spirit, presence of something in the API does not mean it can run on the GPU… E.g. Mac OS X with Radeon X1600 GPU is an OpenGL 2.1+ system, and as such pretends there’s full support for NPOT textures. In practice, as soon as you have NPOT size with mipmaps or a non-clamp wrap mode, you drop into software rendering. Ouch.

A rule of thumb that seems to work: try to detect “DX10+” level GPU, and in that case assume full NPOT support is actually there. Otherwise, in GL2.0+ or when ARB_texture_non_power_of_two is present, only assume limited NPOT support.

Then the question of course is, how to detect DX10+ level GPU in OpenGL? If you’re using OpenGL 3+, then you are on DX10+ GPU. In earlier GL versions, you’d have to use some heuristics. For example, if you have ARB_fragment_program and GL_MAX_PROGRAM_NATIVE_INSTRUCTIONS_ARB is less than 4096 is a pretty good indicator of pre-DX10 hardware, on Mac OS X at least. Likewise, you could query MAX_TEXTURE_SIZE, lower than 8192 is a good indicator for pre-DX10.

OpenGL ES

OpenGL ES 3.0 has full NPOT support in core; ES 2.0 has limited NPOT support (no mipmaps, no Repeat wrap mode) in core; and ES 1.1 has no NPOT support.

For ES 1.1 and 2.0, full NPOT support comes with GL_ARB_texture_non_power_of_two or GL_OES_texture_npot extension. In practice, iOS devices don’t support this; and on Android side there’s support on Qualcomm Adreno and ARM Mali. Possibly some others.

For ES 1.1, limited NPOT support comes with GL_APPLE_texture_2D_limited_npot (all iOS devices) or GL_IMG_texture_npot (some ImgTec Android devices I guess).

WebGL and Native Client currently are pretty much OpenGL ES 2.0, and thus support limited NPOT textures.

Flash Stage3D

Sad situation here; current version of Stage3D (as of Flash Player 11.4) has no NPOT support whatsoever. Not even for render targets.

Consoles

I wouldn’t be allowed to say anything about them, now would I? :) Check documentation that comes with your developer access on each console.

Is there something like the Unity HW survey, but listing supported GL extensions?

Not a single good resource, but there are various bits and pieces:

  • Unity Web Player hardware stats - no GL extensions, but we do map GPUs to “DX-like” levels and from there you can sort of infer things.
  • Steam Hardware Survey - likewise.
  • Apple OpenGL Capabilities - tables of detailed GL info for recent OS X versions and all GPUs supported by Apple. Excellent resource for Mac programmers! Apple does remove pages of old OS X versions from there, you’d have to use archive.org to get them back.
  • GLBenchmark Results - results database of GLBenchmark, has list of OpenGL ES extensions and some caps bits for each result (“GL config.” tab).
  • Realtech VR GLView - OpenGL & OpenGL ES extensions and capabilities viewer, for Windows, Mac, iOS & Android.
  • Omni Software Update Statistics - Mac specific, but does list some OpenGL stats from “Graphics” dropdown.
  • KluDX - Windows app that shows information about GPU; also has reports of submitted data.
  • 0 A.D. OpenGL capabilities database - OpenGL capabilities submitted by players of 0 A.D. game.

Cross Platform Shaders in 2012

Update: Cross Platform Shaders in 2014.

Since about 2002 to 2009 the de facto shader language for games was HLSL. Everyone on PCs was targeting Windows through Direct3D, Xbox 360 uses HLSL as well, and Playstation 3 uses Cg, which for all practical purposes is the same as HLSL. There were very few people targeting Mac OS X or Linux, or using OpenGL on Windows. One shader language ruled the world, and everything was rosy. You could close your eyes and pretend OpenGL with it’s GLSL language just did not exist.

Then a series of events happened, and all of a sudden OpenGL is needed again! iOS and Android are too big to be ignored (which means OpenGL ES + GLSL), and making games for Mac OS X or Linux isn’t a crazy idea either. This little WebGL thing that excites many hackers uses a variant of GLSL as well.

Now we have a problem; we have two similar but subtly different shading languages to deal with. I wrote about how we deal with this at Unity, and not much has changed since 2010. The “recommended way” is still writing HLSL/Cg, and we cross-compile into GLSL for platforms that need it.

But what about the future?

It could happen that importance of HLSL (and Direct3D) will decrease over time; this largely depends on what Microsoft is going to do. But just like OpenGL became important again just as it seemed to become irrelevant, so could Direct3D. Or something completely new. I’ll assume that for several years into the future, we’ll need to deal with at least two shading languages.

There are several approaches at handling the problem, and several solutions in that space, at varying levels of completeness.

#1. Do it all by hand!

“Just write all shaders twice”. Ugh. That’s not “web scale” so we’ll just discard this approach.

Slightly related approach is to have a library of preprocessor macros & function definitions, and use them in places where HLSL & GLSL are different. This is certainly doable, take a look at FXAA for a good example. Downsides are, you really need to know all the tiny differences between languages. HLSL’s fmod() and GLSL’s mod() sound like they do the same thing, but are subtly different - and there are many more places like this.

#2. Don’t use HLSL nor GLSL: treat them as shader backends

You could go for fully graphical based shader authoring. Drag some nodes around, connect them, and have shader “baking” code that can spit out HLSL, GLSL, or anything else that is needed. This is a big enough topic by iself; graphical shader editing has a lot more uses at “describing material properties” level than it has at lower level (who’d want to write a deferred rendering light pass shader using nodes & lines?).

You could also use a completely different language that compiles down to HLSL or GLSL. I’m not aware of any big uses in realtime graphics, but recent examples could be Open Shading Language (in film) or AnySL (in research).

#3. Cross-compile HLSL source into GLSL or vice versa

Parse shader source in one language, produce some intermediate representation, massage / verify that representation as needed, “print” it into another language. Some solutions exist here, for example:

  • hlsl2glslfork does DX9 HLSL -> GLSL 1.10 / ES 1.00 translation. Used by Unity, and judging from pull requests and pokes I get, in several dozen other game development shops.
  • ANGLE does GLSL ES 1.00 -> DX9 HLSL. Used by WebGL implementation in Chrome and Firefox.
  • Cg compiles Cg (“almost the same as HLSL”) into various backends, including D3D9 shader assembly and various versions of GLSL, with mixed success. No support for compiling into D3D10+ shader bytecode as far as I can tell.

Big limitation of two libraries above, is that they only do “DX9 level” shaders, so to speak. No support for DX10/11 style HLSL syntax (which Microsoft has changed a lot), and no support for correspondingly higher GLSL versions (GLSL 3.30+, GLSL ES 3.00). At least right now.

Call to action! There seems to be a need for source level translation libraries for DX10/GL3+ style language syntax & feature sets. I’m not sure if it makes sense to extend the above libraries, or to start from scratch… But we need a good quality, open source with liberal license, well maintained & tested package to do this. It shouldn’t be hard, and it probably doesn’t make sense for everyone to try to roll their own. github & bitbucket makes collaboration a snap, let’s do it.

If anyone at Microsoft is reading this: it would really help to have formal grammar of HLSL available. “Reference for HLSL” on MSDN has tiny bits and pieces scattered around, but that seems both incomplete and hard to assemble into a single grammar.

A building block could be Mesa or its smaller fork, GLSL Optimizer (see related blog post). It has a decent intermediate representation (IR) for shaders, a bunch of cleanup/optimization/lowering passes, a GLSL parser and GLSL printer (in GLSL Optimizer). Could be extended to parse HLSL and/or print HLSL. Currently lacking most of DX11/GL4 features, and some DX10/GL3 features in the IR. But under active development, so will get those soon I hope.

MojoShader also has an in-progress HLSL parser and translator to GLSL.

#4. Translate compiled shader bytecode into GLSL

Take HLSL, compile it down to bytecode, parse that bytecode and generate corresponding “low level” GLSL. Right now this would only go one way, as GLSL does not have a cross platform “compiled shader” representation. Though with recent OpenCL getting SPIR, maybe there’s hope in OpenGL getting something similar in the future?

This is a lot simpler to do than parsing full high level language, and a ton of platform differences go away (the ones that are handled purely at syntax level, e.g. function overloading, type promotion etc.). A possible downside is that HLSL bytecode might be “too optimized” - all the hard work about register packing & allocation, loop unrolling etc. is not that much needed here. Any conventions like whether your matrices are column-major or row-major is also “baked into” the resulting shader, so your D3D and GL rendering code better match there.

Several existing libraries in this space:

What now?

Go and make solutions to the approaches above, especially #3 and #4! Cross-platform shader developers all around the world will thank you. All twenty of them, or something ;)

If you’re a student looking for an entry into the industry as a programmer: this is a perfect example of a freetime / university project! It’s self-contained, it has clear goals, and above all, it’s actually useful for the real world. A non-crappy implementation of a library like this would almost certainly land you a job at Unity and I guess many other places.


Careful with That Initializer, Eugene

I was profiling something, and noticed that HighestBit() was taking suspiciously large amount of time. So I looked at the code. It had some platform specific implementations, but the cross-platform fallback was this:

// index of the highest bit set, or -1 if input is zero
inline int HighestBitRef (UInt32 mask)
{
	int base = 0;
	if ( mask & 0xffff0000 )
	{
		base = 16;
		mask >>= 16;
	}
	if ( mask & 0x0000ff00 )
	{
		base += 8;
		mask >>= 8;
	}
	if ( mask & 0x000000f0 )
	{
		base += 4;
		mask >>= 4;
	}
	const int lut[] = {-1,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3};
	return base + lut[ mask ];
}

Not the best implementation of the functionality, but probably not the worst either. Takes three branches, and then a small look-up table.

Notice anything suspicious?

Let’s take a look at the assembly (MSVC 2010, x86).

; int HighestBitRef (UInt32 mask)
push        ebp  
mov         ebp,esp  
sub         esp,44h  
mov         eax,dword ptr [___security_cookie] ; MSVC stack-smashing protection
xor         eax,ebp  
mov         dword ptr [ebp-4],eax  
; int base = 0;
mov         ecx,dword ptr [ebp+8]  
xor         edx,edx  
; if ( mask & 0xffff0000 )
test        ecx,0FFFF0000h  
je          _lbl1
mov         edx,10h  ; base = 16;
shr         ecx,10h  ; mask >>= 16;
_lbl1: ; if ( mask & 0x0000ff00 )
test        ecx,0FF00h  
je          _lbl2
add         edx,8  ; base += 8;
shr         ecx,8  ; mask >>= 8;
_lbl2: ; if ( mask & 0x000000f0 )
test        cl,0F0h  
je          _lbl3
add         edx,4  ; base += 4;
shr         ecx,4  ; mask >>= 4;
_lbl3:
; const int lut[] = {-1,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3};
mov         eax,1  
mov         dword ptr [ebp-3Ch],eax  
mov         dword ptr [ebp-38h],eax  
mov         eax,2  
mov         dword ptr [ebp-34h],eax  
mov         dword ptr [ebp-30h],eax  
mov         dword ptr [ebp-2Ch],eax  
mov         dword ptr [ebp-28h],eax  
mov         eax,3  
mov         dword ptr [ebp-24h],eax  
mov         dword ptr [ebp-20h],eax  
mov         dword ptr [ebp-1Ch],eax  
mov         dword ptr [ebp-18h],eax  
mov         dword ptr [ebp-14h],eax  
mov         dword ptr [ebp-10h],eax  
mov         dword ptr [ebp-0Ch],eax  
mov         dword ptr [ebp-8],eax  
mov         dword ptr [ebp-44h],0FFFFFFFFh  
mov         dword ptr [ebp-40h],0  
; return base + lut[ mask ];
mov         eax,dword ptr [ebp+ecx*4-44h]  
mov         ecx,dword ptr [ebp-4]  
xor         ecx,ebp  
add         eax,edx  
call        functionSearch+1 ; MSVC stack-smashing protection
mov         esp,ebp  
pop         ebp  
ret  

Ouch. It is creating that look-up table. Each. And. Every. Time.

Well, the code asked for that: const int lut[] = {-1,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3}, so the compiler does exactly what it was told. Could the compiler be smarter, notice that the table is actually always constant, and put that into the data segment? “I would if I was a compiler, and I’m not even smart!” The compiler could do this, I guess, but it does not have to. More often than not, if you’re expecting the compiler to “be smart”, it will do the opposite.

So the above code, it fills the table. This makes the function long enough that the compiler decides to not inline it. And since it’s filling up some table on the stack, MSVC’s “stack protection” code bits come into play (which are on by default), making the code even longer.

I’ve done a quick test and timed how long does this take: for (int i = 0; i < 100000000; ++i) sum += HighestBitRef(i); on a Core i7-2600K @ 3.4GHz… 565 milliseconds.

The fix? Do not initialize the lookup table each time!

const int kHighestBitLUT[] = {-1,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3};

inline int HighestBitRef (UInt32 mask)
{
	// ...
	return base + kHighestBitLUT[ mask ];
}

Note: I could have just put in a static const int lut[] in the original function code. But that sounds like this might not be thread-safe (at least similar initialization of more complex objects isn’t; not sure about array initializers). A quick test with MSVC2010 reveals that it is thread-safe, but I wouldn’t want to rely on that.

How much faster now? 298 milliseconds if explicitly non-inlined, 110 ms when inlined. Five times faster by moving one line up! For completeness sake, using MSVC _BitScanReverse intrinsic (__builtin_clz in gcc), which compiles down to x86 BSR instruction, takes 94 ms in the same test.

So… yeah. Careful with those initializers.


Tiled Forward Shading links

Main idea of my previous post was roughly this: in forward rendering, there’s no reason why we still have to use per-object light lists. We can apply roughly the same ideas as those of tiled deferred shading.

Really nice to see that other people have thought about this before or about the same time; here are some links:

As Andrew Lauritzen points out in the comments of my previous post, claiming “but deferred will need super-fat G-buffers!” is an over-simplification. You could just as well store material indices plus data for sampling textures (UVs + derivatives); and going “deferred” you have more choices in how you schedule your computations.

There’s no principal difference between “forward” and “deferred” these days. As soon as you have a Z-prepass you already are caching/deferring something, and then it’s a whole spectrum of options what and how to cache or “defer” for later computation.

Ultimately of course, the best approach depends on a million of factors. The only lesson to learn from this post is that “forward rendering does not have to use per-object light lists”.


2012 Theory for Forward Rendering

Good question in a tweet by @ivanassen:

So what is the 2012 theory on lights in a forward renderer?

Hard to answer that in 140 characters, so here goes raw brain dump (warning: not checked in practice!).

Short answer

A modern forward renderer for DX11-class hardware would probably be something like AMD’s Leo demo.

They seem to be doing light culling in a compute shader, and the result is per-pixel / tile linked lists of lights. Then scene is rendered normally in forward rendering, fetching the light lists and computing shading. Advantages are many; arbitrary shading models with many parameters that would be hard to store in a G-buffer; semitransparent objects; hardware MSAA support; much smaller memory requirements compared to some fat G-buffer layout.

Disadvantages would be storing linked lists, I guess. Potentially unbounded memory usage here, though I guess various schemes similar to Adaptive Transparency could be used to cap the maximum number of lights per pixel/tile.

Deferred == Caching

All the deferred lighting/shading approaches are essentially caching schemes. We cache some amount of surface information, in screen space, in order to avoid fetching or computing the same information over and over again, while applying lights one by one in traditional forward rendering.

Now, the “cache in screenspace” leads to disadvantages like “it’s really hard to do transparencies” - since with transparencies you do not have one point in space mapping to one pixel on screen anymore. There’s no reason why caching should be done in screen space however; lighting could also just as well be computed in texture space (like some skin rendering techniques, but they do it for a different reason), world space (voxels?), etc.

Does “modern” forward rendering still need caching?

Caching information was important since in DX9 / Shader Model 3 times, it was hard to do forward rendering that could almost arbitrarily apply variable number of lights - with good efficiency - in one pass. That led to either shader combination explosion, or inefficient multipass rendering, or both. But now we have DX11, compute, structured buffers and unordered access views, so maybe we can actually do better?

Because at some point we will want to have BRDFs with more parameters than it is viable to store in a G-buffer (side image: this is half of parameters for a material). We will want many semitransparent objects. And then we’re back to square one; we can not efficiently do this in a traditional “deferred” way where we cache N numbers per pixel.

AMD’s Leo goes in that direction. It seems to be a blend of tiled deferred approaches to light culling, applied to forward rendering.

I imagine it doing something like:

  1. Z-prepass:

    1. Render Z prepass of opaque objects to fill in depth buffer.
    2. Store that away (copy into another depth buffer).
    3. Continue Z prepass of transparent objects; writing to depth.
    4. Now we have two Z buffers, and for any pixel we know the Z-extents of anything interesting in it (from closest transparent object up to closest opaque surface)
  2. Shadowmaps, as usual. Would need to keep all shadowmaps for all lights in memory, which can be a problem!

  3. Light culling, very similar to what you’d do in tiled deferred case!

    1. Have all lights stored in a buffer. Light types, positions/directions/ranges/angles, colors etc.
    2. From the two depth buffers above, we can compute Z ranges per pixel/tile in order to do better light culling.
    3. Run a compute shader that does light culling. Could do this per pixel or per small tiles (e.g. 8x8 ). Result is buffer(s) / lists per pixel or tile, with lights that affect said pixel or tile.
  4. Render objects in forward rendering:

    1. Z-buffer is already pre-filled in 1.1.
    2. Each shader would have to do “apply all lights that affect this pixel/tile” computation. So that would involve fetching those arbitrary light informations, looping over lights etc.
    3. Otherwise, each object is free to use as many shader parameters as it wants, or use any BRDF it wants.
    4. Rendering order is like usual forward rendering; batch-friendly order (since Z is prefilled already) for opaque, per-object or per-triangle back-to-front order for semitransparent objects.
  5. Profit!

Now, I have hand-waved over some potentially problematic details.

For example, “two depth buffers” is not robust for cases where there’s no opaque objects in some area; we’d need to track minimum and maximum depths of semitransparent stuff, or accept worse light culling for those tiles. Likewise, copying the depth buffer might lose some hardware Hi-Z information, so in practice it could be better to track semitransparent depths using another approach (min/max blending of a float texture etc.).

4.b. bit about “let’s apply all lights” assumes there is some way to do that efficiently, while supporting complicated things like each light having a different cookie/gobo texture, or a different shadowmap etc. Texture arrays could almost certainly be used here, but since this just a brain dump without verification in practice, it’s hard to say how would this work.

Update: other papers came out describing almost the same idea, with actual implementations & measurements. Check them out here!