Mobile Hardware Stats (and more)

Short summary: Unity’s hardware stats page now has a “mobile” section. Which is exactly what it says, hardware statistics of people playing Unity games on iOS & Android. Go to stats.unity3d.com and enjoy.

Some interesting bits:

Operating systems

iOS uptake is crazy high: 98% of the market has iOS version that’s not much older than one year (iOS 5.1 was released in 2012 March). You’d be quite okay targetting just 5.1 and up!

Android uptake is… slightly different. 25% of the market is still on Android 2.3, which is almost two and a half years old (2010 December). Note that for all practical reasons Android 3.x does not exist ;)

Windows XP in the Web Player is making a comeback at 48% of the market. Most likely explained by “Asia”, see geography below.

  • Windows Vista could be soon dropped, almost no one is using it anymore. XP… not dropping that just yet :(
  • 64 bit Windows is still not the norm.

Geography

Android is big in United States (18%), China (13%), Korea (12%), Japan (6%), Russia (4%), Taiwan (4%) – mostly Asia.

iOS is big in United States (30%), United Kingdom (10%), China (7%), Russia (4%), Canada (4%), Germany (4%) – mostly “western world”.

Looking at Web Player, China is 28% while US is only 12%!

GPU

GPU makers on Android: Qualcomm 37%, ARM 32%, Imagination 22%, NVIDIA 6%.

  • You wouldn’t guess NVIDIA is in the distant 4th place, would you?
  • ARM share is almost entirely Mali 400. Strangely enough, almost no latest generation (Mali T6xx) devices.
  • OpenGL ES 3.0 capable devices are 4% right now, almost exclusively pulled forward by Qualcomm Adreno 320.
  • On iOS, Imagination is 100% of course…

No big changes on the PC:

  • Intel slowly rising, NVIDIA & AMD flat, others that used to exist (S3 & SIS) don’t exist anymore.
  • GPU capabilities increasing, though shader model 5.0 uptake seems slower than SM4.0 was.
  • Due to rise of Windows XP, “can actually use DX10+” is decreasing :(

Devices

On Android, Samsung is king with 55% of the market. No wonder it takes majority of the Android profits I guess. The rest is split by umpteen vendors (Sony, LG, HTC, Amazon etc.).

Most popular devices are various Galaxy models. Out of non-Samsung ones, Kindle Fire (4.3%), Nexus 7 (1.5%) and then it goes into “WAT? I guess Asia” territory with Xiaomi MI-One (1.2%) and so on.

On iOS, Apple has 100% share (shocking right?). There’s no clear leader in device model; iPhone 4S (18%), iPhone 5 (16%), iPad 2 (16%), iPhone 4 (14%), iPod Touch 4 (10%).

Interesting that first iPad can be pretty much ignored now (1.5%), whereas iPad 2 is still more popular than any of the later iPad models.

CPU

Single core CPUs are about 27% on both Android & iOS. The rest on iOS is all dual-core CPUs, whereas almost a quarter of Androids have four cores!

ARMv6 can be quite safely ignored. Good.

On PC, the “lots and lots of cores!” future did not happen - majority are dual core, and 4 core CPU growth seemingly stopped at 23% (though again, maybe explained by rise of Asia?).

FAQ

How big is this data set exactly?

Millions and millions. We track the data at quarterly granularity, and in the last quarter mobile has been about 200 million devices (yes really!); whereas web player has been 36 million machines.

Why no “All” section in mobile pages, with both Android & iOS?

We’ve added hardware stats tracking on Android earlier, so there are more Unity games made with it out there. Would be totally unfair “market share” - right now, 250 million Android devices and “only” 4 million iOS devices are represented in the stats. As more developers move to more recent Unity versions, the market share will level out and then we’ll add “All” section.

Nice charts, what did you use?

Flot. It is nice! I added “track by area” option to it.

How often is stats.unity3d.com page updated?

Roughly once a month.


"Parallel for" in Apple's GCD

I was checking out OpenSubdiv and noticed that on a Mac it’s not exactly “massively parallel”. Neither of OpenGL backends work (transform feedback one requires GL 4.2, and compute shader one requires GL 4.3 - but Macs right now can only do GL 3.2), OpenCL backend is much slower than the CPU one (OS X 10.7, GeForce GT 330M) for some reason, I don’t have CUDA installed so didn’t check that one, and OpenMP isn’t exactly supported by Apple’s compilers (yet?). Which leaves OpenSubdiv doing simple single threaded CPU subdivision.

This isn’t webscale multicorescale! Something must be done!

Apple platforms might not support OpenMP, but they do have something called Grand Central Dispatch (GCD). Which is supposedly a fancy technology to make multicore programming very easy – here’s the original GCD unveiling. Seeing how easy it is, I decided to try it out.

As a baseline, single threaded “CPU” subdivision kernel takes 33 milliseconds to compute 4th subdivision level of a “Car” model:

OpenMP dispatcher in OpenSubdiv

Subdivision in OpenSubdiv is computed by running several loops over data: loop to compute new edge positions, new face positions, new vertex positions etc. Fairly standard stuff. Each loop iteration is completely independent from others, for example:

void OsdCpuComputeEdge(/*...*/ int start, int end) {
    for (int i = start; i < end; i++) {
    	// compute i-th edge, completely independent of all other edges
    }
}

So of course OpenMP version just trivially says “hey, this loop is parallel!”:

void OsdOmpComputeEdge(/*...*/ int start, int end) {
	#pragma omp parallel for //<-- only this line is different!
    for (int i = start; i < end; i++) {
    	// compute i-th edge
    }
}

And then OpenMP-aware compiler and runtime will decide how to run this loop best over multiple CPU cores available. For example, it might split the loop into as many subsets as there are CPU cores, run these subsets (“jobs”) on its worker threads for these cores, and wait until all of them are done. Or it might split it up into more jobs, so that if the job lenghts will end up being different, it will still have some jobs to process on the other cores. This is all up to the OpenMP runtime to decide, but generally for large completely parallel loops it does a pretty good job.

Except, well, OpenMP doesn’t work on current Xcode 4.5 compiler (clang).

Initial parallel loop using GCD

GCD documentation suggests using dispatch_apply to submit a number of jobs at once; see Performing Loop Operations Concurrently section. This is easy to do:

void OsdGcdComputeEdge(/*...*/ int start, int end, dispatch_queue_t gcdq) {
	// replace for loop with:
	dispatch_apply(end-start, gcdq, ^(size_t blockIdx){
		int i = start+blockIdx;
    	// compute i-th edge
    });
}

See full commit here. That was easy. And slower than single threaded: 47ms with GCD, compared to 33ms single threaded. Not good.

OpenMP looks at the whole loop and hopefully partitions it into sensible count of subsets for parallel execution. Whereas GCD’s dispatch_apply submits each iteration of the loop to be executed in parallel. This “submit stuff to be executed on my worker threads” is naturally not a free operation and incurs some overhead. In our case, each iteration of the loop is fairly simple, it pretty much does weighted average of some vertices. Dispatch overhead here is probably higher than the actual work that we’re trying to do!

Better parallel loop using GCD

Of course the solution here is to batch up work items. Imagine that this loop processes, for example, 16 items (vertices, edges, …), then goes to next 16, and so on. These “packets of 16 items” would be what we dispatch to GCD. At the end of the loop, we might need to handle the remaining ones, if the number of iterations was not a multiple of 16. In fact, this is exactly what GCD documentation suggests in Improving on Loop Code.

All OpenSubdiv CPU kernels take “start” and “end” parameters that are essentially indices into an array of where to do the processing. So from our GCD blocks we can just call the regular CPU functions (see full commit):

const int GCD_WORK_STRIDE = 16;

void OsdGcdComputeEdge(/*...*/ int start, int end, dispatch_queue_t gcdq) {
    // submit work to GCD in parallel
    const int workSize = end-start;
    dispatch_apply(workSize/GCD_WORK_STRIDE, gcdq, ^(size_t blockIdx){
        const int start_i = start + blockIdx*GCD_WORK_STRIDE;
        const int end_i = start_i + GCD_WORK_STRIDE;
        OsdCpuComputeFace(/*...*/, start_i, end_i);
    });
    // do trailing block that's less than our batch size
    const int start_e = end - workSize%GCD_WORK_STRIDE;
    const int end_e = end;
    if (start_e < end_e)
        OsdCpuComputeFace(/*...*/, start_e, end_e);
}

This makes 4th subdivision level of the car model be computed in 15ms:

So that’s twice as fast as single threaded implementation. Is that good enough or not? My machine is a dual core (4 thread) one, so it is within my ballpark of expectations. Maybe it could go higher, but for that I’d need to do some profiling.

But you know what? Take a look at the other numbers - 62 milliseconds are spent on “CPU Draw”, so clearly that takes way more time than actual subdivision now. Fixing that one will have to be for another time, but suffice to say that reading data from GPU vertex buffers back into system memory each frame might not be a recipe for efficiency.

There’s at least one place in the above “GCD loop pattern” (hi Christer!) that might be improved: dispatch_apply waits until all submitted jobs are done. But to compute the trailing block we don’t need to wait for the other ones. The trailing block could be incorporated into the dispatch_apply loop, with better computation of end_i variable. Some other day!


Adventures in 3D Printing

I shamelessly stole whole idea from Robert Cupisz and did some 3D printed earrings. TL;DR: raymarching, marching cubes, MeshLab.

Now for the longer version…

Step 1: pick a Quaternion Julia fractal

As always, Iñigo Quilez’ work is a definitive resource. There’s a ready-made GLSL shader for raymarching this fractal in ShaderToy (named “Quaternion”), however current state of WebGL doesn’t allow loops with dynamic number of iterations, so it does not quite work in the browser. The shader is good otherwise!

Step 2: realtime tweaking with raymarching

With some massaging I’ve brought the shader into Unity.

Here, some experimentation with parameters for the fractal (picked 7.45 for “time value”), as well as extending the distance function to have a little torus for the earring hook, etc.

Keep in mind that while a fractal might look nice, it might not be printable fine because of too thin walls. All materials have a minimum “wall thickness”, and for example silver printed at Shapeways has a minimum thickness of 0.6-0.8mm. So I had to make the hape somewhat less interesting.

Now this leaves us with a signed distance field function (in a form of a GPU shader). This needs to be turned into an actual 3D model.

Step 3: marching cubes

Welcome old friend, Marching Cubes! I couldn’t find anything out of the box that would do “here’s my distance field function, do marching cubes on it”, so I wrote some quick-n-dirty code myself. Started with classic Paul Bourke’s code and made it print everything into an .OBJ file.

Here’s a non-final version of the distance field, gone through marching cubes, and brought back into Unity:

At this point I realized that the output will be quite noisy and some sort of “smoothing” will have to be done. Did a quick try at doing something with 3dsmax, but it is really no good at dealing with more than a million vertices at a time. Just doing a vertex weld on a million vertex model was taking two hours (?!).

Step 4: filtering in MeshLab

Some googling leads to MeshLab which is all kinds of awesome. And open source (which means the UI is not the most polished one, but hey it works).

Here’s my final model, as produced by marching cubes, loaded in MeshLab:

It’s still quite noisy, has several thin features and possibly sharp edges. Here’s what I did in MeshLab:

  • Remove duplicate vertices
  • Filters -> Remeshing, Simplification and Reconstruction -> Surface Reconstruction: Poisson. Entered 8 as octree depth, left others at default (solver divide: 6, sample per node: 1, surface offsetting: 1).
  • Scale the model to be about 26mm in length. Scale tool, measure geometric properties filter, freeze matrix.

Did I say MeshLab is awesome? It is.

Step 5: print it!

Export smoothed model from MeshLab, upload the file to a 3D printing service and… done! I used Shapeways, but there’s also i.materialize and others.

Here is the real actual printed thing!

I’ve been doing computer graphics since, well, last millenium. And this is probably the first time when this “graphics work” directly ends up in a real actual thing. Feels nice ;)


Non Power of Two Textures

Support for non power of two (“NPOT”, i.e. arbitrary sized) textures has been in GPUs for quite a while, but the state of support can be confusing. Recent question from @rygorous:

Lazyweb, <…> GL question: <…> ARB NPOT textures is fairly demanding, and texture rectangle are awkward. Is there an equivalent to ES2-style semantics in regular GL? Bonus Q: Is there something like the Unity HW survey, but listing supported GL extensions? :)

There are generally three big types of texture size support:

  • Full support for arbitrary texture sizes. This includes mipmaps, all texture wrap & filtering modes, and most often compressed texture formats as well.
  • “Limited” support for non-power-of-two sizes. No mipmaps, texture wrap mode has to be Clamp, but does allow texture filtering. This makes such textures not generally useful in 3D space, but just good enough for anything in screen space (UI, 2D, postprocessing).
  • No support, texture sizes have to be powers of two (16,32,64,128,…). If you’re running on really, really old GPU then textures might also need to be square (width = height).

Direct3D 9

Things are quite simple here. D3DCAPS9.TextureCaps has capability bits, D3DPTEXTURECAPS_POW2 and D3DPTEXTURECAPS_NONPOW2CONDITIONAL both being off indicates full support for NPOT texture sizes. When both D3DPTEXTURECAPS_POW2 and D3DPTEXTURECAPS_NONPOW2CONDITIONAL bits are on, then you have limited NPOT support.

I’ve no idea what would it mean if NONPOW2CONDITIONAL bit is set, but POW2 bit is not.

Hardware wise, limited NPOT has been generally available since 2002-2004, and full NPOT since 2006 or so.

Direct3D 11

Very simple; feature level 10_0 and up has full support for NPOT sizes; while feature level 9_x has limited support for NPOT. MSDN link.

OpenGL

Support for NPOT textures has been a core OpenGL feature since 2.0; promoted from earlier ARB_texture_non_power_of_two extension. The extension specifies “full” support for NPOT, including mipmaps & wrapping modes. There’s no practical way to detect hardware that can only do “limited” NPOT textures.

However, in traditional OpenGL spirit, presence of something in the API does not mean it can run on the GPU… E.g. Mac OS X with Radeon X1600 GPU is an OpenGL 2.1+ system, and as such pretends there’s full support for NPOT textures. In practice, as soon as you have NPOT size with mipmaps or a non-clamp wrap mode, you drop into software rendering. Ouch.

A rule of thumb that seems to work: try to detect “DX10+” level GPU, and in that case assume full NPOT support is actually there. Otherwise, in GL2.0+ or when ARB_texture_non_power_of_two is present, only assume limited NPOT support.

Then the question of course is, how to detect DX10+ level GPU in OpenGL? If you’re using OpenGL 3+, then you are on DX10+ GPU. In earlier GL versions, you’d have to use some heuristics. For example, if you have ARB_fragment_program and GL_MAX_PROGRAM_NATIVE_INSTRUCTIONS_ARB is less than 4096 is a pretty good indicator of pre-DX10 hardware, on Mac OS X at least. Likewise, you could query MAX_TEXTURE_SIZE, lower than 8192 is a good indicator for pre-DX10.

OpenGL ES

OpenGL ES 3.0 has full NPOT support in core; ES 2.0 has limited NPOT support (no mipmaps, no Repeat wrap mode) in core; and ES 1.1 has no NPOT support.

For ES 1.1 and 2.0, full NPOT support comes with GL_ARB_texture_non_power_of_two or GL_OES_texture_npot extension. In practice, iOS devices don’t support this; and on Android side there’s support on Qualcomm Adreno and ARM Mali. Possibly some others.

For ES 1.1, limited NPOT support comes with GL_APPLE_texture_2D_limited_npot (all iOS devices) or GL_IMG_texture_npot (some ImgTec Android devices I guess).

WebGL and Native Client currently are pretty much OpenGL ES 2.0, and thus support limited NPOT textures.

Flash Stage3D

Sad situation here; current version of Stage3D (as of Flash Player 11.4) has no NPOT support whatsoever. Not even for render targets.

Consoles

I wouldn’t be allowed to say anything about them, now would I? :) Check documentation that comes with your developer access on each console.

Is there something like the Unity HW survey, but listing supported GL extensions?

Not a single good resource, but there are various bits and pieces:

  • Unity Web Player hardware stats - no GL extensions, but we do map GPUs to “DX-like” levels and from there you can sort of infer things.
  • Steam Hardware Survey - likewise.
  • Apple OpenGL Capabilities - tables of detailed GL info for recent OS X versions and all GPUs supported by Apple. Excellent resource for Mac programmers! Apple does remove pages of old OS X versions from there, you’d have to use archive.org to get them back.
  • GLBenchmark Results - results database of GLBenchmark, has list of OpenGL ES extensions and some caps bits for each result (“GL config.” tab).
  • Realtech VR GLView - OpenGL & OpenGL ES extensions and capabilities viewer, for Windows, Mac, iOS & Android.
  • Omni Software Update Statistics - Mac specific, but does list some OpenGL stats from “Graphics” dropdown.
  • KluDX - Windows app that shows information about GPU; also has reports of submitted data.
  • 0 A.D. OpenGL capabilities database - OpenGL capabilities submitted by players of 0 A.D. game.

Cross Platform Shaders in 2012

Update: Cross Platform Shaders in 2014.

Since about 2002 to 2009 the de facto shader language for games was HLSL. Everyone on PCs was targeting Windows through Direct3D, Xbox 360 uses HLSL as well, and Playstation 3 uses Cg, which for all practical purposes is the same as HLSL. There were very few people targeting Mac OS X or Linux, or using OpenGL on Windows. One shader language ruled the world, and everything was rosy. You could close your eyes and pretend OpenGL with it’s GLSL language just did not exist.

Then a series of events happened, and all of a sudden OpenGL is needed again! iOS and Android are too big to be ignored (which means OpenGL ES + GLSL), and making games for Mac OS X or Linux isn’t a crazy idea either. This little WebGL thing that excites many hackers uses a variant of GLSL as well.

Now we have a problem; we have two similar but subtly different shading languages to deal with. I wrote about how we deal with this at Unity, and not much has changed since 2010. The “recommended way” is still writing HLSL/Cg, and we cross-compile into GLSL for platforms that need it.

But what about the future?

It could happen that importance of HLSL (and Direct3D) will decrease over time; this largely depends on what Microsoft is going to do. But just like OpenGL became important again just as it seemed to become irrelevant, so could Direct3D. Or something completely new. I’ll assume that for several years into the future, we’ll need to deal with at least two shading languages.

There are several approaches at handling the problem, and several solutions in that space, at varying levels of completeness.

#1. Do it all by hand!

“Just write all shaders twice”. Ugh. That’s not “web scale” so we’ll just discard this approach.

Slightly related approach is to have a library of preprocessor macros & function definitions, and use them in places where HLSL & GLSL are different. This is certainly doable, take a look at FXAA for a good example. Downsides are, you really need to know all the tiny differences between languages. HLSL’s fmod() and GLSL’s mod() sound like they do the same thing, but are subtly different - and there are many more places like this.

#2. Don’t use HLSL nor GLSL: treat them as shader backends

You could go for fully graphical based shader authoring. Drag some nodes around, connect them, and have shader “baking” code that can spit out HLSL, GLSL, or anything else that is needed. This is a big enough topic by iself; graphical shader editing has a lot more uses at “describing material properties” level than it has at lower level (who’d want to write a deferred rendering light pass shader using nodes & lines?).

You could also use a completely different language that compiles down to HLSL or GLSL. I’m not aware of any big uses in realtime graphics, but recent examples could be Open Shading Language (in film) or AnySL (in research).

#3. Cross-compile HLSL source into GLSL or vice versa

Parse shader source in one language, produce some intermediate representation, massage / verify that representation as needed, “print” it into another language. Some solutions exist here, for example:

  • hlsl2glslfork does DX9 HLSL -> GLSL 1.10 / ES 1.00 translation. Used by Unity, and judging from pull requests and pokes I get, in several dozen other game development shops.
  • ANGLE does GLSL ES 1.00 -> DX9 HLSL. Used by WebGL implementation in Chrome and Firefox.
  • Cg compiles Cg (“almost the same as HLSL”) into various backends, including D3D9 shader assembly and various versions of GLSL, with mixed success. No support for compiling into D3D10+ shader bytecode as far as I can tell.

Big limitation of two libraries above, is that they only do “DX9 level” shaders, so to speak. No support for DX10/11 style HLSL syntax (which Microsoft has changed a lot), and no support for correspondingly higher GLSL versions (GLSL 3.30+, GLSL ES 3.00). At least right now.

Call to action! There seems to be a need for source level translation libraries for DX10/GL3+ style language syntax & feature sets. I’m not sure if it makes sense to extend the above libraries, or to start from scratch… But we need a good quality, open source with liberal license, well maintained & tested package to do this. It shouldn’t be hard, and it probably doesn’t make sense for everyone to try to roll their own. github & bitbucket makes collaboration a snap, let’s do it.

If anyone at Microsoft is reading this: it would really help to have formal grammar of HLSL available. “Reference for HLSL” on MSDN has tiny bits and pieces scattered around, but that seems both incomplete and hard to assemble into a single grammar.

A building block could be Mesa or its smaller fork, GLSL Optimizer (see related blog post). It has a decent intermediate representation (IR) for shaders, a bunch of cleanup/optimization/lowering passes, a GLSL parser and GLSL printer (in GLSL Optimizer). Could be extended to parse HLSL and/or print HLSL. Currently lacking most of DX11/GL4 features, and some DX10/GL3 features in the IR. But under active development, so will get those soon I hope.

MojoShader also has an in-progress HLSL parser and translator to GLSL.

#4. Translate compiled shader bytecode into GLSL

Take HLSL, compile it down to bytecode, parse that bytecode and generate corresponding “low level” GLSL. Right now this would only go one way, as GLSL does not have a cross platform “compiled shader” representation. Though with recent OpenCL getting SPIR, maybe there’s hope in OpenGL getting something similar in the future?

This is a lot simpler to do than parsing full high level language, and a ton of platform differences go away (the ones that are handled purely at syntax level, e.g. function overloading, type promotion etc.). A possible downside is that HLSL bytecode might be “too optimized” - all the hard work about register packing & allocation, loop unrolling etc. is not that much needed here. Any conventions like whether your matrices are column-major or row-major is also “baked into” the resulting shader, so your D3D and GL rendering code better match there.

Several existing libraries in this space:

What now?

Go and make solutions to the approaches above, especially #3 and #4! Cross-platform shader developers all around the world will thank you. All twenty of them, or something ;)

If you’re a student looking for an entry into the industry as a programmer: this is a perfect example of a freetime / university project! It’s self-contained, it has clear goals, and above all, it’s actually useful for the real world. A non-crappy implementation of a library like this would almost certainly land you a job at Unity and I guess many other places.