Random Thoughts on New Explicit Graphics APIs

Last time I wrote about graphics APIs was almost a year ago. Since then, Apple Metal was unveiled and shipped in iOS 8; as well as Khronos Vulkan was announced (which is very much AMD Mantle, improved to make it cross-vendor). DX12 continues to be developed for Windows 10.

@promit_roy has a very good post on gamedev.net about why these new APIs are needed and what problems do they solve. Go read it now, it’s good.

Just a couple more thoughts I’d add.

Metal experience

When I wrote the previous OpenGL rant, we already were working on Metal and had it “basically work in Unity”. I’ve only ever worked on PC/mobile graphics APIs before (D3D9, D3D11, OpenGL/ES, as well as D3D7-8 back in the day), so Metal was the first of these “more explicit APIs” I’ve experienced (I never actually did anything on consoles before, besides just seeing some code).

ZOMG what a breath of fresh air.

Metal is super simple and very, very clear. I was looking at the header files and was amazed at how small they are – “these few short files, and that’s it?! wow.” A world of difference compared to how much accumulated stuff is in OpenGL core & extension headers (to a lesser degree in OpenGL ES too).

Conceptually Metal is closer to D3D11 than OpenGL ES (separate shader stages; constant buffers; same coordinate conventions), so “porting Unity to it” was mostly taking D3D11 backend (which I did back in the day too, so familiarity with the code helped), changing the API calls and generally removing stuff.

Create a new buffer (vertex, index, constant - does not matter) – one line of code (MTLDevice.newBuffer*). Then just get a pointer to data and do whatever you want. No mapping, no locking, no staging buffers, no trying to nudge the API into doing the exact type of buffer update you want (on the other hand, data synchronization is your own responsibility now).

And even with the very early builds we had, everything more or less “just worked”, with a super useful debug & validation layer. Sure there were issues and there were occasional missing things in the early betas, but nothing major and the issues got fixed fast.

To me Metal showed that a new API that gets rid of the baggage and exposes platform strengths is a very pleasant thing to work with. Metal is essentially just several key ideas (many of which are shared by other “new APIs” like Vulkan or DX12):

  • Command buffer creation is separate from submission; and creation is mostly stateless (do that from any thread).
  • Whole rendering pipeline state is specified, thus avoiding “whoops need to revalidate & recompile shaders” issues.
  • Unified memory; just get a pointer to buffer data. Synchronization is your own concern.

Metal very much keeps the existing resource binding model from D3D11/OpenGL - you bind textures/buffers to shader stage “resource slots”.

I think of all public graphics APIs (old ones like OpenGL/D3D11 and new ones like Vulkan/DX12), Metal is probably the easiest to learn. Yes it’s very much platform specific, but again, OMG so easy.

Partially because it keeps the traditional binding model – while that means Metal can’t do fancy things like bindless resources, it also means the binding model is simple. I would not want to be a student learning graphics programming, and having to understand Vulkan/DX12 resource binding.

Explicit APIs and Vulkan

This bit from Promit’s post,

But there’s a very real message that if these APIs are too challenging to work with directly, well the guys who designed the API also happen to run very full featured engines requiring no financial commitments.

I love conspiracy theories as much as the next guy, but I don’t think that’s quite true. If I’d put my cynical hat on, then sure: making graphics APIs hard to use is an interest of middleware providers. You could also say that making sure there are lots of different APIs is an interest of middleware providers too! The harder it is to make things, the better for them, right.

In practice I think we’re all too much hippies to actually think that way. I can’t speak for Epic, Valve or Frostbite of course, but on Unity side it was mostly Christophe being involved in Vulkan, and if you think his motivation is commercial interest then you don’t know him :) I and others from graphics team were only very casually following the development – I would have loved to be more involved, but was 100% busy on Unity 5.0 development all the time.

So there.

That said, to some extent the explicit APIs (both Vulkan and DX12) are harder to use. I think it’s mostly due to more flexible (but much more complicated) resource binding model. See Metal above - my opinion is that stateless command buffer creation and fully specified pipeline state actually make it easier to use the API. But new way of resource binding and to some extent ability to reuse & replay command buffer chunks (which Metal does not have either) does complicate things.

However, I think this is worth it. The actual lowest level of API should be very efficient, even if somewhat “hard to use” (or require an expert to use it). Additional “ease of use” layers can be put on top of that! The problem with OpenGL was that it was trying to do both at once, with a very sad side effect that everyone and their dog had to implement all the layers (in often subtly incompatible ways).

I think there’s plenty of space for “somewhat abstacted” libraries or APIs to be layered on top of Vulkan/DX12. Think XNA back when it was a thing; or three.js in WebGL world. Or bgfx in the C++ world. These are all good and awesome.

Let’s see what happens!

Optimizing Shader Info Loading, or Look at Yer Data!

A story about a million shader variants, optimizing using Instruments and looking at the data to optimize some more.

The Bug Report

The bug report I was looking into was along the lines of “when we put these shaders into our project, then building a game becomes much slower – even if shaders aren’t being used”.

Indeed it was. Quick look revealed that for ComplicatedReasons(tm) we load information about all shaders during the game build – that explains why the slowdown was happening even if shaders were not actually used.

This issue must be fixed! There’s probably no really good reason we must know about all the shaders for a game build. But to fix it, I’ll need to pair up with someone who knows anything about game data build pipeline, our data serialization and so on. So that will be someday in the future.

Meanwhile… another problem was that loading the “information for a shader” was slow in this project. Did I say slow? It was very slow.

That’s a good thing to look at. Shader data is not only loaded while building the game; it’s also loaded when the shader is needed for the first time (e.g. clicking on it in Unity’s project view); or when we actually have a material that uses it etc. All these operations were quite slow in this project.

Turns out this particular shader had massive internal variant count. In Unity, what looks like “a single shader” to the user often has many variants inside (to handle different lights, lightmaps, shadows, HDR and whatnot - typical ubershader setup). Usually shaders have from a few dozen to a few thousand variants. This shader had 1.9 million. And there were about ten shaders like that in the project.

The Setup

Let’s create several shaders with different variant counts for testing: 27 thousand, 111 thousand, 333 thousand and 1 million variants. I’ll call them 27k, 111k, 333k and 1M respectively. For reference, the new “Standard” shader in Unity 5.0 has about 33 thousand internal variants. I’ll do tests on MacBook Pro (2.3 GHz Core i7) using 64 bit Release build.

Things I’ll be measuring:

  • Import time. How much time it takes to reimport the shader in Unity editor. Since Unity 4.5 this doesn’t do much of actual shader compilation; it just extracts information about shader snippets that need compiling, and the variants that are there, etc.
  • Load time. How much time it takes to load shader asset in the Unity editor.
  • Imported data size. How large is the imported shader data (serialized representation of actual shader asset; i.e. files that live in Library/metadata folder of a Unity project).

So the data is:

Shader   Import    Load    Size
   27k    420ms   120ms    6.4MB
  111k   2013ms   492ms   27.9MB
  333k   7779ms  1719ms   89.2MB
    1M  16192ms  4231ms  272.4MB

Enter Instruments

Last time we used xperf to do some profiling. We’re on a Mac this time, so let’s use Apple Instruments. Just like xperf, Instruments can show a lot of interesting data. We’re looking at the most simple one, “Time Profiler” (though profiling Zombies is very tempting!). You pick that instrument, attach to the executable, start recording, and get some results out.

You then select the time range you’re interested in, and expand the stack trace. Protip: Alt-Click (ok ok, Option-Click you Mac peoples) expands full tree.

So far the whole stack is just going deep into Cocoa stuff. “Hide System Libraries” is very helpful with that:

Another very useful feature is inverting the call tree, where the results are presented from the heaviest “self time” functions (we won’t be using that here though).

When hovering over an item, an arrow is shown on the right (see image above). Clicking on that does “focus on subtree”, i.e. ignores everything outside of that item, and time percentages are shown relative to the item. Here we’ve focused on ShaderCompilerPreprocess (which does majority of shader “importing” work).

Looks like we’re spending a lot of time appending to strings. That usually means strings did not have enough storage buffer reserved and are causing a lot of memory allocations. Code change:

This small change has cut down shader importing time by 20-40%! Very nice!

I did a couple other small tweaks from looking at this profiling data - none of them resulted in any signifinant benefit though.

Profiling shader load time also says that most of the time ends up being spent on loading editor related data that is arrays of arrays of strings and so on:

I could have picked functions from the profiler results, went though each of them and optimized, and perhaps would have achieved a solid 2-3x improvement over initial results. Very often that’s enough to be proud!

However…

Taking a step back

Or like Mike Acton would say, ”look at your data!” (check his CppCon2014 slides or video). Another saying is also applicable: ”think!

Why do we have this problem to begin with?

For example, in 333k variant shader case, we end up sending 610560 lines of shader variant information between shader compiler process & editor, with macro strings in each of them. In total we’re sending 91 megabytes of data over RPC pipe during shader import.

One possible area for improvement: the data we send over and store in imported shader data is a small set of macro strings repeated over and over and over again. Instead of sending or storing the strings, we could just send the set of strings used by a shader once, assign numbers to them, and then send & store the full set as lists of numbers (or fixed size bitmasks). This should cut down on the amount of string operations we do (massively cut down on number of small allocations), size of data we send, and size of data we store.

Another possible approach: right now we have source data in shader that indicate which variants to generate. This data is very small: just a list of on/off features, and some built-in variant lists (“all variants to handle lighting in forward rendering”). We do the full combinatorial explosion of that in the shader compiler process, send the full set over to the editor, and the editor stores that in imported shader data.

But the way we do the “explosion of source data into full set” is always the same. We could just send the source data from shader compiler to the editor (a very small amount!), and furthermore, just store that in imported shader data. We can rebuild the full set when needed at any time.

Changing the data

So let’s try to do that. First let’s deal with RPC only, without changing serialized shader data. A few commits later…

This made shader importing over twice as fast!

Shader   Import
   27k    419ms ->  200ms
  111k   1702ms ->  791ms
  333k   5362ms -> 2530ms
    1M  16784ms -> 8280ms

Let’s do the other part too; where we change serialized shader variant data representation. Instead of storing full set of possible variants, we only store data needed to generate the full set:

Shader   Import              Load                 Size
   27k    200ms ->   285ms    103ms ->    396ms     6.4MB -> 55kB
  111k    791ms ->  1229ms    426ms ->   1832ms    27.9MB -> 55kB
  333k   2530ms ->  3893ms   1410ms ->   5892ms    89.2MB -> 56kB
    1M   8280ms -> 12416ms   4498ms ->  18949ms   272.4MB -> 57kB

Everything seems to work, and the serialized file size got massively decreased. But, both importing and loading got slower?! Clearly I did something stupid. Profile!

Right. So after importing or loading the shader (from now a small file on disk), we generate the full set of shader variant data. Which right now is resulting in a lot of string allocations, since it is generating arrays of arrays of strings or somesuch.

But we don’t really need the strings at this point; for example after loading the shader we only need the internal representation of “shader variant key” which is a fairly small bitmask. A couple of tweaks to fix that, and we’re at:

Shader  Import    Load
   27k    42ms     7ms
  111k    47ms    27ms
  333k    94ms    76ms
    1M   231ms   225ms

Look at that! Importing a 333k variant shader got 82 times faster; loading its metadata got 22 times faster, and the imported file size got over a thousand times smaller!

One final look at the profiler, just because:

Weird, time is spent in memory allocation but there shouldn’t be any at this point in that function; we aren’t creating any new strings there. Ahh, implicit std::string to UnityStr (our own string class with better memory reporting) conversion operators (long story…). Fix that, and we’ve got another 2x improvement:

Shader  Import    Load
   27k    42ms     5ms
  111k    44ms    18ms
  333k    53ms    46ms
    1M   130ms   128ms

The code could still be optimized further, but there ain’t no easy fixes left I think. And at this point I’ll have more important tasks to do…

What we’ve got

So in total, here’s what we have so far:

Shader   Import                Load                 Size
   27k    420ms-> 42ms (10x)    120ms->  5ms (24x)    6.4MB->55kB (119x)
  111k   2013ms-> 44ms (46x)    492ms-> 18ms (27x)   27.9MB->55kB (519x)
  333k   7779ms-> 53ms (147x)  1719ms-> 46ms (37x)   89.2MB->56kB (this is getting)
    1M  16192ms->130ms (125x)  4231ms->128ms (33x)  272.4MB->57kB (ridiculous!)

And a fairly small pull request to achieve all this (~400 lines of code changed, ~400 new added):

Overall I’ve probably spent something like 8 hours on this – hard to say exactly since I did some breaks and other things. Also I was writing down notes & making sceenshots for the blog too :) The fix/optimization is already in Unity 5.0 beta 20 by the way.

Conclusion

Apple’s Instruments is a nice profiling tool (and unlike xperf, the UI is not intimidating…).

However, Profiler Is Not A Replacement For Thinking! I could have just looked at the profiling results and tried to optimize “what’s at top of the profiler” one by one, and maybe achieved 2-3x better performance. But by thinking about the actual problem and why it happens, I got a way, way better result.

Happy thinking!

Curious Case of Slow Texture Importing, and Xperf

I was looking at a curious bug report: “Texture importing got much slower in current beta”. At first look, I dismissed it under “eh, someone’s being confused” (quickly tried on several textures, did not notice any regression). But then I got a proper bug report with several textures. One of them was importing about 10 times slower than it used to be.

Why would anyone make texture importing that much slower? No one would, of course. Turns out, this was an unintended consequence of generally improving things.

But, the bug report made me use xperf (a.k.a. Windows Performance Toolkit) for the first time. Believe it or not, I’ve never used it before!

So here’s the story

We’ve got a TGA texture (2048x2048, uncompressed - a 12MB file) that takes about 10 seconds to import in current beta build, but it took ~1 second on Unity 4.6.

First wild guess: did someone accidentally disable multithreaded texture compression? Nope, doesn’t look like it (making final texture be uncompressed still shows massive regression).

Second guess: we are using FreeImage library to import textures. Maybe someone, I dunno, updated it and comitted a debug build? Nope, last change to our build was done many moons ago.

Time to profile. My quick ”I need to get some answer in 5 seconds” profiler on Windows is Very Sleepy, so let’s look at that:

Wait what? All the time is spent in WinAPI ReadFile function?!

Is there something special about the TGA file I’m testing on? Let’s make the same sized, uncompressed PNG image (so file size comes out the same).

The PNG imports in 108ms, while TGA in 9800ms (I’ve turned off DXT compression, to focus on raw import time). In Unity 4.6 the same work is done 116ms (PNG) and 310ms (TGA). File sizes roughly the same. WAT!

Enter xperf

Asked a coworker who knows something about Windows: “why would reading one file spend all time in ReadFile, but another file of same size read much faster?”, and he said “look with xperf”.

I’ve read about xperf at the excellent Bruce Dawson’s blog, but never tried it myself. Before today, that is.

So, launch Windows Performance Recorder (I don’t even know if it comes with some VS or Windows SDK version or needs to be installed separately… it was on my machine somehow), tick CPU and disk/file I/O and click Start:

Do texture importing in Unity, click save, and on this fairly confusing screen click “Open in WPA”:

The overview in the sidebar gives usage graphs of our stuff. A curious thing: neither CPU (Computation) nor Storage graphs show intense activity? The plot thickens!

CPU usage investigation

Double clicking the Computation graph shows timeline of CPU usage, with graphs for each process. We can see Unity.exe taking up some CPU during a time period, which the UI nicely highlights for us.

Next thing is, we want to know what is using the CPU. Now, the UI groups things by the columns on the left side of the yellow divider, and displays details for them on the right side of it. We’re interested in a callstack now, so context-click on the left side of the divider, and pick “Stack”:

Oh right, to get any useful stacks we’ll need to tell xperf to load the symbols. So you go Trace -> Configure Symbol Paths, add Unity folder there, and then Trace -> Load Symbols.

And then you wait. And wait some more…

And then you get the callstacks! Not quite sure what the “n/a” entry is; my best guess that just represents unused CPU cores or sleeping threads or something like that.

Digging into the other call stack, we see that indeed, all the time is spent in ReadFile.

Ok, so that was not terribly useful; we already knew that from the Very Sleepy profiling session.

Let’s look at I/O usage

Remember the “Storage” graph on sidebar that wasn’t showing much activity? Turns out, you can expand it into more graphs.

Now we’re getting somewhere! The “File I/O” overview graph shows massive amounts of activity, when we were importing our TGA file. Just need to figure out what’s going on there. Double clicking on that graph in the sidebar gives I/O details:

You can probably see where this is going now. We have a lot of file reads, in fact almost 400 thousand of them. That sounds a bit excessive.

Just like in the CPU part, the UI sorts on columns to the left of the yellow divider. Let’s drag the “Process” column to the left; this shows that all these reads are coming from Unity indeed.

Expanding the actual events reveals the culprit:

We are reading the file alright. 3 bytes at a time.

But why and how?

But why are we reading a 12 megabyte TGA file in three-byte chunks? No one updated our image reading library in a long time, so how come things have regressed?

Found the place in code where we’re calling into FreeImage. Looks like we’re setting up our own I/O routines and telling FreeImage to use them:

Version control history check: indeed, a few weeks ago a change in that code was made, that switched from basically “hey, load an image from this file path” to “hey, load an image using these I/O callbacks”

This generally makes sense. If we have our own file system functions, it makes sense to use them. That way we can support reading from some non-plain-files (e.g. archives, or compressed files), etc. In this particular case, the change was done to support LZ4 compression in lightmap cache (FreeImage would need to import texture files without knowing that they have LZ4 compression done on top of them).

So all that is good. Except when that changes things to have wildly different performance characteristics, that is.

When you don’t pass file I/O routines to FreeImage, then it uses a “default set”, which is just C stdio ones:

Now, C stdio routines do I/O buffering by default… our I/O routines do not. And FreeImage’s TGA loader does a very large number of one-pixel reads.

To be fair, the “read TGA one pixel at a time” seems to be fixed in upstream FreeImage these days; we’re just using a quite old version. So looking at this bug made me realize how old our version of FreeImage is, and make a note to upgrade it at some point. But not today.

The Fix

So, a proper fix here would be to setup buffered I/O routines for FreeImage to use. Turns out we don’t have any of them at the moment. They aren’t terribly hard to do; I poked the relevant folks to do them.

In the meantime, to check if that was really the culprit, and to not have “well TGAs import much slower”, I just made a hotfix that reads whole image into memory, and then loads from that.

Is it okay to read whole image into memory buffer? Depends. I’d guess in 95% cases it is okay, especially now that Unity editor is 64 bit. Uncompressed data for majority of images will end up being much larger than the file size anyway. Probably the only exception could be .PSD files, where they could have a lot of layers, but we’re only interested in reading the “composited image” file section. So yeah, that’s why I said “hotfix”; and a proper solution would be having buffered I/O routines, and/or upgrading FreeImage.

This actually made TGA and PNG importing faster than before: 75ms for TGA, 87ms for PNG (Unity 4.6: 310ms TGA, 116ms PNG; current beta before the fix: 9800ms TGA, 108ms PNG).

Yay.

Conclusion

Be careful when replacing built-in functionality of something with your own implementation (e.g. standard I/O or memory allocation or logging or … routines of some library). They might have different performance characteristics!

xperf on Windows is very useful! Go and read Bruce Dawson’s blog for way more details.

On Mac, Apple’s Instruments is a similar tool. I think I’ll use that in some next blog entry.

I probably should have guessed that “too many, too small file read calls” is the actual cause after two minutes of looking into the issue. I don’t have a good excuse on why I did not. Oh well, next time I’ll know :)

Divide and Conquer Debugging

It should not be news to anyone that ability to narrow down a problem while debugging is an incredibly useful skill. Yet from time to time, I see people just helplessly randomly stumbling around, when they are trying to debug something. So with this in mind (and also “less tweeting, more blogging!” in mind for 2015), here’s a practical story.

This happened at work yesterday, and is just an ordinary bug investigation. It’s not some complex bug, and investigation was very short - all of it took less time than writing this blog post.

Bug report

We’re adding iOS Metal support to Unity 4.6.x, and one of the beta testers reported this: “iOS Metal renders submeshes incorrectly”. There was a nice project attached that shows the issue very clearly. He has some meshes with multiple materials on them, and the 2nd material parts are displayed in the wrong position.

The scene looks like this in the Unity editor:

But when ran on iOS device, it looks like this:

Not good! Well, at least the bug report is very nice :)

Initial guesses

Since the problematic parts are the second material on each object, and it only happens on the device, then the user’s “iOS Metal renders submeshes incorrectly” guess makes perfect sense (spoiler alert: the problem was elsewhere).

Ok, so what is different between editor (where everything works) and device (where it’s broken)?

  • Metal: device is running Metal, whereas editor is running OpenGL.
  • CPU: device is running on ARM, editor running on Intel.
  • Need to check which shaders are used on these objects; maybe they are something crazy that results in differences.

Some other exotic things might be different, but first let’s take the above.

Initial Cuts

Run the scene on the device using OpenGL ES 2.0 instead. Ha! The issue is still there. Which means Metal is not the culprit at all!

Run it using a slightly older stable Unity version (4.6.1). The issue is not there. Which means it’s some regression somewhere since Unity 4.6.1 and the code we’re based on now. Thankfully that’s only a couple weeks of commits.

We just need to find what regressed, when and why.

Digging Deeper

Let’s look at the frame on the device, using Xcode frame capture.

Hmm. We see that the scene is rendered in two draw calls (whereas it’s really six sub-objects), via Unity’s dynamic batching.

Dynamic batching is a CPU-side optimization we have where small objects using identical rendering state are transformed into world space on the CPU, into a dynamic geometry buffer, and rendered in a single draw call. So this spends some CPU time to transform the vertices, but saves some time on the API/driver side. For very small objects (sprites etc.) this tends to be a win.

Actually, I could have seen that it’s two draw calls in the editor directly, but it did not occur to me to look for that.

Let’s check what happens if we explicitly disable dynamic batching. Ha! The issue is gone.

So by now, what we know is: it’s some recent regression in dynamic batching, that happens on iOS device but not in the editor; and is not Metal related at all.

But it’s not that “all dynamic batching got broken”, because:

  • Half of the bug scene (the pink objects) are dynamic-batched, and they render correctly.
  • We do have automated graphics tests that cover dynamic batching; they run on iOS; and they did not notice any regressions.

Finding It

Since the regression is recent (4.6.1 worked, and was something like three weeks old), I chose to look at everything that changed since that release, and try to guess which changes are dynamic batching related, and could affect iOS but not the editor.

This is like a heuristic step before/instead of doing actual “bisecting the bug”. Unity codebase is large and testing builds isn’t an extremely fast process (mercurial update, build editor, build iOS support, build iOS application, run). If the bug was a regression from a really old Unity version, then I probably would have tried several in-between versions to narrow it down.

I used perhaps the most useful SourceTree feature - you select two changesets, and it shows the full diff between them. So looking at the whole diff was just several clicks away:

A bunch of changes there are immediately “not relevant” - definitely everything documentation related; almost definitely anything editor related; etc.

This one looked like a candidate for investigation (a change in matrix related ARM NEON code):

This one interesting too (a change in dynamic batching criteria):

And this one (a change in dynamic batching ARM NEON code):

I started looking at the last one…

Lo and behold, it had a typo indeed; the {d1[1]} thing was storing the w component of transformed vertex position, instead of z like it’s supposed to!

The code was in the part where dynamic batching is done on vertex positions only, i.e. it was only used on objects with shaders that only need positions (and not normals, texture coordinates etc.). This explains why half of the scene was correct (pink objects use shader that needs normals as well), and why our graphics tests did not catch this (so turns out, they don’t test dynamic batching with position-only shaders).

Fixing It

The fix is literally a one character change:

…and the batching code is getting some more tests.

Importing Cubemaps From Single Images

So this tweet on EXR format in texture pipeline and replies on cubemaps made me write this…

Typically skies or environment maps are authored as regular 2D textures, and then turned into cubemaps at “import time”. There are various cubemap layouts commonly used: lat-long, spheremap, cross-layout etc.

In Unity 4 we had the pipeline where the user had to pick which projection the source image is using. But for Unity 5, ReJ realized that it’s just boring useless work! You can tell these projections apart quite easily by looking at image aspect ratio.

So now we default to “automatic” cubemap projection, which goes like this:

  • If aspect is 4:3 or 3:4, it’s a horizontal or vertical cross layout.
  • If aspect is square, it’s a sphere map.
  • If aspect is 6:1 or 1:6, it’s six cubemap faces in a row or column.
  • If aspect is 1:1.85, it’s a lat-long map.

Now, some images don’t quite match these exact ratios, so the code is doing some heuristics. Actual code looks like this right now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
float longAxis = image.GetWidth();
float shortAxis = image.GetHeight();
bool definitelyNotLatLong = false;
if (longAxis < shortAxis)
{
    Swap (longAxis, shortAxis);
    // images that have height > width are never latlong maps
    definitelyNotLatLong = true;
}

const float aspect = shortAxis / longAxis;
const float probSphere = 1-Abs(aspect - 1.0f);
const float probLatLong = (definitelyNotLatLong) ?
    0 :
    1-Abs(aspect - 1.0f / 1.85f);
const float probCross = 1-Abs(aspect - 3.0f / 4.0f);
const float probLine = 1-Abs(aspect - 1.0f / 6.0f);
if (probSphere > probCross &&
    probSphere > probLine &&
    probSphere > probLatLong)
{
    // sphere map
    return kGenerateSpheremap;
}
if (probLatLong > probCross &&
    probLatLong > probLine)
{
    // lat-long map
    return kGenerateLatLong;
}
if (probCross > probLine)
{
    // cross layout
    return kGenerateCross;
}
// six images in a row
return kGenerateLine;

So that’s it. There’s no point in forcing your artists to paint lat-long maps, and use some external software to convert to cross layout, or something.

Now of course, you can’t just look at image aspect and determine all possible projections out of it. For example, both spheremap and “angular map” are square. But in my experience heuristics like the above are good enough to cover most common use cases (which seem to be: lat-long, cross layout or a sphere map).