More Hash Function Tests

In the previous post, I wrote about non-crypto hash functions, and did some performance tests. Turns out, it’s great to write about stuff! People at the comments/twitter/internets pointed out more things I should test. So here’s a follow-up post.

What is not the focus

This is about algorithms for “hashing some amount of bytes”, for use either in hashtables or for checksum / uniqueness detection. Depending on your situation, there’s a whole family of algorithms for that, and I am focusing on only one: non-cryptographic fast hash functions.

  • This post is not about cryptographic hashes. Do not read below if you need to hash passwords, sensitive data going through untrusted medium and so on. Use SHA-1, SHA-2, BLAKE2 and friends.
  • Also, I’m not focusing on algorithms that are designed to prevent possible hashtable Denial-of-Service attacks. If something comes from the other side of the internet and ends up inserted into your hashtable, then to prevent possible worst-case O(N) hashtable behavior you’re probably off by using a hash function that does not have known “hash flooding” attacks. SipHash seems to be popular now.
  • If you are hashing very small amounts of data of known size (e.g. single integers or two floats or whatever), you should probably use specialized hashing algorithms for those. Here are some integer hash functions, or 2D hashing with Weyl, or perhaps you could take some other algorithm and just specialize it’s code for your known input size (e.g. xxHash for a single integer).
  • I am testing 32 and 64 bit hash functions here. If you need larger hashes, quite likely some of these functions might be suitable (e.g. SpookyV2 always produces 128 bit hash).

When testing hash functions, I have not gone to great lengths to get them compiling properly or setting up all the magic flags on all my platforms. If some hash function works wonderfully when compiled on Linux Itanium box with an Intel compiler, that’s great for you, but if it performs poorly on the compilers I happen to use, I will not sing praises for it. Being in the games industry, I care about things like “what happens in Visual Studio”, and “what happens on iOS”, and “what happens on PS4”.

More hash function tests!

I’ve updated my hash testbed on github (use revision 9b59c91cf) to include more algorithms, changed tests a little, etc etc.

I checked both “hashing data that is aligned” (16-byte aligned address of data to hash), and unaligned data. Everywhere I tested, there wasn’t a notable performance difference that I could find (but then, I have not tested old ARM CPUs or PowerPC based ones). The only visible effect is that MurmurHash and SpookyHash don’t properly work in asm.js / Emscripten compilation targets, due to their usage of unaligned reads. I’d assume they probably don’t work on some ARM/PowerPC platforms too.

Hash functions tested below:

These are the main functions that are interesting. Because people kept on asking, and because “why not”, I’ve included a bunch of others too:

  • SipRef - SipHash-2-4 reference implementation. Like mentioned before, this one is supposedly good for avoiding hash flooding attacks.
  • MD5-32, SHA1-32, CRC32 - simple implementations of well-known hash functions (from SMHasher test suite). Again, these mostly not in the category I’m interested in, but included to illustrate the performance differences.
  • FNV-1a, FNV-1amod - FNV hash and modified version.
  • djb2, SDBM - both from this site.

Performance Results

Windows / Mac

macOS results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build. Late 2013 MacBookPro:

Windows results, compiled with Visual Studio 2015 in Release 64 bit build. Core i7-5820K:

Notes:

  • Performance profiles of these are very similar.
  • xxHash64 wins at larger (0.5KB+) data sizes, followed closely by 64 bit FarmHash and CityHash.
  • At smaller data sizes (10-500 bytes), FarmHash and CityHash look very good!
  • mum-hash is much slower on Visual Studio. At first I thought it’s _MUM_UNALIGNED_ACCESS macro that was not handling VS-specific defines (_M_AMD64 and _M_IX86) properly (see PR). Turns out it’s not; the speed difference comes from _MUM_USE_INT128 which only kicks in on GCC-based compilers. mum-hash would be pretty competetive otherwise.

Windows 32 bit results, compiled with Visual Studio 2015 in Release 32 bit build. Core i7-5820K:

On a 32 bit platform / compilation target, things change quite a bit!

  • All 64 bit hash functions are slow. For example, xxHash64 frops from 13GB/s to 1GB/s. Use a 32 bit hash function when on a 32 bit target!
  • xxHash32 wins by a good amount.

Note on FarmHash – whole idea behind it is that it uses CPU-specific optimizations (that also change the computed hash value). The graphs above are using default compilation settings on both macOS and Windows. However, on macOS enabling SSE4.2 instructions in Xcode settings makes it much faster at larger data sizes:

With SSE4.2 enabled, FarmHash64 handily wins against xxHash64 (17.9GB/s vs 12.8GB/s) for data that is larger than 1KB. However, that requires compiling with SSE4.2, at my work I can’t afford that. Enabling the same options on XboxOne makes it slower :( Enabling just FARMHASH_ASSUME_AESNI makes the 32 bit FarmHash faster on XboxOne, but does not affect performance of the 64 bit hash. FarmHash also does not have any specific optimizations for ARM CPUs, so my verdict with it all is “not worth bothering” – afterall the impressive SSE4.2 speedup is only for large data sizes.

Mobile

iPhone SE (A9 CPU) results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build:

  • xxHash never wins here; CityHash and FarmHash are fastest across the whole range.

iPhone SE 32 bit results:

This is similar to Windows 32 bit: 64 bit hash functions are slow, xxHash32 wins by a good amount.

Console

Xbox One (AMD Jaguar 1.75GHz CPU) results, compiled Visual Studio 2015 in Release mode:

  • Similar to iPhone results, xxHash is quite a bit slower than CityHash and FarmHash. xxHash uses 64 bit multiplications heavily, whereas others mostly do shifts and logic ops.
  • SpookyHash wins at larger data sizes.

JavaScript

JavaScript (asm.js via Emscripten) results, running on late 2013 MacBookPro.

  • asm.js is in all practical sense a 32 bit compilation target; 64 bit integer operations are slow.
  • xxHash32 wins, followed by FarmHash32.
  • At smaller (<25 bytes) data sizes, simple hashes like FNV-1a, SDBM or djb2 are useful.

Throughput tables

Performance at large data sizes (~4KB), in GB/s:

Hash64 bit32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefoxasm.js Chrome
xxHash64 13.2 12.8 5.7 1.5 1.1 1.5 0.3 0.1
City64 12.2 12.2 7.2 3.6 1.6 1.9 0.4 0.1
Mum 4.0 9.5 4.5 0.5 0.7 1.3 0.1 0.1
Farm64 12.3 11.9 8.2 3.3 2.4 2.2 0.4 0.1
SpookyV2-64 12.8 12.5 7.1 4.3 2.6 1.9 -- --
xxHash32 6.8 6.6 4.0 1.5 6.7 3.7 2.5 1.4
Murmur3-X64-64 7.1 5.8 2.3 1.2 0.9 0.8 -- --
Murmur2A 3.4 3.3 1.7 0.9 3.4 1.8 -- --
Murmur3-32 3.1 2.7 1.3 0.5 3.1 1.3 -- --
City32 5.1 4.9 2.6 0.9 4.0 2.6 1.1 0.8
Farm32 5.2 4.3 2.6 1.8 4.3 1.9 2.1 1.4
SipRef 1.4 1.6 1.0 0.4 0.3 0.4 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.5 0.3 0.2 0.2 0.4 0.3 0.4 0.4
SHA1-32 0.5 0.5 0.3 0.1 0.4 0.2 0.3 0.2
FNV-1a 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.8
FNV-1amod 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.7
djb2 0.9 0.8 0.6 0.4 1.1 0.6 0.8 0.8
SDBM 0.9 0.8 0.4 0.3 0.8 0.4 0.8 0.8

Performance at medium size (128 byte) data, in GB/s:

Hash64 bit32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefoxasm.js Chrome
xxHash64 6.6 6.2 2.8 0.9 0.7 0.7 0.2 0.1
City64 7.6 7.6 4.4 1.7 1.1 1.5 0.3 0.1
Mum 3.2 7.6 4.1 0.5 0.6 1.1 0.1 0.0
Farm64 6.6 5.7 3.4 1.4 0.9 1.1 0.3 0.1
SpookyV2-64 3.4 3.2 1.7 1.4 0.7 0.5 -- --
xxHash32 5.1 5.3 3.4 1.3 5.1 2.8 2.2 1.4
Murmur3-X64-64 5.3 4.3 2.1 1.0 0.8 0.7 -- --
Murmur2A 3.6 3.0 2.1 0.8 3.3 1.7 -- --
Murmur3-32 3.1 2.3 1.3 0.4 2.8 1.3 -- --
City32 4.0 3.6 2.0 0.7 3.3 1.9 1.0 0.8
Farm32 3.9 3.2 1.9 1.0 3.4 1.6 1.6 1.1
SipRef 1.2 1.3 0.8 0.4 0.3 0.3 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.3 0.2 0.2 0.1 0.3 0.2 0.2 0.2
SHA1-32 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1
FNV-1a 0.9 1.0 0.5 0.3 0.9 0.5 0.9 0.9
FNV-1amod 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8
djb2 1.0 0.9 0.7 0.4 1.1 0.7 0.9 0.8
SDBM 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8

Performance at small size (17 byte) data, in GB/s:

Hash64 bit32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefoxasm.js Chrome
xxHash64 2.1 1.8 0.5 0.5 0.4 0.4 0.1 0.0
City64 3.4 3.0 1.5 0.7 0.5 0.7 0.2 0.0
Mum 1.2 2.6 0.9 0.2 0.3 0.5 0.1 0.0
Farm64 3.6 2.6 1.2 0.7 0.6 0.6 0.1 0.0
SpookyV2-64 1.2 1.0 0.6 0.4 0.2 0.1 -- --
xxHash32 2.2 2.0 0.7 0.5 1.9 0.8 1.2 0.8
Murmur3-X64-64 1.7 1.3 0.5 0.4 0.3 0.3 -- --
Murmur2A 2.4 1.8 1.1 0.5 2.1 1.0 -- --
Murmur3-32 2.1 1.5 0.8 0.4 1.8 0.8 -- --
City32 2.1 1.9 0.9 0.5 1.7 0.7 0.8 0.7
Farm32 2.5 2.0 0.8 0.5 1.8 0.9 0.9 0.5
SipRef 0.6 0.6 0.3 0.2 0.2 0.1 0.0 0.0
CRC32 0.8 0.7 0.4 0.2 0.6 0.3 0.5 0.4
MD5-32 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0
SHA1-32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
FNV-1a 1.3 1.5 1.0 0.3 1.2 0.7 1.4 1.0
FNV-1amod 1.1 1.4 0.9 0.3 1.0 0.7 1.3 0.9
djb2 1.6 1.6 1.1 0.4 1.1 0.7 1.3 0.9
SDBM 1.4 1.3 1.1 0.4 1.2 0.7 1.5 1.0

A note on hash quality

As far as I’m concerned, all the 64 bit hashes are excellent quality.

Most of the 32 bit hashes are pretty good too on the data sets I tested.

SDBM produces more collisions than others on binary-like data (various struct memory dumps, 5742 entries, average length 55 bytes – SDBM had 64 collisions, all the other hashes had zero). You could have a way worse hash function than SDBM of course, but then functions like FNV-1a are about as simple to implement, and seem to behave better on binary data.

A note on hash consistency

Some of the hash functions do not produce identical output on various platforms. Among the ones I tested, mum-hash and FarmHash produced different output depending on compiler and platform used.

It’s very likely that most of the above hash functions produce different output if ran on a big-endian CPU. I did not have any platform like that to test on.

Some of the hash functions depend on unaligned memory reads being possible – most notably Murmur and Spooky. I had to change Spooky to work on ARM 32 bit (define ALLOW_UNALIGNED_READS to zero in the source code). Murmur and Spooky did produce wrong results on asm.js (no crash, just different hash results and way more collisions than expected).

Conclusions

General cross-platform use: CityHash64 on a 64 bit system; xxHash32 on a 32 bit system.

  • Good performance across various data sizes.
  • Identical hash results across various little-endian platforms and compilers.
  • No weird compilation options to tweak or peformance drops on compilers that it is not tuned for.

Best throughput on large data: depends on platform!

  • Intel CPU: xxHash64 in general, FarmHash64 if you can use SSE4.2, xxHash32 if compiling for 32 bit.
  • Apple mobile CPU (A9): FarmHash64 for 64 bit, xxHash32 for 32 bit.
  • Console (XboxOne, AMD Jaguar): SpookyV2.
  • asm.js: xxHash32.

Best for short strings: FNV-1a.

  • Super simple to implement, inline-able.
  • Where exactly it becomes “generally fastest” depends on a platform; around 8 bytes or less on PC, mobile and console; around 20 bytes or less on asm.js.
  • If your data is fixed size (e.g. one or two integers), look into specialized hashes instead (see above).

Hash Functions all the way down

Update! I tested more hash functions in a follow-up post. See More Hash Function Tests.

A while ago I needed fast hash function for ~32 byte keys. We already had MurmurHash used in a bunch of places, so I started with that. But then I tried xxHash and that was a bit faster! So I dropped xxHash into the codebase, landed the thing to mainline and promptly left for vacation, with a mental note of “someday should look into other hash functions, or at least move them all under a single folder”.

So that’s what I did: “hey I’ll move source code of MurmurHash, SpookyHash and xxHash under a single place”. But that quickly spiraled out of control:

The things I found! Here’s what you find in a decent-sized codebase, with many people working on it:

  • Most places use a decent hash function – in our case Murmur2A for 32/64 bit hashes, and SpookyV2 for 128 bit hashes. That’s not a bad place to be in!
  • Murmur hash takes seed as input, and naturally almost all places in code copy-pasted the same random hex value as the seed :)
  • There are at least several copies of either FNV or djb2 hash function implementations scattered around, used in random places.
  • Some code uses really, REALLY bad hash functions. There’s even a comment above it, added a number of years ago, when someone found out it’s bad – however they only stopped using it in their own place, and did not replace other usages. Life always takes the path of least resistance :) Thankfully, the places where said hash function was used were nothing “critical”.
  • While 95% of code that uses non-cryptographic hash functions uses them strictly for runtime reasons (i.e. they don’t actually care about exact value of the hash), there are some pieces that hash something, and serialize the hashed value. Each of these need to be looked at in detail, whether you can easily switch to a new hash function.
  • Some other hash related code (specifically, a struct we have to hold 128 bit hashed value, Hash128), were written in a funky way, ages ago. And of course some of our code got that wrong (thankfully, all either debug code, or test mocking code, or something not-critical). Long story short, do not have struct constructors like this: Hash128(const char* str, int len=16)!
    • Someone will think this accepts a string to hash, not “bytes of the hash value”.
    • Someone will pass "foo" into it, and not provide length argument, leading to out-of-bounds reads.
    • Some code will accidentally pass something like 0 to a function that accepts a Hash128, and because C++ is C++, this will get turned into a Hash128(NULL, 16) constructor, and hilarity will ensue.
    • Lesson: be careful with implicit constructors (use explicit). Be careful with default arguments. Don’t set types to const char* unless it’s really a string.

So what started out as “move some files” branch, ended up being a “move files, switch most of hash functions, remove some bad hash functions, change some code, fix some wrong usages, add tests and comments” type of thing. It’s a rabbit hole of hash functions, all the way down!

Anyway.

Hash Functions. Which one to use?

MurmurHash got quite popular, at least in game developer circles, as a “general hash function”. My quick twitter poll seems to reflect that:

It’s a fine choice, but let’s see later if we can generally do better. Another fine choice, especially if you know more about your data than “it’s gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun’s replies, or Rune’s modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!

Sometimes you don’t know much about the data, for example if you’re hashing arbitrary files, or some user input that could be “pretty much anything”.

So let’s take a look at general purpose hash functions. There’s plenty of good tests on the internets (e.g. Hash functions: An empirical comparison), but I wanted to make my own little tests. Because why not! Here’s my randomly mashed together little testbed (use revision 4d535b).

Throughput

Here’s results of various hash functions, hashing data of different lengths, with performance in MB/s:

This was tested on late 2013 MacBookPro (Core i7-4850HQ 2.3GHz), Xcode 7.3.1 release 64 bit build.

  • xxHash in 32 and 64 bit variants, as well as “use 64 bit, take lowest 32 bits of result” one.
  • SpookyHash V2, the 128 bit variant, only taking 64 lowest bits.
  • Murmur, a couple variants of it.
  • CRC32, FNV and djb2, as I found them in our own codebase. I did not actually check whether they are proper implementations or somehow tweaked! Their source is at the testbed, revision 4d535b.

In terms of throughput at not-too-small data sizes (larger than 10-20 bytes), xxHash is the king. If you need 128 bit hashes, SpookyHash is very good too.

What about small keys?

Good question! The throughput of XXH64 is achieved by carefully exploiting instruction level parallelism of modern CPUs. It has a main loop that does 32 bytes at a time, with four independent hashing rounds. It looks something like this:

// rough outline of XXH64:
// ... setup code
do {
    v1 = XXH64_round(v1, XXH_get64bits(p)); p+=8;
    v2 = XXH64_round(v2, XXH_get64bits(p)); p+=8;
    v3 = XXH64_round(v3, XXH_get64bits(p)); p+=8;
    v4 = XXH64_round(v4, XXH_get64bits(p)); p+=8;
} while (p<=limit);
// ... merge v1..v4 values
// ... do leftover bit that is not multiple of 32 bytes

That way even if it looks like it’s “doing more work”, it ends up being faster than super simple algorithms like FNV, that work on one byte at a time, with each and every operation depending on the result of the previous one:

// rough outline of FNV:
while (*c)
{
	hash = (hash ^ *c++) * p;
}

However, xxHash has all this “prologue” and “epilogue” code around the main loop, that handles either non-aligned data, or leftovers from data that aren’t multiple of 32 bytes in size. That adds some branches, and for short keys it does not even go into that smart loop!

That can be seen from the graph above, e.g. xxHash 32 (which has core loop of 16-bytes-at-once) is faster at key sizes < 100 bytes. Let’s zoom in at even smaller data sizes:

Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance. I’m picking out FNV as the winner here, because in my tests djb2 had somewhat more collisions (see below).

What about other platforms?

PC/Windows: tests I did on Windows (Core i7-5820K 3.3GHz, Win10, Visual Studio 2015 release 64 bit) follow roughly the same pattern as the results on a Mac. Nothing too surprising here.

Consoles: did a quick test on XboxOne. Results are surprising in two ways: 1) oh geez the console CPUs are slow (well ok that’s not too surprising), and 2) xxHash is not that awesome here. It’s still decent, but xxHash64 has consistenly worse performance than xxHash32, and for larger keys SpookyHash beats them all. Maybe I need to tweak some settings or poke Yann to look at it? Adding a mental note to do that later…

Mobile: tested on iPhone 6 in 64 bit build. Results not too surprising, except again, unlike PC/Mac with an Intel CPU, xxHash64 is not massively better than everything else – SpookyHash is really good on ARM64 too.

JavaScript! :) Because it was easy, I also compiled that code into asm.js via Emscripten. Overall the patterns are similar, except the >32 bit hashes (xxHash64, SpookyV2) – these are much slower. This is expected, both xxHash64 and Spooky are specifically designed either for 64 bit CPUs, or when you absolutely need a >32 bit hash. If you’re on 32 bit, use xxHash32 or Murmur!

What about hash quality?

SMHasher seems to be a de-facto hash function testing suite (see also: a fork that includes more modern hash functions).

For a layman test, I tested several things on data sets I cared about:

  • Words” - just a dump of English words (/usr/share/dict/words). 235886 entries, 2.2MB total size, average length 9.6 bytes.
  • Filenames” - dump of file paths (from a Unity source tree tests folder). 80297 entries, 4.3MB total size, average length 56.4 bytes.
  • “Source” - partial dump of source files from Unity source tree. 6069 entries, 43.7MB total size, average length 7547 bytes.

First let’s see how many collisions we’d get on these data sets, if we used the hash function for uniqueness/checksum type of checking. Lower numbers are better (0 is ideal):

Hash Words collis Filenames collis Source collis
xxHash32 6 0 0
xxHash64-32 7 0 0
xxHash64 0 0 0
SpookyV2-64 0 0 0
Murmur2A 11 0 0
Murmur3-32 3 1 0
CRC32 5 2 0
FNV 5 1 0
djb2 10 1 0
ZOMGBadHash 998 842 0

ZOMGBadHash is that fairly bad hash function I found, as mentioned above. It’s not fast either, and look at that number of collisions! Here’s how it looked like:

unsigned CStringHash(const char* key)
{
	unsigned h = 0;
	const unsigned sr = 8 * sizeof (unsigned) - 8;
	const unsigned mask = 0xFu << (sr + 4);
	while (*key != '\0')
	{
		h = (h << 4) + *key;
		std::size_t g = h & mask;
		h ^= g | (g >> sr);
		key++;
	}
	return h;
}

I guess someone thought a random jumbo of shifts and xors is gonna make a good hash function, or something. And thrown in mixed 32 vs 64 bit calculations too, for good measure. Do not do this! Hash functions are not just random bit operations.

As another measure of “hash quality”, let’s imagine we use the hash functions in a hashtable. A typical hashtable of a load factor 0.8, that always has power-of-two number of buckets (i.e. something like bucketCount = NextPowerOfTwo(dataSetSize / 0.8)). If we’d put the above data sets into this hashtable, then how many entries we’d have per bucket on average? Lower numbers are better (1.0 is ideal):

Hash Words avg bucket Filenames avg bucket Source avg bucket
xxHash32 1.241 1.338 1.422
xxHash64-32 1.240 1.335 1.430
xxHash64 1.240 1.335 1.430
SpookyV2-64 1.241 1.336 1.434
Murmur2A 1.239 1.340 1.430
Murmur3-32 1.242 1.340 1.427
CRC32 1.240 1.338 1.421
FNV 1.243 1.339 1.415
djb2 1.242 1.339 1.414
ZOMGBadHash 1.633 2.363 7.260

Here all the hash functions are very similar, except the ZOMGBadHash which is, as expected, doing not that well.

TODO

I did not test some of the new-ish hash functions (CityHash, MetroHash, FarmHash). Did not test hash functions that use CPU specific instructions either (for example variants of FarmHash can use CRC32 instruction that’s added in SSE4.2, etc.). That would be for some future time.

Conclusion

  • xxHash64 is really good, especially if you’re on an 64 bit Intel CPU.
  • If you need 128 bit keys, use SpookyHash. It’s also really good if you’re on a non-Intel 64 bit CPU (as shown by XboxOne - AMD and iPhone6 - ARM throughput tests).
  • If you need a 32 bit hash, and are on a 32 bit CPU/system, do not use xxHash64 or SpookyHash! Their 64 bit math is costly when on 32 bit; use xxHash32 or Murmur.
  • For short data/strings, simplicity of FNV or djb2 are hard to beat, they are very performant on short data as well.
  • Do not throw in random bit operations and call that a hash function. Hash function quality is important, and there’s plenty of good (and fast!) hash functions around.

Note: I tested more hash functions in a follow-up post. See More Hash Function Tests.


Adam Demo production talk at CGEvent

A while ago I delivered a talk at CG EVENT 2016 in Vilnius, about some production details of Unity’s Adam demo.

Clarification point! I did nothing at all for the demo production, just did the talk pretending to be @__ReJ__. Also, there are more blog posts about the demo production on Unity blog, e.g. Adam - Production Design for Realtime Short Film by @_calader. All credit goes to the demo team, I only delivered the message :)

Here are the talk slides (34MB pdf).

Some of them were meant to be videos, so here they are, individually:

Slide 41, rough animatic:

Slide 46, previz:

Slide 47, previz with lighting & camera:

Slide 48, early Neuron mocap:

Slide 49, previz with shading & mocap:

Slide 50, previz with postprocessing:

Slide 52, Vicon mocap:

Slide 59, WIP animations & fracture:

Slide 60, eye animation:

Slide 68, area lights:

Slide 71, translucency shader:

Slide 73, crowd simulation:


Maldives Vacation Report 2016

Just spent 9 days in Maldives doing nothing! So here’s a writeup and a bunch of photos.

“Magical”, “paradise”, “heaven” and other things, they say. So we decided to check it out for ourselves. Another factor was that after driving-heavy vacations in some previous years (e.g. USA or Iceland), the kids wanted to just do nothing for a bit. Sit in the water & snorkel, basically.

So off we went, all four of us. Notes in random order:

Picking where to go

This one’s tough. We don’t like resorts and hotels, so at first thought about going airbnb-style. Turns out, there isn’t much of that going on in Maldives; and some of the places that have airbnb listings up are actually just ye olde hotels or guesthouses anyway.

Lazy vacation being lazy, then it was “pick an island & resort” time. This primarily depends on your budget; from guesthouses at the low end to… well I guess there’s no limit how expensive it can go. “Very” would be an accurate description.

There are other factors, like whether you want to dive or snorkel (then look for diving spots & reef information), how much entertainment options you want, whether you’re bringing kids etc. etc.

What kinda sucks about many of the blogs on Maldives, is that they are written as if an honest impression from some traveller, only to find in a small print that hey, the resort or some travel agency covered their expenses. Apparently this “travel blogger” is an actual profession that people make money on; a subtle form of advertisement. Good for them, but makes you wonder how biased their blog posts are.

We wanted a smallish resort, that’s kinda “midrange” in terms of prices, has good reviews and has a decent house reef. Upon some searching, picked Gangehi (tripadvisor link) kinda randomly. Here are basic impressions:

  • Good: nice, small, clean, good reef, nice beach area for small kids.
  • Neutral: food was a mixed bag (some very good, some meh).

Going there

It feels like “hey it’s somewhat beyond Turkey, not that far away”, but it’s more like “Sri Lanka distance away”. For us that was a drive to Vilnius, 2.5hr flight to Istanbul, and 8hr flight to Male, and from there a ~30 minute seaplane flight to the resort.

Seeing the atols during the plane landing is very impressive, especially if you haven’t seen such a thing before (none of us had). They do not look like a real thing :)

Maldives is a whole bunch of separate tiny islands, so the only choice of travel is either by boat or seaplane. None of us flew that before either, so that was interesting too! Here it is, and here’s the resort’s “airport”. After this, even the airports we have here in Lithuania are ginormous by comparison:

Maldives with kids?!

This is supposedly a honeymoon destination, or something. Instead, we went about 14 years too late for that, and with two kids. Punks not dead! It’s fine; at least in our resort there weren’t that many honeymoon couples, actually. There was no special enternainment for kids, but hey, water that you can spend full day in, snorkeling and sand. And an iPad for when that gets boring (whole-family-fun this time was Smallworld 2).

There are some resorts that have special “kid activities” (not sure if we would have cared for that though), and I’m told there are others that explicitly do not allow kids. But overall, if your kids like water you should be good to go.

Maldives in July?!

July is very much a non-season in Maldives – it’s the rain season, and the temperature is colder by a whopping 2 degrees! This leaves it at 31C in the day, and 28C in the night. The horror! :)

The upside of going there now: fewer people around, and apparently prices somewhat lower. We lucked out in that almost all the rain was during the nights; over whole week got maybe 15 minutes of rain in the daytime.

Snorkeling

Now, none of us are divers. We have never snorkeled before either. Most of us (except Marta) can barely swim as well – e.g. my swimming skills are “I’m able to not drown” :) So this time we decided “ok let’s try snorkeling”, and diving (including learning how to do that) will be left for another time.

It’s pretty cool.

Apparently the current El NiƱo has contributed quite a lot to coral bleaching in Maldives; the same reef was quite a lot more colorful half a year ago.

We’re the ones that don’t have any sort of GoPro, and considered getting one before the trip for taking pictures. However, we don’t care about video at all, so went for Olympus TG-4 instead. RAW support and all that. The underwater and water splashing photos are from that camera.

The other photos on this post are either Canon EOS 70D with Canon 24-70mm f/2.8 L II lens or an iPhone 6.

What else is there to do?

Not much, actually :) Splash in the water and walk around:

Watch an occasional bird, stingray (here, cowtail stingray, pastinachus sephen) or shark (no photos of sharks, but there’s a ton of little sharks all around; they are not dangerous at all):

Build castles made of sand, that melt into the sea eventually. Watch hermit crabs drag their shells around.

Take pictures in the traditional “photo to post on facebook” style. Apparently this makes you look awesome, or something.

That thing with houses on top of water is a cute hack by the resorts. The amount of land available is extremely limited, so hey, let’s build houses on water and tell people they are awesome! :)

Take nighttime photos. This is almost equator, so the sun sets very early; by 7:30PM it’s already dark.

Just walk around, if you can call it that. This island was maybe 150 meters in diameter, so “the walks” are minutes in length.

Go on a boat trip to (try to) see dolphins. We actually saw them:

Another boat trip to see a local non-resort island (Mathiveri):

Oh, and the sunsets. Gotta take pictures of the sunsets:

And that’s about it. The rest of the time: reading, sleeping, doing nothing :)

Conclusion?

What would I do differently next time?

Spending 8-9 days in a single place is stretching it. The amount of photos above make it sound like there’s a lot of things to do, but you can do all that in two days. If you’re really prepared for a “do nothing” style vacation, then it’s fine. I’m not used to that (though a friend told me: “Aras, you just need to practice more! No one gets it from the 1st time!”). So I’d probably either do the whole thing shorter, or split it up in two different islands, plus perhaps a two-day guesthouse stay at a local (non-resort) island for a change. Apparently that even has a term, “island hopping”.

Would there be next time?

Not sure yet. It was very nice to see, once. Maybe some other time too – but very likely for next vacation or two we’ll go back to “travel & drive around & see things” style. But if we want another lazy vacation, then it’s a really good place… if your budget allows. This trip was slightly more expensive than our US roadtrip, for example.

So that’s it, until next time!


Solving DX9 Half-Pixel Offset

Summary: the Direct3D 9 “half pixel offset” problem that manages to annoy everyone can be solved in a single isolated place, robustly, and in a way where you don’t have to think about it ever again. Just add two instructions to all your vertex shaders, automatically.

…here I am wondering if the target audience for D3D9 related blog post in 2016 is more than 7 people in the world. Eh, whatever!

Background

Direct3D before version 10 had this pecularity called “half pixel offset”, where viewport coordinates are shifted by half a pixel compared to everyone else (OpenGL, D3D10+, Metal etc.). This causes various problems, particularly with image post-processing or UI rendering, but elsewhere too.

The official documentation (”Directly Mapping Texels to Pixels”), while being technically correct, is not exactly summarized into three easy bullet points.

The typical advice is various: “shift your quad vertex positions by half a pixel” or “shift texture coordinates by half a texel”, etc. Most of them talk almost exclusively about screenspace rendering for image processing or UI.

The problem with all that, is that this requires you to remember to do things in various little places. Your postprocessing code needs to be aware. Your UI needs to be aware. Your baking code needs to be aware. Some of your shaders need to be aware. When 20 places in your code need to remember to deal with this, you know you have a problem.

3D has half-pixel problem too!

While most of material on D3D9 half pixel offset talks about screen-space operations, the problem exists in 3D too! 3D objects are rendered slightly shifted compared to what happens on OpenGL, D3D10+ or Metal.

Here’s a crop of a scene, rendered in D3D9 vs D3D11:

And a crop of a crop, scaled up even more, D3D9 vs D3D11:

Root Cause and Solution

The root cause is that viewport is shifted by half a pixel compared to where we want it to be. Unfortunately we can’t fix it by changing all coordinates passed into SetViewport, shifting them by half a pixel (D3DVIEWPORT9 coordinate members are integers).

However, we have vertex shaders. And the vertex shaders output clip space position. We can adjust the clip space position, to shift everything by half a viewport pixel. Essentially we need to do this:

// clipPos is float4 that contains position output from vertex shader
// (POSITION/SV_Position semantic):
clipPos.xy += renderTargetInvSize.xy * clipPos.w;

That’s it. Nothing more to do. Do this in all your vertex shaders, setup shader constant that contains viewport size, and you are done.

I must stress that this is done across the board. Not only postprocessing or UI shaders. Everything. This fixes the 3D rasterizing mismatch, fixes postprocessing, fixes UI, etc.

Wait, why no one does this then?

Ha. Turns out, they do!

Maybe it’s common knowledge, and only I managed to be confused? Sorry about that then! Should have realized this years ago…

Solving This Automatically

The “add this line of HLSL code to all your shaders” is nice if you are writing or generating all the shader source yourself. But what if you don’t? (e.g. Unity falls into this camp; zillions of shaders already written out there)

Turns out, it’s not that hard to do this at D3D9 bytecode level. No HLSL shader code modifications needed. Right after you compile the HLSL code into D3D9 bytecode (via D3DCompile or fxc), just slightly modify it.

D3D9 bytecode is documented in MSDN, “Direct3D Shader Codes”.

I thought whether I should be doing something flexible/universal (parse “instructions” from bytecode, work on them, encode back into bytecode), or just write up minimal amount of code needed for this patching. Decided on the latter; with any luck D3D9 is nearing it’s end-of-life. It’s very unlikely that I will ever need more D3D9 bytecode manipulation. If in 5 years from now we’ll still need this code, I will be very sad!

The basic idea is:

  1. Find which register is “output position” (clearly defined in shader model 2.0; can be arbitrary register in shader model 3.0), let’s call this oPos.
  2. Find unused temporary register, let’s call this tmpPos.
  3. Replace all usages of oPos with tmpPos.
  4. Add mad oPos.xy, tmpPos.w, constFixup, tmpPos and mov oPos.zw, tmpPos at the end.

Here’s what it does to simple vertex shader:

vs_3_0           // unused temp register: r1
dcl_position v0
dcl_texcoord v1
dcl_texcoord o0.xy
dcl_texcoord1 o1.xyz
dcl_color o2
dcl_position o3  // o3 is position
pow r0.x, v1.x, v1.y
mul r0.xy, r0.x, v1
add o0.xy, r0.y, r0.x
add o1.xyz, c4, -v0
mul o2, c4, v0
dp4 o3.x, v0, c0 // -> dp4 r1.x, v0, c0
dp4 o3.y, v0, c1 // -> dp4 r1.y, v0, c1
dp4 o3.z, v0, c2 // -> dp4 r1.z, v0, c2
dp4 o3.w, v0, c3 // -> dp4 r1.w, v0, c3
                 // new: mad o3.xy, r1.w, c255, r1
                 // new: mov o3.zw, r1

Here’s the code in a gist.

At runtime, each time viewport is changed, set vertex shader constant (I picked c255) to contain (-1.0f/width, 1.0f/height, 0, 0).

That’s it!

Any downsides?

Not much :) The whole fixup needs shaders that:

  • Have an unused constant register. Majority of our shaders are shader model 3.0, and I haven’t seen vertex shaders that use all 32 temporary registers. If that is a problem, “find unused register” analysis could be made smarter, by looking for an unused register just in the place between earliest and latest position writes. I haven’t done that.
  • Have an unused constant register at some (easier if fixed) index. Base spec for both shader model 2.0 and 3.0 is that vertex shaders have 256 constant registers, so I just picked the last one (c255) to contain fixup data.
  • Have instruction slot space to put two more instructions. Again, shader model 3.0 has 512 instruction slot limit and it’s very unlikely it’s using more than 510.

Upsides!

Major ones:

  • No one ever needs to think about D3D9 half-pixel offset, ever, again.
  • 3D rasterization positions match exactly between D3D9 and everything else (D3D11, GL, Metal etc.).

Fixed up D3D9 vs D3D11. Matches now:

I ran all the graphics tests we have, inspected all the resulting differences, and compared the results with D3D11. Turns out, this revealed a few minor places where we got the half-pixel offset wrong in our shaders/code before. So additional advantages (all Unity specific):

  • Some cases of GrabPass were sampling in the middle of pixels, i.e. slightly blurred results. Matches DX11 now.
  • Some shadow acne artifacts slightly reduced; matches DX11 now.
  • Some cases of image postprocessing effects having a one pixel gap on objects that should have been touching edge of screen exactly, have been fixed. Matches DX11 now.

All this will probably go into Unity 5.5. Still haven’t decided whether it’s too invasive/risky change to put into 5.4 at this stage.