Shader Compression: Some Data

One common question I had about SPIR-V Compression is “why compress shaders at all?”, coupled with question on how SPIR-V shaders compare with shader sizes on other platforms.

Here’s some data (insert usual caveats: might be not representative, etc. etc.).

Unity Standard shader, synthetic test

Took Unity’s Standard shader, made some content to make it expand into 482 actual shader variants (some variants to handle different options in the UI, some to handle different lighting setups, lightmaps, shadows and whatnot etc.). This is purely a synthetic test, and not an indicator of any “likely to match real game data size” scenario.

These 482 shaders, when compiled into various graphics API shader representations with our current (Unity 5.5) toolchain, result in sizes like this:

APIUncompressed MBLZ4HC Compressed MB
D3D9 1.04 0.14
D3D11 1.39 0.12
Metal 2.55 0.20
OpenGL 2.04 0.15
Vulkan 6.84 1.63
Vulkan + SMOL-V 1.94 0.43

Basically, SPIR-V shaders are 5x larger than D3D11 shaders, and 3x larger than GL/Metal shaders. When compressed (LZ4HC used since that’s what we use to compress shader code), they are 12x larger than D3D11 shaders, and 8x larger than GL/Metal shaders.

Adding SMOL-V encoding gets SPIR-V shaders to “only” be 3x larger than shaders of other APIs when compressed.

Game, full set of shaders

I also got numbers from one game developer trying out SMOL-V on their game. The game is inhouse engine, size of full game shader build:

APIUncompressed MBZipped MB
D3D11 44 14
Vulkan + remap 178 62
Vulkan + remap + SMOL-V 83 54

Here again, SPIR-V shaders are several times larger (than for example DX11), even after remapping + compression. Adding SMOL-V makes them a bit smaller (I suspect without remapping it might be even smoller).

In context

However, in the bigger picture shader compression might indeed not be a thing worth looking into. As always, “it depends”…

Adding smol-v encoding on this game’s data saved 15 megabytes. On one hand, it’s ten floppy disks, which is almost as much as entire Windows 95 took! On the other hand, it’s about the size of one 4kx4k texture, when DXT5/BC7 compressed.

So yeah. YMMV.


SPIR-V Compression

TL;DR: Vulkan SPIR-V shaders are fairly large. SMOL-V can make them smaller.

Other folks are implementing Vulkan support at work, and the other day they noticed that Vulkan shaders (which are represented as SPIR-V binary format) take up quite a lot of space. I thought it would be a fun excercise to try to make them smoller, and maybe I’d learn something about compression along the way too.

Caveat emptor! I know nothing about compression. Or rather, I’m probably at the stage where I can make the impression that I know something about it, but all that knowledge is very superficial. Exactly the stage that is dangerous, if I start to talk about it as if I have a clue! So below, I’m doing exactly that. You’ve been warned.

SPIR-V format

SPIR-V is extremely simple and regular format. Everything is 4-byte words. Many things that only need a few bits of information are still represented as a word.

This makes it simple, but not exactly space efficient. I don’t have data nearby right now, but a year or so ago I looked into shaders that do the same thing, compiled for DX9, DX11, OpenGL (GLSL) and Vulkan (SPIR-V), and the SPIR-V were “fattest” by a large amount (DX9 and minified GLSL being the smallest).

Compressibility

“Why not just compress them?”, you ask. That should take care of these “three bits of information written as 4 bytes” style enums. That it does; standard lossless compression techniques are pretty good at encoding often occuring patterns into a small number of bits (further reading: Huffman, Arithmetic, FSE coding).

And indeed, SPIR-V compresses quite well. For example, 1315 kilobytes worth of shader data (from various Unity shaders) compresses to 279 kilobytes with Zstandard and to 306 kilobytes with zlib (I used miniz implementation) at default settings. So a standard go-to compression (zlib) gets you a 23.4% compression of SPIR-V.

However, SPIR-V is full of not-really-compressible things, mostly various identifiers (anything called <id> in the spec). Due to the SSA form that SPIR-V uses, all the identifiers ever used are unique numbers, with nothing reusing a previous ID. A regular data compressor does not get to see many repeating patterns there.

Data compression algorithms usually only look for literally repeating patterns. If you’d have a file full of 0x00000001 integers, this will compress extremely well. However, if your file will be a really simple sequence of integers: 1, 2, 3, 4, …, this will not compress particularly well!

I actually just tested this. 16384 four-byte words, which are just a sequence of 0,1,…,16383 integers, compressed with zlib at default settings: 64KB -> 22716 bytes.

Enter Data Filtering

Recall that “a simple sequence of numbers compresses quite poorly” example above? Turns out, a typical trick in data compression is to filter the data before compressing it. Filtering can be any sort of reversible transformation of the data, that makes it be more compressible, i.e. have more actually repeating patterns.

For example, using delta encoding on that integer sequence would transform it into a file that is pretty much all just 0x00000001 integers. This compresses with zlib into just 88 bytes!

Data filtering is fairly widely used, for example:

  • PNG image format has several filters, as described here.
  • Executable file compression usually transforms machine code instructions into a more compressible form, see techniques used in .kkrunchy for example.
  • HDF5 scientific data format has filters like bitshuffle that reorder data before actual compression.
  • Some compressors like RAR seemingly automatically apply various filters to data blocks they identify as “filterable” (i.e. “looks like an executable” or “looks like sound wave samples” somehow).

Perhaps we could do some filtering on SPIR-V to make it more compressible?

Filtering: spirv-remap

In SPIR-V land, there is a tool called spirv-remap that aims to help with compression. What it does, is it changes all the IDs used in the shader to values that would hopefully be similar if you have a lot of other similar shaders, and compress them all as a whole. For each new ID, it “looks” at several surrounding instructions, and picks the ID based on their hash. The assumption is that you’re very likely to have other shaders that have similar fragments of instructions – they would be compressible if only the IDs would be the same.

And indeed, on that same set of shaders I had above: uncompressed size 1315KB, zstd-compressed 279KB (21.2%), remapped + zstd compressed: 189KB (14.4% compression).

However, spirv-remap tries to filter the SPIR-V program in a way that still results in a valid SPIR-V program. Maybe we could do better, if we did not have such a restriction?

SMOL-V: making SPIR-V smoller

So that’s what I did. My goal was to conceptually have two functions:

ByteArray Encode(const ByteArray& spirvInput); // SPIR-V -> SMOL-V
ByteArray Decode(const ByteArray& encodedBytes); // SMOL-V -> SPIR-V

with the goal that:

  • Encoded result would be smaller than input,
  • When compressed (with Zstd, zlib etc.), it would be smaller than if I just compressed the input,
  • When compressed, it would be smaller than what a compressed spirv-remap can achieve.
  • Do that in a fairly simple way. Since hey, I’m a compression n00b, anything that is compression rocket surgery is likely way out of my capabilities. Also, I wanted to roughly spend a (long) day on this.

So below is a write up of what I did (can also be seen in the commit history). First of all, I just looked at the SPIR-V binaries with a hex viewer. And in almost every step below, either looked at binaries, or printed bytes of instructions and looked for patterns.

Variable-length integer encoding: varint

Recall that SPIR-V uses four-byte words to store every single piece of information it needs. Often these are enum-style information that uses a few dozen possible values. I did not want to hardcode every possible operation & enum ranges (that would be a lot of work, and not very future-proof with later SPIR-V versions), so instead I looked at various variable-length integer storage schemes. Most famous is probably UTF-8 in text land. In binary data land there are VLQ, LEB128 and varint, which all are variations of “store 7 bits of data, and one bit to signal if there are more bytes following”. I picked the “varint” as used by Google Protocol Buffers, if only because I found it before I found the others :)

With varint encoding for unsigned integers, numbers below 128 take only one byte, numbers below 16384 take two bytes and so on.

So the very first try was to use varint encoding on each instruction’s length+opcode word, and the Type ID that many instructions have. Then I noticed that the Result IDs of almost every instructions are just one or two IDs larger then the result of a previous instruction. So I wrote them out as deltas from previous, and again encoded as varint.

This got just SMOL-V data to 71% size of original SPIR-V, and 18.2% when Zstd-compressed on top.

Relative-to-result and varint on often-occurring instructions

I dumped frequencies of how much space various opcode types take, and it became fairly clear that OpDecorate takes a lot, as well as OpVectorShuffle.

Now, decorations are guaranteed to be grouped, and often are specified on the same or very similar target IDs. The decoration values themselves are small integers. So, encode result IDs relative to a previously seen declarations, and use varint encoding on everything else (commit).

Vector shuffles also specify several IDs (often close to just-seen ones), and a few small component indices, so do a similar treatment for that (commit).

Combined, these took SMOL-V data to 56%, and 14.6% when Zstd-compressed.

I then noticed that the same pattern occurs in a lot of other instructions: the opcode, type and result IDs are often followed by several other IDs (how many depends on the opcode), and some other “usually small integer” values (how many, again depends on the opcode). So instead of just hardcoding handling of these several opcodes above, I generalized the code to look up this information into a table indexed by opcode.

After quite a lot more opcodes got this treatment, I was at 42% SMOL-V size, and 10.7% when Zstd-compressed. Not bad!

Negative deltas?

Most of the ID arguments I have encoded as a delta from previous Result ID value. The deltas were always positive so far, which is nice for varint encoding. However when I came to adding the same treatment to branch and control flow instructions, I realized that the IDs they reference are often “in the future”, which would mean the deltas are negative. Under the varint encoding, these would be the same as very large positive numbers, and often encode into 4 or 5 bytes.

Luckily, the same Protocol Buffers have a solution for that; signed integers get their bits shuffled so that small absolute values are turned into small positive values – the ZigZag encoding. So I used that to encode IDs of control flow instructions.

Opcode value reordering

At this point tweaking just delta+varint encoding was starting to give diminishing returns. So I started looking at bytes again.

That “encode opcode + length as varint” was often producing 2 or 3 bytes worth of data, due to the way SPIR-V encodes that word. I tried reordering it so that most common opcodes&lenghts produce just one byte.

  1. Swap opcode values so that most common ones fit into 4 bits. Most common ones in my shader test data were: Decorate (24%), Load (17%), Store (9%), AccessChain (7%), VectorShuffle (5%), MemberDecorate (4%) etc.
static SpvOp smolv_RemapOp(SpvOp op)
{
#	define _SMOLV_SWAP_OP(op1,op2) if (op==op1) return op2; if (op==op2) return op1
	_SMOLV_SWAP_OP(SpvOpDecorate,SpvOpNop); // 0
	_SMOLV_SWAP_OP(SpvOpLoad,SpvOpUndef); // 1
	_SMOLV_SWAP_OP(SpvOpStore,SpvOpSourceContinued); // 2
	_SMOLV_SWAP_OP(SpvOpAccessChain,SpvOpSource); // 3
	_SMOLV_SWAP_OP(SpvOpVectorShuffle,SpvOpSourceExtension); // 4
	// Name - already small value - 5
	// MemberName - already small value - 6
	_SMOLV_SWAP_OP(SpvOpMemberDecorate,SpvOpString); // 7
	_SMOLV_SWAP_OP(SpvOpLabel,SpvOpLine); // 8
	_SMOLV_SWAP_OP(SpvOpVariable,(SpvOp)9); // 9
	_SMOLV_SWAP_OP(SpvOpFMul,SpvOpExtension); // 10
	_SMOLV_SWAP_OP(SpvOpFAdd,SpvOpExtInstImport); // 11
	// ExtInst - already small enum value - 12
	// VectorShuffleCompact - already small value - used for compact shuffle encoding
	_SMOLV_SWAP_OP(SpvOpTypePointer,SpvOpMemoryModel); // 14
	_SMOLV_SWAP_OP(SpvOpFNegate,SpvOpEntryPoint); // 15
#	undef _SMOLV_SWAP_OP
	return op;
}
  1. Adjust opcode lengths so that most common ones fit into 3 bits.
// For most compact varint encoding of common instructions, the instruction length
// should come out into 3 bits. SPIR-V instruction lengths are always at least 1,
// and for some other instructions they are guaranteed to be some other minimum
// length. Adjust the length before encoding, and after decoding accordingly.
static uint32_t smolv_EncodeLen(SpvOp op, uint32_t len)
{
	len--;
	if (op == SpvOpVectorShuffle)			len -= 4;
	if (op == SpvOpVectorShuffleCompact)	len -= 4;
	if (op == SpvOpDecorate)				len -= 2;
	if (op == SpvOpLoad)					len -= 3;
	if (op == SpvOpAccessChain)				len -= 3;
	return len;
}
static uint32_t smolv_DecodeLen(SpvOp op, uint32_t len)
{
	len++;
	if (op == SpvOpVectorShuffle)			len += 4;
	if (op == SpvOpVectorShuffleCompact)	len += 4;
	if (op == SpvOpDecorate)				len += 2;
	if (op == SpvOpLoad)					len += 3;
	if (op == SpvOpAccessChain)				len += 3;
	return len;
}
  1. Interleave bits of the original word so that these common ones (opcode + lenght) take up lowest seven bits of the result, and encode to just one byte in varint scheme. 0xLLLLOOOO is how SPIR-V encodes it (L=length, O=op), shuffle it into 0xLLLOOOLO so that common case (op<16, len<8) is encoded into one byte.

That got things down to 35% SMOL-V size, and 9.7% when Zstd-compressed.

Vector Shuffle encoding

SPIR-V has a single opcode OpVectorShuffle that is used for both selecting components from two vectors, and for a typical “swizzle”. Swizzles are by far the most common in the shaders I’ve seen, so often in raw SPIR-V something like “v.xxyy” swizzle ends up being “v, v, 0, 0, 1, 1” - each of these being a full 32 bit word (both arguments point to the same vector, and then component indices spelled out).

I made the code recognize this common pattern of “shuffle with <= 4 components, where each is between 0 and 3”, and encode that as a fake “VectorShuffleCompact” opcode using one of the unused opcode values, 13. The swizzle pattern fits into one byte (two bits per channel) instead taking up 16 bytes (commit).

Adding non-Unity shaders, and zigzag

At this point I added more shaders to test on, to see how everything above behaves on non-Unity compilation pipeline produced shaders (thanks @baldurk, @AlenL and @basisspace for providing and letting me use shaders from The Talos Principle and DOTA2).

Turns out, both of these games ship with shaders that are alreayd processed with spirv-remap. One thing it does (well, the primary thing it does!) is changing all the IDs to not be linearly increasing, but have values all over the place. My previous work on using delta encoding and varint output was often going against that, since often it would be that next ID would be smaller than previous one, resulting in negative delta, which encodes into 4 or 5 bytes under varint. Not good!

Well it wasn’t bad; this is SMOL-V that not only compresses, but also strips debug info, to match what spirv-remap did for Talos/DOTA2 case:

  • Unity: remap+zstd 13.0%, SMOL-V+zstd 7.2%.
  • Talos: remap+zstd 11.1%, SMOL-V+zstd 9.0%.
  • DOTA2: remap+zstd 9.9%, SMOL-V+zstd 8.4%.

It already compresses better than spirv-remap, but is more better on shaders that aren’t already remapped.

I switched all the deltas to use zigzag encoding (see Negative Deltas above), so that on already remapped shaders it does not go into “whoops encoded into 5 bytes”:

  • Unity: remap+zstd 13.0%, SMOL-V+zstd 7.3% (a tiny bit worse than 7.2% before).
  • Talos: remap+zstd 11.1%, SMOL-V+zstd 8.5% (yay, was 9.0% before).
  • DOTA2: remap+zstd 9.9%, SMOL-V+zstd 8.2% (small yay, was 8.4% before).

MemberDecorate encoding

Structure/buffer decorations (OpMemberDecorate) were taking up quite a bit of space, so I looked for some patterns in them.

Most often they are very simple sequences, e.g.

Op             Type Member Decoration Extra
MemberDecorate 168  0      35           0 
MemberDecorate 168  1      35          64 
MemberDecorate 168  2      35          80 
MemberDecorate 168  3      35          96 
MemberDecorate 168  4      35         112 
MemberDecorate 168  5      35         128 
MemberDecorate 168  6       0 
MemberDecorate 168  6      35         384 
MemberDecorate 168  7      35         400 

When encoding, I scan ahead to see whether there’s a sequence of MemberDecorate instructions that are all about the same type, and “fold” them into one – so I can skip writing out opcode+lenght and type ID data. Additionally, delta encode member index, and have special handling of decoration 35 (“Offset”, which is extremely common) to store actual offset as delta from previous one. This got some gains (commit).

Quite likely OpDecorate sequences could get a similar treatment, but I did not do that yet.

Current status

So that’s about it! Current compression numbers, on a set of Unity+Talos+DOTA2 shaders, with debug info stripping:

CompressionNo filter (*)spirv-remapSMOL-V
Size KBRatioSize KBRatioSize KBRatio
Uncompressed 3725.4100.0% 3560.095.6% 1297.634.8%
zlib default 860.623.1% 761.920.5% 464.912.5%
LZ4HC default 884.423.7% 743.320.0% 441.011.8%
Zstd default 555.414.9% 425.611.4% 295.57.9%
Zstd level 20 339.49.1% 260.57.0% 226.76.1%

(*) Note: about 2/3rds of the shader set (Talos & DOTA2) were already processed by spirv-remap; I don’t have unprocessed shaders from these games. This makes spirv-remap look a bit worse than it actually is though.

I think it’s not too bad for a couple days of work. And I have learned a thing or two about compression. Again, the github repository is here: github.com/aras-p/smol-v.

  • Encoding does a simple one-pass scan over the input (with occasional look aheads for MemberDecorate sequences), and writes encoded result to the output.
  • Decoding simply goes over encoded bytes and transforms into SPIR-V. One pass over data, no memory allocations.
  • No “altering” of SPIR-V programs is done; what you encode is exactly what you get after decoding (this is different from spirv-remap, that actually changes the IDs). Exception is the kEncodeFlagStripDebugInfo that removes debug information from the input program.

Future Work?

Not sure I will work on this much (as opposed to “eh, good enough for now”), but possible future work might be:

  • Someone who actually knows about compression will look at it, and point out low hanging fruits :)
  • Do special encoding of some more opcodes (OpDecorate comes to mind).
  • Split up encoded data into several “streams” for better compression (e.g. lenghts, opcodes, types, results, etc.). Very similar to the “Split-stream encoding” from .kkrunchy blog post.
  • As John points out, there are other possible axes to explore compression.

This was super fun. I highly recommend “short, useful, and you get to learn something” projects :)


More Hash Function Tests

In the previous post, I wrote about non-crypto hash functions, and did some performance tests. Turns out, it’s great to write about stuff! People at the comments/twitter/internets pointed out more things I should test. So here’s a follow-up post.

What is not the focus

This is about algorithms for “hashing some amount of bytes”, for use either in hashtables or for checksum / uniqueness detection. Depending on your situation, there’s a whole family of algorithms for that, and I am focusing on only one: non-cryptographic fast hash functions.

  • This post is not about cryptographic hashes. Do not read below if you need to hash passwords, sensitive data going through untrusted medium and so on. Use SHA-1, SHA-2, BLAKE2 and friends.
  • Also, I’m not focusing on algorithms that are designed to prevent possible hashtable Denial-of-Service attacks. If something comes from the other side of the internet and ends up inserted into your hashtable, then to prevent possible worst-case O(N) hashtable behavior you’re probably off by using a hash function that does not have known “hash flooding” attacks. SipHash seems to be popular now.
  • If you are hashing very small amounts of data of known size (e.g. single integers or two floats or whatever), you should probably use specialized hashing algorithms for those. Here are some integer hash functions, or 2D hashing with Weyl, or perhaps you could take some other algorithm and just specialize it’s code for your known input size (e.g. xxHash for a single integer).
  • I am testing 32 and 64 bit hash functions here. If you need larger hashes, quite likely some of these functions might be suitable (e.g. SpookyV2 always produces 128 bit hash).

When testing hash functions, I have not gone to great lengths to get them compiling properly or setting up all the magic flags on all my platforms. If some hash function works wonderfully when compiled on Linux Itanium box with an Intel compiler, that’s great for you, but if it performs poorly on the compilers I happen to use, I will not sing praises for it. Being in the games industry, I care about things like “what happens in Visual Studio”, and “what happens on iOS”, and “what happens on PS4”.

More hash function tests!

I’ve updated my hash testbed on github (use revision 9b59c91cf) to include more algorithms, changed tests a little, etc etc.

I checked both “hashing data that is aligned” (16-byte aligned address of data to hash), and unaligned data. Everywhere I tested, there wasn’t a notable performance difference that I could find (but then, I have not tested old ARM CPUs or PowerPC based ones). The only visible effect is that MurmurHash and SpookyHash don’t properly work in asm.js / Emscripten compilation targets, due to their usage of unaligned reads. I’d assume they probably don’t work on some ARM/PowerPC platforms too.

Hash functions tested below:

These are the main functions that are interesting. Because people kept on asking, and because “why not”, I’ve included a bunch of others too:

  • SipRef - SipHash-2-4 reference implementation. Like mentioned before, this one is supposedly good for avoiding hash flooding attacks.
  • MD5-32, SHA1-32, CRC32 - simple implementations of well-known hash functions (from SMHasher test suite). Again, these mostly not in the category I’m interested in, but included to illustrate the performance differences.
  • FNV-1a, FNV-1amod - FNV hash and modified version.
  • djb2, SDBM - both from this site.

Performance Results

Windows / Mac

macOS results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build. Late 2013 MacBookPro:

Windows results, compiled with Visual Studio 2015 in Release 64 bit build. Core i7-5820K:

Notes:

  • Performance profiles of these are very similar.
  • xxHash64 wins at larger (0.5KB+) data sizes, followed closely by 64 bit FarmHash and CityHash.
  • At smaller data sizes (10-500 bytes), FarmHash and CityHash look very good!
  • mum-hash is much slower on Visual Studio. At first I thought it’s _MUM_UNALIGNED_ACCESS macro that was not handling VS-specific defines (_M_AMD64 and _M_IX86) properly (see PR). Turns out it’s not; the speed difference comes from _MUM_USE_INT128 which only kicks in on GCC-based compilers. mum-hash would be pretty competetive otherwise.

Windows 32 bit results, compiled with Visual Studio 2015 in Release 32 bit build. Core i7-5820K:

On a 32 bit platform / compilation target, things change quite a bit!

  • All 64 bit hash functions are slow. For example, xxHash64 frops from 13GB/s to 1GB/s. Use a 32 bit hash function when on a 32 bit target!
  • xxHash32 wins by a good amount.

Note on FarmHash – whole idea behind it is that it uses CPU-specific optimizations (that also change the computed hash value). The graphs above are using default compilation settings on both macOS and Windows. However, on macOS enabling SSE4.2 instructions in Xcode settings makes it much faster at larger data sizes:

With SSE4.2 enabled, FarmHash64 handily wins against xxHash64 (17.9GB/s vs 12.8GB/s) for data that is larger than 1KB. However, that requires compiling with SSE4.2, at my work I can’t afford that. Enabling the same options on XboxOne makes it slower :( Enabling just FARMHASH_ASSUME_AESNI makes the 32 bit FarmHash faster on XboxOne, but does not affect performance of the 64 bit hash. FarmHash also does not have any specific optimizations for ARM CPUs, so my verdict with it all is “not worth bothering” – afterall the impressive SSE4.2 speedup is only for large data sizes.

Mobile

iPhone SE (A9 CPU) results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build:

  • xxHash never wins here; CityHash and FarmHash are fastest across the whole range.

iPhone SE 32 bit results:

This is similar to Windows 32 bit: 64 bit hash functions are slow, xxHash32 wins by a good amount.

Console

Xbox One (AMD Jaguar 1.75GHz CPU) results, compiled Visual Studio 2015 in Release mode:

  • Similar to iPhone results, xxHash is quite a bit slower than CityHash and FarmHash. xxHash uses 64 bit multiplications heavily, whereas others mostly do shifts and logic ops.
  • SpookyHash wins at larger data sizes.

JavaScript

JavaScript (asm.js via Emscripten) results, running on late 2013 MacBookPro.

  • asm.js is in all practical sense a 32 bit compilation target; 64 bit integer operations are slow.
  • xxHash32 wins, followed by FarmHash32.
  • At smaller (<25 bytes) data sizes, simple hashes like FNV-1a, SDBM or djb2 are useful.

Throughput tables

Performance at large data sizes (~4KB), in GB/s:

Hash64 bit32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefoxasm.js Chrome
xxHash64 13.2 12.8 5.7 1.5 1.1 1.5 0.3 0.1
City64 12.2 12.2 7.2 3.6 1.6 1.9 0.4 0.1
Mum 4.0 9.5 4.5 0.5 0.7 1.3 0.1 0.1
Farm64 12.3 11.9 8.2 3.3 2.4 2.2 0.4 0.1
SpookyV2-64 12.8 12.5 7.1 4.3 2.6 1.9 -- --
xxHash32 6.8 6.6 4.0 1.5 6.7 3.7 2.5 1.4
Murmur3-X64-64 7.1 5.8 2.3 1.2 0.9 0.8 -- --
Murmur2A 3.4 3.3 1.7 0.9 3.4 1.8 -- --
Murmur3-32 3.1 2.7 1.3 0.5 3.1 1.3 -- --
City32 5.1 4.9 2.6 0.9 4.0 2.6 1.1 0.8
Farm32 5.2 4.3 2.6 1.8 4.3 1.9 2.1 1.4
SipRef 1.4 1.6 1.0 0.4 0.3 0.4 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.5 0.3 0.2 0.2 0.4 0.3 0.4 0.4
SHA1-32 0.5 0.5 0.3 0.1 0.4 0.2 0.3 0.2
FNV-1a 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.8
FNV-1amod 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.7
djb2 0.9 0.8 0.6 0.4 1.1 0.6 0.8 0.8
SDBM 0.9 0.8 0.4 0.3 0.8 0.4 0.8 0.8

Performance at medium size (128 byte) data, in GB/s:

Hash64 bit32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefoxasm.js Chrome
xxHash64 6.6 6.2 2.8 0.9 0.7 0.7 0.2 0.1
City64 7.6 7.6 4.4 1.7 1.1 1.5 0.3 0.1
Mum 3.2 7.6 4.1 0.5 0.6 1.1 0.1 0.0
Farm64 6.6 5.7 3.4 1.4 0.9 1.1 0.3 0.1
SpookyV2-64 3.4 3.2 1.7 1.4 0.7 0.5 -- --
xxHash32 5.1 5.3 3.4 1.3 5.1 2.8 2.2 1.4
Murmur3-X64-64 5.3 4.3 2.1 1.0 0.8 0.7 -- --
Murmur2A 3.6 3.0 2.1 0.8 3.3 1.7 -- --
Murmur3-32 3.1 2.3 1.3 0.4 2.8 1.3 -- --
City32 4.0 3.6 2.0 0.7 3.3 1.9 1.0 0.8
Farm32 3.9 3.2 1.9 1.0 3.4 1.6 1.6 1.1
SipRef 1.2 1.3 0.8 0.4 0.3 0.3 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.3 0.2 0.2 0.1 0.3 0.2 0.2 0.2
SHA1-32 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1
FNV-1a 0.9 1.0 0.5 0.3 0.9 0.5 0.9 0.9
FNV-1amod 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8
djb2 1.0 0.9 0.7 0.4 1.1 0.7 0.9 0.8
SDBM 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8

Performance at small size (17 byte) data, in GB/s:

Hash64 bit32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefoxasm.js Chrome
xxHash64 2.1 1.8 0.5 0.5 0.4 0.4 0.1 0.0
City64 3.4 3.0 1.5 0.7 0.5 0.7 0.2 0.0
Mum 1.2 2.6 0.9 0.2 0.3 0.5 0.1 0.0
Farm64 3.6 2.6 1.2 0.7 0.6 0.6 0.1 0.0
SpookyV2-64 1.2 1.0 0.6 0.4 0.2 0.1 -- --
xxHash32 2.2 2.0 0.7 0.5 1.9 0.8 1.2 0.8
Murmur3-X64-64 1.7 1.3 0.5 0.4 0.3 0.3 -- --
Murmur2A 2.4 1.8 1.1 0.5 2.1 1.0 -- --
Murmur3-32 2.1 1.5 0.8 0.4 1.8 0.8 -- --
City32 2.1 1.9 0.9 0.5 1.7 0.7 0.8 0.7
Farm32 2.5 2.0 0.8 0.5 1.8 0.9 0.9 0.5
SipRef 0.6 0.6 0.3 0.2 0.2 0.1 0.0 0.0
CRC32 0.8 0.7 0.4 0.2 0.6 0.3 0.5 0.4
MD5-32 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0
SHA1-32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
FNV-1a 1.3 1.5 1.0 0.3 1.2 0.7 1.4 1.0
FNV-1amod 1.1 1.4 0.9 0.3 1.0 0.7 1.3 0.9
djb2 1.6 1.6 1.1 0.4 1.1 0.7 1.3 0.9
SDBM 1.4 1.3 1.1 0.4 1.2 0.7 1.5 1.0

A note on hash quality

As far as I’m concerned, all the 64 bit hashes are excellent quality.

Most of the 32 bit hashes are pretty good too on the data sets I tested.

SDBM produces more collisions than others on binary-like data (various struct memory dumps, 5742 entries, average length 55 bytes – SDBM had 64 collisions, all the other hashes had zero). You could have a way worse hash function than SDBM of course, but then functions like FNV-1a are about as simple to implement, and seem to behave better on binary data.

A note on hash consistency

Some of the hash functions do not produce identical output on various platforms. Among the ones I tested, mum-hash and FarmHash produced different output depending on compiler and platform used.

It’s very likely that most of the above hash functions produce different output if ran on a big-endian CPU. I did not have any platform like that to test on.

Some of the hash functions depend on unaligned memory reads being possible – most notably Murmur and Spooky. I had to change Spooky to work on ARM 32 bit (define ALLOW_UNALIGNED_READS to zero in the source code). Murmur and Spooky did produce wrong results on asm.js (no crash, just different hash results and way more collisions than expected).

Conclusions

General cross-platform use: CityHash64 on a 64 bit system; xxHash32 on a 32 bit system.

  • Good performance across various data sizes.
  • Identical hash results across various little-endian platforms and compilers.
  • No weird compilation options to tweak or peformance drops on compilers that it is not tuned for.

Best throughput on large data: depends on platform!

  • Intel CPU: xxHash64 in general, FarmHash64 if you can use SSE4.2, xxHash32 if compiling for 32 bit.
  • Apple mobile CPU (A9): FarmHash64 for 64 bit, xxHash32 for 32 bit.
  • Console (XboxOne, AMD Jaguar): SpookyV2.
  • asm.js: xxHash32.

Best for short strings: FNV-1a.

  • Super simple to implement, inline-able.
  • Where exactly it becomes “generally fastest” depends on a platform; around 8 bytes or less on PC, mobile and console; around 20 bytes or less on asm.js.
  • If your data is fixed size (e.g. one or two integers), look into specialized hashes instead (see above).

Hash Functions all the way down

Update! I tested more hash functions in a follow-up post. See More Hash Function Tests.

A while ago I needed fast hash function for ~32 byte keys. We already had MurmurHash used in a bunch of places, so I started with that. But then I tried xxHash and that was a bit faster! So I dropped xxHash into the codebase, landed the thing to mainline and promptly left for vacation, with a mental note of “someday should look into other hash functions, or at least move them all under a single folder”.

So that’s what I did: “hey I’ll move source code of MurmurHash, SpookyHash and xxHash under a single place”. But that quickly spiraled out of control:

The things I found! Here’s what you find in a decent-sized codebase, with many people working on it:

  • Most places use a decent hash function – in our case Murmur2A for 32/64 bit hashes, and SpookyV2 for 128 bit hashes. That’s not a bad place to be in!
  • Murmur hash takes seed as input, and naturally almost all places in code copy-pasted the same random hex value as the seed :)
  • There are at least several copies of either FNV or djb2 hash function implementations scattered around, used in random places.
  • Some code uses really, REALLY bad hash functions. There’s even a comment above it, added a number of years ago, when someone found out it’s bad – however they only stopped using it in their own place, and did not replace other usages. Life always takes the path of least resistance :) Thankfully, the places where said hash function was used were nothing “critical”.
  • While 95% of code that uses non-cryptographic hash functions uses them strictly for runtime reasons (i.e. they don’t actually care about exact value of the hash), there are some pieces that hash something, and serialize the hashed value. Each of these need to be looked at in detail, whether you can easily switch to a new hash function.
  • Some other hash related code (specifically, a struct we have to hold 128 bit hashed value, Hash128), were written in a funky way, ages ago. And of course some of our code got that wrong (thankfully, all either debug code, or test mocking code, or something not-critical). Long story short, do not have struct constructors like this: Hash128(const char* str, int len=16)!
    • Someone will think this accepts a string to hash, not “bytes of the hash value”.
    • Someone will pass "foo" into it, and not provide length argument, leading to out-of-bounds reads.
    • Some code will accidentally pass something like 0 to a function that accepts a Hash128, and because C++ is C++, this will get turned into a Hash128(NULL, 16) constructor, and hilarity will ensue.
    • Lesson: be careful with implicit constructors (use explicit). Be careful with default arguments. Don’t set types to const char* unless it’s really a string.

So what started out as “move some files” branch, ended up being a “move files, switch most of hash functions, remove some bad hash functions, change some code, fix some wrong usages, add tests and comments” type of thing. It’s a rabbit hole of hash functions, all the way down!

Anyway.

Hash Functions. Which one to use?

MurmurHash got quite popular, at least in game developer circles, as a “general hash function”. My quick twitter poll seems to reflect that:

It’s a fine choice, but let’s see later if we can generally do better. Another fine choice, especially if you know more about your data than “it’s gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun’s replies, or Rune’s modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!

Sometimes you don’t know much about the data, for example if you’re hashing arbitrary files, or some user input that could be “pretty much anything”.

So let’s take a look at general purpose hash functions. There’s plenty of good tests on the internets (e.g. Hash functions: An empirical comparison), but I wanted to make my own little tests. Because why not! Here’s my randomly mashed together little testbed (use revision 4d535b).

Throughput

Here’s results of various hash functions, hashing data of different lengths, with performance in MB/s:

This was tested on late 2013 MacBookPro (Core i7-4850HQ 2.3GHz), Xcode 7.3.1 release 64 bit build.

  • xxHash in 32 and 64 bit variants, as well as “use 64 bit, take lowest 32 bits of result” one.
  • SpookyHash V2, the 128 bit variant, only taking 64 lowest bits.
  • Murmur, a couple variants of it.
  • CRC32, FNV and djb2, as I found them in our own codebase. I did not actually check whether they are proper implementations or somehow tweaked! Their source is at the testbed, revision 4d535b.

In terms of throughput at not-too-small data sizes (larger than 10-20 bytes), xxHash is the king. If you need 128 bit hashes, SpookyHash is very good too.

What about small keys?

Good question! The throughput of XXH64 is achieved by carefully exploiting instruction level parallelism of modern CPUs. It has a main loop that does 32 bytes at a time, with four independent hashing rounds. It looks something like this:

// rough outline of XXH64:
// ... setup code
do {
    v1 = XXH64_round(v1, XXH_get64bits(p)); p+=8;
    v2 = XXH64_round(v2, XXH_get64bits(p)); p+=8;
    v3 = XXH64_round(v3, XXH_get64bits(p)); p+=8;
    v4 = XXH64_round(v4, XXH_get64bits(p)); p+=8;
} while (p<=limit);
// ... merge v1..v4 values
// ... do leftover bit that is not multiple of 32 bytes

That way even if it looks like it’s “doing more work”, it ends up being faster than super simple algorithms like FNV, that work on one byte at a time, with each and every operation depending on the result of the previous one:

// rough outline of FNV:
while (*c)
{
	hash = (hash ^ *c++) * p;
}

However, xxHash has all this “prologue” and “epilogue” code around the main loop, that handles either non-aligned data, or leftovers from data that aren’t multiple of 32 bytes in size. That adds some branches, and for short keys it does not even go into that smart loop!

That can be seen from the graph above, e.g. xxHash 32 (which has core loop of 16-bytes-at-once) is faster at key sizes < 100 bytes. Let’s zoom in at even smaller data sizes:

Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance. I’m picking out FNV as the winner here, because in my tests djb2 had somewhat more collisions (see below).

What about other platforms?

PC/Windows: tests I did on Windows (Core i7-5820K 3.3GHz, Win10, Visual Studio 2015 release 64 bit) follow roughly the same pattern as the results on a Mac. Nothing too surprising here.

Consoles: did a quick test on XboxOne. Results are surprising in two ways: 1) oh geez the console CPUs are slow (well ok that’s not too surprising), and 2) xxHash is not that awesome here. It’s still decent, but xxHash64 has consistenly worse performance than xxHash32, and for larger keys SpookyHash beats them all. Maybe I need to tweak some settings or poke Yann to look at it? Adding a mental note to do that later…

Mobile: tested on iPhone 6 in 64 bit build. Results not too surprising, except again, unlike PC/Mac with an Intel CPU, xxHash64 is not massively better than everything else – SpookyHash is really good on ARM64 too.

JavaScript! :) Because it was easy, I also compiled that code into asm.js via Emscripten. Overall the patterns are similar, except the >32 bit hashes (xxHash64, SpookyV2) – these are much slower. This is expected, both xxHash64 and Spooky are specifically designed either for 64 bit CPUs, or when you absolutely need a >32 bit hash. If you’re on 32 bit, use xxHash32 or Murmur!

What about hash quality?

SMHasher seems to be a de-facto hash function testing suite (see also: a fork that includes more modern hash functions).

For a layman test, I tested several things on data sets I cared about:

  • Words” - just a dump of English words (/usr/share/dict/words). 235886 entries, 2.2MB total size, average length 9.6 bytes.
  • Filenames” - dump of file paths (from a Unity source tree tests folder). 80297 entries, 4.3MB total size, average length 56.4 bytes.
  • “Source” - partial dump of source files from Unity source tree. 6069 entries, 43.7MB total size, average length 7547 bytes.

First let’s see how many collisions we’d get on these data sets, if we used the hash function for uniqueness/checksum type of checking. Lower numbers are better (0 is ideal):

Hash Words collis Filenames collis Source collis
xxHash32 6 0 0
xxHash64-32 7 0 0
xxHash64 0 0 0
SpookyV2-64 0 0 0
Murmur2A 11 0 0
Murmur3-32 3 1 0
CRC32 5 2 0
FNV 5 1 0
djb2 10 1 0
ZOMGBadHash 998 842 0

ZOMGBadHash is that fairly bad hash function I found, as mentioned above. It’s not fast either, and look at that number of collisions! Here’s how it looked like:

unsigned CStringHash(const char* key)
{
	unsigned h = 0;
	const unsigned sr = 8 * sizeof (unsigned) - 8;
	const unsigned mask = 0xFu << (sr + 4);
	while (*key != '\0')
	{
		h = (h << 4) + *key;
		std::size_t g = h & mask;
		h ^= g | (g >> sr);
		key++;
	}
	return h;
}

I guess someone thought a random jumbo of shifts and xors is gonna make a good hash function, or something. And thrown in mixed 32 vs 64 bit calculations too, for good measure. Do not do this! Hash functions are not just random bit operations.

As another measure of “hash quality”, let’s imagine we use the hash functions in a hashtable. A typical hashtable of a load factor 0.8, that always has power-of-two number of buckets (i.e. something like bucketCount = NextPowerOfTwo(dataSetSize / 0.8)). If we’d put the above data sets into this hashtable, then how many entries we’d have per bucket on average? Lower numbers are better (1.0 is ideal):

Hash Words avg bucket Filenames avg bucket Source avg bucket
xxHash32 1.241 1.338 1.422
xxHash64-32 1.240 1.335 1.430
xxHash64 1.240 1.335 1.430
SpookyV2-64 1.241 1.336 1.434
Murmur2A 1.239 1.340 1.430
Murmur3-32 1.242 1.340 1.427
CRC32 1.240 1.338 1.421
FNV 1.243 1.339 1.415
djb2 1.242 1.339 1.414
ZOMGBadHash 1.633 2.363 7.260

Here all the hash functions are very similar, except the ZOMGBadHash which is, as expected, doing not that well.

TODO

I did not test some of the new-ish hash functions (CityHash, MetroHash, FarmHash). Did not test hash functions that use CPU specific instructions either (for example variants of FarmHash can use CRC32 instruction that’s added in SSE4.2, etc.). That would be for some future time.

Conclusion

  • xxHash64 is really good, especially if you’re on an 64 bit Intel CPU.
  • If you need 128 bit keys, use SpookyHash. It’s also really good if you’re on a non-Intel 64 bit CPU (as shown by XboxOne - AMD and iPhone6 - ARM throughput tests).
  • If you need a 32 bit hash, and are on a 32 bit CPU/system, do not use xxHash64 or SpookyHash! Their 64 bit math is costly when on 32 bit; use xxHash32 or Murmur.
  • For short data/strings, simplicity of FNV or djb2 are hard to beat, they are very performant on short data as well.
  • Do not throw in random bit operations and call that a hash function. Hash function quality is important, and there’s plenty of good (and fast!) hash functions around.

Note: I tested more hash functions in a follow-up post. See More Hash Function Tests.


Adam Demo production talk at CGEvent

A while ago I delivered a talk at CG EVENT 2016 in Vilnius, about some production details of Unity’s Adam demo.

Clarification point! I did nothing at all for the demo production, just did the talk pretending to be @__ReJ__. Also, there are more blog posts about the demo production on Unity blog, e.g. Adam - Production Design for Realtime Short Film by @_calader. All credit goes to the demo team, I only delivered the message :)

Here are the talk slides (34MB pdf).

Some of them were meant to be videos, so here they are, individually:

Slide 41, rough animatic:

Slide 46, previz:

Slide 47, previz with lighting & camera:

Slide 48, early Neuron mocap:

Slide 49, previz with shading & mocap:

Slide 50, previz with postprocessing:

Slide 52, Vicon mocap:

Slide 59, WIP animations & fracture:

Slide 60, eye animation:

Slide 68, area lights:

Slide 71, translucency shader:

Slide 73, crowd simulation: