A look at 2016, and onto 2017

I don’t have any technical insights to share right now, so you’ll have to bear with me blabbering about random stuff!

Unity Work

By now it’s been 11 years at Unity (see 10 Years at Unity a year ago), and the last year has been interesting. Started out as written in that blog post – that I will be mostly doing plumbing & fixing and stuff.

But then at some point we had a mini-hackweek of “what should the future rendering pipeline of Unity should be?”, and it went quite well. So a team with tasks of “build building blocks for future rendering pipelines” and “modernize low level graphics code” was built, see more on Scriptable Render Loops google doc and corresponding github project. I’ve been doing mostly, well, plumbing there, so nothing new on that front :) – but overall this is very exciting, and if done well could be a big improvement in how Unity’s graphics engine works & what people can do with it.

I’ve also been a “lead” of this low level graphics team (“graphics foundation team” as we call it), and (again) realized that I’m the worst lead the world has ever seen. To my fellow colleagues: sorry about that (シ. .)シ So I stopped being a lead, and that is now in the capable hands of @stramit.

Each year I write a “summary of stuff I’ve done”, mostly for myself to see whether my expectations/plans matched reality. This year a big part in the mismatch has been because at start of year I did not know we’d go big into “let’s do scriptable render loops!”, otherwise it went as expected.

Day to day it feels a bit like “yeah, another day of plumbing”, but the summary does seem to indicate that I managed to get a decent amount of feature/improvement work done too. Nothing that would set the world on fire, but not bad either! Though thinking about it, after 2016, maybe the world could do with less things that set it on fire…

My current plan for next year is to continue working on scriptable render loops and whatever low level things they end up needing. But also perhaps try some new area for a bit, something where I’m a newbie and would have to learn some new things. We’ll see!

Oh, and extrapolating from the trend, I should effectively remove close to half a million lines of code this year. My fellow colleagues: brace yourselves :)

Open Source

My github contributions are not terribly impressive, but I managed to do better than in 2015.

  • Primarily SMOL-V, a SPIR-V shader compressor for Vulkan. It’s used in Unity now, and as far as I know, also used in two non-Unity game productions. If you use it (or tried it, but it did not work or whatever), I’d be happy to know!
  • Small amount of things in GLSL Optimizer and HLSL2GLSL, but with Unity itself mostly moving to a different shader compiler toolchain (HLSLcc at the moment), work there has been winding down. I did not have time or will to go through the PRs & issues there, sorry about that.
  • A bunch of random tiny fixes or tweaks submitted to other github projects, nothing of note really.


I keep on hearing that I should write more, but most of the time don’t feel like I have anything interesting to say. So I keep on posting random lame jokes and stuff on twitter instead.

Nevertheless, the amount of blog posts has been very slowly rising. The drop starting around year 2009 very much coincides with me getting onto twitter. Will try to write more in 2017, but we’ll see how that goes. If you have ideas on what I should write about, let me know!

The amount of readers on the website is basically constant through all the time that I had analytics on it (starting 2013 or so). Which could mean it’s only web crawling bots that read it, for all I know :) There’s an occasional spike, pretty much only because someone posted link on either reddit or hackernews, and the hive mind there decided to pay attention to it for five minutes or whatever.

Giving Back

Financially, I live a very comfortable life now, and that is by no means just an “I made it all by myself” thing. Starting conditions, circumstances, family support, a ton of luck all helped massively.

I try to pay some of that back by sharing knowledge (writing the blog, speaking, answering questions on twitter and ask.fm). This is something, but I could do more!

So this year started doing more of a “throw money towards good causes” thing too, compared to what I did before. Helped some local schools, and several local charities / help funds / patreons / “this is a good project” type of efforts. I’m super lucky that I can afford to do that, looking forward to doing more of it in 2017.

Started work towards installing solar panels on the roof, which should generate about as much electricity as we use up. That’s not exactly under “giving back” section, but feels good to have finally decided to do it. Will see how that goes in our (not that sunny) land.

Other Things

I’m not a twenty-something anymore, and turns out I have to do some sort of extra physical activity besides working with a computer all day long. But OMG it’s so boring, so it takes a massive effort to force myself to do it. I actually stopped going to the gym for half a year, but then my back pains said “ohai! long time no see!”. Ugggh. So towards end of 2016 started doing the gym thing again. I can only force myself to go there twice a week, but that’s enough to keep me from collapsing physically :) So far so good, but still boring.

Rocksmith keeps on being the only “hobby” that I have, and still having loads of fun with it. I probably have played about 200 hours this year.

It feels like I am still improving my guitar playing skills, but I’m not doing enough effort to actually improve. Most of the time I just play through songs in a messy way, without taking time to go through difficult riffs or solos and really nail the things down. So my playing is like: eh, ¯\_(ツ)_/¯. Maybe in 2017 I should try playing with some other people (something I have never done, besides a few very lame attempts in high school), or try taking “real” guitar lessons.

INSIDE and Firewatch are my games of the year. They are both made with Unity too, which does not feel bad at all.

Did family vacations, lazy one in Maldives and a road trip in Spain. Both were nice in their own way, but we all (again) realized that lazy-type vacations are not our style, even if they are nice to have once in a while.

In 2017 I want to spend less time on social media, which lately is mostly a source of really depressing news, and just do more useful or helpful things instead. Wish me luck.

Amazing Optimizers, or Compile Time Tests

I wrote some tests to verify sorting/batching behavior in rendering code, and they were producing different results on Windows (MSVC) vs Mac (clang). The tests were creating a “random fake scene” with a random number generator, and at first it sounded like our “get random normalized float” function was returning slightly different results between platforms (which would be super weird, as in how come no one noticed this before?!).

So I started digging into random number generator, and the unit tests it has. This is what amazed me.

Here’s one of the unit tests (we use a custom native test framework that started years ago on an old version of UnitTest++):

TEST (Random01_WithSeed_RestoredStateGenerateSameNumbers)
	Rand r(1234);
	RandState oldState = r.GetState();
	float prev = Random01(r);
	float curr = Random01(r);
	CHECK_EQUAL (curr, prev);

Makes sense, right?

Here’s what MSVC 2010 compiles this down into:

push        rbx  
sub         rsp,50h  
mov         qword ptr [rsp+20h],0FFFFFFFFFFFFFFFEh  
	Rand r(1234);
	RandState oldState = r.GetState();
	float prev = Random01(r);
movss       xmm0,dword ptr [__real@3f47ac02 (01436078A8h)]  
movss       dword ptr [prev],xmm0  
	float curr = Random01(r);
mov         eax,0BC5448DBh  
shl         eax,0Bh  
xor         eax,0BC5448DBh  
mov         ecx,0CE6F4D86h  
shr         ecx,0Bh  
xor         ecx,eax  
shr         ecx,8  
xor         eax,ecx  
xor         eax,0CE6F4D86h  
and         eax,7FFFFFh  
pxor        xmm0,xmm0  
cvtsi2ss    xmm0,rax  
mulss       xmm0,dword ptr [__real@34000001 (01434CA89Ch)]  
movss       dword ptr [curr],xmm0  
	CHECK_EQUAL (curr, prev);
call        UnitTest::CurrentTest::Details (01420722A0h)
; ...

There’s some bit operations going on (the RNG is Xorshift 128), looks fine on the first glance.

But wait a minute; this seems like it only has code to generate a random number once, whereas the test is supposed to call Random01 three times?!

Turns out the compiler is smart enough to see through some of the calls, folds down all these computations and goes, “yep, so the 2nd call to Random01 will produce 0.779968381 (0x3f47ac02)“. And then it kinda partially does actual computation of the 3rd Random01 call, and eventually checks that the result is the same.


Now, what does clang (whatever version Xcode 8.1 on Mac has) do on this same test?

pushq  %rbp
movq   %rsp, %rbp
pushq  %rbx
subq   $0x58, %rsp
movl   $0x3f47ac02, -0xc(%rbp)   ; imm = 0x3F47AC02 
movl   $0x3f47ac02, -0x10(%rbp)  ; imm = 0x3F47AC02 
callq  0x1002b2950               ; UnitTest::CurrentTest::Results at CurrentTest.cpp:7
; ...

Whoa. There’s no code left at all! Everything just became an “is 0x3F47AC02 == 0x3F47AC02” test. It became a compile-time test.


By the way, the original problem I was looking into? Turns out RNG is fine (phew!). What got me was code in my own test that I should have known better about; it was roughly like this:

transform.SetPosition(Vector3f(Random01(), Random01(), Random01()));

See what’s wrong?




The function argument evaluation order in C/C++ is unspecified.

(╯°□°)╯︵ ┻━┻

Newer languages like C# or Java have guarantees that arguments are evaluated from left to right. Yay sanity.

Interview questions

Recently saw quite some twitter discussions about good & bad interview questions. Here’s a few I found useful.

In general, the most useful questions seem to be fairly open-ended ones, that can either lead to a much larger branching discussions, or don’t even have the “right” answer. Afterall, you mostly don’t want to know the answer (it’s gonna be 42 anyway), but to see the problem solving process and/or evaluate general knowledge and grasp of the applicant.

The examples below are what I have used some times, and are targeted at graphics programmers with some amount of experience.

When a GPU samples a texture, how does it pick which mipmap level to read from?

Ok this one does have the correct answer, but it still can lead to a lot of discussion. The correct answer today is along the lines of:

GPU rasterizes everything in 2x2 fragment blocks, computes horizontal and vertical texture coordinate differences between fragments in the block, and uses magnitudes of the UV differences to pick the mipmap level that most closely matches 1:1 ratio of fragments to texels.

If the applicant does not know the answer, you could try to derive it. “Well ok, if you were building a GPU or writing a software rasterizer, what would you do?“.

Most people initially go for “based on distance to the camera”, or “based on how big the triangle is on screen” and so on. This is a great first step (and was roughly how rasterizers in the really old days worked). You can up the challenge then by throwing in a “but what if UVs are not just computed per-vertex?”. Which of course makes these suggested approaches not suitable anymore, and something else has to be thought up.

Once you get to the 2x2 fragment block solution, more things can be discussed (or just jump here if the applicant already knew the answer). What implications does it have for the programming models, efficiency etc.? Possible things to discuss:

  • The 2x2 quad has to execute shader instructions in lockstep, so that UV differences can be computed. Which leads to branching implications, which leads to regular texture samples being disallowed (HLSL) or undefined (GLSL) when used inside dynamic branching code paths.
  • Lockstep execution can lead into discussion how the GPUs work in general; what are the wavefronts / warps / SIMDs (or whatever your flavor of API/GPU calls them). How branching works, how temporary variable storage works, how to optimize occuppancy, how latency hiding works etc. Could spend a lot of time discussing this.
  • 2x2 quad rasterization means inefficiencies at small triangle sizes (a triangle that covers 1 fragment will still get four fragment shader executions). What implications this has for high geometry density, tessellation, geometric LOD schemes. What implications this has for forward vs deferred shading. What research is done to solve this problem, is the applicant aware of it? What would they do to solve or help with this?

You are designing a lighting system for a game/engine. What would it be?

This one does not even have the “correct” answer. A lighting system could be anything, there’s at least a few dozen commonly used ways to do it, and probably millions of more specialized ways! Lighting encompasses a lot of things – punctual, area, environment light sources, emissive surfaces; realtime and baked illumination components; direct and global illumination; shadows, reflections, volumetrics; tradeoffs between runtime peformance and authoring performance, platform implications, etc. etc. It can be a really long discussion.

Here, you’re interested in several things:

  • General thought process and how do they approach open-ended problems. Do they clarify requirements and try to narrow things down? Do they tell what they do know, what they do not know, and what needs further investigation? Do they just present a single favorite technique of theirs and can’t tell any downsides of it?
  • Awareness of already existing solutions. Do they know what is out there, and aware of pros & cons of common techniques? Have they tried any of it themselves? How up-to-date is their knowledge?
  • Exploring the problem space and making decisions. Is the lighting system for a single very specific game, or does it have to be general and “suitable for anything”? How does that impact the possible choices, and what are consequences of these choices? Likewise, how does choice of hardware platforms, minimum specs, expected complexity of content and all other factors affect the choices?
  • Dealing with tradeoffs. Almost every decisions engineers do involve tradeoffs of some kind - by picking one way of doing something versus some other way, you are making a tradeoff. It could be performance, usability, flexibility, platform reach, implementation complexity, workflow impact, amount of learning/teaching that has to be done, and so on. Do they understand the tradeoffs? Are they aware of pros & cons of various techniques, or can they figure them out?

You are implementing a graphics API abstraction for an engine. How would it look like?

Similar to the lighting question above, there’s no single correct answer.

This one tests awareness of current problem space (console graphics APIs, “modern” APIs like DX12/Vulkan/Metal, older APIs like DX11/OpenGL/DX9). What do they like and dislike in the existing graphics APIs (red flag if they “like everything” – ideal APIs do not exist). What would they change, if they could?

And again, tradeoffs. Would they go for power/performance or ease of use? Can you have both (if “yes” - why and how? if “no” - why?). Do they narrow down the requirements of who the abstraction users would be? Would their abstraction work efficiently on underlying graphics APIs that do not closely map to it?

You need to store and render a large city. How would you do it?

This one I haven’t used, but saw someone mention on twitter. Sounds like an excellent question to me, again because it’s very open ended and touches a lot of things.

Authoring, procedural authoring, baking, runtime modification, storage, streaming, spatial data structures, levels of detail, occlusion, rendering, lighting, and so on. Lots and lots of discussion to be had.

Well this is all.

That said, it’s been ages since I took a job interview myself… So I don’t know if these questions are as useful for the applicants as they seem for me :)

Shader Compression: Some Data

One common question I had about SPIR-V Compression is “why compress shaders at all?“, coupled with question on how SPIR-V shaders compare with shader sizes on other platforms.

Here’s some data (insert usual caveats: might be not representative, etc. etc.).

Unity Standard shader, synthetic test

Took Unity’s Standard shader, made some content to make it expand into 482 actual shader variants (some variants to handle different options in the UI, some to handle different lighting setups, lightmaps, shadows and whatnot etc.). This is purely a synthetic test, and not an indicator of any “likely to match real game data size” scenario.

These 482 shaders, when compiled into various graphics API shader representations with our current (Unity 5.5) toolchain, result in sizes like this:

APIUncompressed MBLZ4HC Compressed MB
D3D9 1.04 0.14
D3D11 1.39 0.12
Metal 2.55 0.20
OpenGL 2.04 0.15
Vulkan 6.84 1.63
Vulkan + SMOL-V 1.94 0.43

Basically, SPIR-V shaders are 5x larger than D3D11 shaders, and 3x larger than GL/Metal shaders. When compressed (LZ4HC used since that’s what we use to compress shader code), they are 12x larger than D3D11 shaders, and 8x larger than GL/Metal shaders.

Adding SMOL-V encoding gets SPIR-V shaders to “only” be 3x larger than shaders of other APIs when compressed.

Game, full set of shaders

I also got numbers from one game developer trying out SMOL-V on their game. The game is inhouse engine, size of full game shader build:

APIUncompressed MBZipped MB
D3D11 44 14
Vulkan + remap 178 62
Vulkan + remap + SMOL-V 83 54

Here again, SPIR-V shaders are several times larger (than for example DX11), even after remapping + compression. Adding SMOL-V makes them a bit smaller (I suspect without remapping it might be even smoller).

In context

However, in the bigger picture shader compression might indeed not be a thing worth looking into. As always, “it depends”…

Adding smol-v encoding on this game’s data saved 15 megabytes. On one hand, it’s ten floppy disks, which is almost as much as entire Windows 95 took! On the other hand, it’s about the size of one 4kx4k texture, when DXT5/BC7 compressed.

So yeah. YMMV.

SPIR-V Compression

TL;DR: Vulkan SPIR-V shaders are fairly large. SMOL-V can make them smaller.

Other folks are implementing Vulkan support at work, and the other day they noticed that Vulkan shaders (which are represented as SPIR-V binary format) take up quite a lot of space. I thought it would be a fun excercise to try to make them smoller, and maybe I’d learn something about compression along the way too.

Caveat emptor! I know nothing about compression. Or rather, I’m probably at the stage where I can make the impression that I know something about it, but all that knowledge is very superficial. Exactly the stage that is dangerous, if I start to talk about it as if I have a clue! So below, I’m doing exactly that. You’ve been warned.

SPIR-V format

SPIR-V is extremely simple and regular format. Everything is 4-byte words. Many things that only need a few bits of information are still represented as a word.

This makes it simple, but not exactly space efficient. I don’t have data nearby right now, but a year or so ago I looked into shaders that do the same thing, compiled for DX9, DX11, OpenGL (GLSL) and Vulkan (SPIR-V), and the SPIR-V were “fattest” by a large amount (DX9 and minified GLSL being the smallest).


“Why not just compress them?”, you ask. That should take care of these “three bits of information written as 4 bytes” style enums. That it does; standard lossless compression techniques are pretty good at encoding often occuring patterns into a small number of bits (further reading: Huffman, Arithmetic, FSE coding).

And indeed, SPIR-V compresses quite well. For example, 1315 kilobytes worth of shader data (from various Unity shaders) compresses to 279 kilobytes with Zstandard and to 306 kilobytes with zlib (I used miniz implementation) at default settings. So a standard go-to compression (zlib) gets you a 23.4% compression of SPIR-V.

However, SPIR-V is full of not-really-compressible things, mostly various identifiers (anything called <id> in the spec). Due to the SSA form that SPIR-V uses, all the identifiers ever used are unique numbers, with nothing reusing a previous ID. A regular data compressor does not get to see many repeating patterns there.

Data compression algorithms usually only look for literally repeating patterns. If you’d have a file full of 0x00000001 integers, this will compress extremely well. However, if your file will be a really simple sequence of integers: 1, 2, 3, 4, …, this will not compress particularly well!

I actually just tested this. 16384 four-byte words, which are just a sequence of 0,1,…,16383 integers, compressed with zlib at default settings: 64KB -> 22716 bytes.

Enter Data Filtering

Recall that “a simple sequence of numbers compresses quite poorly” example above? Turns out, a typical trick in data compression is to filter the data before compressing it. Filtering can be any sort of reversible transformation of the data, that makes it be more compressible, i.e. have more actually repeating patterns.

For example, using delta encoding on that integer sequence would transform it into a file that is pretty much all just 0x00000001 integers. This compresses with zlib into just 88 bytes!

Data filtering is fairly widely used, for example:

  • PNG image format has several filters, as described here.
  • Executable file compression usually transforms machine code instructions into a more compressible form, see techniques used in .kkrunchy for example.
  • HDF5 scientific data format has filters like bitshuffle that reorder data before actual compression.
  • Some compressors like RAR seemingly automatically apply various filters to data blocks they identify as “filterable” (i.e. “looks like an executable” or “looks like sound wave samples” somehow).

Perhaps we could do some filtering on SPIR-V to make it more compressible?

Filtering: spirv-remap

In SPIR-V land, there is a tool called spirv-remap that aims to help with compression. What it does, is it changes all the IDs used in the shader to values that would hopefully be similar if you have a lot of other similar shaders, and compress them all as a whole. For each new ID, it “looks” at several surrounding instructions, and picks the ID based on their hash. The assumption is that you’re very likely to have other shaders that have similar fragments of instructions – they would be compressible if only the IDs would be the same.

And indeed, on that same set of shaders I had above: uncompressed size 1315KB, zstd-compressed 279KB (21.2%), remapped + zstd compressed: 189KB (14.4% compression).

However, spirv-remap tries to filter the SPIR-V program in a way that still results in a valid SPIR-V program. Maybe we could do better, if we did not have such a restriction?

SMOL-V: making SPIR-V smoller

So that’s what I did. My goal was to conceptually have two functions:

ByteArray Encode(const ByteArray& spirvInput); // SPIR-V -> SMOL-V
ByteArray Decode(const ByteArray& encodedBytes); // SMOL-V -> SPIR-V

with the goal that:

  • Encoded result would be smaller than input,
  • When compressed (with Zstd, zlib etc.), it would be smaller than if I just compressed the input,
  • When compressed, it would be smaller than what a compressed spirv-remap can achieve.
  • Do that in a fairly simple way. Since hey, I’m a compression n00b, anything that is compression rocket surgery is likely way out of my capabilities. Also, I wanted to roughly spend a (long) day on this.

So below is a write up of what I did (can also be seen in the commit history). First of all, I just looked at the SPIR-V binaries with a hex viewer. And in almost every step below, either looked at binaries, or printed bytes of instructions and looked for patterns.

Variable-length integer encoding: varint

Recall that SPIR-V uses four-byte words to store every single piece of information it needs. Often these are enum-style information that uses a few dozen possible values. I did not want to hardcode every possible operation & enum ranges (that would be a lot of work, and not very future-proof with later SPIR-V versions), so instead I looked at various variable-length integer storage schemes. Most famous is probably UTF-8 in text land. In binary data land there are VLQ, LEB128 and varint, which all are variations of “store 7 bits of data, and one bit to signal if there are more bytes following”. I picked the “varint” as used by Google Protocol Buffers, if only because I found it before I found the others :)

With varint encoding for unsigned integers, numbers below 128 take only one byte, numbers below 16384 take two bytes and so on.

So the very first try was to use varint encoding on each instruction’s length+opcode word, and the Type ID that many instructions have. Then I noticed that the Result IDs of almost every instructions are just one or two IDs larger then the result of a previous instruction. So I wrote them out as deltas from previous, and again encoded as varint.

This got just SMOL-V data to 71% size of original SPIR-V, and 18.2% when Zstd-compressed on top.

Relative-to-result and varint on often-occurring instructions

I dumped frequencies of how much space various opcode types take, and it became fairly clear that OpDecorate takes a lot, as well as OpVectorShuffle.

Now, decorations are guaranteed to be grouped, and often are specified on the same or very similar target IDs. The decoration values themselves are small integers. So, encode result IDs relative to a previously seen declarations, and use varint encoding on everything else (commit).

Vector shuffles also specify several IDs (often close to just-seen ones), and a few small component indices, so do a similar treatment for that (commit).

Combined, these took SMOL-V data to 56%, and 14.6% when Zstd-compressed.

I then noticed that the same pattern occurs in a lot of other instructions: the opcode, type and result IDs are often followed by several other IDs (how many depends on the opcode), and some other “usually small integer” values (how many, again depends on the opcode). So instead of just hardcoding handling of these several opcodes above, I generalized the code to look up this information into a table indexed by opcode.

After quite a lot more opcodes got this treatment, I was at 42% SMOL-V size, and 10.7% when Zstd-compressed. Not bad!

Negative deltas?

Most of the ID arguments I have encoded as a delta from previous Result ID value. The deltas were always positive so far, which is nice for varint encoding. However when I came to adding the same treatment to branch and control flow instructions, I realized that the IDs they reference are often “in the future”, which would mean the deltas are negative. Under the varint encoding, these would be the same as very large positive numbers, and often encode into 4 or 5 bytes.

Luckily, the same Protocol Buffers have a solution for that; signed integers get their bits shuffled so that small absolute values are turned into small positive values – the ZigZag encoding. So I used that to encode IDs of control flow instructions.

Opcode value reordering

At this point tweaking just delta+varint encoding was starting to give diminishing returns. So I started looking at bytes again.

That “encode opcode + length as varint” was often producing 2 or 3 bytes worth of data, due to the way SPIR-V encodes that word. I tried reordering it so that most common opcodes&lenghts produce just one byte.

1) Swap opcode values so that most common ones fit into 4 bits. Most common ones in my shader test data were: Decorate (24%), Load (17%), Store (9%), AccessChain (7%), VectorShuffle (5%), MemberDecorate (4%) etc.

static SpvOp smolv_RemapOp(SpvOp op)
#	define _SMOLV_SWAP_OP(op1,op2) if (op==op1) return op2; if (op==op2) return op1
	_SMOLV_SWAP_OP(SpvOpDecorate,SpvOpNop); // 0
	_SMOLV_SWAP_OP(SpvOpLoad,SpvOpUndef); // 1
	_SMOLV_SWAP_OP(SpvOpStore,SpvOpSourceContinued); // 2
	_SMOLV_SWAP_OP(SpvOpAccessChain,SpvOpSource); // 3
	_SMOLV_SWAP_OP(SpvOpVectorShuffle,SpvOpSourceExtension); // 4
	// Name - already small value - 5
	// MemberName - already small value - 6
	_SMOLV_SWAP_OP(SpvOpMemberDecorate,SpvOpString); // 7
	_SMOLV_SWAP_OP(SpvOpLabel,SpvOpLine); // 8
	_SMOLV_SWAP_OP(SpvOpVariable,(SpvOp)9); // 9
	_SMOLV_SWAP_OP(SpvOpFMul,SpvOpExtension); // 10
	_SMOLV_SWAP_OP(SpvOpFAdd,SpvOpExtInstImport); // 11
	// ExtInst - already small enum value - 12
	// VectorShuffleCompact - already small value - used for compact shuffle encoding
	_SMOLV_SWAP_OP(SpvOpTypePointer,SpvOpMemoryModel); // 14
	_SMOLV_SWAP_OP(SpvOpFNegate,SpvOpEntryPoint); // 15
#	undef _SMOLV_SWAP_OP
	return op;

2) Adjust opcode lengths so that most common ones fit into 3 bits.

// For most compact varint encoding of common instructions, the instruction length
// should come out into 3 bits. SPIR-V instruction lengths are always at least 1,
// and for some other instructions they are guaranteed to be some other minimum
// length. Adjust the length before encoding, and after decoding accordingly.
static uint32_t smolv_EncodeLen(SpvOp op, uint32_t len)
	if (op == SpvOpVectorShuffle)			len -= 4;
	if (op == SpvOpVectorShuffleCompact)	len -= 4;
	if (op == SpvOpDecorate)				len -= 2;
	if (op == SpvOpLoad)					len -= 3;
	if (op == SpvOpAccessChain)				len -= 3;
	return len;
static uint32_t smolv_DecodeLen(SpvOp op, uint32_t len)
	if (op == SpvOpVectorShuffle)			len += 4;
	if (op == SpvOpVectorShuffleCompact)	len += 4;
	if (op == SpvOpDecorate)				len += 2;
	if (op == SpvOpLoad)					len += 3;
	if (op == SpvOpAccessChain)				len += 3;
	return len;

3) Interleave bits of the original word so that these common ones (opcode + lenght) take up lowest seven bits of the result, and encode to just one byte in varint scheme. 0xLLLLOOOO is how SPIR-V encodes it (L=length, O=op), shuffle it into 0xLLLOOOLO so that common case (op<16, len<8) is encoded into one byte.

That got things down to 35% SMOL-V size, and 9.7% when Zstd-compressed.

Vector Shuffle encoding

SPIR-V has a single opcode OpVectorShuffle that is used for both selecting components from two vectors, and for a typical “swizzle”. Swizzles are by far the most common in the shaders I’ve seen, so often in raw SPIR-V something like “v.xxyy” swizzle ends up being “v, v, 0, 0, 1, 1” - each of these being a full 32 bit word (both arguments point to the same vector, and then component indices spelled out).

I made the code recognize this common pattern of “shuffle with <= 4 components, where each is between 0 and 3”, and encode that as a fake “VectorShuffleCompact” opcode using one of the unused opcode values, 13. The swizzle pattern fits into one byte (two bits per channel) instead taking up 16 bytes (commit).

Adding non-Unity shaders, and zigzag

At this point I added more shaders to test on, to see how everything above behaves on non-Unity compilation pipeline produced shaders (thanks @baldurk, @AlenL and @basisspace for providing and letting me use shaders from The Talos Principle and DOTA2).

Turns out, both of these games ship with shaders that are alreayd processed with spirv-remap. One thing it does (well, the primary thing it does!) is changing all the IDs to not be linearly increasing, but have values all over the place. My previous work on using delta encoding and varint output was often going against that, since often it would be that next ID would be smaller than previous one, resulting in negative delta, which encodes into 4 or 5 bytes under varint. Not good!

Well it wasn’t bad; this is SMOL-V that not only compresses, but also strips debug info, to match what spirv-remap did for Talos/DOTA2 case:

  • Unity: remap+zstd 13.0%, SMOL-V+zstd 7.2%.
  • Talos: remap+zstd 11.1%, SMOL-V+zstd 9.0%.
  • DOTA2: remap+zstd 9.9%, SMOL-V+zstd 8.4%.

It already compresses better than spirv-remap, but is more better on shaders that aren’t already remapped.

I switched all the deltas to use zigzag encoding (see Negative Deltas above), so that on already remapped shaders it does not go into “whoops encoded into 5 bytes”:

  • Unity: remap+zstd 13.0%, SMOL-V+zstd 7.3% (a tiny bit worse than 7.2% before).
  • Talos: remap+zstd 11.1%, SMOL-V+zstd 8.5% (yay, was 9.0% before).
  • DOTA2: remap+zstd 9.9%, SMOL-V+zstd 8.2% (small yay, was 8.4% before).

MemberDecorate encoding

Structure/buffer decorations (OpMemberDecorate) were taking up quite a bit of space, so I looked for some patterns in them.

Most often they are very simple sequences, e.g.

Op             Type Member Decoration Extra
MemberDecorate 168  0      35           0 
MemberDecorate 168  1      35          64 
MemberDecorate 168  2      35          80 
MemberDecorate 168  3      35          96 
MemberDecorate 168  4      35         112 
MemberDecorate 168  5      35         128 
MemberDecorate 168  6       0 
MemberDecorate 168  6      35         384 
MemberDecorate 168  7      35         400 

When encoding, I scan ahead to see whether there’s a sequence of MemberDecorate instructions that are all about the same type, and “fold” them into one – so I can skip writing out opcode+lenght and type ID data. Additionally, delta encode member index, and have special handling of decoration 35 (“Offset”, which is extremely common) to store actual offset as delta from previous one. This got some gains (commit).

Quite likely OpDecorate sequences could get a similar treatment, but I did not do that yet.

Current status

So that’s about it! Current compression numbers, on a set of Unity+Talos+DOTA2 shaders, with debug info stripping:

CompressionNo filter (*)spirv-remapSMOL-V
Size KBRatioSize KBRatioSize KBRatio
Uncompressed 3725.4100.0% 3560.095.6% 1297.634.8%
zlib default 860.623.1% 761.920.5% 464.912.5%
LZ4HC default 884.423.7% 743.320.0% 441.011.8%
Zstd default 555.414.9% 425.611.4% 295.57.9%
Zstd level 20 339.49.1% 260.57.0% 226.76.1%

(*) Note: about 2/3rds of the shader (Talos & DOTA2) set were already processed by spirv-remap; I don’t have unprocessed shaders from these games. This makes spirv-remap look a bit worse than it actually is though.

I think it’s not too bad for a couple days of work. And I have learned a thing or two about compression. Again, the github repository is here: github.com/aras-p/smol-v.

  • Encoding does a simple one-pass scan over the input (with occasional look aheads for MemberDecorate sequences), and writes encoded result to the output.
  • Decoding simply goes over encoded bytes and transforms into SPIR-V. One pass over data, no memory allocations.
  • No “altering” of SPIR-V programs is done; what you encode is exactly what you get after decoding (this is different from spirv-remap, that actually changes the IDs). Exception is the kEncodeFlagStripDebugInfo that removes debug information from the input program.

Future Work?

Not sure I will work on this much (as opposed to “eh, good enough for now”), but possible future work might be:

  • Someone who actually knows about compression will look at it, and point out low hanging fruits :)
  • Do special encoding of some more opcodes (OpDecorate comes to mind).
  • Split up encoded data into several “streams” for better compression (e.g. lenghts, opcodes, types, results, etc.). Very similar to the “Split-stream encoding” from .kkrunchy blog post.
  • As John points out, there are other possible axes to explore compression.

This was super fun. I highly recommend “short, useful, and you get to learn something” projects :)