Pathtracer 17: WebAssembly

Introduction and index of this series is here.

Someone at work posted a “Web Development With Assembly” meme as a joke, and I pulled off a “well, actually” card pointing to WebAssembly. At that point I just had to make my toy path tracer work there.

So here it is: aras-p.info/files/toypathtracer

Porting to WebAssembly

The “porting” process was super easy, I was quite impressed how painless it was. Basically it was:

  1. Download & install the official Emscripten SDK, and follow the instructions there.
  2. Compile my source files, very similar to invoking gcc or clang on the command line, just Emscripten compiler is emcc. This was the full command line I used: emcc -O3 -std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s EXTRA_EXPORTED_RUNTIME_METHODS='["cwrap"]' -o toypathtracer.js main.cpp ../Source/Maths.cpp ../Source/Test.cpp
  3. Modify the existing code to make both threads & SIMD (two things that Emscripten/WebAssembly lacks at the moment) optional. Was just a couple dozen lines of code starting here in this commit.
  4. Write the “main” C++ entry point file that is specific for WebAssembly, and the HTML page to host it.

How to structure the main thing in C++ vs HTML? I basically followed the “Emscripting a C library to Wasm” doc by Google, and “Update a canvas from wasm” Rust example (my case is not Rust, but things were fairly similar). My C++ entry file is here (main.cpp), and the HTML page is here (toypathtracer.html). All pretty simple.

And that’s basically it!

Ok how fast does it run?

At the moment WebAssembly does not have SIMD, and does not have “typical” (shared memory) multi-threading support.

The Web almost got multi-threading at start of 2018, but then Spectre and Meltdown happened, and threading got promptly turned off. As soon as you have ability to run fast atomic instructions on a thread, you can build a really high precision timer, and as soon as you have a high precision timer, you can start measuring things that reveal what sort of thing got into the CPU caches. Having “just” that is enough to start building basic forms of these attacks.

By now the whole industry (CPU, OS, browser makers) scrambled to fix these vulnerabilities, and threading might be coming back to Web soon. However at this time it’s not enabled by default in any browsers yet.

All this means that the performance numbers of WebAssembly will be substantially lower than other CPU implementations – after all, it will be running on just one CPU core, and without any of the SIMD speedups we have done earlier.

Anyway, the results I have are below (higher numbers are better). You can try yourself at aras-p.info/files/toypathtracer

DeviceOSBrowserMray/s
Intel Core i9 8950HK 2.9GHz (MBP 2018)macOS 10.13Safari 115.8
Chrome 705.3
Firefox 635.1
Intel Xeon W-2145 3.7GHzWindows 10Chrome 705.3
AMD ThreadRipper 1950X 3.4GHzWindows 10Firefox 644.7
Chrome 704.6
Edge 174.5
iPhone XS / XR (A12)iOS 12Safari4.4
iPhone 8+ (A11)iOS 12Safari4.0
iPhone SE (A9)iOS 12Safari2.5
Galaxy Note 9 (Snapdragon 845)Android 8.1Chrome2.0
iPhone 6 (A8)iOS 12Safari1.7

For reference, if I turn off threading & SIMD in the regular C++ version, I get 7.0Mray/s on the Core i9 8950HK MacBookPro. So WebAssembly at 5.1-5.8 Mray/s is slightly slower, but not “a lot”. Is nice!

All code is on github at 17-wasm tag.


SPIR-V Compression: SMOL vs MARK

Two years ago I did a small utility to help with Vulkan (SPIR-V) shader compression: SMOL-V (see blog post or github repo).

It is used by Unity, and looks like also used by some non-Unity projects as well (if you use it, let me know! always interesting to see where it ends up at).

Then I remembered the github issue where SPIR-V compression was discussed at. It mentioned that SPIRV-Tools was getting some sort of “compression codec” (see comments) and got closed as “done”, so I decided to check it out.

SPIRV-Tools compression: MARK-V

SPIRV-Tools repository, which is a collection of libraries and tools for processing SPIR-V shaders (validation, stripping, optimization, etc.) has a compressor/decompressor in there too, but it’s not advertised much. It’s not built by default; and requires passing a SPIRV_BUILD_COMPRESSION=ON option to CMake build.

The sources related to it are under source/comp and tools/comp folders; and compression is not part of the main interfaces under include/spirv-tools headers; you’d have to manually include source/comp/markv.h. The build also produces a command line executable spirv-markv that can do encoding or decoding.

The code is well commented in terms of “here’s what this small function does”, but I didn’t find any high level description of “the algorithm” or properties of the compression. I see that it does something with shader instructions; there’s some Huffman related things in there, and large tables that are seemingly auto-generated somehow.

Let’s give it a go!

Getting MARK-V to compile

In SMOL-V repository I have a little test application (see testmain.cpp) that has on a bunch of shaders, runs either SMOL-V or Spirv-Remapper on them, additionally compresses result with zlib/lz4/zstd and so on. “Let’s add MARK-V in there too” sounded like a natural thing to do. And since I refuse to deal with CMake in my hobby projects :), I thought I’d just add relevant MARK-V source files…

First “uh oh” sign: while the number of files under compression related folders (source/comp, tools/comp) is not high, that is 500 kilobytes of source code. Half a meg of source, Carl!

And then of course it needs a whole bunch of surrounding code from SPIRV-Tools to compile. So I copied everything that it needed to work. In total, 1.8MB of source code across 146 files.

After finding all the source files and setting up include paths for them, it compiled easily on both Windows (VS2017) and Mac (Xcode 9.4).

Pet peeve: I never understood why people don’t use file-relative include paths (like #include "../foo/bar/baz.h"), instead requiring the users of your library to setup additional include path compiler flags. As far as I can tell, relative include paths have no downsides, and require way less fiddling to both compile your library and use it.

Side issue: STL vector for input data

The main entry point for MARK-V decoding (this is what would happen on the device when loading shaders – so this is the performance critical part) is:

spv_result_t MarkvToSpirv(
    spv_const_context context, const std::vector<uint8_t>& markv,
    const MarkvCodecOptions& options, const MarkvModel& markv_model,
    MessageConsumer message_consumer, MarkvLogConsumer log_consumer,
    MarkvDebugConsumer debug_consumer, std::vector<uint32_t>* spirv);

Ok, I kind of get the need (or at least convenience) of using std::vector for output data; after all you are decompressing and writing out an expanding array. Not ideal, but at least there is some explanation.

But for input data – why?! One of const uint8_t* markv, size_t markv_size or a const uint8_t* markv_begin, const uint8_t* markv_end is just as convenient, and allows way more flexibility for the user at where the data is coming from. I might have loaded my data as memory-mapped files, which then literally is just a pointer to memory. Why would I have to copy that data into an additional STL vector just to use your library?

Side issue: found bugs in “Max” compression

MARK-V has three compression models - “Lite”, “Mid” and “Max”. On some test shaders I had the “Max” one could not decompress successfully after compression, so I guess “some bugs are there somewhere”. Filed a bug report and excluded the “Max” model from further comparison :(

MARK-V vs SMOL-V

Size evaluation

CompressionNo filterSMOL-VMARK-V LiteMARK-V Mid
Size KBRatioSize KBRatioSize KBRatioSize KBRatio
Uncompressed 4870100.0% 163033.5% 136928.1% 108522.3%
zlib default 121324.9% 60212.4% 4118.5% 3366.9%
LZ4HC default 134327.6% 60612.5% 4108.4% 3346.9%
Zstd default 89918.5% 4469.1% 3948.1% 3296.8%
Zstd level 20 59012.1% 3487.1% 2936.0% 2575.3%

Two learnings from this:

  • MARK-V without additional compression on top (“Uncompressed” row) is not really competitive (~25%); just compressing shader data with Zstandard produces smaller result; or running through SMOL-V coupled with any other compression.
  • This suggests that MARK-V acts more like a “filter” (similar to SMOL-V or spirv-remap), that makes the data smaller, but also makes it more compressible. Coupled with additional compression, MARK-V produces pretty good results, e.g. the “Mid” model ends up compressing data to ~7% of original size. Nice!

Decompression performance

I checked how much time it takes to decode/decompress shaders (4870KB uncompressed size):

Windows
AMD TR 1950X
3.4GHz
Mac
i9-8950HK
2.9GHz
MARK-V Lite536.7ms9.1MB/s 492.7ms9.9MB/s
MARK-V Mid 759.1ms6.4MB/s 691.1ms7.0MB/s
SMOL-V 8.8ms 553.4MB/s 11.1ms438.7MB/s

Now, I haven’t seriously looked at my SMOL-V decompression performance (e.g. Zstandard general decompression algorithm does ~1GB/s), but at ~500MB/s it’s perhaps “not terrible”.

I can’t quite say the same about MARK-V though; it gets under 10MB/s of decompression performance. That, I think, is “pretty bad”. I don’t know what it does there, but this low decompression speed is within a “maybe I wouldn’t want to use this” territory.

Decompressor size

There is only one case where the decompressor code size does not matter: it’s if it comes pre-installed on the end hardware (as part of OS, runtimes, drivers, etc.). In all other cases, you have to ship decompressor inside your own application, i.e. statically or dynamically link to that code – so that, well, you can decompress the data you have compressed.

I evaluated decompressor code size by making a dynamic/shared library on a Mac (.dylib) with a single exported function that does a “decode these bytes please” work. I used -O2 -fvisibility=hidden -std=c++11 -fno-exceptions -fno-rtti compiler flags, and -shared -fPIC -lstdc++ -dead_strip -fvisibility=hidden linker flags.

  • SMOL-V decompressor .dylib size: 8.2 kilobytes.
  • MARK-V decompressor .dylib size (only with “Mid” model): 1853.2 kilobytes.

That’s right. 1.8 megabytes! At first I thought I did something wrong!

I looked at the size report via Bloaty, and yeah, in MARK-V decompressor it’s like: 570KB GetIdDescriptorHuffmanCodecs, 137KB GetOpcodeAndNumOperandsMarkovHuffmanCodec, 64KB GetNonIdWordHuffmanCodecs, 44KB kOpcodeTableEntries and then piles and piles of template instantiations that are smaller, but there’s lots of them.

In SMOL-V by comparison, it’s 2KB smolv::Decode, 1.3KB kSpirvOpData and the rest is misc stuff and/or dylib overhead.

Library compilation time

While this is not that important aspect, it’s relevant to my current work role as a build engineer :)

Compiling MARK-V libraries with optimizations on (-O2) takes 102 seconds on my Mac (single threaded; obviously multi-threaded would be faster). It is close to two megabytes of source code after all; and there is one file (tools/comp/markv_model_shader.cpp) that takes 16 seconds to compile alone. I think that got CI agents into timeouts in SPIRV-Tools project, and that was the reason why MARK-V is not enabled by default in the builds :)

Compiling SMOL-V library takes 0.4 seconds in comparison.

Conclusion

While looking at compression ratio in isolation, MARK-V coupled with additional lossless compression looks good, I don’t think I would recommend it due to other issues.

The decompressor executable size alone (almost 2MB!) means that in order for MARK-V to start to “make sense” compared to say SMOL-V, your total shader data size needs to be over 100 megabytes; only then additional compression from MARK-V offsets the massive decompressor size.

Sure, there are games with shaders that large, but then MARK-V is also quite slow at decompression – it would take over 10 seconds to decompress 100MB worth of shader data :(

All my evaluation code is on mark-v branch in SMOL-V repository. At this point I’m not sure I’ll merge it to the main branch.

This is all.


Pathtracer 16: Burst SIMD Optimization

Introduction and index of this series is here.

When I originally played with the Unity Burst compiler in “Part 3: C#, Unity, Burst”, I just did the simplest possible “get C# working, get it working on Burst” thing and left it there. Later on in “Part 10: Update C#” I updated it to use Structure-of-Arrays data layout for scene objects, and that was about it. Let’s do something about this.

Meanwhile, I have switched from late-2013 MacBookPro to mid-2018 one, so the performance numbers on a “Mac” will be different from the ones in previous posts.

Update to latest Unity + Burst + Mathematics versions

First of all, let’s update the Unity version we use from some random 2018.1 beta to the latest stable 2018.2.13, and update Burst (to 0.2.4-preview.34) & Mathematics (to 0.0.12-preview.19) packages along the way. Mathematics renamed lengthSquared to lengthsq, and introduced a PI constant that clashed with our own one :) These trivial updates in this commit.

Just that got performance on PC from 81.4 to 84.3 Mray/s, and on Mac from 31.5 to 36.5 Mray/s. I guess either Burst or Mathematics (or both) got some optimizations during this half a year, nice!

Add some “manual SIMD” to sphere intersection

Very similar to how in Part 8: SSE HitSpheres I made the C++ HitSpheres function do intersection testing of one ray against 4 spheres at once, we’ll do the same in our Unity C# Burst code.

The thought process and work done is extremely similar to the C++ side done in Part 8 and Part 9; basically:

  • Since data for our spheres is laid out nicely in SoA style arrays, we can easily load data for 4 of them at once.
  • Do all ray intersection math on these 4 spheres,
  • If any are hit, pick the closest one and calculate final hit position & normal.

HitSpheres function code gets to be extremely similar between C++ version and C# version. In fact the C# one is cleaner since float4, int4 and bool4 types in Mathematics package are way more complete SIMD wrappers than my toy manual implementations in the C++ version.

The full change commit is here.

Performance: PC from 84.3 to 133 Mray/s, and Mac from 35.5 to 60.0 Mray/s. Not bad!

Updated numbers for new Mac hardware

Implementation PC Mac
GPU 1854 246
C++, SSE+SoA HitSpheres 187 74
C#, Unity Burst, 4-wide HitSpheres 133 60
C++, SoA HitSpheres 100 36
C#, Unity Burst 82 36
C#, .NET Core 53.0 23.6
C#, mono -O=float32 --llvm w/ MONO_INLINELIMIT=100 22.0
C#, mono -O=float32 --llvm 18.9
C#, mono -O=float32 11.0
C#, mono 6.1
  • PC is AMD ThreadRipper 1950X (3.4GHz, 16c/16t - SMT disabled) with GeForce GTX 1080 Ti.
  • Mac is mid-2018 MacBookPro (Core i9-8950HK 2.9GHz, 6c/12t) with AMD Radeon Pro 560X.
  • Unity version 2018.2.13 with Burst 0.2.4-preview.34 and Mathematics 0.0.12-preview.19.
  • Mono version 5.12.
  • .NET Core version 2.1.302.

All code is on github at 16-burst-simd tag.


Random list of Demoscene Demos

I just did a “hey kids, let me tell you about demoscene” event at work, where I talked about and and showed some demos I think were influential over the years, roughly sorted chronologically.

Here’s that list, in case you also want to see some demoscene things. There’s a whole bunch of excellent demo productions I did not show (due to time constraints); and I mostly focused on Windows/PC demos. A decent way of finding others is searching through “all time top” list at pouët.net.

I’m giving links to youtube, because let’s be realistic, no one’s gonna actually download and run the executables. Or if you would, then you most likely have already seen them anyway :)

Future Crew “Second Reality”, 1993, demo

Tim Clarke “Mars”, 1993, 6 kilobytes

Exceed “Heaven Seven”, 2000, 64 kilobytes

farbrausch “fr-08: .the .product”, 2000, 64 kilobytes

Alex Evans “Tom Thumb”, 2002, wild demo

TBC & Mainloop “Micropolis”, 2004, 4 kilobytes

mfx “Aether”, 2005, demo

Kewlers & mfx “1995”, 2006, demo

mfx “Deities”, 2006, demo

farbrausch “fr-041: debris”, 2007, 144 kilobytes

Fairlight & CNCD “Agenda Circling Forth”, 2010, demo

Fairlight & CNCD “Ziphead”, 2015, demo

Eos “Oscar’s Chair”, 2018, 4 kilobytes

Conspiracy “When Silence Dims The Stars Above”, 2018, 64 kilobytes


Pathtracer 15: Pause & Links

Sailing out to sea | A tale of woe is me
I forgot the name | Of where we’re heading
– Versus Them “Don’t Eat the Captain

So! This whole series on pathtracing adventures started out without a clear goal or purpose. “I’ll just play around and see what happens” was pretty much it. Looks like I ran out of steam and will pause doing further work on it. Maybe sometime later I’ll pick it up again, who knows!

One nice thing about 2018 is that there’s a lot of interest in ray/path tracing again, and other people have been writing about various aspects of it. So here’s a collection of links I saved on the topic over past few months:

Thanks for the adventure so far, everyone!

Put the fork away | It’s not a sailor’s way
We are gentlemen | Don’t eat the captain