Daily Pathtracer Part 7: Initial SIMD

Introduction and index of this series is here.

Let’s get back to the CPU C++ implementation. I want to try SIMD and similar stuffs now!

Warning: I don’t have much (any?) actual experience with SIMD programming. I know conceptually what it is, have written a tiny bit of SIMD assembly/intrinsics code in my life, but nothing that I could say I “know” or even “have a clue” about it. So whatever I do next, might be completely stupid! This is a learning exercise for me too! You’ve been warned.

SIMD, what’s that?

SIMD is for “Single instruction, multiple data”, and the first sentence about it on Wikipedia says “a class of parallel computers in Flynn’s taxonomy” which is, errr, not that useful as an introduction :) Basically SIMD can be viewed as CPU instructions that do “operation” on a bunch of “items” at once. For example, “take these 4 numbers, add these other 4 numbers to them, and get a 4-number result”.

Different CPUs have had a whole bunch of different SIMD instruction sets over the years, and the most common today are:

  • SSE for x86 (Intel/AMD) CPUs.
    • SSE2 can be pretty much assumed to be “everywhere”; it’s been in Intel CPUs since 2001 and AMD CPUs since 2003.
    • There are later SSE versions (SSE3, SSE4 etc.), and then later on there’s AVX too.
  • NEON for ARM (“almost everything mobile”) CPUs.

It’s often said that “graphics” or “multimedia” is an area where SIMD is extremely beneficial, so let’s see how or if that applies to our toy path tracer.

Baby’s first SIMD: SSE for the vector class

The first thing that almost everyone immediately notices is “hey, I have this 3D vector class, let’s make that use SIMD”. This seems to also be taught as “that’s what SIMD is for” at many universities. In our case, we have a float3 struct with three numbers in it, and a bunch of operations (addition, multiplication etc.) do the same thing on all of them.

Spoiler alert: this isn’t a very good approach. I know that, but I also meet quite many people who don’t, for some reason. See for example this old post “The Ubiquitous SSE vector class: Debunking a common myth” by Fabian Giesen.

Let’s make that use SSE instructions. A standard way to use them in C++ is via “intrinsic instructions”, that basically have a data type of __m128 (4 single precision floats, total 128 bits) and instructions like _mm_add_ps and so on. It’s “a bit” unreadable, if you ask me… But the good news is, these data types and functions work on pretty much all compilers (e.g. MSVC, clang and gcc) so that covers your cross-platform needs, as long as it’s Intel/AMD CPUs you’re targeting.

I turned my float3 to use SSE very similar to how it’s described in How To Write A Maths Library In 2016 by Richard Mitton. Here’s the commit.

  • PC (AMD ThreadRipper, 16 threads): 135 -> 134 Mray/s.
  • Mac (MacBookPro, 8 threads): 38.7 -> 44.2 Mray/s.

Huh, that’s a bit underwhelming, isn’t it? Performance on PC (MSVC compiler) pretty much the same. Performance on Mac quite a bit better, but nowhere near “yeah 4x faster!” levels :)

This does make sense though. The float3 struct is explicitly using SIMD now, however a whole lot of remaining code still stays “scalar” (i.e. using regular float variables). For example, one of the “heavy” functions, where a lot of time is spent, is HitSphere, and it has a lot of floats and branches in it:

bool HitSphere(const Ray& r, const Sphere& s, float tMin, float tMax, Hit& outHit)
{
    float3 oc = r.orig - s.center; // SIMD
    float b = dot(oc, r.dir); // scalar
    float c = dot(oc, oc) - s.radius*s.radius; // scalar
    float discr = b*b - c; // scalar
    if (discr > 0) // branch
    {
        float discrSq = sqrtf(discr); // scalar
        float t = (-b - discrSq); // scalar
        if (t < tMax && t > tMin) // branch
        {
            outHit.pos = r.pointAt(t); // SIMD
            outHit.normal = (outHit.pos - s.center) * s.invRadius; // SIMD
            outHit.t = t;
            return true;
        }
        t = (-b + discrSq); // scalar
        if (t < tMax && t > tMin) // branch
        {
            outHit.pos = r.pointAt(t); // SIMD
            outHit.normal = (outHit.pos - s.center) * s.invRadius; // SIMD
            outHit.t = t;
            return true;
        }
    }
    return false;
}

I’ve also enabled __vectorcall on MSVC and changed some functions to take float3 by value instead of by const-reference (see commit), but it did not change things noticeably in this case.

I’ve heard of “fast math” compiler setting

As a topic jump, let’s try telling the compiler “you know what, you can pretend that floating point nubers obey simple algebra rules”.

What? Yeah that’s right, floating point numbers as typically represented in computers (e.g. float double) have a lot of interesting properties. For example, a + (b + c) isn’t necessarily the same as (a + b) + c with floats. You can read a whole lot about them at Bruce Dawson’s blog posts.

C++ compilers have options to say “you know what, relax with floats a bit; I’m fine with potentially not-exact optimizations to calculations”. In MSVC that’s /fp:fast flag; whereas on clang/gcc it’s -ffast-math flag. Let’s switch them on:

  • PC: 134 -> 171 Mray/s.
  • Mac: 44.2 -> 42.6 Mray/s.

It didn’t do anything on Mac (clang compiler), in fact made it a tiny bit slower… but whoa, look at that Windows (MSVC compiler) speedup! ⊙.☉

What’s a proper way to do SIMD?

Let’s get back to SIMD. The way I did float3 with SSE has some downsides, with major ones being:

  • It does not lead to all operations using SIMD, for example doing dot products (of which there are plenty in graphics) ends up doing a bunch of scalar code. Yes, that quite likely could be improved somehow, but still, outside of regular “add/multiply individual vector components”, other operations do not easily map to SIMD.
  • SSE works on 4 floats at a time, but our float3 only uses three. This leaves one “SIMD lane” not doing any useful work. If I were to use AVX SIMD instructions – these work on 8 floats at a time – that would get even less efficient.

I think it’s general knowledge that “a more proper” approach to SIMD (or optimization in general) is changing the mindset, by essentially going from “one thing at a time” to “many things at a time”.

Aside: in shader programming, you might think that basic HLSL types like float3 or float4 map to “SIMD” type of processing, but that’s not the case on modern GPUs. It was true 10-15 years ago, but since then the GPUs moved to be so-called “scalar” architectures. Every float3 in shader code just turns into three floats. But the key thing is: the GPU is not executing one shader at a time; it runs a whole bunch of them (on separate pixels/vertices/…)! So each and every float is “in fact” something like a float64, with every “lane” being part of a different pixel. “Running Code at a Teraflop: How a GPU Shader Core Works” by Kayvon Fatahalian is a great introduction to this.

Mike Acton has a lot of material on “changing mindset for optimization”, e.g. this slides-as-post-it-notes gallery, or the CppCon 2014 talk. In our case, we have a lot of “one thing at a time”: one ray vs one sphere intersection check; generating one random value; and so on.

There are at least several ways how to make current code more SIMD-friendly:

  • Work on more rays at once. I think this is called “packet ray tracing”. 4 rays at once would map nicely to SSE, 8 rays at once to AVX, and so on.
  • Still work on one ray at a time, but at least change HitWorld/HitSphere functions to check more than one sphere at a time. This one’s easier to do right now, so let’s try that :)

Right now code to do “ray vs world” check looks like this:

HitWorld(...)
{
  for (all spheres)
  {
    if (ray hits sphere closer)
    {
      remember it;
    }
  }
  return closest;
}

conceptually, it could be changed to this, with N probably being 4 for SSE:

HitWorld(...)
{
  for (chunk-of-N-spheres from all spheres)
  {
    if (ray hits chunk-of-N-spheres closer)
    {
      remember it;
    }
  }
  return closest;
}

I’ve heard of “Structure of Arrays”, what’s that?

Before diving into doing that, let’s rearrange our data a bit. Very much like aside on the GPUs above, we probably want to split our data into “separate components”, so that instead of all spheres being an “array of structures” (AoS) style:

struct Sphere { float3 center; float radius; };
Sphere spheres[];

it would instead be a “structure of arrays” (SoA) style:

struct Spheres
{
    float centerX[];
    float centerY[];
    float centerZ[];
    float radius[];
};

this way, whenever “test ray against N spheres” code needs to fetch, say, radius of N spheres, it can just load N consecutive numbers from memory.

Let’s do just that, without doing actual SIMD for the ray-vs-spheres checking yet. Instead of a bool HitSphere(Ray, Sphere, ...) function, have a int HitSpheres(Ray, SpheresSoA, ...) one, and then bool HitWorld() function just calls into that (see commit).

  • PC: 171 -> 184 Mray/s.
  • Mac: 42.6 -> 48.1 Mray/s.

Oh wow. This isn’t even doing any explicit SIMD; just shuffling data around, but the speed increase is quite nice!

And then I noticed that the HitSpheres function never needs to know the sphere radius (it needs only the squared radius), so we might just as well put that into SoA data during preparation step (commit). PC: 184 -> 186, Mac: 48.1 -> 49.8 Mray/s. Not much, but nice for such an easy change.

…aaaand that’s it for today :) The above changes are in this PR, or at 07-simd tag.

Learnings and what’s next

Learnings:

  • You likely won’t get big speedups from “just” changing your Vector3 class to use SIMD.
  • Just rearranging your data (e.g. AoS -> SoA layout), without any explicit SIMD usage, can actually speed things up! We’ll see later whether it also helps with explicit SIMD.
  • Play around with compiler settings! E.g. /fp:fast on MSVC here brought a massive speedup.

I didn’t get to the potentially interesting SIMD bits. Maybe next time I’ll try to make HitSpheres function use explicit SIMD intrinsics, and we’ll reflect on that. Until next time!


Daily Pathtracer Part 6: D3D11 GPU

Introduction and index of this series is here.

In the previous post, I did a naïve Metal GPU “port” of the path tracer. Let’s make a Direct3D 11 / HLSL version now.

  • This will allow testing performance of this “totally not suitable for GPU” port on a desktop GPU.
  • HLSL is familiar to more people than Metal.
  • Maybe someday I’d put this into a Unity version, and having HLSL is useful, since Unity uses HLSL as the shading language.
  • Why not D3D12 or Vulkan? Because those things are too hard for me ;) Maybe someday, but not just yet.

Ok let’s do the HLSL port

The final change is here, below are just some notes:

  • Almost everything from Metal post actually applies.
  • Compare Metal shader with HLSL one:
    • Metal is “more C++"-like: there are references and pointers (as opposed to inout and out HLSL alternatives), structs with member functions, enums etc.
    • Overall most of the code is very similar; largest difference is that I used global variables for shader inputs in HLSL, whereas Metal requires function arguments.
  • I used StructuredBuffers to pass data from the application side, so that it’s easy to match data layout on C++ side.
    • On AMD or Intel GPUs, my understanding is that there’s no big difference between structured buffers and other types of buffers.
    • However NVIDIA seems to quite like constant buffers for some usage patterns (see their blog posts: Structured Buffer Performance, Latency in Structured Buffers, Constant Buffers). If I were optimizing for GPU performance (which I am not, yet), that’s one possible area to look into.
  • For reading GPU times, I just do the simplest possible timer query approach, without any double buffering or anything (see code). Yes, this does kill any CPU/GPU parallelism, but here I don’t care about that. Likewise, for reading back traced ray counter I read it immediately without any frame delays or async readbacks.
    • I did run into an issue where even when I get the results from the “whole frame” disjoint timer query, the individual timestamp queries still don’t have their data yet (this was on AMD GPU/driver). So initially I had “everything works” on NVIDIA, but “returns nonsensical GPU times” on AMD. Testing on different GPUs is still useful, yo!

What’s the performance?

Again… this is definitely not an efficient implementation for the GPU. But here are the numbers!

  • GeForce GTX 1080 Ti: 2780 Mray/s,
  • Radeon Pro WX 9100: 3700 Mray/s,
  • An old Radeon HD 7700: 417 Mray/s,
  • C++ CPU implementation, on this AMD Threadripper with SMT off: 135 Mray/s.

For reference, Mac Metal numbers:

  • Radeon Pro 580: 1650 Mray/s,
  • Intel Iris Pro: 191 Mray/s,
  • GeForce GT 750M: 146 Mray/s.

What can we learn from that?

  • Similar to Mac C++ vs GPU Metal speedups, here the speedup is also between 4 and 27 times faster.
    • And again, not a fair comparison to a “real” path tracer; this one doesn’t have any BVH to traverse etc.
  • The Radeon here handily beats the GeForce. On paper it has slightly more TFLOPS, and I suspect some other differences might be at play (structured buffers? GCN architecture being better at “bad for GPU, port from C++” type of code? I haven’t investigated yet).

So there! The code is at 06-gpud3d11 tag on github repo.

What’s next

I don’t know. Have several possible things, will do one of them. Also, geez, doing these posts every day is hard. Maybe I’ll take a couple days off :)


Daily Pathtracer Part 5: Metal GPU!

Introduction and index of this series is here.

Let’s make a super-naïve implementation for a GPU! Did I mention that it’s going to be super simple and not optimized for GPUs at all? I did, good. This will be the “minimal amount of work” type of port, with maybe someday restructured to be more efficient.

Why Metal?

I already have a 1) C++ implementation handy, and 2) a Mac nearby, and 3) Metal is easy to use, and especially easy to move from a C++ implementation.

The Metal Shading Language (see spec) is basically C++11/C++14 variant, with some additions (keywords to indicate address spaces and shader entry points; and attributes to indicate bindings & other metadata), and some removals (no virtuals, no exceptions, no recursion, …).

I wrote about this before, but IMHO Metal occupies a sweet spot between “low-level, access to, ahem, metal” (Vulkan, DX12) and “primarily single threaded, magic drivers” (OpenGL, DX11) APIs. It gives more control in some parts, keeps other parts mostly unchanged, while still being conceptually simple, and simple to use. Though with Metal 2 “argument buffers” it’s not so simple anymore, but you can just ignore them if you don’t use them.

Let’s port C++ path tracer to Metal!

Majority of code translates to Metal shader pretty much as-is, and is extremely similar to the walkthrough in Part 1. See Shaders.metal. And then there’s a bunch of plumbing on the app side, to create buffers, feed them with data, estimate GPU running times, read back number of rays created on the GPU, etc. etc. – nothing fancy, just plumbing – see Renderer.mm changes.

The biggest change I had to do was dealing with lack of recursion in Metal (this is true for most/all GPU shading languages today). C++ code is written in a traditional recursive manner:

Color Trace(Ray r,...)
{
	if (HitWorld(r, ...))
	{
		(Ray scattered, Color attenuation) = Scatter(r, ...);
		return emission + attenuation * Trace(scattered, ...);
	}
	else
	{
		return skyColor;
	}
}

we can reformulate the above into a loop-based approach instead:

Color Trace(Ray r,...)
{
	Color result = (0,0,0)
	Color curAttenuation = (1,1,1)
	for (iter = 0 to maxBounceDepth)
	{
		if (HitWorld(r, ...))
		{
			(Ray scattered, Color attenuation) = Scatter(r, ...);
			result += curAttenuation * emission;
			// modulate attenuation and continue with scattered ray
			curAttenuation *= attenuation;
			r = scattered;
		}
		else
		{
			result += curAttenuation * skyColor;
			break; // stop looping
		}
	}
	return result;
}

While this approach might be useful for CPU path tracing optimizations (I’ll find out later!), it also neatly solves lack of recursion on the GPU side. So that’s exactly what I put into Metal shader code.

The actual path tracer is a compute shader, but in the current state could have been a pixel shader just as well – it does not use any of “compute specific” functionality yet.

So that’s about it, final code changes looked like this. As you can see, mostly either plumbing or copy-paste from existing C++ code with small modifications.

I did just copy & pasted most of the code, without any attempt at “sharing” some of it between C++ and Metal shader versions. If this was a “production” path tracer, and/or I had some idea what I want to achieve in the end, then sharing code might be useful. Right now, it’s easier & faster just to have the code separately.

Does it work?

Yeah, I guess it does work! As I mentioned before, this is definitely not an efficient implementation for the GPU. On the other hand… quite likely not an efficient implementation for the CPU either… But whereas CPU one is “not optimal/optimized”, the GPU one is more on the “well that’s stupidly slow” front. But let’s check performance anyway :)

  • MacBook Pro (2013, Core i7 2.3 GHz, GeForce GT 750M + Intel Iris Pro):
    • GeForce GT 750M: 146 Mray/s
    • Intel Iris Pro: 191 Mray/s
    • CPU: 38 Mray/s
  • iMac 5K (2017, Core i7 4.2 GHz, Radeon Pro 580):
    • Radeon Pro 580: 1650 Mray/s
    • CPU: 59 Mray/s

What can we learn from that?

  • Even this stupidly slow direct port, that should run like molasses on the GPU, is between 4 and 27 times faster than a simple C++ implementation!
  • The integrated Intel GPU in my laptop is in some cases faster than the discrete Nvidia one. I had noticed this before in other workloads; at least on those Mac models the discrete is only faster if you use significant memory bandwidth so that it gets advantage of the VRAM. In pure computation, Intel Iris Pro is surprisingly effective.
  • This is a toy path tracer, and neither C++ nor the GPU implementations are optimized, but overall GPUs being about 10x faster than CPUs at it seems to be expected. E.g. our progressive lightmapper team is seeing roughly similar speedups.

Notes / gotchas

Random notes on things I ran into while doing this:

  • If you want to use features above Metal 1.0 language version, and use built-in Xcode *.metal file handling rules, it’s not exactly intuitive how to tell it that “yo, I need Metal version X”. Turns out it’s under Xcode project “Build Phases” settings.
  • If you set Metal language version to something like “Mac Metal 1.2” (-std=osx-metal1.2) – I forgot what I even wanted that for, perhaps to get read-write textures – you’ll need to sprinkle thread or device etc. address space qualifiers to most/all references and pointers.
  • That read_write access attribute from Metal 1.2 that I wanted to use… I could not get it to actually work. So I went with a double-buffered approach of having two textures; one for previous results, and another for new results.
  • If you do wrong things, it’s quite easy to either make the macOS window manager go crazy, or have a machine reboot. In my case, I accidentally made an infinite loop with the initial rejection sampling based “random point inside disk” function. On an Intel GPU this resulted in screen areas outside my app showing garbage state from previously ran apps; and on Nvidia GPU it just rebooted. This is one of under-appreciated areas where Microsoft (yes, in Windows Vista!) made the situation much better… ten years ago! It’s much harder to make a machine reboot by doing bad things in a shader on Windows.
  • Watch out for NaNs. I had everything working on my Intel/Nvidia GPU machine, but on the AMD GPU it was all black initially. Turns out, I had uninitialized data in a floating point texture that happens to be NaNs on AMD; and another place where a Schlick’s approximation function could generate a very small negative argument for pow(). Testing on different GPUs is useful, yo.
  • There’s no good way to do GPU timing on Metal (as far as I can see), so I do an approximate thing of timing CPU side between command buffer submission and until the GPU is done with it, via a completion handler.

What’s next

Maybe let’s try a DX11 GPU implementation, just to see how this super slow GPU approach works out on a desktop GPU?


Daily Pathtracer Part 4: Fixes & Mitsuba

Introduction and index of this series is here.

The path tracer right now is small, neat and wrong. Some folks pointed on on twitterverse that there’s double lighting due to light sampling; there’s an issue on github about diffuse scattering, and I have noticed some wrong things too. But first of all, how does one even know that rendering is wrong? After all, it doesn’t look terribad to me?

In cases like this, it’s good to have a “reference rendering”, also often called “ground truth”. For that, let’s turn to Mitsuba Renderer.

Rendering our scene in Mitsuba

Why Mitsuba? I’ve seen it mentioned in a bunch of graphics papers, at MJP’s blog, and I know that people working on Unity’s PBR system use it too, so much as they even built a Mitsuba Exporter/Plugin. So I’ll assume that Mitsuba can render “110% correct” images :)

Getting our scene into Mitsuba is pretty easy; the documentation is clear and the file format is simple.

I have simplified some things in our scene for easier comparison: turned off depth of field, made sky have a constant color, and all the metal materials be perfectly smooth. Here’s a Mitsuba file that matches our scene, and here’s the resulting rendering, with 1024 samples per pixel (this took 5.3 minutes on a Mac by the way):

Here’s my rendering, for comparison:

Uff, that is indeed quite off! Let’s fix that.

Fixing frame accumulation

I first turned off explicit light sampling, and that left with the most obvious wrong thing I already briefly noticed before. Specifically, the rendering works by accumulating multiple frames over time, to “converge” to final result. However, depending on how many samples per pixel I was doing per frame, it was producing very different results. Here’s rendering with 4 and 16 samples per pixel, respectively (light sampling off):

Turns out, the problem was in the (cheap) gamma correction (linear -> sRGB color conversion) I had in there. This, well, was wrong, and a leftover from very first code I had written for this. By now my accumulation buffer is full floating point, so I should just accumulate linear colors there, and only convert to sRGB for final display. With that fixed, different sample counts per frame converge to the same result, which is better. More proper linear->sRGB conversion (from here) fixed overall brightness, especially on background/sky.

Fixing diffuse scattering

This is still quite different from Mitsuba though. As pointed out on github, the way Scatter function picked new ray for diffuse materials was wrong; it should have picked a new direction on the unit sphere, not inside of it. With that fixed, it gets much closer to reference result:

I guess this means that Ray Tracing in One Weekend book has the same error as well (that is fixed by Ray Tracing: The Rest of Your Life, where whole scattering is reworked for importance sampling).

Fixing light sampling

I still have a double-lighting problem with explicit light sampling. The problem is basically, that once you explicitly add direct lighting contribution from lights (emissive surfaces), then if the scattered/bounced ray also directly hits the light from the same point, you should ignore the emission from it. This makes sense; that direct ray hit was already accounted for during explicit light sampling!

With that fixed and light sampling back on, things are looking quite good:

There are still differences from Mitsuba rendering on the metal objects (well, “my” metal BRDF there is not a “proper” one like Mitsuba’s), and a small difference on the glass object. I’ll park these for now, and will improve metal surfaces at some later point perhaps.

Even with just 4 rays per pixel, and no progressive image accumulation, look at how (relatively) little noise there is!

And if I turn back previous things (DOF, rough metals, gradient sky), this is what’s rendered now:

What’s next

Now that the path tracer is more correct, let’s get back to exploring different topics :) Next week I’ll write about a super-naïve implementation for a GPU. Stay tuned!


Daily Pathtracer Part 3: C# & Unity & Burst

Introduction and index of this series is here.

As promised in the last post, let’s port our path tracer to C#, both outside & inside of Unity. This will also contain a brief example of Unity 2018.1 “Job System”, as well as Burst compiler.

There will be nothing specific to path tracing in this post!

Basic C# port

Let’s do a standalone (outside of Unity) C# port first.

It’s 2018, so I’ll try to pick some “modern .NET” that would work on both Windows & Mac. I think that is called “.NET Core 2.0(seriously, .NET ecosystem is a bit confusing: .NET Framework, .NET Core, .NET Standard, Portable Class Libraries, Mono, Xamarin etc. – good luck figuring out what is what).

Since I have no idea how to do UI in C# that would work on both Windows & Mac, I’m just not going to do it at all :) The path tracer will be a command line app that renders a bunch of frames, and saves out the final image as a .TGA file (why TGA? because it’s super simple to write).

Everything from our C++ code pretty much ports over directly:

Basic maths plumbing: float3, randomness utils, Ray, Hit, Sphere, Camera.

All of these are struct, not class. Often I see students starting out with class, since that’s the only thing they are taught about in the university. In .NET, “reference types” (like class) means their instances are allocated on the heap, participate in garbage collection, and so on. Whereas “value types” (primitive types like int or float, as well as most struct types) are not allocated on the heap; they are passed “by value”. Read more in official Microsoft docs. Math-heavy code tends to create a lot of small types (like float3 in our case, which is used to represent points, vectors and colors), and if they were allocated on the heap that would be “generally a disaster” for performance.

The path tracing part itself is also a very much a direct port: Material struct, and functions HitWorld, Scatter, Trace, TraceRowJob.

For multi-threading I’m using Parallel.For loop from .NET 4.0 task parallel library. Here’s the code.

It runs, and produces image as expected, yay!

Ok enough of that, how fast does it run?

PC: 67.1 Mray/s, Mac: 17.5 Mray/s.

For reference, C++ version runs at 136 Mray/s on PC, and 37.8 Mray/s on Mac. The numbers are slightly higher than in last post, I made random seed use explicitly passed variable instead of thread local storage.

Note that this is running on “.NET Standard” implementation (JIT, runtime, class libraries). Let’s also try running on Xamarin/Mono on Mac. That gets 5.3 Mray/s, ouch :(

Recall that “use structs, not classes” advice? If I change all these simple types to use class, I get just 2.3 Mray/s on PC, and 5.8 Mray/s on Mac. I suspect the drop is much larger on PC due to more threads being ran at once, possibly creating some serial bottleneck in either allocation or GC.

So the summary of C# performance so far:

  • Basic .NET Core performance is roughly 2x slower than C++ performance, on both Windows & Mac.
  • For simple types like “vector” or “color”, you really want to use struct to avoid allocation & garbage collection overhead. Otherwise your code will run 30 (!) times slower on a 16-thread PC, and 3 times slower on a 8-thread Mac. I don’t know why such a performance discrepancy.
  • Mono .NET implementation is about 3x slower than .NET Core implementation, at least on Mac, which makes it ~6x slower than C++. I suspect the Mono JIT is much “weaker” than RyuJIT and does way less inlining etc.; you might want to “manually inline” your heavily used functions.

Notes on Mono: 1) currently it does not have .NET Core System.MathF class, so some of the things it has to do at double precision via System.Math. 2) Mono still defaults to using double precision math for everything; with -O=float32 option to get single precision since Mono 4.0; you might want to try that for FP32-heavy workloads. They are also planning to switch to actual FP32 for floats. 3) Mono also has an LLVM backend for the JIT, which might give better performance than the default one.

I have updated Mono performance numbers with various options in a later blog post too.

Let’s do Unity now

Basic port of C# code to Unity is trivial, and then I’m putting resulting data into a texture that I display over the whole screen (code).

One possible gotcha: turn off Editor Attaching in preferences, if you’re profiling heavy C# code in the editor. This option makes it possible to attach a C# debugger at any time, but it causes Mono JIT to insert a whole lot of “should I handle debugger right now?” checks in compiled code all over the place.

You’ll also want to set scripting runtime version to .NET 4.x instead of (currently still default) .NET 3.5, to make the Parallel.For call work.

Performance is comparable to “C# with Mono” option above:

  • Editor, “attaching” option off: PC 11.3, Mac 4.6 Mray/s.
    • With “attaching” option on: PC 5.3, Mac 1.6 Mray/s. So yeah that option does make “heavy” C# code run 2-3x slower in the editor, watch out!
  • Standalone non-development build: PC 13.3, Mac 5.2 Mray/s.
    • With IL2CPP instead of Mono: PC 28.1, Mac 17.1 Mray/s.

This is roughly what is expected: performance similar to Mono (which is behind .NET Core), editor has a bit of overhead, IL2CPP 2-3x faster than Mono, which brings it into the same ballpark as .NET Core. All that is quite a bit behind C++ :(

Let’s do some more fancy stuff in Unity!

We’ll want to use Burst compiler, but first we need to do some preparation steps.

Note: right Burst requires using a very specific build of Unity, which is 2018.1 beta 12, version ed1bf90b40e6 (download here). Also, Burst is beta, experimental, work in progress, may or might not work, today only works in editor, etc. etc. You’ve been warned!

We’ll use NativeArray for storing pixel data (commit), including a trick to use Texture2D.LoadRawTextureData with it (I’m adding more proper NativeArray support for texture pixel data as we speak…).

And let’s replace Parallel.For with Unity’s Job System (commit). This has the added benefit that our computation shows up in Unity’s timeline profiler. Look at these threads being all busy:

And now, let’s add Burst! Since Burst is beta right now, it does not show up in Package Manager UI yet. You have to manually edit Packages/manifest.json to contain this:

{
    "dependencies": {
    	"com.unity.burst": "0.2.3"
    },
    "registry": "https://staging-packages.unity.com"
}

This should make “Jobs” menu appear. Now, we can add [ComputeJobOptimization] attribute to our struct TraceRowJob : IJobParallelFor, and iff the job C# code satisfies all restrictions imposed by Burst, it should get Magically Faster(tm). Burst restrictions today basically are:

  • No reference types, only primitive types and structs.
    • Note that NativeArray is a struct, so that one is ok.
    • C# “pointers” are ok too, I think (yes C# does have pointers!).
  • I think that’s about it, but do note that this makes a whole lot of “typical C#” be non-Burstable. You can’t have virtual methods, delegates, references, garbage collection, etc. etc.
  • Most accesses to static class fields are off-limits too; you should put that data into your Job struct instead.

Jobs -> Burst Inspector menu can be used to either see what errors prevented each job from Burst-ing, or inspect generated assembly for the Burst-ed ones.

In our pathtracer, making code Burst-able meant this (see commit):

  • Replace arrays with NativeArray. Part of that was done previously; I also put sphere & material data into NativeArrays.
  • No static data fields means that random number generator seed needs to be passed, instead of stored in a thread-local variable. We noticed that’s a generally good idea anyway earlier.

What performance do we have now, with Burst? PC: 11.3 -> 140 Mray/s (12x faster), Mac: 4.6 -> 42.6 Mray/s (9x faster).

This is pretty good, if you ask me. Recall that C++ implementation is 136 Mray/s on PC, and 37.8 Mray/s on Mac, so we’re actually faster than C++ already. Why and how? I suggest watching Andreas’ talk from GDC 2018.

But wait, we can do a bit more. We have a new (also experimental, WIP, etc) C# library called Unity.Mathematics, that is quite similar to HLSL types and functions. And Burst treats a whole lot of those as “intrinsics” that often map directly to some LLVM instruction. Let’s try that.

First off, add "com.unity.mathematics": "0.0.7" under dependencies in Packages/manifest.json. And then we can get rid of our own float3 struct and some helpers, and use very similar ones from Unity.Mathematics (commit). This gets us to 164 Mray/s on PC, and 48.1 Mray/s on Mac.

And these jobs take about 15x shorter than without Burst now:

Status, findings and what’s next

So we did not learn anything about path tracing this time, just spent some time in C# or Unity land. I hope that was useful for someone at least! The findings about C# are:

  • .NET Core is about 2x slower than vanilla C++.
  • Mono (with default settings) is about 3x slower than .NET Core.
  • IL2CPP is 2x-3x faster than Mono, which is roughly .NET Core performance level.
  • Unity’s Burst compiler can get our C# code faster than vanilla C++. Note that right now Burst is very early tech, I expect it will get even better performance later on.

And now, let’s get back to path tracing! Specifically, our rendering right now is wrong, due to the way I did the light sampling noise reduction optimization (thanks to a bunch of folks on twitter for pointing that out!). Turns out, with path tracing it’s often hard to know when something is “broken”, since many things look quite plausible! I’ll look at one of possible ways of how to approach that in the next post.