Daily Pathtracer 12: GPU Buffer-Oriented D3D11

Introduction and index of this series is here.

In the previous post, I changed the CPU path tracer from recursion (depth first) based approach to “buffer based” (breadth first) one. It got slightly slower on PC, and stayed around the same performance on a Mac.

I was curious how a similar approach would work on the GPU. Would it be slower or faster than a “super naïve GPU path tracer” I had before? No idea! Let’s find that out. Maybe we’ll learn something along the way.

Time for another confession: while I “conceptually” know how a GPU works, and have read & heard a lot of material on the topic, I don’t have much “actual” experience in optimizing compute shaders. Last time I was doing “serious” shader optimization was regular vertex/pixel shader workloads, and that was some years ago too. So I surely lack intuition in optimization approaches & experience with available tools! Everything below might be a complete blunder, and/or I might be making wrong conclusions. You’ve been warned!

Current depth-first GPU implementation

Recall that in my current GPU attempt (see Metal and D3D11 posts), each compute shader invocation maps to one pixel on screen. It traces several “full” ray paths; with rays being scattered off surface hits, extra rays being sent towards light sources, and so on.

Intuitively, while ray execution patterns past the primary eye rays “must be bad” for the GPU (they would be going all over the place, hitting different materials etc.)… It also has a great thing: there’s very little memory traffic. It only needs to read ~50 sphere and material structs, and only needs to write a single color per pixel.

This initial direct GPU version runs at 778 Mray/s on a PC with GeForce GTX 1080 Ti.

Initial buffer-based GPU implementation

Let’s for a moment pretend that GPU compute shader programming model does not have any unique properties or gotchas, and do the “most simple” buffer oriented implementation. It is structured very much like the buffer-oriented CPU implementation:

  1. One compute shader evaluates primary camera rays, and writes out their contribution into the image.

    • Primary ray hits can only contribute emissive color in case they hit a light directly, or a sky color in case they don’t hit anything.
    • However, whenever they hit a surface they can produce more rays for the next ray bounce: scattered ray, or a light sampling (“shadow”) ray. These new rays are appended into a StructuredBuffer with all the ray data (ray, attenuation so far, pixel location, etc.). Like this:
  2. Next up, I do a number of “bounce” iterations. This does an “indirect” compute shader dispatch (one thread for each bounce/shadow ray produced in the earlier pass). The compute shader traces these new rays (coming from a StructuredBuffer produced earlier), evaluates their own contribution, adds it to the image at ray locations, and each ray surface hit can produce more rays for the next bounce. These new rays are written into another StructuredBuffer. Then, repeat this same step again up to N bounce iterations, swapping input & output ray buffers.

This initial commit is here.

Performance: 103 Mray/s (recall that our baseline is 778 Mray/s for the simple depth-first tracer).

༼ ༎ຶ ෴ ༎ຶ༽

That’s not good at all! Also, it had a subtle lighting difference compared to the CPU implementation, mostly visible on the glass sphere. Here are images: CPU, GPU and increased contrast difference. The difference image revealed some block-like patterns too. Something is not good!

By the way, the “output rays in one CS invocation, then run another CS invocation for that amount of rays” bit is surprisingly non-intuitive, in terms of “ok how to actually do this”. Running a CS on D3D11 requires the user to pass number of thread groups, not number of threads! This basically means that I need to sneak in another tiny compute shader, that only runs on a single element, and all it does is divide a number that’s in one buffer, and write the result into another buffer. Why must simple things be cumbersome?!

Why so slow? Let’s try to find out

I think the recommended way of figuring out why a compute shader is slow, as of first half of 2018, is roughly this:

  • Have an implementation for Playstation 4, and use profiling tools there! or,
  • Have an implementation for D3D12 or Vulkan, run on an AMD GPU, and use Radeon GPU Profiler there!

That’s just great (not!)… I have a D3D11 implementation, and my GPU is NVIDIA. Let’s see what we have there.

Visual Studio GPU Usage tool

First off, let’s check whether Visual Studio has anything useful. There’s a GPU Usage tool in there. It can tell me that in my “fast” GPU implementation all the time is taken by a compute shader (well duh), and that in my “slow” implementation all the time is taken by these many compute shader dispatches. Ok so that wasn’t very useful in this case.

NVIDIA Nsight Graphics

I have used Nsight in the past, but I frankly forgot what for (might be debugging, might be profiling). Anyhoo, I forgot everything about it, and turns out their current incarnation, Nsight Graphics 1.0, is all different anyway.

Analyzing a frame in Nsight, it tells me this:

My guess for what all that means is basically this:

According to NVIDIA blogs, “SOL” in there means “speed of light”, so I think it’s telling me that my compute shader is running at about 7% of what it could run at. That’s obviously super bad! But what to do about it; why is my shader slow? I feel about 90% SOL.

Trying random things to speed it up

Without any of the above tools clearly telling me “hey, this thing in your shader is stupid, go fix it”, I resorted to applying random bits of knowledge I might have accumulated in the past. Which is basically all tweets from Sebastian Aaltonen and random docs from conferences, e.g. DirectCompute Optimizations and Best Practices from GTC 2010, and countless others that are similar.

First up, “avoid atomic operations” sounds like a sensible thing to do. My CS, for each thread, was counting the number of rays traced (which is only used to display Mray/s figure!), by incrementing a global counter with an InterlockedAdd function. Let’s track the amount of rays inside the whole thread group via a groupshared variable, and only do the global atomic at once per group (commit). 104 -> 125 Mray/s, not bad for such a simple change.

My “process ray bounce” compute shader was operating on 64 rays at once, let’s try tweaking that number. 256 rays in one go turned out to be fastest. Trivial change, 125 -> 147 Mray/s.

Let’s put new rays into group shared memory!

What I had so far mostly does not even need to be a compute shader, since I’m not using about the only feature that makes them worth having in the 1st place – which is “group shared” (aka thread-group local, aka LDS) memory.

Right now whenever any thread in my compute shader needs to emit a new ray for next bounce pass, it does an atomic increment of a global ray counter, and writes the new ray into a StructuredBuffer. Let’s instead do this:

  1. Have a ray buffer for the whole thread group in groupshared memory.
  2. New rays are appended into that buffer (this still uses atomics, but they are on a thread group local variable),
  3. Once whole thread group is done, write it into the structured buffer with one global atomic operation and a bunch of memory copies.

I did the above, basically going like this, and this was the result…

…it’s running quite fast at 937 Mray/s though, shipit :)

Let’s fix rendering

Recall how my “initial attempt” was also subtly different from the CPU rendering, sometimes in block-like artifacts?

Turns out, I was doing a “wrong thing”, in this bit of compute shader that processes a bounce iteration:

The compute shader traces these new rays, evaluates their own contribution, adds it to the image at ray locations

The actual code is dstImage[pixelCoord] += ... bits around here. In this compute shader, each execution thread no longer maps to a completely separate pixel on screen! They just grab a bunch of rays to process, each with their own pixel location. It can (and often does) end up, that several threads at once process rays that hit the same pixel (think shadow & regular bounce ray for the same pixel; and also I run at 4 rays per pixel to get anti-aliasing…).

The dstImage[pixelCoord] += bit is not atomic at all, and presumably by optimizing the compute shader to be faster, the execution pattern of it started to be very different from before, and what was “subtle errors” turned into “whoa random garbage” now. Or that was my theory, which I haven’t 100% double checked :)

It seems that there’s no easy way to do atomic additions to floats on the GPU. You could implement that manually by doing a loop with an atomic compare/exchange, and maybe there are some GPU-specific shader extensions that for example would allow doing that for half-precision floats or somesuch. All that is “uhh sounds hard” in my book, so I decided to solve this problem by (mis)using the GPU rasterizer.

GPU rasterizer has a blending unit that can blend a lot of things, even if they hit the same locations on screen, and the results come out correctly! So in the bounce-processing compute shader, I don’t write anything into the output image; the shader only produces rays for the next bounce, and “splats” (pixel location + color) for the rasterizer to render later. The splats are also added into a buffer, which is then later on rendered as points.

Here’s a diagram that probably makes it even more confusing :)

That fixed rendering to be correct though!

Ok what’s the performance now?

Doing the above (put rays & splats into groupshared memory, write to global buffers at end of group; blend splats using the rasterizer – see commit) got performance up from 147 to 611 Mray/s. I guess Yoda was not joking in that “LDS we must use” quote.

A couple more commits later I changed how I append items from group-local buffers into the global ones. I had this before:

groupshared s_Data[kSize];
groupshared uint s_DataCount;

// set count to zero at start
if (threadID == 0)
	s_DataCount = 0;
GroupMemoryBarrierWithGroupSync();

// each thread computes some data and adds it:
uint index;
InterlockedAdd(s_DataCount, 1, index);
s_Data[index] = ThisNewData;

// at the end, make first thread write out to global buffer:
GroupMemoryBarrierWithGroupSync();
if (threadID == 0)
{
	uint dataStart;
	g_DataCounts.InterlockedAdd(kCounterOffset, s_DataCount, dataStart);
	for (uint i = 0; i < s_DataCount; ++i)
	{
		g_Data[dataStart + i] = s_Data[i];
	}	
}

this works, but only one thread in the whole group ends up doing “copy into global buffer” work. Doing this instead was quite a bit faster:

groupshared s_Data[kSize];
groupshared uint s_DataCount;
groupshared uint s_DataStart;

// set count to zero at start
if (threadID == 0)
	s_DataCount = 0;
GroupMemoryBarrierWithGroupSync();

// each thread computes some data and adds it:
uint index;
InterlockedAdd(s_DataCount, 1, index);
s_Data[index] = ThisNewData;

// at the end, make first thread reserve space in global buffer and
// find where it starts:
GroupMemoryBarrierWithGroupSync();
if (threadID == 0)
{
	g_DataCounts.InterlockedAdd(kCounterOffset, s_DataCount, s_DataStart);
}

// threads in the whole group copy their portion
uint myCount = (s_DataCount + kCSGroupSize - 1) / kCSGroupSize;
uint myStart = threadID * myCount;
for (uint i = myStart; i < myStart + myCount; ++i)
	if (i < s_DataCount)
		g_Data[s_DataStart + i] = s_Data[i];

Doing the above change for how rays are copied, and how splats are copied, increased performance from 619 to 644 Mray/s.

What else could be done?

So… 644 Mray/s is still behind the “super simple direct port” that I had running at 778 Mray/s…

Some completely random guesses on what else could be done to speed up the current “put rays/splats for whole bounce into a buffer” approach:

  • The compute shaders use a lot of space in groupshared memory right now: they have to have enough space to store the maximum amount of rays & splats that might get produced by the whole group! Large amount of groupshared space means the GPU can only run a very limited amount of groups at once, which is quite bad. Read more at “Optimizing GPU occupancy and resource usage with large thread groups”.
    • I could compress my ray & splat data more, to take up less space. My ray data right now is 28 bytes (float3 position, half3 direction, half3 attenuation, uint for pixel location, light index and other flags); and splat data is 16 bytes (float3 color, uint pixel location). Ray direction could use less space (e.g. 16 bit integers for X&Y components, one bit for sign of Z); attenuations & colors could be packed into smaller space than FP16 (R11G11B10 float, or RGB9E5, or RGBM, etc.). Ray position might be ok with less data than full FP32 float too.
    • Maybe there’s no need to store “maximum possible space” for the whole thread group, and instead have a buffer of fixed size, and write it out whenever it’s filled up.
  • The “some threads possibly append into a local buffer” pattern seems to generally be called “stream compaction”, and is a candidate for using “wave-level operations”. Sadly there’s no easy or cross-platform way of doing these in D3D11.
    • D3D12 shader model 6.0 has wave intrinsics, but that requires using D3D12, and also using the new DXC shader compiler.
    • AMD has extensions to get to them in D3D11, see this or that post.
    • NVIDIA also has extensions for D3D11, see this or that post.
    • …I don’t want to be writing separate compute shaders for different GPUs just yet though.
  • Turns out that Nsight does have a lot of possibly useful counters, besides these “SOL” numbers (thanks Nathan Hoobler for the tip). Have to select them under “User Metrics” section, and of course good luck figuring out which ones of them are actually interesting :)

    The “GPU Trace” feature mentioned on Nsight website looks potentially useful too, but is not available yet at the time of writing.
  • It’s also entirely possible that this whole approach is nonsense and can never be fast anyway!

Current status and what’s next

So, I tried a buffer-oriented approach on the GPU (current code at 12-gpu-buffer-d3d11 tag), and learned a few things:

  • Compute shader optimization feels like extremely beginner-unfriendly area. I’m somewhat versed in that whole space and could even pass a Turing test in a graphics related conversation, yet still a lot of the information sounds either complicated, or is hard to find in a nicely summarized form.
    • Tools that present you with a sea of numbers don’t help the impression either.
    • Looking at responses I got on twitter, seems that I’m not alone in this, so phew, it’s not just me.
    • Apparently, using a PS4 or AMD on D3D12/Vulkan for compute shader optimization is the way to go :)
  • Global atomics are slow.
  • Using large amounts of group shared memory is slow (but can be faster than not using it at all).
  • There’s a reason why UnorderedAccessView in D3D terms has “unordered” in the name. Writes into them can and will come out in unpredictable order! I had to resort to rasterizer’s blend unit to write out my “ray splats”. Doing “wrong” things can produce some “accidental noise art” though!
  • What I got out of everything above so far is 644 Mray/s on GeForce 1080 Ti, which is a lot more complexity than the “stupidly simple” approach, and slower too :(

What’s next? I don’t know, we’ll see. Until next time!


Daily Pathtracer 11: Buffer-Oriented

Introduction and index of this series is here.

I’ll try to restructure the path tracer a bit, from a “recursion based” approach into a “buffer based” approach.

“But why?” I had a thought of playing around with the new Unity 2018.1 async/batched raycasts for a path tracer, but that API is built on a “whole bunch of rays at once” model. My current approach that does one ray at a time, recursively, until it finishes, does not map well to it.

So let’s do it differently! I have no idea if that’s a good idea or not, but eh, let’s try anyway :)

Recursive (current) approach

Current approach is basically like the diagram above. We start with casting some ray (“1”), it hits something, is scattered, we continue with the scattered ray (“2”), until maximum ray depth is reached or ray hits “sky”. Next, we start another camera ray (“3”), that is scattered (“4”), and so on. It basically goes one ray at a time, in a depth-first traversal order (using recursion in my current CPU implementations; and iterative loop in GPU implementations).

Buffer-based approach

I don’t know if “buffer based” is a correct term… I’ve also seen “stream ray tracing” and “wavefront ray tracing” which sound similar, but I’m not sure they mean exact same thing or just somewhat similar idea. Anyway…

One possible other approach would be to do breadth-first traversal of rays. First do all primary (camera) rays, store their hit information into some buffer (hence “buffer based”). Then go look at all these hit results, scatter or process them somehow, and get a new batch of rays to process. Continue until maximum depth is reached or we’re left with no rays to process for some other reason.

Morgan McGuire’s G3D path tracer seems to be structured similarly, as well as Laine, Karras, Aila “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs”, from my quick look, suggests something along those lines.

So the approach would basically be:

// generate initial eye rays
buffer1 = GenerateCameraRays();

// while we still have rays to do
while (!buffer1.empty())
{
	buffer2.MakeEmpty();
	// for each ray in current bounce, raycast and evaluate it
	foreach (Ray r in buffer1)
	{
		hit = HitWorld(r);
		if (hit)
		{
			image[ray.pixel] += EvaluateMaterial();
			// add rays for next bounce
			AddScatteredRayTo(buffer2);
			AddShadowRayTo(buffer2);
		}
		else
		{
			image[ray.pixel] += EvaluateSkyColor();
		}
	}

	// swap buffers; proceed to next bounce
	swap(buffer1, buffer2);
}

What information we need to track per-ray in these buffers? From what I can see, the current path tracer needs to track these:

struct Ray
{
	Ray ray; // the ray itself, duh (origin + direction)
	Color atten; // current attenuation along the ray
	int pixelIndex; // which image pixel this ray is for
	int depth; // ray bounce depth, to know when we hit maximum bounces
	int lightID; // for light sampling ("shadow") rays only: which light this ray is cast towards
	bool shadow; // is this a light sampling ("shadow") ray?
	bool skipEmission; // should material emission, if hit by this ray, be ignored
};

How large these ray buffers should be? In the simplest form, let’s just preallocate “maximum possible space” we think we’re going to need. One buffer for the whole image would be Width * Height * SamplesPerPixel * (1 + MaxShadowRays) in size (one ray can scatter; plus several shadow rays). And we need two of these buffers, since we’re writing into a new one while processing current one.

Implementation of the above for C++ is in this commit. It works correctly, now, what’s the performance, compared our previous state? PC: 187→66 Mray/s, Mac: 41.5→39.5 Mray/s. Huh what? This is almost three times slower on PC, but almost no performance change on Mac?!

What’s going on?

Well, for one, this approach now has a whopping 1800 megabytes (yeah, 1.8GB) of buffers to hold that ray data; and each bounce iteration reads from these giant buffers, and writes into them. The previous approach had… none of such thing; the only memory traffic it had was blending results into the final pixel buffer, and some (very small) arrays of spheres and materials.

I haven’t actually dug into this deeper, but my guess on “why Mac did not become slower” is that 1) if this is limited by RAM bandwidth, then the speed of RAM between my PC & Mac is probably not that big; PC got a much larger slowdown in comparison, and 2) Mac has a Haswell CPU with that 128MB of L4 cache which probably helps things a bit.

A side lesson from this might also be, even if your memory access patterns are completely linear & nice, they are still memory accesses. This does not happen often, but a couple times I’ve seen people approach for example multi-threading by going really heavy on “let’s pass buffers of data around, everywhere”. One might end up with a lot of buffers creating tons of additional memory traffic, even if the access pattern of each buffer is “super nice, linear, and full cache lines are being used”.

Anyway, right now this “buffer oriented” approach is actually quite a lot slower…

Let’s try to reduce ray data size

One possible approach to reduce memory traffic for the buffers would be to stop working on giant “full-screen, worst case capacity” buffers. We could work on buffers that are much smaller in size, and for example would fit into L1 cache; that probably would be a couple hundred rays per buffer.

So of course… let’s not do that for now :) and try to “just” reduce the amount of storage we need for one ray! “Why? We don’t ask why, we ask why not!”

Let’s go!

  • There’s no need to track depth per-ray; we can just do the “process bounces” loop to max iterations instead (commit). Performance unchanged.
  • Our float3 right now is an SSE-register size, which takes up space of four floats, not just the three we need. Stop doing that. Ray buffers: 1800→1350MB; PC performance: 66.1→89.9 Mray/s.
  • Instead of storing a couple ints and bools per ray, put all that into a 32 bit bitfield (commit). Ray buffers: 1350→1125MB; PC performance: 89.9→107 Mray/s.
  • Change first ray bounce (camera rays); there’s little need to write all of them into buffer and immediately process them. They also don’t need to handle “current attenuation” bit (commit). PC performance: 107→133 Mray/s.
  • You know what, ray directions and attenuation colors sound like they could use something more compact than a full 32 bit float per component. Let’s try to use 16 bit floats (“half precision”) for them. And let’s use F16C CPU instructions to convert between float and half; these are generally available in Intel & AMD CPUs made since 2011. That’s these two commits (one and two). Ray buffers: 1125→787MB; PC performance: 107→156 Mray/s.

By the way, Mac performance has stayed at ~40 Mray/s across all these commits. Which makes me think that the bottleneck there is not the memory bandwidth, but calculations. But again, I haven’t investigated this further, just slapping that onto “eh, probably that giant L4 cache helps”.

Status and what’s next

Code is at 11-buffer-oriented tag at github.

PC performance of the “buffer oriented” approach right now is at 156 Mray/s, which, while being behind the 187 Mray/s of the “recursion based” approach, is not “several times behind” at least. So maybe this buffer-oriented approach is not terribly bad, and I “just” need to make it work on smaller buffers that could nicely fit into the caches?

It would probably make sense to also split up “work to do per bounce” further, e.g. separate buffers for regular vs shadow rays; or even split up rays by material type, etc. Someday later!

I’m also interested to see what happens if I implement the above thing for the GPU compute shader variant. GPUs do tend to have massive memory bandwidth, after all. And the “process a big buffer in a fairly uniform way” might lead to way better GPU wave utilization. Maybe I’ll do that next.


Daily Pathtracer 10: Update C#&GPU

Introduction and index of this series is here.

Short post; nothing new. Just wanted to update C#, Unity (C#+Burst) and GPU implementations with the larger scene and optimizations from previous blog posts. So that there’s some, ahem, unity between them again. Here they are, as github commits/PRs:

A note on C# Mono performance

As Miguel de Icaza noted on github and wrote on his blog in-depth, defaults in current Mono version (5.8/5.10) are not tuned for the best floating point performance. Read his blog for details; much better defaults should be shipping in later Mono versions! If nothing else, maybe this toy project will have been useful to gently nudge Mono into improving the defaults :)

Current performance numbers, in Mray/s

Implementation PC Mac
GPU 778 53.0
C++, SSE+SoA HitSpheres 187 41.8
C++, SoA HitSpheres 100 19.6
C#, Unity Burst 82.3 18.7
C#, .NET Core 53.0 13.1
C#, mono -O=float32 --llvm w/ MONO_INLINELIMIT=100 12.7
C#, mono -O=float32 --llvm 10.5
C#, mono -O=float32 6.0
C#, mono 5.5
  • PC is AMD ThreadRipper 1950X (3.4GHz, 16c/16t) with GeForce GTX 1080 Ti.
  • Mac is late-2013 MacBookPro (Core i7-4850HQ 2.3GHz, 4c/8t) with Intel Iris Pro.
  • Unity version 2018.1 beta 12 with Burst 0.2.3.
  • Mono version 5.8.1.
  • .NET Core version 2.1.4.

All code is on github at 10-impl-updates tag.

What’s next

I want to switch from a recursion/iteration oriented path tracer setup, into a stream/buffers oriented one, and see what happens. Just because! My blog, my rules :)


Daily Pathtracer 9: A wild ryg appears

Introduction and index of this series is here.

In the previous post, I did a basic SIMD/SSE implementation of the “hit a ray against all spheres” function. And then of course, me being a n00b at SIMD, I did some stupid things (and had other inefficiencies I knew about outside the SIMD function, that I planned to fix later). And then this happened:

You used the ultimate optimization technique: nerd sniping rygorous into doing it for you =) (via)

i.e. Fabian Giesen himself did a bunch of optimizations and submitted a pull request. Nice work, ryg!

His changes got performance of 107 -> 187 Mray/s on PC, and 30.1 -> 41.8 Mray/s on a Mac. That’s not bad at all for relatively simple changes!

ryg’s optimizations

Full list of changes can be seen in the pull request, here are the major ones:

Use _mm_loadu_ps to load memory into a SIMD variable (commit). On Windows/MSVC that got a massive speedup; no change on Mac/clang since clang was already generating movups instruction there.

Evaluate ray hit data only once (commit). My original code was evaluating hit position & normal for each closer sphere it had hit so far; this change only remembers t value and sphere index instead, and calculates position & normal for the final closest sphere. It also has one possible approach in how to do “find minimum value and index of it in an SSE register”, that I was wondering about.

“Know” which objects emit light instead of searching every time (commit). This one’s not SIMD at all, and super obvious. In the explicit light sampling loop, for each ray bounce off a diffuse surface, I was going through all spheres, checking “hey, do you emit light?”. But only a couple of all of them do! So instead, have an explicit array of light-emitting sphere indices, and only go through that. This was another massive speedup.

Several small simplifications (commit, commit, commit, commit, commit). Each one self-explanatory.

What’s next

I want to apply some of the earlier & above optimizations to the C#, C#+Burst and GPU implementations too, just so that all versions are on the same ground again. Maybe I’ll do that next!


Daily Pathtracer 8: SSE HitSpheres

Introduction and index of this series is here.

In the previous post, I talked about concept of SIMD, structure-of-arrays layout, and one (not good) approach of “let’s use SSE for float3 struct”. Just rearranging our sphere data into SoA layout for HitSpheres function gave a nice speed boost. Now that the data is all nice let’s try to use actual SSE SIMD code for that.

Note: I don’t have much experience with SIMD; this is a learning exercise for me too. Things I do might be totally stupid. You’ve been warned!

SIMD helpers

At least right now while I don’t have much experience, I do find SSE intrinsic functions “a bit” (mildly said) unreadable and scary looking. Too many underscores, too cryptic names, and overall half of instructions I’m sure (I hope?) makes total sense from hardware perspective, but not so much from a “typical programmer” perspective. Or maybe that’s just me.

Anyway, I added a tiny float4 struct & helpers to help my eyes a bit. So that I can write float4 and get the __m128 underneath, or likewise, just “add” two float4s and have that turn into an _mm_add_ps, and so on.

HitSpheres with SSE

Without further ado, my HitSpheres function with SSE implementation is here. It’s very likely that it has some, errr, “not very efficient” things in there, please do let me know!

SSE version comes out at ~100 lines, compared with ~40 lines for C++ one. So it’s quite a bit more verbose, and somewhat less readable (for me right now…), but not terribly cryptic. Here’s what it does, step by step, with some non-essential parts skipped.

We’ll be doing intersection checking of one ray against all spheres, 4 spheres at a time. In the main loop of the function, we’ll load data for 4 spheres into SSE float4 type variables, and do intersection checks against the ray; with ray data duplicated (“splatted”) identically into all 4 lanes of float4 variables. At the end, we’ll have up to 4 intersection results and will pick closest one.

Prepare data for the ray and min/max distances. Duplicate into 4-wide variables:

float4 rOrigX = SHUFFLE4(r.orig, 0, 0, 0, 0);
float4 rOrigY = SHUFFLE4(r.orig, 1, 1, 1, 1);
float4 rOrigZ = SHUFFLE4(r.orig, 2, 2, 2, 2);
float4 rDirX = SHUFFLE4(r.dir, 0, 0, 0, 0);
float4 rDirY = SHUFFLE4(r.dir, 1, 1, 1, 1);
float4 rDirZ = SHUFFLE4(r.dir, 2, 2, 2, 2);
float4 tMin4 = float4(tMin);
float4 tMax4 = float4(tMax);

We’ll be storing current closest hit (position, normal, t value, sphere index) for each “lane” here. Sphere index is directly in __m128i type since I don’t have something like an int4 helper struct.

float4 hitPosX, hitPosY, hitPosZ;
float4 hitNorX, hitNorY, hitNorZ;
float4 hitT = float4(tMax);
__m128i id = _mm_set1_epi32(-1);

The loop goes over all spheres, 4 at a time. The calling code already makes sure that if total amount of spheres is not a multiple by 4, then extra “fake” entries are present, with “impossible data” (zero radius etc.). We can just go over them and the ray will never hit them. At start of loop, I just load center & squared radius for 4 spheres. Right now my “load” implementation uses unaligned load; I should perhaps switch to aligned instead.

for (int i = 0; i < spheres.simdCount; i += kSimdWidth)
{
    // load data for 4 spheres
    float4 sCenterX = float4(spheres.centerX + i);
    float4 sCenterY = float4(spheres.centerY + i);
    float4 sCenterZ = float4(spheres.centerZ + i);
    float4 sSqRadius = float4(spheres.sqRadius + i);

Next up is basic math, does exactly what it says; just for 4 spheres at once:

float4 ocX = rOrigX - sCenterX;
float4 ocY = rOrigY - sCenterY;
float4 ocZ = rOrigZ - sCenterZ;
float4 b = ocX * rDirX + ocY * rDirY + ocZ * rDirZ;
float4 c = ocX * ocX + ocY * ocY + ocZ * ocZ - sSqRadius;
float4 discr = b * b - c;

Now comes a branch that says “is discriminant for any of 4 spheres positive?”:

bool4 discrPos = discr > float4(0.0f);
// if ray hits any of the 4 spheres
if (any(discrPos))
{

In SIMD programming, quite similar to GPU shader programming, it’s common to use branch-less code, i.e. compute both sides of some check, and “select” one or another based on a condition. Branches are possible of course, and can be beneficial when all or most “lanes” tend to take the same side of the branch, or if it allows saving a large amount of calculations.

Here, bool4 is actually exactly the same as float4; it holds 128 bits worth of data. Comparison operator (_mm_cmpgt_ps instruction) sets all bits of a 32-bit “lane” to 1 or 0 accordingly. any is implemented via _mm_movemask_ps instruction, which returns a regular integer with a bit for every highest bit of an SSE register lane. If that returns non-zero, it means that some of the four spheres have a positive discriminant here. Otherwise, none of the spheres are hit by this ray, and we can move onto the next batch of 4 spheres.

Next up is computing t values for two possible ray-sphere intersection points (remember: for 4 spheres at once), and checking which ones of those satisfy conditions. The conditions are:

  • Must have had a positive discriminant earlier (we are executing this code if any sphere in a batch intersects a ray, but some of them might not),
  • t must be larger than minimum distance passed as this function argument,
  • t must be smaller than maximum distance (passed as argument, and kept on decreasing inside this function as intersections are found)
// ray could hit spheres at t0 & t1
float4 t0 = -b - discrSq;
float4 t1 = -b + discrSq;
bool4 msk0 = discrPos & (t0 > tMin4) & (t0 < tMax4);
bool4 msk1 = discrPos & (t1 > tMin4) & (t1 < tMax4);

Now we have a “mask” (all bits set) for which intersections are “good” at t0 (mask msk0), and similar for t1. We need the closer “good” one, so whenever t0 is “good” we should take that, otherwise take t1. Recall that “SIMD prefers branch-less select-style programming”? This is our first occurrence of that, essentially doing a t = msk0 ? t0 : t1, for each of the four lanes. And at the end the final “good” intersections were the ones where either t0 or t1 was suitable; a union of the masks achieves that.

// where sphere is hit at t0, take that; elsewhere take t1 hit
float4 t = select(t1, t0, msk0);
bool4 msk = msk0 | msk1;

Given that the result = mask ? this : that type of operation seems to be very common in SIMD programming, you might think that SSE would have a built-in instruction for that. But noooo, it took all until SSE4.1 to add the _mm_blend_ps instruction. If you need to target earlier CPUs, you have to do funky bit logic dance to achieve the same result. “SSE: mind the gap!“ by Fabian Giesen has this and a lot more tricks to deal with SSE oddities.

Next up, if any sphere got actually hit, we compute the intersection point, normal and so on, and update our “best so far” (four of them) variables with that, using select with a mask:

if (any(msk))
{
    // compute intersection at t (id, position, normal), and overwrite current best
    // results based on mask.
    // also move tMax to current intersection
    id = select(id, curId, msk);
    tMax4 = select(tMax4, t, msk);
    float4 posX = rOrigX + rDirX * t;
    float4 posY = rOrigY + rDirY * t;
    float4 posZ = rOrigZ + rDirZ * t;
    hitPosX = select(hitPosX, posX, msk);
    // ...similar for other hitFoo variables

That’s the “check all spheres loop” done. Now after the loop, we’re left with up to 4 sphere intersections, and have to pick closest one. This is the part where I’m sure my “solution” is all kinds of suboptimal, suggestions welcome :) I suspect it’s suboptimal because: 1) it’s a lot of repetitive code, and 2) it’s a lot of scalar-looking and branchy-looking code too.

Out of four intersection t values, find the smallest one via a “horizontal minimum” helper:

float minT = hmin(hitT);

How is hmin implemented? To find minimum in an N-wide SSE variable, we need to do logN steps of shuffling and “minimum” operation. With SSE, width is 4, so minimum can be found in two shuffles and two min.

float hmin(float4 v)
{
    v = min(v, SHUFFLE4(v, 2, 3, 0, 0));
    v = min(v, SHUFFLE4(v, 1, 0, 0, 0));
    return v.getX();
}

And then once I know the minimum t distance out of all four results, I just check each of them one by one, “was this the minimum? extract and return data if so”, e.g. for the first lane:

if (hitT.getX() == minT)
{
    outHit.pos = float3(hitPosX.getX(), hitPosY.getX(), hitPosZ.getX());
    outHit.normal = float3(hitNorX.getX(), hitNorY.getX(), hitNorZ.getX());
    outHit.t = hitT.getX();
    return (int16_t)_mm_extract_epi16(id, 0);
}

The second lane is exactly the same, just getY component is used, etc. Repeat for all four SSE lanes.

…and that’s the SSE implementation of HitSpheres!

Note that I could have used no branches inside the loop at all; just do everything with masked selects. In my test having the branches there is actually a bit faster, so I left them in.

Ok, what’s the performance?

  • PC (AMD ThreadRipper, 16 threads): 186 -> 194 Mray/s.
  • Mac (MacBookPro, 8 threads): 49.8 -> 55.2 Mray/s.

Hmpft. It is faster, but not awesomely faster. What’s going on? Two things:

  1. HitSpheres is not all the work that the path tracer is doing; just a part of it,
  2. We only have 9 spheres in the whole scene. All this effort to process four spheres at once, when in total there’s only nine of them… yeah.

Time for a larger scene then!

46 spheres now, with two of them being light-emitting surfaces. I know, that’s not a “properly complex” scene, but on the other hand, I don’t have any ray intersection acceleration structure either; each ray is checking all the spheres.

With this larger scene, here are some updated numbers:

  • This post, HitSpheres with SSE: PC 107, Mac 30.1 Mray/s.
  • Previous post:
    • HitSheres with SoA layout: PC 78.8, Mac 17.4 Mray/s.
    • Before any SIMD/SoA stuff: PC 48.0, Mac 12.3 Mray/s.

So this & previous post combined, by optimizing only “ray against N spheres” part, so far got 2-2.5x total speedup. Not too bad!

For reference, (still unoptimized at all) GPU compute shader implementation on this larger scene:

  • PC GeForce 1080 Ti, DX11: 648 Mray/s,
  • Mac Intel Iris Pro, Metal: 41.8 Mray/s.

…and that’s it for today :) The above changes are in this PR, or at 08-simd tag.

What’s next

I have done a “one ray vs N spheres” SIMD with SSE part. Quite likely that’s not the best approach (fairly isolated though: just one function). Doing “N rays” type of SIMD might make more sense for performance.

So maybe that’s next, or maybe I’ll look into a “stream/buffer” oriented setup instead. Or neither of the above :) Stay tuned!