Blogs · Aras' website

Pathtracer 13: GPU threadgroup memory is useful!

Posted on May 28, 2018

Introduction and index of this series is here.

Oh, last post was exactly a month ago… I guess I’ll remove “daily” from the titles then :)

So the previous approach “let’s do one bounce iteration per pass” (a.k.a. “buffer oriented”) turned out to add a whole lot of complexity, and was not really faster. So you know what, let’s park that one for now; maybe we’ll return to something like that once (if ever) we either actually need it, or perhaps when we’ll work on smaller ray packets that don’t need hundreds-of-megabytes of ray buffers.

Scott Bean (@gfxbean) sent a little hint that in my “regular, super simple” GPU implementation I might get much better performance by moving scene/material data into groupshared memory. As we’ve seen in the previous post, using group shared memory can speed things up quite a lot, and in this case all threads will be going through exactly the same spheres to check rays against.

All that work is completely isolated inside the compute shader (nice!), and conceptually goes like this:

groupshared Foo s_GroupFoo[kMaxFoos];

// at start of shader:
CopyFoosFromStructuredBuffersInto(s_GroupFoo);

ThreadGroupMemoryBarrier(); // sync threads in the group

// proceed as usual, just use s_GroupFoo instead
// of StructuredBuffer<Foo> variable

D3D11

The actual commit for D3D11 is here, and is pretty self-explanatory. At start of shader I make each thread do a little bit of “copy” work like this:

void main(uint3 tid : SV_GroupThreadID)
{
    uint threadID = tid.y * kCSGroupSizeX + tid.x;
    uint groupSize = kCSGroupSizeX * kCSGroupSizeY;
    uint objCount = g_Params[0].sphereCount;
    uint myObjCount = (objCount + groupSize - 1) / groupSize;
    uint myObjStart = threadID * myObjCount;
    for (uint io = myObjStart; io < myObjStart + myObjCount; ++io)
    {
        if (io < objCount)
        {
            s_GroupSpheres[io] = g_Spheres[io];
            s_GroupMaterials[io] = g_Materials[io];
        }
        if (io < g_Params[0].emissiveCount)
        {
            s_GroupEmissives[io] = g_Emissives[io];
        }
    }
    GroupMemoryBarrierWithGroupSync();

I also reduced thread group size from 16x16 to 8x8 since that was a bit faster on my GPU (may or might not be faster on any other GPU…). What’s the result? NVIDIA GeForce 1080 Ti: 778 -> 1854 Mray/s.

So that’s 2.4x faster for a fairly simple (and admittedly not trivially scalable to large scenes) change! However… a quick test on Radeon Pro WX 9100: says: 1200 -> 1100 Mray/s, so a bit slower. I haven’t investigated why, but I guess the takeaways are:

Pre-caching compute shader data into thread group shared memory can make it a lot faster!
Or it might make it slower on a different GPU.
Good luck!

Metal

I did the same change in the Metal implementation; here’s the commit - pretty much the same as what is there on D3D11. The result? MacBook Pro (2013) with Intel Iris Pro 60.8 -> 42.9 Mray/s. (oꆤ︵ꆤo)

Why? No idea; Mac has no tooling to answer this question, as far as I can tell.

And then I did a change that I thought of totally at random, just because I modified these lines of code and started to think “I wonder what would happen if I…”. In the shader, several places had code like const Sphere& s = spheres[index] – initially came from the code being a direct copy from C++. I changed these places to copy into local variables by value, instead of having a const reference, i.e. Sphere s = spheres[index].

Here’s the commit, and that tiny change got the performance up to 98.7 Mray/s on Intel Iris Pro.

Why? Who knows! I would have expected any “sufficiently smart compiler” to have compiled both versions of code into exact same result. Turns out, nope, one of them is 2x faster, good luck!

Metal shaders are a bit of a black box, with not even intermediate representation being publicly documented. Good thing is… turns out the IR is just LLVM bitcode (via @icculus). So I grabbed a random llvm-dis I had on my machine (from Emscripten SDK, of all places), checked which output file Xcode produces for the *.metal inputs, and ran it on both versions.

The resulting LLVM IR disassembly is not very easy on the eyes, looking generally like this:

; <label>:13:                                     ; preds = %54, %10
  %14 = phi float [ %5, %10 ], [ %56, %54 ]
  %15 = phi i32 [ -1, %10 ], [ %55, %54 ]
  %16 = phi i32 [ 0, %10 ], [ %57, %54 ]
  %17 = sext i32 %16 to i64
  %18 = getelementptr inbounds %struct.Sphere, %struct.Sphere addrspace(3)* %2, i64 %17
  %19 = bitcast %struct.Sphere addrspace(3)* %18 to i8 addrspace(3)*
  call void @llvm.memcpy.p0i8.p3i8.i64(i8* %11, i8 addrspace(3)* %19, i64 20, i32 4, i1 false), !tbaa.struct !47
  br label %20
; <label>:20:                                     ; preds = %20, %13
  %21 = phi i32 [ 0, %13 ], [ %30, %20 ]
  %22 = phi <4 x float> [ undef, %13 ], [ %29, %20 ]
  %23 = sext i32 %21 to i64
  %24 = getelementptr inbounds %struct.Sphere, %struct.Sphere* %8, i64 0, i32 0, i32 0, i64 %23
  %25 = load float, float* %24, align 4, !tbaa !46

I’m not fluent in reading it, but by diffing the two versions, it’s not immediately obvious why one would be slower than the other. The slow one has some more load instructions with addrspace(3) on them, whereas the fast one has more calls into alloca (?) and llvm.memcpy.p0i8.p3i8.i64. Ok I guess? The alloca calls are probably not “real” calls; they just end up marking up how much of thread local space will get needed after all inlining. Memcpy probably ends up being a bunch of moves in exactly once place, so if GPU has any sort of load coalescing, then that gets used there. Or that’s my theory for “why faster”.

So Metal takeaways might be:

By-value instead of by-const-reference things might be much more efficient.
Metal bytecode is “just” LLVM IR, so peeking into that with llvm-dis can be useful. Note that this is still a machine-independent, very high level IR; you have no visibility into what the GPU driver will make of it in the end.

Current status and what’s next

So this simple change to pre-cache sphere/material/emissive data into thread group shared memory got GPU performance up to:

PC (GeForce 1080 Ti): 778 -> 1854 Mray/s,
Mac (Intel Iris Pro): 61 -> 99 Mray/s.

Which is not bad for such a simple change. Current code is over at 13-gpu-threadgroup-opt tag on github.

What’s next? I’m not sure. Maybe I should look at moving this out of “toy” stage and add bounding volume hierarchy & triangle meshes support? Narrator: he did not.

Daily Pathtracer 12: GPU Buffer-Oriented D3D11

Posted on Apr 25, 2018

Introduction and index of this series is here.

In the previous post, I changed the CPU path tracer from recursion (depth first) based approach to “buffer based” (breadth first) one. It got slightly slower on PC, and stayed around the same performance on a Mac.

I was curious how a similar approach would work on the GPU. Would it be slower or faster than a “super naïve GPU path tracer” I had before? No idea! Let’s find that out. Maybe we’ll learn something along the way.

Time for another confession: while I “conceptually” know how a GPU works, and have read & heard a lot of material on the topic, I don’t have much “actual” experience in optimizing compute shaders. Last time I was doing “serious” shader optimization was regular vertex/pixel shader workloads, and that was some years ago too. So I surely lack intuition in optimization approaches & experience with available tools! Everything below might be a complete blunder, and/or I might be making wrong conclusions. You’ve been warned!

Current depth-first GPU implementation

Recall that in my current GPU attempt (see Metal and D3D11 posts), each compute shader invocation maps to one pixel on screen. It traces several “full” ray paths; with rays being scattered off surface hits, extra rays being sent towards light sources, and so on.

Intuitively, while ray execution patterns past the primary eye rays “must be bad” for the GPU (they would be going all over the place, hitting different materials etc.)… It also has a great thing: there’s very little memory traffic. It only needs to read ~50 sphere and material structs, and only needs to write a single color per pixel.

This initial direct GPU version runs at 778 Mray/s on a PC with GeForce GTX 1080 Ti.

Initial buffer-based GPU implementation

Let’s for a moment pretend that GPU compute shader programming model does not have any unique properties or gotchas, and do the “most simple” buffer oriented implementation. It is structured very much like the buffer-oriented CPU implementation:

One compute shader evaluates primary camera rays, and writes out their contribution into the image.

Primary ray hits can only contribute emissive color in case they hit a light directly, or a sky color in case they don’t hit anything.
However, whenever they hit a surface they can produce more rays for the next ray bounce: scattered ray, or a light sampling (“shadow”) ray. These new rays are appended into a StructuredBuffer with all the ray data (ray, attenuation so far, pixel location, etc.). Like this:

Next up, I do a number of “bounce” iterations. This does an “indirect” compute shader dispatch (one thread for each bounce/shadow ray produced in the earlier pass). The compute shader traces these new rays (coming from a StructuredBuffer produced earlier), evaluates their own contribution, adds it to the image at ray locations, and each ray surface hit can produce more rays for the next bounce. These new rays are written into another StructuredBuffer. Then, repeat this same step again up to N bounce iterations, swapping input & output ray buffers.

This initial commit is here.

Performance: 103 Mray/s (recall that our baseline is 778 Mray/s for the simple depth-first tracer).

༼ ༎ຶ ෴ ༎ຶ༽

That’s not good at all! Also, it had a subtle lighting difference compared to the CPU implementation, mostly visible on the glass sphere. Here are images: CPU, GPU and increased contrast difference. The difference image revealed some block-like patterns too. Something is not good!

By the way, the “output rays in one CS invocation, then run another CS invocation for that amount of rays” bit is surprisingly non-intuitive, in terms of “ok how to actually do this”. Running a CS on D3D11 requires the user to pass number of thread groups, not number of threads! This basically means that I need to sneak in another tiny compute shader, that only runs on a single element, and all it does is divide a number that’s in one buffer, and write the result into another buffer. Why must simple things be cumbersome?!

Why so slow? Let’s try to find out

I think the recommended way of figuring out why a compute shader is slow, as of first half of 2018, is roughly this:

Have an implementation for Playstation 4, and use profiling tools there! or,
Have an implementation for D3D12 or Vulkan, run on an AMD GPU, and use Radeon GPU Profiler there!

That’s just great (not!)… I have a D3D11 implementation, and my GPU is NVIDIA. Let’s see what we have there.

Visual Studio GPU Usage tool

First off, let’s check whether Visual Studio has anything useful. There’s a GPU Usage tool in there. It can tell me that in my “fast” GPU implementation all the time is taken by a compute shader (well duh), and that in my “slow” implementation all the time is taken by these many compute shader dispatches. Ok so that wasn’t very useful in this case.

NVIDIA Nsight Graphics

I have used Nsight in the past, but I frankly forgot what for (might be debugging, might be profiling). Anyhoo, I forgot everything about it, and turns out their current incarnation, Nsight Graphics 1.0, is all different anyway.

Analyzing a frame in Nsight, it tells me this:

My guess for what all that means is basically this:

According to NVIDIA blogs, “SOL” in there means “speed of light”, so I think it’s telling me that my compute shader is running at about 7% of what it could run at. That’s obviously super bad! But what to do about it; why is my shader slow? I feel about 90% SOL.

Trying random things to speed it up

Without any of the above tools clearly telling me “hey, this thing in your shader is stupid, go fix it”, I resorted to applying random bits of knowledge I might have accumulated in the past. Which is basically all tweets from Sebastian Aaltonen and random docs from conferences, e.g. DirectCompute Optimizations and Best Practices from GTC 2010, and countless others that are similar.

First up, “avoid atomic operations” sounds like a sensible thing to do. My CS, for each thread, was counting the number of rays traced (which is only used to display Mray/s figure!), by incrementing a global counter with an InterlockedAdd function. Let’s track the amount of rays inside the whole thread group via a groupshared variable, and only do the global atomic at once per group (commit). 104 -> 125 Mray/s, not bad for such a simple change.

My “process ray bounce” compute shader was operating on 64 rays at once, let’s try tweaking that number. 256 rays in one go turned out to be fastest. Trivial change, 125 -> 147 Mray/s.

Let’s put new rays into group shared memory!

What I had so far mostly does not even need to be a compute shader, since I’m not using about the only feature that makes them worth having in the 1st place – which is “group shared” (aka thread-group local, aka LDS) memory.

Right now whenever any thread in my compute shader needs to emit a new ray for next bounce pass, it does an atomic increment of a global ray counter, and writes the new ray into a StructuredBuffer. Let’s instead do this:

Have a ray buffer for the whole thread group in groupshared memory.
New rays are appended into that buffer (this still uses atomics, but they are on a thread group local variable),
Once whole thread group is done, write it into the structured buffer with one global atomic operation and a bunch of memory copies.

I did the above, basically going like this, and this was the result…

…it’s running quite fast at 937 Mray/s though, shipit :)

Let’s fix rendering

Recall how my “initial attempt” was also subtly different from the CPU rendering, sometimes in block-like artifacts?

Turns out, I was doing a “wrong thing”, in this bit of compute shader that processes a bounce iteration:

The compute shader traces these new rays, evaluates their own contribution, adds it to the image at ray locations

The actual code is dstImage[pixelCoord] += ... bits around here. In this compute shader, each execution thread no longer maps to a completely separate pixel on screen! They just grab a bunch of rays to process, each with their own pixel location. It can (and often does) end up, that several threads at once process rays that hit the same pixel (think shadow & regular bounce ray for the same pixel; and also I run at 4 rays per pixel to get anti-aliasing…).

The dstImage[pixelCoord] += bit is not atomic at all, and presumably by optimizing the compute shader to be faster, the execution pattern of it started to be very different from before, and what was “subtle errors” turned into “whoa random garbage” now. Or that was my theory, which I haven’t 100% double checked :)

It seems that there’s no easy way to do atomic additions to floats on the GPU. You could implement that manually by doing a loop with an atomic compare/exchange, and maybe there are some GPU-specific shader extensions that for example would allow doing that for half-precision floats or somesuch. All that is “uhh sounds hard” in my book, so I decided to solve this problem by (mis)using the GPU rasterizer.

GPU rasterizer has a blending unit that can blend a lot of things, even if they hit the same locations on screen, and the results come out correctly! So in the bounce-processing compute shader, I don’t write anything into the output image; the shader only produces rays for the next bounce, and “splats” (pixel location + color) for the rasterizer to render later. The splats are also added into a buffer, which is then later on rendered as points.

Here’s a diagram that probably makes it even more confusing :)

That fixed rendering to be correct though!

Ok what’s the performance now?

Doing the above (put rays & splats into groupshared memory, write to global buffers at end of group; blend splats using the rasterizer – see commit) got performance up from 147 to 611 Mray/s. I guess Yoda was not joking in that “LDS we must use” quote.

A couple more commits later I changed how I append items from group-local buffers into the global ones. I had this before:

groupshared s_Data[kSize];
groupshared uint s_DataCount;

// set count to zero at start
if (threadID == 0)
	s_DataCount = 0;
GroupMemoryBarrierWithGroupSync();

// each thread computes some data and adds it:
uint index;
InterlockedAdd(s_DataCount, 1, index);
s_Data[index] = ThisNewData;

// at the end, make first thread write out to global buffer:
GroupMemoryBarrierWithGroupSync();
if (threadID == 0)
{
	uint dataStart;
	g_DataCounts.InterlockedAdd(kCounterOffset, s_DataCount, dataStart);
	for (uint i = 0; i < s_DataCount; ++i)
	{
		g_Data[dataStart + i] = s_Data[i];
	}	
}

this works, but only one thread in the whole group ends up doing “copy into global buffer” work. Doing this instead was quite a bit faster:

groupshared s_Data[kSize];
groupshared uint s_DataCount;
groupshared uint s_DataStart;

// set count to zero at start
if (threadID == 0)
	s_DataCount = 0;
GroupMemoryBarrierWithGroupSync();

// each thread computes some data and adds it:
uint index;
InterlockedAdd(s_DataCount, 1, index);
s_Data[index] = ThisNewData;

// at the end, make first thread reserve space in global buffer and
// find where it starts:
GroupMemoryBarrierWithGroupSync();
if (threadID == 0)
{
	g_DataCounts.InterlockedAdd(kCounterOffset, s_DataCount, s_DataStart);
}

// threads in the whole group copy their portion
uint myCount = (s_DataCount + kCSGroupSize - 1) / kCSGroupSize;
uint myStart = threadID * myCount;
for (uint i = myStart; i < myStart + myCount; ++i)
	if (i < s_DataCount)
		g_Data[s_DataStart + i] = s_Data[i];

Doing the above change for how rays are copied, and how splats are copied, increased performance from 619 to 644 Mray/s.

What else could be done?

So… 644 Mray/s is still behind the “super simple direct port” that I had running at 778 Mray/s…

Some completely random guesses on what else could be done to speed up the current “put rays/splats for whole bounce into a buffer” approach:

The compute shaders use a lot of space in groupshared memory right now: they have to have enough space to store the maximum amount of rays & splats that might get produced by the whole group! Large amount of groupshared space means the GPU can only run a very limited amount of groups at once, which is quite bad. Read more at “Optimizing GPU occupancy and resource usage with large thread groups”.
- I could compress my ray & splat data more, to take up less space. My ray data right now is 28 bytes (float3 position, half3 direction, half3 attenuation, uint for pixel location, light index and other flags); and splat data is 16 bytes (float3 color, uint pixel location). Ray direction could use less space (e.g. 16 bit integers for X&Y components, one bit for sign of Z); attenuations & colors could be packed into smaller space than FP16 (R11G11B10 float, or RGB9E5, or RGBM, etc.). Ray position might be ok with less data than full FP32 float too.
- Maybe there’s no need to store “maximum possible space” for the whole thread group, and instead have a buffer of fixed size, and write it out whenever it’s filled up.
The “some threads possibly append into a local buffer” pattern seems to generally be called “stream compaction”, and is a candidate for using “wave-level operations”. Sadly there’s no easy or cross-platform way of doing these in D3D11.
- D3D12 shader model 6.0 has wave intrinsics, but that requires using D3D12, and also using the new DXC shader compiler.
- AMD has extensions to get to them in D3D11, see this or that post.
- NVIDIA also has extensions for D3D11, see this or that post.
- …I don’t want to be writing separate compute shaders for different GPUs just yet though.
Turns out that Nsight does have a lot of possibly useful counters, besides these “SOL” numbers (thanks Nathan Hoobler for the tip). Have to select them under “User Metrics” section, and of course good luck figuring out which ones of them are actually interesting :)

The “GPU Trace” feature mentioned on Nsight website looks potentially useful too, but is not available yet at the time of writing.
It’s also entirely possible that this whole approach is nonsense and can never be fast anyway!

Current status and what’s next

So, I tried a buffer-oriented approach on the GPU (current code at 12-gpu-buffer-d3d11 tag), and learned a few things:

Compute shader optimization feels like extremely beginner-unfriendly area. I’m somewhat versed in that whole space and could even pass a Turing test in a graphics related conversation, yet still a lot of the information sounds either complicated, or is hard to find in a nicely summarized form.
- Tools that present you with a sea of numbers don’t help the impression either.
- Looking at responses I got on twitter, seems that I’m not alone in this, so phew, it’s not just me.
- Apparently, using a PS4 or AMD on D3D12/Vulkan for compute shader optimization is the way to go :)
Global atomics are slow.
Using large amounts of group shared memory is slow (but can be faster than not using it at all).
There’s a reason why UnorderedAccessView in D3D terms has “unordered” in the name. Writes into them can and will come out in unpredictable order! I had to resort to rasterizer’s blend unit to write out my “ray splats”. Doing “wrong” things can produce some “accidental noise art” though!
What I got out of everything above so far is 644 Mray/s on GeForce 1080 Ti, which is a lot more complexity than the “stupidly simple” approach, and slower too :(

What’s next? I don’t know, we’ll see. Until next time!

Daily Pathtracer 11: Buffer-Oriented

Posted on Apr 19, 2018

Introduction and index of this series is here.

I’ll try to restructure the path tracer a bit, from a “recursion based” approach into a “buffer based” approach.

“But why?” I had a thought of playing around with the new Unity 2018.1 async/batched raycasts for a path tracer, but that API is built on a “whole bunch of rays at once” model. My current approach that does one ray at a time, recursively, until it finishes, does not map well to it.

So let’s do it differently! I have no idea if that’s a good idea or not, but eh, let’s try anyway :)

Recursive (current) approach

Current approach is basically like the diagram above. We start with casting some ray (“1”), it hits something, is scattered, we continue with the scattered ray (“2”), until maximum ray depth is reached or ray hits “sky”. Next, we start another camera ray (“3”), that is scattered (“4”), and so on. It basically goes one ray at a time, in a depth-first traversal order (using recursion in my current CPU implementations; and iterative loop in GPU implementations).

Buffer-based approach

I don’t know if “buffer based” is a correct term… I’ve also seen “stream ray tracing” and “wavefront ray tracing” which sound similar, but I’m not sure they mean exact same thing or just somewhat similar idea. Anyway…

One possible other approach would be to do breadth-first traversal of rays. First do all primary (camera) rays, store their hit information into some buffer (hence “buffer based”). Then go look at all these hit results, scatter or process them somehow, and get a new batch of rays to process. Continue until maximum depth is reached or we’re left with no rays to process for some other reason.

Morgan McGuire’s G3D path tracer seems to be structured similarly, as well as Laine, Karras, Aila “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs”, from my quick look, suggests something along those lines.

So the approach would basically be:

// generate initial eye rays
buffer1 = GenerateCameraRays();

// while we still have rays to do
while (!buffer1.empty())
{
	buffer2.MakeEmpty();
	// for each ray in current bounce, raycast and evaluate it
	foreach (Ray r in buffer1)
	{
		hit = HitWorld(r);
		if (hit)
		{
			image[ray.pixel] += EvaluateMaterial();
			// add rays for next bounce
			AddScatteredRayTo(buffer2);
			AddShadowRayTo(buffer2);
		}
		else
		{
			image[ray.pixel] += EvaluateSkyColor();
		}
	}

	// swap buffers; proceed to next bounce
	swap(buffer1, buffer2);
}

What information we need to track per-ray in these buffers? From what I can see, the current path tracer needs to track these:

struct Ray
{
	Ray ray; // the ray itself, duh (origin + direction)
	Color atten; // current attenuation along the ray
	int pixelIndex; // which image pixel this ray is for
	int depth; // ray bounce depth, to know when we hit maximum bounces
	int lightID; // for light sampling ("shadow") rays only: which light this ray is cast towards
	bool shadow; // is this a light sampling ("shadow") ray?
	bool skipEmission; // should material emission, if hit by this ray, be ignored
};

How large these ray buffers should be? In the simplest form, let’s just preallocate “maximum possible space” we think we’re going to need. One buffer for the whole image would be Width * Height * SamplesPerPixel * (1 + MaxShadowRays) in size (one ray can scatter; plus several shadow rays). And we need two of these buffers, since we’re writing into a new one while processing current one.

Implementation of the above for C++ is in this commit. It works correctly, now, what’s the performance, compared our previous state? PC: 187→66 Mray/s, Mac: 41.5→39.5 Mray/s. Huh what? This is almost three times slower on PC, but almost no performance change on Mac?!

What’s going on?

Well, for one, this approach now has a whopping 1800 megabytes (yeah, 1.8GB) of buffers to hold that ray data; and each bounce iteration reads from these giant buffers, and writes into them. The previous approach had… none of such thing; the only memory traffic it had was blending results into the final pixel buffer, and some (very small) arrays of spheres and materials.

I haven’t actually dug into this deeper, but my guess on “why Mac did not become slower” is that 1) if this is limited by RAM bandwidth, then the speed of RAM between my PC & Mac is probably not that big; PC got a much larger slowdown in comparison, and 2) Mac has a Haswell CPU with that 128MB of L4 cache which probably helps things a bit.

A side lesson from this might also be, even if your memory access patterns are completely linear & nice, they are still memory accesses. This does not happen often, but a couple times I’ve seen people approach for example multi-threading by going really heavy on “let’s pass buffers of data around, everywhere”. One might end up with a lot of buffers creating tons of additional memory traffic, even if the access pattern of each buffer is “super nice, linear, and full cache lines are being used”.

Anyway, right now this “buffer oriented” approach is actually quite a lot slower…

Let’s try to reduce ray data size

One possible approach to reduce memory traffic for the buffers would be to stop working on giant “full-screen, worst case capacity” buffers. We could work on buffers that are much smaller in size, and for example would fit into L1 cache; that probably would be a couple hundred rays per buffer.

So of course… let’s not do that for now :) and try to “just” reduce the amount of storage we need for one ray! “Why? We don’t ask why, we ask why not!”

Let’s go!

There’s no need to track depth per-ray; we can just do the “process bounces” loop to max iterations instead (commit). Performance unchanged.
Our float3 right now is an SSE-register size, which takes up space of four floats, not just the three we need. Stop doing that. Ray buffers: 1800→1350MB; PC performance: 66.1→89.9 Mray/s.
Instead of storing a couple ints and bools per ray, put all that into a 32 bit bitfield (commit). Ray buffers: 1350→1125MB; PC performance: 89.9→107 Mray/s.
Change first ray bounce (camera rays); there’s little need to write all of them into buffer and immediately process them. They also don’t need to handle “current attenuation” bit (commit). PC performance: 107→133 Mray/s.
You know what, ray directions and attenuation colors sound like they could use something more compact than a full 32 bit float per component. Let’s try to use 16 bit floats (“half precision”) for them. And let’s use F16C CPU instructions to convert between float and half; these are generally available in Intel & AMD CPUs made since 2011. That’s these two commits (one and two). Ray buffers: 1125→787MB; PC performance: 107→156 Mray/s.

By the way, Mac performance has stayed at ~40 Mray/s across all these commits. Which makes me think that the bottleneck there is not the memory bandwidth, but calculations. But again, I haven’t investigated this further, just slapping that onto “eh, probably that giant L4 cache helps”.

Status and what’s next

Code is at 11-buffer-oriented tag at github.

PC performance of the “buffer oriented” approach right now is at 156 Mray/s, which, while being behind the 187 Mray/s of the “recursion based” approach, is not “several times behind” at least. So maybe this buffer-oriented approach is not terribly bad, and I “just” need to make it work on smaller buffers that could nicely fit into the caches?

It would probably make sense to also split up “work to do per bounce” further, e.g. separate buffers for regular vs shadow rays; or even split up rays by material type, etc. Someday later!

I’m also interested to see what happens if I implement the above thing for the GPU compute shader variant. GPUs do tend to have massive memory bandwidth, after all. And the “process a big buffer in a fairly uniform way” might lead to way better GPU wave utilization. Maybe I’ll do that next.

Daily Pathtracer 10: Update C#&GPU

Posted on Apr 16, 2018

Introduction and index of this series is here.

Short post; nothing new. Just wanted to update C#, Unity (C#+Burst) and GPU implementations with the larger scene and optimizations from previous blog posts. So that there’s some, ahem, unity between them again. Here they are, as github commits/PRs:

A note on C# Mono performance

As Miguel de Icaza noted on github and wrote on his blog in-depth, defaults in current Mono version (5.8/5.10) are not tuned for the best floating point performance. Read his blog for details; much better defaults should be shipping in later Mono versions! If nothing else, maybe this toy project will have been useful to gently nudge Mono into improving the defaults :)

Current performance numbers, in Mray/s

Implementation	PC	Mac
GPU	778	53.0
C++, SSE+SoA HitSpheres	187	41.8
C++, SoA HitSpheres	100	19.6
C#, Unity Burst (note: see later optimization)	82.3	18.7
C#, .NET Core	53.0	13.1
C#, mono `-O=float32 --llvm` w/ `MONO_INLINELIMIT=100`		12.7
C#, mono `-O=float32 --llvm`		10.5
C#, mono `-O=float32`		6.0
C#, mono		5.5

PC is AMD ThreadRipper 1950X (3.4GHz, 16c/16t) with GeForce GTX 1080 Ti.
Mac is late-2013 MacBookPro (Core i7-4850HQ 2.3GHz, 4c/8t) with Intel Iris Pro.
Unity version 2018.1 beta 12 with Burst 0.2.3.
Mono version 5.8.1.
.NET Core version 2.1.4.

All code is on github at 10-impl-updates tag.

What’s next

I want to switch from a recursion/iteration oriented path tracer setup, into a stream/buffers oriented one, and see what happens. Just because! My blog, my rules :)

Daily Pathtracer 9: A wild ryg appears

Posted on Apr 13, 2018

Introduction and index of this series is here.

In the previous post, I did a basic SIMD/SSE implementation of the “hit a ray against all spheres” function. And then of course, me being a n00b at SIMD, I did some stupid things (and had other inefficiencies I knew about outside the SIMD function, that I planned to fix later). And then this happened:

You used the ultimate optimization technique: nerd sniping rygorous into doing it for you =) (via)

i.e. Fabian Giesen himself did a bunch of optimizations and submitted a pull request. Nice work, ryg!

His changes got performance of 107 -> 187 Mray/s on PC, and 30.1 -> 41.8 Mray/s on a Mac. That’s not bad at all for relatively simple changes!

ryg’s optimizations

Full list of changes can be seen in the pull request, here are the major ones:

Use _mm_loadu_ps to load memory into a SIMD variable (commit). On Windows/MSVC that got a massive speedup; no change on Mac/clang since clang was already generating movups instruction there.

Evaluate ray hit data only once (commit). My original code was evaluating hit position & normal for each closer sphere it had hit so far; this change only remembers t value and sphere index instead, and calculates position & normal for the final closest sphere. It also has one possible approach in how to do “find minimum value and index of it in an SSE register”, that I was wondering about.

“Know” which objects emit light instead of searching every time (commit). This one’s not SIMD at all, and super obvious. In the explicit light sampling loop, for each ray bounce off a diffuse surface, I was going through all spheres, checking “hey, do you emit light?”. But only a couple of all of them do! So instead, have an explicit array of light-emitting sphere indices, and only go through that. This was another massive speedup.

Several small simplifications (commit, commit, commit, commit, commit). Each one self-explanatory.

What’s next

I want to apply some of the earlier & above optimizations to the C#, C#+Burst and GPU implementations too, just so that all versions are on the same ground again. Maybe I’ll do that next!