Pathtracer 14: iOS

Introduction and index of this series is here.

I wanted to check out how’s the performance on a mobile device. So, let’s take what we ended up with in the previous post, and make it run on iOS.

Initial port

Code for the Mac app is a super simple Cocoa application that either updates a Metal texture from the CPU and draws it to screen, or produces the texture with a Metal compute shader. I know almost nothing about Mac or Cocoa programming, so I just created a new project in Xcode, picked a “Metal game” template, removed things I don’t need and added the things I do need.

“Porting” that to iOS basically involved these steps (again, I don’t know how it’s supposed to be done; I’m just doing a random walk):

  1. Created two projects in Xcode, using the “Metal game” template; one for Mac (which matches my current code setup), and another one for “Cross Platform” case.
  2. Looked at the differences in file layout & project settings between them,
  3. Applied the differences to my app. The changes in detail were:
    1. Some folder renaming and moving files around in Xcode project structure.
    2. Added iOS specific files produced by Xcode project template.
    3. Some tweaks to existing app code to make it compile on iOS – mostly temporarily disabling all the SSE SIMD code paths (iOS uses ARM CPUs, SSE does not exist there). Other changes were mostly differences in Metal functionality between macOS and iOS (MTLResourceStorageModeManaged buffer mode and didModifyRange buffer method only exist on macOS).
    4. Added iOS build target to Xcode project.

And then it Just Worked; both the CPU & GPU code paths! Which was a bit surprising, actually :)

Performance of this “just make it run” port on iPhone SE: CPU 5.7 Mray/s, GPU 19.8 Mray/s.

Xcode tools for iOS GPU performance

I wanted to look at what sort of tooling Xcode has for investigating iOS GPU performance these days. Last time I did it was a couple years ago, and was also not related to compute shader workloads. So here’s a quick look into what I found!

Update: this post was about Xcode 9 on an A9 hardware. At WWDC 2018 Apple has announced big improvements to Metal profiling tools in Xcode 10, especially when running on A11 or later hardware. I haven’t tried them myself, but you might want to check out the WWDC session and “Optimizing Performance” doc.

TL;DR: it’s not bad. Too bad it’s not as good as PS4 tooling, but then again, who is?

Most of Xcode GPU analysis is under the “Debug Navigator” thingy, where with an app running you can select the “FPS” section and it displays basic gauges of CPU & GPU performance. When using Metal, there is a “Capture GPU Frame” button near the bottom which leads to actual frame debugging & performance tools.

The default view is more useful for debugging rendering issues; you want to switch to “View Frame By Performance” instead:

The left sidebar then lists various things grouped by pipeline (compute or graphics), and by shader. It does not list them by objects rendered, which is different from how GPU profiling on desktop usually works. In my case obviously the single compute shader dispatch takes up almost all the time.

The information presented seems to be a bunch of GPU counters (number of shader invocations, instructions executed, and so on). Some of those are more useful than others, and what kind of information is being shown probably also depends on the device & GPU model. Here are screenshots of what I saw displayed about my compute shader on an iPhone SE:

Whole frame overview has various counters per encoder. From here: occupancy is not too bad, and hey look my shader is not using any half-precision instructions:

“Performance” section has more stats in number form:

“Pipeline Statistics” section has some useful performance hints and overview graphs of, uhm, something. This is probably telling me I’m ALU bound, but what are the units of each bar, and whether they are all even the same scale? I don’t know :)

If the shader was compiled with debugging information on, then it can also show which places of the shader actually took time. As far as I can tell, it just lies – for my shader, it basically says “yeah, all these lines took zero time, and there’s one line that took 6%”. Where are the other 94%?!

Xcode tools for Mac GPU performance

In the previous post I ranted how Mac has no GPU performance tools at all, and while that is somewhat true (i.e. there’s no tool that would have told me “hey Aras, use by-value local variables insteaad of by-reference! twice as fast!”)… some of that “Capture GPU Frame” functionality exists for Mac Metal applications as well.

Here’s what information is displayed by “Performance” section on my MBP (Intel Iris Pro):

The “compute kernel” part has way fewer counters, and I don’t quite believe that ALU active time was exactly zero.

“Pipeline Statistics” section on the other hand… it has no performance hints, but it does have more overview graphs! “Register pressure”, “SIMD group occupancy” and “threadgroup memory” parts sound useful!

Let’s do SIMD NEON code paths for CPU

Recall when in part 8 I played around with SSE intrinsics for CPU HitSpheres function? Well now that code is disabled since iOS uses ARM CPUs, so Intel specific instructions don’t even compile there.

However, ARM CPUs do have their own SIMD instruction set: NEON. I know! Let’s use NEON intrinsic functions to implement our own float3 and float4 helpers, and then the SIMD HitSpheres should more or less work.

Caveat: as usual, I basically have no idea what I’m talking about. I have read some NEON code in the past, and perhaps have written a small NEON function or two at some point, but I’m nowhere near being “proficient” at it.

NEON float3

First off, let’s do the float3 helper class implementation with NEON. On x64 CPUs that did improve performance a bit (not much though). NEON intrinsics overall seem to be way more orthogonal and “intuitive” than SSE ones, however SSE has way, way more information, tutorials & reference about it out there. Anyway, the NEON float3 part is this commit, and my summary of NEON is:

  • #include <arm_neon.h> to get intrinsics & data types,
  • float32x4_t data type is for 4-wide floats,
  • NEON intrinsic functions start with v (for “vector”?), have q in there for things that operate on four things, and a suffix indicating the data type. For example, a 4-wide float add is vaddq_f32. Simple and sweet!
  • Getting to individual SIMD lanes is much easier than on SSE (just vgetq_lane_f32), however doing arbitrary swizzles/shuffles is harder – you have to dance around with extracting low/high parts, or “zipping” various operands, etc.

Doing the above work did not noticeably change performance though. Oh well, actually quite expected. I did learn/remember some NEON stuff though, so a net positive :)

NEON HitSpheres & float4

Last time an actual performance gain with SIMD was doing SSE HitSpheres, with data laid out in struct-of-arrays fashion. To get the same working on NEON, I basically have to implement a float4 helper class, and touch several places in HitSpheres function itself that use SSE directly. It’s all in this commit.

That got CPU performance from 5.8 Mray/s up to 8.5 Mray/s. Nice!

Note that my NEON approach is very likely suboptimal; I was basically doing a direct port from SSE. Which means:

  • “mask” calculation for comparisons. On SSE that is just _mm_movemask_ps, but becomes this in NEON:
VM_INLINE unsigned mask(float4 v)
    static const uint32x4_t movemask = { 1, 2, 4, 8 };
    static const uint32x4_t highbit = { 0x80000000, 0x80000000, 0x80000000, 0x80000000 };
    uint32x4_t t0 = vreinterpretq_u32_f32(v.m);
    uint32x4_t t1 = vtstq_u32(t0, highbit);
    uint32x4_t t2 = vandq_u32(t1, movemask);
    uint32x2_t t3 = vorr_u32(vget_low_u32(t2), vget_high_u32(t2));
    return vget_lane_u32(t3, 0) | vget_lane_u32(t3, 1);
  • picking closest hit among 4 results may or might not be done more optimally in NEON:
int id_scalar[4];
float hitT_scalar[4];
vst1q_s32(id_scalar, id);
vst1q_f32(hitT_scalar, hitT.m);
_mm_storeu_si128((__m128i *)id_scalar, id);
_mm_storeu_ps(hitT_scalar, hitT.m);
// In general, you would do this with a bit scan (first set/trailing zero count).
// But who cares, it's only 16 options.
static const int laneId[16] =
    0, 0, 1, 0, // 00xx
    2, 0, 1, 0, // 01xx
    3, 0, 1, 0, // 10xx
    2, 0, 1, 0, // 11xx
int lane = laneId[minMask];
int hitId = id_scalar[lane];
float finalHitT = hitT_scalar[lane];

Current status

So the above is basic port to iOS, with some simple NEON code path, and no mobile specific GPU tweaks/optimizations at all. Code is over at 14-ios tag on github.


  • iPhone SE (A9 chip): 8.5 Mray/s CPU, 19.8 Mray/s GPU.
  • iPhone X (A11 chip): 12.9 Mray/s CPU, 46.6 Mray/s GPU.
    • I haven’t looked into how many CPU threads the enkiTS task scheduler ends up using on iPhone X. I suspect it still might be just two “high performance” cores, which would be within my expectations of “roughly 50% more per-core CPU perf in two Apple CPU generations”. Which is fairly impressive!
  • For comparison, a MacBook Pro (2013) with Core i7 2.3 GHz & Intel Iris Pro gets: 42 Mray/s CPU, 99 Mray/s GPU.
    • Which means that single-thread CPU performance on iPhone X is actually very similar, or even a bit higher, than on an (admittedly old) MacBook Pro!

Pathtracer 13: GPU threadgroup memory is useful!

Introduction and index of this series is here.

Oh, last post was exactly a month ago… I guess I’ll remove “daily” from the titles then :)

So the previous approach “let’s do one bounce iteration per pass” (a.k.a. “buffer oriented”) turned out to add a whole lot of complexity, and was not really faster. So you know what, let’s park that one for now; maybe we’ll return to something like that once (if ever) we either actually need it, or perhaps when we’ll work on smaller ray packets that don’t need hundreds-of-megabytes of ray buffers.

Scott Bean (@gfxbean) sent a little hint that in my “regular, super simple” GPU implementation I might get much better performance by moving scene/material data into groupshared memory. As we’ve seen in the previous post, using group shared memory can speed things up quite a lot, and in this case all threads will be going through exactly the same spheres to check rays against.

All that work is completely isolated inside the compute shader (nice!), and conceptually goes like this:

groupshared Foo s_GroupFoo[kMaxFoos];

// at start of shader:

ThreadGroupMemoryBarrier(); // sync threads in the group

// proceed as usual, just use s_GroupFoo instead
// of StructuredBuffer<Foo> variable


The actual commit for D3D11 is here, and is pretty self-explanatory. At start of shader I make each thread do a little bit of “copy” work like this:

void main(uint3 tid : SV_GroupThreadID)
    uint threadID = tid.y * kCSGroupSizeX + tid.x;
    uint groupSize = kCSGroupSizeX * kCSGroupSizeY;
    uint objCount = g_Params[0].sphereCount;
    uint myObjCount = (objCount + groupSize - 1) / groupSize;
    uint myObjStart = threadID * myObjCount;
    for (uint io = myObjStart; io < myObjStart + myObjCount; ++io)
        if (io < objCount)
            s_GroupSpheres[io] = g_Spheres[io];
            s_GroupMaterials[io] = g_Materials[io];
        if (io < g_Params[0].emissiveCount)
            s_GroupEmissives[io] = g_Emissives[io];

I also reduced thread group size from 16x16 to 8x8 since that was a bit faster on my GPU (may or might not be faster on any other GPU…). What’s the result? NVIDIA GeForce 1080 Ti: 778 -> 1854 Mray/s.

So that’s 2.4x faster for a fairly simple (and admittedly not trivially scalable to large scenes) change! However… a quick test on Radeon Pro WX 9100: says: 1200 -> 1100 Mray/s, so a bit slower. I haven’t investigated why, but I guess the takeaways are:

  1. Pre-caching compute shader data into thread group shared memory can make it a lot faster!
  2. Or it might make it slower on a different GPU.
  3. Good luck!


I did the same change in the Metal implementation; here’s the commit - pretty much the same as what is there on D3D11. The result? MacBook Pro (2013) with Intel Iris Pro 60.8 -> 42.9 Mray/s. (oꆤ︵ꆤo)

Why? No idea; Mac has no tooling to answer this question, as far as I can tell.

And then I did a change that I thought of totally at random, just because I modified these lines of code and started to think “I wonder what would happen if I…”. In the shader, several places had code like const Sphere& s = spheres[index] – initially came from the code being a direct copy from C++. I changed these places to copy into local variables by value, instead of having a const reference, i.e. Sphere s = spheres[index].

Here’s the commit, and that tiny change got the performance up to 98.7 Mray/s on Intel Iris Pro.

Why? Who knows! I would have expected any “sufficiently smart compiler“ to have compiled both versions of code into exact same result. Turns out, nope, one of them is 2x faster, good luck!

Metal shaders are a bit of a black box, with not even intermediate representation being publicly documented. Good thing is… turns out the IR is just LLVM bitcode (via @icculus). So I grabbed a random llvm-dis I had on my machine (from Emscripten SDK, of all places), checked which output file Xcode produces for the *.metal inputs, and ran it on both versions.

The resulting LLVM IR disassembly is not very easy on the eyes, looking generally like this:

; <label>:13:                                     ; preds = %54, %10
  %14 = phi float [ %5, %10 ], [ %56, %54 ]
  %15 = phi i32 [ -1, %10 ], [ %55, %54 ]
  %16 = phi i32 [ 0, %10 ], [ %57, %54 ]
  %17 = sext i32 %16 to i64
  %18 = getelementptr inbounds %struct.Sphere, %struct.Sphere addrspace(3)* %2, i64 %17
  %19 = bitcast %struct.Sphere addrspace(3)* %18 to i8 addrspace(3)*
  call void @llvm.memcpy.p0i8.p3i8.i64(i8* %11, i8 addrspace(3)* %19, i64 20, i32 4, i1 false), !tbaa.struct !47
  br label %20
; <label>:20:                                     ; preds = %20, %13
  %21 = phi i32 [ 0, %13 ], [ %30, %20 ]
  %22 = phi <4 x float> [ undef, %13 ], [ %29, %20 ]
  %23 = sext i32 %21 to i64
  %24 = getelementptr inbounds %struct.Sphere, %struct.Sphere* %8, i64 0, i32 0, i32 0, i64 %23
  %25 = load float, float* %24, align 4, !tbaa !46

I’m not fluent in reading it, but by diffing the two versions, it’s not immediately obvious why one would be slower than the other. The slow one has some more load instructions with addrspace(3) on them, whereas the fast one has more calls into alloca (?) and llvm.memcpy.p0i8.p3i8.i64. Ok I guess? The alloca calls are probably not “real” calls; they just end up marking up how much of thread local space will get needed after all inlining. Memcpy probably ends up being a bunch of moves in exactly once place, so if GPU has any sort of load coalescing, then that gets used there. Or that’s my theory for “why faster”.

So Metal takeaways might be:

  1. By-value instead of by-const-reference things might be much more efficient.
  2. Metal bytecode is “just” LLVM IR, so peeking into that with llvm-dis can be useful. Note that this is still a machine-independent, very high level IR; you have no visibility into what the GPU driver will make of it in the end.

Current status and what’s next

So this simple change to pre-cache sphere/material/emissive data into thread group shared memory got GPU performance up to:

  • PC (GeForce 1080 Ti): 778 -> 1854 Mray/s,
  • Mac (Intel Iris Pro): 61 -> 99 Mray/s.

Which is not bad for such a simple change. Current code is over at 13-gpu-threadgroup-opt tag on github.

What’s next? I’m not sure. Maybe I should look at moving this out of “toy” stage and add bounding volume hierarchy & triangle meshes support? Narrator: he did not.

Daily Pathtracer 12: GPU Buffer-Oriented D3D11

Introduction and index of this series is here.

In the previous post, I changed the CPU path tracer from recursion (depth first) based approach to “buffer based” (breadth first) one. It got slightly slower on PC, and stayed around the same performance on a Mac.

I was curious how a similar approach would work on the GPU. Would it be slower or faster than a “super naïve GPU path tracer” I had before? No idea! Let’s find that out. Maybe we’ll learn something along the way.

Time for another confession: while I “conceptually” know how a GPU works, and have read & heard a lot of material on the topic, I don’t have much “actual” experience in optimizing compute shaders. Last time I was doing “serious” shader optimization was regular vertex/pixel shader workloads, and that was some years ago too. So I surely lack intuition in optimization approaches & experience with available tools! Everything below might be a complete blunder, and/or I might be making wrong conclusions. You’ve been warned!

Current depth-first GPU implementation

Recall that in my current GPU attempt (see Metal and D3D11 posts), each compute shader invocation maps to one pixel on screen. It traces several “full” ray paths; with rays being scattered off surface hits, extra rays being sent towards light sources, and so on.

Intuitively, while ray execution patterns past the primary eye rays “must be bad” for the GPU (they would be going all over the place, hitting different materials etc.)… It also has a great thing: there’s very little memory traffic. It only needs to read ~50 sphere and material structs, and only needs to write a single color per pixel.

This initial direct GPU version runs at 778 Mray/s on a PC with GeForce GTX 1080 Ti.

Initial buffer-based GPU implementation

Let’s for a moment pretend that GPU compute shader programming model does not have any unique properties or gotchas, and do the “most simple” buffer oriented implementation. It is structured very much like the buffer-oriented CPU implementation:

  1. One compute shader evaluates primary camera rays, and writes out their contribution into the image.

    • Primary ray hits can only contribute emissive color in case they hit a light directly, or a sky color in case they don’t hit anything.
    • However, whenever they hit a surface they can produce more rays for the next ray bounce: scattered ray, or a light sampling (“shadow”) ray. These new rays are appended into a StructuredBuffer with all the ray data (ray, attenuation so far, pixel location, etc.). Like this:
  2. Next up, I do a number of “bounce” iterations. This does an “indirect” compute shader dispatch (one thread for each bounce/shadow ray produced in the earlier pass). The compute shader traces these new rays (coming from a StructuredBuffer produced earlier), evaluates their own contribution, adds it to the image at ray locations, and each ray surface hit can produce more rays for the next bounce. These new rays are written into another StructuredBuffer. Then, repeat this same step again up to N bounce iterations, swapping input & output ray buffers.

This initial commit is here.

Performance: 103 Mray/s (recall that our baseline is 778 Mray/s for the simple depth-first tracer).

༼ ༎ຶ ෴ ༎ຶ༽

That’s not good at all! Also, it had a subtle lighting difference compared to the CPU implementation, mostly visible on the glass sphere. Here are images: CPU, GPU and increased contrast difference. The difference image revealed some block-like patterns too. Something is not good!

By the way, the “output rays in one CS invocation, then run another CS invocation for that amount of rays” bit is surprisingly non-intuitive, in terms of “ok how to actually do this”. Running a CS on D3D11 requires the user to pass number of thread groups, not number of threads! This basically means that I need to sneak in another tiny compute shader, that only runs on a single element, and all it does is divide a number that’s in one buffer, and write the result into another buffer. Why must simple things be cumbersome?!

Why so slow? Let’s try to find out

I think the recommended way of figuring out why a compute shader is slow, as of first half of 2018, is roughly this:

  • Have an implementation for Playstation 4, and use profiling tools there! or,
  • Have an implementation for D3D12 or Vulkan, run on an AMD GPU, and use Radeon GPU Profiler there!

That’s just great (not!)… I have a D3D11 implementation, and my GPU is NVIDIA. Let’s see what we have there.

Visual Studio GPU Usage tool

First off, let’s check whether Visual Studio has anything useful. There’s a GPU Usage tool in there. It can tell me that in my “fast” GPU implementation all the time is taken by a compute shader (well duh), and that in my “slow” implementation all the time is taken by these many compute shader dispatches. Ok so that wasn’t very useful in this case.

NVIDIA Nsight Graphics

I have used Nsight in the past, but I frankly forgot what for (might be debugging, might be profiling). Anyhoo, I forgot everything about it, and turns out their current incarnation, Nsight Graphics 1.0, is all different anyway.

Analyzing a frame in Nsight, it tells me this:

My guess for what all that means is basically this:

According to NVIDIA blogs, “SOL” in there means “speed of light”, so I think it’s telling me that my compute shader is running at about 7% of what it could run at. That’s obviously super bad! But what to do about it; why is my shader slow? I feel about 90% SOL.

Trying random things to speed it up

Without any of the above tools clearly telling me “hey, this thing in your shader is stupid, go fix it”, I resorted to applying random bits of knowledge I might have accumulated in the past. Which is basically all tweets from Sebastian Aaltonen and random docs from conferences, e.g. DirectCompute Optimizations and Best Practices from GTC 2010, and countless others that are similar.

First up, “avoid atomic operations” sounds like a sensible thing to do. My CS, for each thread, was counting the number of rays traced (which is only used to display Mray/s figure!), by incrementing a global counter with an InterlockedAdd function. Let’s track the amount of rays inside the whole thread group via a groupshared variable, and only do the global atomic at once per group (commit). 104 -> 125 Mray/s, not bad for such a simple change.

My “process ray bounce” compute shader was operating on 64 rays at once, let’s try tweaking that number. 256 rays in one go turned out to be fastest. Trivial change, 125 -> 147 Mray/s.

Let’s put new rays into group shared memory!

What I had so far mostly does not even need to be a compute shader, since I’m not using about the only feature that makes them worth having in the 1st place – which is “group shared” (aka thread-group local, aka LDS) memory.

Right now whenever any thread in my compute shader needs to emit a new ray for next bounce pass, it does an atomic increment of a global ray counter, and writes the new ray into a StructuredBuffer. Let’s instead do this:

  1. Have a ray buffer for the whole thread group in groupshared memory.
  2. New rays are appended into that buffer (this still uses atomics, but they are on a thread group local variable),
  3. Once whole thread group is done, write it into the structured buffer with one global atomic operation and a bunch of memory copies.

I did the above, basically going like this, and this was the result…

…it’s running quite fast at 937 Mray/s though, shipit :)

Let’s fix rendering

Recall how my “initial attempt” was also subtly different from the CPU rendering, sometimes in block-like artifacts?

Turns out, I was doing a “wrong thing”, in this bit of compute shader that processes a bounce iteration:

The compute shader traces these new rays, evaluates their own contribution, adds it to the image at ray locations

The actual code is dstImage[pixelCoord] += ... bits around here. In this compute shader, each execution thread no longer maps to a completely separate pixel on screen! They just grab a bunch of rays to process, each with their own pixel location. It can (and often does) end up, that several threads at once process rays that hit the same pixel (think shadow & regular bounce ray for the same pixel; and also I run at 4 rays per pixel to get anti-aliasing…).

The dstImage[pixelCoord] += bit is not atomic at all, and presumably by optimizing the compute shader to be faster, the execution pattern of it started to be very different from before, and what was “subtle errors” turned into “whoa random garbage” now. Or that was my theory, which I haven’t 100% double checked :)

It seems that there’s no easy way to do atomic additions to floats on the GPU. You could implement that manually by doing a loop with an atomic compare/exchange, and maybe there are some GPU-specific shader extensions that for example would allow doing that for half-precision floats or somesuch. All that is “uhh sounds hard” in my book, so I decided to solve this problem by (mis)using the GPU rasterizer.

GPU rasterizer has a blending unit that can blend a lot of things, even if they hit the same locations on screen, and the results come out correctly! So in the bounce-processing compute shader, I don’t write anything into the output image; the shader only produces rays for the next bounce, and “splats” (pixel location + color) for the rasterizer to render later. The splats are also added into a buffer, which is then later on rendered as points.

Here’s a diagram that probably makes it even more confusing :)

That fixed rendering to be correct though!

Ok what’s the performance now?

Doing the above (put rays & splats into groupshared memory, write to global buffers at end of group; blend splats using the rasterizer – see commit) got performance up from 147 to 611 Mray/s. I guess Yoda was not joking in that “LDS we must use” quote.

A couple more commits later I changed how I append items from group-local buffers into the global ones. I had this before:

groupshared s_Data[kSize];
groupshared uint s_DataCount;

// set count to zero at start
if (threadID == 0)
	s_DataCount = 0;

// each thread computes some data and adds it:
uint index;
InterlockedAdd(s_DataCount, 1, index);
s_Data[index] = ThisNewData;

// at the end, make first thread write out to global buffer:
if (threadID == 0)
	uint dataStart;
	g_DataCounts.InterlockedAdd(kCounterOffset, s_DataCount, dataStart);
	for (uint i = 0; i < s_DataCount; ++i)
		g_Data[dataStart + i] = s_Data[i];

this works, but only one thread in the whole group ends up doing “copy into global buffer” work. Doing this instead was quite a bit faster:

groupshared s_Data[kSize];
groupshared uint s_DataCount;
groupshared uint s_DataStart;

// set count to zero at start
if (threadID == 0)
	s_DataCount = 0;

// each thread computes some data and adds it:
uint index;
InterlockedAdd(s_DataCount, 1, index);
s_Data[index] = ThisNewData;

// at the end, make first thread reserve space in global buffer and
// find where it starts:
if (threadID == 0)
	g_DataCounts.InterlockedAdd(kCounterOffset, s_DataCount, s_DataStart);

// threads in the whole group copy their portion
uint myCount = (s_DataCount + kCSGroupSize - 1) / kCSGroupSize;
uint myStart = threadID * myCount;
for (uint i = myStart; i < myStart + myCount; ++i)
	if (i < s_DataCount)
		g_Data[s_DataStart + i] = s_Data[i];

Doing the above change for how rays are copied, and how splats are copied, increased performance from 619 to 644 Mray/s.

What else could be done?

So… 644 Mray/s is still behind the “super simple direct port” that I had running at 778 Mray/s…

Some completely random guesses on what else could be done to speed up the current “put rays/splats for whole bounce into a buffer” approach:

  • The compute shaders use a lot of space in groupshared memory right now: they have to have enough space to store the maximum amount of rays & splats that might get produced by the whole group! Large amount of groupshared space means the GPU can only run a very limited amount of groups at once, which is quite bad. Read more at “Optimizing GPU occupancy and resource usage with large thread groups”.
    • I could compress my ray & splat data more, to take up less space. My ray data right now is 28 bytes (float3 position, half3 direction, half3 attenuation, uint for pixel location, light index and other flags); and splat data is 16 bytes (float3 color, uint pixel location). Ray direction could use less space (e.g. 16 bit integers for X&Y components, one bit for sign of Z); attenuations & colors could be packed into smaller space than FP16 (R11G11B10 float, or RGB9E5, or RGBM, etc.). Ray position might be ok with less data than full FP32 float too.
    • Maybe there’s no need to store “maximum possible space” for the whole thread group, and instead have a buffer of fixed size, and write it out whenever it’s filled up.
  • The “some threads possibly append into a local buffer” pattern seems to generally be called “stream compaction”, and is a candidate for using “wave-level operations”. Sadly there’s no easy or cross-platform way of doing these in D3D11.
    • D3D12 shader model 6.0 has wave intrinsics, but that requires using D3D12, and also using the new DXC shader compiler.
    • AMD has extensions to get to them in D3D11, see this or that post.
    • NVIDIA also has extensions for D3D11, see this or that post.
    • …I don’t want to be writing separate compute shaders for different GPUs just yet though.
  • Turns out that Nsight does have a lot of possibly useful counters, besides these “SOL” numbers (thanks Nathan Hoobler for the tip). Have to select them under “User Metrics” section, and of course good luck figuring out which ones of them are actually interesting :)

    The “GPU Trace” feature mentioned on Nsight website looks potentially useful too, but is not available yet at the time of writing.
  • It’s also entirely possible that this whole approach is nonsense and can never be fast anyway!

Current status and what’s next

So, I tried a buffer-oriented approach on the GPU (current code at 12-gpu-buffer-d3d11 tag), and learned a few things:

  • Compute shader optimization feels like extremely beginner-unfriendly area. I’m somewhat versed in that whole space and could even pass a Turing test in a graphics related conversation, yet still a lot of the information sounds either complicated, or is hard to find in a nicely summarized form.
    • Tools that present you with a sea of numbers don’t help the impression either.
    • Looking at responses I got on twitter, seems that I’m not alone in this, so phew, it’s not just me.
    • Apparently, using a PS4 or AMD on D3D12/Vulkan for compute shader optimization is the way to go :)
  • Global atomics are slow.
  • Using large amounts of group shared memory is slow (but can be faster than not using it at all).
  • There’s a reason why UnorderedAccessView in D3D terms has “unordered” in the name. Writes into them can and will come out in unpredictable order! I had to resort to rasterizer’s blend unit to write out my “ray splats”. Doing “wrong” things can produce some “accidental noise art” though!
  • What I got out of everything above so far is 644 Mray/s on GeForce 1080 Ti, which is a lot more complexity than the “stupidly simple” approach, and slower too :(

What’s next? I don’t know, we’ll see. Until next time!

Daily Pathtracer 11: Buffer-Oriented

Introduction and index of this series is here.

I’ll try to restructure the path tracer a bit, from a “recursion based” approach into a “buffer based” approach.

“But why?” I had a thought of playing around with the new Unity 2018.1 async/batched raycasts for a path tracer, but that API is built on a “whole bunch of rays at once” model. My current approach that does one ray at a time, recursively, until it finishes, does not map well to it.

So let’s do it differently! I have no idea if that’s a good idea or not, but eh, let’s try anyway :)

Recursive (current) approach

Current approach is basically like the diagram above. We start with casting some ray (“1”), it hits something, is scattered, we continue with the scattered ray (“2”), until maximum ray depth is reached or ray hits “sky”. Next, we start another camera ray (“3”), that is scattered (“4”), and so on. It basically goes one ray at a time, in a depth-first traversal order (using recursion in my current CPU implementations; and iterative loop in GPU implementations).

Buffer-based approach

I don’t know if “buffer based” is a correct term… I’ve also seen “stream ray tracing” and “wavefront ray tracing” which sound similar, but I’m not sure they mean exact same thing or just somewhat similar idea. Anyway…

One possible other approach would be to do breadth-first traversal of rays. First do all primary (camera) rays, store their hit information into some buffer (hence “buffer based”). Then go look at all these hit results, scatter or process them somehow, and get a new batch of rays to process. Continue until maximum depth is reached or we’re left with no rays to process for some other reason.

Morgan McGuire’s G3D path tracer seems to be structured similarly, as well as Laine, Karras, Aila “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs”, from my quick look, suggests something along those lines.

So the approach would basically be:

// generate initial eye rays
buffer1 = GenerateCameraRays();

// while we still have rays to do
while (!buffer1.empty())
	// for each ray in current bounce, raycast and evaluate it
	foreach (Ray r in buffer1)
		hit = HitWorld(r);
		if (hit)
			image[ray.pixel] += EvaluateMaterial();
			// add rays for next bounce
			image[ray.pixel] += EvaluateSkyColor();

	// swap buffers; proceed to next bounce
	swap(buffer1, buffer2);

What information we need to track per-ray in these buffers? From what I can see, the current path tracer needs to track these:

struct Ray
	Ray ray; // the ray itself, duh (origin + direction)
	Color atten; // current attenuation along the ray
	int pixelIndex; // which image pixel this ray is for
	int depth; // ray bounce depth, to know when we hit maximum bounces
	int lightID; // for light sampling ("shadow") rays only: which light this ray is cast towards
	bool shadow; // is this a light sampling ("shadow") ray?
	bool skipEmission; // should material emission, if hit by this ray, be ignored

How large these ray buffers should be? In the simplest form, let’s just preallocate “maximum possible space” we think we’re going to need. One buffer for the whole image would be Width * Height * SamplesPerPixel * (1 + MaxShadowRays) in size (one ray can scatter; plus several shadow rays). And we need two of these buffers, since we’re writing into a new one while processing current one.

Implementation of the above for C++ is in this commit. It works correctly, now, what’s the performance, compared our previous state? PC: 187→66 Mray/s, Mac: 41.5→39.5 Mray/s. Huh what? This is almost three times slower on PC, but almost no performance change on Mac?!

What’s going on?

Well, for one, this approach now has a whopping 1800 megabytes (yeah, 1.8GB) of buffers to hold that ray data; and each bounce iteration reads from these giant buffers, and writes into them. The previous approach had… none of such thing; the only memory traffic it had was blending results into the final pixel buffer, and some (very small) arrays of spheres and materials.

I haven’t actually dug into this deeper, but my guess on “why Mac did not become slower” is that 1) if this is limited by RAM bandwidth, then the speed of RAM between my PC & Mac is probably not that big; PC got a much larger slowdown in comparison, and 2) Mac has a Haswell CPU with that 128MB of L4 cache which probably helps things a bit.

A side lesson from this might also be, even if your memory access patterns are completely linear & nice, they are still memory accesses. This does not happen often, but a couple times I’ve seen people approach for example multi-threading by going really heavy on “let’s pass buffers of data around, everywhere”. One might end up with a lot of buffers creating tons of additional memory traffic, even if the access pattern of each buffer is “super nice, linear, and full cache lines are being used”.

Anyway, right now this “buffer oriented” approach is actually quite a lot slower…

Let’s try to reduce ray data size

One possible approach to reduce memory traffic for the buffers would be to stop working on giant “full-screen, worst case capacity” buffers. We could work on buffers that are much smaller in size, and for example would fit into L1 cache; that probably would be a couple hundred rays per buffer.

So of course… let’s not do that for now :) and try to “just” reduce the amount of storage we need for one ray! “Why? We don’t ask why, we ask why not!”

Let’s go!

  • There’s no need to track depth per-ray; we can just do the “process bounces” loop to max iterations instead (commit). Performance unchanged.
  • Our float3 right now is an SSE-register size, which takes up space of four floats, not just the three we need. Stop doing that. Ray buffers: 1800→1350MB; PC performance: 66.1→89.9 Mray/s.
  • Instead of storing a couple ints and bools per ray, put all that into a 32 bit bitfield (commit). Ray buffers: 1350→1125MB; PC performance: 89.9→107 Mray/s.
  • Change first ray bounce (camera rays); there’s little need to write all of them into buffer and immediately process them. They also don’t need to handle “current attenuation” bit (commit). PC performance: 107→133 Mray/s.
  • You know what, ray directions and attenuation colors sound like they could use something more compact than a full 32 bit float per component. Let’s try to use 16 bit floats (“half precision”) for them. And let’s use F16C CPU instructions to convert between float and half; these are generally available in Intel & AMD CPUs made since 2011. That’s these two commits (one and two). Ray buffers: 1125→787MB; PC performance: 107→156 Mray/s.

By the way, Mac performance has stayed at ~40 Mray/s across all these commits. Which makes me think that the bottleneck there is not the memory bandwidth, but calculations. But again, I haven’t investigated this further, just slapping that onto “eh, probably that giant L4 cache helps”.

Status and what’s next

Code is at 11-buffer-oriented tag at github.

PC performance of the “buffer oriented” approach right now is at 156 Mray/s, which, while being behind the 187 Mray/s of the “recursion based” approach, is not “several times behind” at least. So maybe this buffer-oriented approach is not terribly bad, and I “just” need to make it work on smaller buffers that could nicely fit into the caches?

It would probably make sense to also split up “work to do per bounce” further, e.g. separate buffers for regular vs shadow rays; or even split up rays by material type, etc. Someday later!

I’m also interested to see what happens if I implement the above thing for the GPU compute shader variant. GPUs do tend to have massive memory bandwidth, after all. And the “process a big buffer in a fairly uniform way” might lead to way better GPU wave utilization. Maybe I’ll do that next.

Daily Pathtracer 10: Update C#&GPU

Introduction and index of this series is here.

Short post; nothing new. Just wanted to update C#, Unity (C#+Burst) and GPU implementations with the larger scene and optimizations from previous blog posts. So that there’s some, ahem, unity between them again. Here they are, as github commits/PRs:

A note on C# Mono performance

As Miguel de Icaza noted on github and wrote on his blog in-depth, defaults in current Mono version (5.8/5.10) are not tuned for the best floating point performance. Read his blog for details; much better defaults should be shipping in later Mono versions! If nothing else, maybe this toy project will have been useful to gently nudge Mono into improving the defaults :)

Current performance numbers, in Mray/s

Implementation PC Mac
GPU 778 53.0
C++, SSE+SoA HitSpheres 187 41.8
C++, SoA HitSpheres 100 19.6
C#, Unity Burst 82.3 18.7
C#, .NET Core 53.0 13.1
C#, mono -O=float32 --llvm w/ MONO_INLINELIMIT=100 12.7
C#, mono -O=float32 --llvm 10.5
C#, mono -O=float32 6.0
C#, mono 5.5
  • PC is AMD ThreadRipper 1950X (3.4GHz, 16c/16t) with GeForce GTX 1080 Ti.
  • Mac is late-2013 MacBookPro (Core i7-4850HQ 2.3GHz, 4c/8t) with Intel Iris Pro.
  • Unity version 2018.1 beta 12 with Burst 0.2.3.
  • Mono version 5.8.1.
  • .NET Core version 2.1.4.

All code is on github at 10-impl-updates tag.

What’s next

I want to switch from a recursion/iteration oriented path tracer setup, into a stream/buffers oriented one, and see what happens. Just because! My blog, my rules :)