Daily Pathtracer 9: A wild ryg appears

Introduction and index of this series is here.

In the previous post, I did a basic SIMD/SSE implementation of the “hit a ray against all spheres” function. And then of course, me being a n00b at SIMD, I did some stupid things (and had other inefficiencies I knew about outside the SIMD function, that I planned to fix later). And then this happened:

You used the ultimate optimization technique: nerd sniping rygorous into doing it for you =) (via)

i.e. Fabian Giesen himself did a bunch of optimizations and submitted a pull request. Nice work, ryg!

His changes got performance of 107 -> 187 Mray/s on PC, and 30.1 -> 41.8 Mray/s on a Mac. That’s not bad at all for relatively simple changes!

ryg’s optimizations

Full list of changes can be seen in the pull request, here are the major ones:

Use _mm_loadu_ps to load memory into a SIMD variable (commit). On Windows/MSVC that got a massive speedup; no change on Mac/clang since clang was already generating movups instruction there.

Evaluate ray hit data only once (commit). My original code was evaluating hit position & normal for each closer sphere it had hit so far; this change only remembers t value and sphere index instead, and calculates position & normal for the final closest sphere. It also has one possible approach in how to do “find minimum value and index of it in an SSE register”, that I was wondering about.

“Know” which objects emit light instead of searching every time (commit). This one’s not SIMD at all, and super obvious. In the explicit light sampling loop, for each ray bounce off a diffuse surface, I was going through all spheres, checking “hey, do you emit light?”. But only a couple of all of them do! So instead, have an explicit array of light-emitting sphere indices, and only go through that. This was another massive speedup.

Several small simplifications (commit, commit, commit, commit, commit). Each one self-explanatory.

What’s next

I want to apply some of the earlier & above optimizations to the C#, C#+Burst and GPU implementations too, just so that all versions are on the same ground again. Maybe I’ll do that next!

Daily Pathtracer 8: SSE HitSpheres

Introduction and index of this series is here.

In the previous post, I talked about concept of SIMD, structure-of-arrays layout, and one (not good) approach of “let’s use SSE for float3 struct”. Just rearranging our sphere data into SoA layout for HitSpheres function gave a nice speed boost. Now that the data is all nice let’s try to use actual SSE SIMD code for that.

Note: I don’t have much experience with SIMD; this is a learning exercise for me too. Things I do might be totally stupid. You’ve been warned!

SIMD helpers

At least right now while I don’t have much experience, I do find SSE intrinsic functions “a bit” (mildly said) unreadable and scary looking. Too many underscores, too cryptic names, and overall half of instructions I’m sure (I hope?) makes total sense from hardware perspective, but not so much from a “typical programmer” perspective. Or maybe that’s just me.

Anyway, I added a tiny float4 struct & helpers to help my eyes a bit. So that I can write float4 and get the __m128 underneath, or likewise, just “add” two float4s and have that turn into an _mm_add_ps, and so on.

HitSpheres with SSE

Without further ado, my HitSpheres function with SSE implementation is here. It’s very likely that it has some, errr, “not very efficient” things in there, please do let me know!

SSE version comes out at ~100 lines, compared with ~40 lines for C++ one. So it’s quite a bit more verbose, and somewhat less readable (for me right now…), but not terribly cryptic. Here’s what it does, step by step, with some non-essential parts skipped.

We’ll be doing intersection checking of one ray against all spheres, 4 spheres at a time. In the main loop of the function, we’ll load data for 4 spheres into SSE float4 type variables, and do intersection checks against the ray; with ray data duplicated (“splatted”) identically into all 4 lanes of float4 variables. At the end, we’ll have up to 4 intersection results and will pick closest one.

Prepare data for the ray and min/max distances. Duplicate into 4-wide variables:

float4 rOrigX = SHUFFLE4(r.orig, 0, 0, 0, 0);
float4 rOrigY = SHUFFLE4(r.orig, 1, 1, 1, 1);
float4 rOrigZ = SHUFFLE4(r.orig, 2, 2, 2, 2);
float4 rDirX = SHUFFLE4(r.dir, 0, 0, 0, 0);
float4 rDirY = SHUFFLE4(r.dir, 1, 1, 1, 1);
float4 rDirZ = SHUFFLE4(r.dir, 2, 2, 2, 2);
float4 tMin4 = float4(tMin);
float4 tMax4 = float4(tMax);

We’ll be storing current closest hit (position, normal, t value, sphere index) for each “lane” here. Sphere index is directly in __m128i type since I don’t have something like an int4 helper struct.

float4 hitPosX, hitPosY, hitPosZ;
float4 hitNorX, hitNorY, hitNorZ;
float4 hitT = float4(tMax);
__m128i id = _mm_set1_epi32(-1);

The loop goes over all spheres, 4 at a time. The calling code already makes sure that if total amount of spheres is not a multiple by 4, then extra “fake” entries are present, with “impossible data” (zero radius etc.). We can just go over them and the ray will never hit them. At start of loop, I just load center & squared radius for 4 spheres. Right now my “load” implementation uses unaligned load; I should perhaps switch to aligned instead.

for (int i = 0; i < spheres.simdCount; i += kSimdWidth)
    // load data for 4 spheres
    float4 sCenterX = float4(spheres.centerX + i);
    float4 sCenterY = float4(spheres.centerY + i);
    float4 sCenterZ = float4(spheres.centerZ + i);
    float4 sSqRadius = float4(spheres.sqRadius + i);

Next up is basic math, does exactly what it says; just for 4 spheres at once:

float4 ocX = rOrigX - sCenterX;
float4 ocY = rOrigY - sCenterY;
float4 ocZ = rOrigZ - sCenterZ;
float4 b = ocX * rDirX + ocY * rDirY + ocZ * rDirZ;
float4 c = ocX * ocX + ocY * ocY + ocZ * ocZ - sSqRadius;
float4 discr = b * b - c;

Now comes a branch that says “is discriminant for any of 4 spheres positive?”:

bool4 discrPos = discr > float4(0.0f);
// if ray hits any of the 4 spheres
if (any(discrPos))

In SIMD programming, quite similar to GPU shader programming, it’s common to use branch-less code, i.e. compute both sides of some check, and “select” one or another based on a condition. Branches are possible of course, and can be beneficial when all or most “lanes” tend to take the same side of the branch, or if it allows saving a large amount of calculations.

Here, bool4 is actually exactly the same as float4; it holds 128 bits worth of data. Comparison operator (_mm_cmpgt_ps instruction) sets all bits of a 32-bit “lane” to 1 or 0 accordingly. any is implemented via _mm_movemask_ps instruction, which returns a regular integer with a bit for every highest bit of an SSE register lane. If that returns non-zero, it means that some of the four spheres have a positive discriminant here. Otherwise, none of the spheres are hit by this ray, and we can move onto the next batch of 4 spheres.

Next up is computing t values for two possible ray-sphere intersection points (remember: for 4 spheres at once), and checking which ones of those satisfy conditions. The conditions are:

  • Must have had a positive discriminant earlier (we are executing this code if any sphere in a batch intersects a ray, but some of them might not),
  • t must be larger than minimum distance passed as this function argument,
  • t must be smaller than maximum distance (passed as argument, and kept on decreasing inside this function as intersections are found)
// ray could hit spheres at t0 & t1
float4 t0 = -b - discrSq;
float4 t1 = -b + discrSq;
bool4 msk0 = discrPos & (t0 > tMin4) & (t0 < tMax4);
bool4 msk1 = discrPos & (t1 > tMin4) & (t1 < tMax4);

Now we have a “mask” (all bits set) for which intersections are “good” at t0 (mask msk0), and similar for t1. We need the closer “good” one, so whenever t0 is “good” we should take that, otherwise take t1. Recall that “SIMD prefers branch-less select-style programming”? This is our first occurrence of that, essentially doing a t = msk0 ? t0 : t1, for each of the four lanes. And at the end the final “good” intersections were the ones where either t0 or t1 was suitable; a union of the masks achieves that.

// where sphere is hit at t0, take that; elsewhere take t1 hit
float4 t = select(t1, t0, msk0);
bool4 msk = msk0 | msk1;

Given that the result = mask ? this : that type of operation seems to be very common in SIMD programming, you might think that SSE would have a built-in instruction for that. But noooo, it took all until SSE4.1 to add the _mm_blend_ps instruction. If you need to target earlier CPUs, you have to do funky bit logic dance to achieve the same result. “SSE: mind the gap!“ by Fabian Giesen has this and a lot more tricks to deal with SSE oddities.

Next up, if any sphere got actually hit, we compute the intersection point, normal and so on, and update our “best so far” (four of them) variables with that, using select with a mask:

if (any(msk))
    // compute intersection at t (id, position, normal), and overwrite current best
    // results based on mask.
    // also move tMax to current intersection
    id = select(id, curId, msk);
    tMax4 = select(tMax4, t, msk);
    float4 posX = rOrigX + rDirX * t;
    float4 posY = rOrigY + rDirY * t;
    float4 posZ = rOrigZ + rDirZ * t;
    hitPosX = select(hitPosX, posX, msk);
    // ...similar for other hitFoo variables

That’s the “check all spheres loop” done. Now after the loop, we’re left with up to 4 sphere intersections, and have to pick closest one. This is the part where I’m sure my “solution” is all kinds of suboptimal, suggestions welcome :) I suspect it’s suboptimal because: 1) it’s a lot of repetitive code, and 2) it’s a lot of scalar-looking and branchy-looking code too.

Out of four intersection t values, find the smallest one via a “horizontal minimum” helper:

float minT = hmin(hitT);

How is hmin implemented? To find minimum in an N-wide SSE variable, we need to do logN steps of shuffling and “minimum” operation. With SSE, width is 4, so minimum can be found in two shuffles and two min.

float hmin(float4 v)
    v = min(v, SHUFFLE4(v, 2, 3, 0, 0));
    v = min(v, SHUFFLE4(v, 1, 0, 0, 0));
    return v.getX();

And then once I know the minimum t distance out of all four results, I just check each of them one by one, “was this the minimum? extract and return data if so”, e.g. for the first lane:

if (hitT.getX() == minT)
    outHit.pos = float3(hitPosX.getX(), hitPosY.getX(), hitPosZ.getX());
    outHit.normal = float3(hitNorX.getX(), hitNorY.getX(), hitNorZ.getX());
    outHit.t = hitT.getX();
    return (int16_t)_mm_extract_epi16(id, 0);

The second lane is exactly the same, just getY component is used, etc. Repeat for all four SSE lanes.

…and that’s the SSE implementation of HitSpheres!

Note that I could have used no branches inside the loop at all; just do everything with masked selects. In my test having the branches there is actually a bit faster, so I left them in.

Ok, what’s the performance?

  • PC (AMD ThreadRipper, 16 threads): 186 -> 194 Mray/s.
  • Mac (MacBookPro, 8 threads): 49.8 -> 55.2 Mray/s.

Hmpft. It is faster, but not awesomely faster. What’s going on? Two things:

  1. HitSpheres is not all the work that the path tracer is doing; just a part of it,
  2. We only have 9 spheres in the whole scene. All this effort to process four spheres at once, when in total there’s only nine of them… yeah.

Time for a larger scene then!

46 spheres now, with two of them being light-emitting surfaces. I know, that’s not a “properly complex” scene, but on the other hand, I don’t have any ray intersection acceleration structure either; each ray is checking all the spheres.

With this larger scene, here are some updated numbers:

  • This post, HitSpheres with SSE: PC 107, Mac 30.1 Mray/s.
  • Previous post:
    • HitSheres with SoA layout: PC 78.8, Mac 17.4 Mray/s.
    • Before any SIMD/SoA stuff: PC 48.0, Mac 12.3 Mray/s.

So this & previous post combined, by optimizing only “ray against N spheres” part, so far got 2-2.5x total speedup. Not too bad!

For reference, (still unoptimized at all) GPU compute shader implementation on this larger scene:

  • PC GeForce 1080 Ti, DX11: 648 Mray/s,
  • Mac Intel Iris Pro, Metal: 41.8 Mray/s.

…and that’s it for today :) The above changes are in this PR, or at 08-simd tag.

What’s next

I have done a “one ray vs N spheres” SIMD with SSE part. Quite likely that’s not the best approach (fairly isolated though: just one function). Doing “N rays” type of SIMD might make more sense for performance.

So maybe that’s next, or maybe I’ll look into a “stream/buffer” oriented setup instead. Or neither of the above :) Stay tuned!

Daily Pathtracer Part 7: Initial SIMD

Introduction and index of this series is here.

Let’s get back to the CPU C++ implementation. I want to try SIMD and similar stuffs now!

Warning: I don’t have much (any?) actual experience with SIMD programming. I know conceptually what it is, have written a tiny bit of SIMD assembly/intrinsics code in my life, but nothing that I could say I “know” or even “have a clue” about it. So whatever I do next, might be completely stupid! This is a learning exercise for me too! You’ve been warned.

SIMD, what’s that?

SIMD is for “Single instruction, multiple data”, and the first sentence about it on Wikipedia says “a class of parallel computers in Flynn’s taxonomy” which is, errr, not that useful as an introduction :) Basically SIMD can be viewed as CPU instructions that do “operation” on a bunch of “items” at once. For example, “take these 4 numbers, add these other 4 numbers to them, and get a 4-number result”.

Different CPUs have had a whole bunch of different SIMD instruction sets over the years, and the most common today are:

  • SSE for x86 (Intel/AMD) CPUs.
    • SSE2 can be pretty much assumed to be “everywhere”; it’s been in Intel CPUs since 2001 and AMD CPUs since 2003.
    • There are later SSE versions (SSE3, SSE4 etc.), and then later on there’s AVX too.
  • NEON for ARM (“almost everything mobile”) CPUs.

It’s often said that “graphics” or “multimedia” is an area where SIMD is extremely beneficial, so let’s see how or if that applies to our toy path tracer.

Baby’s first SIMD: SSE for the vector class

The first thing that almost everyone immediately notices is “hey, I have this 3D vector class, let’s make that use SIMD”. This seems to also be taught as “that’s what SIMD is for” at many universities. In our case, we have a float3 struct with three numbers in it, and a bunch of operations (addition, multiplication etc.) do the same thing on all of them.

Spoiler alert: this isn’t a very good approach. I know that, but I also meet quite many people who don’t, for some reason. See for example this old post “The Ubiquitous SSE vector class: Debunking a common myth“ by Fabian Giesen.

Let’s make that use SSE instructions. A standard way to use them in C++ is via “intrinsic instructions”, that basically have a data type of __m128 (4 single precision floats, total 128 bits) and instructions like _mm_add_ps and so on. It’s “a bit” unreadable, if you ask me… But the good news is, these data types and functions work on pretty much all compilers (e.g. MSVC, clang and gcc) so that covers your cross-platform needs, as long as it’s Intel/AMD CPUs you’re targeting.

I turned my float3 to use SSE very similar to how it’s described in How To Write A Maths Library In 2016 by Richard Mitton. Here’s the commit.

  • PC (AMD ThreadRipper, 16 threads): 135 -> 134 Mray/s.
  • Mac (MacBookPro, 8 threads): 38.7 -> 44.2 Mray/s.

Huh, that’s a bit underwhelming, isn’t it? Performance on PC (MSVC compiler) pretty much the same. Performance on Mac quite a bit better, but nowhere near “yeah 4x faster!” levels :)

This does make sense though. The float3 struct is explicitly using SIMD now, however a whole lot of remaining code still stays “scalar” (i.e. using regular float variables). For example, one of the “heavy” functions, where a lot of time is spent, is HitSphere, and it has a lot of floats and branches in it:

bool HitSphere(const Ray& r, const Sphere& s, float tMin, float tMax, Hit& outHit)
    float3 oc = r.orig - s.center; // SIMD
    float b = dot(oc, r.dir); // scalar
    float c = dot(oc, oc) - s.radius*s.radius; // scalar
    float discr = b*b - c; // scalar
    if (discr > 0) // branch
        float discrSq = sqrtf(discr); // scalar
        float t = (-b - discrSq); // scalar
        if (t < tMax && t > tMin) // branch
            outHit.pos = r.pointAt(t); // SIMD
            outHit.normal = (outHit.pos - s.center) * s.invRadius; // SIMD
            outHit.t = t;
            return true;
        t = (-b + discrSq); // scalar
        if (t < tMax && t > tMin) // branch
            outHit.pos = r.pointAt(t); // SIMD
            outHit.normal = (outHit.pos - s.center) * s.invRadius; // SIMD
            outHit.t = t;
            return true;
    return false;

I’ve also enabled __vectorcall on MSVC and changed some functions to take float3 by value instead of by const-reference (see commit), but it did not change things noticeably in this case.

I’ve heard of “fast math” compiler setting

As a topic jump, let’s try telling the compiler “you know what, you can pretend that floating point nubers obey simple algebra rules”.

What? Yeah that’s right, floating point numbers as typically represented in computers (e.g. float double) have a lot of interesting properties. For example, a + (b + c) isn’t necessarily the same as (a + b) + c with floats. You can read a whole lot about them at Bruce Dawson’s blog posts.

C++ compilers have options to say “you know what, relax with floats a bit; I’m fine with potentially not-exact optimizations to calculations”. In MSVC that’s /fp:fast flag; whereas on clang/gcc it’s -ffast-math flag. Let’s switch them on:

  • PC: 134 -> 171 Mray/s.
  • Mac: 44.2 -> 42.6 Mray/s.

It didn’t do anything on Mac (clang compiler), in fact made it a tiny bit slower… but whoa, look at that Windows (MSVC compiler) speedup! ⊙.☉

What’s a proper way to do SIMD?

Let’s get back to SIMD. The way I did float3 with SSE has some downsides, with major ones being:

  • It does not lead to all operations using SIMD, for example doing dot products (of which there are plenty in graphics) ends up doing a bunch of scalar code. Yes, that quite likely could be improved somehow, but still, outside of regular “add/multiply individual vector components”, other operations do not easily map to SIMD.
  • SSE works on 4 floats at a time, but our float3 only uses three. This leaves one “SIMD lane” not doing any useful work. If I were to use AVX SIMD instructions – these work on 8 floats at a time – that would get even less efficient.

I think it’s general knowledge that “a more proper” approach to SIMD (or optimization in general) is changing the mindset, by essentially going from “one thing at a time” to “many things at a time”.

Aside: in shader programming, you might think that basic HLSL types like float3 or float4 map to “SIMD” type of processing, but that’s not the case on modern GPUs. It was true 10-15 years ago, but since then the GPUs moved to be so-called “scalar” architectures. Every float3 in shader code just turns into three floats. But the key thing is: the GPU is not executing one shader at a time; it runs a whole bunch of them (on separate pixels/vertices/…)! So each and every float is “in fact” something like a float64, with every “lane” being part of a different pixel. “Running Code at a Teraflop: How a GPU Shader Core Works“ by Kayvon Fatahalian is a great introduction to this.

Mike Acton has a lot of material on “changing mindset for optimization”, e.g. this slides-as-post-it-notes gallery, or the CppCon 2014 talk. In our case, we have a lot of “one thing at a time”: one ray vs one sphere intersection check; generating one random value; and so on.

There are at least several ways how to make current code more SIMD-friendly:

  • Work on more rays at once. I think this is called “packet ray tracing”. 4 rays at once would map nicely to SSE, 8 rays at once to AVX, and so on.
  • Still work on one ray at a time, but at least change HitWorld/HitSphere functions to check more than one sphere at a time. This one’s easier to do right now, so let’s try that :)

Right now code to do “ray vs world” check looks like this:

  for (all spheres)
    if (ray hits sphere closer)
      remember it;
  return closest;

conceptually, it could be changed to this, with N probably being 4 for SSE:

  for (chunk-of-N-spheres from all spheres)
    if (ray hits chunk-of-N-spheres closer)
      remember it;
  return closest;

I’ve heard of “Structure of Arrays”, what’s that?

Before diving into doing that, let’s rearrange our data a bit. Very much like aside on the GPUs above, we probably want to split our data into “separate components”, so that instead of all spheres being an “array of structures” (AoS) style:

struct Sphere { float3 center; float radius; };
Sphere spheres[];

it would instead be a “structure of arrays” (SoA) style:

struct Spheres
    float centerX[];
    float centerY[];
    float centerZ[];
    float radius[];

this way, whenever “test ray against N spheres” code needs to fetch, say, radius of N spheres, it can just load N consecutive numbers from memory.

Let’s do just that, without doing actual SIMD for the ray-vs-spheres checking yet. Instead of a bool HitSphere(Ray, Sphere, ...) function, have a int HitSpheres(Ray, SpheresSoA, ...) one, and then bool HitWorld() function just calls into that (see commit).

  • PC: 171 -> 184 Mray/s.
  • Mac: 42.6 -> 48.1 Mray/s.

Oh wow. This isn’t even doing any explicit SIMD; just shuffling data around, but the speed increase is quite nice!

And then I noticed that the HitSpheres function never needs to know the sphere radius (it needs only the squared radius), so we might just as well put that into SoA data during preparation step (commit). PC: 184 -> 186, Mac: 48.1 -> 49.8 Mray/s. Not much, but nice for such an easy change.

…aaaand that’s it for today :) The above changes are in this PR, or at 07-simd tag.

Learnings and what’s next


  • You likely won’t get big speedups from “just” changing your Vector3 class to use SIMD.
  • Just rearranging your data (e.g. AoS -> SoA layout), without any explicit SIMD usage, can actually speed things up! We’ll see later whether it also helps with explicit SIMD.
  • Play around with compiler settings! E.g. /fp:fast on MSVC here brought a massive speedup.

I didn’t get to the potentially interesting SIMD bits. Maybe next time I’ll try to make HitSpheres function use explicit SIMD intrinsics, and we’ll reflect on that. Until next time!

Daily Pathtracer Part 6: D3D11 GPU

Introduction and index of this series is here.

In the previous post, I did a naïve Metal GPU “port” of the path tracer. Let’s make a Direct3D 11 / HLSL version now.

  • This will allow testing performance of this “totally not suitable for GPU” port on a desktop GPU.
  • HLSL is familiar to more people than Metal.
  • Maybe someday I’d put this into a Unity version, and having HLSL is useful, since Unity uses HLSL as the shading language.
  • Why not D3D12 or Vulkan? Because those things are too hard for me ;) Maybe someday, but not just yet.

Ok let’s do the HLSL port

The final change is here, below are just some notes:

  • Almost everything from Metal post actually applies.
  • Compare Metal shader with HLSL one:
    • Metal is “more C++”-like: there are references and pointers (as opposed to inout and out HLSL alternatives), structs with member functions, enums etc.
    • Overall most of the code is very similar; largest difference is that I used global variables for shader inputs in HLSL, whereas Metal requires function arguments.
  • I used StructuredBuffers to pass data from the application side, so that it’s easy to match data layout on C++ side.
    • On AMD or Intel GPUs, my understanding is that there’s no big difference between structured buffers and other types of buffers.
    • However NVIDIA seems to quite like constant buffers for some usage patterns (see their blog posts: Structured Buffer Performance, Latency in Structured Buffers, Constant Buffers). If I were optimizing for GPU performance (which I am not, yet), that’s one possible area to look into.
  • For reading GPU times, I just do the simplest possible timer query approach, without any double buffering or anything (see code). Yes, this does kill any CPU/GPU parallelism, but here I don’t care about that. Likewise, for reading back traced ray counter I read it immediately without any frame delays or async readbacks.
    • I did run into an issue where even when I get the results from the “whole frame” disjoint timer query, the individual timestamp queries still don’t have their data yet (this was on AMD GPU/driver). So initially I had “everything works” on NVIDIA, but “returns nonsensical GPU times” on AMD. Testing on different GPUs is still useful, yo!

What’s the performance?

Again… this is definitely not an efficient implementation for the GPU. But here are the numbers!

  • GeForce GTX 1080 Ti: 2780 Mray/s,
  • Radeon Pro WX 9100: 3700 Mray/s,
  • An old Radeon HD 7700: 417 Mray/s,
  • C++ CPU implementation, on this AMD Threadripper with SMT off: 135 Mray/s.

For reference, Mac Metal numbers:

  • Radeon Pro 580: 1650 Mray/s,
  • Intel Iris Pro: 191 Mray/s,
  • GeForce GT 750M: 146 Mray/s.

What can we learn from that?

  • Similar to Mac C++ vs GPU Metal speedups, here the speedup is also between 4 and 27 times faster.
    • And again, not a fair comparison to a “real” path tracer; this one doesn’t have any BVH to traverse etc.
  • The Radeon here handily beats the GeForce. On paper it has slightly more TFLOPS, and I suspect some other differences might be at play (structured buffers? GCN architecture being better at “bad for GPU, port from C++” type of code? I haven’t investigated yet).

So there! The code is at 06-gpud3d11 tag on github repo.

What’s next

I don’t know. Have several possible things, will do one of them. Also, geez, doing these posts every day is hard. Maybe I’ll take a couple days off :)

Daily Pathtracer Part 5: Metal GPU!

Introduction and index of this series is here.

Let’s make a super-naïve implementation for a GPU! Did I mention that it’s going to be super simple and not optimized for GPUs at all? I did, good. This will be the “minimal amount of work” type of port, with maybe someday restructured to be more efficient.

Why Metal?

I already have a 1) C++ implementation handy, and 2) a Mac nearby, and 3) Metal is easy to use, and especially easy to move from a C++ implementation.

The Metal Shading Language (see spec) is basically C++11/C++14 variant, with some additions (keywords to indicate address spaces and shader entry points; and attributes to indicate bindings & other metadata), and some removals (no virtuals, no exceptions, no recursion, …).

I wrote about this before, but IMHO Metal occupies a sweet spot between “low-level, access to, ahem, metal” (Vulkan, DX12) and “primarily single threaded, magic drivers” (OpenGL, DX11) APIs. It gives more control in some parts, keeps other parts mostly unchanged, while still being conceptually simple, and simple to use. Though with Metal 2 “argument buffers” it’s not so simple anymore, but you can just ignore them if you don’t use them.

Let’s port C++ path tracer to Metal!

Majority of code translates to Metal shader pretty much as-is, and is extremely similar to the walkthrough in Part 1. See Shaders.metal. And then there’s a bunch of plumbing on the app side, to create buffers, feed them with data, estimate GPU running times, read back number of rays created on the GPU, etc. etc. – nothing fancy, just plumbing – see Renderer.mm changes.

The biggest change I had to do was dealing with lack of recursion in Metal (this is true for most/all GPU shading languages today). C++ code is written in a traditional recursive manner:

Color Trace(Ray r,...)
	if (HitWorld(r, ...))
		(Ray scattered, Color attenuation) = Scatter(r, ...);
		return emission + attenuation * Trace(scattered, ...);
		return skyColor;

we can reformulate the above into a loop-based approach instead:

Color Trace(Ray r,...)
	Color result = (0,0,0)
	Color curAttenuation = (1,1,1)
	for (iter = 0 to maxBounceDepth)
		if (HitWorld(r, ...))
			(Ray scattered, Color attenuation) = Scatter(r, ...);
			result += curAttenuation * emission;
			// modulate attenuation and continue with scattered ray
			curAttenuation *= attenuation;
			r = scattered;
			result += curAttenuation * skyColor;
			break; // stop looping
	return result;

While this approach might be useful for CPU path tracing optimizations (I’ll find out later!), it also neatly solves lack of recursion on the GPU side. So that’s exactly what I put into Metal shader code.

The actual path tracer is a compute shader, but in the current state could have been a pixel shader just as well – it does not use any of “compute specific” functionality yet.

So that’s about it, final code changes looked like this. As you can see, mostly either plumbing or copy-paste from existing C++ code with small modifications.

I did just copy & pasted most of the code, without any attempt at “sharing” some of it between C++ and Metal shader versions. If this was a “production” path tracer, and/or I had some idea what I want to achieve in the end, then sharing code might be useful. Right now, it’s easier & faster just to have the code separately.

Does it work?

Yeah, I guess it does work! As I mentioned before, this is definitely not an efficient implementation for the GPU. On the other hand… quite likely not an efficient implementation for the CPU either… But whereas CPU one is “not optimal/optimized”, the GPU one is more on the “well that’s stupidly slow” front. But let’s check performance anyway :)

  • MacBook Pro (2013, Core i7 2.3 GHz, GeForce GT 750M + Intel Iris Pro):
    • GeForce GT 750M: 146 Mray/s
    • Intel Iris Pro: 191 Mray/s
    • CPU: 38 Mray/s
  • iMac 5K (2017, Core i7 4.2 GHz, Radeon Pro 580):
    • Radeon Pro 580: 1650 Mray/s
    • CPU: 59 Mray/s

What can we learn from that?

  • Even this stupidly slow direct port, that should run like molasses on the GPU, is between 4 and 27 times faster than a simple C++ implementation!
  • The integrated Intel GPU in my laptop is in some cases faster than the discrete Nvidia one. I had noticed this before in other workloads; at least on those Mac models the discrete is only faster if you use significant memory bandwidth so that it gets advantage of the VRAM. In pure computation, Intel Iris Pro is surprisingly effective.
  • This is a toy path tracer, and neither C++ nor the GPU implementations are optimized, but overall GPUs being about 10x faster than CPUs at it seems to be expected. E.g. our progressive lightmapper team is seeing roughly similar speedups.

Notes / gotchas

Random notes on things I ran into while doing this:

  • If you want to use features above Metal 1.0 language version, and use built-in Xcode *.metal file handling rules, it’s not exactly intuitive how to tell it that “yo, I need Metal version X”. Turns out it’s under Xcode project “Build Phases” settings.
  • If you set Metal language version to something like “Mac Metal 1.2” (-std=osx-metal1.2) – I forgot what I even wanted that for, perhaps to get read-write textures – you’ll need to sprinkle thread or device etc. address space qualifiers to most/all references and pointers.
  • That read_write access attribute from Metal 1.2 that I wanted to use… I could not get it to actually work. So I went with a double-buffered approach of having two textures; one for previous results, and another for new results.
  • If you do wrong things, it’s quite easy to either make the macOS window manager go crazy, or have a machine reboot. In my case, I accidentally made an infinite loop with the initial rejection sampling based “random point inside disk” function. On an Intel GPU this resulted in screen areas outside my app showing garbage state from previously ran apps; and on Nvidia GPU it just rebooted. This is one of under-appreciated areas where Microsoft (yes, in Windows Vista!) made the situation much better… ten years ago! It’s much harder to make a machine reboot by doing bad things in a shader on Windows.
  • Watch out for NaNs. I had everything working on my Intel/Nvidia GPU machine, but on the AMD GPU it was all black initially. Turns out, I had uninitialized data in a floating point texture that happens to be NaNs on AMD; and another place where a Schlick’s approximation function could generate a very small negative argument for pow(). Testing on different GPUs is useful, yo.
  • There’s no good way to do GPU timing on Metal (as far as I can see), so I do an approximate thing of timing CPU side between command buffer submission and until the GPU is done with it, via a completion handler.

What’s next

Maybe let’s try a DX11 GPU implementation, just to see how this super slow GPU approach works out on a desktop GPU?