Daily Pathtracer Part 6: D3D11 GPU

Posted on Apr 4, 2018

#rendering #code #gpu #d3d

Introduction and index of this series is here.

In the previous post, I did a naïve Metal GPU “port” of the path tracer. Let’s make a Direct3D 11 / HLSL version now.

This will allow testing performance of this “totally not suitable for GPU” port on a desktop GPU.
HLSL is familiar to more people than Metal.
Maybe someday I’d put this into a Unity version, and having HLSL is useful, since Unity uses HLSL as the shading language.
Why not D3D12 or Vulkan? Because those things are too hard for me ;) Maybe someday, but not just yet.

Ok let’s do the HLSL port

The final change is here, below are just some notes:

Almost everything from Metal post actually applies.
Compare Metal shader with HLSL one:
- Metal is “more C++"-like: there are references and pointers (as opposed to inout and out HLSL alternatives), structs with member functions, enums etc.
- Overall most of the code is very similar; largest difference is that I used global variables for shader inputs in HLSL, whereas Metal requires function arguments.
I used StructuredBuffers to pass data from the application side, so that it’s easy to match data layout on C++ side.
- On AMD or Intel GPUs, my understanding is that there’s no big difference between structured buffers and other types of buffers.
- However NVIDIA seems to quite like constant buffers for some usage patterns (see their blog posts: Structured Buffer Performance, Latency in Structured Buffers, Constant Buffers). If I were optimizing for GPU performance (which I am not, yet), that’s one possible area to look into.
For reading GPU times, I just do the simplest possible timer query approach, without any double buffering or anything (see code). Yes, this does kill any CPU/GPU parallelism, but here I don’t care about that. Likewise, for reading back traced ray counter I read it immediately without any frame delays or async readbacks.
- I did run into an issue where even when I get the results from the “whole frame” disjoint timer query, the individual timestamp queries still don’t have their data yet (this was on AMD GPU/driver). So initially I had “everything works” on NVIDIA, but “returns nonsensical GPU times” on AMD. Testing on different GPUs is still useful, yo!

What’s the performance?

Again… this is definitely not an efficient implementation for the GPU. But here are the numbers!

GeForce GTX 1080 Ti: 2780 Mray/s,
Radeon Pro WX 9100: 3700 Mray/s,
An old Radeon HD 7700: 417 Mray/s,
C++ CPU implementation, on this AMD Threadripper with SMT off: 135 Mray/s.

For reference, Mac Metal numbers:

Radeon Pro 580: 1650 Mray/s,
Intel Iris Pro: 191 Mray/s,
GeForce GT 750M: 146 Mray/s.

What can we learn from that?

Similar to Mac C++ vs GPU Metal speedups, here the speedup is also between 4 and 27 times faster.
- And again, not a fair comparison to a “real” path tracer; this one doesn’t have any BVH to traverse etc.
The Radeon here handily beats the GeForce. On paper it has slightly more TFLOPS, and I suspect some other differences might be at play (structured buffers? GCN architecture being better at “bad for GPU, port from C++” type of code? I haven’t investigated yet).

So there! The code is at 06-gpud3d11 tag on github repo.

What’s next

I don’t know. Have several possible things, will do one of them. Also, geez, doing these posts every day is hard. Maybe I’ll take a couple days off :)