Daily Pathtracer 9: A wild ryg appears

Posted on Apr 13, 2018

#rendering #code

Introduction and index of this series is here.

In the previous post, I did a basic SIMD/SSE implementation of the “hit a ray against all spheres” function. And then of course, me being a n00b at SIMD, I did some stupid things (and had other inefficiencies I knew about outside the SIMD function, that I planned to fix later). And then this happened:

You used the ultimate optimization technique: nerd sniping rygorous into doing it for you =) (via)

i.e. Fabian Giesen himself did a bunch of optimizations and submitted a pull request. Nice work, ryg!

His changes got performance of 107 -> 187 Mray/s on PC, and 30.1 -> 41.8 Mray/s on a Mac. That’s not bad at all for relatively simple changes!

ryg’s optimizations

Full list of changes can be seen in the pull request, here are the major ones:

Use _mm_loadu_ps to load memory into a SIMD variable (commit). On Windows/MSVC that got a massive speedup; no change on Mac/clang since clang was already generating movups instruction there.

Evaluate ray hit data only once (commit). My original code was evaluating hit position & normal for each closer sphere it had hit so far; this change only remembers t value and sphere index instead, and calculates position & normal for the final closest sphere. It also has one possible approach in how to do “find minimum value and index of it in an SSE register”, that I was wondering about.

“Know” which objects emit light instead of searching every time (commit). This one’s not SIMD at all, and super obvious. In the explicit light sampling loop, for each ray bounce off a diffuse surface, I was going through all spheres, checking “hey, do you emit light?”. But only a couple of all of them do! So instead, have an explicit array of light-emitting sphere indices, and only go through that. This was another massive speedup.

Several small simplifications (commit, commit, commit, commit, commit). Each one self-explanatory.

What’s next

I want to apply some of the earlier & above optimizations to the C#, C#+Burst and GPU implementations too, just so that all versions are on the same ground again. Maybe I’ll do that next!