Blogs · Aras' website

Pathtracer 16: Burst SIMD Optimization

Posted on Oct 29, 2018

Introduction and index of this series is here.

When I originally played with the Unity Burst compiler in “Part 3: C#, Unity, Burst”, I just did the simplest possible “get C# working, get it working on Burst” thing and left it there. Later on in “Part 10: Update C#” I updated it to use Structure-of-Arrays data layout for scene objects, and that was about it. Let’s do something about this.

Meanwhile, I have switched from late-2013 MacBookPro to mid-2018 one, so the performance numbers on a “Mac” will be different from the ones in previous posts.

Update to latest Unity + Burst + Mathematics versions

First of all, let’s update the Unity version we use from some random 2018.1 beta to the latest stable 2018.2.13, and update Burst (to 0.2.4-preview.34) & Mathematics (to 0.0.12-preview.19) packages along the way. Mathematics renamed lengthSquared to lengthsq, and introduced a PI constant that clashed with our own one :) These trivial updates in this commit.

Just that got performance on PC from 81.4 to 84.3 Mray/s, and on Mac from 31.5 to 36.5 Mray/s. I guess either Burst or Mathematics (or both) got some optimizations during this half a year, nice!

Add some “manual SIMD” to sphere intersection

Very similar to how in Part 8: SSE HitSpheres I made the C++ HitSpheres function do intersection testing of one ray against 4 spheres at once, we’ll do the same in our Unity C# Burst code.

The thought process and work done is extremely similar to the C++ side done in Part 8 and Part 9; basically:

Since data for our spheres is laid out nicely in SoA style arrays, we can easily load data for 4 of them at once.
Do all ray intersection math on these 4 spheres,
If any are hit, pick the closest one and calculate final hit position & normal.

HitSpheres function code gets to be extremely similar between C++ version and C# version. In fact the C# one is cleaner since float4, int4 and bool4 types in Mathematics package are way more complete SIMD wrappers than my toy manual implementations in the C++ version.

The full change commit is here.

Performance: PC from 84.3 to 133 Mray/s, and Mac from 35.5 to 60.0 Mray/s. Not bad!

Updated numbers for new Mac hardware

Implementation	PC	Mac
GPU	1854	246
C++, SSE+SoA HitSpheres	187	74
C#, Unity Burst, 4-wide HitSpheres	133	60
C++, SoA HitSpheres	100	36
C#, Unity Burst	82	36
C#, .NET Core	53.0	23.6
C#, mono `-O=float32 --llvm` w/ `MONO_INLINELIMIT=100`		22.0
C#, mono `-O=float32 --llvm`		18.9
C#, mono `-O=float32`		11.0
C#, mono		6.1

PC is AMD ThreadRipper 1950X (3.4GHz, 16c/16t - SMT disabled) with GeForce GTX 1080 Ti.
Mac is mid-2018 MacBookPro (Core i9-8950HK 2.9GHz, 6c/12t) with AMD Radeon Pro 560X.
Unity version 2018.2.13 with Burst 0.2.4-preview.34 and Mathematics 0.0.12-preview.19.
Mono version 5.12.
.NET Core version 2.1.302.

All code is on github at 16-burst-simd tag.

Random list of Demoscene Demos

Posted on Aug 3, 2018

I just did a “hey kids, let me tell you about demoscene” event at work, where I talked about and and showed some demos I think were influential over the years, roughly sorted chronologically.

Here’s that list, in case you also want to see some demoscene things. There’s a whole bunch of excellent demo productions I did not show (due to time constraints); and I mostly focused on Windows/PC demos. A decent way of finding others is searching through “all time top” list at pouët.net.

I’m giving links to youtube, because let’s be realistic, no one’s gonna actually download and run the executables. Or if you would, then you most likely have already seen them anyway :)

Future Crew “Second Reality”, 1993, demo

Tim Clarke “Mars”, 1993, 6 kilobytes

Psychic Link “Paper”, 1996, 64 kilobytes

Exceed “Heaven Seven”, 2000, 64 kilobytes

farbrausch “fr-08: .the .product”, 2000, 64 kilobytes

Alex Evans “Tom Thumb”, 2002, wild demo

TBC & Mainloop “Micropolis”, 2004, 4 kilobytes

mfx “Aether”, 2005, demo

Kewlers & mfx “1995”, 2006, demo

mfx “Deities”, 2006, demo

farbrausch “fr-041: debris”, 2007, 144 kilobytes

Fairlight & CNCD “Agenda Circling Forth”, 2010, demo

Fairlight & CNCD “Ziphead”, 2015, demo

Eos “Oscar’s Chair”, 2018, 4 kilobytes

Conspiracy “When Silence Dims The Stars Above”, 2018, 64 kilobytes

Pathtracer 15: Pause & Links

Posted on Aug 1, 2018

Sailing out to sea | A tale of woe is me
I forgot the name | Of where we’re heading
– Versus Them “Don’t Eat the Captain”

So! This whole series on pathtracing adventures started out without a clear goal or purpose. “I’ll just play around and see what happens” was pretty much it. Looks like I ran out of steam and will pause doing further work on it. Maybe sometime later I’ll pick it up again, who knows!

One nice thing about 2018 is that there’s a lot of interest in ray/path tracing again, and other people have been writing about various aspects of it. So here’s a collection of links I saved on the topic over past few months:

“Efficient Incoherent Ray Traversal on GPUs Through Compressed Wide BVHs” by Henri Ylitie, Tero Karras, Samuli Laine, 2017 July.
“Caffeine path tracing demo & tutorial (WebGL)” by Rye Terrell, 2018 February.
“GPU Ray Tracing in One Weekend” by Jeremy Cowles, 2018 March.
“GDC Retrospective and Additional Thoughts on Real-Time Raytracing” by Colin Barré-Brisebois, 2018 April.
A bunch of interesting blog posts on importance sampling etc. by Joe Schutte, 2018 March-May.
“minpt: A path tracer in 300 lines of C++” by Hisanari Otsu, 2018 May.
“D3D12 Raytracing Samples” by Microsoft, 2018 May.
“Writing a Portable CPU/GPU Ray Tracer in C#” by Eric Mellino, 2018 May.
“GPU Ray Tracing in Unity – Part 1” and “GPU Path Tracing in Unity – Part 2” by David Kuri, 2018 May.
“Stochastic All the Things: Raytracing in Hybrid Real-time Rendering” by Tomasz Stachowiak, 2018 May.
“Adding texturing to a GLSL path tracer” by Michael Cameron, 2018 May.
“Denoising with Kernel Prediction and Asymmetric Loss Functions” by Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, Jan Novák, 2018 June.
Series on path tracer in Rust (initial, threading, SIMD, more SIMD) by Cameron Hart, 2018 May-June.
“Gradients, Poisson’s Equation and Light Transport (two minute papers)” by Károly Zsolnai-Fehér, 2015 October.
“Compressed Leaf Bounding Volume Hierarchies” by Carsten Benthin, Ingo Wald, Sven Woop, Attila Áfra, 2018 June.
“The other pathtracer” series (job system, triangles, complex scenes, optimizing AABB-Ray) by Carmelo Fernández-Agüera Tortosa, 2018 June.
“Metal for Ray Tracing Acceleration” by Sean James, Wayne Lister, 2018 June.
Another raytracing blogpost series (rendering equation, image output, ray-sphere intersection, camera and anti-aliasing, diffuse materials, reflecting materials) by Victor Li, 2018 July.
Ray tracer with Metal Performance Shaders by Serhii Rieznik, 2018 June-July.
Quick Path Tracer project in C++ by Rory Driscoll, 2018 June.
“Pathtraced Depth of Field & Bokeh” by Alan Wolfe, 2018 July.
“GPU based clay simulation and ray tracing tech in Claybook” by Sebastian Aaltonen, 2018 July.
“Sampling Anisotropic Microfacet BRDF” by Cao Jiayin, 2018 July.
CUDA Pathtracer blog series (preparation, first project, lightweight kernel, optimizing memory transfers, compacting non-active rays) by @voxel_tracer, 2018 June-July.
“Swallowing the elephant (optimizing pbrt for Moana scene)” by Matt Pharr, 2018 July.

Thanks for the adventure so far, everyone!

Put the fork away | It’s not a sailor’s way
We are gentlemen | Don’t eat the captain

Iceland Vacation 2018

Posted on Jul 25, 2018

Hello! End of June & start of July we were traveling in Iceland, so here’s some photos and stuff.

I’ve heard that some folks somehow don’t know that Iceland is absolutely beautiful. How?! Here’s my attempt at helping the situation by dumping a whole bunch of photos into the series of tubes.

Planning

We’ve been to Iceland before; what we did differently this time was:

Almost 2x longer trip (11 days),
Our kids are 5 years older (15 and 9yo), which makes it easier! We are five years older too though :/
Six people in total, since now we also took my parents. This meant renting two cars.

Similar to last time, I used internets and google maps to scout for locations and do rough planning. It was basically “go around the whole country” (on the main Route 1), cutting in one place via the highland Route F35, and then a detour into Snæfellsnes peninsula.

Total driving distance ended up ~2600km (200-300km per day). That does not sound a lot, but we did not end up having “lazy days”; there is a lot to see in Iceland, and every stop along the way is basically an hour or two. For example you might want to hike up the waterfall, or get down to some cliffs in the water, etc. The map on the right shows all the places we did end up stopping at. I had a dozen more marked up, but we skipped some.

I booked everything well in advance (4 months), either via Booking.com or Airbnb. Since we were a party of six, in some more remote places there was not that many choices actually. Having a camper or tents might be much cheaper and allow more freedom, at expense of comfort.

Cost wise, some things (like housing) has visibly increased since 2013 when we were last there. Makes sense, since the amount of tourists has increased as well; capitalism gonna capital. Total cost breakdown for us was: 33% housing, 23% flights, 20% car rent, 24% everything else (food, eating out, gas, guided trips, …).

Late June is basically “early summer” in Iceland. Most/all of the highland roads are already open. There can be quite a lot of rain; I was looking at the forecasts and it did not look very good. Luckily enough, we only got serious rain for like 3 days; most other days there was relatively little rain. Temperature was mostly in +8..+15°C range, often with a really cold wind. There were moments when I wished I’d taken gloves :)

Photo Impressions

Most of the photos are taken by my wife. Equipment: Canon EOS 70D with Canon 24-70mm f/2.8 L II and Sigma 8-16mm f/4.5-5.6. Some taken with iPhone SE.

Day 1, South (Selfoss to Kirkjubæjarklaustur)

The southern part is quite crowded with tourists; going up to Dyrhólaey/Vik is plenty of sights and a good trip for a day. We also started the first day with “ok there’s a million things to see today!”.

First up, mostly waterfalls. Urriðafoss, Seljalandsfoss, a view into the infamous Eyjafjallajökull, and Skógafoss.

Fun fact! Unity codebase has a text = "Eyjafjallajökull-Pranckevičius"; line in one of the tests, that checks whether some thing deals with non-English characters. I think @lucasmeijer added that.

End of June is blooming time of Nootka Lupin; there are vast fields full of them. People go to take wedding photos and whatnot in there.

Next up, we can go to the tongue of Sólheimajökull glacier (this is a bit redundant; “jökull” already means “glacier”). I’ve never seen a glacier before, and the photos of course don’t do it justice. This is a tiny piece at the end of the glacier. Very impressive.

Dyrhólaey peninsula:

Dverghamrar basalt column formations, with Foss á Síðu waterfall in the distance (redundancy again, “foss” already means “waterfall”):

Day 2, South/East (Kirkjubæjarklaustur to Höfn)

Driving up to another glacier, Svínafellsjökull. Again, the scale is hard to comprehend; many glaciers in Iceland are 500 meters high, some going up to a kilometer. A kilometer of ice!

A short (but very bumpy) road to the side, and we are close to it:

Next up, Jökulsárlón glacial lake. Was a setting for a bunch of movies! The lake is just over a hundred years old, and is growing very fast, largely due to melting glaciers.

Right next to it there is so called “Diamond Beach”, where icebergs, after being flushed out into the sea and eroded by salt, come ashore as tiny pieces of ice. The sand is black of course, since it was originally pumice and volcanic ash.

Day 3, East (Höfn to Egilsstaðir)

Eastern side of Iceland is where there’s no tourist crowds, and no big-name attractions either. Even the main highway road becomes gravel for a dozen kilometers in one place :) Most of the Route 1 goes along the coastline that is full of fjords, which makes for a fairly long drive. There is a shortcut (route 939 aka Öxi) that lets you cut some 80km, but it’s gravel and very steep (here’s random youtube video showing it). I thought “let’s do the coastline instead, we’ll watch plenty of sea and cliffs”. Not so fast! Turns out, coastline can mean that there’s a literal cloud right on the road, and you basically don’t see anything. Oh well :)

There were some lighthouses (barely visible due to mist/fog/clouds), a nice waterfall (Sveinsstekksfoss), and also here’s a photo of our typical lunch:

We stayed in a lovely horse ranch, and also found an old car dump nearby.

Day 5, North/East (Egilsstaðir to Mývatn)

Most of the day was driving on Route 1 through Norður-Múlasýsla region. First you see towns and villages disappear, then farms disappear, and then even sheep disappear (whereas normally sheep are everywhere in Iceland). What’s left is a volcanic desert with basically a single road cutting through it.

There was a waterfall (Rjúkandi) near start of that trip, and lava fields towards the end, close to Dettifoss.

Here’s Dettifoss, which is 100m wide, 44m deep and other measurements as well (ref).

Nearby, the Krafla area with the Víti crater, Krafla power station and Hverir geothermal area with fumaroles and mudpots.

Lake Mývatn nearby has a flying mountain (not really, just low fog) and a lot of birds.

Day 5, North (Mývatn to Akureyri)

Mývatn to Akureyri is a very short drive, so we did a detour through Husavik towards Ásbyrgi canyon. Last time we were in Iceland, Husavik was lovely and Ásbyrgi was quite impressive. However this time, pretty much the whole day was heavy rain. Not much visibility, and not too pleasant to hike around and enjoy the sights. Oh well! Here’s Ásbyrgi and Goðafoss:

Akureyri has an excellent botanical garden; more photos from it at my wife’s blog.

Day 6, Highlands (Akureyri to Kerlingarfjöll)

This was where we took off the main highway and into the F35/Kjalvegur gravel road. I heard from a bunch of people the suggestion along the lines of “OMG you have to go along one of the highland roads”, and so that’s why we did it. F35 is the easiest of those; legally it requires a 4x4/AWD car but I think technically any car should be able to do it. Most other highland roads actually have river crossings; whereas F35 only has one or two small streams to cross. Most of the road is actually in very good condition (at least at start of July), with only a couple dozen kilometers that have enough stones and pits to make you go at 20-30km/h.

There is Hveravellir geothermal area near Langjökull:

We stayed at a place near Kerlingarfjöll:

And decided to hike towards a nearby rhyolite mountain area (Hveradalir). Apparently I must have misread something somewhere, since what I thought was 3km turned out to be 5km one way (mixup of miles vs kilometers in my head?), the path was steep, with blobs of snow along the way, really strong wind and a descending cloud. At some point we decided to declare ourselves losers and just turn back. Oh well :/

Turns out, you can just drive up to the same area via some mountain road. It’s steep and bumpy, and there was still tons of snow on the side, but the views up there were amazing. The wind almost blew us away though; maybe it’s good that we did not hike all the way.

Day 7, Part of Golden Circle (Kerlingarfjöll to Reykjavik)

“Golden Circle” is a marketing term for probably the most touristy route in Iceland. But parts of it did happen to be on our way, so we went straight from the highlands where there’s no one around, into “all the tourists in one spot” types of places like Gullfoss.

Next up, Strokkur geyser, again with a ton of tourists:

And we spent the evening just strolling around Reykjavik.

Day 8, Part of Golden Circle (around Reykjavik)

Þingvellir national park, most famous for being a place where you can actually see the rift between Eurasian and North American tectonic plates, and also for being a place of Alþingi, one of the oldest parliaments in the world.

Next up, Kerið crater. Similar to Krafla’s Víti, except with more tourists and you can get down to the lake itself.

Then we went to the Raufarhólshellir lava cave. Things I learned: “skylight” is not just a computer graphics term (also means places where underground caves have openings towards the ground); lava flow produces really intricate “bathtub ring” patterns; and complete darkness feels eery.

Day 9, West (Reykjavik to Snæfellsnes)

Driving up to Snæfellsnes takes a good chunk of time, with generally nothing to see along the way (in relative terms of course; in many other countries these valleys and horizons would be amazing… but Iceland has too many more impressive sights). There are Gerðuberg basalt columns midway:

…but apart from that, not much. I was starting to think “ohh maybe this will be a low point of the trip”, and then! Rauðfeldsgjá gorge was very fun; you try to find your way across a water stream in a very narrow gorge, with huge chunks of snow right above you.

Just a couple minutes from there, Arnarstapi village has really nice cliffs at the water.

Five minutes from that, Hellnar village has even more impressive cliffs. I mean look at them! That layout and flow of the rocks should not exist! :)

And then! Djúpalónssandur beach with black sand and rock formations.

Near our sleeping place there’s Kirkjufell, which is featured in a ton of photos showing off wide-angle lenses :)

Day 10, West/South (Snæfellsnes to Keflavík)

Stykkishólmur town and random sights on the way back. Was an easy day without sensory overload :)

Day 11, Reykjanes Peninsula (around Keflavík)

Our flight back was in the evening, so we visited some places in Reykjanes near the airport. Gunnuhver mud pool:

Krísuvíkurberg cliffs and Dollan lava caves:

Krýsuvík geothermal area:

Kleifarvatn lake:

And the famous Bláa Lónið (Blue Lagoon), but we decided not to go inside (too many people, and didn’t feel the need either). There’s a power station right next to it, and some tractors doing cleaning. Much romance, wow :)

Next time?

I have no doubt that we’ll go to Iceland again (seriously, it’s amazing). One obvious thing would be going in the winter. So maybe that!

Pathtracer 14: iOS

Posted on May 30, 2018

Introduction and index of this series is here.

I wanted to check out how’s the performance on a mobile device. So, let’s take what we ended up with in the previous post, and make it run on iOS.

Initial port

Code for the Mac app is a super simple Cocoa application that either updates a Metal texture from the CPU and draws it to screen, or produces the texture with a Metal compute shader. I know almost nothing about Mac or Cocoa programming, so I just created a new project in Xcode, picked a “Metal game” template, removed things I don’t need and added the things I do need.

“Porting” that to iOS basically involved these steps (again, I don’t know how it’s supposed to be done; I’m just doing a random walk):

Created two projects in Xcode, using the “Metal game” template; one for Mac (which matches my current code setup), and another one for “Cross Platform” case.
Looked at the differences in file layout & project settings between them,
Applied the differences to my app. The changes in detail were:
Some folder renaming and moving files around in Xcode project structure.
Added iOS specific files produced by Xcode project template.
Some tweaks to existing app code to make it compile on iOS – mostly temporarily disabling all the SSE SIMD code paths (iOS uses ARM CPUs, SSE does not exist there). Other changes were mostly differences in Metal functionality between macOS and iOS (MTLResourceStorageModeManaged buffer mode and didModifyRange buffer method only exist on macOS).
Added iOS build target to Xcode project.

And then it Just Worked; both the CPU & GPU code paths! Which was a bit surprising, actually :)

Performance of this “just make it run” port on iPhone SE: CPU 5.7 Mray/s, GPU 19.8 Mray/s.

Xcode tools for iOS GPU performance

I wanted to look at what sort of tooling Xcode has for investigating iOS GPU performance these days. Last time I did it was a couple years ago, and was also not related to compute shader workloads. So here’s a quick look into what I found!

Update: this post was about Xcode 9 on an A9 hardware. At WWDC 2018 Apple has announced big improvements to Metal profiling tools in Xcode 10, especially when running on A11 or later hardware. I haven’t tried them myself, but you might want to check out the WWDC session and “Optimizing Performance” doc.

TL;DR: it’s not bad. Too bad it’s not as good as PS4 tooling, but then again, who is?

Most of Xcode GPU analysis is under the “Debug Navigator” thingy, where with an app running you can select the “FPS” section and it displays basic gauges of CPU & GPU performance. When using Metal, there is a “Capture GPU Frame” button near the bottom which leads to actual frame debugging & performance tools.

The default view is more useful for debugging rendering issues; you want to switch to “View Frame By Performance” instead:

The left sidebar then lists various things grouped by pipeline (compute or graphics), and by shader. It does not list them by objects rendered, which is different from how GPU profiling on desktop usually works. In my case obviously the single compute shader dispatch takes up almost all the time.

The information presented seems to be a bunch of GPU counters (number of shader invocations, instructions executed, and so on). Some of those are more useful than others, and what kind of information is being shown probably also depends on the device & GPU model. Here are screenshots of what I saw displayed about my compute shader on an iPhone SE:

Whole frame overview has various counters per encoder. From here: occupancy is not too bad, and hey look my shader is not using any half-precision instructions:

“Performance” section has more stats in number form:

“Pipeline Statistics” section has some useful performance hints and overview graphs of, uhm, something. This is probably telling me I’m ALU bound, but what are the units of each bar, and whether they are all even the same scale? I don’t know :)

If the shader was compiled with debugging information on, then it can also show which places of the shader actually took time. As far as I can tell, it just lies – for my shader, it basically says “yeah, all these lines took zero time, and there’s one line that took 6%”. Where are the other 94%?!

Xcode tools for Mac GPU performance

In the previous post I ranted how Mac has no GPU performance tools at all, and while that is somewhat true (i.e. there’s no tool that would have told me “hey Aras, use by-value local variables insteaad of by-reference! twice as fast!”)… some of that “Capture GPU Frame” functionality exists for Mac Metal applications as well.

Here’s what information is displayed by “Performance” section on my MBP (Intel Iris Pro):

The “compute kernel” part has way fewer counters, and I don’t quite believe that ALU active time was exactly zero.

“Pipeline Statistics” section on the other hand… it has no performance hints, but it does have more overview graphs! “Register pressure”, “SIMD group occupancy” and “threadgroup memory” parts sound useful!

Let’s do SIMD NEON code paths for CPU

Recall when in part 8 I played around with SSE intrinsics for CPU HitSpheres function? Well now that code is disabled since iOS uses ARM CPUs, so Intel specific instructions don’t even compile there.

However, ARM CPUs do have their own SIMD instruction set: NEON. I know! Let’s use NEON intrinsic functions to implement our own float3 and float4 helpers, and then the SIMD HitSpheres should more or less work.

Caveat: as usual, I basically have no idea what I’m talking about. I have read some NEON code in the past, and perhaps have written a small NEON function or two at some point, but I’m nowhere near being “proficient” at it.

NEON float3

First off, let’s do the float3 helper class implementation with NEON. On x64 CPUs that did improve performance a bit (not much though). NEON intrinsics overall seem to be way more orthogonal and “intuitive” than SSE ones, however SSE has way, way more information, tutorials & reference about it out there. Anyway, the NEON float3 part is this commit, and my summary of NEON is:

#include <arm_neon.h> to get intrinsics & data types,
float32x4_t data type is for 4-wide floats,
NEON intrinsic functions start with v (for “vector”?), have q in there for things that operate on four things, and a suffix indicating the data type. For example, a 4-wide float add is vaddq_f32. Simple and sweet!
Getting to individual SIMD lanes is much easier than on SSE (just vgetq_lane_f32), however doing arbitrary swizzles/shuffles is harder – you have to dance around with extracting low/high parts, or “zipping” various operands, etc.

Doing the above work did not noticeably change performance though. Oh well, actually quite expected. I did learn/remember some NEON stuff though, so a net positive :)

NEON HitSpheres & float4

Last time an actual performance gain with SIMD was doing SSE HitSpheres, with data laid out in struct-of-arrays fashion. To get the same working on NEON, I basically have to implement a float4 helper class, and touch several places in HitSpheres function itself that use SSE directly. It’s all in this commit.

That got CPU performance from 5.8 Mray/s up to 8.5 Mray/s. Nice!

Note that my NEON approach is very likely suboptimal; I was basically doing a direct port from SSE. Which means:

“mask” calculation for comparisons. On SSE that is just _mm_movemask_ps, but becomes this in NEON:

VM_INLINE unsigned mask(float4 v)
{
    static const uint32x4_t movemask = { 1, 2, 4, 8 };
    static const uint32x4_t highbit = { 0x80000000, 0x80000000, 0x80000000, 0x80000000 };
    uint32x4_t t0 = vreinterpretq_u32_f32(v.m);
    uint32x4_t t1 = vtstq_u32(t0, highbit);
    uint32x4_t t2 = vandq_u32(t1, movemask);
    uint32x2_t t3 = vorr_u32(vget_low_u32(t2), vget_high_u32(t2));
    return vget_lane_u32(t3, 0) | vget_lane_u32(t3, 1);
}

picking closest hit among 4 results may or might not be done more optimally in NEON:

int id_scalar[4];
float hitT_scalar[4];
#if USE_NEON
vst1q_s32(id_scalar, id);
vst1q_f32(hitT_scalar, hitT.m);
#else
_mm_storeu_si128((__m128i *)id_scalar, id);
_mm_storeu_ps(hitT_scalar, hitT.m);
#endif
// In general, you would do this with a bit scan (first set/trailing zero count).
// But who cares, it's only 16 options.
static const int laneId[16] =
{
    0, 0, 1, 0, // 00xx
    2, 0, 1, 0, // 01xx
    3, 0, 1, 0, // 10xx
    2, 0, 1, 0, // 11xx
};
int lane = laneId[minMask];
int hitId = id_scalar[lane];
float finalHitT = hitT_scalar[lane];

Current status

So the above is basic port to iOS, with some simple NEON code path, and no mobile specific GPU tweaks/optimizations at all. Code is over at 14-ios tag on github.

Performance:

iPhone SE (A9 chip): 8.5 Mray/s CPU, 19.8 Mray/s GPU.
iPhone X (A11 chip): 12.9 Mray/s CPU, 46.6 Mray/s GPU.
- I haven’t looked into how many CPU threads the enkiTS task scheduler ends up using on iPhone X. I suspect it still might be just two “high performance” cores, which would be within my expectations of “roughly 50% more per-core CPU perf in two Apple CPU generations”. Which is fairly impressive!
For comparison, a MacBook Pro (2013) with Core i7 2.3 GHz & Intel Iris Pro gets: 42 Mray/s CPU, 99 Mray/s GPU.
- Which means that single-thread CPU performance on iPhone X is actually very similar, or even a bit higher, than on an (admittedly old) MacBook Pro!