Random list of Demoscene Demos

I just did a “hey kids, let me tell you about demoscene” event at work, where I talked about and and showed some demos I think were influential over the years, roughly sorted chronologically.

Here’s that list, in case you also want to see some demoscene things. There’s a whole bunch of excellent demo productions I did not show (due to time constraints); and I mostly focused on Windows/PC demos. A decent way of finding others is searching through “all time top” list at pouët.net.

I’m giving links to youtube, because let’s be realistic, no one’s gonna actually download and run the executables. Or if you would, then you most likely have already seen them anyway :)

Future Crew “Second Reality”, 1993, demo

Tim Clarke “Mars”, 1993, 6 kilobytes

Exceed “Heaven Seven”, 2000, 64 kilobytes

farbrausch “fr-08: .the .product”, 2000, 64 kilobytes

Alex Evans “Tom Thumb”, 2002, wild demo

TBC & Mainloop “Micropolis”, 2004, 4 kilobytes

mfx “Aether”, 2005, demo

Kewlers & mfx “1995”, 2006, demo

mfx “Deities”, 2006, demo

farbrausch “fr-041: debris”, 2007, 144 kilobytes

Fairlight & CNCD “Agenda Circling Forth”, 2010, demo

Fairlight & CNCD “Ziphead”, 2015, demo

Eos “Oscar’s Chair”, 2018, 4 kilobytes

Conspiracy “When Silence Dims The Stars Above”, 2018, 64 kilobytes


Pathtracer 15: Pause & Links

Sailing out to sea | A tale of woe is me
I forgot the name | Of where we’re heading
– Versus Them “Don’t Eat the Captain

So! This whole series on pathtracing adventures started out without a clear goal or purpose. “I’ll just play around and see what happens” was pretty much it. Looks like I ran out of steam and will pause doing further work on it. Maybe sometime later I’ll pick it up again, who knows!

One nice thing about 2018 is that there’s a lot of interest in ray/path tracing again, and other people have been writing about various aspects of it. So here’s a collection of links I saved on the topic over past few months:

Thanks for the adventure so far, everyone!

Put the fork away | It’s not a sailor’s way
We are gentlemen | Don’t eat the captain


Iceland Vacation 2018

Hello! End of June & start of July we were traveling in Iceland, so here’s some photos and stuff.

I’ve heard that some folks somehow don’t know that Iceland is absolutely beautiful. How?! Here’s my attempt at helping the situation by dumping a whole bunch of photos into the series of tubes.

Planning

We’ve been to Iceland before; what we did differently this time was:

  • Almost 2x longer trip (11 days),
  • Our kids are 5 years older (15 and 9yo), which makes it easier! We are five years older too though :/
  • Six people in total, since now we also took my parents. This meant renting two cars.

Similar to last time, I used internets and google maps to scout for locations and do rough planning. It was basically “go around the whole country” (on the main Route 1), cutting in one place via the highland Route F35, and then a detour into Snæfellsnes peninsula.

Total driving distance ended up ~2600km (200-300km per day). That does not sound a lot, but we did not end up having “lazy days”; there is a lot to see in Iceland, and every stop along the way is basically an hour or two. For example you might want to hike up the waterfall, or get down to some cliffs in the water, etc. The map on the right shows all the places we did end up stopping at. I had a dozen more marked up, but we skipped some.

I booked everything well in advance (4 months), either via Booking.com or Airbnb. Since we were a party of six, in some more remote places there was not that many choices actually. Having a camper or tents might be much cheaper and allow more freedom, at expense of comfort.

Cost wise, some things (like housing) has visibly increased since 2013 when we were last there. Makes sense, since the amount of tourists has increased as well; capitalism gonna capital. Total cost breakdown for us was: 33% housing, 23% flights, 20% car rent, 24% everything else (food, eating out, gas, guided trips, …).

Late June is basically “early summer” in Iceland. Most/all of the highland roads are already open. There can be quite a lot of rain; I was looking at the forecasts and it did not look very good. Luckily enough, we only got serious rain for like 3 days; most other days there was relatively little rain. Temperature was mostly in +8..+15°C range, often with a really cold wind. There were moments when I wished I’d taken gloves :)

Photo Impressions

Most of the photos are taken by my wife. Equipment: Canon EOS 70D with Canon 24-70mm f/2.8 L II and Sigma 8-16mm f/4.5-5.6. Some taken with iPhone SE.

Day 1, South (Selfoss to Kirkjubæjarklaustur)

The southern part is quite crowded with tourists; going up to Dyrhólaey/Vik is plenty of sights and a good trip for a day. We also started the first day with “ok there’s a million things to see today!”.

First up, mostly waterfalls. Urriðafoss, Seljalandsfoss, a view into the infamous Eyjafjallajökull, and Skógafoss.

Fun fact! Unity codebase has a text = "Eyjafjallajökull-Pranckevičius"; line in one of the tests, that checks whether some thing deals with non-English characters. I think @lucasmeijer added that.

End of June is blooming time of Nootka Lupin; there are vast fields full of them. People go to take wedding photos and whatnot in there.

Next up, we can go to the tongue of Sólheimajökull glacier (this is a bit redundant; “jökull” already means “glacier”). I’ve never seen a glacier before, and the photos of course don’t do it justice. This is a tiny piece at the end of the glacier. Very impressive.

Dyrhólaey peninsula:

Dverghamrar basalt column formations, with Foss á Síðu waterfall in the distance (redundancy again, “foss” already means “waterfall”):

Day 2, South/East (Kirkjubæjarklaustur to Höfn)

Driving up to another glacier, Svínafellsjökull. Again, the scale is hard to comprehend; many glaciers in Iceland are 500 meters high, some going up to a kilometer. A kilometer of ice!

A short (but very bumpy) road to the side, and we are close to it:

Next up, Jökulsárlón glacial lake. Was a setting for a bunch of movies! The lake is just over a hundred years old, and is growing very fast, largely due to melting glaciers.

Right next to it there is so called “Diamond Beach”, where icebergs, after being flushed out into the sea and eroded by salt, come ashore as tiny pieces of ice. The sand is black of course, since it was originally pumice and volcanic ash.

Day 3, East (Höfn to Egilsstaðir)

Eastern side of Iceland is where there’s no tourist crowds, and no big-name attractions either. Even the main highway road becomes gravel for a dozen kilometers in one place :) Most of the Route 1 goes along the coastline that is full of fjords, which makes for a fairly long drive. There is a shortcut (route 939 aka Öxi) that lets you cut some 80km, but it’s gravel and very steep (here’s random youtube video showing it). I thought “let’s do the coastline instead, we’ll watch plenty of sea and cliffs”. Not so fast! Turns out, coastline can mean that there’s a literal cloud right on the road, and you basically don’t see anything. Oh well :)

There were some lighthouses (barely visible due to mist/fog/clouds), a nice waterfall (Sveinsstekksfoss), and also here’s a photo of our typical lunch:

We stayed in a lovely horse ranch, and also found an old car dump nearby.

Day 5, North/East (Egilsstaðir to Mývatn)

Most of the day was driving on Route 1 through Norður-Múlasýsla region. First you see towns and villages disappear, then farms disappear, and then even sheep disappear (whereas normally sheep are everywhere in Iceland). What’s left is a volcanic desert with basically a single road cutting through it.

There was a waterfall (Rjúkandi) near start of that trip, and lava fields towards the end, close to Dettifoss.

Here’s Dettifoss, which is 100m wide, 44m deep and other measurements as well (ref).

Nearby, the Krafla area with the Víti crater, Krafla power station and Hverir geothermal area with fumaroles and mudpots.

Lake Mývatn nearby has a flying mountain (not really, just low fog) and a lot of birds.

Day 5, North (Mývatn to Akureyri)

Mývatn to Akureyri is a very short drive, so we did a detour through Husavik towards Ásbyrgi canyon. Last time we were in Iceland, Husavik was lovely and Ásbyrgi was quite impressive. However this time, pretty much the whole day was heavy rain. Not much visibility, and not too pleasant to hike around and enjoy the sights. Oh well! Here’s Ásbyrgi and Goðafoss:

Akureyri has an excellent botanical garden; more photos from it at my wife’s blog.

Day 6, Highlands (Akureyri to Kerlingarfjöll)

This was where we took off the main highway and into the F35/Kjalvegur gravel road. I heard from a bunch of people the suggestion along the lines of “OMG you have to go along one of the highland roads”, and so that’s why we did it. F35 is the easiest of those; legally it requires a 4x4/AWD car but I think technically any car should be able to do it. Most other highland roads actually have river crossings; whereas F35 only has one or two small streams to cross. Most of the road is actually in very good condition (at least at start of July), with only a couple dozen kilometers that have enough stones and pits to make you go at 20-30km/h.

There is Hveravellir geothermal area near Langjökull:

We stayed at a place near Kerlingarfjöll:

And decided to hike towards a nearby rhyolite mountain area (Hveradalir). Apparently I must have misread something somewhere, since what I thought was 3km turned out to be 5km one way (mixup of miles vs kilometers in my head?), the path was steep, with blobs of snow along the way, really strong wind and a descending cloud. At some point we decided to declare ourselves losers and just turn back. Oh well :/

Turns out, you can just drive up to the same area via some mountain road. It’s steep and bumpy, and there was still tons of snow on the side, but the views up there were amazing. The wind almost blew us away though; maybe it’s good that we did not hike all the way.

Day 7, Part of Golden Circle (Kerlingarfjöll to Reykjavik)

Golden Circle” is a marketing term for probably the most touristy route in Iceland. But parts of it did happen to be on our way, so we went straight from the highlands where there’s no one around, into “all the tourists in one spot” types of places like Gullfoss.

Next up, Strokkur geyser, again with a ton of tourists:

And we spent the evening just strolling around Reykjavik.

Day 8, Part of Golden Circle (around Reykjavik)

Þingvellir national park, most famous for being a place where you can actually see the rift between Eurasian and North American tectonic plates, and also for being a place of Alþingi, one of the oldest parliaments in the world.

Next up, Kerið crater. Similar to Krafla’s Víti, except with more tourists and you can get down to the lake itself.

Then we went to the Raufarhólshellir lava cave. Things I learned: “skylight” is not just a computer graphics term (also means places where underground caves have openings towards the ground); lava flow produces really intricate “bathtub ring” patterns; and complete darkness feels eery.

Day 9, West (Reykjavik to Snæfellsnes)

Driving up to Snæfellsnes takes a good chunk of time, with generally nothing to see along the way (in relative terms of course; in many other countries these valleys and horizons would be amazing… but Iceland has too many more impressive sights). There are Gerðuberg basalt columns midway:

…but apart from that, not much. I was starting to think “ohh maybe this will be a low point of the trip”, and then! Rauðfeldsgjá gorge was very fun; you try to find your way across a water stream in a very narrow gorge, with huge chunks of snow right above you.

Just a couple minutes from there, Arnarstapi village has really nice cliffs at the water.

Five minutes from that, Hellnar village has even more impressive cliffs. I mean look at them! That layout and flow of the rocks should not exist! :)

And then! Djúpalónssandur beach with black sand and rock formations.

Near our sleeping place there’s Kirkjufell, which is featured in a ton of photos showing off wide-angle lenses :)

Day 10, West/South (Snæfellsnes to Keflavík)

Stykkishólmur town and random sights on the way back. Was an easy day without sensory overload :)

Day 11, Reykjanes Peninsula (around Keflavík)

Our flight back was in the evening, so we visited some places in Reykjanes near the airport. Gunnuhver mud pool:

Krísuvíkurberg cliffs and Dollan lava caves:

Krýsuvík geothermal area:

Kleifarvatn lake:

And the famous Bláa Lónið (Blue Lagoon), but we decided not to go inside (too many people, and didn’t feel the need either). There’s a power station right next to it, and some tractors doing cleaning. Much romance, wow :)

Next time?

I have no doubt that we’ll go to Iceland again (seriously, it’s amazing). One obvious thing would be going in the winter. So maybe that!


Pathtracer 14: iOS

Introduction and index of this series is here.

I wanted to check out how’s the performance on a mobile device. So, let’s take what we ended up with in the previous post, and make it run on iOS.

Initial port

Code for the Mac app is a super simple Cocoa application that either updates a Metal texture from the CPU and draws it to screen, or produces the texture with a Metal compute shader. I know almost nothing about Mac or Cocoa programming, so I just created a new project in Xcode, picked a “Metal game” template, removed things I don’t need and added the things I do need.

“Porting” that to iOS basically involved these steps (again, I don’t know how it’s supposed to be done; I’m just doing a random walk):

  1. Created two projects in Xcode, using the “Metal game” template; one for Mac (which matches my current code setup), and another one for “Cross Platform” case.
  2. Looked at the differences in file layout & project settings between them,
  3. Applied the differences to my app. The changes in detail were:
  4. Some folder renaming and moving files around in Xcode project structure.
  5. Added iOS specific files produced by Xcode project template.
  6. Some tweaks to existing app code to make it compile on iOS – mostly temporarily disabling all the SSE SIMD code paths (iOS uses ARM CPUs, SSE does not exist there). Other changes were mostly differences in Metal functionality between macOS and iOS (MTLResourceStorageModeManaged buffer mode and didModifyRange buffer method only exist on macOS).
  7. Added iOS build target to Xcode project.

And then it Just Worked; both the CPU & GPU code paths! Which was a bit surprising, actually :)

Performance of this “just make it run” port on iPhone SE: CPU 5.7 Mray/s, GPU 19.8 Mray/s.

Xcode tools for iOS GPU performance

I wanted to look at what sort of tooling Xcode has for investigating iOS GPU performance these days. Last time I did it was a couple years ago, and was also not related to compute shader workloads. So here’s a quick look into what I found!

Update: this post was about Xcode 9 on an A9 hardware. At WWDC 2018 Apple has announced big improvements to Metal profiling tools in Xcode 10, especially when running on A11 or later hardware. I haven’t tried them myself, but you might want to check out the WWDC session and “Optimizing Performance” doc.

TL;DR: it’s not bad. Too bad it’s not as good as PS4 tooling, but then again, who is?

Most of Xcode GPU analysis is under the “Debug Navigator” thingy, where with an app running you can select the “FPS” section and it displays basic gauges of CPU & GPU performance. When using Metal, there is a “Capture GPU Frame” button near the bottom which leads to actual frame debugging & performance tools.

The default view is more useful for debugging rendering issues; you want to switch to “View Frame By Performance” instead:

The left sidebar then lists various things grouped by pipeline (compute or graphics), and by shader. It does not list them by objects rendered, which is different from how GPU profiling on desktop usually works. In my case obviously the single compute shader dispatch takes up almost all the time.

The information presented seems to be a bunch of GPU counters (number of shader invocations, instructions executed, and so on). Some of those are more useful than others, and what kind of information is being shown probably also depends on the device & GPU model. Here are screenshots of what I saw displayed about my compute shader on an iPhone SE:

Whole frame overview has various counters per encoder. From here: occupancy is not too bad, and hey look my shader is not using any half-precision instructions:

“Performance” section has more stats in number form:

“Pipeline Statistics” section has some useful performance hints and overview graphs of, uhm, something. This is probably telling me I’m ALU bound, but what are the units of each bar, and whether they are all even the same scale? I don’t know :)

If the shader was compiled with debugging information on, then it can also show which places of the shader actually took time. As far as I can tell, it just lies – for my shader, it basically says “yeah, all these lines took zero time, and there’s one line that took 6%”. Where are the other 94%?!

Xcode tools for Mac GPU performance

In the previous post I ranted how Mac has no GPU performance tools at all, and while that is somewhat true (i.e. there’s no tool that would have told me “hey Aras, use by-value local variables insteaad of by-reference! twice as fast!”)… some of that “Capture GPU Frame” functionality exists for Mac Metal applications as well.

Here’s what information is displayed by “Performance” section on my MBP (Intel Iris Pro):

The “compute kernel” part has way fewer counters, and I don’t quite believe that ALU active time was exactly zero.

“Pipeline Statistics” section on the other hand… it has no performance hints, but it does have more overview graphs! “Register pressure”, “SIMD group occupancy” and “threadgroup memory” parts sound useful!

Let’s do SIMD NEON code paths for CPU

Recall when in part 8 I played around with SSE intrinsics for CPU HitSpheres function? Well now that code is disabled since iOS uses ARM CPUs, so Intel specific instructions don’t even compile there.

However, ARM CPUs do have their own SIMD instruction set: NEON. I know! Let’s use NEON intrinsic functions to implement our own float3 and float4 helpers, and then the SIMD HitSpheres should more or less work.

Caveat: as usual, I basically have no idea what I’m talking about. I have read some NEON code in the past, and perhaps have written a small NEON function or two at some point, but I’m nowhere near being “proficient” at it.

NEON float3

First off, let’s do the float3 helper class implementation with NEON. On x64 CPUs that did improve performance a bit (not much though). NEON intrinsics overall seem to be way more orthogonal and “intuitive” than SSE ones, however SSE has way, way more information, tutorials & reference about it out there. Anyway, the NEON float3 part is this commit, and my summary of NEON is:

  • #include <arm_neon.h> to get intrinsics & data types,
  • float32x4_t data type is for 4-wide floats,
  • NEON intrinsic functions start with v (for “vector”?), have q in there for things that operate on four things, and a suffix indicating the data type. For example, a 4-wide float add is vaddq_f32. Simple and sweet!
  • Getting to individual SIMD lanes is much easier than on SSE (just vgetq_lane_f32), however doing arbitrary swizzles/shuffles is harder – you have to dance around with extracting low/high parts, or “zipping” various operands, etc.

Doing the above work did not noticeably change performance though. Oh well, actually quite expected. I did learn/remember some NEON stuff though, so a net positive :)

NEON HitSpheres & float4

Last time an actual performance gain with SIMD was doing SSE HitSpheres, with data laid out in struct-of-arrays fashion. To get the same working on NEON, I basically have to implement a float4 helper class, and touch several places in HitSpheres function itself that use SSE directly. It’s all in this commit.

That got CPU performance from 5.8 Mray/s up to 8.5 Mray/s. Nice!

Note that my NEON approach is very likely suboptimal; I was basically doing a direct port from SSE. Which means:

  • “mask” calculation for comparisons. On SSE that is just _mm_movemask_ps, but becomes this in NEON:
VM_INLINE unsigned mask(float4 v)
{
    static const uint32x4_t movemask = { 1, 2, 4, 8 };
    static const uint32x4_t highbit = { 0x80000000, 0x80000000, 0x80000000, 0x80000000 };
    uint32x4_t t0 = vreinterpretq_u32_f32(v.m);
    uint32x4_t t1 = vtstq_u32(t0, highbit);
    uint32x4_t t2 = vandq_u32(t1, movemask);
    uint32x2_t t3 = vorr_u32(vget_low_u32(t2), vget_high_u32(t2));
    return vget_lane_u32(t3, 0) | vget_lane_u32(t3, 1);
}
  • picking closest hit among 4 results may or might not be done more optimally in NEON:
int id_scalar[4];
float hitT_scalar[4];
#if USE_NEON
vst1q_s32(id_scalar, id);
vst1q_f32(hitT_scalar, hitT.m);
#else
_mm_storeu_si128((__m128i *)id_scalar, id);
_mm_storeu_ps(hitT_scalar, hitT.m);
#endif
// In general, you would do this with a bit scan (first set/trailing zero count).
// But who cares, it's only 16 options.
static const int laneId[16] =
{
    0, 0, 1, 0, // 00xx
    2, 0, 1, 0, // 01xx
    3, 0, 1, 0, // 10xx
    2, 0, 1, 0, // 11xx
};
int lane = laneId[minMask];
int hitId = id_scalar[lane];
float finalHitT = hitT_scalar[lane];

Current status

So the above is basic port to iOS, with some simple NEON code path, and no mobile specific GPU tweaks/optimizations at all. Code is over at 14-ios tag on github.

Performance:

  • iPhone SE (A9 chip): 8.5 Mray/s CPU, 19.8 Mray/s GPU.
  • iPhone X (A11 chip): 12.9 Mray/s CPU, 46.6 Mray/s GPU.
    • I haven’t looked into how many CPU threads the enkiTS task scheduler ends up using on iPhone X. I suspect it still might be just two “high performance” cores, which would be within my expectations of “roughly 50% more per-core CPU perf in two Apple CPU generations”. Which is fairly impressive!
  • For comparison, a MacBook Pro (2013) with Core i7 2.3 GHz & Intel Iris Pro gets: 42 Mray/s CPU, 99 Mray/s GPU.
    • Which means that single-thread CPU performance on iPhone X is actually very similar, or even a bit higher, than on an (admittedly old) MacBook Pro!

Pathtracer 13: GPU threadgroup memory is useful!

Introduction and index of this series is here.

Oh, last post was exactly a month ago… I guess I’ll remove “daily” from the titles then :)

So the previous approach “let’s do one bounce iteration per pass” (a.k.a. “buffer oriented”) turned out to add a whole lot of complexity, and was not really faster. So you know what, let’s park that one for now; maybe we’ll return to something like that once (if ever) we either actually need it, or perhaps when we’ll work on smaller ray packets that don’t need hundreds-of-megabytes of ray buffers.

Scott Bean (@gfxbean) sent a little hint that in my “regular, super simple” GPU implementation I might get much better performance by moving scene/material data into groupshared memory. As we’ve seen in the previous post, using group shared memory can speed things up quite a lot, and in this case all threads will be going through exactly the same spheres to check rays against.

All that work is completely isolated inside the compute shader (nice!), and conceptually goes like this:

groupshared Foo s_GroupFoo[kMaxFoos];

// at start of shader:
CopyFoosFromStructuredBuffersInto(s_GroupFoo);

ThreadGroupMemoryBarrier(); // sync threads in the group

// proceed as usual, just use s_GroupFoo instead
// of StructuredBuffer<Foo> variable

D3D11

The actual commit for D3D11 is here, and is pretty self-explanatory. At start of shader I make each thread do a little bit of “copy” work like this:

void main(uint3 tid : SV_GroupThreadID)
{
    uint threadID = tid.y * kCSGroupSizeX + tid.x;
    uint groupSize = kCSGroupSizeX * kCSGroupSizeY;
    uint objCount = g_Params[0].sphereCount;
    uint myObjCount = (objCount + groupSize - 1) / groupSize;
    uint myObjStart = threadID * myObjCount;
    for (uint io = myObjStart; io < myObjStart + myObjCount; ++io)
    {
        if (io < objCount)
        {
            s_GroupSpheres[io] = g_Spheres[io];
            s_GroupMaterials[io] = g_Materials[io];
        }
        if (io < g_Params[0].emissiveCount)
        {
            s_GroupEmissives[io] = g_Emissives[io];
        }
    }
    GroupMemoryBarrierWithGroupSync();

I also reduced thread group size from 16x16 to 8x8 since that was a bit faster on my GPU (may or might not be faster on any other GPU…). What’s the result? NVIDIA GeForce 1080 Ti: 778 -> 1854 Mray/s.

So that’s 2.4x faster for a fairly simple (and admittedly not trivially scalable to large scenes) change! However… a quick test on Radeon Pro WX 9100: says: 1200 -> 1100 Mray/s, so a bit slower. I haven’t investigated why, but I guess the takeaways are:

  1. Pre-caching compute shader data into thread group shared memory can make it a lot faster!
  2. Or it might make it slower on a different GPU.
  3. Good luck!

Metal

I did the same change in the Metal implementation; here’s the commit - pretty much the same as what is there on D3D11. The result? MacBook Pro (2013) with Intel Iris Pro 60.8 -> 42.9 Mray/s. (oꆤ︵ꆤo)

Why? No idea; Mac has no tooling to answer this question, as far as I can tell.

And then I did a change that I thought of totally at random, just because I modified these lines of code and started to think “I wonder what would happen if I…”. In the shader, several places had code like const Sphere& s = spheres[index] – initially came from the code being a direct copy from C++. I changed these places to copy into local variables by value, instead of having a const reference, i.e. Sphere s = spheres[index].

Here’s the commit, and that tiny change got the performance up to 98.7 Mray/s on Intel Iris Pro.

Why? Who knows! I would have expected any “sufficiently smart compiler” to have compiled both versions of code into exact same result. Turns out, nope, one of them is 2x faster, good luck!

Metal shaders are a bit of a black box, with not even intermediate representation being publicly documented. Good thing is… turns out the IR is just LLVM bitcode (via @icculus). So I grabbed a random llvm-dis I had on my machine (from Emscripten SDK, of all places), checked which output file Xcode produces for the *.metal inputs, and ran it on both versions.

The resulting LLVM IR disassembly is not very easy on the eyes, looking generally like this:

; <label>:13:                                     ; preds = %54, %10
  %14 = phi float [ %5, %10 ], [ %56, %54 ]
  %15 = phi i32 [ -1, %10 ], [ %55, %54 ]
  %16 = phi i32 [ 0, %10 ], [ %57, %54 ]
  %17 = sext i32 %16 to i64
  %18 = getelementptr inbounds %struct.Sphere, %struct.Sphere addrspace(3)* %2, i64 %17
  %19 = bitcast %struct.Sphere addrspace(3)* %18 to i8 addrspace(3)*
  call void @llvm.memcpy.p0i8.p3i8.i64(i8* %11, i8 addrspace(3)* %19, i64 20, i32 4, i1 false), !tbaa.struct !47
  br label %20
; <label>:20:                                     ; preds = %20, %13
  %21 = phi i32 [ 0, %13 ], [ %30, %20 ]
  %22 = phi <4 x float> [ undef, %13 ], [ %29, %20 ]
  %23 = sext i32 %21 to i64
  %24 = getelementptr inbounds %struct.Sphere, %struct.Sphere* %8, i64 0, i32 0, i32 0, i64 %23
  %25 = load float, float* %24, align 4, !tbaa !46

I’m not fluent in reading it, but by diffing the two versions, it’s not immediately obvious why one would be slower than the other. The slow one has some more load instructions with addrspace(3) on them, whereas the fast one has more calls into alloca (?) and llvm.memcpy.p0i8.p3i8.i64. Ok I guess? The alloca calls are probably not “real” calls; they just end up marking up how much of thread local space will get needed after all inlining. Memcpy probably ends up being a bunch of moves in exactly once place, so if GPU has any sort of load coalescing, then that gets used there. Or that’s my theory for “why faster”.

So Metal takeaways might be:

  1. By-value instead of by-const-reference things might be much more efficient.
  2. Metal bytecode is “just” LLVM IR, so peeking into that with llvm-dis can be useful. Note that this is still a machine-independent, very high level IR; you have no visibility into what the GPU driver will make of it in the end.

Current status and what’s next

So this simple change to pre-cache sphere/material/emissive data into thread group shared memory got GPU performance up to:

  • PC (GeForce 1080 Ti): 778 -> 1854 Mray/s,
  • Mac (Intel Iris Pro): 61 -> 99 Mray/s.

Which is not bad for such a simple change. Current code is over at 13-gpu-threadgroup-opt tag on github.

What’s next? I’m not sure. Maybe I should look at moving this out of “toy” stage and add bounding volume hierarchy & triangle meshes support? Narrator: he did not.