Optimizing Shader Info Loading, or Look at Yer Data!

A story about a million shader variants, optimizing using Instruments and looking at the data to optimize some more.

The Bug Report

The bug report I was looking into was along the lines of “when we put these shaders into our project, then building a game becomes much slower – even if shaders aren’t being used”.

Indeed it was. Quick look revealed that for ComplicatedReasons(tm) we load information about all shaders during the game build – that explains why the slowdown was happening even if shaders were not actually used.

This issue must be fixed! There’s probably no really good reason we must know about all the shaders for a game build. But to fix it, I’ll need to pair up with someone who knows anything about game data build pipeline, our data serialization and so on. So that will be someday in the future.

Meanwhile… another problem was that loading the “information for a shader” was slow in this project. Did I say slow? It was very slow.

That’s a good thing to look at. Shader data is not only loaded while building the game; it’s also loaded when the shader is needed for the first time (e.g. clicking on it in Unity’s project view); or when we actually have a material that uses it etc. All these operations were quite slow in this project.

Turns out this particular shader had massive internal variant count. In Unity, what looks like “a single shader” to the user often has many variants inside (to handle different lights, lightmaps, shadows, HDR and whatnot - typical ubershader setup). Usually shaders have from a few dozen to a few thousand variants. This shader had 1.9 million. And there were about ten shaders like that in the project.

The Setup

Let’s create several shaders with different variant counts for testing: 27 thousand, 111 thousand, 333 thousand and 1 million variants. I’ll call them 27k, 111k, 333k and 1M respectively. For reference, the new “Standard” shader in Unity 5.0 has about 33 thousand internal variants. I’ll do tests on MacBook Pro (2.3 GHz Core i7) using 64 bit Release build.

Things I’ll be measuring:

  • Import time. How much time it takes to reimport the shader in Unity editor. Since Unity 4.5 this doesn’t do much of actual shader compilation; it just extracts information about shader snippets that need compiling, and the variants that are there, etc.
  • Load time. How much time it takes to load shader asset in the Unity editor.
  • Imported data size. How large is the imported shader data (serialized representation of actual shader asset; i.e. files that live in Library/metadata folder of a Unity project).

So the data is:

Shader   Import    Load    Size
   27k    420ms   120ms    6.4MB
  111k   2013ms   492ms   27.9MB
  333k   7779ms  1719ms   89.2MB
    1M  16192ms  4231ms  272.4MB

Enter Instruments

Last time we used xperf to do some profiling. We’re on a Mac this time, so let’s use Apple Instruments. Just like xperf, Instruments can show a lot of interesting data. We’re looking at the most simple one, “Time Profiler” (though profiling Zombies is very tempting!). You pick that instrument, attach to the executable, start recording, and get some results out.

You then select the time range you’re interested in, and expand the stack trace. Protip: Alt-Click (ok ok, Option-Click you Mac peoples) expands full tree.

So far the whole stack is just going deep into Cocoa stuff. “Hide System Libraries” is very helpful with that:

Another very useful feature is inverting the call tree, where the results are presented from the heaviest “self time” functions (we won’t be using that here though).

When hovering over an item, an arrow is shown on the right (see image above). Clicking on that does “focus on subtree”, i.e. ignores everything outside of that item, and time percentages are shown relative to the item. Here we’ve focused on ShaderCompilerPreprocess (which does majority of shader “importing” work).

Looks like we’re spending a lot of time appending to strings. That usually means strings did not have enough storage buffer reserved and are causing a lot of memory allocations. Code change:

This small change has cut down shader importing time by 20-40%! Very nice!

I did a couple other small tweaks from looking at this profiling data - none of them resulted in any signifinant benefit though.

Profiling shader load time also says that most of the time ends up being spent on loading editor related data that is arrays of arrays of strings and so on:

I could have picked functions from the profiler results, went though each of them and optimized, and perhaps would have achieved a solid 2-3x improvement over initial results. Very often that’s enough to be proud!

However…

Taking a step back

Or like Mike Acton would say, ”look at your data!” (check his CppCon2014 slides or video). Another saying is also applicable: ”think!

Why do we have this problem to begin with?

For example, in 333k variant shader case, we end up sending 610560 lines of shader variant information between shader compiler process & editor, with macro strings in each of them. In total we’re sending 91 megabytes of data over RPC pipe during shader import.

One possible area for improvement: the data we send over and store in imported shader data is a small set of macro strings repeated over and over and over again. Instead of sending or storing the strings, we could just send the set of strings used by a shader once, assign numbers to them, and then send & store the full set as lists of numbers (or fixed size bitmasks). This should cut down on the amount of string operations we do (massively cut down on number of small allocations), size of data we send, and size of data we store.

Another possible approach: right now we have source data in shader that indicate which variants to generate. This data is very small: just a list of on/off features, and some built-in variant lists (“all variants to handle lighting in forward rendering”). We do the full combinatorial explosion of that in the shader compiler process, send the full set over to the editor, and the editor stores that in imported shader data.

But the way we do the “explosion of source data into full set” is always the same. We could just send the source data from shader compiler to the editor (a very small amount!), and furthermore, just store that in imported shader data. We can rebuild the full set when needed at any time.

Changing the data

So let’s try to do that. First let’s deal with RPC only, without changing serialized shader data. A few commits later…

This made shader importing over twice as fast!

Shader   Import
   27k    419ms ->  200ms
  111k   1702ms ->  791ms
  333k   5362ms -> 2530ms
    1M  16784ms -> 8280ms

Let’s do the other part too; where we change serialized shader variant data representation. Instead of storing full set of possible variants, we only store data needed to generate the full set:

Shader   Import              Load                 Size
   27k    200ms ->   285ms    103ms ->    396ms     6.4MB -> 55kB
  111k    791ms ->  1229ms    426ms ->   1832ms    27.9MB -> 55kB
  333k   2530ms ->  3893ms   1410ms ->   5892ms    89.2MB -> 56kB
    1M   8280ms -> 12416ms   4498ms ->  18949ms   272.4MB -> 57kB

Everything seems to work, and the serialized file size got massively decreased. But, both importing and loading got slower?! Clearly I did something stupid. Profile!

Right. So after importing or loading the shader (from now a small file on disk), we generate the full set of shader variant data. Which right now is resulting in a lot of string allocations, since it is generating arrays of arrays of strings or somesuch.

But we don’t really need the strings at this point; for example after loading the shader we only need the internal representation of “shader variant key” which is a fairly small bitmask. A couple of tweaks to fix that, and we’re at:

Shader  Import    Load
   27k    42ms     7ms
  111k    47ms    27ms
  333k    94ms    76ms
    1M   231ms   225ms

Look at that! Importing a 333k variant shader got 82 times faster; loading its metadata got 22 times faster, and the imported file size got over a thousand times smaller!

One final look at the profiler, just because:

Weird, time is spent in memory allocation but there shouldn’t be any at this point in that function; we aren’t creating any new strings there. Ahh, implicit std::string to UnityStr (our own string class with better memory reporting) conversion operators (long story…). Fix that, and we’ve got another 2x improvement:

Shader  Import    Load
   27k    42ms     5ms
  111k    44ms    18ms
  333k    53ms    46ms
    1M   130ms   128ms

The code could still be optimized further, but there ain’t no easy fixes left I think. And at this point I’ll have more important tasks to do…

What we’ve got

So in total, here’s what we have so far:

Shader   Import                Load                 Size
   27k    420ms-> 42ms (10x)    120ms->  5ms (24x)    6.4MB->55kB (119x)
  111k   2013ms-> 44ms (46x)    492ms-> 18ms (27x)   27.9MB->55kB (519x)
  333k   7779ms-> 53ms (147x)  1719ms-> 46ms (37x)   89.2MB->56kB (this is getting)
    1M  16192ms->130ms (125x)  4231ms->128ms (33x)  272.4MB->57kB (ridiculous!)

And a fairly small pull request to achieve all this (~400 lines of code changed, ~400 new added):

Overall I’ve probably spent something like 8 hours on this – hard to say exactly since I did some breaks and other things. Also I was writing down notes & making sceenshots for the blog too :) The fix/optimization is already in Unity 5.0 beta 20 by the way.

Conclusion

Apple’s Instruments is a nice profiling tool (and unlike xperf, the UI is not intimidating…).

However, Profiler Is Not A Replacement For Thinking! I could have just looked at the profiling results and tried to optimize “what’s at top of the profiler” one by one, and maybe achieved 2-3x better performance. But by thinking about the actual problem and why it happens, I got a way, way better result.

Happy thinking!

Curious Case of Slow Texture Importing, and Xperf

I was looking at a curious bug report: “Texture importing got much slower in current beta”. At first look, I dismissed it under “eh, someone’s being confused” (quickly tried on several textures, did not notice any regression). But then I got a proper bug report with several textures. One of them was importing about 10 times slower than it used to be.

Why would anyone make texture importing that much slower? No one would, of course. Turns out, this was an unintended consequence of generally improving things.

But, the bug report made me use xperf (a.k.a. Windows Performance Toolkit) for the first time. Believe it or not, I’ve never used it before!

So here’s the story

We’ve got a TGA texture (2048x2048, uncompressed - a 12MB file) that takes about 10 seconds to import in current beta build, but it took ~1 second on Unity 4.6.

First wild guess: did someone accidentally disable multithreaded texture compression? Nope, doesn’t look like it (making final texture be uncompressed still shows massive regression).

Second guess: we are using FreeImage library to import textures. Maybe someone, I dunno, updated it and comitted a debug build? Nope, last change to our build was done many moons ago.

Time to profile. My quick ”I need to get some answer in 5 seconds” profiler on Windows is Very Sleepy, so let’s look at that:

Wait what? All the time is spent in WinAPI ReadFile function?!

Is there something special about the TGA file I’m testing on? Let’s make the same sized, uncompressed PNG image (so file size comes out the same).

The PNG imports in 108ms, while TGA in 9800ms (I’ve turned off DXT compression, to focus on raw import time). In Unity 4.6 the same work is done 116ms (PNG) and 310ms (TGA). File sizes roughly the same. WAT!

Enter xperf

Asked a coworker who knows something about Windows: “why would reading one file spend all time in ReadFile, but another file of same size read much faster?”, and he said “look with xperf”.

I’ve read about xperf at the excellent Bruce Dawson’s blog, but never tried it myself. Before today, that is.

So, launch Windows Performance Recorder (I don’t even know if it comes with some VS or Windows SDK version or needs to be installed separately… it was on my machine somehow), tick CPU and disk/file I/O and click Start:

Do texture importing in Unity, click save, and on this fairly confusing screen click “Open in WPA”:

The overview in the sidebar gives usage graphs of our stuff. A curious thing: neither CPU (Computation) nor Storage graphs show intense activity? The plot thickens!

CPU usage investigation

Double clicking the Computation graph shows timeline of CPU usage, with graphs for each process. We can see Unity.exe taking up some CPU during a time period, which the UI nicely highlights for us.

Next thing is, we want to know what is using the CPU. Now, the UI groups things by the columns on the left side of the yellow divider, and displays details for them on the right side of it. We’re interested in a callstack now, so context-click on the left side of the divider, and pick “Stack”:

Oh right, to get any useful stacks we’ll need to tell xperf to load the symbols. So you go Trace -> Configure Symbol Paths, add Unity folder there, and then Trace -> Load Symbols.

And then you wait. And wait some more…

And then you get the callstacks! Not quite sure what the “n/a” entry is; my best guess that just represents unused CPU cores or sleeping threads or something like that.

Digging into the other call stack, we see that indeed, all the time is spent in ReadFile.

Ok, so that was not terribly useful; we already knew that from the Very Sleepy profiling session.

Let’s look at I/O usage

Remember the “Storage” graph on sidebar that wasn’t showing much activity? Turns out, you can expand it into more graphs.

Now we’re getting somewhere! The “File I/O” overview graph shows massive amounts of activity, when we were importing our TGA file. Just need to figure out what’s going on there. Double clicking on that graph in the sidebar gives I/O details:

You can probably see where this is going now. We have a lot of file reads, in fact almost 400 thousand of them. That sounds a bit excessive.

Just like in the CPU part, the UI sorts on columns to the left of the yellow divider. Let’s drag the “Process” column to the left; this shows that all these reads are coming from Unity indeed.

Expanding the actual events reveals the culprit:

We are reading the file alright. 3 bytes at a time.

But why and how?

But why are we reading a 12 megabyte TGA file in three-byte chunks? No one updated our image reading library in a long time, so how come things have regressed?

Found the place in code where we’re calling into FreeImage. Looks like we’re setting up our own I/O routines and telling FreeImage to use them:

Version control history check: indeed, a few weeks ago a change in that code was made, that switched from basically “hey, load an image from this file path” to “hey, load an image using these I/O callbacks”

This generally makes sense. If we have our own file system functions, it makes sense to use them. That way we can support reading from some non-plain-files (e.g. archives, or compressed files), etc. In this particular case, the change was done to support LZ4 compression in lightmap cache (FreeImage would need to import texture files without knowing that they have LZ4 compression done on top of them).

So all that is good. Except when that changes things to have wildly different performance characteristics, that is.

When you don’t pass file I/O routines to FreeImage, then it uses a “default set”, which is just C stdio ones:

Now, C stdio routines do I/O buffering by default… our I/O routines do not. And FreeImage’s TGA loader does a very large number of one-pixel reads.

To be fair, the “read TGA one pixel at a time” seems to be fixed in upstream FreeImage these days; we’re just using a quite old version. So looking at this bug made me realize how old our version of FreeImage is, and make a note to upgrade it at some point. But not today.

The Fix

So, a proper fix here would be to setup buffered I/O routines for FreeImage to use. Turns out we don’t have any of them at the moment. They aren’t terribly hard to do; I poked the relevant folks to do them.

In the meantime, to check if that was really the culprit, and to not have “well TGAs import much slower”, I just made a hotfix that reads whole image into memory, and then loads from that.

Is it okay to read whole image into memory buffer? Depends. I’d guess in 95% cases it is okay, especially now that Unity editor is 64 bit. Uncompressed data for majority of images will end up being much larger than the file size anyway. Probably the only exception could be .PSD files, where they could have a lot of layers, but we’re only interested in reading the “composited image” file section. So yeah, that’s why I said “hotfix”; and a proper solution would be having buffered I/O routines, and/or upgrading FreeImage.

This actually made TGA and PNG importing faster than before: 75ms for TGA, 87ms for PNG (Unity 4.6: 310ms TGA, 116ms PNG; current beta before the fix: 9800ms TGA, 108ms PNG).

Yay.

Conclusion

Be careful when replacing built-in functionality of something with your own implementation (e.g. standard I/O or memory allocation or logging or … routines of some library). They might have different performance characteristics!

xperf on Windows is very useful! Go and read Bruce Dawson’s blog for way more details.

On Mac, Apple’s Instruments is a similar tool. I think I’ll use that in some next blog entry.

I probably should have guessed that “too many, too small file read calls” is the actual cause after two minutes of looking into the issue. I don’t have a good excuse on why I did not. Oh well, next time I’ll know :)

Divide and Conquer Debugging

It should not be news to anyone that ability to narrow down a problem while debugging is an incredibly useful skill. Yet from time to time, I see people just helplessly randomly stumbling around, when they are trying to debug something. So with this in mind (and also “less tweeting, more blogging!” in mind for 2015), here’s a practical story.

This happened at work yesterday, and is just an ordinary bug investigation. It’s not some complex bug, and investigation was very short - all of it took less time than writing this blog post.

Bug report

We’re adding iOS Metal support to Unity 4.6.x, and one of the beta testers reported this: “iOS Metal renders submeshes incorrectly”. There was a nice project attached that shows the issue very clearly. He has some meshes with multiple materials on them, and the 2nd material parts are displayed in the wrong position.

The scene looks like this in the Unity editor:

But when ran on iOS device, it looks like this:

Not good! Well, at least the bug report is very nice :)

Initial guesses

Since the problematic parts are the second material on each object, and it only happens on the device, then the user’s “iOS Metal renders submeshes incorrectly” guess makes perfect sense (spoiler alert: the problem was elsewhere).

Ok, so what is different between editor (where everything works) and device (where it’s broken)?

  • Metal: device is running Metal, whereas editor is running OpenGL.
  • CPU: device is running on ARM, editor running on Intel.
  • Need to check which shaders are used on these objects; maybe they are something crazy that results in differences.

Some other exotic things might be different, but first let’s take the above.

Initial Cuts

Run the scene on the device using OpenGL ES 2.0 instead. Ha! The issue is still there. Which means Metal is not the culprit at all!

Run it using a slightly older stable Unity version (4.6.1). The issue is not there. Which means it’s some regression somewhere since Unity 4.6.1 and the code we’re based on now. Thankfully that’s only a couple weeks of commits.

We just need to find what regressed, when and why.

Digging Deeper

Let’s look at the frame on the device, using Xcode frame capture.

Hmm. We see that the scene is rendered in two draw calls (whereas it’s really six sub-objects), via Unity’s dynamic batching.

Dynamic batching is a CPU-side optimization we have where small objects using identical rendering state are transformed into world space on the CPU, into a dynamic geometry buffer, and rendered in a single draw call. So this spends some CPU time to transform the vertices, but saves some time on the API/driver side. For very small objects (sprites etc.) this tends to be a win.

Actually, I could have seen that it’s two draw calls in the editor directly, but it did not occur to me to look for that.

Let’s check what happens if we explicitly disable dynamic batching. Ha! The issue is gone.

So by now, what we know is: it’s some recent regression in dynamic batching, that happens on iOS device but not in the editor; and is not Metal related at all.

But it’s not that “all dynamic batching got broken”, because:

  • Half of the bug scene (the pink objects) are dynamic-batched, and they render correctly.
  • We do have automated graphics tests that cover dynamic batching; they run on iOS; and they did not notice any regressions.

Finding It

Since the regression is recent (4.6.1 worked, and was something like three weeks old), I chose to look at everything that changed since that release, and try to guess which changes are dynamic batching related, and could affect iOS but not the editor.

This is like a heuristic step before/instead of doing actual “bisecting the bug”. Unity codebase is large and testing builds isn’t an extremely fast process (mercurial update, build editor, build iOS support, build iOS application, run). If the bug was a regression from a really old Unity version, then I probably would have tried several in-between versions to narrow it down.

I used perhaps the most useful SourceTree feature - you select two changesets, and it shows the full diff between them. So looking at the whole diff was just several clicks away:

A bunch of changes there are immediately “not relevant” - definitely everything documentation related; almost definitely anything editor related; etc.

This one looked like a candidate for investigation (a change in matrix related ARM NEON code):

This one interesting too (a change in dynamic batching criteria):

And this one (a change in dynamic batching ARM NEON code):

I started looking at the last one…

Lo and behold, it had a typo indeed; the {d1[1]} thing was storing the w component of transformed vertex position, instead of z like it’s supposed to!

The code was in the part where dynamic batching is done on vertex positions only, i.e. it was only used on objects with shaders that only need positions (and not normals, texture coordinates etc.). This explains why half of the scene was correct (pink objects use shader that needs normals as well), and why our graphics tests did not catch this (so turns out, they don’t test dynamic batching with position-only shaders).

Fixing It

The fix is literally a one character change:

…and the batching code is getting some more tests.

Importing Cubemaps From Single Images

So this tweet on EXR format in texture pipeline and replies on cubemaps made me write this…

Typically skies or environment maps are authored as regular 2D textures, and then turned into cubemaps at “import time”. There are various cubemap layouts commonly used: lat-long, spheremap, cross-layout etc.

In Unity 4 we had the pipeline where the user had to pick which projection the source image is using. But for Unity 5, ReJ realized that it’s just boring useless work! You can tell these projections apart quite easily by looking at image aspect ratio.

So now we default to “automatic” cubemap projection, which goes like this:

  • If aspect is 4:3 or 3:4, it’s a horizontal or vertical cross layout.
  • If aspect is square, it’s a sphere map.
  • If aspect is 6:1 or 1:6, it’s six cubemap faces in a row or column.
  • If aspect is 1:1.85, it’s a lat-long map.

Now, some images don’t quite match these exact ratios, so the code is doing some heuristics. Actual code looks like this right now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
float longAxis = image.GetWidth();
float shortAxis = image.GetHeight();
bool definitelyNotLatLong = false;
if (longAxis < shortAxis)
{
    Swap (longAxis, shortAxis);
    // images that have height > width are never latlong maps
    definitelyNotLatLong = true;
}

const float aspect = shortAxis / longAxis;
const float probSphere = 1-Abs(aspect - 1.0f);
const float probLatLong = (definitelyNotLatLong) ?
    0 :
    1-Abs(aspect - 1.0f / 1.85f);
const float probCross = 1-Abs(aspect - 3.0f / 4.0f);
const float probLine = 1-Abs(aspect - 1.0f / 6.0f);
if (probSphere > probCross &&
    probSphere > probLine &&
    probSphere > probLatLong)
{
    // sphere map
    return kGenerateSpheremap;
}
if (probLatLong > probCross &&
    probLatLong > probLine)
{
    // lat-long map
    return kGenerateLatLong;
}
if (probCross > probLine)
{
    // cross layout
    return kGenerateCross;
}
// six images in a row
return kGenerateLine;

So that’s it. There’s no point in forcing your artists to paint lat-long maps, and use some external software to convert to cross layout, or something.

Now of course, you can’t just look at image aspect and determine all possible projections out of it. For example, both spheremap and “angular map” are square. But in my experience heuristics like the above are good enough to cover most common use cases (which seem to be: lat-long, cross layout or a sphere map).

US Vacation Report 2014

This April I had a vacation in the USA, so here’s a write up and a bunch of photos. Our trip: 12 days, group of five (myself, my wife, our two daughters and my sister), rented a car and drove around. Made the itinerary ourselves; tried to stay out of big cities or hotel chains – used airbnb where possible. For everyone except me, this was the first trip to USA; I actually never did venture outside of conference cities before either.

TL;DR: Grand Canyon and Death Valley are awesome.

Summary of what we wanted:

  • See nature. Grand Canyon, Death Valley, Yosemite, Highway 1 etc. were on the potential list. Decided to skip Yosemite this time.
  • No serious hiking; that’s hard to do with kids and we’re lazy :)

Asking friends, reading the internets (wikitravel, wikipedia, lonely planet, random blogs), came up with a list of “places I’d like to go”. Thanks for everyone on Facebook for telling that my initial plan was way too ambitious. This is what we ended up at:

Flew to Vegas, a couple of round trips from there (to Valley of Fire and Grand Canyon), and then off towards the ocean. Through Death Valley, and then up north to San Francisco. Fly back from there. In total ended up close to 3000 kilometers; a pretty good balance between “want to see lots of stuff” and “gotta always be driving”.

Almost all photos below are my wife’s. Equipment: Canon EOS 70D with Canon 24-70mm f/2.8 L II and Sigma 8-16mm f/4.5-5.6. A couple photos by me, taken with iPhone 4S :)

Day 1: Valley of Fire

Close to Vegas, there’s this amazing formation of red sandstone. It’s fairly big, so that took the whole day. A good short trip for a jetlagged day :)

This is Elephant Rock, if you squint enough:

Day 2: Vegas to Grand Canyon via Route 66

Stela (my 4yo daughter) is a huge fan of Pixar’s Cars, so a short detour though actual Route 66, and the Hackberry General Store was a joy for her.

Yellow Arizona landscapes:

Arrived at Grand Canyon towards the evening, in the South Rim. It is touristy, so we picked a less-central trail (Kaibab Trail). The views are absolutely breathtaking; something that is very hard to tell via photos. The scale is hard to comprehend: heigt from the rim to the bottom is 1.6km!

We walked a bit below the rim on Kaibab Trail. Would be cool to get to the bottom, but that is a full-day hike one way. Some next time.

Day 3: Grand Canyon, and back to Vegas through Hoover Dam

It’s next to impossible to find lodging at the Grand Canyon Village itself (unless you book half a year in advance?), so we slept in Tusayan 10km south. Next morning, went to canyon rim again.

Visited Hoover Dam on the way back to Vegas:

Day 4: Death Valley

Death Valley, lowest and driest place in North America, and hottest place on Earth. This was April, so the temperature was a mild 40°C in the shade :) Death Valley is amazing due to lots and lots of very different geological things close to each other.

Here, Zabriskie Point, a fantastic erosional landscape. If you’re into obscure movies, you might know a film of the same name.

Next up, Devil’s Golf Course, a salt pan. Apparently almost a century ago one guide book had “Only the devil could play golf here” description, and it stuck. Salt crystals are really sharp on these rocks; there are signs advising “do not trip over or you’ll cut yourself”.

Natural Bridge in the mountains right next to it. It was also getting very, very hot.

Badwater Basin, lowest point in North America. Aistė (my wife) walking on pure salt.

Artist’s Palette, with rocks of any color you like.

Remains of a borax mine, Harmony Borax Works:

Mesquite Flat Sand Dunes. They say some scenes from Star Wars were shot here!

Day 5: Death Valley to Pacific

We stayed at Panamint Springs Resort in the Panamint Valley. Tried to catch a sunrise next morning, but it was all cloudy :( Ground here is covered with salt as well:

Leaving Panamint Valley, and through much more green California:

Day 6: San Luis Obispo to Big Sur

Spent a night in San Luis Obispo (lovely little town!), and took the Route 1 up north towards Big Sur.

Turns out, the coast redwoods are quite tall!

Day 7: Big Sur to Santa Cruz

Impressive scenery of ocean and clouds and rocks; and Bixby Creek Bridge.

Jellyfish at Monterey Aquarium:

Big waves and Santa Cruz beach:

Days 8, 9, 10: To San Francisco and there

Rented an apartment in the western side (Sunset District), so that it would be possible to find some parking space :) Moraga Steps and view towards sunset.

Obligatory Golden Gate Bridge.

Muir Woods just north of SF. This is again a redwood park, but much more crowded than Big Sur.

Random places in SF: wall art at Broadway/Columbus, and a block of Lombard Street, “the crookedest street in the world”.

Day 11: back home

A long flight back home. 5AM in the airport :)