<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lost in the Triangles</title>
	<atom:link href="http://aras-p.info/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://aras-p.info/blog</link>
	<description>Random thoughts of a triangle pusher</description>
	<lastBuildDate>Sun, 01 Apr 2012 07:27:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Tiled Forward Shading links</title>
		<link>http://aras-p.info/blog/2012/03/27/tiled-forward-shading-links/</link>
		<comments>http://aras-p.info/blog/2012/03/27/tiled-forward-shading-links/#comments</comments>
		<pubDate>Tue, 27 Mar 2012 15:35:46 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=821</guid>
		<description><![CDATA[Main idea of my previous post was roughly this: in forward rendering, there&#8217;s no reason why we still have to use per-object light lists. We can apply roughly the same ideas as those of tiled deferred shading. Really nice to see that other people have thought about this before or about the same time; here [...]]]></description>
			<content:encoded><![CDATA[<p>Main idea of my <a href="http://aras-p.info/blog/2012/03/02/2012-theory-for-forward-rendering/">previous post</a> was roughly this: in forward rendering, there&#8217;s no reason why we still have to use per-object light lists. We can apply roughly the same ideas as those of tiled deferred shading.</p>
<p>Really nice to see that other people have thought about this before or about the same time; here are some links:</p>
<ul>
<li><a href="http://www.cse.chalmers.se/~olaolss/main_frame.php?contents=publication&#038;id=tiled_shading">Tiled Shading</a> by Ola Olsson and Ulf Assarsson; Journal of Graphics Tools. PDF, source code and comparisons between tiled forward &#038; tiled deferred.</li>
<li><a href="http://developer.amd.com/gpu_assets/AMD_Demos_LeoDemoGDC2012.ppsx">Forward+: Bringing Deferred Lighting to the Next Level</a> by Takahiro Hirada, Jay McKee, Jason C. Yang; GDC 2012. This describes AMD&#8217;s Leo demo. There&#8217;s an incomplete Eurographics 2012 <a href="https://sites.google.com/site/takahiroharada/">paper here</a>.</li>
<li><a href="http://www.pjblewis.com/articles/tile-based-forward-rendering/">Tile-Based Forward Rendering</a> by Peter J. B. Lewis. Implementation without using a Compute Shader (but uses other DX11 features like UAVs).</li>
<li><a href="http://mynameismjp.wordpress.com/2012/03/31/light-indexed-deferred-rendering/">Light Indexed Deferred Rendering</a>; new implementation by Matt &#8220;MJP&#8221; Pettineo. Includes performance comparisons with tiled deferred rendering.</li>
<li>Very similar in approach is of course <a href="http://code.google.com/p/lightindexed-deferredrender/">Light Indexed Deferred Rendering</a> by Damian Trebilco.</li>
</ul>
<p>As <a href="http://portfolio.punkuser.net/">Andrew Lauritzen</a> points out in the <a href="http://aras-p.info/blog/2012/03/02/2012-theory-for-forward-rendering/#comment-179964">comments of my previous post</a>, claiming &#8220;but deferred will need super-fat G-buffers!&#8221; is an over-simplification. You could just as well store material indices plus data for sampling textures (UVs + derivatives); and going &#8220;deferred&#8221; you have more choices in how you schedule your computations.</p>
<p>There&#8217;s no principal difference between &#8220;forward&#8221; and &#8220;deferred&#8221; these days. As soon as you have a Z-prepass you already are caching/deferring <em>something</em>, and then it&#8217;s a whole spectrum of options what and how to cache or &#8220;defer&#8221; for later computation.</p>
<p>Ultimately of course, the best approach depends on a million of factors. The only lesson to learn from this post is that &#8220;forward rendering does not have to use per-object light lists&#8221;.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2012/03/27/tiled-forward-shading-links/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>2012 Theory for Forward Rendering</title>
		<link>http://aras-p.info/blog/2012/03/02/2012-theory-for-forward-rendering/</link>
		<comments>http://aras-p.info/blog/2012/03/02/2012-theory-for-forward-rendering/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 08:16:30 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=805</guid>
		<description><![CDATA[Good question in a tweet by @ivanassen: So what is the 2012 theory on lights in a forward renderer? Hard to answer that in 140 characters, so here goes raw brain dump (warning: not checked in practice!). Short answer A modern forward renderer for DX11-class hardware would probably be something like AMD&#8217;s Leo demo. They [...]]]></description>
			<content:encoded><![CDATA[<p>Good question in a <a href="https://twitter.com/ivanassen/statuses/175350571044311042">tweet</a> by <a href="https://twitter.com/#!/ivanassen">@ivanassen</a>:</p>
<blockquote><p>So what is the 2012 theory on lights in a forward renderer?</p></blockquote>
<p>Hard to answer that in 140 characters, so here goes raw brain dump <em>(warning: not checked in practice!)</em>.</p>
<p><strong>Short answer</strong></p>
<p>A modern forward renderer for DX11-class hardware would probably be something like <a href="http://developer.amd.com/samples/demos/pages/AMDRadeonHD7900SeriesGraphicsReal-TimeDemos.aspx">AMD&#8217;s Leo demo</a>.</p>
<p>They seem to be doing light culling in a compute shader, and the result is per-pixel / tile linked lists of lights. Then scene is rendered normally in forward rendering, fetching the light lists and computing shading. Advantages are many; arbitrary shading models with many parameters that would be hard to store in a G-buffer; semitransparent objects; hardware MSAA support; much smaller memory requirements compared to some fat G-buffer layout.</p>
<p>Disadvantages would be storing linked lists, I guess. Potentially unbounded memory usage here, though I guess various schemes similar to <a href="http://software.intel.com/en-us/articles/adaptive-transparency/">Adaptive Transparency</a> could be used to cap the maximum number of lights per pixel/tile.</p>
<p><strong>Deferred == Caching</strong></p>
<p>All the deferred lighting/shading approaches are essentially caching schemes. We cache some amount of surface information, in screen space, in order to avoid fetching or computing the same information over and over again, while applying lights one by one in traditional forward rendering.</p>
<p>Now, the &#8220;cache in screenspace&#8221; leads to disadvantages like &#8220;it&#8217;s really hard to do transparencies&#8221; &#8211; since with transparencies you do not have one point in space mapping to one pixel on screen anymore. There&#8217;s no reason why caching should be done in screen space however; lighting could also just as well be computed in texture space (like some skin rendering techniques, but they do it for a different reason), world space (voxels?), etc.</p>
<p><strong>Does &#8220;modern&#8221; forward rendering still need caching?</strong></p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2012/03/ShaderParams.png"><img src="http://aras-p.info/blog/wp-content/uploads/2012/03/ShaderParams-238x500.png" alt="" title="Shader parameters on all the things!" width="238" height="500" class="alignright size-medium wp-image-810" /></a><br />
Caching information was important since in DX9 / Shader Model 3 times, it was hard to do forward rendering that could almost arbitrarily apply variable number of lights &#8211; with good efficiency &#8211; in one pass. That led to either shader combination explosion, or inefficient multipass rendering, or both. But now we have DX11, compute, structured buffers and unordered access views, so maybe we can actually do better?</p>
<p>Because at some point we will want to have BRDFs with more parameters than it is viable to store in a G-buffer (side image: this is <em>half</em> of parameters for a material). We will want many semitransparent objects. And then we&#8217;re back to square one; we can not efficiently do this in a traditional &#8220;deferred&#8221; way where we cache N numbers per pixel.</p>
<p>AMD&#8217;s Leo goes in that direction. It seems to be a blend of <a href="http://software.intel.com/en-us/articles/deferred-rendering-for-current-and-future-rendering-pipelines/">tiled deferred approaches</a> to light culling, applied to forward rendering.</p>
<p><strong>I imagine it doing something like:</strong></p>
<ol>
<li>Z-prepass:
<ol>
<li>Render Z prepass of opaque objects to fill in depth buffer.</li>
<li>Store that away (copy into another depth buffer).</li>
<li>Continue Z prepass of transparent objects; writing to depth.</li>
<li>Now we have two Z buffers, and for any pixel we know the Z-extents of anything interesting in it (from closest transparent object up to closest opaque surface)</li>
</ol>
</li>
<li>Shadowmaps, as usual. Would need to keep all shadowmaps for all lights in memory, which can be a problem!</li>
<li>Light culling, very similar to what you&#8217;d do in tiled deferred case!
<ol>
<li>Have all lights stored in a buffer. Light types, positions/directions/ranges/angles, colors etc.</li>
<li>From the two depth buffers above, we can compute Z ranges per pixel/tile in order to do better light culling.</li>
<li>Run a compute shader that does light culling. Could do this per pixel or per small tiles (e.g. 8&#215;8 ). Result is buffer(s) / lists per pixel or tile, with lights that affect said pixel or tile.</li>
</ol>
</li>
<li>Render objects in forward rendering:
<ol>
<li>Z-buffer is already pre-filled in 1.1.</li>
<li>Each shader would have to do &#8220;apply all lights that affect this pixel/tile&#8221; computation. So that would involve fetching those arbitrary light informations, looping over lights etc.</li>
<li>Otherwise, each object is free to use as many shader parameters as it wants, or use any BRDF it wants.</li>
<li>Rendering order is like usual forward rendering; batch-friendly order (since Z is prefilled already) for opaque, per-object or per-triangle back-to-front order for semitransparent objects.</li>
</ol>
</li>
<li><em>Profit!</em></li>
</ol>
<p>Now, I have hand-waved over some potentially problematic details.</p>
<p>For example, &#8220;two depth buffers&#8221; is not robust for cases where there&#8217;s <em>no</em> opaque objects in some area; we&#8217;d need to track minimum <em>and</em> maximum depths of semitransparent stuff, or accept worse light culling for those tiles. Likewise, copying the depth buffer might lose some hardware Hi-Z information, so in practice it could be better to track semitransparent depths using another approach (min/max blending of a float texture etc.).</p>
<p>4.2. bit about &#8220;let&#8217;s apply all lights&#8221; assumes there is <em>some</em> way to do that efficiently, while supporting complicated things like each light having a different cookie/gobo texture, or a different shadowmap etc. Texture arrays could almost certainly be used here, but since this just a brain dump without verification in practice, it&#8217;s hard to say how would this work.</p>
<p><b>Update</b>: other papers came out describing almost the same idea, with actual implementations &#038; measurements. <a href="http://aras-p.info/blog/2012/03/27/tiled-forward-shading-links/">Check them out here!</a></p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2012/03/02/2012-theory-for-forward-rendering/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Prophets and duct-tapers or: useful programmer traits</title>
		<link>http://aras-p.info/blog/2011/09/09/prophets-and-duct-tapers-or-useful-programmer-traits/</link>
		<comments>http://aras-p.info/blog/2011/09/09/prophets-and-duct-tapers-or-useful-programmer-traits/#comments</comments>
		<pubDate>Fri, 09 Sep 2011 16:52:38 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=795</guid>
		<description><![CDATA[I liked Pierre&#8217;s The Prophet Programmer post. Go read it now. Now of course that post is a rant. It exaggerates. It puts everything into one bit grayscale colors. There&#8217;s never one person completely like this &#8220;prophet programmer&#8221; and another like the idolized &#8220;best programmer&#8230; not afraid of anything!!1&#8243;. But it does highlight at least [...]]]></description>
			<content:encoded><![CDATA[<p>I liked Pierre&#8217;s <a href="http://www.codercorner.com/blog/?p=502">The Prophet Programmer</a> post. Go read it now.</p>
<p>Now <em>of course</em> that post is a rant. It exaggerates. It puts everything into one bit grayscale colors. There&#8217;s never one person completely like this &#8220;prophet programmer&#8221; and another like the idolized &#8220;best programmer&#8230; not afraid of anything!!1&#8243;.</p>
<p>But it does highlight at least this thing: some aspects of programmer&#8217;s behavior are either useful or not.</p>
<p>Obsessing over latest hypes, &#8220;the proper ways&#8221;, following books by the letter just by itself <em>is not useful</em>. Sure, sometimes a dash of &#8220;proper ways&#8221; or recommendations is good, but the benefits of doing that are really, really tiny. Hence it&#8217;s not worth thinking/arguing much about.</p>
<p><strong>Here&#8217;s some actually useful programmer traits</strong> instead.</strong> I&#8217;m thinking about real actual people I&#8217;m working with here, even if I&#8217;m not telling names.</p>
<p>He <em>feels what needs to be done</em> to get the solution, in the big picture. Sometimes these are unusual ideas that probably no one is doing &#8211; because everyone has always been seeing the problem in the standard way. The solutions seem obvious once you see them, but require some sort of step function in thinking to get there. Zero iteration way of hooking up touchscreen device input to test the game is to play the game on PC, stream images into the device and stream inputs back. Least hassle free asset pipeline is when there is no &#8220;export/import asset&#8221; step. Or a more famous outside example, tablets <a href="http://aras-p.info/blog/wp-content/uploads/2011/09/tablets-before-and-after-ipad.jpeg">before and after</a> the iPad. You rarely, if ever, can do things like that by doing user surveys or improving on existing solutions; you need someone who can see through and find what&#8217;s the <em>actual</em> problem you want to solve. This guy is worth gold.</p>
<p>She can <em>cut things</em>. &#8220;Perfection is achieved, not when there is nothing more to add, but when there is nothing left to cut away&#8221;, quoth Saint-Exupéry. To be good at doing anything you (both you and your team) need to focus, which means cutting things. Let go of bad ideas and blind alleys. If your justification for doing it is &#8220;but we already spent so much time on it&#8221;, just don&#8217;t &#8211; it will only get worse. Cut features that aren&#8217;t quite ready by the deadlines. Remove old things that aren&#8217;t useful anymore. Doing that can and will make some people upset; it&#8217;s really, <em>really</em> hard to postpone or even completely abandon a thing that someone put a lot of effort into. But it needs to be done; and you need her on the team to make these hard decisions.</p>
<p>That other guy is <em>freaking fast</em>. And not in a sense of &#8220;types tons of code real fast and then sometimes it works, and two weeks after someone else has to clean it up&#8221;. No &#8211; he&#8217;s cranking out good, solid, tested, working code at incredible speeds. Got ten bugs; they are fixed by next day. Got a new feature to do; commits with everything implemented (and working!) are pushed in a few days. When he goes on vacation your burndown chart changes slope. How he does it? I don&#8217;t know. But by all means, keep onto him!</p>
<p>The other girl can figure out any <em>complex problem real fast</em>. Be it a tricky bug, unexpected behavior, really weird interaction with other systems &#8211; others could be spending hours, if not days, trying to figure out what&#8217;s going on. She, on the other hand, checks just a handful of things and goes &#8220;ha! the problem&#8217;s right there&#8221;. As if applying binary search to the whole problem space, except to everyone else the space seems unsorted and they don&#8217;t even know what they&#8217;re looking for!</p>
<p>This dude can keep <em>a ton of context in his head</em> while doing anything. How will this feature interact with dozens or even hundreds of other features; he&#8217;s able to think about all of them and majority of corner cases and get everything right in one go. Would take dozens of roundtrips between coding &#038; QA for someone else to get right. When estimating effort for new things, he can immediately list all the tricky work that will need to be done; whereas others would go &#8220;sounds easy&#8221; only to find out it&#8217;s a month of work.</p>
<p>She&#8217;s <em>not satisfied with the status quo</em>. No this isn&#8217;t good enough, she says; and let me show you where &#038; how spectacularly it breaks. And it does not matter if everyone else is doing it this way; here&#8217;s why putting that stuff into uniform grid isn&#8217;t good. A lot of times you need this extra bump to snap out of your own &#8220;this is good enough, no one will care&#8221; thoughts.</p>
<p>He&#8217;s doing a lot of <em>boring work to get others more productive</em>. There&#8217;s <em>a ton</em> of boring work on even the most exciting projects, and someone has to do it. He&#8217;s often the unsung hero, quietly working on infrastructure, build times, fixing annoyances in the tools, processes and workflows; all just so that others can be better at doing <em>exciting</em> things. You could call him a janitor or a plumber if you wish, but any place gets rotten and broken real fast without those people.</p>
<p>&#8230;and the list could go on. Unlike obsessing over irrelevant details, <strong>these make a difference</strong>. Makes your team run circles around others. Helps you solve <em>hard</em> problems, invent things, moves you forward at enormous velocity.</p>
<p>You need people with those traits and attitudes.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/09/09/prophets-and-duct-tapers-or-useful-programmer-traits/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Fast Mobile Shaders or, I did a talk at SIGGRAPH!</title>
		<link>http://aras-p.info/blog/2011/08/17/fast-mobile-shaders-or-i-did-a-talk-at-siggraph/</link>
		<comments>http://aras-p.info/blog/2011/08/17/fast-mobile-shaders-or-i-did-a-talk-at-siggraph/#comments</comments>
		<pubDate>Wed, 17 Aug 2011 19:22:05 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[papers]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=791</guid>
		<description><![CDATA[Finally after many years of dreaming I made it to SIGGRAPH! And not only that, I also did a talk/course with ReJ for 1.5 hours. This was the first time Unity had real presence at SIGGRAPH and I hope we&#8217;ll be more active &#038; visible next time around. Here it is, 100+ slides with notes: [...]]]></description>
			<content:encoded><![CDATA[<p>Finally after many years of dreaming I made it to <a href="http://www.siggraph.org/s2011/">SIGGRAPH</a>! And not only that, I also did a talk/course with <a href="http://twitter.com/#!/__ReJ__">ReJ</a> for 1.5 hours. This was the first time Unity had real presence at SIGGRAPH and I hope we&#8217;ll be more active &#038; visible next time around.</p>
<p>Here it is, 100+ slides with notes: <a href="http://aras-p.info/texts/files/FastMobileShaders_siggraph2011.pdf"><strong>Fast Mobile Shaders</strong></a> (17MB pdf). This isn&#8217;t strictly about shaders; there&#8217;s info about mobile GPU architectures, general performance, hidden surface removal and so on. Also, graphs with logarithmic scales; can&#8217;t go wrong with that!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/08/17/fast-mobile-shaders-or-i-did-a-talk-at-siggraph/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Testing Graphics Code, 4 years later</title>
		<link>http://aras-p.info/blog/2011/06/17/testing-graphics-code-4-years-later/</link>
		<comments>http://aras-p.info/blog/2011/06/17/testing-graphics-code-4-years-later/#comments</comments>
		<pubDate>Fri, 17 Jun 2011 04:44:46 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[rendering]]></category>
		<category><![CDATA[unity]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=762</guid>
		<description><![CDATA[Almost four years ago I wrote how we test rendering code at Unity. Did it stand the test of time and more importantly, growing the company from less than 10 people to more than 100 people? I&#8217;m happy to say it did! That&#8217;s it, move on to read the rest of the internets. The earlier [...]]]></description>
			<content:encoded><![CDATA[<p>Almost four years ago <a href="http://aras-p.info/blog/2007/07/31/testing-graphics-code/">I wrote how we test rendering code</a> at Unity. Did it stand the test of time and more importantly, growing the company from less than 10 people to more than 100 people?</p>
<p><em>I&#8217;m happy to say it did! That&#8217;s it, move on to read the rest of the internets.<br />
</em></p>
<p>The earlier post was more focused on hardware compatibility area (differences between platforms, GPUs, driver versions, driver bugs and their workarounds etc.). In addition to that, we do regression tests on a bunch of <a href="http://blogs.unity3d.com/2010/01/12/on-web-player-regression-testing/">actual Unity made games</a>. All that is good and works, let&#8217;s talk about what tests the rendering team at Unity is using in the daily lives instead.</p>
<p><strong>Graphics Feature &#038; Regression Testing</strong></p>
<p>In daily life of a graphics programmer, you care about two things related to testing:</p>
<p><span id="more-762"></span><strong>1.</strong> Whether a new feature you are adding, more or less, works.<br />
<strong>2.</strong> Whether something new you added or something you refactored broke or changed any existing features.</p>
<p>Now, &#8220;works&#8221; is a vague term. Definitions can range from equally vague</p>
<blockquote><p>Works For Me!</p></blockquote>
<p>to something like </p>
<blockquote><p>It has been battle tested on thousands of use cases, hundreds of shipped games, dozens of platforms, thousands of platform configurations and within each and every one of them there&#8217;s not a single wrong pixel, not a single wasted memory byte and not a single wasted nanosecond! <em>No kittehs were harmed either!</em></p></blockquote>
<p>In ideal world we&#8217;d only consider the latter as &#8220;works&#8221;, however that&#8217;s quite hard to achieve.</p>
<p>So instead we settle for small &#8220;functional tests&#8221;, where each feature has a small scene setup that exercises said feature (very much like talked about in <a href="http://aras-p.info/blog/2007/07/31/testing-graphics-code/">previous post</a>). It&#8217;s graphics programmer&#8217;s responsibility to add tests like that for his stuff.</p>
<p>For example, Fog handling might be tested by a couple scenes like this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/092-FogModes.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/092-FogModes.png" alt="" title="Fog Modes" width="400" height="300" class="alignnone size-full wp-image-770" /></a><br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/017-Fog.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/017-Fog.png" alt="" title="Fog vs. different shaders; Forward rendering above, Deferred Lighting below" width="400" height="300" class="alignnone size-full wp-image-771" /></a></p>
<p>Another example, tests for various corner cases of Deferred Lighting:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/118-DeferredLMCases.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/118-DeferredLMCases.png" alt="" title="Lighmapped/NonLightmapped objects vs. Baked/NonBaked lights" width="400" height="300" class="alignnone size-full wp-image-774" /></a><br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/134-DefLightShapes.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/134-DefLightShapes.png" alt="" title="Light volumes crossing near/far planes" width="400" height="300" class="alignnone size-full wp-image-775" /></a><br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/143-DefLargeCoords.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/143-DefLargeCoords.png" alt="" title="Ability to handle small near plane &amp; large world coordinates" width="400" height="300" class="alignnone size-full wp-image-776" /></a></p>
<p>So that&#8217;s basic testing for &#8220;it works&#8221; that the graphics programmers themselves do. Beyond that, features are tested by QA and a large beta testing group, tried, profiled and optimized on real actual game projects and so on.</p>
<p>The good thing is, doing these basic tests also provides you with point 2 (did I break or change something?) automatically. If after your changes, all the graphics tests still pass, there&#8217;s a pretty good chance you did not break anything. Of course this testing is not exhaustive, but any time a regression is spotted by QA, beta testers or reported by users, you can add a new graphics test to check for that situation.</p>
<p><strong>How do we actually do it?</strong></p>
<p>We use <a href="http://www.jetbrains.com/teamcity/">TeamCity</a> for the build/test farm. It has several build machines set up as graphics test agents (unlike most other build machines, they need an actual GPU, or a iOS device connected to them, or a console devkit etc.) that run graphics test configurations for all branches automatically. Each branch has it&#8217;s graphics tests run daily, and branches with &#8220;high graphics code activity&#8221; (i.e. branches that the rendering team is actually working on) have them run more often. You can always initiate the tests manually by clicking a button of course. What you want to see at any time is this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/teamcity-gfx-tests.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/teamcity-gfx-tests.png" alt="" title="The graphics tests are passing one by one!" width="445" height="362" class="alignnone size-full wp-image-778" /></a></p>
<p>The basic approach is the same as <a href="http://aras-p.info/blog/2007/07/31/testing-graphics-code/">4 years ago</a>: a &#8220;game level&#8221; (&#8220;scene&#8221; in Unity speak) for each test, runs for defined number of frames, run everything at fixed timestep, take a screenshot at end of each frame. Compare each screenshot with &#8220;known good&#8221; image for that platform; any differences equals &#8220;FAIL&#8221;. On many platforms you have to allow a couple of wrong pixels because many consumer GPUs are not <i>fully</i> deterministic it seems.</p>
<p>So you have this bunch of &#8220;this is the golden truth&#8221; images for all the tests:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/some-gfx-tests.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/some-gfx-tests-500x247.png" alt="" title="Images for some of the graphics tests" width="500" height="247" class="alignnone size-medium wp-image-781" /></a></p>
<p>And each platform automatically tested on TeamCity has it&#8217;s own set:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/06/gfx-test-platforms.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/06/gfx-test-platforms.png" alt="" title="Platforms of graphics tests" width="187" height="181" class="alignnone size-full wp-image-782" /></a></p>
<p>Since the &#8220;test controller&#8221; can run on a different device than actual tests (the case for iOS, Xbox 360 etc.), the test executable opens a socket connection to transfer the screenshots. The test controller is a relatively simple C# application that listens on a socket, fetches the screenshots and compares them with the template ones. The result of it is output that TeamCity can understand; along with &#8220;build artifacts&#8221; that consist of failed tests (for each failed test: expected image, failed image, difference image with increased contrast).</p>
<p>That&#8217;s pretty much it! And of course, automated tests are nice and all, but that should not get too much into the way of actual <a href="http://programming-motherfucker.com/">programming manifesto</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/06/17/testing-graphics-code-4-years-later/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Notes on Native Client &amp; Pepper Plugin API</title>
		<link>http://aras-p.info/blog/2011/06/02/notes-on-native-client-pepper-plugin-api/</link>
		<comments>http://aras-p.info/blog/2011/06/02/notes-on-native-client-pepper-plugin-api/#comments</comments>
		<pubDate>Thu, 02 Jun 2011 08:24:48 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[rant]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=744</guid>
		<description><![CDATA[Google&#8217;s Native Client (NaCl) is a brilliant idea. TL;DR: it allows native code to be run securely in the browser. But is it secure? &#8220;Bububut, waitaminnit! Native code is not secure by definition&#8221; you say. Turns out, that isn&#8217;t necessarily true. With a specially massaged compiler, some runtime support and careful native code validation it [...]]]></description>
			<content:encoded><![CDATA[<p>Google&#8217;s <a href="http://code.google.com/p/nativeclient/">Native Client</a> (NaCl) is a brilliant idea. <a href="http://en.wikipedia.org/wiki/Wikipedia:Too_long;_didn%27t_read">TL;DR</a>: it allows <em>native</em> code to be run <em>securely</em> in the browser.</p>
<p><strong>But is it secure?</strong></p>
<p><em>&#8220;Bububut, waitaminnit! Native code is not secure by definition&#8221;</em> you say. Turns out, that isn&#8217;t necessarily true. With a specially massaged compiler, some runtime support and careful native code validation it is possible to ensure native code, when ran in the browser, can&#8217;t cause harm to user&#8217;s machine. I suggest taking a look at the original <a href="http://src.chromium.org/viewvc/native_client/data/docs_tarball/nacl/googleclient/native_client/documentation/nacl_paper.pdf">NaCl for x86 paper</a> and more recently, how similar techniques would apply to <a href="http://www.chromium.org/nativeclient/reference/arm-overview">ARM CPUs</a>.</p>
<p><strong>But what can you do with it?</strong></p>
<p><span id="more-744"></span>So that&#8217;s great. It means it is possible to take C/C++ code, compile it with NaCl SDK (a gcc derived toolchain) and have it run in the browser. We can make a loop in C that multiplies a ton of floating point numbers, and it will run at native speed. That&#8217;s wonderful, except you can&#8217;t really do much interesting stuff with your own C code in isolation&#8230;</p>
<p>You need access to the hardware and/or OS. As game developers, we need pixels to appear on the screen. Preferably lots of them, with the help of something like a <a href="http://en.wikipedia.org/wiki/Graphics_processing_unit">GPU</a>. Audio waves to come out of the speakers. Mouse moves and keyboard presses to translate to some fancy actions. Post a high score to the internets. And so on.</p>
<p>NaCl surely can&#8217;t just allow my C code to call <a href="http://msdn.microsoft.com/en-us/library/bb219685(v=vs.85).aspx"><tt>Direct3DCreate9</tt></a> and run with it, while keeping the promise of &#8220;it&#8217;s secure&#8221;? Or a more extreme case, <tt>FILE* f = fopen("/etc/passwd", "rt");</tt>?!</p>
<p>And that&#8217;s true; NaCl does not allow you to use completely arbitrary APIs. It has it&#8217;s own set of APIs to interface with &#8220;the system&#8221;.</p>
<p><strong>Ok, how do I interface with the system?</strong></p>
<p>&#8230;and that&#8217;s where the current state of NaCl gets a bit confusing.</p>
<p>Initially Google developed an improved &#8220;browser plugin model&#8221; and called it Pepper. This Pepper thing would then take care of actually putting your code <em>into the browser</em>. Starting it up, tearing it down, controlling repaints, processing events and so on. But then apparently they realized that building on top of a decade-old Netscape plugin API (<a href="http://en.wikipedia.org/wiki/NPAPI">NPAPI</a>) isn&#8217;t going to really work, so they developed Pepper2 or PPAPI (Pepper Plugin API) which ditches NPAPI completely. To write a native client plugin, you only interface with PPAPI.</p>
<p>So some of the pages on the internets reference the &#8220;old API&#8221; (which is gone as far as I can see), and some others reference the new one. It does not help that Native Client&#8217;s own documentation are scattered around in <a href="http://www.chromium.org/nativeclient">Chromium</a>, <a href="http://code.google.com/p/nativeclient/">NaCl</a>, <a href="http://code.google.com/p/nativeclient-sdk/">NaCl SDK</a> and <a href="http://code.google.com/p/ppapi/">PPAPI</a> sites. Seriously, <em>it&#8217;s a mess</em>, with seemingly no high level, up to date &#8220;introduction&#8221; page that tells what exactly PPAPI can and can&#8217;t do. <em>Edit</em>: I&#8217;m told that the definitive entry point to NaCl right now is this page: <a href="http://code.google.com/chrome/nativeclient/"><strong>http://code.google.com/chrome/nativeclient/</strong></a> which clears up some mess.</p>
<p><strong>Here&#8217;s what I think it can do</strong></p>
<p><em>Note: At <a href="http://unity3d.com/">work</a> we have an in-progress Unity NaCl port using this PPAPI. However, I am not working on it, so my knowledge may or may not be true. Take everything with a grain of NaCl ;)</em></p>
<p>Most of things below found by poking around at <a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/">PPAPI source tree</a>, and by looking into Unity&#8217;s NaCl platform dependent bits.</p>
<p><em><strong>Graphics</strong></em></p>
<p>PPAPI provides an OpenGL ES 2.0 implementation for your 3D needs. You need to setup the context and initial surfaces via PPAPI (<tt><a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/cpp/dev/context_3d_dev.h?view=markup">ppapi/cpp/dev/context_3d_dev.h</a>, <a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/cpp/dev/surface_3d_dev.h?view=markup">ppapi/cpp/dev/surface_3d_dev.h</a></tt>) &#8211; similar to what you&#8217;d use EGL on other platforms for &#8211; and beyond that you just include <tt>GLES2/gl2.h, GLES2/gl2ext.h</tt> and call ye olde GLES2.0 functions.</p>
<p>Behind the scenes, all your GLES2.0 calls will be put into a <a href="http://src.chromium.org/viewvc/chrome/trunk/src/gpu/command_buffer/">command buffer</a> and transferred to actual &#8220;3D server&#8221; process for consuming them. Chrome splits up itself into various processes like that for security reasons &#8212; so that each process has the minimum set of privileges, and a crash or a security exploit in one of them can&#8217;t easily transfer over to other parts of the browser.</p>
<p><em><strong>Audio</strong></em></p>
<p>For audio needs, PPAPI provides a simple buffer based API in <tt><a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/cpp/audio_config.h?view=markup">ppapi/cpp/audio_config.h</a></tt> and <tt><a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/cpp/audio.h?view=markup">ppapi/cpp/audio.h</a></tt>. Your own callback will be called whenever audio buffer needs to be filled with new samples. That means you do all sound mixing yourself and just fill in the final buffer.</p>
<p><em><strong>Input</strong></em></p>
<p>Your plugin instance (subclass of <tt>pp::Instance</tt>) will get input events via HandleInputEvent virtual function override. Each event is a simple <a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/c/pp_input_event.h?view=markup"><tt>PPInputEvent</tt> struct</a> and can represent keyboard &#038; mouse. No support for gamepads or touch input so far, it seems.</p>
<p><em><strong>Other stuff</strong></em></p>
<p>Doing WWW requests is possible via <tt><a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/cpp/url_loader.h?view=markup">ppapi/cpp/url_loader.h</a></tt> and friends.</p>
<p>Timer &#038; time queries via <tt><a href="http://src.chromium.org/viewvc/chrome/trunk/src/ppapi/cpp/core.h?view=markup">ppapi/cpp/core.h</a></tt> (e.g. <tt>pp::Module::Get()->core()->CallOnMainThread(...)</tt>).</p>
<p>And, well, a bunch of other stuff is there, like ability to rasterize blocks of text into bitmaps, pop up file selection dialogs, use the browser to decode video streams and so on. Everything &#8211; or almost everything &#8211; is there to make it possible to do games on it.</p>
<p><strong>Summary</strong></p>
<p>Like <a href="http://chadaustin.me/2011/01/in-defense-of-language-democracy/">Chad says</a>, it would be good to end <em>&#8220;thou shalt only use Javascript&#8221;</em> on the web. Javascript is a very nice language &#8211; especially considering how it came into existence &#8211; but <em>forcing</em> it on everyone is quite silly. And no matter how hard V8/JägerMonkey/Nitro folks are trying, it is very, very hard to <a href="http://chadaustin.me/2011/01/digging-into-javascript-performance/">beat performance</a> of a simple, static, compiled language (like C) that has direct access to memory and the programmer is in almost full control of both the code flow and the memory layout. Steve rightly <a href="http://twitter.com/#!/stevestreeting/status/76216985888882688">points out</a> that even if for some tasks a super-optimized Javascript engine will approach the speed of C, it will burn much more energy to do so &#8212; a very important aspect in the increasingly mobile world.</p>
<p>Native Client does give some hope that there will be a way to run native code, at native speeds, in the browser, without compromising on security. Let it happen.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/06/02/notes-on-native-client-pepper-plugin-api/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A way to visualize mip levels</title>
		<link>http://aras-p.info/blog/2011/05/03/a-way-to-visualize-mip-levels/</link>
		<comments>http://aras-p.info/blog/2011/05/03/a-way-to-visualize-mip-levels/#comments</comments>
		<pubDate>Tue, 03 May 2011 16:41:59 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=710</guid>
		<description><![CDATA[Recently a discussion on Twitter about folks using 2048 textures on a pair of dice spawned this post. How do artists know if the textures are too high or too low resolution? Here&#8217;s what we do in Unity, which may or may not work elsewhere. When you have a game scene that, for example, looks [...]]]></description>
			<content:encoded><![CDATA[<p>Recently a <a href="http://twitter.com/#!/aras_p/status/63538509952200705">discussion</a> on Twitter about folks using 2048 textures on a pair of dice spawned this post. How do artists know if the textures are too high or too low resolution? Here&#8217;s what we do in Unity, which may or may not work elsewhere.</p>
<p>When you have a game scene that, for example, looks like this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/05/BootcampNormal.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/05/BootcampNormal-500x283.jpg" alt="" title="Normal game scene view" width="500" height="283" class="alignnone size-medium wp-image-714" /></a><br />
We provide a &#8220;mipmaps&#8221; visualization mode that renders it like this:</p>
<p><span id="more-710"></span><a href="http://aras-p.info/blog/wp-content/uploads/2011/05/BootcampMips.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/05/BootcampMips-500x283.jpg" alt="" title="Mipmap view of the game scene" width="500" height="283" class="alignnone size-medium wp-image-713" /></a></p>
<p>Original texture colors mean it&#8217;s a perfect match (1:1 texels to pixels ratio); more red = too much texture detail; more blue = too little texture detail.</p>
<p><em>That&#8217;s it, end of story, move along!</em></p>
<p>Now of course it&#8217;s not that simple. You can just go and resize all textures that were used on the red stuff. The player might walk over to those red objects, and <em>then</em> they would need more detail!</p>
<p>Also, the amount of texture detail needed very much depends on the screen resolution the game will be running at:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/05/PlatformerSizes.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/05/PlatformerSizes-500x190.jpg" alt="" title="Different resolutions need different detail" width="500" height="190" class="alignnone size-medium wp-image-722" /></a></p>
<p>Still, even with varying resolution sizes and the fact that the same objects in 3D can be near &#038; far from the viewer, this view can answer the question of &#8220;does something have a too high/too low texture detail?&#8221;, mostly by looking at colorization mismatch between nearby objects.</p>
<p>In the picture above, the railings have too little texture detail (blue), while the lamp posts have too much (red). The little extruded things on the floating pads have too much detail as well.</p>
<p>The image below reveals that floor and ceiling have mismatching texture densities: floor has too little, while ceiling has too much. Probably should be the other way around, in a platform you&#8217;d more often be looking at the floor.<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/05/FloorCeiling1.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/05/FloorCeiling1-500x318.jpg" alt="" title="Floor vs Ceiling" width="500" height="318" class="alignnone size-medium wp-image-726" /></a></p>
<p><strong>How to do this?</strong></p>
<p>In the mipmap view shader, we display the original texture mixed with a special &#8220;colored mip levels&#8221; texture. The regular texture is sampled with original UVs, while the color coded texture is sampled with more dense ones, to allow visualization of &#8220;too little texture detail&#8221;. In shader code <em>(HLSL, shader model 2.0 compatible)</em>:</p>
<blockquote><pre>struct v2f {
    float4 pos : SV_POSITION;
    float2 uv : TEXCOORD0;
    float2 mipuv : TEXCOORD1;
};
<b>float2 mainTextureSize</b>;
v2f vert (float4 vertex : POSITION, float2 uv : TEXCOORD0)
{
    v2f o;
    o.pos = mul (matrix_mvp, vertex);
    o.uv = uv;
    o.mipuv = <b>uv * mainTextureSize / 8.0</b>;
    return o;
}
half4 frag (v2f i) : COLOR0
{
    half4 col = tex2D (mainTexture, i.uv);
    half4 mip = tex2D (mipColorsTexture, i.mipuv);
    half4 res;
    res.rgb = lerp (col.rgb, mip.rgb, mip.a);
    res.a = col.a;
    return res;
}
</pre>
</blockquote>
<p>The <tt>mainTextureSize</tt> above is the pixel size of the main texture, for example (256,256). Division by eight might seem a bit weird, but it really isn&#8217;t!</p>
<p>To show the colored mip levels, we need to create <tt>mipColorsTexture</tt> that has different colors in each mip level.</p>
<p>Let&#8217;s say we would create a 32&#215;32 size texture for this, and the largest mip level would be used to display &#8220;ideal texel to pixel density&#8221;. If the original texture was 256 pixels in size and we want to sample a 32 pixels texture at exactly the same texel density as the original one, we have to use more dense UVs: <tt>newUV = uv * 256 / 32</tt> or in a more generic way, <tt>newUV = uv * textureSize / mipTextureSize</tt>.</p>
<p>Why there&#8217;s <tt>8.0</tt> in the shader then, if we create the mip texture at 32&#215;32 size? That&#8217;s because we don&#8217;t want the largest mip level to indicate &#8220;ideal texel to pixel&#8221; density. We also want a way to visualize &#8220;not enough texel density&#8221;. So we push the ideal mip level two levels down, which means it&#8217;s four times UV difference. That&#8217;s how 32 becomes 8 in the shader.</p>
<p>The actual colors we use for this 32&#215;32 mipmaps visualization texture are, in RGBA: (0.0,0.0,1.0,0.8); (0.0,0.5,1.0,0.4); (1.0,1.0,1.0,0.0); (1.0,0.7,0.0,0.2); (1.0,0.3,0.0,0.6); (1.0,0.0,0.0,0.8). Alpha channel controls how much to interpolate between the original color and the tinted color. Our 3rd mip level has zero alpha so it displays unmodified color.</p>
<p><em>Now, step 2 is somehow forcing artists to actually use this ;)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/05/03/a-way-to-visualize-mip-levels/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Mercurial/Kiln experience so far</title>
		<link>http://aras-p.info/blog/2011/04/18/mercurialkiln-experience-so-far/</link>
		<comments>http://aras-p.info/blog/2011/04/18/mercurialkiln-experience-so-far/#comments</comments>
		<pubDate>Mon, 18 Apr 2011 07:14:33 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[unity]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=668</guid>
		<description><![CDATA[At work we switched to Mercurial almost two months ago. Like Richard says, it was time to stop using Subversion. Here are my impressions so far. Preemptive warning: I&#8217;ve only ever used CVS, SourceSafe, Subversion, git and Mercurial as source contro systems (never used Perforce). I never really used a code review tool before Kiln. [...]]]></description>
			<content:encoded><![CDATA[<p>At <a href="http://unity3d.com/">work</a> we switched to <a href="http://mercurial.selenic.com/">Mercurial</a> almost two months ago. Like <a href="http://altdevblogaday.org/2011/03/09/its-time-to-stop-using-subversion/">Richard says</a>, it was time to stop using Subversion. Here are my impressions so far.</p>
<p><span id="more-668"></span><em>Preemptive warning: I&#8217;ve only ever used CVS, SourceSafe, Subversion, git and Mercurial as source contro systems (never used Perforce). I never really used a code review tool before Kiln. Everything below might be non-issues in other tools/systems, or not suitable for different setups/workflows!<br />
</em></p>
<p><strong>The Story</strong></p>
<p>At Unity we used <a href="http://subversion.apache.org/">Subversion</a> for source code versioning as long as I remember. svn revision 1 &#8212; an import from CVS &#8212; happened in 2005. We don&#8217;t talk about CVS. Nor about SourceSafe. Subversion was fine while the number of developers was small; we had a saying that CVS scales up to 5 people, and experimentally found out that svn scales up to about 50.</p>
<p>Since merging branches in subversion does not <em>really</em> work well, everyone was mostly working on one trunk, <em>carefully</em>. We would do an occasional branch for &#8220;this will surely break everything&#8221; features; and would branch off trunk sometime before each Unity release, but that&#8217;s about it. Having something like 50 people and 10 platforms on a single branch in version control does get a bit uneasy.</p>
<p>So we looked at various options, like <a href="http://git-scm.com/">git</a>, <a href="http://mercurial.selenic.com/">Mercurial</a>, <a href="http://www.perforce.com/">Perforce</a> and so on. I don&#8217;t know why exactly we ended up with Mercurial (someone made a decision I guess&#8230;). It <em>felt</em> like distributed versioning systems are <em>teh future</em> and unlike most game developers we don&#8217;t need to version hundreds of gigabytes of binary assets (hence no big need for Perforce).</p>
<p>So while some people were at GDC, we did a big switch to several things at once: 1) replace Subversion with Mercurial, 2) replace &#8220;everyone works on the same trunk&#8221; workflow with &#8220;teams work on their own topic branches&#8221;, 3) introduce a bit more formal code reviews via <a href="http://www.fogcreek.com/kiln/">Kiln</a>.</p>
<p>In hindsight, maybe switching three things at once wasn&#8217;t the brightest idea; there&#8217;s only so much change a person can absorb per unit of time. On the other hand, everyone experienced a large initial shock but now that the debris is setting down they just continue working with no big shocks predicted in the near future.</p>
<p><strong>Our Setup</strong></p>
<p>We use Fogcreek&#8217;s Kiln and host it on <a href="http://www.fogcreek.com/kiln/for-your-server.html">our own servers</a>. This is mostly for legal reasons I think (in our source code we have 3rd party bits which are under strict NDAs). Advantage of hosting ourselves is that we&#8217;re under complete control. Disadvantage is that we have to do some work; and we only get Kiln updates each couple of months (so for example everyone who lets Fogcreek host Kiln is on Kiln 2.4.x right now, while we&#8217;re still on 2.3.x).</p>
<p>Our source tree is about 12000 files amounting to about 600MB. Mercurial&#8217;s history (60000 revisions imported from svn) adds another 200MB. Additionally, we pull almost 1GB of binary files (see below for binary file versioning) into the source tree.</p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2011/04/hg-branches.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/hg-branches-150x150.png" alt="" title="Team+feature branches in Mercurial" width="150" height="150" class="alignright size-thumbnail wp-image-685" /></a>Each &#8220;team&#8221; (core, editor, graphics, ios, android, &#8230;) has it&#8217;s own &#8220;branch&#8221; (actually, a separate repository clone) of the codebase, and merge back and forth between &#8220;trunk&#8221; repository. The trunk is supposed to be stable and shippable at almost any time (in theory&#8230; :)); unfinished, unreviewed code or code that has any failing tests can&#8217;t be pushed into trunk. Additionally, long-lasting features get their own &#8220;feature branches&#8221; (again, actually full clones of the repository). So right now we have more than 40 of those team+feature branches.</p>
<p>We have almost 50 developers committing to the source tree. Additionally, there is a build farm of 30 machines building most of those branches and running automated test suites. All this <em>does</em> put some pressure on the Kiln server ;) Everything below describes usage of Kiln 2.3.x with Mercurial 1.7.x; with more recent versions anything might have changed.</p>
<p><strong>Mercurial, or: I Have Two Heads!</strong></p>
<p>Probably the hardest thing to grok is the whole centralized-to-distributed versioning transition. Not everyone has github as their start page yet, and DVCS is actually more complex than a simple centralized model that Subversion has.</p>
<p>Things like this:</p>
<blockquote><p>OMG it says I have two heads now, what do I do?!</p></blockquote>
<p>just do not happen in centralized systems. <em>It&#8217;s not easy for a developer to accept he has two heads now, either. Or where this extra head came from&#8230;</em></p>
<p>And the benefits of distributed source control system are not immediately obvious to someone who&#8217;s never used one. The initial reaction is that suddenly everything got more complex for no good reason. Compare operations that you would use daily:</p>
<ul>
<li>Subversion: update, commit.
<ul>
<li>Since merges don&#8217;t really work: branch, switch &#038; merge are rarely used by mere mortals.</li>
</ul>
</li>
<li>Mercurial: pull, update or merge, commit, push.
<ul>
<li>And you might find you have two heads now!</li>
</ul>
<ul>
<li>You should also see their faces when you go &#8220;well, let me tell you about rebase&#8230;&#8221;. You might just as well explain everything with <a href="http://tartley.com/?p=1267">easy to understand spatial analogies</a> ;)</li>
</ul>
</li>
</ul>
<p>Thankfully, there&#8217;s this thing called the intertubes, which often has <a href="http://hginit.com/">helpful tutorials</a>.</p>
<p>Myself, I think <em>maybe</em> switching to git would have been a smaller overall shock. Mercurial is easier to get into, but it kind of pretends to work like ye olde versioning system, while underneath it is very different. Git, on the other hand, does not even try to look similar; it says &#8220;I&#8217;ll fuck with your brain&#8221; immediately after initial &#8220;hi how are you&#8221;. So it&#8217;s a larger initial shock, but maybe that <em>forces</em> people to get into this different mindset faster.</p>
<p><strong>Versioning large binary files</strong></p>
<p>Even if we <em>mostly</em> version only the code, there are occasional binaries. In our case it&#8217;s mostly 3rd party SDKs that are linked into Unity. For example, PhysX, Mono, FMOD, D3DX, Cg etc. We do have the source code for most of them, but we don&#8217;t need each developer to have 30000 files of Mono&#8217;s source code for example. So we build them separately, and version the prebuilt headers/libraries/DLLs in the regular source tree. Some of those prebuilt things can get quite large though (think couple hundred megabytes).</p>
<p>Most distributed version control systems (including git and mercurial) have trouble with this. <em>Every</em> version of <em>every</em> file is stored in your own local <del datetime="whoops, wrong terminology!">checkout</del>clone. Try having 50 versions of whole Mono build in there and you&#8217;ll wonder where the precious SSD space on your laptop did go!</p>
<p>Luckily, Kiln has a solution for this: <a href="http://kiln.stackexchange.com/questions/1873">kbfiles</a> extension. For each file marked as &#8220;large binary file&#8221;, only it&#8217;s &#8220;stand in&#8221; SHA1 hash is versioned, and the file itself is fetched from a central server into your local machine on demand. Think of it as a centralized versioning model for those special binary files. kbfiles itself is based on <a href="http://mercurial.selenic.com/wiki/BfilesExtension">bfiles extension</a>, with a tighter integration into Mercurial.</p>
<p>So the good news, with Kiln large binary files are handled easy and with no pain. You can globally set &#8220;large size&#8221; threshold, filename patterns etc. that are turned into &#8220;big files&#8221; automatically; or manually indicate &#8220;big file&#8221; when adding new files. And then continue using Mercurial as usual.</p>
<p>The bad news, however, is that kbfiles still has occasional bugs. Of course they will be fixed eventually, but for example right now <a href="http://blog.bitquabit.com/2008/11/25/rebasing-mercurial/">rebasing</a> with an incoming bigfiles commit will result in the wrong bigfile version in the end. Or, presence of kbfiles extension makes various Mercurial operations (like <tt>hg status</tt>) be <em>much</em> <a href="http://kiln.stackexchange.com/questions/3319">slower than usual</a>.</p>
<p><strong>Kiln as Web Interface</strong></p>
<p>Kiln itself is the server hosting Mercurial repositories, a web interface to view/admin them, and a code review tool. It&#8217;s fairly nice and does all the standard stuff, like show overview of all activity happening in a group of repositories:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-overview.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-overview-500x288.png" alt="" title="Overview of all activity in Kiln" width="500" height="288" class="alignnone size-medium wp-image-688" /></a></p>
<p>And shows the overview of any particular repository:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-repo.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-repo-500x279.png" alt="" title="One repository in Kiln" width="500" height="279" class="alignnone size-medium wp-image-689" /></a></p>
<p>And of course diff view of any particular commit:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-diff.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-diff-500x173.png" alt="" title="Diff view in Kiln" width="500" height="173" class="alignnone size-medium wp-image-686" /></a></p>
<p>My largest complaints about Kiln&#8217;s web interface are: 1) speed and 2) merge spiderwebs.</p>
<p><b><em>Speed</em></b>: like oh so many modern fancy-web systems, Kiln sometimes feels sluggish. Sometimes, in a time taken for Kiln to display a diff, Crysis 2 <em>would have rendered New York fifty times</em>. We did various things to boost up our server&#8217;s <em>oomph</em>, but it still does not feel fast enough. Maybe we don&#8217;t know how to setup our servers right; or maybe Kiln is actually quite slow; or maybe our repository size + branch count + number of people hitting it are exceeding whatever limits Kiln was designed for. That said, this is not unique of Kiln, <em>lots</em> of web systems are slow for sometimes no good reasons. If you are a web developer, however, keep this in mind: latency of any user operation is super important.</p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-merge-spiderweb.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-merge-spiderweb-150x150.png" alt="" title="It&#039;s a merge forest!" width="150" height="150" class="alignright size-thumbnail wp-image-687" /></a><b><em>Merge spiderwebs</em></b>: distributed version control makes merges reliable and easy. However, merges happen all the time and can make it hard to see what was <em>actually</em> going on in the code. You can&#8217;t see the actual changes through the merge spiderwebs.</p>
<p>The change history is littered with &#8220;merge&#8221;, &#8220;merge remote repo&#8221;, &#8220;merge again&#8221; commits. The branch graph goes crazy and starts taking half of the page width. Not good! Now of course, this is where <a href="http://blog.bitquabit.com/2008/11/25/rebasing-mercurial/">rebasing</a> would help, however right now we&#8217;re not very keen on using it because of Kiln&#8217;s bigfiles bug mentioned above.</p>
<p><strong>Kiln as Code Review Tool</strong></p>
<p>Reviewing code is fairly easy: there&#8217;s a Review button that shows up when hovering over any commit. Each commit also shows how many reviews it has pending or accepted. So you just click on something, and voilà, you can request a code review:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-reviewrequest.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-reviewrequest-500x230.png" alt="" title="Requesting a review in Kiln" width="500" height="230" class="alignnone size-medium wp-image-691" /></a></p>
<p>Within each review you see the diffs, send comments back and forth between people, and highlight code snippets to be attached with each comment:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-review.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/kiln-review-500x332.png" alt="" title="Code review in Kiln" width="500" height="332" class="alignnone size-medium wp-image-690" /></a></p>
<p>In Kiln 2.3.x (which is what we use at the moment) the reviews still have a sort of &#8220;unfinished&#8221; feeling. For example, if you want multiple people to review a change, Kiln actually creates multiple reviews that are only very loosely coupled. The good news is that in Kiln 2.4 they have <a href="http://blog.fogcreek.com/rethinking-reviews/">improved this</a>, and I&#8217;m quite sure more improvements will come in the future.</p>
<p>Another option that I&#8217;m missing right now: in the repository views, filter out all approved commits. As an occasional &#8220;merge master&#8221;, I need to see if my big merge had any unreviewed or pending-review commits &#8212; something that&#8217;s quite hard to see with a merge-heavy history.</p>
<p><strong>Summary</strong></p>
<p>I&#8217;m quite happy with how switch to Mercurial + Kiln turned out to be so far. With each team working on their own repository, it does feel like we&#8217;re much less stepping on each other&#8217;s toes. That said, we haven&#8217;t shipped any Unity release from Mercurial yet; doing that will be a future exercise.</p>
<p><a href="http://www.fogcreek.com/kiln/">Kiln</a> is promising. It has some very good ideas (integrated code reviews &#038; versioning of big files in Mercurial), but it still has quite a lot of rough edges. I&#8217;m not totally happy with the web side performance of it either. That said, Fogcreek&#8217;s support for us has been fantastic; we got some bugfixes in the matter of days and they&#8217;ve been really helpful with setup/workflow/optimization issues. So it seems like it has a good future. Fogcreek guys, if you&#8217;re reading this: <a href="http://farm1.static.flickr.com/225/524768428_e20c722cc0.jpg">keep up wrk</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/04/18/mercurialkiln-experience-so-far/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Stories of Universities</title>
		<link>http://aras-p.info/blog/2011/04/01/stories-of-universities/</link>
		<comments>http://aras-p.info/blog/2011/04/01/stories-of-universities/#comments</comments>
		<pubDate>Fri, 01 Apr 2011 18:55:26 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[random]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=658</guid>
		<description><![CDATA[I was doing a talk and a Q&#038;A session at a local university. Unaware of the consequences, one guy asked about the usefulness of the programming courses they have in real work&#8230; Oh boy. Do you really want to go there? Now before I go ranting full steam, let me tell that there were really [...]]]></description>
			<content:encoded><![CDATA[<p>I was doing a talk and a Q&#038;A session at a local university. Unaware of the consequences, one guy asked about the usefulness of the programming courses they have in real work&#8230;</p>
<p>Oh boy. Do you really want to go there?<br />
<span id="more-658"></span></p>
<blockquote><p>Now before I go ranting full steam, let me tell that there were really good courses and really bright teachers at my (otherwise unspectacular) university. Most of the math, physics and related fundamental sciences courses were good &#038; taught by people who know their stuff. Even some of the computer science / programming courses were good!
</p></blockquote>
<p>With that aside, let&#8217;s bet back to ranting.</p>
<p><strong>What is OOP?</strong></p>
<p>Somehow conversation drifted to the topics of code design, architecture and whatnot. I asked the audience, for example, what do they think are the benefits of object oriented programming (OOP)? The answers were the following:</p>
<ul>
<li>Mumble mumble&#8230; weeelll&#8230; something something mumble. This was the majority&#8217;s opinion.</li>
<li>OOP makes it very easy for a new guy to start at work, because everything nicely separated and he can just work on this one file without knowing anything else.</li>
<li>Without OOP there&#8217;s no way to separate things out; everything becomes a mess.</li>
<li>OOP uses classes, and they are nicer than not using classes. Because a class lets you&#8230; uhm&#8230; well I don&#8217;t know, but classes are nicer than no classes. I think it had something to do with something being in separate files. Or maybe in one file. I don&#8217;t actually know&#8230;</li>
<li><em>I forget if there was anything else really.</em></li>
</ul>
<p>Let me tell you how easy it is for a guy to start at work. You come to new place all inspired and excited. You&#8217;re being put into some unholy codebase that grew in a chaotic way over last N years and being assigned to do some random feature or fix some bugs. When you encounter anything smelly in the codebase (this happens fairly often), the answer to &#8220;WTF is this?&#8221; is most often &#8220;it came from the past, yeah, we don&#8217;t like it either&#8221; or &#8220;I dunno, this guy who left last year wrote it&#8221; or &#8220;yeah, I wrote it but it was ages ago, I don&#8217;t remember anything about it&#8230; wow! this is idiotic code indeed! just be careful, touching it might break everything&#8221;. All this is totally independent of whether the codebase used OOP or not.</p>
<p>I am exaggerating of course; the codebase doesn&#8217;t have to be that bad. But still; whether it&#8217;s good or not, or whether it&#8217;s easy for a new guy to start there is really not related to it being OOP.</p>
<p>Interesting!</p>
<p>Clearly they have no frigging clue what OOP is, besides of whatever they&#8217;ve been told by the teacher. And the teacher in turn knows about OOP based on what he read in one or two books. And the author of the books&#8230; well, we don&#8217;t know; depends on the book I guess. But this is at least a second-order disconnect from reality, if not more!</p>
<p>Why is that?</p>
<p>I guess part of the problem is teachers having no real actual work experience except by reading books. This can work for math. For a lot of programming courses&#8230; not so much. Another part is students learning in a vacuum, trying to <em>kind of</em> get what the lectures are about and pass the tests.</p>
<p>In both cases it&#8217;s totally separated from doing some real actual work and trying to apply what you&#8217;re trying to learn. Which leads to some funny things like&#8230;</p>
<p><strong>How are floating point numbers stored?</strong></p>
<p>I saw this about 11 years ago in one lecture of a C++ course. The teacher quickly explained how various types are stored in memory. He got over the integer types without trouble and started explaining floats.</p>
<blockquote><p>So there&#8217;s one bit for the sign. Then come the digits before the decimal point. Since there are 10 possible choices for each digit, you need four bits of memory for each digit. Then comes one bit for the decimal point. After the decimal point, again you have four bits per digit. Done!
</p></blockquote>
<p>ORLY? This was awesome, especially trying to imagine how to store the decimal point.</p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2011/04/pifloat.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/04/pifloat.png" alt="" title="π in floating point representation, beware!" width="342" height="51" class="alignnone size-full wp-image-661" /></a></p>
<p>See that decimal digit bit, haha! <em>You see, it&#8217;s one bit and you can&#8217;t&#8230; what do you mean you don&#8217;t get it? And not only that; this needs variable length and&#8230; really? You&#8217;re going to a party instead?</em> I wasn&#8217;t very popular.</p>
<p>Funny or not, this is not exactly telling a correct story on how floats are stored in memory on 101% of the architectures you&#8217;d ever care about.</p>
<p>I could tell a ton of other examples of little disconnects with reality, which I think are caused by not ever having to put your knowledge into practice.</p>
<p><strong>Where do we go from here?</strong></p>
<p>Now of course, the university I went to is not something that would be considered &#8220;good&#8221; by world standards. I went to several lectures by <a href="http://graphics.ucsd.edu/~henrik/">Henrik Wann Jensen</a> at DTU at that was like night and day! But how many of these not-too-good-only-passable universities are around the world? I&#8217;d imagine certainly more than one, and certainly less than the number of MITs, Stanfords et al combined.</p>
<p>As a student, I <em>somehow</em> figured I should take a lot of things with a grain of <del>salt</del> doubt. And in a lot of cases, trying to do something for real trumps lab work / tests / exams in how much you&#8217;ll be able to learn. Go make a techdemo, a small game, play around with some techniques, try to implement that clever sounding paper from siggraph and observe it burst in flames, team up with friends while doing any of the above. <a href="http://www.youtube.com/watch?v=u6ALySsPXt0">Do it</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/04/01/stories-of-universities/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Mobile graphics API wishlist: some features</title>
		<link>http://aras-p.info/blog/2011/03/19/mobile-graphics-api-wishlist-some-features/</link>
		<comments>http://aras-p.info/blog/2011/03/19/mobile-graphics-api-wishlist-some-features/#comments</comments>
		<pubDate>Sat, 19 Mar 2011 13:50:15 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[mobile]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=653</guid>
		<description><![CDATA[In my previous post I talked about things I&#8217;d want from OpenGL ES 2.0 in the performance area. Now it&#8217;s time to look at what extra features it might expose with an extension here or there. Note that I’m focusing on, in my limited understanding, low-hanging fruits. The features I want already exist in the [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/">previous post</a> I talked about things I&#8217;d want from OpenGL ES 2.0 in the performance area. Now it&#8217;s time to look at what extra features it might expose with an extension here or there.</p>
<p><span id="more-653"></span><em>Note that I’m focusing on, in my limited understanding, low-hanging fruits. The features I want already exist in the current GPUs or platforms; or could be easily made available. Of course more radical new architectures would bring more &#038; fancier features, but that&#8217;s a topic for another story.</em></p>
<p><strong>Programmable blending</strong></p>
<p>At least two out of three big current mobile GPU families (PVR SGX, Adreno, Tegra 2) support programmable blending in the hardware. Maybe all of them do this and I just don&#8217;t have enough data. By &#8220;support it in the hardware&#8221; I mean either: 1) the GPU has no blending hardware, the drivers add &#8220;read current pixel &#038; blend&#8221; instructions to the shaders or 2) has blending hardware for commonly used modes, but fancier modes use shader patching with no severe performance penalties.</p>
<p>Programmable blending is useful for various things; from deferred-style decals (blending normals is hard in fixed function!) to fancier Photoshop-like blend modes to potentially faster single-pixel image postprocessing effects (like color correction).</p>
<p>Currently only NVIDIA exposes this capability via <a href="http://developer.download.nvidia.com/tegra/docs/tegra_gles2_development.pdf">NV_shader_framebuffer_fetch</a> extension.</p>
<p><em>Suggestion</em>: expose it on other hardware that can do this! It&#8217;s fine to not handle hard edge cases (for example, what happens when multisampling is used?), we can live with the limitations.</p>
<p><strong>Direct, fast access to frame buffer on the CPU</strong></p>
<p>Most (all?) mobile platforms use unified memory approach, where there&#8217;s no physical distinction between &#8220;system memory&#8221; and &#8220;video memory&#8221;. Some of those platforms are slightly unbalanced, e.g. a strong GPU coupled with a weak CPU or vice versa. More and more of those systems will have multicore CPUs. It might make sense to do similar approaches that PS3 guys are doing these days &#8211; offload some of the GPU work to the CPU(s).</p>
<p>Image processing, deferred lighting and similar things could be done more efficiently on a general purpose CPU, where you aren&#8217;t limited to &#8220;one pixel at a time&#8221; model of current mobile GPUs.</p>
<p><em>Suggestion</em>: can haz get a pointer to framebuffer memory perhaps? Of course this is grossly oversimplifying all the synchronization &#038; security issues, but <em>something</em> should be possible to do in order to exploit the unified memory model. Right now it just sits there largely unused, with GLES2.0 still pretending CPU is talking to a GPU over a ten meter high concrete wall.</p>
<p><strong>Expose Tile Based GPU capabilities</strong></p>
<p>PowerVR GPUs found in all iOS and some Android devices are so called &#8220;tile based&#8221; architectures. So is, to some extent, Qualcomm Adreno family.</p>
<p>Currently this capability is mostly sitting behind a black box. On PowerVR GPUs the programmer does know that &#8220;overdraw of opaque objects does not matter&#8221;, or that &#8220;alpha testing is really slow&#8221; but that&#8217;s about it. There&#8217;s no control over the whole rendering process, even if some of the things could benefit from having more control over the whole tiling thing.</p>
<p>Take, for example, deferred lighting/shading. The cool folks are doing it tile-based already on <a href="http://www.slideshare.net/DICEStudio/directx-11-rendering-in-battlefield-3?from=ss_embed">DirectX 11</a> or <a href="http://www.slideshare.net/DICEStudio/spubased-deferred-shading-in-battlefield-3-for-playstation-3?from=ss_embed">PS3</a>.</p>
<p>On a tile-based GPU, all rendering is <em>already</em> happening in tiles, so what if we could say &#8220;now, you work on this tile, render this, render that; now we go this this tile&#8221;? Maybe that way we could achieve two things at once: 1) better light culling because it&#8217;s at tile level, and 2) most of the data could stay on this super-fast on-chip memory, without having to be written into system memory &#038; later read again. Memory bandwidth is very often a limiting factor in mobile graphics performance, and ability to keep deferred lighting buffers on-chip through the whole process could cut down bandwidth requirements a lot.</p>
<p><em>Suggestion</em>: somehow <em>(I&#8217;m feeling very hand-wavy today)</em> expose more control over tiled rendering. For example, explicitly say that rendering will only happen to the given tiles; and these textures are very likely to be read just after they are rendered into &#8211; so don&#8217;t resolve them to memory if they fit into on-chip one.</p>
<p>There&#8217;s already a Qualcomm extension of something towards that area &#8211; <a href="http://www.khronos.org/registry/gles/extensions/QCOM/QCOM_tiled_rendering.txt">QCOM_tiled_rendering</a> &#8211; though it seems to be more concerned about where does rendering happen. More control is needed on how to mark FBO textures as &#8220;keep in on-chip memory for sampling as a texture plz&#8221;.</p>
<p><strong>OpenCL</strong></p>
<p>Current mobile GPUs already are, or very soon will be, OpenCL capable. Also OpenCL can be implemented on the CPU, nicely SIMDified via NEON, and use multicore. <em>DO WANT!</em> (and while you&#8217;re at it, everything that&#8217;s doable to make interop between CL &#038; GL faster)</p>
<p>This can be used for a ton of things; skinning, culling, particles, procedural animations, image postprocessing and so on. And with a much less restrictive programming model, it&#8217;s easier to reuse computation results across draw calls or frames.</p>
<p>Couple this with &#8220;direct access to memory on the CPU&#8221; and OpenCL could be used for more things than graphics (again I&#8217;m grossly oversimplifying here and ignoring the whole synchronization/latency/security elephant&#8230;).</p>
<p><strong>MOAR?</strong></p>
<p>Now of course there are more things I&#8217;d want to see, but for today I&#8217;ll take just those above, thank you. Have a nice day!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/03/19/mobile-graphics-api-wishlist-some-features/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Mobile graphics API wishlist: performance</title>
		<link>http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/</link>
		<comments>http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/#comments</comments>
		<pubDate>Fri, 04 Mar 2011 06:24:49 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[mobile]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=645</guid>
		<description><![CDATA[Most mobile platforms currently are based on OpenGL ES 2.0. While it is much better than traditional OpenGL, there are ways where it limits performance or does not expose some interesting hardware features. So here&#8217;s an unorganized wishlist for GLES2.0 performance part! Note that I&#8217;m focusing on, in my limited understanding, short term low-hanging fruits [...]]]></description>
			<content:encoded><![CDATA[<p>Most mobile platforms currently are based on OpenGL ES 2.0. While it is <em>much</em> better than traditional OpenGL, there are ways where it limits performance or does not expose some interesting hardware features. So here&#8217;s an unorganized wishlist for GLES2.0 performance part!</p>
<p><span id="more-645"></span><em>Note that I&#8217;m focusing on, in my limited understanding, short term low-hanging fruits how to extend/patch existing GLES2.0 API. A pipe dream would be starting from scratch, getting rid of all OpenGL baggage and hopefully come up with a much cleaner, leaner &#038; better API, especially if it&#8217;s designed to only support some particular platform. But I digress, back to GLES2.0 for now.</em></p>
<p><strong>No guarantees when something expensive might happen.</strong></p>
<p>Due to some flexibility in GLES2.0, there might be expensive things happening at almost any point in your frame. For example, binding a texture with a different format might cause a driver to recompile a shader at the draw call time. I&#8217;ve seen <a href="http://twitter.com/#!/aras_p/status/34628257294852096">60 milliseconds</a> on iPhone 3Gs at first draw call with a relatively simple shader, all spent inside shader compiler backend. <em>60 milliseconds!</em> There are various things that can cause performance hiccups like this: texture formats, blending modes, vertex layout, non power of two textures and so on.</p>
<p><em>Suggestion</em>: work with GPU vendors and agree on an API that could make guarantees on when the expensive resource creation / patching work can happen, and when it can&#8217;t. For example, <em>somehow</em> guarantee that a draw call or a state set will not cause any object recreation / shader patching in the driver. I don&#8217;t have much experience with D3D10/11, but my impression is that this was one of the things it got right, no?</p>
<p><strong>Offline shader compilation.</strong></p>
<p>GLES2.0 has the functionality to load binary shaders, but it&#8217;s not mandatory. Some of the big platforms (iOS, I&#8217;m looking at you) just don&#8217;t support it.</p>
<p>Now of course, a single platform (like iOS or Android) can have multiple different GPUs, so you can&#8217;t fully compile a shader offline into final optimized GPU microcode. But <em>some</em> of the full compilation cost could very well be done offline, without being specific to any particular GPU.</p>
<p><em>Suggestion</em>: come up with a platform independent binary shader format. Something like D3D9 shader assembly is probably too low level (it assumes a vector4-based GPU, limited number of registers and so on), but something higher level should be possible. All of the shader lexing, parsing and common optimizations (constant folding, arithmetic simplifications, dead code removal etc.) can be done offline. It won&#8217;t speed up shader loading by an order of magnitude, but even if it&#8217;s possible to cut it by 20%, it&#8217;s worth it. And it would remove a very big bug surface area too!</p>
<p><strong>Texture loading.</strong></p>
<p>A lot (all?) of mobile platforms have unified CPU &#038; GPU memories, however to actually load the texture we have to read or memory map it from disk and then copy into OpenGL via glTexture2D and similar functions. Then, depending on the format, the driver would internally do swizzling and alignment of texture data.</p>
<p><em>Suggestion</em>: can&#8217;t most of this cost be removed? If for some formats it&#8217;s perfectly, statically known what layout and swizzling the GPU expects&#8230; can&#8217;t we just point the API to the data we already loaded or memory mapped? We could still need to implement the glTexture2D case for when (if ever) a totally new strange GPU comes that needs the data in a different order, but why not provide a faster path for the current GPUs?</p>
<p><strong>Vertex declarations.</strong></p>
<p>In unextended GLES2.0 you have to do <em>a ton</em> of calls just to setup vertex data. <a href="http://www.khronos.org/registry/gles/extensions/OES/OES_vertex_array_object.txt">OES_vertex_array_object</a> is a step in the right direction, providing the ability to create sets of vertex data bindings (&#8220;vertex declarations&#8221; in D3D speak). However, it builds upon an existing API, resulting in something that feels quite messy. Somehow it feels that by starting from scratch it could result in something much cleaner. Like&#8230; vertex declarations that existed in D3D since forever maybe?</p>
<p><em>Suggestion</em>: clean up that shit! It would probably need to be tied to a vertex shader input signature (just like in D3D10/11) to guarantee there would be no shader patching, but we&#8217;d be fine with that.</p>
<p><strong>Shader uniforms are per shader program.</strong></p>
<p>What it says &#8211; shader uniforms (&#8220;constants&#8221; in D3D speak) are not global; they are tied to a specific shader program. I don&#8217;t quite understand why, and I don&#8217;t think any GPU works that way. This is causing complexities and/or performance loss in the driver (it either has to save &#038; restore all uniform values on each shader change, or have dirty tracking on which uniforms have changed etc.). It also causes unneeded uniform sets on the client side &#8211; instead of having, for example, view*projection matrix set just once per frame it has to be set for each shader program that we use.</p>
<p><em>Suggestion</em>: just get rid of that? If you need to not break the existing spec, how about adding an extension to make all uniforms global? I propose <code>glCanHaz(GL_OES_GLOBAL_UNIFORMS_PLZ)</code></p>
<p><strong>Next up:</strong></p>
<p>Next time, I&#8217;ll take a look at my unorganized wishlist for mobile graphics features!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A Non-Uniform Work Distribution</title>
		<link>http://aras-p.info/blog/2011/02/16/a-non-uniform-work-distribution/</link>
		<comments>http://aras-p.info/blog/2011/02/16/a-non-uniform-work-distribution/#comments</comments>
		<pubDate>Wed, 16 Feb 2011 15:47:57 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[rant]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=630</guid>
		<description><![CDATA[Warning: a post with stupid questions and no answers whatsoever! You need to do ten thousand things for the gold master / release / ShipIt(tm) moment. And you have 40 people who do the actual work&#8230; this means each of them only has to do 10000/40=250 things, which is not that bad. Right? Meanwhile in [...]]]></description>
			<content:encoded><![CDATA[<p><em>Warning: a post with stupid questions and no answers whatsoever!</em></p>
<p>You need to do ten thousand things for the gold master / release / ShipIt(tm) moment. And you have 40 people who do the actual work&#8230; this means each of them <em>only</em> has to do 10000/40=250 things, which is not that bad. Right?</p>
<p><span id="more-630"></span>Meanwhile in the real world&#8230; it does not actually work like that. And that&#8217;s something that has been on my mind for a long time. I don&#8217;t know how much of this is truth vs. perception, or what to do about it. But here&#8217;s my feeling, simplified:</p>
<p><strong>20 percent of the people are responsible for getting 80 percent of the work done</strong></p>
<p>I am somewhat exaggerating just to keep it consistent with the <a href="http://en.wikipedia.org/wiki/Pareto_principle">Pareto principle</a>. But my feeling is that &#8220;work done&#8221; distribution is highly non uniform everywhere I worked where the team was more than a handful of people.</p>
<p>Here are some stupid statistics to illustrate my point (with graphs, and everyone loves graphs!):</p>
<p>Graph of bugs fixed per developer, over one week during the bug fixing phase. Red/yellow/green corresponds to priority 1,2,3 issues:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/graphbugs.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/graphbugs.png" alt="" title="Bugs fixed per developer" width="375" height="309" class="alignnone size-full wp-image-631" /></a></p>
<p>The distribution of bugs fixes is, shall we say, <em>somewhat</em> non uniform.</p>
<p>Is it a valid measure of &#8220;productivity&#8221;? Absolutely not. Some people probably haven&#8217;t been fixing bugs at all that week. Some bugs are <em>way</em> harder to fix than others. Some people could have made major part of the fix, but the finishing touches &#038; the act of actually resolving the bug was made by someone else. So yes, this statistics is absolutely flawed, but do we have anything else?</p>
<p>We could be checking version control commits.</p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2011/02/svntimeline.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/svntimeline-500x243.png" alt="" title="Source code commits over time period" width="500" height="243" class="alignnone size-medium wp-image-637" /></a></p>
<p>Or putting the same into &#8220;commits by developer&#8221;:</p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2011/02/svnauthor.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/svnauthor-500x269.png" alt="" title="Source code commits by author" width="500" height="269" class="alignnone size-medium wp-image-638" /></a></p>
<p>Of course this is even easier to game than resolving bugs. <em>&#8220;Moving buttons to the left&#8221;, &#8220;Whoops, that was wrong, moving them to the right again&#8221;</em> anyone? And people will be trolling statistics just because they can.<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/svntroll.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/svntroll.png" alt="" title="svn trolling!" width="330" height="108" class="alignnone size-full wp-image-633" /></a></p>
<p>However, there is still this highly subjective &#8220;feeling&#8221; that some folks are way, <em>way</em> faster than others. And not in just &#8220;can do some mess real fast&#8221; way, but in the &#8220;gets actual work done, and done well&#8221; way.</p>
<p>Or is it just my experience? How is it in your company? What can be done about it? Should something be done about it? I don&#8217;t know the answers&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/02/16/a-non-uniform-work-distribution/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>The Virtual and No-Virtual</title>
		<link>http://aras-p.info/blog/2011/02/01/the-virtual-and-no-virtual/</link>
		<comments>http://aras-p.info/blog/2011/02/01/the-virtual-and-no-virtual/#comments</comments>
		<pubDate>Tue, 01 Feb 2011 10:28:03 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=606</guid>
		<description><![CDATA[You are writing some system where different implementations have to be used for different platforms. To keep things real, let&#8217;s say it&#8217;s a rendering system which we&#8217;ll call &#8220;GfxDevice&#8221; (based on a true story!). For example, on Windows there could be a Direct3D 9, Direct3D 11 or OpenGL implementations; on iOS/Android there could be OpenGL [...]]]></description>
			<content:encoded><![CDATA[<p>You are writing some system where different implementations have to be used for different platforms. To keep things real, let&#8217;s say it&#8217;s a rendering system which we&#8217;ll call &#8220;GfxDevice&#8221; <em>(based on a true story!)</em>. For example, on Windows there could be a Direct3D 9, Direct3D 11 or OpenGL implementations; on iOS/Android there could be OpenGL ES 1.1 &#038; 2.0 ones and so on.</p>
<p>For sake of simplicity, let&#8217;s say our GfxDevice interface needs to do this <em>(in real world it would need to do much more)</em>:</p>
<blockquote><pre>
void SetShader (ShaderType type, ShaderID shader);
void SetTexture (int unit, TextureID texture);
void SetGeometry (VertexBufferID vb, IndexBufferID ib);
void Draw (PrimitiveType prim, int primCount);
</pre>
</blockquote>
<p>How this can be done?</p>
<p><span id="more-606"></span><br />
<strong>Approach #1: virtual interface!</strong></p>
<p>Many a programmer would think like this: why of course, GfxDevice is an interface with virtual functions, and then we have multiple implementations of it. Sounds good, and that&#8217;s what you would have been taught at the university in various software design courses. Here we go:</p>
<blockquote><pre>
class GfxDevice {
public:
    virtual ~GfxDevice();
    virtual void SetShader (ShaderType type, ShaderID shader) = 0;
    virtual void SetTexture (int unit, TextureID texture) = 0;
    virtual void SetGeometry (VertexBufferID vb, IndexBufferID ib) = 0;
    virtual void Draw (PrimitiveType prim, int primCount) = 0;
};
// and then we have:
class GfxDeviceD3D9 : public GfxDevice {
    // ...
};
class GfxDeviceGLES20 : public GfxDevice {
    // ...
};
class GfxDeviceGCM : public GfxDevice {
    // ...
};
// and so on
</pre>
</blockquote>
<p>And then based on platform (or something else) you create the right GfxDevice implementation, and the rest of the code uses that. This is all good and it works.</p>
<p>But then&#8230; hey! Some platforms <em>can only ever have one</em> GfxDevice implementation. On PS3 you will <em>always</em> end up using GfxDeviceGCM. Does it really make sense to have virtual functions on that platform?</p>
<blockquote><p>
Side note: <em>of course</em> the cost of a virtual function call is not something that stands out immediately. It&#8217;s much less than, for example, doing a network request to get the leaderboards or parsing that XML file that ended up in your game for reasons no one can remember. Virtual function calls will not show up in the profiler as &#8220;a heavy bottleneck&#8221;. However, they are not free and their cost will be scattered around in a million places that is very hard to eradicate. You can end up having death by a thousand paper cuts.
</p></blockquote>
<p>If we want to get rid of virtual functions on platforms where they are useless, what can we do?</p>
<p><strong>Approach #2: preprocessor to the rescue</strong></p>
<p>We just have to take out the &#8220;virtual&#8221; bit from the interface, and the &#8220;= 0&#8243; abstract function bit. With a bit of preprocessor we can:</p>
<blockquote><pre>
#define GFX_DEVICE_VIRTUAL (PLATFORM_WINDOWS || PLATFORM_MOBILE_UNIVERSAL || SOMETHING_ELSE)
#if GFX_DEVICE_VIRTUAL
    #define GFX_API virtual
    #define GFX_PURE = 0
#else
    #define GFX_API
    #define GFX_PURE
#endif
class GfxDevice {
public:
    GFX_API ~GfxDevice();
    GFX_API void SetShader (ShaderType type, ShaderID shader) GFX_PURE;
    GFX_API void SetTexture (int unit, TextureID texture) GFX_PURE;
    GFX_API void SetGeometry (VertexBufferID vb, IndexBufferID ib) GFX_PURE;
    GFX_API void Draw (PrimitiveType prim, int primCount) GFX_PURE;
};
</pre>
</blockquote>
<p>And then there&#8217;s no separate class called GfxDeviceGCM for PS3; it&#8217;s just GfxDevice class implementing non-virtual methods. You have to make sure you don&#8217;t try to compile multiple GfxDevice class implementations on PS3 of course.</p>
<p>Ta-da! Virtual functions are gone on some platforms and life is good.</p>
<p>But we still have the other platforms, where there can be more than one GfxDevice implementation, and the decision for which one to use is made at runtime. Like our good old friend the PC: you could use Direct3D 9 or Direct3D 11 or OpenGL, based on the OS, GPU capabilities or user&#8217;s preference. Or a mobile platform where you don&#8217;t know whether OpenGL ES 2.0 will be available and you&#8217;d have to fallback to OpenGL ES 1.1.</p>
<p><strong>Let&#8217;s think about what virtual functions actually are</strong></p>
<p>How virtual functions work? Usually they work like this: each object gets a &#8220;pointer to a virtual function table&#8221; as it&#8217;s first hidden member. The virtual function table (vtable) is then just pointers to where the functions are in the code. Something like this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/vtable1.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/vtable1.png" alt="" title="How virtual functions work" width="535" height="371" class="alignnone size-full wp-image-615" /></a><br />
The key points are: 1) each object&#8217;s data starts with a vtable pointer, and 2) vtable layout for classes implementing the same interface is the same.</p>
<p>When the compiler generates code for something like this:</p>
<blockquote><pre>
device->Draw (kPrimTriangles, 1337);
</pre>
</blockquote>
<p>it will generate something like the following pseudo-assembly:</p>
<blockquote><pre>
vtable = load pointer from [device] address
drawlocation = vtable + 3*PointerSize<em> ; since Draw is at index [3] in vtable</em>
drawfunction = load pointer from [drawlocation] address
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address
</pre>
</blockquote>
<p>This code will work no matter if device is of GfxDeviceGLES20 or GfxDeviceGLES11 kind. For both cases, the first pointer in the object will point to the appropriate vtable, and the fourth pointer in the vtable will point to the appropriate Draw function.</p>
<p>By the way, the above illustrates the overhead of a virtual function call. If we&#8217;d assume a platform where we have an in-order CPU and reading from memory takes 500 CPU cycles (which is not far from truth for current consoles), then if nothing we need is in the CPU cache yet, this is what actually happens:</p>
<blockquote><pre>
vtable = load pointer from [device] address
<em>; <strong>wait 500 cycles</strong> until the pointer arrives</em>
drawlocation = vtable + 3*PointerSize
drawfunction = load pointer from [drawlocation] address
<em>; <strong>wait 500 cycles</strong> until the pointer arrives</em>
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address
<em>; <strong>wait 500 cycles</strong> until code at that address is loaded</em>
</pre>
</blockquote>
<p><strong>Can we do better?</strong></p>
<p>Look at the picture in the previous paragraph and remember the &#8220;wait 500 cycles&#8221; for each pointer we are chasing. Can we reduce the number of pointer chases? Of course we can: why not ditch the vtable altogether, and just put function pointers directly into the GfxDevice object?</p>
<blockquote><p>Virtual tables are implemented in this way mostly to save space. If we had 10000 objects of some class that has 20 virtual methods, we only pay one pointer overhead per object (40000 bytes on 32 bit architecture) and we store the vtable (20*4=80 bytes on 32 bit arch) just once, in total 39.14 kilobytes.<br />
If we&#8217;d move all function pointers into objects themselves, we&#8217;d need to store 20 function pointers in each object. Which would be 781.25 kilobytes! Clearly this approach does not scale with increasing object instance counts.
</p></blockquote>
<p>However, how many GfxDevice object instances do we <em>really</em> have? Most often&#8230; <em>exactly one</em>.</p>
<p><strong>Approach #3: function pointers</strong></p>
<p>If we move function pointers to the object itself, we&#8217;d have something like this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/novtable2.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/novtable2.png" alt="" title="No vtable!" width="337" height="356" class="alignnone size-full wp-image-621" /></a></p>
<p>There&#8217;s no built-in language support for implementing this in C++ however, so that would have to be done manually. Something like:</p>
<blockquote><pre>
struct GfxDeviceFunctions {
    SetShaderFunc SetShader;
    SetTextureFunc SetTexture;
    SetGeometryFunc SetGeometry;
    DrawFunc Draw;
};
class GfxDeviceGLES20 : public GfxDeviceFunctions {
    // ...
};
</pre>
</blockquote>
<p>And then when creating a particular GfxDevice, you have to fill in the function pointers yourself. And the functions were member functions which magically take &#8220;this&#8221; parameter; it&#8217;s hard to just use them as function pointers without going to clumsy C++ member function pointer syntax and related issues.</p>
<p>We can be more explicit, C style, and instead just have the functions be static, taking &#8220;this&#8221; parameter directly:</p>
<blockquote><pre>
class GfxDeviceGLES20 : public GfxDeviceFunctions {
    // ...
    static void DrawImpl (GfxDevice* self, PrimitiveType prim, int primCount);
    // ...
};
</pre>
</blockquote>
<p>Code that uses it would look like this then:</p>
<blockquote><pre>
device->Draw (device, kPrimTriangles, 1337);
</pre>
</blockquote>
<p>and it would generate the following pseudo-assembly:</p>
<blockquote><pre>
drawlocation = device + 3*PointerSize
drawfunction = load pointer from [drawlocation] address
<em>; <strong>wait 500 cycles</strong> until the pointer arrives</em>
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address
<em>; <strong>wait 500 cycles</strong> until code at that address is loaded</em>
</pre>
</blockquote>
<p>Look at that, one of &#8220;wait 500 cycles&#8221; is gone!</p>
<p><strong>More C style</strong></p>
<p>We could move function pointers outside of GfxDevice if we want to, and just make them global:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/globalfuncs.png"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/globalfuncs.png" alt="" title="Global function pointers" width="529" height="358" class="alignnone size-full wp-image-624" /></a></p>
<p>In GLES1.1 case, that global GfxDevice funcs block would point to different pieces of code. And the pseudocode for this:</p>
<blockquote><pre>
// global variables!
SetShaderFunc GfxSetShader;
SetTextureFunc GfxSetTexture;
SetGeometryFunc GfxSetGeometry;
DrawFunc GfxDraw;
// GLES2.0 implementation:
void GfxDrawGLES20 (GfxDevice* self, PrimitiveType prim, int primCount) { /* ... */ }
</pre>
</blockquote>
<p>Code that uses it:</p>
<blockquote><pre>
GfxDraw (device, kPrimTriangles, 1337);
</pre>
</blockquote>
<p>and the pseudo-assembly:</p>
<blockquote><pre>
drawfunction = load pointer from [GfxDraw variable] address
<em>; wait 500 cycles until the pointer arrives</em>
pass device pointer, kPrimTriangles and 1337 as arguments
call into code at [drawfunction] address
<em>; wait 500 cycles until code at that address is loaded</em>
</pre>
</blockquote>
<p><strong>Is it worth it?</strong></p>
<p>I can hear some saying, &#8220;what? throwing away C++ OOP and implementing the same in almost raw C?! you&#8217;re crazy!&#8221;</p>
<p>Whether going the above route is better or worse is mostly a matter of programming style and preferences. It does get rid of one &#8220;wait 500 cycles&#8221; in the worst case for sure. And yes, to get that you do lose some of automagic syntax sugar in C++.</p>
<p>Is it worth it? Like always, depends on a lot of things. But if you do find yourself pondering the virtual function overhead for singleton-like objects, or especially if you do see that your profiler reports cache misses when calling into them, at least you&#8217;ll know one of the many possible alternatives, right?</p>
<p>And yeah, another alternative that&#8217;s easy to do on some platforms? Just put different GfxDevice implementations into dynamically loaded libraries, exposing the same set of functions. Which would end up being <em>very</em> similar to the last approach of &#8220;store function pointer table globally&#8221;, except you&#8217;d get some compiler syntax sugar to make it easier; and you wouldn&#8217;t even need to load the code that is not going to be used.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/02/01/the-virtual-and-no-virtual/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>iOS shader tricks, or it&#8217;s 2001 all over again</title>
		<link>http://aras-p.info/blog/2011/02/01/ios-shader-tricks-or-its-2001-all-over-again/</link>
		<comments>http://aras-p.info/blog/2011/02/01/ios-shader-tricks-or-its-2001-all-over-again/#comments</comments>
		<pubDate>Tue, 01 Feb 2011 07:43:57 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=592</guid>
		<description><![CDATA[I was recently optimizing some OpenGL ES 2.0 shaders for iOS/Android, and it was funny to see how performance tricks that were cool in 2001 are having their revenge again. Here&#8217;s a small example of starting with a normalmapped Blinn-Phong shader and optimizing it to run several times faster. Most of the clever stuff below [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently optimizing some OpenGL ES 2.0 shaders for iOS/Android, and it was funny to see how performance tricks that were cool in 2001 are having their revenge again. Here&#8217;s a small example of starting with a normalmapped Blinn-Phong shader and optimizing it to run several times faster. Most of the clever stuff below was actually done by <a href="http://twitter.com/#!/__ReJ__">ReJ</a>, props to him!</p>
<p>Here&#8217;s a small test I&#8217;ll be working on: just a single plane with albedo and normal map textures:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump1.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump1-150x150.jpg" alt="" title="iOS Bumped Specular" width="150" height="150" class="alignnone size-thumbnail wp-image-593" /></a></p>
<p><span id="more-592"></span>I&#8217;ll be testing on iPhone 3Gs with iOS 4.2.1. Timer is started before glClear() and stopped after glFinish() that I added just after drawing the mesh.</p>
<p>Let&#8217;s start with an initial na&iuml;ve shader version:<br />
<script src="https://gist.github.com/783784.js"> </script></p>
<p>Should be pretty self-explanatory to anyone who&#8217;s familiar with tangent space normal mapping and Blinn-Phong BRDF. Running time: <strong>24.5 milliseconds</strong>. On iPhone 4&#8242;s Retina resolution, this would be about 4x slower!</p>
<p>What can we do next? On mobile platforms using appropriate precision of variables is often very important, especially in a fragment shader. So let&#8217;s go and add highp/mediump/lowp qualifiers to the fragment shader: <a href="https://gist.github.com/783703/05e78340b12739e853ce031bd0388430ea95f2a6">shader source</a></p>
<p>Still the same running time! Alas, iOS does not have low level shader analysis tools, so we can&#8217;t really tell why that is happening. We could be limited by something else (e.g. normalizing vectors and computing pow() being the bottlenecks that run in parallel with all low precision stuff), or the driver might be promoting most of our computations to higher precision because it feels like it. It&#8217;s a magic box!</p>
<p>Let&#8217;s start approximating instead. How about computing normalized view direction per vertex, and interpolating that for the fragment shader? It won&#8217;t be entirely &#8220;correct&#8221;, but hey, it&#8217;s a phone we&#8217;re talking about. <a href="https://gist.github.com/783703/1e4fd0daa384d308d125a748985e8e203e49625a">shader source</a></p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump3.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump3-150x150.jpg" alt="" title="iOS Bumped Specular, wrong precision!" width="150" height="150" class="alignright size-thumbnail wp-image-594" /></a><br />
15 milliseconds! But&#8230; the rendering is wrong; everything turned white near the bottom of the screen. Turns out PowerVR SGX (the GPU in all current iOS devices) is really meaning &#8220;low precision&#8221; when we want to add two lowp vectors and normalize the result. Let&#8217;s try promoting one of them to medium precision with a &#8220;varying mediump vec3 v_viewdir&#8221;: <a href="https://gist.github.com/783703/591eb83dacaae3840cc4e4d3d8b95a4fc3abdd65">shader source</a></p>
<p>That fixed rendering, but we&#8217;re back to 24.5 milliseconds. <em>Sad shader writers are sad&#8230; oh shader performance analysis tools, where art thou?</em></p>
<p>Let&#8217;s try approximating some more: compute half-vector in the vertex shader, and interpolate normalized value. This would get rid of all normalizations in the fragment shader. <a href="https://gist.github.com/783703/6360c2912b860aa30415e5120ef147169274cd71">shader source</a></p>
<p><strong>16.3</strong> milliseconds, not too bad! We still have pow() computed in the fragment shader, and that one is probably not the fastest operation there&#8230;</p>
<p>Almost a decade ago, a very common trick was to use a lookup texture to do the lighting. For example, a 2D texture indexed by (N.L, N.H). Since all lighting data would be &#8220;baked&#8221; into the texture, it does not necessarily have to be Blinn-Phong even; we can prepare faux-anisotropic, metallic, toon-shading or other fancy BRDFs there, as long as they can be expressed in terms of N.L and N.H. So let&#8217;s try creating 128&#215;128 RGBA lookup texture and use that: <a href="https://gist.github.com/783703/87f1cf5529d644cab16123550e809e9f7598f4f3">shader source</a></p>
<p>A fast &amp; not super efficient code to create the lighting lookup texture for Blinn-Phong:<br />
<script src="https://gist.github.com/783759.js"> </script></p>
<p><strong>9.1</strong> milliseconds! We lost some precision in the specular though (it&#8217;s dimmer):<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump6.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump6-150x150.jpg" alt="" title="iOS Bumped Specular via texture LUT" width="150" height="150" class="alignnone size-thumbnail wp-image-595" /></a></p>
<p>What else can be done? Notice that we clamp N.L and N.H values in the fragment shader, but this could be done just as well by the texture sampler, if we set texture&#8217;s addressing mode to CLAMP_TO_EDGE. Let&#8217;s get rid of the clamps: <a href="https://gist.github.com/783703/e24a2475fded83d2196372c8092a0d8de80a98eb">shader source</a></p>
<p>This is 8.3 milliseconds, or <strong>7.6</strong> milliseconds if we reduce our lighting texture resolution to 32&#215;128.</p>
<p>Should we stop there? Not necessarily. For example, the shader is still multiplying albedo with a per-material color. Maybe that&#8217;s not very useful and can be let go. Maybe we can also make specular be always white?<br />
<script src="https://gist.github.com/783703.js"> </script></p>
<p>How fast is this? <strong>5.9 milliseconds</strong>,&nbsp;or over <strong>4 times</strong> faster than our original shader.</p>
<p>Could it be made faster? Maybe; that&#8217;s an exercise for the reader :) I tried computing just the RGB color channels and setting alpha to zero, but that got slightly slower. Without real shader analysis tools it&#8217;s hard to see where or if additional cycles could be squeezed out.</p>
<p>I&#8217;m adding <a href='http://aras-p.info/blog/wp-content/uploads/2011/02/iOSShaderPerf.zip'>Xcode project with sources, textures and shaders of this experiment</a>. Notes about it: only tested on iPhone 3Gs (probably will crash on iPhone 3G, and iPad will have wrong aspect ratio). Might not work at all! Shader is read from Resources/Shaders/shader.txt, next to it are shader versions of the steps of this experiment. Enjoy!</p>
<p><em>This is a cross post from altdevblogaday: <a href="http://altdevblogaday.com/ios-shader-tricks-or-its-2001-all-over-again">http://altdevblogaday.com/ios-shader-tricks-or-its-2001-all-over-again</a></em></p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/02/01/ios-shader-tricks-or-its-2001-all-over-again/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>GLSL Optimizer</title>
		<link>http://aras-p.info/blog/2010/09/29/glsl-optimizer/</link>
		<comments>http://aras-p.info/blog/2010/09/29/glsl-optimizer/#comments</comments>
		<pubDate>Wed, 29 Sep 2010 10:39:21 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=561</guid>
		<description><![CDATA[During development of Unity 3.0, I was not-so-pleasantly surprised to see that our cross-compiled shaders run slow on iPhone 3Gs. And by &#8220;slow&#8221;, I mean SLOW; at the speeds of &#8220;stop the presses, we can not ship brand new OpenGL ES 2.0 support with THAT performance&#8221;. Back story Take this HLSL pixel shader for particles, [...]]]></description>
			<content:encoded><![CDATA[<p>During development of <a href="http://unity3d.com/unity/whats-new/unity-3">Unity 3.0</a>, I was not-so-pleasantly surprised to see that our <a href="http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/">cross-compiled</a> shaders run <i>slow</i> on iPhone 3Gs. And by &#8220;slow&#8221;, I mean <strong>SLOW</strong>; at the speeds of &#8220;stop the presses, we can not ship brand new OpenGL ES 2.0 support with THAT performance&#8221;.</p>
<p><span id="more-561"></span><br />
<b>Back story</b></p>
<p>Take this HLSL pixel shader for particles, that does nothing but multiplies texture with per-vertex color:</p>
<blockquote><p><code>
<pre>
half4 frag (v2f i) : COLOR { return i.color * tex2D (_MainTex, i.texcoord); }
</pre>
<p></code></p></blockquote>
<p>This is about as simple as it can get; should be one texture fetch and one multiply for the GPU.</p>
<p>Now <i>of course</i>, when HLSL gets cross-compiled into GLSL, it is augmented by some dummy functions/moves to match GLSL&#8217;s semantics of &#8220;a function called main that takes no arguments and returns no value&#8221;. So you get something like this in GLSL:</p>
<blockquote><p><code>
<pre>
vec4 frag (in v2f i) { return i.color * texture2D (_MainTex, i.texcoord); }
void main() {
    vec4 xl_retval;
    v2f xlt_i;
    xlt_i.color = gl_Color;
    xlt_i.texcoord = gl_TexCoord[0];
    xl_retval = frag (xlt_i);
    gl_FragData[0] = xl_retval;
}
</pre>
<p></code></p></blockquote>
<p>Makes sense. The original function was translated, and main() got added that fills in the input structure, calls the function and writes result to gl_FragData[0] (aka gl_FragColor).</p>
<p>Lo and behold, the above (with some OpenGL ES 2.0 specific stuff added, like precision qualifiers, definitions of varyings etc.) runs like sh*t on a mobile platform.</p>
<p>Which probably means <b>mobile platform drivers are quite bad at optimizing GLSL</b>. I mostly tested iOS, but some tests on Android indicate that situation is the same (maybe even worse, depending on exact kind of Android you have). Which is sad since said platforms also do not have any way to precompile shaders offline, where they could afford good but slow compilers.</p>
<p>Now of course, if you&#8217;re writing GLSL shaders by hand, you&#8217;re probably writing close to optimal code, with no redundant data moves or wrapper functions. But if you&#8217;re cross-compiling them from Cg/HLSL, or generating from some shader fragments, or from visual shader editors, you probably depend on shader compiler being decent at optimizing redundant bits.</p>
<p><b>GLSL Optimizer</b></p>
<p>Around the same time I accidentally discovered that <a href="http://mesa3d.org/">Mesa 3D</a> guys are working on new GLSL compiler, dubbed <a href="http://cgit.freedesktop.org/mesa/mesa/log/?h=glsl2">GLSL2</a>. I looked at the code and I liked it a lot; very hackable and &#8220;no bullshit&#8221; approach. So I took that Mesa&#8217;s GLSL compiler and made it output GLSL back after it has done all the optimizations.</p>
<p>Here it is: <a href="http://github.com/aras-p/glsl-optimizer"><b>http://github.com/aras-p/glsl-optimizer</b></a></p>
<p>It reads GLSL, does some architecture independent optimizations (dead code removal, algebraic simplifications, constant propagation, constant folding, inlining, &#8230;) and spits out &#8220;optimized&#8221; GLSL back.</p>
<p><b>Results</b></p>
<p>The above simple particle shader example. GLSL optimizer optimizes it into:</p>
<blockquote><p><code>
<pre>
void main() {
    gl_FragData[0] =
        (gl_Color.xyzw * texture2D (_MainTex, gl_TexCoord[0].xy)).xyzw;
}
</pre>
<p></code></p></blockquote>
<p>Save for redundant swizzle outputs (on my todo list), this is pretty much what you&#8217;d be writing by hand. No redundant moves, function call inlined, no extra temporaries, sweet!</p>
<p>How much difference does this make?<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesNo.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesNo.jpg" alt="" title="Particles, GLSL not optimized" width="160" height="240" /></a><a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesYes.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesYes.jpg" alt="" title="Particles, optimized GLSL" width="160" height="240" /></a><br />
Lots of particles, non-optimized GLSL on the left; optimized GLSL on the right (click for larger image). <b>Yep, it&#8217;s 236 vs. 36 milliseconds/frame</b> (4 vs. 27 FPS).</p>
<p>This result is for iPhone 3Gs running iOS 4.1. Some Android results: Motorola Droid (some PowerVR GPU): 537 vs. 223 ms; Nexus One (Snapdragon 8250 w/ Adreno GPU): 155 vs. 155 ms (yay! good drivers!); Samsung Galaxy S (some PowerVR GPU): 200 vs. 60 ms. All tests were ran at native device resolutions, so do not take this as performance comparisons between devices.</p>
<p>What about a more complex shader example? Let&#8217;s try per-pixel lit Diffuse shader (which is quite simple, but will do ok as &#8220;complex shader&#8221; example for a mobile platform). You can see that the GLSL code below is <a href="http://aras-p.info/blog/2010/07/16/surface-shaders-one-year-later/">mostly auto-generated</a>; writing it by hand wouldn&#8217;t produce that many data moves, unused struct members etc. Cg compiles original shader code into 10 ALU and 1 TEX instructions for D3D9 pixel shader 2.0, and is able to optimize away all the redundant stuff.</p>
<blockquote><p><code>
<pre>
struct SurfaceOutput {
    vec3 Albedo;
    vec3 Normal;
    vec3 Emission;
    float Specular;
    float Gloss;
    float Alpha;
};
struct Input {
    vec2 uv_MainTex;
};
struct v2f_surf {
    vec4 pos;
    vec2 hip_pack0;
    vec3 normal;
    vec3 vlight;
};
uniform vec4 _Color;
uniform vec4 _LightColor0;
uniform sampler2D _MainTex;
uniform vec4 _WorldSpaceLightPos0;
void surf (in Input IN, inout SurfaceOutput o) {
    vec4 c;
    c = texture2D (_MainTex, IN.uv_MainTex) * _Color;
    o.Albedo = c.xyz;
    o.Alpha = c.w;
}
vec4 LightingLambert (in SurfaceOutput s, in vec3 lightDir, in float atten) {
    float diff;
    vec4 c;
    diff = max (0.0, dot (s.Normal, lightDir));
    c.xyz  = (s.Albedo * _LightColor0.xyz) * (diff * atten * 2.0);
    c.w  = s.Alpha;
    return c;
}
vec4 frag_surf (in v2f_surf IN) {
    Input surfIN;
    SurfaceOutput o;
    float atten = 1.0;
    vec4 c;
    surfIN.uv_MainTex = IN.hip_pack0.xy;
    o.Albedo = vec3 (0.0);
    o.Emission = vec3 (0.0);
    o.Specular = 0.0;
    o.Alpha = 0.0;
    o.Gloss = 0.0;
    o.Normal = IN.normal;
    surf (surfIN, o);
    c = LightingLambert (o, _WorldSpaceLightPos0.xyz, atten);
    c.xyz += (o.Albedo * IN.vlight);
    c.w = o.Alpha;
    return c;
}
void main() {
    vec4 xl_retval;
    v2f_surf xlt_IN;
    xlt_IN.hip_pack0 = vec2 (gl_TexCoord[0]);
    xlt_IN.normal = vec3 (gl_TexCoord[1]);
    xlt_IN.vlight = vec3 (gl_TexCoord[2]);
    xl_retval = frag_surf (xlt_IN);
    gl_FragData[0] = xl_retval;
}
</pre>
<p></code></p></blockquote>
<p>Running the above through GLSL optimizer produces this:</p>
<blockquote><p><code>
<pre>
uniform vec4 _Color;
uniform vec4 _LightColor0;
uniform sampler2D _MainTex;
uniform vec4 _WorldSpaceLightPos0;
void main ()
{
    vec4 c;
    vec4 tmpvar_32;
    tmpvar_32 = texture2D (_MainTex, gl_TexCoord[0].xy) * _Color;
    vec3 tmpvar_33;
    tmpvar_33 = tmpvar_32.xyz;
    float tmpvar_34;
    tmpvar_34 = tmpvar_32.w;
    vec4 c_i0_i1;
    c_i0_i1.xyz = ((tmpvar_33 * _LightColor0.xyz) *
    	(max (0.0, dot (gl_TexCoord[1].xyz, _WorldSpaceLightPos0.xyz)) * 2.0)).xyz;
    c_i0_i1.w = (vec4(tmpvar_34)).w;
    c = c_i0_i1;
    c.xyz = (c_i0_i1.xyz + (tmpvar_33 * gl_TexCoord[2].xyz)).xyz;
    c.w = (vec4(tmpvar_34)).w;
    gl_FragData[0] = c.xyzw;
}
</pre>
<p></code></p></blockquote>
<p>All functions got inlined, all unused variable assignments got eliminated, and most of redundant moves are gone. There are some redundant moves left though (again, on my todo list), and the variables are assigned cryptic names after inlining. But otherwise, writing the equivalent shader by hand would be pretty close.</p>
<p>Difference between non-optimized and optimized GLSL in this case:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseNo.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseNo.jpg" alt="" title="Per-pixel Diffuse, GLSL not optimized" width="160" height="240" /></a><a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseYes.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseYes.jpg" alt="" title="Per-pixel Diffuse, optimized GLSL" width="160" height="240" /></a><br />
Non-optimized vs. optimized: <b>350 vs. 267 ms/frame</b> (2.9 vs. 3.7 FPS). Not bad either!</p>
<p><b>Closing thoughts</b></p>
<p>Pulling off this GLSL optimizer quite late in <a href="http://unity3d.com/unity/whats-new/unity-3">Unity 3.0</a> release cycle was a risky move, but it did work.</p>
<p>Hats off to Mesa folks (Eric Anholt, Ian Romanick, Kenneth Graunke et al) for making an awesome codebase of the GLSL compiler! I haven&#8217;t merged up latest GLSL compiler developments on Mesa tree; they&#8217;ve implemented quite a few new compiler optimizations but I was too busy shipping Unity 3 already. Will try to merge them in soon-ish.</p>
<p>I&#8217;ve tested non-optimized vs. optimized GLSL a bit on a desktop platform (MacBook Pro, GeForce 8600M, OS X 10.6.4) and there is no observable speed difference. Which makes sense, and I <i>would have expected</i> mobile drivers to be good at optimization as well, but apparently that&#8217;s not the case.</p>
<p>Now of course, mobile drivers will improve over time, and I hope offline &#8220;GLSL optimization&#8221; step will become obsolete in the future. I still think it makes perfect sense to fully compile shaders offline, so at runtime there&#8217;s no trace of GLSL at all (just load binary blob of GPU microcode into the driver), but that&#8217;s a story for another day.</p>
<p>In the meantime, you&#8217;re welcome to try <a href="http://github.com/aras-p/glsl-optimizer">GLSL Optimizer</a> out!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/09/29/glsl-optimizer/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Surface Shaders, one year later</title>
		<link>http://aras-p.info/blog/2010/07/16/surface-shaders-one-year-later/</link>
		<comments>http://aras-p.info/blog/2010/07/16/surface-shaders-one-year-later/#comments</comments>
		<pubDate>Fri, 16 Jul 2010 06:38:43 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=530</guid>
		<description><![CDATA[Over a year ago I had a thought that &#8220;Shaders must die&#8221; (part 1, part 2, part 3). And what do you know &#8211; turns out we&#8217;re trying to pull this off in upcoming Unity 3. We call this Surface Shaders cause I&#8217;ve a suspicion &#8220;shaders must die&#8221; as a feature name wouldn&#8217;t have flied [...]]]></description>
			<content:encoded><![CDATA[<p>Over a year ago I had a thought that &#8220;Shaders must die&#8221; (<a href="http://aras-p.info/blog/2009/05/05/shaders-must-die/">part 1</a>, <a href="http://aras-p.info/blog/2009/05/07/shaders-must-die-part-2/">part 2</a>, <a href="http://aras-p.info/blog/2009/05/10/shaders-must-die-part-3/">part 3</a>).</p>
<p>And what do you know &#8211; turns out we&#8217;re trying to pull this off in upcoming <a href="http://unity3d.com/unity/coming-soon/unity-3">Unity 3</a>. We call this <strong>Surface Shaders</strong> cause I&#8217;ve a suspicion &#8220;shaders must die&#8221; as a feature name wouldn&#8217;t have flied very far.</p>
<p><span id="more-530"></span></p>
<p><strong>Idea</strong></p>
<p>The main idea is that 90% of the time I just want to declare surface properties. This is what I want to say:</p>
<blockquote><p>Hey, albedo comes from this texture mixed with this texture, and normal comes from this normal map. Use Blinn-Phong lighting model please, and don&#8217;t bother me again!</p></blockquote>
<p>With the above, I don&#8217;t have to care whether this will be used in a forward or deferred rendering, or how various light types will be handled, or how many lights per pass will be done in a forward renderer, or how some indirect illumination SH probes will come in, etc. I&#8217;m not interested in all that! These dirty bits are job of rendering programmers, <em>just make it work dammit</em>!</p>
<p>This is not a new idea. Most graphical shader editors <em>that make sense</em> do not have &#8220;pixel color&#8221; as the final output node; instead they have some node that basically describes surface parameters (diffuse, specularity, normal, &#8230;), and all the lighting code is usually not expressed in the shader graph itself. <a href="http://code.google.com/p/openshadinglanguage/">OpenShadingLanguage</a> is a similar idea as well (but because it&#8217;s targeted at offline rendering for movies, it&#8217;s much richer &#038; more complex).</p>
<p><strong>Example</strong></p>
<p>Here&#8217;s a simple &#8211; but full &#038; complete &#8211; Unity 3.0 shader that does diffuse lighting with a texture &#038; a normal map.<br />
<code>
<pre>
  <span style="color:gray">Shader "Example/Diffuse Bump" {
    Properties {
      _MainTex ("Texture", 2D) = "white" {}
      _BumpMap ("Bumpmap", 2D) = "bump" {}
    }
    SubShader {
      Tags { "RenderType" = "Opaque" }
      CGPROGRAM</span>
      #pragma surface surf Lambert
      struct Input {
        float2 uv_MainTex;
        float2 uv_BumpMap;
      };
      sampler2D _MainTex;
      sampler2D _BumpMap;
      void surf (Input IN, inout SurfaceOutput o) {
        o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
        o.Normal = UnpackNormal (tex2D (_BumpMap, IN.uv_BumpMap));
      }
      <span style="color:gray">ENDCG
    }
    Fallback "Diffuse"
  }</span></pre>
<p></code><br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfaceShaderDiffuseBump.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfaceShaderDiffuseBump-150x150.png" alt="" title="SurfaceShaderDiffuseBump" width="150" height="150" class="alignright size-thumbnail wp-image-543" /></a>Given pretty model &#038; textures, it can produce pretty pictures! How cool is that?</p>
<p>I grayed out bits that are not really interesting (declaration of serialized shader properties &#038; their UI names, shader fallback for older machines etc.). What&#8217;s left is Cg/HLSL code, which is then augmented by tons of auto-generated code that deals with lighting &#038; whatnot.</p>
<p>This surface shader dissected into pieces:</p>
<ul>
<li><code>#pragma surface surf Lambert</code>: this is a surface shader with main function &#8220;surf&#8221;, and a Lambert lighting model. Lambert is one of predefined lighting models, but you can write your own.</li>
<li><code>struct Input</code>: input data for the surface shader. This can have various predefined inputs that will be computed per-vertex &#038; passed into your surface function per-pixel. In this case, it&#8217;s two texture coordinates.</li>
<li><code>surf</code> function: actual surface shader code. It takes Input, and writes into <code>SurfaceOutput</code> (a predefined structure). It is possible to write into custom structures, provided you use lighting models that operate on those structures. The actual code just writes Albedo and Normal to the output.</li>
</ul>
<p><strong>What is generated</strong></p>
<p>Unity&#8217;s &#8220;surface shader code generator&#8221; would take this, generate <em>actual</em> vertex &#038; pixel shaders, and compile them to various target platforms. With default settings in Unity 3.0, it would make this shader support:</p>
<ul>
<li>Forward renderer and Deferred Lighting (Light Pre-Pass) renderer.</li>
<li>Objects with precomputed lightmaps and without.</li>
<li>Directional, Point and Spot lights; with projected light cookies or without; with shadowmaps or without. Well ok, this is only for forward renderer because in Light Pre-Pass lighting happens elsewhere.</li>
<li>For Forward renderer, it would compile in support for lights computed per-vertex and spherical harmonics lights computed per-object. It would also generate extra additive blended pass if needed for the case when additional per-pixel lights have to be rendered in separate passes.</li>
<li>For Light Pre-Pass renderer, it would generate base pass that outputs normals &#038; specular power; and a final pass that combines albedo with lighting, adds in any lightmaps or emissive lighting etc.</li>
<li>It can optionally generate a shadow caster rendering pass (needed if custom vertex position modifiers are used for vertex shader based animation; or some complex alpha-test effects are done).</li>
</ul>
<p>For example, here&#8217;s code that would be compiled for a forward-rendered base pass with one directional light, 4 per-vertex point lights, 3rd order SH lights; optional lightmaps <em>(I suggest just scrolling down)</em>: </p>
<pre style="font-size: 75%;">
#pragma vertex vert_surf
#pragma fragment frag_surf
#pragma fragmentoption ARB_fog_exp2
#pragma fragmentoption ARB_precision_hint_fastest
#pragma multi_compile_fwdbase
#include "HLSLSupport.cginc"
#include "UnityCG.cginc"
#include "Lighting.cginc"
#include "AutoLight.cginc"
struct Input {
	float2 uv_MainTex : TEXCOORD0;
};
sampler2D _MainTex;
sampler2D _BumpMap;
void surf (Input IN, inout SurfaceOutput o)
{
	o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
	o.Normal = UnpackNormal (tex2D (_BumpMap, IN.uv_MainTex));
}
struct v2f_surf {
  V2F_POS_FOG;
  float2 hip_pack0 : TEXCOORD0;
  #ifndef LIGHTMAP_OFF
  float2 hip_lmap : TEXCOORD1;
  #else
  float3 lightDir : TEXCOORD1;
  float3 vlight : TEXCOORD2;
  #endif
  LIGHTING_COORDS(3,4)
};
#ifndef LIGHTMAP_OFF
float4 unity_LightmapST;
#endif
float4 _MainTex_ST;
v2f_surf vert_surf (appdata_full v) {
  v2f_surf o;
  PositionFog( v.vertex, o.pos, o.fog );
  o.hip_pack0.xy = TRANSFORM_TEX(v.texcoord, _MainTex);
  #ifndef LIGHTMAP_OFF
  o.hip_lmap.xy = v.texcoord1.xy * unity_LightmapST.xy + unity_LightmapST.zw;
  #endif
  float3 worldN = mul((float3x3)_Object2World, SCALED_NORMAL);
  TANGENT_SPACE_ROTATION;
  #ifdef LIGHTMAP_OFF
  o.lightDir = mul (rotation, ObjSpaceLightDir(v.vertex));
  #endif
  #ifdef LIGHTMAP_OFF
  float3 shlight = ShadeSH9 (float4(worldN,1.0));
  o.vlight = shlight;
  #ifdef VERTEXLIGHT_ON
  float3 worldPos = mul(_Object2World, v.vertex).xyz;
  o.vlight += Shade4PointLights (
    unity_4LightPosX0, unity_4LightPosY0, unity_4LightPosZ0,
    unity_LightColor0, unity_LightColor1, unity_LightColor2, unity_LightColor3,
    unity_4LightAtten0, worldPos, worldN );
  #endif // VERTEXLIGHT_ON
  #endif // LIGHTMAP_OFF
  TRANSFER_VERTEX_TO_FRAGMENT(o);
  return o;
}
#ifndef LIGHTMAP_OFF
sampler2D unity_Lightmap;
#endif
half4 frag_surf (v2f_surf IN) : COLOR {
  Input surfIN;
  surfIN.uv_MainTex = IN.hip_pack0.xy;
  SurfaceOutput o;
  o.Albedo = 0.0;
  o.Emission = 0.0;
  o.Specular = 0.0;
  o.Alpha = 0.0;
  o.Gloss = 0.0;
  surf (surfIN, o);
  half atten = LIGHT_ATTENUATION(IN);
  half4 c;
  #ifdef LIGHTMAP_OFF
  c = LightingLambert (o, IN.lightDir, atten);
  c.rgb += o.Albedo * IN.vlight;
  #else // LIGHTMAP_OFF
  half3 lmFull = DecodeLightmap (tex2D(unity_Lightmap, IN.hip_lmap.xy));
  #ifdef SHADOWS_SCREEN
  c.rgb = o.Albedo * min(lmFull, atten*2);
  #else
  c.rgb = o.Albedo * lmFull;
  #endif
  c.a = o.Alpha;
  #endif // LIGHTMAP_OFF
  return c;
}
</pre>
<p>Of those 90 lines of code, 10 are your original surface shader code; the remaining 80 would have to be pretty much written by hand in Unity 2.x days (well ok, less code would have to be written because 2.x had less rendering features). <em>But wait</em>, that was only base pass of the forward renderer! It also generates code for additive pass, for deferred base pass, deferred final pass, optionally for shadow caster pass and so on.</p>
<p>So this should be an easier to write lit shaders (it is for me at least). I hope this will also increase the number of Unity users who can write shaders at least 3 times <em>(i.e. to 30 up from 10!)</em>. It <em>should</em> be more future proof to accomodate changes to the lighting pipeline we&#8217;ll do in Unity next.</p>
<p><strong>Predefined Input values</strong></p>
<p>The Input structure can contain texture coordinates and some predefined values, for example view direction, world space position, world space reflection vector and so on. Code to compute them is only generated if they are <em>actually</em> used. For example, if you use world space reflection to do some cubemap reflections (as emissive term) in your surface shader, then in Light Pre-Pass base pass the reflection vector will <em>not be computed</em> (since it does not output emission, so by extension does not need reflection vector).</p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfaceShaderRim.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfaceShaderRim-150x150.png" alt="" title="SurfaceShaderRim" width="150" height="150" class="alignright size-thumbnail wp-image-545" /></a>As a small example, the shader above extended to do simple rim lighting:<br />
<code>
<pre>
  <span style="color:gray">#pragma surface surf Lambert
  struct Input {
      float2 uv_MainTex;
      float2 uv_BumpMap;</span>
      float3 viewDir;
  <span style="color:gray">};
  sampler2D _MainTex;
  sampler2D _BumpMap;</span>
  float4 _RimColor;
  float _RimPower;
  <span style="color:gray">void surf (Input IN, inout SurfaceOutput o) {
      o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
      o.Normal = UnpackNormal (tex2D (_BumpMap, IN.uv_BumpMap));</span>
      half rim =
          1.0 - saturate(dot (normalize(IN.viewDir), o.Normal));
      o.Emission = _RimColor.rgb * pow (rim, _RimPower);
  <span style="color:gray">}</span>
</pre>
<p></code></p>
<p><strong>Vertex shader modifiers</strong></p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfaceShaderNormalExtrusion.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfaceShaderNormalExtrusion-150x150.png" alt="" title="SurfaceShaderNormalExtrusion" width="150" height="150" class="alignright size-thumbnail wp-image-551" /></a>It is possible to specify custom &#8220;vertex modifier&#8221; function that will be called at start of the generated vertex shader, to modify (or generate) per-vertex data. You know, vertex shader based tree wind animation, grass billboard extrusion and so on. It can also fill in any non-predefined values in the Input structure.</p>
<p>My favorite vertex modifier? Moving vertices along their normals.</p>
<p><strong>Custom Lighting Models</strong></p>
<p>There are a couple simple lighting models built-in, but it&#8217;s possible to specify your own. A lighting model is nothing more than a function that will be called with the filled SurfaceOutput structure and per-light parameters (direction, attenuation and so on). Different functions would have to be called in forward &#038; light pre-pass rendering cases; and naturally the light pre-pass one has much less flexibility. So for any fancy effects, it is possible to say &#8220;do not compile this shader for light pre-pass&#8221;, in which case it will be rendered via forward rendering.</p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfWrapLambert.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/07/SurfWrapLambert-150x150.png" alt="" title="SurfWrapLambert" width="150" height="150" class="alignright size-thumbnail wp-image-549" /></a>Example of wrapped-Lambert lighting model:<br />
<code>
<pre>
  #pragma surface surf WrapLambert
  half4 LightingWrapLambert (SurfaceOutput s, half3 dir, half atten) {
      dir = normalize(dir);
      half NdotL = dot (s.Normal, dir);
      half diff = NdotL * 0.5 + 0.5;
      half4 c;
      c.rgb = s.Albedo * _LightColor0.rgb * (diff * atten * 2);
      c.a = s.Alpha;
      return c;
  }
  <span style="color:gray">struct Input {
      float2 uv_MainTex;
  };
  sampler2D _MainTex;
  void surf (Input IN, inout SurfaceOutput o) {
      o.Albedo = tex2D (_MainTex, IN.uv_MainTex).rgb;
  }</span></pre>
<p></code></p>
<p><strong>Behind the scenes</strong></p>
<p>I&#8217;m using HLSL parser from Ryan Gordon&#8217;s <a href="http://hg.icculus.org/icculus/mojoshader/">mojoshader</a> to parse the original surface shader code and infer some things from the AST mojoshader produces. This way I can figure out what members are in what structures, go over function prototypes and so on. At this stage some error checking is done to tell the user his surface function is of wrong prototype, or his structures are missing required members &#8211; which is much better than failing with dozens of compile errors in the generated code later.</p>
<p>To figure out which surface shader inputs are <em>actually</em> used in the various lighting passes, I&#8217;m generating small dummy pixel shaders, compile them with Cg and use Cg&#8217;s API to query used inputs &#038; outputs. This way I can figure out, for example, that a normal map nor it&#8217;s texture coordinate is not actually used in Light Pre-Pass&#8217; final pass, and save some vertex shader instructions &#038; a texcoord interpolator.</p>
<p>The code that is ultimately generated is compiled with various shader compilers depending on the target platform (Cg for PC/Mac, XDK HLSL for Xbox 360, PS3 Cg for PS3, and my own <a href="https://github.com/aras-p/hlsl2glslfork">fork of HLSL2GLSL</a> for iPhone, Android and upcoming <a href="http://blogs.unity3d.com/2010/05/19/google-android-and-the-future-of-games-on-the-web/">NativeClient port of Unity</a>).</p>
<p>So yeah, that&#8217;s it. We&#8217;ll see where this goes next, or what happens when Unity 3 will be released.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/07/16/surface-shaders-one-year-later/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Compiling HLSL into GLSL in 2010</title>
		<link>http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/</link>
		<comments>http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/#comments</comments>
		<pubDate>Fri, 21 May 2010 19:59:38 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[d3d]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=523</guid>
		<description><![CDATA[Realtime shader languages these days have settled down into two camps: HLSL (or Cg, which for all practical reasons is the same) and GLSL (or GLSL ES, which is sufficiently similar). HLSL/Cg is used by Direct3D and the big consoles (Xbox 360, PS3). GLSL/ES is used by OpenGL and pretty much all modern mobile platforms [...]]]></description>
			<content:encoded><![CDATA[<p>Realtime shader languages these days have settled down into two camps: HLSL (or Cg, which for all practical reasons is the same) and GLSL (or GLSL ES, which is sufficiently similar). HLSL/Cg is used by Direct3D and the big consoles (Xbox 360, PS3). GLSL/ES is used by OpenGL and pretty much all modern mobile platforms (iPhone, Android, &#8230;).</p>
<p>Since shaders are more or less &#8220;assets&#8221;, having two different languages to deal with is not very nice. What, I&#8217;m supposed to write my shader twice just to support both (for example) D3D and iPad? You would think in 2010, almost a decade since high level realtime shader languages have appeared, this problem would be solved&#8230; but it isn&#8217;t!</p>
<p><span id="more-523"></span>In <a href="http://unity3d.com/unity/coming-soon/unity-3">upcoming Unity 3.0</a>, we&#8217;re going to have OpenGL ES 2.0 for mobile platforms, where GLSL ES is the only option to write shaders in. However, almost all other platforms (Windows, 360, PS3) need HLSL/Cg.</p>
<p>I tried a bit making <a href="http://developer.nvidia.com/object/cg_toolkit.html">Cg</a> spit out GLSL code. In theory it can, and I read somewhere that <a href="http://en.wikipedia.org/wiki/Id_Software">id</a> uses it for OpenGL backend for <a href="http://en.wikipedia.org/wiki/Rage_(video_game)">Rage</a>&#8230; But I just couldn&#8217;t make it work. What&#8217;s possible for <a href="http://en.wikipedia.org/wiki/John_Carmack">John</a> apparently is not possible for mere mortals.</p>
<p>Then I looked at ATI&#8217;s <a href="https://github.com/aras-p/hlsl2glslfork">HLSL2GLSL</a>. That did produce GLSL shaders that were not absolutely horrible. So I started using it, and <em>(surprise!)</em> quickly ran into small issues here and there. Too bad development of the library stopped around 2006&#8230; on the plus side, it&#8217;s open source!</p>
<p>So I just forked it. Here it is: <a href="http://code.google.com/p/hlsl2glslfork/"><strong>http://code.google.com/p/hlsl2glslfork/</strong></a> (<a href="https://github.com/aras-p/hlsl2glslfork/commits/master">commit log here</a>). There are no prebuilt binaries or source drops right now, just a Mercurial repository. BSD license. Patches welcome.</p>
<p><em>Note on the codebase</em>: I don&#8217;t particularly like the codebase. It seems somewhat over-engineered code, that was probably taken from reference GLSL parser that 3DLabs once did, and adapted to parse HLSL and spit out GLSL. There are pieces of code that are unused, unfinished or duplicated. Judging from comments, some pieces of code have been in the hands of 3DLabs, ATI and NVIDIA (what good can come out of <em>that</em>?!). However, it <em>works</em>, and that&#8217;s the most important trait any code can have.</p>
<p><em>Note on the preprocessor</em>: I bumped into some preprocessor issues that couldn&#8217;t be easily fixed without first understanding someone else&#8217;s ancient code and then changing it significantly. Fortunately, Ryan Gordon&#8217;s project, <a href="http://icculus.org/mojoshader/">MojoShader</a>, happens to have preprocessor that very closely emulates HLSL&#8217;s one (including various quirks). So I&#8217;m using that to preprocess any source before passing it down to HLSL2GLSL. Kudos to Ryan!</p>
<p><em>Side note on MojoShader</em>: Ryan is also working on HLSL->GLSL cross compiler in MojoShader. I like that codebase much more; will certainly try it out once it&#8217;s somewhat ready.</p>
<p><em>You can never have enough notes</em>: Google&#8217;s <a href="http://code.google.com/p/angleproject/">ANGLE project</a> (running OpenGL ES 2.0 on top of Direct3D runtime+drivers) seems to be working on the opposite tool. For obvious reasons, they need to take GLSL ES shaders and produce D3D compatible shaders (HLSL or shader assembly/bytecode). The project seems to be moving fast; and if one day we&#8217;ll decide to default to GLSL as shader language in Unity, I&#8217;ll know where to look for a translator into HLSL :)</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>GDC 2010 report</title>
		<link>http://aras-p.info/blog/2010/03/17/gdc-2010-report/</link>
		<comments>http://aras-p.info/blog/2010/03/17/gdc-2010-report/#comments</comments>
		<pubDate>Wed, 17 Mar 2010 12:30:26 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=507</guid>
		<description><![CDATA[Just returned from exciting (and exhausting) trip to Game Developers Conference 2010. Random notes: Unity It seems that everyone is talking about Unity this year. At GDC 2009 some people have heard about us, some others were &#8220;where the f*** this came from?!&#8221;, and some had no idea what Unity is. This year it&#8217;s hard [...]]]></description>
			<content:encoded><![CDATA[<p>Just returned from exciting (and exhausting) trip to <a href="http://www.gdconf.com/">Game Developers Conference 2010</a>. Random notes:</p>
<p><strong>Unity</strong></p>
<p>It seems that everyone is talking about Unity this year. At GDC 2009 some people have heard about us, some others were &#8220;where the f*** this came from?!&#8221;, and some had no idea what Unity is. This year it&#8217;s hard to find anyone who hasn&#8217;t heard about Unity. I was surprised by number of AAA developers who are playing around with Unity internally (for prototyping, mobile &#038; whatnot) and/or are big fans of Unity. <em>I like!</em></p>
<p>We had a cool booth that was very busy at all times. As a bonus, the Unity chairs could be used as weapons!<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/03/UnityBooth1.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2010/03/UnityBooth1-150x150.jpg" alt="" title="Unity Booth at GDC2010" width="150" height="150" class="alignnone size-thumbnail wp-image-508" /></a> <a href="http://aras-p.info/blog/wp-content/uploads/2010/03/UnityBooth2.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2010/03/UnityBooth2-150x150.jpg" alt="" title="Busy!" width="150" height="150" class="alignnone size-thumbnail wp-image-509" /></a> <a href="http://aras-p.info/blog/wp-content/uploads/2010/03/UnityBooth3.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2010/03/UnityBooth3-150x150.jpg" alt="" title="Unity Transformers! (Uniformers?)" width="150" height="150" class="alignnone size-thumbnail wp-image-510" /></a></p>
<p>Awesome quote: CEO of <del datetime="2010-03-17T11:16:46+00:00">censored</del> (competing middleware company) said: &#8220;yeah, Unity is going up, we are going down&#8221;. This is taken <em>completely</em> out of context of course.</p>
<p>We were busy demoing <a href="http://unity3d.com/unity/coming-soon/unity-3">upcoming Unity 3</a> which I think will be quite awesome. Three days before the conference were spent crunching on the demos for GDC :)</p>
<p><strong>Cool Stuff</strong></p>
<p><em>Only managed to go to two sessions :(</em></p>
<p><a href="http://twitter.com/self_shadow">Stephen Hill</a>&#8216;s &#8220;<a href="https://www.cmpevents.com/GD10/a.asp?option=C&#038;V=11&#038;SessID=10333">Rendering Tools and Techniques of Splinter Cell: Conviction</a>&#8221; had interesting bits &#038; pieces of stuff. Nice work on hierarchical Z occlusion and ambient occlusion fields! (probably first time I see AO fields used in actual game production)</p>
<p><a href="http://twitter.com/mike_acton">Mike Acton</a>&#8216;s &#8220;<a href="https://www.cmpevents.com/GD10/a.asp?option=C&#038;V=11&#038;SessID=10892">Three Big Lies: Typical Design Failures in Game Programming</a>&#8221; was entertaining. Content wise I pretty much knew what to expect. If you aren&#8217;t following Mike &#8211; <a href="http://www.youtube.com/watch#!v=u6ALySsPXt0">do it now</a>! Talk slides are <a href="http://www.insomniacgames.com/research_dev/articles/2010/1522262">at Insomniac&#8217;s site</a>.</p>
<p><a href="http://www.radgametools.com/">RAD</a>&#8216;s Telemetry profiler looks totally sweet. <em>I think</em> they acquired <a href="http://www.youtube.com/watch?v=oKRJNUvIJlg">this one</a> and improved it. Some <em>very</em> good UI ideas in there. On a related note, Scaleform&#8217;s new profiler looks&#8230; <em>kinda inspired</em> by Unity&#8217;s (<a href="http://aras-p.info/blog/wp-content/uploads/2010/03/ScaleformVsUnityProfiler.png">comparison: Scaleform on the left, Unity on the right</a>).</p>
<p><strong>Fun Stuff</strong></p>
<p><em>Managed to sneak in some fun (dare I say &#8220;social&#8221;?) stuff.</em></p>
<p><a href="http://twitter.com/repi/status/10461765908">Rendering folks dinner</a> <em>(thanks <a href="http://repi.se/">Johan</a>!)</em> was awesome, even if it made me feel kinda small &#038; stupid among those super smart guys &#038; gals. <a href="http://img709.yfrog.com/i/ozz.jpg/">Shadow algorithms</a> on receipts FTW! <a href="http://whatmakesyouthinkimnot.wordpress.com/2010/02/18/middleware-meetup/">Middleware Meetup</a> <em>(thanks <a href="http://whatmakesyouthinkimnot.wordpress.com/about/">Dan</a>!)</em> was full of friendly competitors :) <a href="http://macton.posterous.com/initial-details-for-our-gdc-tweetups-gdcdrink">#gdcdrink</a> tweetup <em>(thanks <a href="http://twitter.com/mike_acton">Mike</a>!)</em> had lots of war stories, PS3 talk and how to do fluid simulation on 360&#8242;s pixel shaders.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/03/17/gdc-2010-report/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Screenspace vs. mip-mapping</title>
		<link>http://aras-p.info/blog/2010/01/07/screenspace-vs-mip-mapping/</link>
		<comments>http://aras-p.info/blog/2010/01/07/screenspace-vs-mip-mapping/#comments</comments>
		<pubDate>Thu, 07 Jan 2010 14:27:55 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=485</guid>
		<description><![CDATA[Just spent half a day debugging this, so here it is for the future reference of the internets. In a deferred rendering setup (see Game Angst for a good discussion of deferred shading &#038; lighting), lights are applied using data from screen-space buffers. Position, normal and other things are reconstructed from buffers and lighting is [...]]]></description>
			<content:encoded><![CDATA[<p><em>Just spent half a day debugging this, so here it is for the future reference of the internets.</em></p>
<p>In a deferred rendering setup (see <a href="http://gameangst.com/?p=141">Game Angst</a> for a good discussion of deferred shading &#038; lighting), lights are applied using data from screen-space buffers. Position, normal and other things are reconstructed from buffers and lighting is computed &#8220;in screen space&#8221;.</p>
<p>Because each light is applied to a portion of the screen, the pixels it computes can belong to different objects. If in any place of lighting computation you use textures with <a href="http://en.wikipedia.org/wiki/Mipmap">mipmaps</a>, <em>be careful</em>. Most common use for mipmapped light textures is light &#8220;cookies&#8221; (aka <a href="http://en.wikipedia.org/wiki/Gobo_(lighting)">Gobo</a>).</p>
<p>Let&#8217;s say we have a very simple scene with a spot light: <span id="more-485"></span><br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/01/DeferredCookieGood.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/DeferredCookieGood.png" alt="" title="Deferred Cookie (Good)" width="610" height="458" class="alignnone size-full wp-image-486" /></a></p>
<p>Light&#8217;s angular attenuation comes from a texture like this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/01/cookie128.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/cookie128.png" alt="" title="cookie128" width="128" height="128" class="alignnone size-full wp-image-489" /></a></p>
<p>If the texture has mipmaps and you sample it using the &#8220;obvious&#8221; way (e.g. tex2Dproj), you can get something like this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/01/DeferredCookieBad.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/DeferredCookieBad.png" alt="" title="Deferred Cookie (Bad!)" width="610" height="458" class="alignnone size-full wp-image-491" /></a></p>
<p><em>Black stuff around the sphere is no good!</em> It&#8217;s not the infamous half-texel offset in D3D9, not a driver bug, not a shader compiler bug and not the nature trying to prevent you from writing a deferred renderer.</p>
<p>It&#8217;s the mipmapping.</p>
<p>Mipmaps of your cookie texture look like this (128&#215;128, 16&#215;16, 8&#215;8, 4&#215;4 shown):<br />
<img src="http://aras-p.info/blog/wp-content/uploads/2010/01/cookie128.png" alt="" title="128x128" width="128" height="128" /><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/cookie16.png" alt="" title="16x16" width="128" height="128" /><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/cookie8.png" alt="" title="8x8" width="128" height="128" /><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/cookie4.png" alt="" title="4x4" width="128" height="128" /></p>
<p>Now, take two adjacent pixels, where one belongs to the edge of the sphere, and the other belongs to the background object (technically you take a 2&#215;2 block of pixels, but just two are enough to illustrate the point). When the light is applied, cookie texture coordinates for those pixels are computed. It can happen that the coordinates are <em>very</em> different, especially when pixels &#8220;belong&#8221; to entirely different surfaces that are quite far away from each other.</p>
<p>What the GPU does when texture coordinates of adjacent pixels are very different? Chooses a lower mipmap level so that texel to pixel density roughly matches 1:1. On the edges of this &#8220;wrong&#8221; screenshot, it happens that very small mipmap level is sampled, which is either black or white color (see 4&#215;4 mip level).</p>
<p>What to do here? You could disable mip-mapping (which is not good for performance and not good for image quality). You could drop some smallest mip levels which might be enough and not that bad for performance. Another option is to manually supply LOD level or derivatives to sampling instructions, using <em>something else</em> than cookie texture coordinates. For example, derivative in view space position, or something like that. This might not be possible on lower shader models though.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/01/07/screenspace-vs-mip-mapping/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Four years ago today&#8230;</title>
		<link>http://aras-p.info/blog/2010/01/04/four-years-ago-today/</link>
		<comments>http://aras-p.info/blog/2010/01/04/four-years-ago-today/#comments</comments>
		<pubDate>Mon, 04 Jan 2010 17:54:41 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[unity]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=466</guid>
		<description><![CDATA[&#8230;I took a plane to Copenhagen. Oh, this sounds familiar&#8230; Well ok, it all started a bit before: I exchanged some emails with David and Joachim and they invited me for a gamejam in their office. Then one thing led to another, I was young and needed money (oops! wrong topic) and on January 2006 [...]]]></description>
			<content:encoded><![CDATA[<p>&#8230;I took a plane to Copenhagen. <a href="http://aras-p.info/blog/2008/01/15/about-two-years-ago/"><em>Oh, this sounds familiar&#8230;</em></a></p>
<p>Well ok, it all started a bit before: <span id="more-466"></span><br />
<img src="http://aras-p.info/blog/wp-content/uploads/2010/01/futureofmiddleware.png" alt="" title="Future of Middleware" width="548" height="19" class="alignnone size-full wp-image-472" /></p>
<p>I exchanged some emails with <a href="http://blogs.unity3d.com/author/david/">David</a> and <a href="http://blogs.unity3d.com/author/joe/">Joachim</a> and they invited me for a <a href="http://unity3d.com/pakimono/">gamejam</a> in their office. Then one thing led to another, I was young and needed money <em>(oops! wrong topic)</em> and on January 2006 I started working on this thing called &#8220;Unity&#8221;.</p>
<p>Unity was at version <a href="http://unity3d.com/unity/whats-new/unity-1.2">1.2.1</a> then. Since then we&#8217;ve released about a dozen new versions, added hundreds (or thousands?) of new features, a handful of new platforms and <a href="http://blogs.unity3d.com/2009/11/13/blast-from-the-past-pt-3-a-growing-company/">have grown a lot</a>.</p>
<p><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/insanesales.png" alt="" title="Sales are INSANE!!!111" width="333" height="72" class="alignright size-full wp-image-474" />Also, we stopped saying &#8220;Sales are INSANE!!!!11&#8243; whenever they exceeded a whopping ten thousand euros per week. <em><span style="color: #808080;">Seriously, that much money in 2006 was a big thing. Our Windows build machine was a single core Celeron with 512MB RAM because that&#8217;s what we could afford!</span></em> Well ok, we&#8217;re still saying &#8220;sales are insane!&#8221; from time to time, just the threshold has gone way up.</p>
<p style="clear:both"><img src="http://aras-p.info/blog/wp-content/uploads/2010/01/greatsuccess.png" alt="" title="Great Success" width="220" height="121" class="alignright size-full wp-image-473" />Occasionally we&#8217;d get excited about strangest things. I think this email is about some car model from ATI that was on front page of <a href="http://aras-p.info/blog/wp-content/uploads/2010/01/website2006.png">our website in 2006</a>. It&#8217;s beyond me why we&#8217;d put a car on Unity website, but somehow it seemed to make sense at the time.</p>
<p style="clear:both">It would take too much space to list all the awesome things that happened in those four years. I got to work on some things too, like <a href="http://aras-p.info/blog/wp-content/uploads/2010/01/200603-firefox.jpg">Windows Web Player</a>, <a href="http://aras-p.info/blog/wp-content/uploads/2010/01/200702-fastd3d.png">Direct3D renderer</a>, <a href="http://aras-p.info/blog/2007/08/28/lolshadows/">shadows</a>, <a href="http://blogs.unity3d.com/2009/05/16/blast-from-the-recent-past-unity-25/">editor for Windows</a> and whatnot. But I mostly concentrate on creating trouble, which does not seem to hinder Unity that much. I need to get more efficient!</p>
<p>Seriously though, it has been an amazing ride so far, and I hope it will only become better. Thanks to everyone at Unity Technologies and the community!</p>
<p>Rock on!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/01/04/four-years-ago-today/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Direct3D GPU Hacks</title>
		<link>http://aras-p.info/blog/2009/11/20/direct3d-gpu-hacks/</link>
		<comments>http://aras-p.info/blog/2009/11/20/direct3d-gpu-hacks/#comments</comments>
		<pubDate>Fri, 20 Nov 2009 12:26:48 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[d3d]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=462</guid>
		<description><![CDATA[I&#8217;m catching up on various GPU hacks that exist for Direct3D 9 (things like native shadow mapping, render to vertex buffer, etc.). Turns out there&#8217;s a lot of them, but all the information is scattered around the intertubes. So here are the D3D9 hacks known to me in one place. Let me know if I [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m catching up on various GPU hacks that exist for Direct3D 9 (things like native shadow mapping, render to vertex buffer, etc.). Turns out there&#8217;s a lot of them, but all the information is scattered around the intertubes.</p>
<p>So here are the <a href="http://aras-p.info/texts/D3D9GPUHacks.html"><strong>D3D9 hacks known to me in one place</strong></a>.</p>
<p>Let me know if I missed something or got something wrong. I also want to figure out if Intel GPUs/drivers implement any of them.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/11/20/direct3d-gpu-hacks/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Improving C#/Mono for Games</title>
		<link>http://aras-p.info/blog/2009/11/14/improving-cmono-for-games/</link>
		<comments>http://aras-p.info/blog/2009/11/14/improving-cmono-for-games/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 19:07:24 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[games]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=442</guid>
		<description><![CDATA[A tweet by Michael Hutchinson on C#/Mono usage in games caused me to do a couple of short replies (one, two). But then I started thinking a bit more, and here&#8217;s a longer post on what is needed for C# (and more specifically Mono) to be used in games more. In Unity we use Mono [...]]]></description>
			<content:encoded><![CDATA[<p>A <a href="http://twitter.com/mjhutchinson/status/5643232459">tweet by Michael Hutchinson</a> on C#/Mono usage in games caused me to do a couple of short replies (<a href="http://twitter.com/aras_p/status/5643338294">one</a>, <a href="http://twitter.com/aras_p/status/5643361286">two</a>). But then I started thinking a bit more, and here&#8217;s a longer post on what is needed for C# (and more specifically Mono) to be used in games more.</p>
<p>In <a href="http://unity3d.com/">Unity</a> we use Mono to do game code (well, Unity users are doing that, not us). Overall it&#8217;s great; it has tons of advantages, loads of awesome and a flying ninja here and there. But no technology is perfect, right?</p>
<p><strong>Edit</strong>: Miguel rightly points out in the comments that Mono team is solving or has already solved some of these issues already. In some areas they are moving so fast that we at Unity can&#8217;t keep up!</p>
<p><span id="more-442"></span><br />
<strong>#1: Garbage Collector</strong></p>
<p>Most game developers do not like Garbage Collection (GC) very much. Typically, the more limited/hardcore their target platform is, the more they dislike GC. The reason? Most GC implementations cause rather unpredictable spikes.</p>
<p>Here&#8217;s a run of something recorded in the <em>(awesome)</em> Unity 2.6 profiler. Horizontal axis is time, vertical is CPU time spent in that frame:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/11/gcspikes.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/11/gcspikes.png" alt="Garbage collection spikes" title="Garbage collection spikes" width="563" height="187" class="alignnone size-full wp-image-441" /></a></p>
<p>At the bottom you see dark red thingies appearing once in a while. This is garbage collector kicking in, because some script code is allocating some memory at runtime.</p>
<p>Now of course, it <em>is possible</em> to write your script code so that it does no allocations (or almost no allocations). Preallocate your objects into pools, manually invoke GC when there&#8217;s a game situation when a small hickup won&#8217;t affect gameplay, etc. In fact, a lot of iPhone games made with Unity do that.</p>
<p>But that kind of side steps the whole advantage of &#8220;garbage collector almost frees you from doing memory management&#8221;. If you&#8217;re not allocating anything anyway, GC could just as well not be there!</p>
<p>A little side story. Me and Unity&#8217;s iPhone tech lead ReJ tried to explain what GC is to a non-programmer. Here&#8217;s what we came up with:</p>
<blockquote><p>
Garbage Collection is this cleaning service for lazy people. They can just leave any garbage on the floor in their house, and once in a while a garbage guy comes, collects all the garbage and takes it outside. Now, there are some intricacies in the service.</p>
<p>First, you never know when the garbage guy will come. You might be taking a shower, doing a meditation or having some &#8220;sexy time&#8221; &#8211; and it&#8217;s in the service agreement that when a garbage guy comes, you have to let him in to do his work.</p>
<p>Second thing is, the garbage guy is usually some homeless drunkard. He smells so bad that when he comes, you have to stop whatever you were doing, go outside and wait until he&#8217;s done with the garbage collection. Even your neighbors, who might be doing something entirely else in parallel, actually have to stop and idle while garbage is being collected in your house!</p>
<p>There are variations of this GC service. One variation is called &#8220;moving GC&#8221;, where the garbage guy also rearranges your furniture while collecting the garbage &#8211; he moves it all into one side of your house. This is so that you can buy a bigger piece of furniture, or throw a huge piece of garbage &#8211; and there will be enough unused space for you to do that! Of course this way GC process takes somewhat longer, but hey, you get all your stuff nicely packed into one corner.</p>
<p>Can&#8217;t you see that this service is the greatest idea of all time?
</p></blockquote>
<p>This is quite a harsh attitude towards GC, and of course it&#8217;s exaggerated. But there is some truth to it. So how could GC be fixed?</p>
<p><em>GC fix #1: more control</em></p>
<p>More explicit control on when &#038; how long GC runs. I want to say to the garbage guy, &#8220;come everyday at 4PM and do your work for 20 minutes&#8221;. In the game, I&#8217;d want to call GC with an upper time limit, say 1 millisecond for each call, and I would be calling that 30 times per second.</p>
<p><em>GC fix #2: sometimes I want to clean garbage myself</em></p>
<p>Inefficiencies and unpredictability of GC cause people to do even more work than a normal, oldskool memory allocation. Why not provide an option to deal with deallocations manually? I.e. a keyword <code>reallynew</code> could allocate an object that is not part of garbage collected world. It would function as a regular .NET object, just it would be user&#8217;s responsibility to <code>reallydelete</code> it.</p>
<p>Mono is already extending .NET (see <a href="http://tirania.org/blog/archive/2008/Nov-03.html">SIMD</a> and <a href="http://tirania.org/blog/archive/2009/Apr-09.html">continuations</a>). Maybe it makes sense to add some way to bypass garbage collector?</p>
<p><strong>#2: Distribution Size</strong></p>
<p>Using C#/.NET in a game requires having .NET runtime. None of the interesting platforms are guaranteed to have it, and even on Windows you can&#8217;t count on it being present. Mono is great here in a sense that it can be used on many more platforms than Microsoft&#8217;s own .NET. It&#8217;s also great on distribution size, but only if you compare it to Microsoft&#8217;s .NET.</p>
<p>In Unity Web Player, we package Mono DLL + mscorlib assembly into something like 1.5 megabytes (after LZMA compression). Which is great compared to 20+ megabytes of .NET runtime, but not that great it you compare it so, say, <a href="http://www.lua.org/">Lua</a> runtime (which is less than 100 kilobytes).</p>
<p>On some platforms (iPhone, Xbox 360, PS3, &#8230;) it&#8217;s not possible to generate code at runtime, so Mono&#8217;s JIT does not work. All code that&#8217;s written in C# has to be precompiled to machine code ahead of time (AOT compilation). This is not a problem per se, but because .NET framework was never designed with small size and few dependencies in mind, <em>doing anything</em> will ultimately pull in a lot of code.</p>
<p>We joke that doing anything in C# will result in an XML parser being included <em>somewhere</em>. This is not that far from the truth; e.g. calling <code>float.ToString()</code> will pull in whole internationalization system, which <em>probably</em> somewhere needs to read some global XML configuration file to figure out whether daylight savings time is active when Eastern European Brazilian Chinese calendar is used.</p>
<p><em>Size fix #1: custom core .NET libraries?</em></p>
<p>For game uses, most of &#8220;fat&#8221; stuff in .NET runtime is not really needed. <code>float.ToString()</code> could just always use period as a decimal separator. Core libraries could consist just of essential collections (list, array, hash table) and maybe a String class, with just essential methods. Maybe it&#8217;s worth sacrificing some of the generality of .NET if that could shave off a couple of megabytes from your iPhone game size?</p>
<p>Of course this is very much doable; &#8220;all that is needed&#8221; &#8482; is writing custom mscorlib+friends, and telling C# compiler to not ever reference <em>any</em> of the &#8220;real&#8221; libraries.</p>
<p><em>Size fix #2: make Mono runtime smaller</em></p>
<p>Uncompressed Mono DLL in our Windows build is 1.5 megabytes. We have turned off all the easy stuff (profiler, debugger, logging, COM, AOT etc.). But <em>probably</em> some more could be stripped away. Do our games really need multiple AppDomains? Some fancy marshalling? I don&#8217;t know, it just <em>feels</em> that 1.5MB is a lot.</p>
<p><strong>#3: Porting to New Platforms</strong></p>
<p>You know this classic: &#8220;There&#8217;s no portable code. There&#8217;s only code that&#8217;s been ported.&#8221;</p>
<p>Most existing gaming platforms are quite weird. Most upcoming smartphone platforms also are quite weird, each in their own interesting way. Porting a large project like Mono is not easy, especially since parts of it (JIT or AOT engine) highly depend on the platform.</p>
<p>For Unity iPhone, unexpected discovery that it&#8217;s not possible to JIT on iPhone made the initial release be delayed by something like 4 months. It did not help that in early iPhone SDK builds JIT was actually possible, and Apple decided to disable runtime generated code later. Making Mono actually work there required significant work both from Mono team and from Unity. We still have one guy working almost exclusively on Mono+iPhone issues!</p>
<p>Of course, <em>maybe</em> all the Mono iPhone work made porting to new platforms easier as a byproduct. But so far we don&#8217;t have Mono ported to any other platform, up to production quality. So judging from experience, we now always assume Mono port will be a pain, just because &#8220;some nasty surprises will come up&#8221; (and they always do).</p>
<p><strong>#4: Small Stuff</strong></p>
<p>There is a ton of small bits where extending .NET would benefit gaming scenarios. For example:</p>
<p>Suppose there is some array on the native engine side; for example vertex positions in a mesh (3xFloat for each vertex). Is it possible to make that piece of memory be represented as a native struct array for .NET side? So that it would not involve any extra memory copies, but N vertices somewhere in memory would look just like Vector3[N] for C#?</p>
<p>On a similar note, having &#8220;strided arrays&#8221; would be useful. For example, mesh data is often interleaved, so for each vertex there is a position, normal, UVs and so on. It would be cool if in C# position array would still look like Vector3[N], but internally the distance between each element would be larger than 12 bytes required for Vector3.</p>
<p><strong>Where do we go from here?</strong></p>
<p>The above are just random ideas, and I&#8217;m not complaining about Mono. It is great! It&#8217;s just not perfect. Mono being open source is a very good thing, which means pretty much any interested party can improve it as needed. So rock on.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/11/14/improving-cmono-for-games/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Deferred Cascaded Shadow Maps</title>
		<link>http://aras-p.info/blog/2009/11/04/deferred-cascaded-shadow-maps/</link>
		<comments>http://aras-p.info/blog/2009/11/04/deferred-cascaded-shadow-maps/#comments</comments>
		<pubDate>Wed, 04 Nov 2009 14:42:08 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[rendering]]></category>
		<category><![CDATA[unity]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=434</guid>
		<description><![CDATA[Reading &#8220;Rendering Technology at Black Rock Studios&#8221; made me realize that cascaded shadow maps I did 2+ years ago in Unity 2.0 are probably called &#8220;deferred shadowing&#8221;. Since I never wrote how they are done&#8230; here: The process is roughly this (all of this is DX9 level tech on PCs; later tech or consoles could [...]]]></description>
			<content:encoded><![CDATA[<p>Reading &#8220;<a href="http://www.bungie.net/News/content.aspx?type=topnews&#038;link=Siggraph_09">Rendering Technology at Black Rock Studios</a>&#8221; made me realize that cascaded shadow maps I did 2+ years ago in Unity 2.0 are <em>probably</em> called &#8220;deferred shadowing&#8221;. Since I never wrote how they are done&#8230; here:</p>
<p>The process is roughly this (all of this is DX9 level tech on PCs; later tech or consoles could and should use more optimizations):</p>
<ol>
<li>Render shadow map cascades. All of them packed into one shadow map via viewports.</li>
<li>Collect shadows into screen sized render target. This is the shadow term.</li>
<li>Blur the shadow term.</li>
<li>In regular forward rendering, use shadow term in screen space.</li>
</ol>
<p>More detail:</p>
<p><strong>Render Shadow Cascades</strong></p>
<p>Nothing fancy here. All cascades packed into a single shadow map. For example two 512&#215;512 cascades would be packed into 1024&#215;512 shadow map side by side.</p>
<p><strong>Screen-space Shadow Term</strong></p>
<p>Render all shadow receivers with a shader that &#8220;collects&#8221; shadow map term. In effect, shadows from all cascades are collected into a screen-sized texture. After this step, original cascaded shadowmaps are not needed anymore.</p>
<p>Unity supports up to 4 shadow map cascades, which neatly fit into a float4 register in the pixel shader. Correct cascade is sampled just once, <em>without</em> using static or dynamic branching. Pixel shader pseudocode:</p>
<blockquote><pre>
float4 near = float4 (z >= _LightSplitsNear);
float4 far = float4 (z < _LightSplitsFar);
float4 weights = near * far;
float2 coord =
    i._ShadowCoord[0] * weights.x +
    i._ShadowCoord[1] * weights.y +
    i._ShadowCoord[2] * weights.z +
    i._ShadowCoord[3] * weights.w;
float sm = tex2D (_ShadowMapTexture, coord.xy).r;
</pre>
</blockquote>
<p>Additionally, shadow fadeout is applied here (shadows in Unity can be cast up to specified distance from the camera, and they fade out when approaching that distance).</p>
<p>After this I end up having shadow term in screen space. Note that here I do not do any shadow map filtering; that is done in screen space later.</p>
<p>On PCs in DX9 there is (or there was?) no easy/sane way to read depth buffer in the pixel shader, so while collecting shadows the shader also outputs depth packed into two channels of the render target.</p>
<p><strong>Screen-space Shadow Blur</strong></p>
<p>Previous step results in screen space shadow term and depth. Shadow term is blurred into another render target, using a spatially varying Poisson disc-like filter.</p>
<p>Filter size depends on depth (shadow boundaries closer to the camera are blurred more). Filter also discards samples if difference in depth is larger <em>than something</em>, to avoid blurring over object boundaries. It's not totally robust, but seems to work quite well.</p>
<p><strong>Using shadow term in forward rendering</strong></p>
<p>In forward rendering, this blurred shadow term texture is used. Here shadow term already has filtering &#038; fadeout applied, and the shaders do not need to know anything about shadow cascades. Just read pixel from the texture and use it in lighting computation. Done!</p>
<p><strong>Fin</strong></p>
<p>Back then I didn't know this would be called "deferred" <em>(that would probably have scared me away!)</em>. I don't know if this approach is any good, but so far it works quite well for Unity needs. Also, reduces shader permutation count a lot, which I like.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/11/04/deferred-cascaded-shadow-maps/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Fixing bugs, in Tom Waits&#8217; words</title>
		<link>http://aras-p.info/blog/2009/09/20/fixing-bugs-in-tom-waits-words/</link>
		<comments>http://aras-p.info/blog/2009/09/20/fixing-bugs-in-tom-waits-words/#comments</comments>
		<pubDate>Sun, 20 Sep 2009 07:19:27 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[random]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=431</guid>
		<description><![CDATA[Mixing a sprint of bug fixing before the release and Tom Waits&#8217; music results in interesting combination. For example, Crossroads describes bug fixing process perfectly: And that&#8217;s where ol&#8217; George found himself out there at the FogBugz Fixin&#8217; the devil&#8217;s bugs Now, a man figures it&#8217;s his bugs and he&#8217;ll assign whom he wants But [...]]]></description>
			<content:encoded><![CDATA[<p>Mixing a sprint of bug fixing before the release and Tom Waits&#8217; music results in interesting combination. For example, <a href="http://en.wikipedia.org/wiki/The_Black_Rider_(album)">Crossroads</a> describes bug fixing process perfectly:</p>
<blockquote><p>And that&#8217;s where ol&#8217; George found himself out there at the FogBugz<br />
Fixin&#8217; the devil&#8217;s bugs<br />
Now, a man figures it&#8217;s his bugs and he&#8217;ll assign whom he wants<br />
But it don&#8217;t always work out that way<br />
You see, some bugs are special for a certain target<br />
A certain platform, or a certain person<br />
And no matter whom you&#8217;re assignin&#8217;, that&#8217;s where the bug &#8216;ll end up<br />
And in the moment of assigning your mouse turns into a dowser&#8217;s wand<br />
And clicks where the bug wants to go.</p></blockquote>
<p>Uhm. Yeah.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/09/20/fixing-bugs-in-tom-waits-words/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Strided blur and other tips for SSAO</title>
		<link>http://aras-p.info/blog/2009/09/17/strided-blur-and-other-tips-for-ssao/</link>
		<comments>http://aras-p.info/blog/2009/09/17/strided-blur-and-other-tips-for-ssao/#comments</comments>
		<pubDate>Thu, 17 Sep 2009 07:59:01 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[gpu]]></category>
		<category><![CDATA[papers]]></category>
		<category><![CDATA[rendering]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=409</guid>
		<description><![CDATA[If you&#8217;re new to SSAO, here are good overview blog posts: meshula.net and levelofdetail. Some tips and an idea on strided blur below. Bits and pieces I found useful SSAO can be generated at a smaller resolution than screen, with depth+normals aware upsample/blur step. If random offset vector points away from surface normal, flip it. [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;re new to SSAO, here are good overview blog posts: <a href="http://meshula.net/wordpress/?p=145">meshula.net</a> and <a href="http://levelofdetail.wordpress.com/2008/02/10/2007-the-year-ssao-broke/">levelofdetail</a>. Some tips and an idea on strided blur below.</p>
<p><span id="more-409"></span><strong>Bits and pieces I found useful</strong></p>
<ul>
<li>SSAO can be generated at a smaller resolution than screen, with depth+normals aware upsample/blur step.</li>
<li>If random offset vector points away from surface normal, flip it. This makes random vectors be in the upper hemisphere, which reduces false occlusion on flat surfaces. Of course this requires having surface normals.</li>
<li>When generating random vectors for your AO kernel:
<ul>
<li>Generate vectors <i>inside</i> unit sphere (not <i>on</i> unit sphere).</li>
<li>Use energy minimization to distribute your samples better, especially at low sample counts. See <a href="http://www.malmer.nu/index.php/2008-04-11_energy-minimization-is-your-friend">malmer.ru</a> blog post.</li>
</ul>
</li>
<li>In your AO blurring/upsampling step: no need to sample each pixel for blur. Just skip some of them, i.e. make kernel offsets larger. See below.</li>
</ul>
<p><strong>Strided blur for AO</strong></p>
<p>Normally you&#8217;d blur AO term using some sort of standard blur, for example separable Gaussian: horizontal blur, followed by vertical blur. How one can imagine horizontal blur kernel:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/blur1.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/blur1.png" alt="Horizontal Blur Kernel" title="Horizontal Blur Kernel" width="291" height="51" class="alignnone size-full wp-image-420" /></a></p>
<p>Here&#8217;s how <a href="http://runevision.com/">Rune</a> taught me how to blur better:</p>
<blockquote>
<dl>
<dt>Rune:</dt>
<dd>The other thing is the blur. I tried to make the blur 4 times stronger, and it looks much better IMO without any artifacts I could see. I could even use 4x downsampling with that blur amount and still get acceptable results.</dd>
<dt>Aras:</dt>
<dd>how did you make it 4x stronger? <i>(I was going to say that blur step is already quite expensive, and I don&#8217;t want to add more samples to make it even more expensive, yadda yadda)</i></dd>
<dt>Rune:</dt>
<dd>m_SSAOMaterial.SetVector (&#8220;_TexelOffsetScale&#8221;, m_IsOpenGL ?<br />
	&nbsp;&nbsp;new Vector4 (<b>4</b>,0,1.0f/m_Downsampling,0) :<br />
	&nbsp;&nbsp;new Vector4 (<b>4.0f</b>/source.width,0,0,0));<br />
	And similar for vertical.</dd>
<dt>Aras:</dt>
<dd>hmm. that&#8217;s strange :)</dd>
<dt>Rune:</dt>
<dd>I have no idea what I&#8217;m doing of course but it looks good.</dd>
<dt>Aras:</dt>
<dd>so this way it does not do Gaussian on 9&#215;9 pixels, but instead only takes each 4th pixel. Wider area, but&#8230; it should not work! :)</dd>
<dt>Rune:</dt>
<dd>It creates a very fine pattern at pixel level but it&#8217;s way more subtle than the noise you get otherwise.</dd>
<dt>Aras:</dt>
<dd>ok <i>(hides in the corner and weeps)</i></dd>
</dl>
</blockquote>
<p>So yeah. The blur kernel can be &#8220;spread&#8221; to skip some pixels, effectively resulting in a larger blur radius for the same sample count:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/blur2.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/blur2.png" alt="Blur with 2 pixel stride" title="Blur with 2 pixel stride" width="291" height="51" class="alignnone size-full wp-image-421" /></a></p>
<p>Or even this:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/blur3.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/blur3.png" alt="Blur with 3 pixel stride" title="Blur with 3 pixel stride" width="291" height="51" class="alignnone size-full wp-image-422" /></a></p>
<p>Yes, it&#8217;s not correct blur. <strong>But that&#8217;s okay</strong>, we&#8217;re not building nuclear reactors that depend on SSAO blur being accurate. <em>If you are, SSAO is probably a wrong approach anyway, I&#8217;ve heard it&#8217;s not that useful for nuclear stuff</em>.</p>
<p>I&#8217;m not sure how this blur should be called. Strided blur? Interleaved blur? Interlaced blur? Or maybe everyone is doing that already and it has a well established name? Let me know.</p>
<p>Some images of blur in action. Raw AO term (very low &#8211; 8 &#8211; sample count and increased contrast on purpose):<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO1raw.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO1raw-500x270.png" alt="Raw AO at low sample count" title="Raw AO at low sample count" width="500" height="270" class="alignnone size-medium wp-image-412" /></a></p>
<p>Regular 9&#215;9 blur (does not blur over depth+normals discontinuities):<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO2blur.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO2blur-500x270.png" alt="Blurred AO" title="Blurred AO" width="500" height="270" class="alignnone size-medium wp-image-413" /></a></p>
<p>Blur that goes in 2 pixel stride (effectively 17&#215;17):<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO3blur2.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO3blur2-500x271.png" alt="Blurred AO with stride 2" title="Blurred AO with stride 2" width="500" height="271" class="alignnone size-medium wp-image-414" /></a><br />
It does create a fine interleaved pattern because it skips pixels. But you get wider blur!<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO3blur2mag.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO3blur2mag.png" alt="Blurred AO with stride 2, magnified" title="Blurred AO with stride 2, magnified" width="256" height="244" class="alignnone size-full wp-image-415" /></a></p>
<p>Blur that goes in 3 pixel stride (effectively 25&#215;25):<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO4blur3.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO4blur3-500x269.png" alt="Blurred AO with stride 3" title="Blurred AO with stride 3" width="500" height="269" class="alignnone size-medium wp-image-416" /></a><br />
At 3 pixel stride the artifacts are becoming apparent. But hey, this is very<br />
low AO sample count, increased contrast and no textures in the scene.<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO4blur3mag.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO4blur3mag.png" alt="Blured AO with stride 3, magnified" title="Blured AO with stride 3, magnified" width="256" height="244" class="alignnone size-full wp-image-417" /></a></p>
<p>For sake of completeness, the same raw AO term, but computed at 2&#215;2 smaller resolution (still using low sample count etc.):<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO5down2.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO5down2-500x270.png" alt="AO computed at lower resolution" title="AO computed at lower resolution" width="500" height="270" class="alignnone size-medium wp-image-418" /></a></p>
<p>Now, 2&#215;2 smaller AO, blurred with 3 pixels stride:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/AO6down2blur3.png"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/AO6down2blur3-499x272.png" alt="AO at lower resolution, blurred with 3 pixel stride" title="AO at lower resolution, blurred with 3 pixel stride" width="499" height="272" class="alignnone size-medium wp-image-419" /></a></p>
<p>Happy blurring!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/09/17/strided-blur-and-other-tips-for-ssao/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Usability depends on context!</title>
		<link>http://aras-p.info/blog/2009/09/14/usability-depends-on-context/</link>
		<comments>http://aras-p.info/blog/2009/09/14/usability-depends-on-context/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 13:16:02 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[unity]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=387</guid>
		<description><![CDATA[Here&#8217;s a little story on how usability decisions need to depend on context. In Unity editor pretty much any window can be &#8220;detached&#8221; from the main window. An obvious use case is putting it onto a separate monitor. But of course you can just end up having a ton of detached windows overlapping each other. [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a little story on how usability decisions need to depend on context.</p>
<p>In Unity editor pretty much any window can be &#8220;detached&#8221; from the main window. An obvious use case is putting it onto a separate monitor. But of course you can just end up having a ton of detached windows overlapping each other.</p>
<p>Here I have four windows in total on OS X:<span id="more-387"></span><br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/OSXOverlapped.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/OSXOverlapped-500x324.jpg" alt="Overlapped Windows on OS X" title="Overlapped Windows on OS X" width="500" height="324" class="size-medium wp-image-389" /></a></p>
<p>Here I have four windows on Windows:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/WinOverlapped.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/WinOverlapped-500x312.jpg" alt="Overlapped Windows on Windows" title="Overlapped Windows on Windows" width="500" height="312" class="size-medium wp-image-393" /></a></p>
<p>However, users of OS X and Windows are used to applications behaving differently.</p>
<p>On OS X, it is <em>very</em> common that a single application has many overlapping windows. Usually users don&#8217;t have problems finding their windows either, thanks to Exposé. Press a key, voilà, here they are:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/OSXExpose.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/OSXExpose-500x316.jpg" alt="Exposé on OS X" title="Exposé on OS X" width="500" height="316" class="size-medium wp-image-388" /></a></p>
<p>On Windows, there is no Exposé. So there&#8217;s a problem: when a detached window is obscured by another window, how do you get to it? One would ask &#8220;well, what&#8217;s wrong in having windows partially overlapped, like in above screenshot?&#8221;, to which I&#8217;d say &#8220;you&#8217;re a Mac user&#8221;.</p>
<p>Windows users do not have a ton of windows on screen. They tend to maximize the application they are currently working with. <em>I was doing this myself</em> all the time, and it took 3 years of Mac laptop usage before I stopped maximizing everything on my Windows box!</p>
<p>So what a typical Windows user might see when using Unity is this. Now, where are the other three detached windows?<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/WinMaximized.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/WinMaximized-500x312.jpg" alt="Maximized" title="Maximized" width="500" height="312" class="size-medium wp-image-392" /></a></p>
<p>On Windows, it is <em>very uncommon</em> for a single application to have many overlapped windows. When an application does that, the &#8220;detached&#8221; windows are always positioned on top of the main window. There are some applications that do not do this (yes I&#8217;m looking at you GIMP), and almost everyone is not happy with their usability.</p>
<p>So we decided to take this context into account. Windows users do not have Exposé, <em>and</em> they expect &#8220;detached&#8221; windows to be always on top of the main window. Unity 2.6 will do this soon.<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/WinInFront.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/WinInFront-500x312.jpg" alt="In Front on Windows" title="In Front on Windows" width="500" height="312" class="size-medium wp-image-391" /></a></p>
<p>Of course, you still can dock all the windows together and this whole &#8220;windows are obscured by other windows&#8221; issue goes away:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2009/09/WinDocked.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2009/09/WinDocked-500x312.jpg" alt="Docked on Windows" title="Docked on Windows" width="500" height="312" class="size-medium wp-image-390" /></a></p>
<p><em>Hmm&#8230; I think the screenshots above show two new big features in upcoming Unity 2.6. Preemptive note: UI of the stuff above is not final. Anything might change, don&#8217;t become attached to any particular pixel!</em></p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/09/14/usability-depends-on-context/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Talks &amp; Demos from Assembly 2009</title>
		<link>http://aras-p.info/blog/2009/08/21/talks-demos-from-assembly-2009/</link>
		<comments>http://aras-p.info/blog/2009/08/21/talks-demos-from-assembly-2009/#comments</comments>
		<pubDate>Fri, 21 Aug 2009 07:10:34 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[demos]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=381</guid>
		<description><![CDATA[I went to Assembly 2009 demoparty this year. No demo submissions, but I did a seminar presentation about developing graphics technology for small games (PDF slides). Mostly on hardware statistics, GPU features, testing and stability: [vimeo clip_id="6128236" width="504" height="284"] However, the awesome talk was given by ReJ: low level iPhone (pre-3GS) rendering details (PDF slides). [...]]]></description>
			<content:encoded><![CDATA[<p>I went to <a href="http://www.assembly.org/summer09/?set_language=en">Assembly 2009</a> demoparty this year.</p>
<p>No demo submissions, but I did a seminar presentation about developing graphics technology for small games (<a href='http://aras-p.info/texts/files/Assembly09-Aras-GfxTech.pdf'>PDF slides</a>). Mostly on hardware statistics, GPU features, testing and stability:</p>
<p><span id="more-381"></span>[vimeo clip_id="6128236" width="504" height="284"]</p>
<p>However, the <strong>awesome talk was given by ReJ</strong>: low level iPhone (pre-3GS) rendering details (<a href='http://blogs.unity3d.com/wp-content/uploads/2009/08/Assembly09-iPhone-Learning-GPU-from-Driver-Code.pdf'>PDF slides</a>). Inner workings of iPhone&#8217;s GPU, OpenGL ES drivers, command buffers, VFP assembly and so on. Bringing assembly back to the Assembly, yeah!<br />
[vimeo clip_id="6064955" width="504" height="284"]</p>
<p>If you&#8217;re going to watch some demos from Assembly 2009, make sure to see:</p>
<ul>
<li><a href="http://capped.tv/cncd_orange_fairlight-frameranger">Frameranger</a> (1st place demo). Rocked the big screen! Seems somewhat unfinished though.</li>
<li><a href="http://capped.tv/united_force_digital_dynamite-the_golden_path">The Golden Path</a> (3rd place demo) &#8211; for something fresh. Also, a good way to disprove the saying that &#8220;the winners don&#8217;t take drugs&#8221; :)</li>
<li><a href="http://capped.tv/youth_uprising_mlat_design_out-muon_baryon">Muon Baryon</a> (1st place 4 kilobyte intro) &#8211; that&#8217;s what kids do with sphere marching on the GPU these days.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/08/21/talks-demos-from-assembly-2009/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Compact Normal Storage for small g-buffers</title>
		<link>http://aras-p.info/blog/2009/08/04/compact-normal-storage-for-small-g-buffers/</link>
		<comments>http://aras-p.info/blog/2009/08/04/compact-normal-storage-for-small-g-buffers/#comments</comments>
		<pubDate>Tue, 04 Aug 2009 09:39:51 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[d3d]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=377</guid>
		<description><![CDATA[I&#8217;ve been experimenting with compact storage of view space normals for small g-buffers. Think about storing depth and normal in a single 8 bit/channel RGBA texture. Here are my findings &#8211; with error visualization and shader performance numbers for some GPUs. If you know any other method to encode/store normals in a compact way, please [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been experimenting with compact storage of view space normals for small g-buffers. Think about storing depth and normal in a single 8 bit/channel RGBA texture.</p>
<p><a href="http://aras-p.info/texts/CompactNormalStorage.html"><strong>Here are my findings</strong></a> &#8211; with error visualization and shader performance numbers for some GPUs.</p>
<p>If you know any other method to encode/store normals in a compact way, please let me know!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/08/04/compact-normal-storage-for-small-g-buffers/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Encoding floats to RGBA &#8211; the final?</title>
		<link>http://aras-p.info/blog/2009/07/30/encoding-floats-to-rgba-the-final/</link>
		<comments>http://aras-p.info/blog/2009/07/30/encoding-floats-to-rgba-the-final/#comments</comments>
		<pubDate>Thu, 30 Jul 2009 12:58:08 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[gpu]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=369</guid>
		<description><![CDATA[The saga continues! In short, I need to pack a floating point number in [0..1) range into several channels of 8 bit/channel render texture. My previous approach is not ideal. Turns out some folks have figured out an approach that finally seems to work. Here it is for my own reference: gamedev.net forum post by [...]]]></description>
			<content:encoded><![CDATA[<p>The saga continues! In short, I need to pack a floating point number in [0..1) range into several channels of 8 bit/channel render texture. My <a href="http://aras-p.info/blog/2008/06/20/encoding-floats-to-rgba-again/">previous approach</a> is not ideal.</p>
<p>Turns out some folks have figured out an approach that finally <em>seems</em> to work.</p>
<p>Here it is for my own reference:</p>
<ul>
<li><a href="http://www.gamedev.net/community/forums/topic.asp?topic_id=442138&#038;whichpage=1&#2936108">gamedev.net forum post by gjaegy</a></li>
<li>Suggestion <a href="http://aras-p.info/blog/2008/06/20/encoding-floats-to-rgba-again/#comment-16380">right there</a> on my previous blog post comments</li>
<li>Repost <a href="http://www.gamerendering.com/2008/09/25/packing-a-float-value-in-rgba/">gamerendering blog</a></li>
<li>Repost on <a href="http://www.gamedev.net/community/forums/topic.asp?topic_id=463075&#038;whichpage=1&#3054958">gamedev.net forums</a> again.</li>
</ul>
<p>So here&#8217;s the proper way:</p>
<blockquote>
<pre>inline float4 EncodeFloatRGBA( float v ) {
  float4 enc = float4(1.0, 255.0, 65025.0, 160581375.0) * v;
  enc = frac(enc);
  enc -= enc.yzww * float4(1.0/255.0,1.0/255.0,1.0/255.0,0.0);
  return enc;
}
inline float DecodeFloatRGBA( float4 rgba ) {
  return dot( rgba, float4(1.0, 1/255.0, 1/65025.0, 1/160581375.0) );
}</pre>
</blockquote>
<p>That is, the difference from the <a href="http://aras-p.info/blog/2008/06/20/encoding-floats-to-rgba-again/">previous approach</a> is that the &#8220;magic&#8221; (read: hardware dependent) bias is replaced with subtracting next component&#8217;s encoded value from the previous component&#8217;s encoded value.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/07/30/encoding-floats-to-rgba-the-final/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Implementing fixed function T&amp;L in vertex shaders</title>
		<link>http://aras-p.info/blog/2009/06/09/implementing-fixed-function-tl-in-vertex-shaders/</link>
		<comments>http://aras-p.info/blog/2009/06/09/implementing-fixed-function-tl-in-vertex-shaders/#comments</comments>
		<pubDate>Tue, 09 Jun 2009 06:08:50 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[d3d]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[rendering]]></category>
		<category><![CDATA[unity]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=364</guid>
		<description><![CDATA[Almost half a year ago I was wondering how to implement T&#038;L in vertex shaders. Well, finally I implemented it for upcoming Unity 2.6. I wrote some sort of a technical report here. In short, I&#8217;m combining assembly fragments and doing simple temporary register allocation, which seems to work quite well. Performance is very similar [...]]]></description>
			<content:encoded><![CDATA[<p>Almost half a year ago I was wondering <a href="http://aras-p.info/blog/2009/01/22/fixed-function-lighting-in-vertex-shader-how/">how to implement T&#038;L in vertex shaders</a>.</p>
<p>Well, finally I implemented it for upcoming Unity 2.6. I wrote some sort of a <a href="http://aras-p.info/texts/VertexShaderTnL.html"><strong>technical report here</strong></a>.</p>
<p>In short, I&#8217;m combining assembly fragments and doing simple temporary register allocation, which seems to work quite well. Performance is very similar to using fixed function (I know it&#8217;s implemented as vertex shaders internally by the runtime/driver) on several different cards I tried (Radeon HD 3xxx, GeForce 8xxx, Intel GMA 950).</p>
<p>What was unexpected: the most complex piece is not the vertex lighting! Most complexity is in how to route/generate texture coordinates and transform them. Huge combination explosion there.</p>
<p>Otherwise &#8211; I like! Here&#8217;s a link to the <a href="http://aras-p.info/texts/VertexShaderTnL.html">article again</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2009/06/09/implementing-fixed-function-tl-in-vertex-shaders/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 2.156 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2012-05-21 21:00:31 -->

