<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lost in the Triangles &#187; opengl</title>
	<atom:link href="http://aras-p.info/blog/tags/opengl/feed/" rel="self" type="application/rss+xml" />
	<link>http://aras-p.info/blog</link>
	<description>Random thoughts of a triangle pusher</description>
	<lastBuildDate>Fri, 09 Sep 2011 17:03:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Mobile graphics API wishlist: some features</title>
		<link>http://aras-p.info/blog/2011/03/19/mobile-graphics-api-wishlist-some-features/</link>
		<comments>http://aras-p.info/blog/2011/03/19/mobile-graphics-api-wishlist-some-features/#comments</comments>
		<pubDate>Sat, 19 Mar 2011 13:50:15 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[mobile]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=653</guid>
		<description><![CDATA[In my previous post I talked about things I&#8217;d want from OpenGL ES 2.0 in the performance area. Now it&#8217;s time to look at what extra features it might expose with an extension here or there. Note that I’m focusing on, in my limited understanding, low-hanging fruits. The features I want already exist in the [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/">previous post</a> I talked about things I&#8217;d want from OpenGL ES 2.0 in the performance area. Now it&#8217;s time to look at what extra features it might expose with an extension here or there.</p>
<p><span id="more-653"></span><em>Note that I’m focusing on, in my limited understanding, low-hanging fruits. The features I want already exist in the current GPUs or platforms; or could be easily made available. Of course more radical new architectures would bring more &#038; fancier features, but that&#8217;s a topic for another story.</em></p>
<p><strong>Programmable blending</strong></p>
<p>At least two out of three big current mobile GPU families (PVR SGX, Adreno, Tegra 2) support programmable blending in the hardware. Maybe all of them do this and I just don&#8217;t have enough data. By &#8220;support it in the hardware&#8221; I mean either: 1) the GPU has no blending hardware, the drivers add &#8220;read current pixel &#038; blend&#8221; instructions to the shaders or 2) has blending hardware for commonly used modes, but fancier modes use shader patching with no severe performance penalties.</p>
<p>Programmable blending is useful for various things; from deferred-style decals (blending normals is hard in fixed function!) to fancier Photoshop-like blend modes to potentially faster single-pixel image postprocessing effects (like color correction).</p>
<p>Currently only NVIDIA exposes this capability via <a href="http://developer.download.nvidia.com/tegra/docs/tegra_gles2_development.pdf">NV_shader_framebuffer_fetch</a> extension.</p>
<p><em>Suggestion</em>: expose it on other hardware that can do this! It&#8217;s fine to not handle hard edge cases (for example, what happens when multisampling is used?), we can live with the limitations.</p>
<p><strong>Direct, fast access to frame buffer on the CPU</strong></p>
<p>Most (all?) mobile platforms use unified memory approach, where there&#8217;s no physical distinction between &#8220;system memory&#8221; and &#8220;video memory&#8221;. Some of those platforms are slightly unbalanced, e.g. a strong GPU coupled with a weak CPU or vice versa. More and more of those systems will have multicore CPUs. It might make sense to do similar approaches that PS3 guys are doing these days &#8211; offload some of the GPU work to the CPU(s).</p>
<p>Image processing, deferred lighting and similar things could be done more efficiently on a general purpose CPU, where you aren&#8217;t limited to &#8220;one pixel at a time&#8221; model of current mobile GPUs.</p>
<p><em>Suggestion</em>: can haz get a pointer to framebuffer memory perhaps? Of course this is grossly oversimplifying all the synchronization &#038; security issues, but <em>something</em> should be possible to do in order to exploit the unified memory model. Right now it just sits there largely unused, with GLES2.0 still pretending CPU is talking to a GPU over a ten meter high concrete wall.</p>
<p><strong>Expose Tile Based GPU capabilities</strong></p>
<p>PowerVR GPUs found in all iOS and some Android devices are so called &#8220;tile based&#8221; architectures. So is, to some extent, Qualcomm Adreno family.</p>
<p>Currently this capability is mostly sitting behind a black box. On PowerVR GPUs the programmer does know that &#8220;overdraw of opaque objects does not matter&#8221;, or that &#8220;alpha testing is really slow&#8221; but that&#8217;s about it. There&#8217;s no control over the whole rendering process, even if some of the things could benefit from having more control over the whole tiling thing.</p>
<p>Take, for example, deferred lighting/shading. The cool folks are doing it tile-based already on <a href="http://www.slideshare.net/DICEStudio/directx-11-rendering-in-battlefield-3?from=ss_embed">DirectX 11</a> or <a href="http://www.slideshare.net/DICEStudio/spubased-deferred-shading-in-battlefield-3-for-playstation-3?from=ss_embed">PS3</a>.</p>
<p>On a tile-based GPU, all rendering is <em>already</em> happening in tiles, so what if we could say &#8220;now, you work on this tile, render this, render that; now we go this this tile&#8221;? Maybe that way we could achieve two things at once: 1) better light culling because it&#8217;s at tile level, and 2) most of the data could stay on this super-fast on-chip memory, without having to be written into system memory &#038; later read again. Memory bandwidth is very often a limiting factor in mobile graphics performance, and ability to keep deferred lighting buffers on-chip through the whole process could cut down bandwidth requirements a lot.</p>
<p><em>Suggestion</em>: somehow <em>(I&#8217;m feeling very hand-wavy today)</em> expose more control over tiled rendering. For example, explicitly say that rendering will only happen to the given tiles; and these textures are very likely to be read just after they are rendered into &#8211; so don&#8217;t resolve them to memory if they fit into on-chip one.</p>
<p>There&#8217;s already a Qualcomm extension of something towards that area &#8211; <a href="http://www.khronos.org/registry/gles/extensions/QCOM/QCOM_tiled_rendering.txt">QCOM_tiled_rendering</a> &#8211; though it seems to be more concerned about where does rendering happen. More control is needed on how to mark FBO textures as &#8220;keep in on-chip memory for sampling as a texture plz&#8221;.</p>
<p><strong>OpenCL</strong></p>
<p>Current mobile GPUs already are, or very soon will be, OpenCL capable. Also OpenCL can be implemented on the CPU, nicely SIMDified via NEON, and use multicore. <em>DO WANT!</em> (and while you&#8217;re at it, everything that&#8217;s doable to make interop between CL &#038; GL faster)</p>
<p>This can be used for a ton of things; skinning, culling, particles, procedural animations, image postprocessing and so on. And with a much less restrictive programming model, it&#8217;s easier to reuse computation results across draw calls or frames.</p>
<p>Couple this with &#8220;direct access to memory on the CPU&#8221; and OpenCL could be used for more things than graphics (again I&#8217;m grossly oversimplifying here and ignoring the whole synchronization/latency/security elephant&#8230;).</p>
<p><strong>MOAR?</strong></p>
<p>Now of course there are more things I&#8217;d want to see, but for today I&#8217;ll take just those above, thank you. Have a nice day!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/03/19/mobile-graphics-api-wishlist-some-features/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Mobile graphics API wishlist: performance</title>
		<link>http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/</link>
		<comments>http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/#comments</comments>
		<pubDate>Fri, 04 Mar 2011 06:24:49 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[mobile]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=645</guid>
		<description><![CDATA[Most mobile platforms currently are based on OpenGL ES 2.0. While it is much better than traditional OpenGL, there are ways where it limits performance or does not expose some interesting hardware features. So here&#8217;s an unorganized wishlist for GLES2.0 performance part! Note that I&#8217;m focusing on, in my limited understanding, short term low-hanging fruits [...]]]></description>
			<content:encoded><![CDATA[<p>Most mobile platforms currently are based on OpenGL ES 2.0. While it is <em>much</em> better than traditional OpenGL, there are ways where it limits performance or does not expose some interesting hardware features. So here&#8217;s an unorganized wishlist for GLES2.0 performance part!</p>
<p><span id="more-645"></span><em>Note that I&#8217;m focusing on, in my limited understanding, short term low-hanging fruits how to extend/patch existing GLES2.0 API. A pipe dream would be starting from scratch, getting rid of all OpenGL baggage and hopefully come up with a much cleaner, leaner &#038; better API, especially if it&#8217;s designed to only support some particular platform. But I digress, back to GLES2.0 for now.</em></p>
<p><strong>No guarantees when something expensive might happen.</strong></p>
<p>Due to some flexibility in GLES2.0, there might be expensive things happening at almost any point in your frame. For example, binding a texture with a different format might cause a driver to recompile a shader at the draw call time. I&#8217;ve seen <a href="http://twitter.com/#!/aras_p/status/34628257294852096">60 milliseconds</a> on iPhone 3Gs at first draw call with a relatively simple shader, all spent inside shader compiler backend. <em>60 milliseconds!</em> There are various things that can cause performance hiccups like this: texture formats, blending modes, vertex layout, non power of two textures and so on.</p>
<p><em>Suggestion</em>: work with GPU vendors and agree on an API that could make guarantees on when the expensive resource creation / patching work can happen, and when it can&#8217;t. For example, <em>somehow</em> guarantee that a draw call or a state set will not cause any object recreation / shader patching in the driver. I don&#8217;t have much experience with D3D10/11, but my impression is that this was one of the things it got right, no?</p>
<p><strong>Offline shader compilation.</strong></p>
<p>GLES2.0 has the functionality to load binary shaders, but it&#8217;s not mandatory. Some of the big platforms (iOS, I&#8217;m looking at you) just don&#8217;t support it.</p>
<p>Now of course, a single platform (like iOS or Android) can have multiple different GPUs, so you can&#8217;t fully compile a shader offline into final optimized GPU microcode. But <em>some</em> of the full compilation cost could very well be done offline, without being specific to any particular GPU.</p>
<p><em>Suggestion</em>: come up with a platform independent binary shader format. Something like D3D9 shader assembly is probably too low level (it assumes a vector4-based GPU, limited number of registers and so on), but something higher level should be possible. All of the shader lexing, parsing and common optimizations (constant folding, arithmetic simplifications, dead code removal etc.) can be done offline. It won&#8217;t speed up shader loading by an order of magnitude, but even if it&#8217;s possible to cut it by 20%, it&#8217;s worth it. And it would remove a very big bug surface area too!</p>
<p><strong>Texture loading.</strong></p>
<p>A lot (all?) of mobile platforms have unified CPU &#038; GPU memories, however to actually load the texture we have to read or memory map it from disk and then copy into OpenGL via glTexture2D and similar functions. Then, depending on the format, the driver would internally do swizzling and alignment of texture data.</p>
<p><em>Suggestion</em>: can&#8217;t most of this cost be removed? If for some formats it&#8217;s perfectly, statically known what layout and swizzling the GPU expects&#8230; can&#8217;t we just point the API to the data we already loaded or memory mapped? We could still need to implement the glTexture2D case for when (if ever) a totally new strange GPU comes that needs the data in a different order, but why not provide a faster path for the current GPUs?</p>
<p><strong>Vertex declarations.</strong></p>
<p>In unextended GLES2.0 you have to do <em>a ton</em> of calls just to setup vertex data. <a href="http://www.khronos.org/registry/gles/extensions/OES/OES_vertex_array_object.txt">OES_vertex_array_object</a> is a step in the right direction, providing the ability to create sets of vertex data bindings (&#8220;vertex declarations&#8221; in D3D speak). However, it builds upon an existing API, resulting in something that feels quite messy. Somehow it feels that by starting from scratch it could result in something much cleaner. Like&#8230; vertex declarations that existed in D3D since forever maybe?</p>
<p><em>Suggestion</em>: clean up that shit! It would probably need to be tied to a vertex shader input signature (just like in D3D10/11) to guarantee there would be no shader patching, but we&#8217;d be fine with that.</p>
<p><strong>Shader uniforms are per shader program.</strong></p>
<p>What it says &#8211; shader uniforms (&#8220;constants&#8221; in D3D speak) are not global; they are tied to a specific shader program. I don&#8217;t quite understand why, and I don&#8217;t think any GPU works that way. This is causing complexities and/or performance loss in the driver (it either has to save &#038; restore all uniform values on each shader change, or have dirty tracking on which uniforms have changed etc.). It also causes unneeded uniform sets on the client side &#8211; instead of having, for example, view*projection matrix set just once per frame it has to be set for each shader program that we use.</p>
<p><em>Suggestion</em>: just get rid of that? If you need to not break the existing spec, how about adding an extension to make all uniforms global? I propose <code>glCanHaz(GL_OES_GLOBAL_UNIFORMS_PLZ)</code></p>
<p><strong>Next up:</strong></p>
<p>Next time, I&#8217;ll take a look at my unorganized wishlist for mobile graphics features!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/03/04/mobile-graphics-api-wishlist-performance/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>iOS shader tricks, or it&#8217;s 2001 all over again</title>
		<link>http://aras-p.info/blog/2011/02/01/ios-shader-tricks-or-its-2001-all-over-again/</link>
		<comments>http://aras-p.info/blog/2011/02/01/ios-shader-tricks-or-its-2001-all-over-again/#comments</comments>
		<pubDate>Tue, 01 Feb 2011 07:43:57 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=592</guid>
		<description><![CDATA[I was recently optimizing some OpenGL ES 2.0 shaders for iOS/Android, and it was funny to see how performance tricks that were cool in 2001 are having their revenge again. Here&#8217;s a small example of starting with a normalmapped Blinn-Phong shader and optimizing it to run several times faster. Most of the clever stuff below [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently optimizing some OpenGL ES 2.0 shaders for iOS/Android, and it was funny to see how performance tricks that were cool in 2001 are having their revenge again. Here&#8217;s a small example of starting with a normalmapped Blinn-Phong shader and optimizing it to run several times faster. Most of the clever stuff below was actually done by <a href="http://twitter.com/#!/__ReJ__">ReJ</a>, props to him!</p>
<p>Here&#8217;s a small test I&#8217;ll be working on: just a single plane with albedo and normal map textures:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump1.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump1-150x150.jpg" alt="" title="iOS Bumped Specular" width="150" height="150" class="alignnone size-thumbnail wp-image-593" /></a></p>
<p><span id="more-592"></span>I&#8217;ll be testing on iPhone 3Gs with iOS 4.2.1. Timer is started before glClear() and stopped after glFinish() that I added just after drawing the mesh.</p>
<p>Let&#8217;s start with an initial na&iuml;ve shader version:<br />
<script src="https://gist.github.com/783784.js"> </script></p>
<p>Should be pretty self-explanatory to anyone who&#8217;s familiar with tangent space normal mapping and Blinn-Phong BRDF. Running time: <strong>24.5 milliseconds</strong>. On iPhone 4&#8242;s Retina resolution, this would be about 4x slower!</p>
<p>What can we do next? On mobile platforms using appropriate precision of variables is often very important, especially in a fragment shader. So let&#8217;s go and add highp/mediump/lowp qualifiers to the fragment shader: <a href="https://gist.github.com/783703/05e78340b12739e853ce031bd0388430ea95f2a6">shader source</a></p>
<p>Still the same running time! Alas, iOS does not have low level shader analysis tools, so we can&#8217;t really tell why that is happening. We could be limited by something else (e.g. normalizing vectors and computing pow() being the bottlenecks that run in parallel with all low precision stuff), or the driver might be promoting most of our computations to higher precision because it feels like it. It&#8217;s a magic box!</p>
<p>Let&#8217;s start approximating instead. How about computing normalized view direction per vertex, and interpolating that for the fragment shader? It won&#8217;t be entirely &#8220;correct&#8221;, but hey, it&#8217;s a phone we&#8217;re talking about. <a href="https://gist.github.com/783703/1e4fd0daa384d308d125a748985e8e203e49625a">shader source</a></p>
<p><a href="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump3.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump3-150x150.jpg" alt="" title="iOS Bumped Specular, wrong precision!" width="150" height="150" class="alignright size-thumbnail wp-image-594" /></a><br />
15 milliseconds! But&#8230; the rendering is wrong; everything turned white near the bottom of the screen. Turns out PowerVR SGX (the GPU in all current iOS devices) is really meaning &#8220;low precision&#8221; when we want to add two lowp vectors and normalize the result. Let&#8217;s try promoting one of them to medium precision with a &#8220;varying mediump vec3 v_viewdir&#8221;: <a href="https://gist.github.com/783703/591eb83dacaae3840cc4e4d3d8b95a4fc3abdd65">shader source</a></p>
<p>That fixed rendering, but we&#8217;re back to 24.5 milliseconds. <em>Sad shader writers are sad&#8230; oh shader performance analysis tools, where art thou?</em></p>
<p>Let&#8217;s try approximating some more: compute half-vector in the vertex shader, and interpolate normalized value. This would get rid of all normalizations in the fragment shader. <a href="https://gist.github.com/783703/6360c2912b860aa30415e5120ef147169274cd71">shader source</a></p>
<p><strong>16.3</strong> milliseconds, not too bad! We still have pow() computed in the fragment shader, and that one is probably not the fastest operation there&#8230;</p>
<p>Almost a decade ago, a very common trick was to use a lookup texture to do the lighting. For example, a 2D texture indexed by (N.L, N.H). Since all lighting data would be &#8220;baked&#8221; into the texture, it does not necessarily have to be Blinn-Phong even; we can prepare faux-anisotropic, metallic, toon-shading or other fancy BRDFs there, as long as they can be expressed in terms of N.L and N.H. So let&#8217;s try creating 128&#215;128 RGBA lookup texture and use that: <a href="https://gist.github.com/783703/87f1cf5529d644cab16123550e809e9f7598f4f3">shader source</a></p>
<p>A fast &amp; not super efficient code to create the lighting lookup texture for Blinn-Phong:<br />
<script src="https://gist.github.com/783759.js"> </script></p>
<p><strong>9.1</strong> milliseconds! We lost some precision in the specular though (it&#8217;s dimmer):<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump6.jpg"><img src="http://aras-p.info/blog/wp-content/uploads/2011/02/iosbump6-150x150.jpg" alt="" title="iOS Bumped Specular via texture LUT" width="150" height="150" class="alignnone size-thumbnail wp-image-595" /></a></p>
<p>What else can be done? Notice that we clamp N.L and N.H values in the fragment shader, but this could be done just as well by the texture sampler, if we set texture&#8217;s addressing mode to CLAMP_TO_EDGE. Let&#8217;s get rid of the clamps: <a href="https://gist.github.com/783703/e24a2475fded83d2196372c8092a0d8de80a98eb">shader source</a></p>
<p>This is 8.3 milliseconds, or <strong>7.6</strong> milliseconds if we reduce our lighting texture resolution to 32&#215;128.</p>
<p>Should we stop there? Not necessarily. For example, the shader is still multiplying albedo with a per-material color. Maybe that&#8217;s not very useful and can be let go. Maybe we can also make specular be always white?<br />
<script src="https://gist.github.com/783703.js"> </script></p>
<p>How fast is this? <strong>5.9 milliseconds</strong>,&nbsp;or over <strong>4 times</strong> faster than our original shader.</p>
<p>Could it be made faster? Maybe; that&#8217;s an exercise for the reader :) I tried computing just the RGB color channels and setting alpha to zero, but that got slightly slower. Without real shader analysis tools it&#8217;s hard to see where or if additional cycles could be squeezed out.</p>
<p>I&#8217;m adding <a href='http://aras-p.info/blog/wp-content/uploads/2011/02/iOSShaderPerf.zip'>Xcode project with sources, textures and shaders of this experiment</a>. Notes about it: only tested on iPhone 3Gs (probably will crash on iPhone 3G, and iPad will have wrong aspect ratio). Might not work at all! Shader is read from Resources/Shaders/shader.txt, next to it are shader versions of the steps of this experiment. Enjoy!</p>
<p><em>This is a cross post from altdevblogaday: <a href="http://altdevblogaday.com/ios-shader-tricks-or-its-2001-all-over-again">http://altdevblogaday.com/ios-shader-tricks-or-its-2001-all-over-again</a></em></p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2011/02/01/ios-shader-tricks-or-its-2001-all-over-again/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>GLSL Optimizer</title>
		<link>http://aras-p.info/blog/2010/09/29/glsl-optimizer/</link>
		<comments>http://aras-p.info/blog/2010/09/29/glsl-optimizer/#comments</comments>
		<pubDate>Wed, 29 Sep 2010 10:39:21 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=561</guid>
		<description><![CDATA[During development of Unity 3.0, I was not-so-pleasantly surprised to see that our cross-compiled shaders run slow on iPhone 3Gs. And by &#8220;slow&#8221;, I mean SLOW; at the speeds of &#8220;stop the presses, we can not ship brand new OpenGL ES 2.0 support with THAT performance&#8221;. Back story Take this HLSL pixel shader for particles, [...]]]></description>
			<content:encoded><![CDATA[<p>During development of <a href="http://unity3d.com/unity/whats-new/unity-3">Unity 3.0</a>, I was not-so-pleasantly surprised to see that our <a href="http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/">cross-compiled</a> shaders run <i>slow</i> on iPhone 3Gs. And by &#8220;slow&#8221;, I mean <strong>SLOW</strong>; at the speeds of &#8220;stop the presses, we can not ship brand new OpenGL ES 2.0 support with THAT performance&#8221;.</p>
<p><span id="more-561"></span><br />
<b>Back story</b></p>
<p>Take this HLSL pixel shader for particles, that does nothing but multiplies texture with per-vertex color:</p>
<blockquote><p><code>
<pre>
half4 frag (v2f i) : COLOR { return i.color * tex2D (_MainTex, i.texcoord); }
</pre>
<p></code></p></blockquote>
<p>This is about as simple as it can get; should be one texture fetch and one multiply for the GPU.</p>
<p>Now <i>of course</i>, when HLSL gets cross-compiled into GLSL, it is augmented by some dummy functions/moves to match GLSL&#8217;s semantics of &#8220;a function called main that takes no arguments and returns no value&#8221;. So you get something like this in GLSL:</p>
<blockquote><p><code>
<pre>
vec4 frag (in v2f i) { return i.color * texture2D (_MainTex, i.texcoord); }
void main() {
    vec4 xl_retval;
    v2f xlt_i;
    xlt_i.color = gl_Color;
    xlt_i.texcoord = gl_TexCoord[0];
    xl_retval = frag (xlt_i);
    gl_FragData[0] = xl_retval;
}
</pre>
<p></code></p></blockquote>
<p>Makes sense. The original function was translated, and main() got added that fills in the input structure, calls the function and writes result to gl_FragData[0] (aka gl_FragColor).</p>
<p>Lo and behold, the above (with some OpenGL ES 2.0 specific stuff added, like precision qualifiers, definitions of varyings etc.) runs like sh*t on a mobile platform.</p>
<p>Which probably means <b>mobile platform drivers are quite bad at optimizing GLSL</b>. I mostly tested iOS, but some tests on Android indicate that situation is the same (maybe even worse, depending on exact kind of Android you have). Which is sad since said platforms also do not have any way to precompile shaders offline, where they could afford good but slow compilers.</p>
<p>Now of course, if you&#8217;re writing GLSL shaders by hand, you&#8217;re probably writing close to optimal code, with no redundant data moves or wrapper functions. But if you&#8217;re cross-compiling them from Cg/HLSL, or generating from some shader fragments, or from visual shader editors, you probably depend on shader compiler being decent at optimizing redundant bits.</p>
<p><b>GLSL Optimizer</b></p>
<p>Around the same time I accidentally discovered that <a href="http://mesa3d.org/">Mesa 3D</a> guys are working on new GLSL compiler, dubbed <a href="http://cgit.freedesktop.org/mesa/mesa/log/?h=glsl2">GLSL2</a>. I looked at the code and I liked it a lot; very hackable and &#8220;no bullshit&#8221; approach. So I took that Mesa&#8217;s GLSL compiler and made it output GLSL back after it has done all the optimizations.</p>
<p>Here it is: <a href="http://github.com/aras-p/glsl-optimizer"><b>http://github.com/aras-p/glsl-optimizer</b></a></p>
<p>It reads GLSL, does some architecture independent optimizations (dead code removal, algebraic simplifications, constant propagation, constant folding, inlining, &#8230;) and spits out &#8220;optimized&#8221; GLSL back.</p>
<p><b>Results</b></p>
<p>The above simple particle shader example. GLSL optimizer optimizes it into:</p>
<blockquote><p><code>
<pre>
void main() {
    gl_FragData[0] =
        (gl_Color.xyzw * texture2D (_MainTex, gl_TexCoord[0].xy)).xyzw;
}
</pre>
<p></code></p></blockquote>
<p>Save for redundant swizzle outputs (on my todo list), this is pretty much what you&#8217;d be writing by hand. No redundant moves, function call inlined, no extra temporaries, sweet!</p>
<p>How much difference does this make?<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesNo.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesNo.jpg" alt="" title="Particles, GLSL not optimized" width="160" height="240" /></a><a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesYes.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptParticlesYes.jpg" alt="" title="Particles, optimized GLSL" width="160" height="240" /></a><br />
Lots of particles, non-optimized GLSL on the left; optimized GLSL on the right (click for larger image). <b>Yep, it&#8217;s 236 vs. 36 milliseconds/frame</b> (4 vs. 27 FPS).</p>
<p>This result is for iPhone 3Gs running iOS 4.1. Some Android results: Motorola Droid (some PowerVR GPU): 537 vs. 223 ms; Nexus One (Snapdragon 8250 w/ Adreno GPU): 155 vs. 155 ms (yay! good drivers!); Samsung Galaxy S (some PowerVR GPU): 200 vs. 60 ms. All tests were ran at native device resolutions, so do not take this as performance comparisons between devices.</p>
<p>What about a more complex shader example? Let&#8217;s try per-pixel lit Diffuse shader (which is quite simple, but will do ok as &#8220;complex shader&#8221; example for a mobile platform). You can see that the GLSL code below is <a href="http://aras-p.info/blog/2010/07/16/surface-shaders-one-year-later/">mostly auto-generated</a>; writing it by hand wouldn&#8217;t produce that many data moves, unused struct members etc. Cg compiles original shader code into 10 ALU and 1 TEX instructions for D3D9 pixel shader 2.0, and is able to optimize away all the redundant stuff.</p>
<blockquote><p><code>
<pre>
struct SurfaceOutput {
    vec3 Albedo;
    vec3 Normal;
    vec3 Emission;
    float Specular;
    float Gloss;
    float Alpha;
};
struct Input {
    vec2 uv_MainTex;
};
struct v2f_surf {
    vec4 pos;
    vec2 hip_pack0;
    vec3 normal;
    vec3 vlight;
};
uniform vec4 _Color;
uniform vec4 _LightColor0;
uniform sampler2D _MainTex;
uniform vec4 _WorldSpaceLightPos0;
void surf (in Input IN, inout SurfaceOutput o) {
    vec4 c;
    c = texture2D (_MainTex, IN.uv_MainTex) * _Color;
    o.Albedo = c.xyz;
    o.Alpha = c.w;
}
vec4 LightingLambert (in SurfaceOutput s, in vec3 lightDir, in float atten) {
    float diff;
    vec4 c;
    diff = max (0.0, dot (s.Normal, lightDir));
    c.xyz  = (s.Albedo * _LightColor0.xyz) * (diff * atten * 2.0);
    c.w  = s.Alpha;
    return c;
}
vec4 frag_surf (in v2f_surf IN) {
    Input surfIN;
    SurfaceOutput o;
    float atten = 1.0;
    vec4 c;
    surfIN.uv_MainTex = IN.hip_pack0.xy;
    o.Albedo = vec3 (0.0);
    o.Emission = vec3 (0.0);
    o.Specular = 0.0;
    o.Alpha = 0.0;
    o.Gloss = 0.0;
    o.Normal = IN.normal;
    surf (surfIN, o);
    c = LightingLambert (o, _WorldSpaceLightPos0.xyz, atten);
    c.xyz += (o.Albedo * IN.vlight);
    c.w = o.Alpha;
    return c;
}
void main() {
    vec4 xl_retval;
    v2f_surf xlt_IN;
    xlt_IN.hip_pack0 = vec2 (gl_TexCoord[0]);
    xlt_IN.normal = vec3 (gl_TexCoord[1]);
    xlt_IN.vlight = vec3 (gl_TexCoord[2]);
    xl_retval = frag_surf (xlt_IN);
    gl_FragData[0] = xl_retval;
}
</pre>
<p></code></p></blockquote>
<p>Running the above through GLSL optimizer produces this:</p>
<blockquote><p><code>
<pre>
uniform vec4 _Color;
uniform vec4 _LightColor0;
uniform sampler2D _MainTex;
uniform vec4 _WorldSpaceLightPos0;
void main ()
{
    vec4 c;
    vec4 tmpvar_32;
    tmpvar_32 = texture2D (_MainTex, gl_TexCoord[0].xy) * _Color;
    vec3 tmpvar_33;
    tmpvar_33 = tmpvar_32.xyz;
    float tmpvar_34;
    tmpvar_34 = tmpvar_32.w;
    vec4 c_i0_i1;
    c_i0_i1.xyz = ((tmpvar_33 * _LightColor0.xyz) *
    	(max (0.0, dot (gl_TexCoord[1].xyz, _WorldSpaceLightPos0.xyz)) * 2.0)).xyz;
    c_i0_i1.w = (vec4(tmpvar_34)).w;
    c = c_i0_i1;
    c.xyz = (c_i0_i1.xyz + (tmpvar_33 * gl_TexCoord[2].xyz)).xyz;
    c.w = (vec4(tmpvar_34)).w;
    gl_FragData[0] = c.xyzw;
}
</pre>
<p></code></p></blockquote>
<p>All functions got inlined, all unused variable assignments got eliminated, and most of redundant moves are gone. There are some redundant moves left though (again, on my todo list), and the variables are assigned cryptic names after inlining. But otherwise, writing the equivalent shader by hand would be pretty close.</p>
<p>Difference between non-optimized and optimized GLSL in this case:<br />
<a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseNo.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseNo.jpg" alt="" title="Per-pixel Diffuse, GLSL not optimized" width="160" height="240" /></a><a href="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseYes.png"><img src="http://aras-p.info/blog/wp-content/uploads/2010/09/glslOptDiffuseYes.jpg" alt="" title="Per-pixel Diffuse, optimized GLSL" width="160" height="240" /></a><br />
Non-optimized vs. optimized: <b>350 vs. 267 ms/frame</b> (2.9 vs. 3.7 FPS). Not bad either!</p>
<p><b>Closing thoughts</b></p>
<p>Pulling off this GLSL optimizer quite late in <a href="http://unity3d.com/unity/whats-new/unity-3">Unity 3.0</a> release cycle was a risky move, but it did work.</p>
<p>Hats off to Mesa folks (Eric Anholt, Ian Romanick, Kenneth Graunke et al) for making an awesome codebase of the GLSL compiler! I haven&#8217;t merged up latest GLSL compiler developments on Mesa tree; they&#8217;ve implemented quite a few new compiler optimizations but I was too busy shipping Unity 3 already. Will try to merge them in soon-ish.</p>
<p>I&#8217;ve tested non-optimized vs. optimized GLSL a bit on a desktop platform (MacBook Pro, GeForce 8600M, OS X 10.6.4) and there is no observable speed difference. Which makes sense, and I <i>would have expected</i> mobile drivers to be good at optimization as well, but apparently that&#8217;s not the case.</p>
<p>Now of course, mobile drivers will improve over time, and I hope offline &#8220;GLSL optimization&#8221; step will become obsolete in the future. I still think it makes perfect sense to fully compile shaders offline, so at runtime there&#8217;s no trace of GLSL at all (just load binary blob of GPU microcode into the driver), but that&#8217;s a story for another day.</p>
<p>In the meantime, you&#8217;re welcome to try <a href="http://github.com/aras-p/glsl-optimizer">GLSL Optimizer</a> out!</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/09/29/glsl-optimizer/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Compiling HLSL into GLSL in 2010</title>
		<link>http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/</link>
		<comments>http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/#comments</comments>
		<pubDate>Fri, 21 May 2010 19:59:38 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[d3d]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=523</guid>
		<description><![CDATA[Realtime shader languages these days have settled down into two camps: HLSL (or Cg, which for all practical reasons is the same) and GLSL (or GLSL ES, which is sufficiently similar). HLSL/Cg is used by Direct3D and the big consoles (Xbox 360, PS3). GLSL/ES is used by OpenGL and pretty much all modern mobile platforms [...]]]></description>
			<content:encoded><![CDATA[<p>Realtime shader languages these days have settled down into two camps: HLSL (or Cg, which for all practical reasons is the same) and GLSL (or GLSL ES, which is sufficiently similar). HLSL/Cg is used by Direct3D and the big consoles (Xbox 360, PS3). GLSL/ES is used by OpenGL and pretty much all modern mobile platforms (iPhone, Android, &#8230;).</p>
<p>Since shaders are more or less &#8220;assets&#8221;, having two different languages to deal with is not very nice. What, I&#8217;m supposed to write my shader twice just to support both (for example) D3D and iPad? You would think in 2010, almost a decade since high level realtime shader languages have appeared, this problem would be solved&#8230; but it isn&#8217;t!</p>
<p><span id="more-523"></span>In <a href="http://unity3d.com/unity/coming-soon/unity-3">upcoming Unity 3.0</a>, we&#8217;re going to have OpenGL ES 2.0 for mobile platforms, where GLSL ES is the only option to write shaders in. However, almost all other platforms (Windows, 360, PS3) need HLSL/Cg.</p>
<p>I tried a bit making <a href="http://developer.nvidia.com/object/cg_toolkit.html">Cg</a> spit out GLSL code. In theory it can, and I read somewhere that <a href="http://en.wikipedia.org/wiki/Id_Software">id</a> uses it for OpenGL backend for <a href="http://en.wikipedia.org/wiki/Rage_(video_game)">Rage</a>&#8230; But I just couldn&#8217;t make it work. What&#8217;s possible for <a href="http://en.wikipedia.org/wiki/John_Carmack">John</a> apparently is not possible for mere mortals.</p>
<p>Then I looked at ATI&#8217;s <a href="https://github.com/aras-p/hlsl2glslfork">HLSL2GLSL</a>. That did produce GLSL shaders that were not absolutely horrible. So I started using it, and <em>(surprise!)</em> quickly ran into small issues here and there. Too bad development of the library stopped around 2006&#8230; on the plus side, it&#8217;s open source!</p>
<p>So I just forked it. Here it is: <a href="http://code.google.com/p/hlsl2glslfork/"><strong>http://code.google.com/p/hlsl2glslfork/</strong></a> (<a href="https://github.com/aras-p/hlsl2glslfork/commits/master">commit log here</a>). There are no prebuilt binaries or source drops right now, just a Mercurial repository. BSD license. Patches welcome.</p>
<p><em>Note on the codebase</em>: I don&#8217;t particularly like the codebase. It seems somewhat over-engineered code, that was probably taken from reference GLSL parser that 3DLabs once did, and adapted to parse HLSL and spit out GLSL. There are pieces of code that are unused, unfinished or duplicated. Judging from comments, some pieces of code have been in the hands of 3DLabs, ATI and NVIDIA (what good can come out of <em>that</em>?!). However, it <em>works</em>, and that&#8217;s the most important trait any code can have.</p>
<p><em>Note on the preprocessor</em>: I bumped into some preprocessor issues that couldn&#8217;t be easily fixed without first understanding someone else&#8217;s ancient code and then changing it significantly. Fortunately, Ryan Gordon&#8217;s project, <a href="http://icculus.org/mojoshader/">MojoShader</a>, happens to have preprocessor that very closely emulates HLSL&#8217;s one (including various quirks). So I&#8217;m using that to preprocess any source before passing it down to HLSL2GLSL. Kudos to Ryan!</p>
<p><em>Side note on MojoShader</em>: Ryan is also working on HLSL->GLSL cross compiler in MojoShader. I like that codebase much more; will certainly try it out once it&#8217;s somewhat ready.</p>
<p><em>You can never have enough notes</em>: Google&#8217;s <a href="http://code.google.com/p/angleproject/">ANGLE project</a> (running OpenGL ES 2.0 on top of Direct3D runtime+drivers) seems to be working on the opposite tool. For obvious reasons, they need to take GLSL ES shaders and produce D3D compatible shaders (HLSL or shader assembly/bytecode). The project seems to be moving fast; and if one day we&#8217;ll decide to default to GLSL as shader language in Unity, I&#8217;ll know where to look for a translator into HLSL :)</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2010/05/21/compiling-hlsl-into-glsl-in-2010/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>ARB_draw_buffers</title>
		<link>http://aras-p.info/blog/2008/12/30/arb-draw-buffers/</link>
		<comments>http://aras-p.info/blog/2008/12/30/arb-draw-buffers/#comments</comments>
		<pubDate>Tue, 30 Dec 2008 07:48:09 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[opengl]]></category>
		<category><![CDATA[random]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=250</guid>
		<description><![CDATA[No, I don&#8217;t have any particular point to make. But I did not even get the t-shirt&#8230;]]></description>
			<content:encoded><![CDATA[<p><img src="http://aras-p.info/blog/wp-content/uploads/2008/12/arb_draw_buffers.jpg" alt="ARB_draw_buffers" title="ARB_draw_buffers" width="600" height="455" class="alignnone size-full wp-image-249" /></p>
<p>No, I don&#8217;t have any particular point to make. But I did not even get the t-shirt&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2008/12/30/arb-draw-buffers/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>OpenGL 3: a big step in no direction at all?</title>
		<link>http://aras-p.info/blog/2008/08/20/opengl-3-a-big-step-in-no-direction-at-all/</link>
		<comments>http://aras-p.info/blog/2008/08/20/opengl-3-a-big-step-in-no-direction-at-all/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 09:28:03 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[opengl]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=195</guid>
		<description><![CDATA[Well, the post title pretty much summarizes my take on it, doesn&#8217;t it? I guess I could just stop typing now&#8230; but I won&#8217;t! So after some promises, delays and a period of deadly silence, OpenGL 3.0 was released. Response to it was &#8220;interesting&#8220;, to say at least. Some part of that response is related [...]]]></description>
			<content:encoded><![CDATA[<p>Well, the post title pretty much summarizes my take on it, doesn&#8217;t it? I guess I could just stop typing now&#8230; but I won&#8217;t!</p>
<p>So after some promises, delays and a period of deadly silence, OpenGL 3.0 was <a href="http://www.khronos.org/news/press/releases/khronos_releases_opengl_30_specifications_to_support_latest_generations_of/">released</a>.</p>
<p>Response to it was &#8220;<a href="http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&#038;Number=243193">interesting</a>&#8220;, to say at least. Some part of that response is related to seriously mishandled communication on Khronos part. Some part is because GL 3.0 is not what it was promised to be. Let&#8217;s just ignore the communication issue, it does not affect OpenGL <em>itself</em> in a direct way (it affects the developer community though).</p>
<p><em>By the way, I borrowed part of the post title from a <a href="http://fireuser.com/blog/opengl_30_a_big_step_in_the_right_direction/">blog post</a> linked from opengl.org. In general, I do not agree with that blog post, but it&#8217;s a valid point of view. Unlike some other <a href="http://zerias.blogspot.com/2008/08/why-fud-against-opengl-30.html">blog posts</a> linked from opengl.org that are just pure garbage&#8230;<br />
</em></p>
<p>I am not sure what are the goals of OpenGL at this point. OpenGL&#8217;s current position, as far as games are concerned, seems to be roughly this:</p>
<blockquote><p>Be the graphics API on various platforms where no alternatives are available.</p></blockquote>
<p>Why? Because Windows has got D3D, which is far more stable, comes with useful tools, more often updated and actually works for variety of users (I&#8217;ll get to this point in a second). Mobile platforms have OpenGL ES, which is decent. All consoles have their own APIs (some of them similar to D3D, <em>none</em> of them similar to GL). So that leaves OpenGL as the choice on OS X, Linux and such. Not because it&#8217;s better. Because it&#8217;s the only choice.</p>
<p><em>&#8220;Oh, but look, <a href="http://www.idsoftware.com/">id</a> uses OpenGL! Two other games use OpenGL as well!&#8221;</em> Well, good for them. But they are in a different league than &#8220;the rest of us&#8221;. For <em>some games</em>, driver writers will do whatever it takes to get those games running correct &#038; fast. Surprise surprise, id games fall into this category. For the rest of us &#8211; no such luxury. Hey, try talking to your friendly IHV, the most likely answer is <em>&#8220;yeah, but are really busy with some high profile games right now, ping us back in two months&#8221;</em>. After two months, repeat.</p>
<p>So the rest comes from somone who is <em>not</em> working on the high-profile games that IHVs specially tune drivers to.</p>
<p>If OpenGL&#8217;s goals are to stay in this current position, then GL 3.0 is okay. It adds some new features, brings some extensions into core, hey, it even says &#8220;it&#8217;s quite likely that maybe perhaps someday some of the old cruft in the API will be removed, if we feel like it&#8221;. No problem with that.</p>
<p>However, OpenGL is advertised as something different, as if it wants to:</p>
<blockquote><p>Be <strong>the</strong> graphics API on <strong>various platforms</strong>.</p></blockquote>
<p>Which is quite different from it&#8217;s current position. I&#8217;m not sure if that&#8217;s the goal of OpenGL. Myself, I don&#8217;t care about the mythical cross-platform API that would <em>actually work</em> on those different platforms. API is a tool to do stuff; if different platforms have different APIs &#8211; no problem with that.</p>
<p>However, if OpenGL <em>wants</em> to achieve this advertised goal, it has to do several things. First and foremost:</p>
<p><strong>Actually work</strong></p>
<p>Stable drivers and runtime. In it&#8217;s current state, GL is too complex to implement good quality drivers/runtime. Complexity can be reduced in several ways:</p>
<ul>
<li>Cleanup the API. This was what GL 3.0 was supposed to be. Actual 3.0 did not do any of that, instead it just postponed the cleanup &#8220;until we feel like it&#8221;.</li>
<li>Share some of the hard work. Why does everyone and their dog have to write GLSL preprocessor, lexer, parser and basic optimizer themselves? Define precompiled shader format, write frontend once, make it open. This would also be actually useful to reduce load times.</li>
</ul>
<p>GL 3.0 could have done both of the above, instead it did none. It could have cleaned up the API, and provide one platform independent GL 1.x/2.x library that calls into actual 3.0 runtime. All the fixed function, immediate mode, display lists, whatever would be in one nice library. Even existing apps could continue to function transparently this way (with the benefit of actually simpler = more stable drivers).</p>
<p><strong>Support platforms/hardware/features user needs</strong></p>
<p>This is of course dependent on the user in question. For someone like <a href="http://unity3d.com/">us</a>, we still have to support 10 year old hardware.</p>
<p>D3D9 does a fine job for that (provided you have drivers installed, and DX9 runtime installed &#8211; which comes included in XP SP2 and upwards). OpenGL 2.1 and earlier would do a fine job for that, provided it would &#8220;actually work&#8221; (see above).</p>
<p>If GL 3.0 would be as was originally promised &#8211; almost new API, shader model 2.0+ hardware, it would be sort of fine. In our case, that would mean writing and supporting two renderers &#8211; &#8220;old GL&#8221; and &#8220;new GL&#8221;, where old one would be used on old hardware or old platforms where &#8220;new GL&#8221; is not available. If the new runtime were much leaner, much more stable and generally nicer, this would not be a big problem.</p>
<p>With actual GL 3.0, in theory one does not have to write two renderers. Minimum hardware level for GL 3.0 is shader model 4+ though. So to support both old hardware/platforms and new hardware/platforms, quite a lot of duplication has to be done. Especially if you intend to go towards proposed &#8220;future GL path&#8221;, i.e. start dropping deprecated functionality from the codebase. At which point you&#8217;ll probably write two separate renderers already. So we&#8217;re back to where original GL 3.0 would have been, just without any extra niceness/stability/leanness right now.</p>
<p>Oh, and look at <a href="http://www.khronos.org/library/detail/2008_siggraph_opengl_bof_slides/">vendor announcements</a> from 2008 OpenGL BOF. NVIDIA: we have almost full drivers now. AMD: we&#8217;re committed to having drivers. Intel: look for GL 3.0 on future platforms. In other words, looks like current Intel&#8217;s cards won&#8217;t ever have GL 3.0 drivers. And in our target market, Intel has the majority of cards.</p>
<p>That sounds very much like &#8220;just ignore whole GL 3.0 thing&#8221; plan to me.</p>
<p><strong>Be nice</strong></p>
<p>This is a point of far lesser importance than &#8220;actually work&#8221; and &#8220;support what is needed&#8221; ones. Having good tools (PIX, &#8230;), documentation, code examples etc. is nice. But not much more; being nicest API in the world does not do much if it does not actually work or does not support what you need. Even in this area, actual GL 3.0 is <em>not</em> nice &#8211; it&#8217;s full of redundancies and crap that goes 15 years back in history.</p>
<p><strong>Summing it up</strong></p>
<p>To me, GL 3.0 looks like a blunder. Instead of fixing the core problems, they just postponed that. Well, <em>Keep up the good work!</em></p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2008/08/20/opengl-3-a-big-step-in-no-direction-at-all/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Depth bias and the power of deceiving yourself</title>
		<link>http://aras-p.info/blog/2008/06/12/depth-bias-and-the-power-of-deceiving-yourself/</link>
		<comments>http://aras-p.info/blog/2008/06/12/depth-bias-and-the-power-of-deceiving-yourself/#comments</comments>
		<pubDate>Thu, 12 Jun 2008 06:52:19 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[d3d]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[unity]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=176</guid>
		<description><![CDATA[In Unity we very often mix fixed function and programmable vertex pipelines. In our lighting model, some amount of brightest lights per object are drawn in pixel lit mode, and the rest are drawn using fixed function vertex lighting. Naturally the pixel lights most often use vertex shaders, as they want to calculate some texcoords [...]]]></description>
			<content:encoded><![CDATA[<p>In Unity we very often mix fixed function and programmable vertex pipelines. In our lighting model, some amount of brightest lights per object are drawn in pixel lit mode, and the rest are drawn using fixed function vertex lighting. Naturally the pixel lights most often use vertex shaders, as they want to calculate some texcoords for light cookies, or do something with tangent space, or calculate some texcoords for shadow mapping, and so on. The vertex lighting pass uses fixed function, because it&#8217;s the easiest way. It is possible to implement fixed function lighting equivalent in vertex shaders, but we haven&#8217;t done that yet because of complexities of Direct3D <em>and</em> OpenGL, the need to support shader model 1.1 and various other issues. Call me lazy.</p>
<p>And herein lies the problem: most often precision of vertex transformations is not the same in fixed function versus programmable vertex pipelines. If you&#8217;d just draw some objects in multiple passes, mixing fixed function and programmable paths, this is roughly what you will get (excuse my programmer&#8217;s art):<br />
<a href='http://aras-p.info/blog/wp-content/uploads/2008/06/scenenobias.png'><img src="http://aras-p.info/blog/wp-content/uploads/2008/06/scenenobias-300x225.png" alt="Mixing fixed function and vertex shaders" title="scenenobias" width="300" height="225" class="alignnone size-medium wp-image-177" /></a></p>
<p><em>Not pretty at all!</em> This should have looked like this:<br />
<a href='http://aras-p.info/blog/wp-content/uploads/2008/06/scenegoodbias.png'><img src="http://aras-p.info/blog/wp-content/uploads/2008/06/scenegoodbias-300x225.png" alt="All good here" title="scenegoodbias" width="300" height="225" class="alignnone size-medium wp-image-178" /></a></p>
<p>So what do we do to make it look like this? We &#8220;pull&#8221; (bias) some rendering passes slighly towards the camera, so there is no depth fighting.</p>
<p>Now, at the moment Unity editor runs only on the Macs, which use OpenGL. In there, most of hardware configurations do not need this depth bias at all &#8211; they are able to generate same results in fixed function and programmable pipelines. Only Intel cards do need the depth bias on Mac OS X (on Windows, AMD and Intel cards need depth bias). So people author their games using OpenGL, where it does not need depth bias in most cases.</p>
<p>How do you apply depth bias in OpenGL? Enable GL_POLYGON_OFFSET_FILL and set <a href="http://www.opengl.org/documentation/specs/man_pages/hardcopy/GL/html/gl/polygonoffset.html">glPolygonOffset</a> to something like -1, -1. This works.</p>
<p>How do you apply depth bias in Direct3D 9? <em>Conceptually</em>, you do the same. There are <a href="http://msdn.microsoft.com/en-us/library/bb205599(VS.85).aspx">DEPTHBIAS and SLOPESCALEDEPTHBIAS</a> render states that do just that. And so we did use them.</p>
<p><a href="http://forum.unity3d.com/viewtopic.php?t=8443">And people complained</a> about funky results on Windows.</p>
<p>And I&#8217;d look at their projects, see that they are using something like 0.01 for camera&#8217;s near plane and 1000.0 for the far plane, and tell them something along the lines of <em>&#8220;increase your near plane, stupid!&#8221;</em> (well ok, without the &#8220;stupid&#8221; part). And I&#8217;d explain all the above about mixing fixed function and vertex shaders, and how we do depth bias in that case, and how on OpenGL it&#8217;s often not needed but on Direct3D it&#8217;s pretty much always needed. And yes, how sometimes that can produce &#8220;double lighting&#8221; artifacts on close or intersecting geometry, and how the only solution is to increase the near plane and/or avoid close or intersecting geometry.</p>
<p>Sometimes this helped! I was <em>so convinced</em> that their too-low-near-plane was always the culprit.</p>
<p>And then one day I decided to check. This is what I&#8217;ve got on Direct3D:<br />
<a href='http://aras-p.info/blog/wp-content/uploads/2008/06/scenebadbias.png'><img src="http://aras-p.info/blog/wp-content/uploads/2008/06/scenebadbias-300x225.png" alt="Depth bias artefacts" title="scenebadbias" width="300" height="225" class="alignnone size-medium wp-image-179" /></a></p>
<p>Ok, this scene is intentionally using a low near plane, but let me stress this again. This is what I&#8217;ve got:<br />
<a href='http://aras-p.info/blog/wp-content/uploads/2008/06/scenebadbiasfail.png'><img src="http://aras-p.info/blog/wp-content/uploads/2008/06/scenebadbiasfail-300x225.png" alt="Epic fail!" title="scenebadbiasfail" width="300" height="225" class="alignnone size-medium wp-image-180" /></a></p>
<p><em>Not good at all.</em></p>
<p>What happened? It happened in roughly this way:</p>
<ol>
<li>First, depth bias <a href="http://msdn.microsoft.com/en-us/library/bb205599(VS.85).aspx">documentation</a> on Direct3D is wrong. Depth bias is <em>not</em> in 0..16 range, it is in 0..1 range which corresponds to entire range of depth buffer.</li>
<li>Back then, our code was always using 16 bit depth buffers, so the equivalent of -1,-1 depth bias in OpenGL was multiplied with something like 1.0/65535.0, and that was fed into Direct3D. <em>Hey, it seemed to work!</em></li>
<li>Later on, the device setup code was modified to do proper format selection, so most often it ended up using 24 bit depth buffer. <em>Of course</em> <del datetime="2008-06-12T06:33:50+00:00">no one</del><ins datetime="2008-06-12T06:50:43+00:00"> I</ins> never modified the depth bias code to account for this change&#8230;</li>
<li>And it stayed there. And I kept deceiving myself that the content of the users is to blame, and not some stupid code of mine.</li>
</ol>
<p><strong>It&#8217;s good to check your assumptions once in a while.</strong></p>
<p>So yeah, the proper multiplier for depth bias on Direct3D with 24 bit depth buffer should be not 1.0/65535.0, but something like 1.0/(2^24-1). Except that this value is <em>really small</em>, so something like 4.8e-7 should be used instead (see <a href="http://terathon.com/gdc07_lengyel.ppt">Lengyel&#8217;s GDC2007 talk</a>). Oh, but for some reason it&#8217;s not really enough in practice, so something like 2.0*4.8e-7 should be used instead (tested so far on GeForce 8600, Radeon HD 3850, Radeon 9600, Intel 945, reference rasterizer). Oh, and the same value should be used even when a 16 bit depth buffer is used; using 1.0/65535.0 multiplier with 16 bit depth buffer produces way too large bias.</p>
<p>With proper bias values the image is good on Direct3D again. Yay for that (fix is coming in Unity 2.1 soon).</p>
<p><em>&#8230;and yes, I know that real men fudge projection matrix instead of using depth bias&#8230; someday maybe.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2008/06/12/depth-bias-and-the-power-of-deceiving-yourself/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>What OpenGL actually needs</title>
		<link>http://aras-p.info/blog/2007/11/08/what-opengl-actually-needs/</link>
		<comments>http://aras-p.info/blog/2007/11/08/what-opengl-actually-needs/#comments</comments>
		<pubDate>Thu, 08 Nov 2007 08:33:53 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[opengl]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/2007/11/08/what-opengl-actually-needs/</guid>
		<description><![CDATA[Ok, it looks like OpenGL 3.0 specification will be delayed a bit. Oh well, spec now, first drivers a bit later, sort-of-stable drivers a year or two later, and Joe-the-average-user will hopefully have some OpenGL 3.0 support in his Windows box after 5 years. Still, progress has to be made. The idea of abandoning the [...]]]></description>
			<content:encoded><![CDATA[<p>Ok, it looks like OpenGL 3.0 specification <a href="http://www.opengl.org/news/permalink/opengl_arb_announces_an_update_on_opengl_30/">will be delayed</a> a bit. Oh well, spec now, first drivers a bit later, sort-of-stable drivers a year or two later, and Joe-the-average-user will hopefully have some OpenGL 3.0 support in his Windows box after 5 years. Still, progress has to be made.</p>
<p>The idea of abandoning the old concept of &#8220;bind the current object and do stuff on it&#8221; and replacing it with direct functions that take object as parameter is very good. Too much state-machine-like functionality in current OpenGL is just a pain for no good reason. Also a very good idea is to make most objects immutable once they are created. Too much flexibility for no good reason just makes the lives of driver developers harder (and gives them much more opportunities to make bugs). All in all, OpenGL&#8217;s API is becoming more like Direct3D, which is good in my eyes.</p>
<p>What OpenGL needs, besides all the work that goes into OpenGL 3.0? Certainly not <a href="http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&#038;Number=229374">lengthy discussions</a> on whether alpha test should be kept or removed <em>(it does not matter! just pick one)</em> or whether shader assembly is actually assembly <em>(it&#8217;s not. but current implementations of GLSL are too unusable, so&#8230;)</em>.</p>
<p>What OpenGL needs is implementation quality.</p>
<p>Of all crashes in Unity 1.x web games, close to 100% are inside the dll of OpenGL driver, occurring in totally unpredictable situations. I&#8217;ve yet to see a crash in D3D driver of Unity 2.0 web games. Why is this?</p>
<p>My thinking is because in D3D, quite a chunk of work is done by Microsoft (the D3D runtime). And as it&#8217;s a component of the OS, they probably try hard to make it stable, and they have WHQL tests at least. It&#8217;s a somewhat similar situation on the Mac with OpenGL &#8211; Apple does the runtime, and IHVs do the drivers. Thus OpenGL in the Mac is <em>much more</em> stable than on Windows (it&#8217;s not as stable as I&#8217;d like it to be, but hey).</p>
<p>Get <em>someone</em> out of whole Khronos conglomerate to write GLSL parsers, format conversions, whatever else that is not directly tied to the hardware. Make it open source if you wish, so that some bugs can be found by mere mortals (instead of waiting indefinitely for IHVs to reply because we&#8217;re not important enough). Write very extensive testing suites that not just test rasterization rules, but also try to do something more complex than drawing a couple of primitives. The more tests the better. And <strong>make it required</strong> for all implementations to use this common codebase and pass all the tests, otherwise they won&#8217;t have the right to call themselves &#8220;OpenGL&#8221;.</p>
<p>Oh, and get more games to actually use OpenGL, because right now all drivers have to do is make sure the current id Software engine runs okay :)</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2007/11/08/what-opengl-actually-needs/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Is OpenGL really faster than D3D9?</title>
		<link>http://aras-p.info/blog/2007/09/23/is-opengl-really-faster-than-d3d9/</link>
		<comments>http://aras-p.info/blog/2007/09/23/is-opengl-really-faster-than-d3d9/#comments</comments>
		<pubDate>Sat, 22 Sep 2007 23:50:08 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[d3d]]></category>
		<category><![CDATA[opengl]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/2007/09/23/is-opengl-really-faster-than-d3d9/</guid>
		<description><![CDATA[The common knowledge is that drawing stuff in OpenGL is much more faster than in D3D9. I wonder &#8211; is this actually true, or just an urban legend? I could very well imagine that setting everything up to draw a single model and then issuing 1000 draw calls for it is faster in OpenGL&#8230; but [...]]]></description>
			<content:encoded><![CDATA[<p>The common knowledge is that drawing stuff in OpenGL is much more faster than in D3D9. I wonder &#8211; is this actually true, or just an urban legend? I could very well imagine that setting everything up to draw a single model and then issuing 1000 draw calls for it is faster in OpenGL&#8230; but come on, that&#8217;s not a very life-like scenario!</p>
<p>At <a href="http://unity3d.com">work</a> we now have a D3D9 and an OpenGL renderers on Windows. The original codebase was very much designed for OpenGL, so I had to jump through a lot of hoops to get it fully working on D3D&#8230; small differences that add up, like: there&#8217;s no object space texgen on D3D, shaders don&#8217;t track built-in state (world, modelview matrices, light positions, &#8230;), textures in GL vs. textures + sampler state in D3D, and so on. Anyway, the codebase was definitely not designed to exploit D3D strengths and OpenGL weaknesses, more likely the other way around.</p>
<p>But wait! I look at our benchmark tests, and D3D9 is consistently faster than OpenGL. Some examples:</p>
<ul>
<li>Real world scene with lots of shadow casting lights (different objects, different shaders, different lights, different shadow types in one scene):
<ul>
<li>Core Duo with Radeon X1600: 23 FPS D3D9, 13 FPS GL.</li>
<li>P4 with GeForce 6800GT: 16 FPS D3D9, 9 FPS GL.</li>
<li>Core2 Duo with Radeon HD 2600: 41 FPS D3D9, 35 FPS GL.</li>
</ul>
</li>
<li>High object count test (1000 objects, multiple lights, 5 passes per object total):
<ul>
<li>Core Duo with Radeon X1600: 18.3 FPS D3D9, 12.5 FPS GL.</li>
<li>P4 with GeForce 6800GT: 13.2 FPS D3D9, 9.4 FPS GL.</li>
<li>Core2 Duo with Radeon HD 2600: 34.8 FPS D3D9, 29.3 FPS GL.</li>
</ul>
</li>
<li>Dynamic geometry (lots of particle systems) test (this is limited by vertex buffer writing speed and CPU calculating the particles, not draw by calls):
<ul>
<li>Core Duo with Radeon X1600: 170 FPS D3D9, 102 FPS GL.</li>
<li>P4 with GeForce 6800GT: 108 FPS D3D9, 74 FPS GL.</li>
<li>Core2 Duo with Radeon HD 2600: 325 FPS D3D9, 242 FPS GL.</li>
</ul>
</li>
<li>&#8230;and so on.</li>
</ul>
<p>To be fair, there are a couple of tests where on some hardware OpenGL has a slight edge. But in 95% of the cases, D3D9 is faster. Not to mention that we have about 10x less broken hardware/driver workarounds for D3D9 than we have for OpenGL&#8230;</p>
<p>What gives? Either our OpenGL code is horribly suboptimal, or <em>&#8220;OpenGL is faster!!!!11oneoneeleven&#8221;</em> is a myth. I have trouble figuring out in which places our code would be horribly suboptimal, I think we follow all advice given by hardware vendors on how to make OpenGL efficient (not that there is much advice out there though&#8230;).</p>
<p>There isn&#8217;t much software that can run the same content on both D3D and OpenGL and is suitable for benchmarking. I tried <a href="http://ogre3d.org">Ogre 3D</a> demos on one machine (GeForce 6800GT card) and guess what? D3D9 is faster in tests that specifically stress draw count (like the instancing demo&#8230; D3D9 is faster both in instanced and non-instanced modes).</p>
<p>Am I crazy?</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2007/09/23/is-opengl-really-faster-than-d3d9/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Can you set OpenGL states independently?</title>
		<link>http://aras-p.info/blog/2007/07/25/can-you-set-opengl-states-independently/</link>
		<comments>http://aras-p.info/blog/2007/07/25/can-you-set-opengl-states-independently/#comments</comments>
		<pubDate>Wed, 25 Jul 2007 20:50:11 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[opengl]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/2007/07/25/can-you-set-opengl-states-independently/</guid>
		<description><![CDATA[Most of the time, yes, you can just set the needed states! You can set alpha blending on and turn light #0 off, and often nothing bad will happen. Blending will be on, and light #0 will be off. Fine. Until you hit a graphics card (quite new &#8211; from 2006, it can even do [...]]]></description>
			<content:encoded><![CDATA[<p>Most of the time, yes, you can just set the needed states! You can set alpha blending on and turn light #0 off, and often nothing bad will happen. Blending will be on, and light #0 will be off. Fine.</p>
<p>Until you hit a graphics card (quite new &#8211; from 2006, it can even do pixel shader 2.0) that completely hangs up the machine in one of your unit tests. In fact, in the first unit test, that does almost nothing. Debugging that thing is <em>total awesomeness</em> &#8211; try something out, and the machine either hangs up or it does not. Reboot, repeat.</p>
<p>After something like 30 hang-ups I found the cause: <em>you are damned</em> if you set GL_SEPARATE_SPECULAR_COLOR and GL_COLOR_SUM to different values (i.e. use separate specular but don&#8217;t turn on color sum). Because, you know, some code was there that did not see a point in changing light mode color control when no lighting was on. So yeah, always set those two in sync. Just to please this card&#8217;s drivers.</p>
<p>It&#8217;s hard for me to have any faith in driver developers. I know that their job is hard, walking the fine line between correctness and getting decent benchmark scores&#8230; But still &#8211; hanging up the machine when two OpenGL 1.2 states are set to different values? Would you trust those people to write <a href="http://www.opengl.org/documentation/glsl">full fledged compilers</a>?</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2007/07/25/can-you-set-opengl-states-independently/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Debugging story: video memory leaks</title>
		<link>http://aras-p.info/blog/2007/07/14/debugging-story-video-memory-leaks/</link>
		<comments>http://aras-p.info/blog/2007/07/14/debugging-story-video-memory-leaks/#comments</comments>
		<pubDate>Sat, 14 Jul 2007 19:31:19 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/2007/07/14/debugging-story-video-memory-leaks/</guid>
		<description><![CDATA[I ranted about OpenGL p-buffers a while ago. Time for the whole story! From time to time I hit some nasty debugging situation, and it always takes ages to figure out, and the path to the solution is always different. This is an example of such a debugging story. While developing shadow mapping I implemented [...]]]></description>
			<content:encoded><![CDATA[<p>I <a href="http://aras-p.info/blog/2007/06/04/opengl-pbuffers-suck">ranted</a> about OpenGL p-buffers a while ago. Time for the whole story!</p>
<p>From time to time I hit some nasty debugging situation, and it always takes <em>ages</em> to figure out, and the path to the solution is always different. This is an example of such a debugging story.</p>
<p>While developing shadow mapping I implemented a &#8220;screen space shadows&#8221; thing (where cascaded shadow maps are gathered into a screen-space texture and shadow receiver rendering later uses only that texture). Then while being in the editor and maximizing/restoring the window a few times, everything locks up for 3 or 5 seconds, then resumes normally.</p>
<p>So there&#8217;s a problem: a complete freeze after editor window is being resized after a couple of times (not immediately!), but otherwise everything just works. Where is the bug? What caused it?</p>
<p>Since shadows were working fine before, and I never noticed such lock-ups &#8211; it must be the screen-space shadow gathering thing that I just implemented, right? <em>(Fast-forward answer: no)</em> So I try to figure out <em>where</em> the lock-up is happening. Profiling does not give any insights &#8211; the lock-up is not even in my process, instead &#8220;somewhere&#8221;. Hm&#8230; I insert lots of manual timing code around various code blocks (that deal with shadows). They say the lock-up <em>most often</em> happens when activating a new render texture (an OpenGL p-buffer), specifically, calling a glFlush(). But not always, sometimes it&#8217;s still somewhere else.</p>
<p>After some head-scratching, a session with OpenGL Driver Profiler reveals what is actually happening &#8211; video memory is leaked! Apparently Mac OS X &#8220;virtualizes&#8221; VRAM, and when it runs out, the OS will still happily create p-buffers and so on, it will just start swapping VRAM contents to AGP/PCIe area. This swapping causes the lock-up. Ok, so now I know <em>what</em> is happening, I just need to find out <em>why</em>.</p>
<p>I look at all the code that deals with render textures &#8211; it looks ok. And it would be pretty strange if a VRAM leak would be unnoticed for two years since Unity is out in the wild&#8230; So that must be the depth render textures that are causing a leak (since they are a new type for the shadows), right? <em>(Answer: no)</em></p>
<p>I build a test case that allocates and deallocates a bunch of depth render textures each frame. No leaks&#8230; Huh.</p>
<p>I change my original code so that it gathers screen-space shadows onto the screen directly, instead of the screen-sized texture. No leaks&#8230; Hm&#8230; So it must be the depth render texture followed by screen-size render texture, that is causing the leaks, right? <em>(Answer: no)</em> Because when I have just the depth render texture, I have no leaks; and when I have no depth render texture, instead I gather shadows &#8220;from nothing&#8221; into a screen-size texture, I also have no leaks. So it must be the combination!</p>
<p>So far, the theory is that rendering into a depth texture followed by creation of screen-size texture will cause a video memory leak <em>(Answer: no)</em>. It looks like it leaks the amount that should be taken by depth texture (I say &#8220;it looks&#8221; because in OpenGL you never know&#8230; it&#8217;s all abstracted to make my life easier, hurray!). Looks like a fine bug report, time to build a small repro application that is completely separate from Unity.</p>
<p>So I grab some p-buffer sample code from Apple&#8217;s developer site, change it to also use depth textures and rectangle textures, remove all unused cruft, code the expected bug pattern (render into depth texture followed by rectangle p-buffer creation) and&#8230; it does not leak. D&#8217;oh.</p>
<p>Ok, another attempt: I take the p-buffer related code out of Unity, build a small application with just that code, code the expected bug pattern and&#8230; it does not leak! Huh?</p>
<p><em>Now what?</em></p>
<p>I compare the OpenGL call traces of Unity-in-test-case (leaks) and Unity-code-in-a-separate-app (does not leak). Of course, the Unity case does a lot more; setting up various state, shaders, textures, rendering actual objects with actual shaders, filtering out redundant state changes and whatnot. So I try to bring in bits of stuff that Unity does into my test application.</p>
<p>After a while I made my test app leak video memory (now that&#8217;s an achievement)! Turns out the leak happens when doing this:</p>
<ol>
<li>Create depth p-buffer</li>
<li>Draw to depth p-buffer</li>
<li>Copy it&#8217;s contents into a depth texture</li>
<li>Create a screen-sized p-buffer</li>
<li>Draw something into it <em>using</em> the depth texture</li>
<li>Release the depth texture and p-buffer</li>
<li>Release the screen-sized p-buffer</li>
</ol>
<p>My initial test app was not doing step 5&#8230; Now, <em>why</em> the leaks happens? Is it a bug or something I am doing wrong? And more importantly: how to get rid of it?</p>
<p>My suspicion was that OpenGL context sharing was somehow to blame here <em>(finally, a correct suspicion)</em>. We share OpenGL contexts, because, well, it&#8217;s the only sane thing to do &#8211; if you have a texture, mesh or shader somewhere, you really want to have it available both to the screen and when rendering into something else. The documentation on sharing of OpenGL contexts is extremely spartan, however. Like: &#8220;yeah, when they are shared, then the resources are shared&#8221; &#8211; great. Well, the actual text is like this (Apple&#8217;s <a href="http://developer.apple.com/qa/qa2001/qa1248.html">QA1248</a>):</p>
<blockquote><p>All sharing is peer to peer and developers can assume that shared resources are reference counted and thus will be<br />
maintained until explicitly released or when the last context sharing resources is itself released. It is helpful to think of this in the simplest terms possible and not to assume excess complication.</p></blockquote>
<p>Ok, <em>I am</em> thinking of this in the simplest terms possible&#8230; and it leaks video memory! The docs do not have a single word on <em>how</em> the resources are reference counted and what happens when a context is deleted.</p>
<p>Anyway, armed with my suspicion of context sharing being The Bad Guy here, I tried random things in my small test app. Turns out that unbinding any active textures from a context before switching to new one got rid of the leak. It looks like objects are refcounted by contexts, and they are not actually deleted while they are bound in some context (that is what I expect to happen). However, when a context itself is deleted, it seems as if it does not decrease refcounts of these objects (that is definitely what I don&#8217;t expect to happen). I am not sure if that&#8217;s a bug, or just undocumented &#8220;feature&#8221;&#8230;</p>
<p>All happy, I bring in my changes to the full codebase (&#8220;unbind any active textures before switching to a new context!&#8221;)&#8230; and the leak is still there. Huh?</p>
<p>After some head-scratching and randomly experimenting with <em>whatever</em>, turns out that you have to unbind any active &#8220;things&#8221; before switching to a new context. Even leaving a vertex buffer object bound can make a depth texture memory be leaked when another context is destroyed. Funky, eh?</p>
<p>So that was some 4 days wasted on chasing the bug that started out as &#8220;mysterious 5 second lock-ups&#8221;, went through &#8220;screen-space shadows leak video memory&#8221;, then through &#8220;depth textures followed by screen-size textures leak video memory&#8221; and through &#8220;unbind textures before switching contexts&#8221; to &#8220;unbind everything before switching contexts&#8221;. Would I have guessed it would end up like this? Not at all. I am still not sure if that&#8217;s the intended behavior or a bug; it looks more like a bug to me.</p>
<p>The take-away for OpenGL developers: <strong>when using shared contexts, unbind active textures, VBOs, shader programs etc. before switching OpenGL contexts</strong>. Otherwise at least on Mac OS X you will hit video memory leaks.</p>
<p>It&#8217;s somewhat sad that I find myself fighting issues like that most of my development time &#8211; not actually implementing some cool new stuff, but <em>making stuff actually work</em>. Oh well, I guess that is the difference between making (tech)demos and an actual software product.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2007/07/14/debugging-story-video-memory-leaks/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>OpenGL pbuffers suck!</title>
		<link>http://aras-p.info/blog/2007/06/04/opengl-pbuffers-suck/</link>
		<comments>http://aras-p.info/blog/2007/06/04/opengl-pbuffers-suck/#comments</comments>
		<pubDate>Mon, 04 Jun 2007 16:22:03 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[opengl]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/2007/06/04/opengl-pbuffers-suck/</guid>
		<description><![CDATA[Aaargh! Well, the blog title is about as much as I wanted to say on this topic. &#8230;this is just me venting out, during the process of chasing a video memory leak for 4 days already. It involves p-buffers, depth textures, shared OpenGL contexts and other delicious things. Still didn&#8217;t find the cause, but I&#8217;m [...]]]></description>
			<content:encoded><![CDATA[<p>Aaargh! Well, the blog title is about as much as I wanted to say on this topic.</p>
<p>&#8230;this is just me venting out, during the process of chasing a video memory leak for 4 days already. It involves p-buffers, depth textures, shared OpenGL contexts and other delicious things. Still didn&#8217;t find the cause, but I&#8217;m getting close.</p>
<p>Pbuffer my a**.</p>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2007/06/04/opengl-pbuffers-suck/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>ARB_vertex_buffer_object is stupid</title>
		<link>http://aras-p.info/blog/2007/03/22/arb_vertex_buffer_object-is-stupid/</link>
		<comments>http://aras-p.info/blog/2007/03/22/arb_vertex_buffer_object-is-stupid/#comments</comments>
		<pubDate>Thu, 22 Mar 2007 21:51:00 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[opengl]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=105</guid>
		<description><![CDATA[OpenGL vertex buffer functionality, I mock thee too! Why couldn&#8217;t they make the specification simple&#038;clear, and then why can&#8217;t the implementations work as expected? It started out like this: converting some existing code that generates geometry on the fly. It used to generate that into in-memory arrays and then Just Draw Them. Probably not the [...]]]></description>
			<content:encoded><![CDATA[<div style="text-align: justify;">OpenGL vertex buffer functionality, I <a href="http://www.stevestreeting.com/?p=489">mock thee</a> too! Why couldn&#8217;t they make the <a href="http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_buffer_object.txt">specification</a> simple&#038;clear, and then why can&#8217;t the implementations work as expected?</p>
<p>It started out like this: converting some existing code that generates geometry on the fly. It used to generate that into in-memory arrays and then Just Draw Them. Probably not the most optimal solution, but that&#8217;s fine. Of course we can optimize that, right?</p>
<p>So with all my knowledge how things used to work in D3D I start &#8220;I&#8217;ll just do the same in OpenGL&#8221; adventure. Create a single big dynamic vertex buffer, a single big dynamic element buffer; update small portions of it with glBufferSubData, &#8220;discard&#8221; it (=glBufferData with null pointer) when the end is reached, rinse &#038; repeat.</p>
<p>Now, let&#8217;s for a moment ignore the fact that updating portions of index buffer does not actually work on Mac OS X&#8230; Everything else is fine and it actually works! Except for&#8230; it&#8217;s quite a lot <span style="font-style: italic;">slower</span> than just doing the old &#8220;render from memory&#8221; thing. Ok, must be some OS X specific thing&#8230; Nope, on a Windows box with GeForce 6800GT it is still slower.</p>
<p>Now, there are three things that could have gone wrong: 1) I did something stupid (quite likely), 2) VBOs for dynamically updated chunks of geometry suck (could be&#8230; they don&#8217;t have a way to update just one chunk without one extra memory copy at least), 3) both me and VBOs are stupid. If I was me I&#8217;d bet on the third option.</p>
<p>What I don&#8217;t get is: D3D has had a buffer model that is simple to understand and actually works for, like, 6 years now! Why ARB_vertex_buffer_object guys couldn&#8217;t <span style="font-style: italic;">just copy</span> that? The world would be a better place! No, instead they make a way to map only <span style="font-style: italic;">whole </span>buffer; updating chunks is extra memory copy; there are confusing usage parameters (when should I use STREAM and when DYNAMIC?); performance costs are unclear (when is <a href="http://www.stevestreeting.com/?p=491">glBufferSubData faster than glMapBuffer</a>?) etc. And in the end when an OpenGL noob like me tries to actually make them work &#8211; he can&#8217;t! It&#8217;s slow!
</div>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2007/03/22/arb_vertex_buffer_object-is-stupid/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A steam of random things</title>
		<link>http://aras-p.info/blog/2006/08/14/a-steam-of-random-things/</link>
		<comments>http://aras-p.info/blog/2006/08/14/a-steam-of-random-things/#comments</comments>
		<pubDate>Sun, 13 Aug 2006 21:24:00 +0000</pubDate>
		<dc:creator>Aras Pranckevičius</dc:creator>
				<category><![CDATA[demos]]></category>
		<category><![CDATA[opengl]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[unity]]></category>

		<guid isPermaLink="false">http://aras-p.info/blog/?p=96</guid>
		<description><![CDATA[Awards: Hey, we&#8216;ve got a &#8220;runner up&#8221; award in Apple Design Awards 2006, Best Use of Graphics category! Yeah, a runner up is more like &#8220;the first of the losers&#8221;, but oh well. Got beaten by modo 201, which probably is a fair trade. It&#8217;s just that we thought we&#8217;d be in Best Developer Tool [...]]]></description>
			<content:encoded><![CDATA[<div style="text-align: justify;"><span style="font-weight: bold;">Awards:</span> Hey, <a href="http://unity3d.com">we</a>&#8216;ve got a &#8220;runner up&#8221; award in <a href="http://developer.apple.com/ada/">Apple Design Awards 2006</a>, Best Use of Graphics category! Yeah, a runner up is more like &#8220;the first of the losers&#8221;, but oh well. Got beaten by <span style="font-style: italic;">modo 201</span>, which probably is a fair trade. It&#8217;s just that we thought we&#8217;d be in <span style="font-style: italic;">Best Developer Tool</span> category, but that is apparently for text editors and scripting languages :)</p>
<p><span style="font-weight: bold;">Demos: </span>In the other news, fellow <a href="http://nesnausk.org/members.php">ReJ</a> with TBL just won Assembly 2006 demo competition with an <a href="http://www.pouet.net/prod.php?which=25778">Amiga demo</a>, putting all PC demos faces&#8217; to dust. Check it out. Art direction over hardware capabilities, one more time.</p>
<p><span style="font-weight: bold;">Drivers:</span> why oh why the graphics drivers must be so bad? The other day I was thinking why can&#8217;t they auto-update themselves (with an option to turn it off for corporate users etc.). Now you&#8217;ve got a (not so recent!) driver that is able to parse vertex programs wrong, and a user who does not have a clue that he should update it. It&#8217;s bad enough to have a bug in the first place, but auto-update at least would fix&#8230;</p>
<p>Or you have a driver that says <span style="font-style: italic;">&#8220;I&#8217;m OpenGL 1.2!&#8221;</span> but the 3D texture functions are null. <span style="font-style: italic;">And</span> it&#8217;s the most recent driver for a particular graphics card that you can buy today! <span style="font-style: italic;">And</span> it&#8217;s not even a hard problem! What the developers are thinking &#8211; they go over the required GL 1.2 functionality, see that some is actually missing and decide <span style="font-style: italic;">&#8220;ah, screw it, we&#8217;ll say it&#8217;s 1.2 anyways&#8221;</span>?!</p>
<p>I just don&#8217;t get it. I could use some enlightenment on this.</p></div>
]]></content:encoded>
			<wfw:commentRss>http://aras-p.info/blog/2006/08/14/a-steam-of-random-things/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

