This article describes how to make D3DX Effects run fast - efficiently manage device states, etc. A more extended article that also talks about renderer organization, feeding effects with parameters and other aspects are in ShaderX4 book. This one just outlines the basic ideas related to performance and describes some improvements (everything starting at "improved solution" section) that didn't make into the book (I thought about them too late).
D3DX effects do lots of work behind the scenes to make programmer's life easier. D3D guys are making serious improvements in reducing Effect system overhead (optimizing internals), but some things just can't be optimized away. I'm mostly talking about the fact that by default, effects save and restore all modified D3D device state - this obviously has some overhead (probably mostly in D3D runtime and the driver, not in Effects themselves). This behavior is great for quick prototyping, but what if we could reduce the state management overhead?
One common advice is "don't save/restore state in Effects" (via D3DXFX_DONOTSAVE* flags). Ok, fine, but what's then? After one effect, another one will
be messed up because previous one might set some exotic states! Setting all states in each effect is also not a viable option: if just one of them wanted to set
some rarely used state (e.g. ShadeMode=Flat), then all other effects would need to set it back to Gouraud.
My first observation is that essentially there are three groups of device states:
Which states fall into which groups is completely up to the engine. More on that later.
The initial solution could be: in each effect, write an extra "restore pass". This pass sets 1st group states back to standard values, if they are modified by the effect. In the renderer, turn off state saving/restoring; render all passes except last normally; and just begin/end the last pass (but don't actually render anything in it).
An example effect excerpt could look like this:
technique tec20 {
pass POpaque {
VertexShader = compile vs_1_1 vsMainOpaque();
PixelShader = compile ps_1_1 psMainOpaque();
AlphaTestEnable = True;
AlphaFunc = Greater;
AlphaRef = 250;
}
pass PAlpha {
VertexShader = compile vs_1_1 vsMainAlpha();
PixelShader = compile ps_1_1 psMainAlpha();
AlphaTestEnable = False;
ZWriteEnable = False;
AlphaBlendEnable = True;
SrcBlend = SrcAlpha;
DestBlend = InvSrcAlpha;
}
// restore pass (nothing will be actually rendered here)
pass PRestore {
AlphaBlendEnable = False; // restore alpha blend to standard value
ZWriteEnable = True; // restore z write to standard value
}
}
Ok, now we've got a solution that does not require saving/restoring all changed state. Only the needed (as defined by the engine) state is manually restored via explicitly written "restore pass". This achieves the primary objective - performance (see measurements below), but has a serious problem...
It is very error prone. What if you forget to restore some needed state? The following effects will be messed up, and the cause is really hard to find. What if you don't assign some dependent state when you should? The current effect will be messed up. More - this system requires remembering which states belong to which groups... So while the solution can work (hey, it has worked for me for 2 years!), most likely it will not or at least will cause some serious headache. Especially if there are many people authoring effect files.
In short: automagically generate the state restore pass. In more detail:
Points 2-4 are not trivial, but doable. As a bonus, once the effect is examined (step 2), we can check whether it assigns all needed dependent states (third group), etc. In the end, we should have a fairly robust system (no manual work necessary to make it work), with all the speed benefits (no need to save/restore all touched state). Read on.
Knowing which states fall into which groups (step 1 above) is easy - have some file that describes each group (see example below):
CullMode needs to be restored
to iCull shared variable value - just list "(iCull)" as its standard value.VertexShader and PixelShader, even if to NULL".
So just list these states in configuration file.AlphaBlendEnable
is set to True, SrcBlend and DestBlend must also be set. In the file, for each master state and its value, list all dependent states.
Examining the effect is harder. Hardcore approaches, like writing a custom effect file parser, are easily out of question (re-implementing preprocessor anyone?).
A sensible way to examine the effect is: load it; set custom ID3DXEffectStateManager that "remembers" all state assignments; begin all effect passes. After
that, the custom state manager will have all state assignments recorded (we're mostly interested in the states, not their values).
Now, generate the state restore pass: for each 1st group state that is assigned by the effect, issue its assignment back to "standard" value (standard values come from configuration file). Here, we can also check whether effect's first pass assigns all needed states (2nd group); and whether all dependent states (3rd group) are assigned in the same pass when the master state is set to its value. Note: at least in 2004Oct SDK, the macro values can't span multiple lines. So generate the state restore pass as one long line.
The only thing left is: how do we get the generated state restore pass back into the effect? Currently there's no way to "inject" new passes programatically...
A possible solution is: require all effects to call macro RESTORE_PASS, like this:
technique Foo {
pass P1 { ... }
pass P2 { ... }
RESTORE_PASS
}
When first loading the effect (for examining), set the macro to empty string (macro substitutions can be supplied programatically). After the effect is examined
and restore pass is generated, set the macro to the contents of restore pass, e.g. PRestore{AlphaBlendEnable=False;} and load the effect
again. This time, the effect should have the (generated) restore pass. After loading, check whether restore pass exists and complain loudly if it does not
("maybe you forgot to write RESTORE_PASS at the end?").
We've got a system with some good properties:
RESTORE_PASS near the end of technique (and the system will let you know if you forget that).The system also has some drawbacks:
I've made a test application that is heavy on effect switching, but the rendered objects and shaders are simple - hopefully that's a way to stress-test effect switching performance.
The application renders 2000 objects each frame (2000 DIPs) and does 1361 effect switches (the actual number of effects is 8, but I intentionally didn't sort by effects). Test system was P4@3GHz, GeForce6800GT (78.01 drivers), with DX SDK 2004 October (compiled with VC6) and 2005 June (compiled with VC7.1). No observable speed difference was found in both SDK versions, suggesting that the bottleneck is all state save/restore stuff (runtime+driver) and not the Effects internals (which presumably changed somewhat between SDK versions).
| Approach | FPS | ms/frame | ms/frame improvement |
| Effects default: save/restore all state | 34.29 | 29.16 | - |
| Described solution ("needed state" restore pass) | 44.43 | 22.51 | 30% |
| Described solution; plus redundant state filtering via ID3DXEffectStateManager | 46.51 | 21.50 | 36% |
| Described solution; plus redundant state filtering; plus other reduntant sets filtering (vertex/index buffers, declarations etc.) | 47.45 | 21.07 | 38% |
Of course, doing 1361 effect switches each frame is pretty surreal... Extrapolating the results towards 150 effect switches/frame (which is pretty realistic) we get that proposed system alone can save 0.73 ms/frame; coupled with redundant states filtering the gain is 0.88 ms/frame. So while it won't improve renderer beyond the speed of light, it is at least something :)
Over the course of several my own projects, I've settled upon quite stable state grouping. The following is Lua script that defines state groups; it is directly read by the engine. The script should be pretty clear, even if you don't know Lua :)
-- If these states are modified, they are restored to the given default values
restored = {
-- render states
{ 'AlphaBlendEnable', 'False' },
{ 'SeparateAlphaBlendEnable', 'False' },
{ 'AlphaTestEnable', 'False' },
{ 'ClipPlaneEnable', '0' },
{ 'ColorWriteEnable', 'Red | Green | Blue | Alpha' },
{ 'FogEnable', 'False' },
{ 'PointSpriteEnable', 'False' },
{ 'StencilEnable', 'False' },
{ 'ZEnable', 'True' },
{ 'ZWriteEnable', 'True' },
{ 'BlendOp', 'Add' },
{ 'BlendOpAlpha', 'Add' },
{ 'Clipping', 'True' },
{ 'CullMode', '(iCull)' }, -- standard cull mode is iCull shared variable
{ 'DepthBias', '0' },
{ 'DitherEnable', 'False' },
{ 'FillMode', 'Solid' },
{ 'LastPixel', 'True' },
{ 'MultiSampleAntiAlias', 'True' },
{ 'MultiSampleMask', '0xFFFFFFFF' },
{ 'PatchSegments', '0' },
{ 'ShadeMode', 'Gouraud' },
{ 'SlopeScaleDepthBias', '0' },
{ 'ZFunc', 'Less' },
{ 'Wrap0', '0' },
{ 'Wrap1', '0' },
{ 'Wrap2', '0' },
{ 'Wrap3', '0' },
{ 'Wrap4', '0' },
{ 'Wrap5', '0' },
{ 'Wrap6', '0' },
{ 'Wrap7', '0' },
{ 'Wrap8', '0' },
{ 'Wrap9', '0' },
{ 'Wrap10', '0' },
{ 'Wrap11', '0' },
{ 'Wrap12', '0' },
{ 'Wrap13', '0' },
{ 'Wrap14', '0' },
{ 'Wrap15', '0' },
-- exotic texture stage states
{ 'TexCoordIndex', '@stage@' }, -- value is the stage index
{ 'TextureTransformFlags', 'Disable' },
-- exotic sampler states
{ 'MipMapLodBias', '0' },
{ 'MaxMipLevel', '0' },
{ 'SRGBTexture', '0' },
}
-- These states are required in first pass of each effect
required = {
'VertexShader', 'PixelShader',
}
-- If the given master state is set to the given value, all listed
-- dependent states must also be set in the same pass.
dependent = {
{ 'AlphaBlendEnable', 1,
{ 'SrcBlend', 'DestBlend', }, },
{ 'SeparateAlphaBlendEnable', 1,
{ 'SrcBlendAlpha', 'DestBlendAlpha', }, },
{ 'AlphaTestEnable', 1,
{ 'AlphaFunc', 'AlphaRef', }, },
{ 'StencilEnable', 1,
{ 'StencilFail', 'StencilFunc', 'StencilMask', 'StencilPass', 'StencilWriteMask', 'StencilZFail', }, },
}
As you can see, it's missing almost all fixed function T&L related states. That's because I never found a good way to use T&L with Effects; and I don't really use it anymore either :)
I have this system implemented in my freetime demo/game engine dingus. Whole implementation is more or less isolated in
dingus/kernel/EffectLoader.[cpp|h] files. Here are web links to them in Subversion (best viewed with tab size 4, yes I do use tabs):
The implementation is not the nicest or the most robust one, but it works and hopefully illustrates the whole idea.
If I've written complete nonsense here (or described your own patented technique), feel free to drop me a mail: nearaz at gmail dot com