ATI SDK

ATI Product Information

Support for Alternate OS's

Hardware partners

Software partners

RenderMonkey

Drivers


 
 

Highlights


GPU MeshMapper (V1.0)

GPU PerfStudio (V1.2)

Samples: CrossFire Detect (update)

Samples: PostTonemapResolve

The Compressonator (version 1.41)

GPU Shader Analyzer (V1.4)

RenderMonkey™
(version 1.81) (New)


ATI Compress (version 1.6)

AMD Tootle 2.0 (New)

AMD OpenGL ES 2.0 Emulator (V1.1) (New)

HLSL2GLSL (V0.9)

AMD at GDC 2007

ATI SDK


 
 
ATI Developer - Source Code
 
Designing for Radeon®

Introduction
This section of the SDK is a general outline of how to design and optimize applications for best performance using the Radeon® hardware.

We will cover aspects of the Radeon® that should be understood when developing on and designing for the Radeon® at both the T&L and pixel levels. While we deal mostly with general 3D application optimization, there are some areas that are applicable specifically to the Radeon®.
Hardware Transform and Lighting
One of the most important aspects of programming any application for 3D acceleration is maintaining concurrency between the CPU and graphics processors in the system. When you maintain a level of concurrency between the graphics and CPU by keeping both processors busy and eliminating stalls, you make the whole system more efficient and therefore increase the performance of the system. Choosing the right entrypoints and understanding why they are the right entrypoints is fundamental to maintaining concurrency on hardware T&L devices.

Use static vertex buffers

This is the single biggest performance issue when using hardware transformation and lighting. It is imperative that vertex buffers whose data does not need to change during the lifetime of your program are allocated in hardware-accessible memory and are not locked by the application after first lock that fills them with data.

Typical Direct3D® syntax looks like:

D3DVERTEXBUFFERDESC vbdesc;

vbdesc.dwSize = sizeof(vbdesc);
vbdesc.dwCaps = 0;
vbdesc.dwFVF = D3DFVF_XYZ | D3DFVF_NORMAL | D3DFVF_DIFFUSE | D3DFVF_TEX2;
vbdesc.dwNumVertices = uVertices;

if(!(g_D3DDeviceDesc.dwDevCaps & D3DDEVCAPS_HWTRANSFORMANDLIGHT) ||
IsEqualGUID(g_D3DDeviceDesc.deviceGUID, IID_IDirect3DRefDevice)) {
vbdesc.dwCaps |= D3DVBCAPS_SYSTEMMEMORY; }

g_pD3D->CreateVertexBuffer(&vbdesc, &pvbVertices, 0);

pvbVertices->Lock(DDLOCK_WAIT | DDLOCK_WRITEONLY, (void **) &pVertices, NULL);

// Fill in vertices ...

pvbVertices->Unlock();
pvbVertices->Optimize(g_pD3DDevice, 0);
Batch up your primitives

Do not try to render one triangle at a time with the hardware. Try to render as many primitives as possible within one function call to amortize the call overhead across many primitives. The importance of this optimization cannot be over-stated. A 3D graphics engine should be designed around this central concept. That said, the Radeon is very forgiving in terms of selection of an "optimal" vertex buffer size.

Keep your references local

This is a subtle but important point that few developers have considered until recently. Graphics processors capable of transformation and lighting have caches of recently-transformed vertices. If, within a given API call, you re-reference a vertex that was recently transformed and is still in cache, the vertex is essentially re-transformed for free. Of course, vertex caches are of finite size, so you should plan to "re-wind" your data structures to increase locality of reference. This can be a significant performance boost.

Indexed primitives

One way to use vertex caches optimially is to use indexed primitives and do your best to keep references to a given vertex index local to each other. The investment it takes on your part to re-shuffle your vertex data can definitely be worth it in the long run. Generally each vertex is used for more than one face of an object, so this method works well for most objects.

Strips and fans

Stripping and Fanning your data causes implicit locality of reference by re-using vertices from previous primitives. On the Radeon, the physical memory address is used as the tag in the vertex cache (i.e. not just the index in an indexed primitive call), so you can also exploit the vertex cache by increasing locality of reference here as well.

Flexible vertex formats

These will let you specify a smaller sized vertex that eliminates unused data components to increases effective bus/memory bandwidth as well as increase the number of vertices which fit in the vertex cache. If you just need to draw a non-textured shaded polygon that is colored based on the current material type, your vertices need not have texture coordinates or color data. Usage in Direct3D® might look like:

DWORD dwFVF = ( D3DFVF_XYZ | D3DFVF_NORMAL );
Pd3dDevice->DrawIndexedPrimitiveVB( D3DPT_TRIANGLELIST, dwFVF, . . . );
Radeon 3D Pixel Pipeline
The Radeon is the first chip on the market to support three simultaneous textures natively in hardware. As a result, many rendering effects that previously required multiple passes can be collapsed into fewer rendering passes. A wide variety of such effects are demonstrated in the Radeon SDK Sample Code. Understanding and efficiently using the Radeon Pixel Pipeline can improve your performance in important cases.

Reduce the number of rendering passes

Currently, many games use multiple rendering passes of each object into the frame buffer to generate advanced visual effects. This means that the same vertices are transformed by the hardware multiple times, with the only thing changing being the textures applied on each pass and the state of the pixel pipeline. For many important cases, utilizing the three-texture capabilities of the Radeon can eliminate this re-transformation of geometry and improve overall performance. Common case which render in a single pass on the Radeon and multiple passes on dual-texture chips include:
  • (BaseMap * LightMap) * DetailMap *2
  • (BaseMap * (DiffuseInterp · NormalMap)) + EnvMap * BaseMap.alpha
  • (BaseMap * DiffuseInterp) + EnvBumpMap
where
  • · indicates use of the per-pixel DOTPRODUCT3 operator
  • EnvBumpMap takes up two textures: the bump map and the environment or reflection map
For more specifics on the multi-texturing capabilities of the Radeon, see SetTextureStageState() for Direct3D® usage and the OpenGL® Multitexture extensions like ARB_multitexture and EXT_texture_env_combine.

As with the RAGE 128™ and RAGE 128 PRO™ chips, multitexturing and trilinear filtering are not mutually exclusive on Radeon. The Radeon will do 3 bilinear texture fetches at its peak rate. This means that a trilinear-filtered base map modulated with a non-trilinear light map will run at full speed through the pixel pipeline. Modulating in a third trilinear-filtered detail texture increases the number of bilinear fetches to five, and drops the pixel pipeline performance. Developers should understand this tradeoff when developing on the Radeon. Three bilinear texture fetches (no matter how they're distributed) will run at peak rate on the Radeon. Anything above that will still draw correctly (i.e. true trilinear will be used rather than an approximation) but will impact performance.
Rendering State
Any 3D graphics accelerator acts as a state machine whose states are programmed via an API such as OpenGL® or Direct3D®. Things like the current texture filtering mode(s), alpha blending states, {multi}texture modes and anything else which is part of the 3D Pixel Pipeline or Transform and Lighting Pipeline make up the current state.

Minimize per-frame changes to global render state

For years, hardware manufacturers have been preaching the mantra of reducing the number of times the global state of the renderer is changed by the app. This is because changing the state machine typically requires the pending primitives to flush through the system before the new state can be set and subsequent primitives can be processed.

It is also wise to avoid setting redundant render states. The Direct3D® runtime and any decent OpenGL® ICD will catch redundant render state setting and avoid a hardware flush, but the price of the API call and the state check is still paid. Even in applications which do not intentionally, there are often a small number of global states that the chip is set to. Using a tool like IPEAK, it is possible to observe the global state setting behavior of shipping applications.

If the global graphics states that are used by application are set at startup time, it is possible for accelerators to optimize loading of these global states. Mechanisms for specifying global state exist in both Direct3D® and OpenGL®.

Direct3D®

In Direct3D®, a number of IDirect3DDevice7 methods which affect the render state of the accelerator can be used to define "state blocks":

// T&L Pipeline State
IDirect3DDevice7::LightEnable()
IDirect3DDevice7::SetClipPlane()
IDirect3DDevice7::SetLight()
IDirect3DDevice7::SetMaterial()
IDirect3DDevice7::SetTransform()
IDirect3DDevice7::SetViewport()

// Pixel Pipeline State
IDirect3DDevice7::SetRenderState()
IDirect3DDevice7::SetTexture()
IDirect3DDevice7::SetTextureStageState()

mention ValidateDevice()
OpenGL®
Maximize the number of primitives drawn with the same render-state
  • This is essentially the same idea as minimizing render state changes. If you have a set of objects that share the same texture, it might be useful to draw the set together, since there will be no change in the render state for the current texture. Similarly, if you can arrange to have all the triangles that share the same material/shader (and thus render states) to be rendered in one batch, it will improve performance both due to minimizing render state changes, as well as due to increased batching of primitives.
  • With the ubiquity of multitexture hardware, a further optimization is worth considering for the case of light mapped games. [light map paging]
Texture Considerations
Texture management is one of the most problematic issues to solve in games. It's easy when the trivial policy of loading up all the textures to (local or non-local) video memory works: i.e. the amount of textures you have in the game level does not exceed the total video memory available to load them into.

If the textures don't fit or if an application wishes to scale to fit the available resources, texture management becomes more involved. In Direct3D® as of DirectX® 6, there is a texture management facility, whereby the application specifies the DDSCAPS2_TEXTUREMANAGE when creating textures. This will let DirectX® manage the texture memory automatically.

When textures need to be managed, you generally have more texture memory than video memory. You should keep this amount of extra memory down to a minimum: otherwise you incur the penalty of loading up the extra textures into video memory if they are not yet there, but have to be used immediately. Remember that we're trying to minimize texture loads during run time or we'll stall the 3D pipeline. People in the field recommend a budget of up to 20% of texture memory more than video memory available, but you may want to investigate the optimal ratios for your own application.

If you are managing your own textures, create as many textures in video memory as you can at initialization time, and then reuse those surfaces by blitting or loading new textures into them on demand, rather than destroying the surfaces and recreating them whenever it becomes necessary to load new textures.

With the large 32MB frame buffers and the availability of AGP, most applications should be able to load all textures of a game level up at one time.

Like most accelerators, the Radeon will have better texel cache performance if the data in texture surfaces are optimized. This means that an application which intends to re-access a texture surface after creating it will incur a penalty as the surface is de-optimized. To keep the texture from being optimized, an application can use the DDSCAPS2_HINTDYNAMIC flag at surface creation. This indicates that the application intends to modify/update this texture map frequently (one or more times per frame) and will keep the surface from ever being optimized. Naturally, to get optimal performance this per-frame modification of textures should be kept at an absolute minimum. To indicate that you do not plan to update a given texture, use the DDSCAPS2_HINTSTATIC flag. Such surfaces will be optimized.

A reason for using a large or high-resolution texture in your game is to load up a number of smaller textures into this "texture page". Using the smaller textures within this larger texture page allows you not to change the render-state when using any of the textures within the page, thus improving concurrency. This can cause bad artifacts if mip-mapping is used and also precludes use of texture repeating.
General Issues
Do not stall the 3D pipeline

In Direct3D®, do not lock frame buffer, vertex buffers, or other DirectDraw® surfaces unnecessarily as this causes synchronization of the CPU and 3D accelerator. Locking any DirectDraw® surface that is being used in any portion of the 3D rendering pipeline will cause a stall: all the 3D operations that have been queued up to be done by the accelerator must complete before the Lock() can occur. This effectively serializes the two processors and eliminates concurrency. If you must lock a surface or buffer, make sure that it is not done too soon after it is used (explicitly or implicitly) in a rendering call.

Minimize updates to textures during rendering

This is just a special case of the previous point: in Direct3D®, textures are just special DirectDraw® surfaces. If you must update a texture, in addition to the caveats above, provide hints to the driver on texture creation that you will be updating it: set the DDSCAPS2_HINTDYNAMIC flag for the texture. On the other hand, if you are sure you'll never touch the contents of a texture again after you load it up, then set the DDSCAPS2_HINTSTATIC flag so that the driver can optimize the texture for best texture cache coherency.

In OpenGL® this amounts to using texture objects, which were introduced in OpenGL® 1.1 and are in common use today. Non-startup-time calls to glTexImage2D() should also be kept to a minimum, as this requires the implementation to copy the texel data from the application and format/optimize it for use by the hardware.

Triple-buffering can minimize buffer dependencies

If the 3D hardware completes rendering to the back-buffer of your double-buffered surface before the front buffer is ready to be flipped to the back, you will incur a wait for the vertical-blanking period for the flip to occur, unless you have a triple-buffered scheme. In this scheme, the third buffer will still be available to be rendered to, and there will be no stall of the pipeline to wait for a rendering surface to be made available. Naturally, this will cause any physics or "twitch" application interaction to lag an additional frame behind the displayed image. Developers should understand this trade-off.
Geometry optimizations
These are basic 3D graphics principles, but they buy you a lot of performance if used effectively.
Don't draw objects that aren't visible
  • Cull objects as early in the pipeline as possible. If the whole object is positioned outside the view frustum, then don't process that object any further.
  • Fewer vertices sent down to the accelerator mean less communication with accelerator, and thus more performance.
Draw fewer polygons where not many are needed
  • Use Level-Of-Detail (LOD) models for your meshes: this means that as the mesh gets further away from the viewer, you can use a model of the mesh that uses fewer polygons that approximate the shape of the mesh.
  • Make sure each polygon you draw gives you "bang for your buck". There may be polygons that are so far away from the viewer and their onscreen size so small that they don't really change the appearance of your frame buffer. Avoid sending down three vertices to the accelerator, which may result in only one pixel or none at all on the screen.
  • Use multi-texturing effects for realistic low polygon primitives. For example, you can use emboss style bump mapping to achieve the illusion of a bumpy surface that would take a lot more polygons to approximate otherwise. Similarly, other intelligent use of texture maps can reduce the polygon count of your mesh designs.
Profiling
Your application may run well on one system, but perform poorly on another. To make sure your real-time applications run well on as many systems as possible, you can profile the game every time it is run, so that you can find the features and limitations of the particular system you have.

Another property that you should introduce into your games is "togglable features": checkboxes that enable and disable certain features according to the performance and visual quality that the user requires. This will help especially at the debugging stage to pinpoint any feature that is a performance bottleneck for your game. It is also useful to check visual quality problems: if something doesn't look right, check whether disabling a feature corrects it. This is an excellent way to isolate driver/hardware/capability issues as well.
 
 


 



©2008 Advanced Micro Devices, Inc.    |    Contact AMD    |    Terms and Conditions    |    Privacy    |    Trademark information    |    Site Map