|
|
|
|
| |
Designing for ATI Rage128 and Rage128
Pro
Introduction |
| |
- As a developer, you are aware of the reasons for
using 3D hardware acceleration in your applications.
Those reasons include:
-
- Better graphics at higher resolutions
- High frame rates
- Photorealistic special effects
This section of the SDK is a general outline of how to
design and optimize applications for best performance
using the ATI Rage128 and Rage128 Pro hardware. We will
cover aspects of the Rage 128 and Rage128 Pro that should
be accounted for while designing an application or game
engine for 3D acceleration. While we deal mostly with
general 3D acceleration hints, there are some areas that
may be applicable specifically to the Rage 128 family
of accelerators.
|
| Concurrency |
| |
One of the most important aspects of programming any
application for 3D acceleration is maintaining concurrency
between the CPU and graphics processors in the system.
When you maintain a level of concurrency between the graphics
and CPU, you make the whole system more efficient, and
therefore increase the performance of the system. There
are several programming techniques you can use:
Batch up your primitives
Do not try to send down one triangle at a time to the
hardware: this is very inefficient, because each rendering
call requires overhead in terms of time as well as data.
Try to send as many triangles as possible within one rendering
call.
Strips and fans
Reduce the amount of data transferred from CPU to accelerator
by using strips and fans. These reduce the total amount
of vertex data to be sent down to the hardware. Unfortunately,
arbitrary objects are generally hard to reduce to strips
or fans.
Usage in D3D: |
Pd3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP, .
. . );
Pd3dDevice->DrawPrimitive(D3DPT_TRIANGLEFAN, . . .
);
|
Usage in OpenGL®: |
glBegin(GL_TRIANGLE_FAN);
...
glEnd();
glBegin(GL_TRIANGLE_STRIP);
...
glEnd();
|
Indexed primitives
Indexed primitives allow one to use the same vertex data
over again for other triangles that make up an object.
Generally each vertex is used for more than one face of
an object, so this method works well for most objects:
i.e. you don't have to send the same vertex multiple times
down to the hardware. This is even more useful in conjunction
with Local Vertex Buffers in Direct3D and Vertex Arrays
or Compiled Vertex Arrays in OpenGL®(see below).
Usage in D3D: |
Pd3dDevice->DrawIndexedPrimitive( . . . );
Pd3dDevice->DrawIndexedPrimitiveVB( . . . );
|
Usage in OpenGL®: |
glDrawElements( . . .);
glDrawArrays( . . .);
|
Flexible vertex formats
These will let you specify a smaller sized vertex that
eliminates unused data components: e.g. If you just need
to draw a non-textured shaded polygon, your vertices need
not have texture coordinates.
Usage in D3D: |
DWORD dwFVF = ( D3DFVF_XYZ | D3DFVF_DIFFUSE );
Pd3dDevice->DrawIndexedPrimitive( D3DPT_TRIANGLELIST,
dwFVF, . . . );
|
Reduce the number of rendering passes
Currently, many games use multiple rendering passes of
each object into the frame buffer to generate photorealistic
effects. This means that the same vertices are passed
down to the hardware multiple times, with the only thing
changing being the textures applied on each pass.
You can use the multi-texturing capabilities of the Rage
128 and Rage128 Pro (see SetTextureStageState()for
Direct3D usage and the OpenGL®
Multitexture Extensions for OpenGL® details) to reduce
the number of passes required to achieve the same effect.
This means that each vertex needs to be passed down to
the chip fewer times, and therefore reduces communication
between the CPU and the chip.
Do not stall the 3D pipeline
In Direct3D, do not lock frame buffer, vertex buffers,
or other DirectDraw surfaces unnecessarily as this causes
synchronization of the CPU and 3D accelerator. Locking
any DirectDraw surface that is being used in any portion
of the 3D rendering pipeline will cause a stall: all the
3D operations that have been queued up to be done by the
accelerator must complete before the Lock() can occur.
This effectively serializes the two processors and eliminates
concurrency. If you must lock a surface or buffer, make
sure that it is not done too soon after it is used (explicitly
or implicitly) in a rendering call.
Minimize updates to textures during rendering
This is just a special case of the previous point: in
Direct3D, textures are just special DirectDraw surfaces.
If you must update a texture, in addition to the caveats
above, provide hints to the driver on texture creation
that you will be updating it: set the DDSCAPS2_HINTDYNAMIC
flag for the texture. On the other hand, if you are sure
you'll never touch the contents of a texture again after
you load it up, then set the DDSCAPS2_HINTSTATIC flag
so that the driver can optimize the texture for best texture
cache coherency.
In OpenGL® this amounts to using texture objects, which
were introduced in OpenGL® 1.1 and are in common use today.
Non-startup-time calls to glTexImage2D() should also be
kept to a minimum, as this requires the implementation
to copy the texel data from the application and format/optimize
it for use by the hardware.
Triple-buffering can minimize buffer dependencies
If the 3D hardware completes rendering to the back-buffer
of your double-buffered surface before the front buffer
is ready to be flipped to the back, you will incur a wait
for the vertical-blanking period for the flip to occur,
unless you have a triple-buffered scheme. In this scheme,
the third buffer will still be available to be rendered
to, and there will be no stall of the pipeline to wait
for a rendering surface to be made available. Naturally,
this will cause any physics or "twitch" application interaction
to lag an additional frame behind the displayed image.
Developers should understand this trade-off.
|
| Rendering State |
| |
- Some of the 3D rendering features that the accelerator
maintains as its current rendering state are:
-
- Shading mode
- Current texture(s)
- Texture Filtering mode(s)
- Alpha-blending states
- (Multi)texture modes
- Anything else which is part of the 3D
Pixel Pipeline
Under D3D, these IDirect3Ddevice3 methods affect the render
state of the accelerator:
|
SetRenderState();
SetTexture()
SetTextureStageState();
|
Naturally, to use an accelerator efficiently, you need
to:
- Minimize changes to the current render state
before any draw operation
-
- Keep track of all the render states that you
set, and if there is any render state that needs
to be changed, check first whether the change
is necessary (has some previous call already set
that state?) before actually using the API to
change the state. This ensures that there is no
extra work done at a lower level (either DirectX
runtime or the driver) to set a redundant render
state. Moving to a "material" or "shader" model
and using ValidateDevice() to validate at the
material or shader granularity can lead to efficiency
in this area.
- Maximize the number of primitives drawn with
the same render-state
-
- This is essentially the same idea as minimizing
render state changes. If you have a set of objects
that share the same texture, it might be useful
to draw the set together, since there will be
no change in the render state for the current
texture. Similarly, if you can arrange to have
all the triangles that share the same material/shader
(and thus render states) to be rendered in one
batch, it will improve performance both due to
minimizing render state changes, as well as due
to increased batching of primitives.
|
| Leveraging Rage 128 and Rage128 Pro Features
|
| |
There are certain features of the Rage 128 family of
accelerators, which if accounted for, will allow developers
to increase the performance and visual quality of their
applications.
AGP
Like the Rage Pro family, the Rage 128 family of chips
uses the AGP bus very efficiently, allowing applications
to use very large texture footprints. Essentially, you
increase your effective video memory for textures to go
past the memory limit of your VRAM. The chip is designed
to optimally handle high-resolution textures (up to 1024x1024)
with lots of color (up to 32bpp) that enhances the visual
quality and realism.
The Rage 128 and Rage128 Pro work internally using 32-bit
color for all interpolation and calculations. This means
that using textures that are 32bpp will give the most
effective use of the pipeline, as well as create the best
content possible.
Creating AGP textures requires no extra effort: it's as
easy as just letting the driver decide which memory the
texture surface should be created in; AGP or local video
memory. If you are using Direct3D and want to be specific
about using AGP memory for particular textures, you can
set the DDSCAPS_NONLOCALVIDMEM while creating your texture
surface.
32-bit Draw Buffers
Use 32-bit draw buffers rather than 16-bit to get realistic
colors in your scenes. As mentioned above, the chip internally
works in 32-bit color, and there will be no need to dither
down to 16-bit colors if the draw buffers are 32- bit.
The visual quality can improve dramatically.
The 32-bit draw buffers need twice as much memory as buffers
of the same size in 16-bit. But since the Rage 128 and
Rage128 pro support 32MB of frame buffer memory, there
is almost no reason to use lower color depths for drawing
surfaces. At the very least, an application should give
its user the option of setting deep frame buffers and
texture maps.
32-bit Z-Buffer
The Rage 128 and Rage128 pro allow you to use a 32-bit
or 24-bit Z-buffer, which improves the accuracy of the
depth calculations.
8-bit Stencil Buffer
You can create volumetric rendering special effects by
using the 8-bit stencil buffer that is available to you
with the Rage 128 and Rage128 Pro-based parts. This allows
you to use shadow volumes (see Shadowvol and Shadowvol2
examples in the DirectX® 7 SDK) and constructive solid
geometry with hardware support in your applications.
|
| |
 |
|
|
|