tessellation in a low poly world nicolas thibieroz amd graphics products group...
TRANSCRIPT
Tessellation in a Low Poly World
Nicolas ThibierozAMD Graphics Products Group
Original materials from Bill Bilodeau110/04/23
GDC Paris 2008
Medium
What is Tessellation?
Tessellation is the process of adding new primitives into an existing model
Triangle counts can be “dialed in” by adjusting the tessellation level
Low High
AMD Hardware Tessellator
Output Merger
Rasterizer
Pixel Shader
Mem
ory
/ R
esou
rces
Vertex Shader
Mem
ory
/ R
esou
rces
Input Assembler
Tessellator
Hardware tessellation allows you to render more polygons for better silhouettes
Initial concept artwork from Bay Raitt, Valve
Surface control cages are easier to work with than individual triangles
Artists prefer to create models this way
Animations are simpler on a control cage
Control cage can be animated on the GPU, then tessellated in a second pass
Animated Control Cage
Vertex Shader
Pixel Shader
R2VB
Vertex Shader
Pixel Shader
Tessellator
Hardware tessellation is a form of compression
Smaller footprint – you only need to store the control cage and possibly a displacement map
Improved bandwidth – less data to transfer from memory to GPU
Three types of primitives, or “superprims”, are supported
Triangles
Quads
Lines
There are two tessellation modes
- Continuous
- Adaptive
Continuous Tessellation
Specify floating point tessellation level per-draw call
– Tessellation levels range from 1.0 to 14.99
Eliminates popping as vertices are added through tessellation
Level 1.0 Level 2.0
Level = 1.0Level = 1.1Level = 1.3Level = 1.7Level = 2.0
Continuous Tessellation
Level 1.0 Level 2.0
Specify floating point tessellation level per-draw call
– Tessellation levels range from 1.0 to 14.99
Eliminates popping as vertices are added through tessellation
Adaptive allows different levels of tessellation within the same mesh
Edge tessellation factor = 5.x
Edge t
esse
llatio
n fa
ctor
= 3
.x Edge tessellation factor = 3.x
Edge tessellation factor = 5.x
Edge tessellation factor = 7.x
Edge tessellation factor =
3.xEdg
e te
ssel
lati
on fa
ctor
= 3
.x
Adaptive tessellation can be done in real-time using multiple passes
Transformed Superprim
Mesh
Superprim Mesh
Vertex Shader
Pixel Shader
Superprim Mesh
Vertex Shader
Pixel Shader
Sampler
Stream 0
Vertex Shader
Pixel ShaderSuperprim
Mesh
Stream 1Tessellator
Tessellation Factors
R2VB
Code Example: Continuous Tessellation
// Enable tessellation:TSSetTessellationMode( pd3dDevice, TSMD_ENABLE_CONTINUOUS );// Set tessellation level:TSSetMaxTessellationLevel( pd3dDevice, sg_fMaxTessellationLevel );// Select appropriate technique to render our tessellated objects:sg_pEffect->SetTechnique( "RenderTessellatedDisplacedScene" );
// Render all passes with tessellationV( sg_pEffect->Begin( &cPasses, 0 ) ); for ( iPass = 0; iPass < cPasses; iPass++ ) { V( sg_pEffect->BeginPass( iPass ) ); V( TSDrawMeshSubset( sg_pMesh, 0 ) ); V( sg_pEffect->EndPass() ); } V( sg_pEffect->End() );
// Disable tessellation:TSSetTessellationMode( pd3dDevice, TSMD_DISABLE );
Displacement Map
The vertex shader is used as an evaluation shader
Tessellator
Super-prim Mesh Tessellated and Displaced Mesh
Tessellated Mesh
Vertex Shader
(Evaluation Shader)
Sampler
Example Code: Evaluation Vertex Shader
struct VsInputTessellated
{
// Barycentric weights for this vertex
float3 vBarycentric: BLENDWEIGHT0;
// Data from superprim vertex 0:
float4 vPositionVert0 : POSITION0;
float2 vTexCoordVert0 : TEXCOORD0;
float3 vNormalVert0 : NORMAL0;
// Data from superprim vertex 1:
float4 vPositionVert1 : POSITION4;
float2 vTexCoordVert1 : TEXCOORD4;
float3 vNormalVert1 : NORMAL4;
// Data from superprim vertex 2:
float4 vPositionVert2 : POSITION8;
float2 vTexCoordVert2 : TEXCOORD8;
float3 vNormalVert2 : NORMAL8;
};
Example Code: Evaluation Vertex Shader
VsOutputTessellated VSRenderTessellatedDisplaced( VsInputTessellated i )
{
VsOutputTessellated o;
// Compute new position based on the barycentric coordinates:
float3 vPosTessOS = i.vPositionVert0.xyz * i.vBarycentric.x + i.vPositionVert1.xyz
i.vBarycentric.y + i.vPositionVert2.xyz * i.vBarycentric.z;
// Output world-space position:
o.vPositionWS = vPosTessOS;
// Compute new normal vector for the tessellated vertex:
o.vNormalWS = i.vNormalVert0.xyz * i.vBarycentric.x + i.vNormalVert1.xyz * i.vBarycentric.y
+ i.vNormalVert2.xyz * i.vBarycentric.z;
// Compute new texture coordinates based on the barycentric coordinates:
o.vTexCoord = i.vTexCoordVert0.xy * i.vBarycentric.x + i.vTexCoordVert1.xy * i.vBarycentric.y
+ i.vTexCoordVert2.xy * i.vBarycentric.z;
// Displace the tessellated vertex (sample the displacement map)
o.vPositionWS = DisplaceVertex( vPosTessOS, o.vTexCoord, o.vNormalWS );
// Transform position to screen-space:
o.vPosCS = mul( float4( o.vPositionWS, 1.0 ), g_mWorldViewProjection );
return o;
} // End of VsOutputTessellated VSRenderTessellatedDisplaced(..)
What if you want to do more?
DirectX 9 has a limit of 15 float4 vertex input components – High order surfaces need more inputs
TSToggleIndicesRetrieval() allows you to fetch the super-prim data from a vertex texture
Bezier Control Points
Vertex Shader
Sampler
Tessellator
(u,v)
P0,0, P0,1 … P3,3
Other Tessellation Library Functions
TSDrawIndexed(…)
– Analogous to DrawIndexedPrimitive(…)
TSDrawNonIndexed(…)
– Needed for adaptive tessellation, since every edge needs its own tessellation level
TSSetMinTessellationLevel(…)
– Sets the minimum tessellation level for adaptive tessellation
TSComputeNumTessellatedPrimitives(…)
– Calculates the number of tessellated primitives that will be generated by the tessellator
Displacement mapping alters tangent space
To do normal mapping we need to rotate tangent space
Alternatively, use model space normal maps Doesn’t work with animation or tiling
Displacement map lighting
Use the displacement map to calculate the per-pixel normal
Central differencing with neighboring displacements can approximate the derivative
Light with the computed normal
No need to use a normal map
Terrain Rendering: Performance Results
Both use the same displacement map (2K x 2K)
and identical pixel shaders
Low Resolution with Tessellation
High Resolution, No Tessellation
On-disk model polygon count (pre-tessellation)
840 triangles 1,280,038 triangles
Original model rendering cost
1210 fps (0.83 ms)
Actual rendered model polygon count
1,008,038 triangles 1,280,038 triangles
VRAM Vertex buffer size
70 KB 31 MB
VRAM Index buffer size 23 KB 14 MB
Rendering time 821.41 fps (1.22 ms) 301 fps (3.32 ms)
Rendering with tessellation is > 6X faster and provides memory
savings over 44MB! Subtracting the cost of shading
Terrain Tessellation Sample
AMD GPU MeshMapper
New tool for generate normal, displacement, and ambient occlusion maps from hi-res and low-res mesh pairs
Advantages of the Tessellator
• Saves memory bandwidth and reduces memory footprint
• Flexible support for displacement mapping and many kinds of high order surfaces
• Easier content creation – artists and animators only need to work with low resolution geometry
• Continuous LOD avoids unnecessary triangles
• The tessellator is available now on the Xbox 360 and the latest ATI Radeon and FireGL graphics cards
• Public availability of tessellation SDK very soon
Harnessing the Power of Multiple GPUs
Nicolas ThibierozAMD Graphics Products Group
Original materials from Jon Story & Holger Grün 2510/04/23
GDC Paris 2008
Why MGPU?
MGPUs can be used to dramatically increase performance and visual quality
– At higher screen resolutions
– Especially with increased use of MSAA
Many applications become GPU limited at higher screen resolutions
– High resolution monitors => mainstream affordability
Achieve next generation performance on today‘s HW
– Prototype your next engine
Provides an upgrade path for mainstream parts
2610/04/23
Multiple Boards
An increasing number of motherboards can accept 2 or more discrete video cards
Connected by high speed crossover cables
Now possible to fit 4 Radeon HD3850 boards to a single motherboard
CrossFireX technology allows you to harness that performance
2710/04/23
4x
2x
Multiple GPUs per Board
The Radeon HD3870 X2 is a single-board multi-GPU architecture
– AFR is on by default
Heavy peer to peer communication
– Bi-directional 16x lane pipe connecting the 2 GPUs
CrossFireX supports 2 HD3870 X2 boards for Quad GPU performance
2810/04/23
4x
2x
Hybrid Crossfire
Combination of integrated and discrete graphics
3D graphics performance boost
– Laptops
– Mainstream desktop PCs
Use less power during non-taxing graphical tasks
2910/04/23
CrossFire Rendering Modes
Split Frame Rendering / Scissor
– Screen is divided into number of GPUs
– Dynamic load balancing
Alternate Frame Rendering
– GPUs take alternate frames
– Vertex processing not duplicated
– Highest performing mode
3010/04/23
How does AFR Work?
3110/04/23
CPU
GPU0 (Frame N) GPU1 (Frame N+1)
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Hardware Considerations
Current MGPU setups are not shared memory architectures
– Resources placed in local video memory are duplicated for each GPU
Driver initiates peer to peer (P2P) copies to keep resources in sync
– On some chipsets this may involve the CPU
– Synchronizes all GPUs
– Very heavy impact on performance that can even result in negative scaling
3210/04/23
Driver Modes
Compatible AFR Mode
– Default mode
– Driver checks for AFR unfriendly behaviour
– Will P2P copy stale resources
Full AFR Mode (Application Profile)
– Driver recognises EXE name
– Use a unique name and don‘t change it
– Behaviour fully guided by profile
– Best performance – no checking
– Rename EXE to “AFR-FriendlyD3D.exe“
– Use “AFR-FriendlyOGL.exe“ for OpenGL
– No checking : Speed & compatibility test
3310/04/23
Detecting the Number of GPUs
Visit http://ati.amd.com/developer
– Download project called “CrossFire Detect“
Statically link to:
– “atimgpud_s_x86.lib“ 32 bit version
– “atimgpud_s_x64.lib“ 64 bit version
Include header file:
– “atimgpud.h“
Call this function:
– INT count = AtiMultiGPUAdapters();
3410/04/23
Common Pitfalls & Solutions
3510/04/23
Pitfall: Dependencies Between Frames
36 10/04/23
Update resource A
Present (N)
Draw using A
Update resource A
Present (N+1)
GPU1 (Frame N+1)GPU0 (Frame N)
resource Aresource A
Present (N-1)
Draw using A
P2P copy from GPU0 to GPU1
Solution: Resources that Change Every Frame
3710/04/23
Draw using A
Present (N)
Update resource A
Draw using A
Present (N+1)
GPU1 (Frame N+1)GPU0 (Frame N)
resource Aresource A
Present (N-1)
Update resource A
Solution: Resources that Change Every Few Frames
3810/04/23
Draw using A
Present (N)
Update resource A
Draw using A
Present (N+1)
GPU1 (Frame N+1)GPU0 (Frame N)
resource Aresource A
Present (N-1)
Update resource A
Draw using A
Present (N+2)
Draw using A
Present (N+4)
Draw using A
Present (N+3)
Pitfalls: In DX10 there are Other Ways to Update Resources...
Drawing to vertex/index buffers
Stream Out
CopyResource() calls
CopySubresourceRegion() calls
GenerateMips() calls
ResolveSubresource() calls
3910/04/23
Pitfall: Waiting on Queries
4010/04/23
CPU
GPU0 (Frame N) GPU1 (Frame N+1)
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Waiting for Query Result!!!
Solution: Queries
Avoid using queries whenever possible
- For occlusion queries consider a CPU-based approach
Avoid waiting on query results
- Pick up the result of a query at least N-GPU frames after it was issued
For queries issued every frame
- Create additional query objects for each GPU
- Cycle through them
Pitfall: CPU Access to a Renderable Resource
When the CPU locks a renderable resource it must wait for all GPUs to finish using the resource before acquiring the pointer
All GPUs now have to wait until the CPU unlocks the resource pointer
After the unlock the driver has to update the resource on each GPU via P2P copies
Just don‘t do this – it destroys performance even on a single GPU setup, and is catastrophic for MGPUs
4210/04/23
Solutions: Locks / Maps
In DX10 stream to and copy from STAGING textures
In DX9 StretchRect() is always better than Lock()
At resource creation time use the appropriate flags from:
– D3D10_USAGE
– D3D10_CPU_ACCESS_FLAG
In DX9 never lock static Vertex/Index Buffers because it will cause P2P copies
4310/04/23
Concluding Pitfalls & Solutions
Drivers take a conservative approach
– Performs checks on resource synchronization
– P2P copy if necessary
You know the application best
– Determine if a P2P copy is necessary
– Talk to us about a profile
4410/04/23
AFR-Friendly SDK Sample
Part of the ATI developer SDK
– http://ati.amd.com/developer
Detects the number of GPUs
Correctly deals with textures used as render targets
Provides a solution for dealing with mouse cursor lag
Go and take a look!!
4510/04/23
Call to Action
• MGPUs provide demonstrable performance gains
• MGPUs boost visual quality
• Plan from day one to make your rendering scale
• Detect the number of GPUs
• Regularly check for AFR unfriendly behavior
• Talk to us...
4610/04/23
QUESTIONS?