richard huddy european developer relations rhuddy@ati graphics... · richard huddy european...

23
Graphics Performance Richard Huddy European Developer Relations [email protected]

Upload: others

Post on 30-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Graphics Performance

Richard HuddyEuropean Developer [email protected]

Page 2: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Today’s agenda…

•DirectX 9• CPU limited games• The classical or “legacy” pipeline• The idea of balance• Where most people go wrong

•DirectX 10• What the driver model brings to the table• What the runtime will do for you• What we had to change about our hardware• What you have to change about your games• The idea of balance• Where most people will go wrong

Page 3: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

DX 9 - CPU limited games

•A real rarity, yes?•Well, no, that’s why we encourage benchmarking at high resolutions

•The single commonest reason is poor batching...

•In DX9 small batches are death for performance…

• And when you set state, you do so with many interactions with the API

Page 4: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

The classical or “legacy” pipeline

• Vertex/Index Fetch• Vertex processing• Primitive Assembly• Rasterization• Pixel processing• Frame Buffer operations (*the end of the street)

It’s a one way street with each stage feeding the next in line…

Page 5: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Vertex/Index Fetch

• Fetching data is slower than not fetching data• Indexing lets you skip fetches…• Packing your data might save you b/w

• Data is fetched into a cache, so make use of the cache• That’s why you should use indexed vertices• That’s why you should use strips (and fans)

• If the data is aligned with cache lines then transactions run faster

• We use 256 bit fetches, even on lower width cards where we simply make more fetches

• That’s 32 bytes per fetch• 32 bytes good, 36 bytes bad• Red good, green bad…

Page 6: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Vertex processing

• Indexed vertices are not re-shaded if they’re still in the cache

• This is the big win• So get back to them before you lose them• Vertex cache is 16 entries and is a FIFO (i.e. not MRU)

• Our VS is a 5-way vector• So we can schedule an explicit 4-D instruction in the same clock cycle

as a scalar operation• You should use explicit xyzw channel masking to help us see scalar

operations…

Page 7: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Pixel processing

• The usual rules apply:• If you don’t process a pixel it’s faster than if you do• We can reject 256 pixels in a single clock tick

• Our PS is a 4-way vector• So we can schedule explicit 3-D instructions in the same clock as a

scalar operation• You should use explicit colour channel masking to help us see these

opportunities…

• ALU and TEX operations, what are they and how are they different?

• What’s the ideal instruction type ratio?• And what did Nick mean by “more than 3:1”

• [XXX - Tell them how clever our compiler is…]

Page 8: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Blocks of work in parallel in DX9…

• Or… - Why ATI’s DFC is “f**king amazing”• [XXX - Nick was too gentle this morning, be ‘assertive’…]

• Handling units of work in parallel• 16 pixels in an X1800• 48 pixels in an X1900• “some number” in our future hardware

• What do I mean by handling 16 pixels in parallel in an X1800?

• Consider “If (a) then b else c”• If ‘a’ is always true, or always false then our life is easy, so we do the job

the quick way• If ‘a’ is sometimes true then our life is hard

Page 9: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

That’s the walk down the classical pipeline

Page 10: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

The idea of balance

• Is the idea that if you have a classical pipeline and one part is underused then you could simply throw more work at it

• For free…• How cool is that?• And if you only have two parts to the pipeline it can’t be that hard

to use this idea cleverly, no?

• Is sillier than you’d think…• So your user changes resolution…• Forces AA through the control panel…• Selects max aniso…• Overclocks their memory, or the core, or one of the buses, or changes to

a newer graphics card with a different balance etc.

Page 11: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Where most people go wrong in DX9

•The most common reasons why games don’t run fast…

•Being CPU limited for any of several reasons•Not meshing their geometry sufficiently well to use our cache

•Wrong draw order (“I pity the fool who draws his skybox first”)

•Not appreciating the TEX:ALU balance•Failing to close off data flow in the PS

• By which I mean masked writes that allow instruction overlap

Page 12: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

“The Small Batch Problem”

•Common complaints• “Why can’t I make more draw calls?”, or• “Why does a draw call take so long?”

•Uncommon complaints• Everything else

Page 13: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Direct3D 10

Page 14: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

DX 10 – the new driver model

•What the driver model means for users• “The new OS is more stable than the old one”

•What the driver model means for ATI• More QA, more code, better reputation.

•What the driver model means for games programmers

• You don’t need to do anything to take advantage after moving to DX10

Page 15: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

What the runtime will do for you

• The DX 9 runtime on Vista• Using DX9 will show not allow access to most of the new

optimisations…• So it’s worth re-casting your code from DX9 to DX10 to get those

benefits

• The DX 10 runtime• User mode• Does much less validation• Thinner and hence faster• Redesigned to allow fewer transactions to achieve the same amount of

work

Page 16: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

What we had to change about our hardware

•Where state checking happened• Mostly this was in the driver• The runtime used to do buffering and filtering too…

•What state checking happened• Scary amounts of cross validation between state changes

•Why state checking happened• Invalid state combinations are a “not good” kind of a thing• (Some unrelated state lay in the same control words…)

• Now our hardware does most of the validation…

Page 17: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Blocks of work in parallel…

• Let’s look at this in DX10 (in contrast with DX9…)

• Handling units of work in parallel• “some number” in our future hardware

• What do I mean by handling 16 pixels in parallel in an X1800?

• Consider “If (a) then b else c”• If ‘a’ is always true, or always false then our life is easy, so we do the job

the quick way• If ‘a’ is sometimes true then our life is hard

• This has implications for VS, GS and PS in USA SM4 hardware

• Lets pretend that R600 has work blocks of 17 shaders and look at the consequences…

• [XXX - Ramble on here for a while because they really need to get this

Page 18: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

What you have to change about your games

•How to take advantage of the new hardware• Plug it in to your PC, it’s the fastest DX9 accelerator you’ll ever

have seen• Switch over to DX10

•How to take advantage of the new runtime• Switch to DX10 and do your shader compilation up front

•How to take advantage of the new architecture• Use stream out to save subsequent passes• Use the GS to create or kill polys where you couldn’t do that before

• [But be careful about variability…]

Page 19: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

The idea of balance in DX10

•Or why Homer Simpson was right to keep shouting “U.S.A.”

•Now you have to think in terms of the total load of the combined active shaders

• ALU = sum of [VS+GS+PS] math ops / parallelizability• TEX = sum of [VS+GS+PS] texture cost• DFC = sum of [VS+GS+PS] branch ops• Clock ticks = max (ALU, TEX, DFC)

• …WITH CONSISTENT DFC…

Page 20: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Where most people will go wrong

• Treating the pipeline usage as ‘intuitive’• Or “Why Richard has been talking about performance for 10 years”

• Over using the new features• Be careful to make sure you don’t prefer convenience over efficiency

• Leaving things turned on• Don’t write to channels that aren’t used

• How fast is a chip?• Now you should think in terms of what total work needs to be done• You can’t predict the execution time of an individual shader, only the

workload it carries

• How fast is video memory?• We have colossal bandwidth to local vid mem, but it’s never enough…• So as far as possible we rely on cleverness, not wide pipes

• For example, we prefer to perform the Z test before the pixel shader starts execution

Page 21: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Today’s top tips…

• They say people take away just 3 things from any presentation• So we used to say “Batch, batch, batch”

• Your top 3 tips for DirectX 10 • What we’re going to say from now is…:

1. Show us your shaders2. Don’t think that the DirectX 9 way is “the old fashioned way”, those

algorithms are often very refined now3. Beware of treating the SDK samples as predictors of performance

Render to cube map can be fun…

4. Measuring performance on a USAThe legacy formula - A shader takes how many clocks…?The new formula

An individual shader has a minimum execution time but no maximum, do what?

How to work out how long the work takes, phew!

Page 22: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

A word about the demos and…What were all those “xxx” bits?

Page 23: Richard Huddy European Developer Relations RHuddy@ati Graphics... · Richard Huddy European Developer Relations RHuddy@ati.com . Today’s agenda… •DirectX 9 • CPU limited games

Richard “7 of 5” [email protected]