august 14-15 2006 xbox 360 cpu performance update (gamefest 2006 edition) bruce dawson software...

August 14-15 2006

Xbox 360 CPU Xbox 360 CPU Performance UpdatePerformance Update(Gamefest 2006 edition)(Gamefest 2006 edition)Bruce DawsonBruce DawsonSoftware Design EngineerSoftware Design EngineerGame Technology GroupGame Technology GroupMicrosoftMicrosoft

August 14-15 2006

Talk PurposeTalk PurposeGive you the tools to make CPU code Give you the tools to make CPU code faster!faster!

Better understanding of CPUBetter understanding of CPUCompiler limitationsCompiler limitationsCompiler tricksCompiler tricksUpdate on toolsUpdate on toolsAssembly languageAssembly language

New material onlyNew material only

August 14-15 2006

Important Things Not Important Things Not CoveredCoveredXfest February 2006 Xfest February 2006

Effective Profiler Use on Xbox 360Effective Profiler Use on Xbox 360Efficient C++ on Xbox 360Efficient C++ on Xbox 360Trace Analysis and Memory OptimizationTrace Analysis and Memory Optimization

Gamefest June 2005Gamefest June 2005CPU Performance Bottlenecks and SolutionsCPU Performance Bottlenecks and Solutions

Xfest February 2005Xfest February 2005Xenon CPU PipelinesXenon CPU PipelinesIntro to PPCIntro to PPC and VMX 128 and VMX 128

Xbox 360 XDKXbox 360 XDKCPU Pipeline AnimatorCPU Pipeline Animator

https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/training/XfestFeb2006.htm#IDAQJUP

https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/training/XfestFeb2006.htm#IDACJUP

https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/training/XfestFeb2006.htm#IDAMKUP

https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/training/xds_Gamefest_05Jun_Training.htm#2.1.

https://xds.xbox.com/BPProgInfo2.asp?page=scripts/training.asp&catID=5#Xenon%20CPU%20Pipelines

https://xds.xbox.com/BPProgInfo2.asp?page=scripts/training.asp&catID=5#Introduction%20to%20PowerPC%20and%20the%20Xenon%20CPU

August 14-15 2006

System Block DiagramSystem Block Diagram

Core0 Core1 Core2

1MB L2

L1 L1 L1

CPU: 3.2GHz

MemoryController

10 MB EDRAM

512 MBRAM

Memory

Sou

thbr

idge

3D Core

DVD

HDD port

A/V output

Ethernet

MU ports

Controller ports

GPU: 500MHz

August 14-15 2006

CPU Block DiagramCPU Block DiagramCore 0

32-KB 4-way d-cache32-KB 2-way i-cache

L2 Cache1-MB 8-way, Snoop Logic

NCU 0Store Queue

Store Gathering

NCU 2Store Queue

Store Gathering

NCU 1Store Queue

Store Gathering

Bus Interface unit

8x64 byteStore-

gathering

24-entryStore

Queue

Core 232-KB 4-way d-cache32-KB 2-way i-cache

Core 132-KB 4-way d-cache32-KB 2-way i-cache

8x64 byteStore-

gathering

24-entryStore

Queue

8x64 byteStore-

gathering

24-entryStore

Queue

8 RC Machines

8-entry Load

Queue

8-entry Load

Queue

8-entry Load

Queue

3 cores, 6 threads, 1-MB 3 cores, 6 threads, 1-MB L2 cache, on one chipL2 cache, on one chipIn-order instruction executionIn-order instruction executionInstruction latencies from 2 to 14 cyclesInstruction latencies from 2 to 14 cyclesPipeline flushes due to load-hit-stores, Pipeline flushes due to load-hit-stores, mispredicted branches, float compares, mispredicted branches, float compares, etc.etc.Significant memory latency: ~610 Significant memory latency: ~610 cyclescycles

August 14-15 2006

Branch PredictorBranch

Predictor

Branch PredictorBranch

Predictor

Instruction Decode and Dependency

Checking

Instruction Decode and Dependency

Checking

Instruction Fetch and Buffering, Thread 1




Vector/Scalar Unit Instruction Queue and Dependency

Checking

Vector/Scalar Unit Instruction Queue and Dependency

Checking

Integer PipelineInteger Pipeline

Branch PipelineBranch Pipeline

Address Generation and Int Load/Store PipelineAddress Generation and Int Load/Store Pipeline

CTR outCTR outCR outCR out

Dot Product PipelineDot Product Pipeline

Vector Simple PipelineVector Simple Pipeline

Vector Float PipelineVector Float Pipeline

Vector/Scalar Store PipelineVector/Scalar Store Pipeline

Vector Permute PipelineVector Permute Pipeline

Scalar Float PipelineScalar Float Pipeline

Vector/Scalar Load PipelineVector/Scalar Load Pipeline

Two stall points:Two stall points:IQ end (for int, IQ end (for int,

load, and load, and branch)...branch)... and VQ end and VQ end

for float and for float and VMXVMX

Other issues Other issues include include

microcoded microcoded instructions and instructions and

flushesflushes

Aligned pairs Aligned pairs dispatch, dispatch, each pair each pair from one from one threadthread

Execution Execution pipelines pipelines

accept one accept one instruction instruction per clockper clock

Not physically Not physically accurate—see accurate—see the XDK for the XDK for

precise layout precise layout and timingsand timings

Pipeline latency Pipeline latency is normally the is normally the pipeline lengthpipeline length

Exceptions Exceptions for load/store for load/store

to integer, to integer, integer to integer to load/store, load/store,

and and float/VMX float/VMX comparescompares

August 14-15 2006

CD Bonus Extra TrackCD Bonus Extra TrackLeast understood Trace warning:Least understood Trace warning:

<4-byte write-combined write at inst <4-byte write-combined write at inst 82312034, estimated cost from 532 82312034, estimated cost from 532 occurrences is 10640occurrences is 10640

Writing to uncacheable write-combined Writing to uncacheable write-combined memory is only efficient if rules are memory is only efficient if rules are followed:followed:

All writes should be four bytes or greater, All writes should be four bytes or greater, naturally aligned, and in-order, with no gaps naturally aligned, and in-order, with no gaps or repeatsor repeatsOtherwise each write may be a separate Otherwise each write may be a separate front-side-bus (FSB) transaction!front-side-bus (FSB) transaction!

August 14-15 2006

CD Bonus CodeCD Bonus CodeSlow code (detected by trace analysis)Slow code (detected by trace analysis)short* pWriteCombined = ...short* pWriteCombined = ...pWriteCombined[0] = data0;pWriteCombined[0] = data0;pWriteCombined[1] = data1;pWriteCombined[1] = data1;pWriteCombined[2] = data2;pWriteCombined[2] = data2;

Combining pairs of 16-bit values Combining pairs of 16-bit values into 32-bit values before into 32-bit values before writing is writing is muchmuch faster faster

August 14-15 2006

CD Bonus CodeCD Bonus CodeSlow code (not detected by trace Slow code (not detected by trace analysis, yet)analysis, yet)DWORD* pWriteCombined = ...DWORD* pWriteCombined = ...pWriteCombined[0] = data0;pWriteCombined[0] = data0;pWriteCombined[2] = data2;pWriteCombined[2] = data2;pWriteCombined[1] = data1;pWriteCombined[1] = data1;

Writing in order is essentialWriting in order is essential

August 14-15 2006

CD Bonus CodeCD Bonus CodeFast codeFast codeextern "C" void _ReadWriteBarrier();extern "C" void _ReadWriteBarrier();#pragma intrinsic(_ReadWriteBarrier)#pragma intrinsic(_ReadWriteBarrier)

DWORD* pWriteCombined = ...DWORD* pWriteCombined = ...pWriteCombined[0] = data0;pWriteCombined[0] = data0;_ReadWriteBarrier();_ReadWriteBarrier();pWriteCombined[1] = data1;pWriteCombined[1] = data1;_ReadWriteBarrier();_ReadWriteBarrier();pWriteCombined[2] = data2;pWriteCombined[2] = data2;

August 14-15 2006

Finding CPU Finding CPU Problems/HotspotsProblems/Hotspots

PIX system monitorPIX system monitorPIX timing capturePIX timing captureXbPerfview (/callcapXbPerfview (/callcapor /fastcap)or /fastcap)Sampling profiler!?Sampling profiler!?Trace analysisTrace analysisCustom timing codeCustom timing code

August 14-15 2006

Trace Analysis UpdateTrace Analysis UpdateTrace Recording records every Trace Recording records every instruction executed and every address instruction executed and every address referencedreferencedTrace Analysis has a UI!Trace Analysis has a UI!Multiple reports faster (one analysis pass Multiple reports faster (one analysis pass for all reports)for all reports)Source view with integrated top-issues Source view with integrated top-issues and per-line disassemblyand per-line disassemblyComing soon: links from top-issues and Coming soon: links from top-issues and memory access map to source view, etc.memory access map to source view, etc.

August 14-15 2006

Timing: Timing: mftbmftbmftbmftb is the fastest and most precise is the fastest and most precise way to measure CPU execution timeway to measure CPU execution timeUsed by XbPerfView and Used by XbPerfView and QueryPerformanceCounterQueryPerformanceCounter

Increments every 64 CPU cyclesIncrements every 64 CPU cycles~44 cycle cost to read~44 cycle cost to readAccessible in one instruction (Accessible in one instruction (__mftb__mftb intrinsic)intrinsic)

August 14-15 2006

Timing: Timing: mftbmftb Frequency Frequencymftbmftb or or QueryPerformanceCounterQueryPerformanceCounter are tempting for game timing also, but...are tempting for game timing also, but...mftbmftb does not run at exactly 50.0 MHz does not run at exactly 50.0 MHz

Actually runs at Actually runs at aboutabout 49.875 MHz 49.875 MHzVaries between machines from about 49.85 Varies between machines from about 49.85 to 49.90 MHzto 49.90 MHzAlways exactly 64 CPU cycles per tickAlways exactly 64 CPU cycles per tick

GetTickCountGetTickCount is more long-term is more long-term accurateaccurate

August 14-15 2006

Timing: Timing: mftbmftb Errors Errorsmftbmftb is occasionally slightly off* is occasionally slightly off*

Every 256 billion CPU cycles (85 seconds) Every 256 billion CPU cycles (85 seconds) value is wrong for 4 CPU cyclesvalue is wrong for 4 CPU cycles

Solutions:Solutions:Ignore the problemIgnore the problemDetect and fix the problem:Detect and fix the problem: int64 time = __mftb();int64 time = __mftb(); if( 0 == (DWORD)time )if( 0 == (DWORD)time ) time = __mftb();time = __mftb();Use Use QueryPerformanceCounterQueryPerformanceCounterUse Use __mftb32__mftb32 (max time 85 seconds) (max time 85 seconds)

* Slightly off means about 4 billion (exactly 2* Slightly off means about 4 billion (exactly 23232) too ) too smallsmall

// Timing with __mftb32()// Timing with __mftb32()DWORD start = __mftb32();DWORD start = __mftb32();DoStuff();DoStuff();DWORD elapsed = __mftb32() - start;DWORD elapsed = __mftb32() - start;

August 14-15 2006

Timing: __mftb AlignmentTiming: __mftb AlignmentWhen Xbox 360 launched, When Xbox 360 launched, mftbmftb on on separate cores was not alignedseparate cores was not aligned

One core could be 10-20 ticks (640-1280 One core could be 10-20 ticks (640-1280 cycles) aheadcycles) ahead

With the spring 2006 update With the spring 2006 update mftbmftb should be cycle accurate should be cycle accurate synchronized between coressynchronized between cores

August 14-15 2006

Detecting Sub-optimal Detecting Sub-optimal CodeCode

Key is to find code that could run betterKey is to find code that could run betterTrace recording top-issues analysisTrace recording top-issues analysis

Points out many common problemsPoints out many common problemsMakes everyone an expertMakes everyone an expert

Assembly code inspection/searchAssembly code inspection/searchLook for sign/zero extend instructionsLook for sign/zero extend instructionsLook for code that expands to "too many" Look for code that expands to "too many" instructionsinstructionsLook for excessive Look for excessive frsp, fmr (and mr, vor)frsp, fmr (and mr, vor)Look for bad schedulingLook for bad schedulingLook for references to r1—stack pointerLook for references to r1—stack pointer

Compiler generated Compiler generated temporaries:temporaries:

Usually unwantedUsually unwantedOften avoidableOften avoidableOften lead to load-hit-stores, or just Often lead to load-hit-stores, or just

wasted instructionswasted instructionsGo on the stack, use r1Go on the stack, use r1

Sign/zero Sign/zero extension occurs extension occurs when using short when using short

and byte local and byte local variablesvariables

August 14-15 2006

Assembly Inspection Assembly Inspection OptionsOptions

Set a Visual Studio breakpoint, go to Set a Visual Studio breakpoint, go to disassembly mode (toggle with disassembly mode (toggle with Ctrl+F11)Ctrl+F11)Record a trace, go to the Source tab, Record a trace, go to the Source tab, expand source linesexpand source linesCOD files: C/C++, Output Files, COD files: C/C++, Output Files, Assembler Output, set to Assembler Output, set to Assembly, Assembly, Machine Code and Source (/FAcs)Machine Code and Source (/FAcs)

Don’t need to Don’t need to link or run link or run

gamegame

August 14-15 2006

Poor Code in COD FilePoor Code in COD File; Begin code for function: ?IntegralFloat@@YAMM@Z; Begin code for function: ?IntegralFloat@@YAMM@Z; 5 : // Truncate a float to an integral value,; 5 : // Truncate a float to an integral value,; 6 : // still as a float, using casts.; 6 : // still as a float, using casts.; 7 : return (float)(int)input;; 7 : return (float)(int)input;

0000000000 3961fff03961fff0 addi r11, addi r11,r1r1,-16,-16 0000400004 fc00081efc00081e fctiwz fr0,fr1 fctiwz fr0,fr1 0000800008 7c005fae7c005fae stfiwx fr0,r0,r11 stfiwx fr0,r0,r11 0000c0000c e961fff2e961fff2 lwa r11,-10h( lwa r11,-10h(r1r1)) 0001000010 f961fff0f961fff0 std r11,-10h( std r11,-10h(r1r1)) 0001400014 c801fff0c801fff0 lfd fr0,-10h( lfd fr0,-10h(r1r1)) 0001800018 fc00069cfc00069c fcfid fr0,fr0 fcfid fr0,fr0 0001c0001c fc200018fc200018 frsp fr1,fr0 frsp fr1,fr0 0002000020 4e8000204e800020 blr blr

Note all the Note all the references to r1references to r1

—avoid if —avoid if possiblepossible

August 14-15 2006

Faster code in COD FileFaster code in COD File; Begin code for function: ?IntegralFloatFast@@YAMM@Z; Begin code for function: ?IntegralFloatFast@@YAMM@Z; 12 : // Truncate a float to an integral value,; 12 : // Truncate a float to an integral value,; 13 : // still as a float, using intrinsics.; 13 : // still as a float, using intrinsics.; 14 : return (float)__fcfid(__fctidz(input));; 14 : return (float)__fcfid(__fctidz(input));

0000000000 fc000e5efc000e5e fctidz fr0,fr1 fctidz fr0,fr1 0000400004 fc00069cfc00069c fcfid fr0,fr0 fcfid fr0,fr0 0000800008 fc200018fc200018 frsp fr1,fr0 frsp fr1,fr0 0000c0000c 4e8000204e800020 blr blr

See ppcintrinsics.h for explanations of See ppcintrinsics.h for explanations of __fctidz and __fcfid, or see __fctidz and __fcfid, or see Optimization Optimization Case StudiesCase Studies from the summer 2005 from the summer 2005 GamefestGamefest

Fewer Fewer instructions, no instructions, no

stack trafficstack traffic

August 14-15 2006

Poor Code in PIX Source Poor Code in PIX Source TabTab

August 14-15 2006

Expanding CodeExpanding Code extsh extsh instructions are instructions are doing no useful doing no useful

workwork

August 14-15 2006

Faster Code in PIX Source Faster Code in PIX Source TabTab

Using int Using int instead of short instead of short saves time and saves time and

code spacecode space

Save char/short Save char/short for arrays and for arrays and large structslarge structs

August 14-15 2006

Compiler Quirks to Beware Compiler Quirks to Beware OfOf

__restrict and inline interacting badly__restrict and inline interacting badlyCompiler may do poor scheduling Compiler may do poor scheduling when unrolling read/modify/write of when unrolling read/modify/write of one arrayone arraybool—the forgotten 8-bit typebool—the forgotten 8-bit typeFunctions returning boolFunctions returning bool

August 14-15 2006

Restrict and InlineRestrict and Inlinevoid TestCalcs(__vector4* __restrict input1,void TestCalcs(__vector4* __restrict input1, __vector4* __restrict input2,__vector4* __restrict input2, __vector4* __restrict result,__vector4* __restrict result, const int count) {const int count) { for(int j=0; j<count; j+=4) {for(int j=0; j<count; j+=4) { result[j+0] = input1[j+0] + input2[j+0];result[j+0] = input1[j+0] + input2[j+0]; result[j+1] = input1[j+1] + input2[j+1];result[j+1] = input1[j+1] + input2[j+1]; result[j+2] = input1[j+2] + input2[j+2];result[j+2] = input1[j+2] + input2[j+2]; result[j+3] = input1[j+3] + input2[j+3];result[j+3] = input1[j+3] + input2[j+3]; }}}}

August 14-15 2006

This Code Depends on This Code Depends on ContextContext

TestCalcs is inlinedTestCalcs is inlinedParent function pointers are not Parent function pointers are not marked __restrictmarked __restrict

Inlining merges parametersInlining merges parametersMerging is conservative so Merging is conservative so __restrict is lost__restrict is lost

The Parent TrapThe Parent Trapvoid Parent(__vector4* input1, __vector4* input2,void Parent(__vector4* input1, __vector4* input2, __vector4* result, const int count)__vector4* result, const int count){{ TestCalcs(input1, input2, result, count);TestCalcs(input1, input2, result, count);}}

Avoiding The Parent TrapAvoiding The Parent Trapvoid Parent(__vector4* input1A, __vector4* input2A,void Parent(__vector4* input1A, __vector4* input2A, __vector4* resultA, const int count)__vector4* resultA, const int count){{ __vector4* __restrict input1 = input1A;__vector4* __restrict input1 = input1A; __vector4* __restrict input2 = input2A;__vector4* __restrict input2 = input2A; __vector4* __restrict result = resultA;__vector4* __restrict result = resultA; TestCalcs(input1, input2, result, count);TestCalcs(input1, input2, result, count);}}

August 14-15 2006

Solutions to Solutions to __restrict/inline__restrict/inline

Be wary of __forceinlineBe wary of __forceinlineBetter results (in this case) with Better results (in this case) with __declspec(noinline)__declspec(noinline)Even better results by marking the Even better results by marking the variables as __restrict in the parentvariables as __restrict in the parent

Be wary of increased inlining making the Be wary of increased inlining making the problem returnproblem return

August 14-15 2006

Array Update Good Array Update Good SchedulingSchedulingvoid IncrementFast(float* __restrict data,void IncrementFast(float* __restrict data, float* __restrict input,float* __restrict input, int count, float addend)int count, float addend){{ // Assume count is a multiple of 2// Assume count is a multiple of 2 for (int i = 0; i < count; i += 2)for (int i = 0; i < count; i += 2) {{ data[i] = input[i] + addend;data[i] = input[i] + addend; data[i+1] = input[i+1] + addend;data[i+1] = input[i+1] + addend; }}}}

Pointers are Pointers are marked __restrict, marked __restrict, so compiler can do so compiler can do great schedulinggreat scheduling

lfs fr0,0(r9)lfs fr0,0(r9) lfsx fr13,r8,r11lfsx fr13,r8,r11 fadds fr0,fr0,fr1fadds fr0,fr0,fr1 fadds fr13,fr13,fr1fadds fr13,fr13,fr1 stfs fr0,-4(r11)stfs fr0,-4(r11) stfs fr13,0(r11)stfs fr13,0(r11)

August 14-15 2006

Array Update Bad Array Update Bad SchedulingSchedulingvoid IncrementData(float* __restrict data,void IncrementData(float* __restrict data, int count, float addend)int count, float addend){{ // Assume count is a multiple of 2// Assume count is a multiple of 2 for (int i = 0; i < count; i += 2)for (int i = 0; i < count; i += 2) {{ data[i] += addend;data[i] += addend; data[i+1] += addend;data[i+1] += addend; }}}}

Updates to Updates to data[] can’t data[] can’t overlap, so overlap, so

compiler compiler shouldshould do great do great

scheduling…scheduling…lfs fr0,0(r11)lfs fr0,0(r11) fadds fr0,fr0,fr1fadds fr0,fr0,fr1 stfs fr0,0(r11)stfs fr0,0(r11) lfs fr0,0(r9)lfs fr0,0(r9) fadds fr0,fr1,fr0fadds fr0,fr1,fr0 stfs fr0,0(r9)stfs fr0,0(r9)

August 14-15 2006

bool—the Forgotten 8-Bit bool—the Forgotten 8-Bit TypeType

Our compiler has difficulty with boolOur compiler has difficulty with boolIt likes to add extra instructions and do It likes to add extra instructions and do other unfortunate thingsother unfortunate thingsSomeday this may change, but for Someday this may change, but for now...now...Functions that return bool can be the Functions that return bool can be the worstworst

SorrySorry

August 14-15 2006

bool Interacting with STLbool Interacting with STLThis is the canonical form for STL This is the canonical form for STL iterationiteration

typedef std::vector<int> IntVec;typedef std::vector<int> IntVec;IntVec::iterator end = testVector.end();IntVec::iterator end = testVector.end();for (IntVec::iterator p = testVector.begin();for (IntVec::iterator p = testVector.begin(); p != end;p != end; ++p) {++p) {

result += *p;result += *p;}}

Loop test calls bool operator!=(iter, Loop test calls bool operator!=(iter, iter), and generates inefficient codeiter), and generates inefficient code

August 14-15 2006

Ideal Code for STL LoopIdeal Code for STL Loop$LL20@IteratorIt$LL20@IteratorIt lwz r9,0(r11)lwz r9,0(r11) addi r11,r11,4addi r11,r11,4 add r3,r9,r3add r3,r9,r3 cmplw cr6,r11,r10cmplw cr6,r11,r10 bne cr6,$LL20@IteratorItbne cr6,$LL20@IteratorIt

p != endp != end

August 14-15 2006

Actual Code for STL LoopActual Code for STL Loop$LL20@IteratorIt$LL20@IteratorIt lwz r10,0(r11)lwz r10,0(r11) addi r11,r11,4addi r11,r11,4 add r3,r10,r3add r3,r10,r3 subf r10,r11,r9subf r10,r11,r9 cntlzw r10,r10cntlzw r10,r10 rlwinm r10,r10,27,31,31rlwinm r10,r10,27,31,31 cntlzw r10,r10cntlzw r10,r10 rlwinm r10,r10,27,31,31rlwinm r10,r10,27,31,31 cmplwi cr6,r10,0cmplwi cr6,r10,0 bne cr6,$LL20@IteratorItbne cr6,$LL20@IteratorIt

Code is poor Code is poor even when inlined!even when inlined!

p != endp != end

August 14-15 2006

bool Function Solutionsbool Function SolutionsAvoid functions that return bool, even Avoid functions that return bool, even inline functionsinline functionsPrefer doing comparisons in main Prefer doing comparisons in main functionfunction

But this can be stymied when using STL But this can be stymied when using STL iterators with overloaded operator!=iterators with overloaded operator!=

Since iterator comparisons use bool Since iterator comparisons use bool functions, consider using array style functions, consider using array style indexing for vectorsindexing for vectors

August 14-15 2006

Improving Code Improving Code GenerationGeneration

Use __restrict (have we mentioned that Use __restrict (have we mentioned that before?)before?)

But only use it when it is trueBut only use it when it is trueInlining is goodInlining is good

Avoid having tiny leaf functions—even just error Avoid having tiny leaf functions—even just error handlers—that can't be inlinedhandlers—that can't be inlined

Avoid functions that return Avoid functions that return bool/BOOL/booleanbool/BOOL/booleanLTCG, PGOLTCG, PGOKnow what code is generatedKnow what code is generated

August 14-15 2006

Compiler ImprovementsCompiler ImprovementsUpdated compiler coming Real Soon Updated compiler coming Real Soon NowNow

Improved VMX code-generation (less Improved VMX code-generation (less spilling to stack)spilling to stack)__declspec( passinreg )__declspec( passinreg )Bug fixes and miscellaneous other Bug fixes and miscellaneous other improvementsimprovementsSee See Sublime C++ for GamesSublime C++ for Games for more for more informationinformation

What about when the compiler isn't What about when the compiler isn't good enough?good enough?

August 14-15 2006

Simple Assembly Simple Assembly LanguageLanguageint __declspec( naked ) SimpleAddAssem( int x,int __declspec( naked ) SimpleAddAssem( int x, int y ) {int y ) { asm {asm { // x is in r3// x is in r3 // y is in r4// y is in r4 add r3, r3, r4add r3, r3, r4

// The return value is in r3// The return value is in r3 blr // Don’t forget the explicit ‘blr’blr // Don’t forget the explicit ‘blr’ }}}}

Expands to this code, always:Expands to this code, always:add r3,r3,r4add r3,r3,r4blrblr

August 14-15 2006

The Perils of Not Being The Perils of Not Being NakedNakedint SimpleAddAssem( int x,int SimpleAddAssem( int x, int y ) {int y ) { asm {asm { // x is in r3// x is in r3 // y is in r4// y is in r4 add r3, r3, r4add r3, r3, r4

// The return value is in r3// The return value is in r3 // Don’t put an explicit ‘blr’// Don’t put an explicit ‘blr’ }}}}

Expands to this code in Expands to this code in release:release:stw r3,x$(r1)stw r3,x$(r1)stw r4,y$(r1)stw r4,y$(r1)add r3,r3,r4add r3,r3,r4blrblr

Expands to this code in Expands to this code in /callcap:/callcap:mflr r12mflr r12stw r12,-8(r1)stw r12,-8(r1)stwu r1,-60h(r1)stwu r1,-60h(r1)stw r3,x$(r1)stw r3,x$(r1)stw r4,y$(r1)stw r4,y$(r1)mr r13,r13mr r13,r13add r3,r3,r4add r3,r3,r4mr r14,r14mr r14,r14addi r1,r1,96addi r1,r1,96lwz r12,-8(r1)lwz r12,-8(r1)mtlr r12mtlr r12blrblr

Special ‘mr’ Special ‘mr’ instructions change instructions change to /callcap to /callcap functionsfunctions when profiling, and when profiling, and

trash your registers!trash your registers!

August 14-15 2006

Assembly Language Assembly Language GuidelinesGuidelinesAvoid when possibleAvoid when possible

The compiler can schedule code very wellThe compiler can schedule code very wellIntrinsics give you access to special instructionsIntrinsics give you access to special instructionsThe compiler can call functions faster and easier!The compiler can call functions faster and easier!High-level code can be rearranged and updated High-level code can be rearranged and updated much fastermuch faster

Assembly may make sense for small, critical Assembly may make sense for small, critical routinesroutines

Know the pipelinesKnow the pipelinesUse __declspec( naked )Use __declspec( naked )Compare to C/C++ performanceCompare to C/C++ performance

August 14-15 2006

SummarySummaryThe compiler is your friend: learn to The compiler is your friend: learn to work with itwork with itThe compiler sometimes generates The compiler sometimes generates bad code—watch for it and work bad code—watch for it and work around itaround itKnow your toolsKnow your toolsUse assembly language when Use assembly language when necessarynecessary

August 14-15 2006

ReferencesReferencesUnsigned DevelopersUnsigned Developershttp://arstechnica.com/articles/paedia/cpu/xbox360-2.arshttp://arstechnica.com/articles/paedia/cpu/xbox360-2.arsLearning the Xbox 360 CPULearning the Xbox 360 CPUhttps://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/

whitepapers/xbox_360_cpu_overview.docwhitepapers/xbox_360_cpu_overview.dochttps://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/

whitepapers/xbox_360_cpu_pipelines.docwhitepapers/xbox_360_cpu_pipelines.docXbox 360 Optimization GuidesXbox 360 Optimization Guideshttps://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/

training/XfestFeb2006.htm#IDA4JUPtraining/XfestFeb2006.htm#IDA4JUPhttps://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/

training/xds_Gamefest_05Jun_Training.htm#2.4.training/xds_Gamefest_05Jun_Training.htm#2.4.

© 2006 © 2006 MicrosoftMicrosoft Corporation. All rights reserved. Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

DirectX Developer CenterDirectX Developer Centerhttp://msdn.microsoft.com/directxhttp://msdn.microsoft.com/directx

Game Development MSDN ForumsGame Development MSDN Forumshttp://forums.microsoft.com/msdnhttp://forums.microsoft.com/msdn

Xbox 360 CentralXbox 360 Centralhttp://xds.xbox.com/http://xds.xbox.com/

XNA Web siteXNA Web sitehttp://www.microsoft.com/xnahttp://www.microsoft.com/xna

august 14-15 2006 xbox 360 cpu performance update (gamefest 2006 edition) bruce dawson software...

Documents