august 14-15 2006 xbox 360 cpu performance update (gamefest 2006 edition) bruce dawson software...
DESCRIPTION
August Talk Purpose Give you the tools to make CPU code faster! Better understanding of CPU Compiler limitations Compiler tricks Update on tools Assembly language New material onlyTRANSCRIPT
August 14-15 2006
Xbox 360 CPU Xbox 360 CPU Performance UpdatePerformance Update(Gamefest 2006 edition)(Gamefest 2006 edition)Bruce DawsonBruce DawsonSoftware Design EngineerSoftware Design EngineerGame Technology GroupGame Technology GroupMicrosoftMicrosoft
August 14-15 2006
Talk PurposeTalk PurposeGive you the tools to make CPU code Give you the tools to make CPU code faster!faster!
Better understanding of CPUBetter understanding of CPUCompiler limitationsCompiler limitationsCompiler tricksCompiler tricksUpdate on toolsUpdate on toolsAssembly languageAssembly language
New material onlyNew material only
August 14-15 2006
Important Things Not Important Things Not CoveredCoveredXfest February 2006 Xfest February 2006
Effective Profiler Use on Xbox 360Effective Profiler Use on Xbox 360Efficient C++ on Xbox 360Efficient C++ on Xbox 360Trace Analysis and Memory OptimizationTrace Analysis and Memory Optimization
Gamefest June 2005Gamefest June 2005CPU Performance Bottlenecks and SolutionsCPU Performance Bottlenecks and Solutions
Xfest February 2005Xfest February 2005Xenon CPU PipelinesXenon CPU PipelinesIntro to PPCIntro to PPC and VMX 128 and VMX 128
Xbox 360 XDKXbox 360 XDKCPU Pipeline AnimatorCPU Pipeline Animator
August 14-15 2006
System Block DiagramSystem Block Diagram
Core0 Core1 Core2
1MB L2
L1 L1 L1
CPU: 3.2GHz
MemoryController
10 MB EDRAM
512 MBRAM
Memory
Sou
thbr
idge
3D Core
DVD
HDD port
A/V output
Ethernet
MU ports
Controller ports
GPU: 500MHz
August 14-15 2006
CPU Block DiagramCPU Block DiagramCore 0
32-KB 4-way d-cache32-KB 2-way i-cache
L2 Cache1-MB 8-way, Snoop Logic
NCU 0Store Queue
Store Gathering
NCU 2Store Queue
Store Gathering
NCU 1Store Queue
Store Gathering
Bus Interface unit
8x64 byteStore-
gathering
24-entryStore
Queue
Core 232-KB 4-way d-cache32-KB 2-way i-cache
Core 132-KB 4-way d-cache32-KB 2-way i-cache
8x64 byteStore-
gathering
24-entryStore
Queue
8x64 byteStore-
gathering
24-entryStore
Queue
8 RC Machines
8-entry Load
Queue
8-entry Load
Queue
8-entry Load
Queue
3 cores, 6 threads, 1-MB 3 cores, 6 threads, 1-MB L2 cache, on one chipL2 cache, on one chipIn-order instruction executionIn-order instruction executionInstruction latencies from 2 to 14 cyclesInstruction latencies from 2 to 14 cyclesPipeline flushes due to load-hit-stores, Pipeline flushes due to load-hit-stores, mispredicted branches, float compares, mispredicted branches, float compares, etc.etc.Significant memory latency: ~610 Significant memory latency: ~610 cyclescycles
August 14-15 2006
Branch PredictorBranch
Predictor
Branch PredictorBranch
Predictor
Instruction Decode and Dependency
Checking
Instruction Decode and Dependency
Checking
Instruction Fetch and Buffering, Thread 1
Instruction Fetch and Buffering, Thread 1
Instruction Fetch and Buffering, Thread 0
Instruction Fetch and Buffering, Thread 0
Vector/Scalar Unit Instruction Queue and Dependency
Checking
Vector/Scalar Unit Instruction Queue and Dependency
Checking
Integer PipelineInteger Pipeline
Branch PipelineBranch Pipeline
Address Generation and Int Load/Store PipelineAddress Generation and Int Load/Store Pipeline
CTR outCTR outCR outCR out
Dot Product PipelineDot Product Pipeline
Vector Simple PipelineVector Simple Pipeline
Vector Float PipelineVector Float Pipeline
Vector/Scalar Store PipelineVector/Scalar Store Pipeline
Vector Permute PipelineVector Permute Pipeline
Scalar Float PipelineScalar Float Pipeline
Vector/Scalar Load PipelineVector/Scalar Load Pipeline
Two stall points:Two stall points:IQ end (for int, IQ end (for int,
load, and load, and branch)...branch)... and VQ end and VQ end
for float and for float and VMXVMX
Other issues Other issues include include
microcoded microcoded instructions and instructions and
flushesflushes
Aligned pairs Aligned pairs dispatch, dispatch, each pair each pair from one from one threadthread
Execution Execution pipelines pipelines
accept one accept one instruction instruction per clockper clock
Not physically Not physically accurate—see accurate—see the XDK for the XDK for
precise layout precise layout and timingsand timings
Pipeline latency Pipeline latency is normally the is normally the pipeline lengthpipeline length
Exceptions Exceptions for load/store for load/store
to integer, to integer, integer to integer to load/store, load/store,
and and float/VMX float/VMX comparescompares
August 14-15 2006
CD Bonus Extra TrackCD Bonus Extra TrackLeast understood Trace warning:Least understood Trace warning:
<4-byte write-combined write at inst <4-byte write-combined write at inst 82312034, estimated cost from 532 82312034, estimated cost from 532 occurrences is 10640occurrences is 10640
Writing to uncacheable write-combined Writing to uncacheable write-combined memory is only efficient if rules are memory is only efficient if rules are followed:followed:
All writes should be four bytes or greater, All writes should be four bytes or greater, naturally aligned, and in-order, with no gaps naturally aligned, and in-order, with no gaps or repeatsor repeatsOtherwise each write may be a separate Otherwise each write may be a separate front-side-bus (FSB) transaction!front-side-bus (FSB) transaction!
August 14-15 2006
CD Bonus CodeCD Bonus CodeSlow code (detected by trace analysis)Slow code (detected by trace analysis)short* pWriteCombined = ...short* pWriteCombined = ...pWriteCombined[0] = data0;pWriteCombined[0] = data0;pWriteCombined[1] = data1;pWriteCombined[1] = data1;pWriteCombined[2] = data2;pWriteCombined[2] = data2;
Combining pairs of 16-bit values Combining pairs of 16-bit values into 32-bit values before into 32-bit values before writing is writing is muchmuch faster faster
August 14-15 2006
CD Bonus CodeCD Bonus CodeSlow code (not detected by trace Slow code (not detected by trace analysis, yet)analysis, yet)DWORD* pWriteCombined = ...DWORD* pWriteCombined = ...pWriteCombined[0] = data0;pWriteCombined[0] = data0;pWriteCombined[2] = data2;pWriteCombined[2] = data2;pWriteCombined[1] = data1;pWriteCombined[1] = data1;
Writing in order is essentialWriting in order is essential
August 14-15 2006
CD Bonus CodeCD Bonus CodeFast codeFast codeextern "C" void _ReadWriteBarrier();extern "C" void _ReadWriteBarrier();#pragma intrinsic(_ReadWriteBarrier)#pragma intrinsic(_ReadWriteBarrier)
DWORD* pWriteCombined = ...DWORD* pWriteCombined = ...pWriteCombined[0] = data0;pWriteCombined[0] = data0;_ReadWriteBarrier();_ReadWriteBarrier();pWriteCombined[1] = data1;pWriteCombined[1] = data1;_ReadWriteBarrier();_ReadWriteBarrier();pWriteCombined[2] = data2;pWriteCombined[2] = data2;
August 14-15 2006
Finding CPU Finding CPU Problems/HotspotsProblems/Hotspots
PIX system monitorPIX system monitorPIX timing capturePIX timing captureXbPerfview (/callcapXbPerfview (/callcapor /fastcap)or /fastcap)Sampling profiler!?Sampling profiler!?Trace analysisTrace analysisCustom timing codeCustom timing code
August 14-15 2006
Trace Analysis UpdateTrace Analysis UpdateTrace Recording records every Trace Recording records every instruction executed and every address instruction executed and every address referencedreferencedTrace Analysis has a UI!Trace Analysis has a UI!Multiple reports faster (one analysis pass Multiple reports faster (one analysis pass for all reports)for all reports)Source view with integrated top-issues Source view with integrated top-issues and per-line disassemblyand per-line disassemblyComing soon: links from top-issues and Coming soon: links from top-issues and memory access map to source view, etc.memory access map to source view, etc.
August 14-15 2006
Timing: Timing: mftbmftbmftbmftb is the fastest and most precise is the fastest and most precise way to measure CPU execution timeway to measure CPU execution timeUsed by XbPerfView and Used by XbPerfView and QueryPerformanceCounterQueryPerformanceCounter
Increments every 64 CPU cyclesIncrements every 64 CPU cycles~44 cycle cost to read~44 cycle cost to readAccessible in one instruction (Accessible in one instruction (__mftb__mftb intrinsic)intrinsic)
August 14-15 2006
Timing: Timing: mftbmftb Frequency Frequencymftbmftb or or QueryPerformanceCounterQueryPerformanceCounter are tempting for game timing also, but...are tempting for game timing also, but...mftbmftb does not run at exactly 50.0 MHz does not run at exactly 50.0 MHz
Actually runs at Actually runs at aboutabout 49.875 MHz 49.875 MHzVaries between machines from about 49.85 Varies between machines from about 49.85 to 49.90 MHzto 49.90 MHzAlways exactly 64 CPU cycles per tickAlways exactly 64 CPU cycles per tick
GetTickCountGetTickCount is more long-term is more long-term accurateaccurate
August 14-15 2006
Timing: Timing: mftbmftb Errors Errorsmftbmftb is occasionally slightly off* is occasionally slightly off*
Every 256 billion CPU cycles (85 seconds) Every 256 billion CPU cycles (85 seconds) value is wrong for 4 CPU cyclesvalue is wrong for 4 CPU cycles
Solutions:Solutions:Ignore the problemIgnore the problemDetect and fix the problem:Detect and fix the problem: int64 time = __mftb();int64 time = __mftb(); if( 0 == (DWORD)time )if( 0 == (DWORD)time ) time = __mftb();time = __mftb();Use Use QueryPerformanceCounterQueryPerformanceCounterUse Use __mftb32__mftb32 (max time 85 seconds) (max time 85 seconds)
* Slightly off means about 4 billion (exactly 2* Slightly off means about 4 billion (exactly 23232) too ) too smallsmall
// Timing with __mftb32()// Timing with __mftb32()DWORD start = __mftb32();DWORD start = __mftb32();DoStuff();DoStuff();DWORD elapsed = __mftb32() - start;DWORD elapsed = __mftb32() - start;
August 14-15 2006
Timing: __mftb AlignmentTiming: __mftb AlignmentWhen Xbox 360 launched, When Xbox 360 launched, mftbmftb on on separate cores was not alignedseparate cores was not aligned
One core could be 10-20 ticks (640-1280 One core could be 10-20 ticks (640-1280 cycles) aheadcycles) ahead
With the spring 2006 update With the spring 2006 update mftbmftb should be cycle accurate should be cycle accurate synchronized between coressynchronized between cores
August 14-15 2006
Detecting Sub-optimal Detecting Sub-optimal CodeCode
Key is to find code that could run betterKey is to find code that could run betterTrace recording top-issues analysisTrace recording top-issues analysis
Points out many common problemsPoints out many common problemsMakes everyone an expertMakes everyone an expert
Assembly code inspection/searchAssembly code inspection/searchLook for sign/zero extend instructionsLook for sign/zero extend instructionsLook for code that expands to "too many" Look for code that expands to "too many" instructionsinstructionsLook for excessive Look for excessive frsp, fmr (and mr, vor)frsp, fmr (and mr, vor)Look for bad schedulingLook for bad schedulingLook for references to r1—stack pointerLook for references to r1—stack pointer
Compiler generated Compiler generated temporaries:temporaries:
Usually unwantedUsually unwantedOften avoidableOften avoidableOften lead to load-hit-stores, or just Often lead to load-hit-stores, or just
wasted instructionswasted instructionsGo on the stack, use r1Go on the stack, use r1
Sign/zero Sign/zero extension occurs extension occurs when using short when using short
and byte local and byte local variablesvariables
August 14-15 2006
Assembly Inspection Assembly Inspection OptionsOptions
Set a Visual Studio breakpoint, go to Set a Visual Studio breakpoint, go to disassembly mode (toggle with disassembly mode (toggle with Ctrl+F11)Ctrl+F11)Record a trace, go to the Source tab, Record a trace, go to the Source tab, expand source linesexpand source linesCOD files: C/C++, Output Files, COD files: C/C++, Output Files, Assembler Output, set to Assembler Output, set to Assembly, Assembly, Machine Code and Source (/FAcs)Machine Code and Source (/FAcs)
Don’t need to Don’t need to link or run link or run
gamegame
August 14-15 2006
Poor Code in COD FilePoor Code in COD File; Begin code for function: ?IntegralFloat@@YAMM@Z; Begin code for function: ?IntegralFloat@@YAMM@Z; 5 : // Truncate a float to an integral value,; 5 : // Truncate a float to an integral value,; 6 : // still as a float, using casts.; 6 : // still as a float, using casts.; 7 : return (float)(int)input;; 7 : return (float)(int)input;
0000000000 3961fff03961fff0 addi r11, addi r11,r1r1,-16,-16 0000400004 fc00081efc00081e fctiwz fr0,fr1 fctiwz fr0,fr1 0000800008 7c005fae7c005fae stfiwx fr0,r0,r11 stfiwx fr0,r0,r11 0000c0000c e961fff2e961fff2 lwa r11,-10h( lwa r11,-10h(r1r1)) 0001000010 f961fff0f961fff0 std r11,-10h( std r11,-10h(r1r1)) 0001400014 c801fff0c801fff0 lfd fr0,-10h( lfd fr0,-10h(r1r1)) 0001800018 fc00069cfc00069c fcfid fr0,fr0 fcfid fr0,fr0 0001c0001c fc200018fc200018 frsp fr1,fr0 frsp fr1,fr0 0002000020 4e8000204e800020 blr blr
Note all the Note all the references to r1references to r1
—avoid if —avoid if possiblepossible
August 14-15 2006
Faster code in COD FileFaster code in COD File; Begin code for function: ?IntegralFloatFast@@YAMM@Z; Begin code for function: ?IntegralFloatFast@@YAMM@Z; 12 : // Truncate a float to an integral value,; 12 : // Truncate a float to an integral value,; 13 : // still as a float, using intrinsics.; 13 : // still as a float, using intrinsics.; 14 : return (float)__fcfid(__fctidz(input));; 14 : return (float)__fcfid(__fctidz(input));
0000000000 fc000e5efc000e5e fctidz fr0,fr1 fctidz fr0,fr1 0000400004 fc00069cfc00069c fcfid fr0,fr0 fcfid fr0,fr0 0000800008 fc200018fc200018 frsp fr1,fr0 frsp fr1,fr0 0000c0000c 4e8000204e800020 blr blr
See ppcintrinsics.h for explanations of See ppcintrinsics.h for explanations of __fctidz and __fcfid, or see __fctidz and __fcfid, or see Optimization Optimization Case StudiesCase Studies from the summer 2005 from the summer 2005 GamefestGamefest
Fewer Fewer instructions, no instructions, no
stack trafficstack traffic
August 14-15 2006
Poor Code in PIX Source Poor Code in PIX Source TabTab
August 14-15 2006
Expanding CodeExpanding Code extsh extsh instructions are instructions are doing no useful doing no useful
workwork
August 14-15 2006
Faster Code in PIX Source Faster Code in PIX Source TabTab
Using int Using int instead of short instead of short saves time and saves time and
code spacecode space
Save char/short Save char/short for arrays and for arrays and large structslarge structs
August 14-15 2006
Compiler Quirks to Beware Compiler Quirks to Beware OfOf
__restrict and inline interacting badly__restrict and inline interacting badlyCompiler may do poor scheduling Compiler may do poor scheduling when unrolling read/modify/write of when unrolling read/modify/write of one arrayone arraybool—the forgotten 8-bit typebool—the forgotten 8-bit typeFunctions returning boolFunctions returning bool
August 14-15 2006
Restrict and InlineRestrict and Inlinevoid TestCalcs(__vector4* __restrict input1,void TestCalcs(__vector4* __restrict input1, __vector4* __restrict input2,__vector4* __restrict input2, __vector4* __restrict result,__vector4* __restrict result, const int count) {const int count) { for(int j=0; j<count; j+=4) {for(int j=0; j<count; j+=4) { result[j+0] = input1[j+0] + input2[j+0];result[j+0] = input1[j+0] + input2[j+0]; result[j+1] = input1[j+1] + input2[j+1];result[j+1] = input1[j+1] + input2[j+1]; result[j+2] = input1[j+2] + input2[j+2];result[j+2] = input1[j+2] + input2[j+2]; result[j+3] = input1[j+3] + input2[j+3];result[j+3] = input1[j+3] + input2[j+3]; }}}}
August 14-15 2006
This Code Depends on This Code Depends on ContextContext
TestCalcs is inlinedTestCalcs is inlinedParent function pointers are not Parent function pointers are not marked __restrictmarked __restrict
Inlining merges parametersInlining merges parametersMerging is conservative so Merging is conservative so __restrict is lost__restrict is lost
The Parent TrapThe Parent Trapvoid Parent(__vector4* input1, __vector4* input2,void Parent(__vector4* input1, __vector4* input2, __vector4* result, const int count)__vector4* result, const int count){{ TestCalcs(input1, input2, result, count);TestCalcs(input1, input2, result, count);}}
Avoiding The Parent TrapAvoiding The Parent Trapvoid Parent(__vector4* input1A, __vector4* input2A,void Parent(__vector4* input1A, __vector4* input2A, __vector4* resultA, const int count)__vector4* resultA, const int count){{ __vector4* __restrict input1 = input1A;__vector4* __restrict input1 = input1A; __vector4* __restrict input2 = input2A;__vector4* __restrict input2 = input2A; __vector4* __restrict result = resultA;__vector4* __restrict result = resultA; TestCalcs(input1, input2, result, count);TestCalcs(input1, input2, result, count);}}
August 14-15 2006
Solutions to Solutions to __restrict/inline__restrict/inline
Be wary of __forceinlineBe wary of __forceinlineBetter results (in this case) with Better results (in this case) with __declspec(noinline)__declspec(noinline)Even better results by marking the Even better results by marking the variables as __restrict in the parentvariables as __restrict in the parent
Be wary of increased inlining making the Be wary of increased inlining making the problem returnproblem return
August 14-15 2006
Array Update Good Array Update Good SchedulingSchedulingvoid IncrementFast(float* __restrict data,void IncrementFast(float* __restrict data, float* __restrict input,float* __restrict input, int count, float addend)int count, float addend){{ // Assume count is a multiple of 2// Assume count is a multiple of 2 for (int i = 0; i < count; i += 2)for (int i = 0; i < count; i += 2) {{ data[i] = input[i] + addend;data[i] = input[i] + addend; data[i+1] = input[i+1] + addend;data[i+1] = input[i+1] + addend; }}}}
Pointers are Pointers are marked __restrict, marked __restrict, so compiler can do so compiler can do great schedulinggreat scheduling
lfs fr0,0(r9)lfs fr0,0(r9) lfsx fr13,r8,r11lfsx fr13,r8,r11 fadds fr0,fr0,fr1fadds fr0,fr0,fr1 fadds fr13,fr13,fr1fadds fr13,fr13,fr1 stfs fr0,-4(r11)stfs fr0,-4(r11) stfs fr13,0(r11)stfs fr13,0(r11)
August 14-15 2006
Array Update Bad Array Update Bad SchedulingSchedulingvoid IncrementData(float* __restrict data,void IncrementData(float* __restrict data, int count, float addend)int count, float addend){{ // Assume count is a multiple of 2// Assume count is a multiple of 2 for (int i = 0; i < count; i += 2)for (int i = 0; i < count; i += 2) {{ data[i] += addend;data[i] += addend; data[i+1] += addend;data[i+1] += addend; }}}}
Updates to Updates to data[] can’t data[] can’t overlap, so overlap, so
compiler compiler shouldshould do great do great
scheduling…scheduling…lfs fr0,0(r11)lfs fr0,0(r11) fadds fr0,fr0,fr1fadds fr0,fr0,fr1 stfs fr0,0(r11)stfs fr0,0(r11) lfs fr0,0(r9)lfs fr0,0(r9) fadds fr0,fr1,fr0fadds fr0,fr1,fr0 stfs fr0,0(r9)stfs fr0,0(r9)
August 14-15 2006
bool—the Forgotten 8-Bit bool—the Forgotten 8-Bit TypeType
Our compiler has difficulty with boolOur compiler has difficulty with boolIt likes to add extra instructions and do It likes to add extra instructions and do other unfortunate thingsother unfortunate thingsSomeday this may change, but for Someday this may change, but for now...now...Functions that return bool can be the Functions that return bool can be the worstworst
SorrySorry
August 14-15 2006
bool Interacting with STLbool Interacting with STLThis is the canonical form for STL This is the canonical form for STL iterationiteration
typedef std::vector<int> IntVec;typedef std::vector<int> IntVec;IntVec::iterator end = testVector.end();IntVec::iterator end = testVector.end();for (IntVec::iterator p = testVector.begin();for (IntVec::iterator p = testVector.begin(); p != end;p != end; ++p) {++p) {
result += *p;result += *p;}}
Loop test calls bool operator!=(iter, Loop test calls bool operator!=(iter, iter), and generates inefficient codeiter), and generates inefficient code
August 14-15 2006
Ideal Code for STL LoopIdeal Code for STL Loop$LL20@IteratorIt$LL20@IteratorIt lwz r9,0(r11)lwz r9,0(r11) addi r11,r11,4addi r11,r11,4 add r3,r9,r3add r3,r9,r3 cmplw cr6,r11,r10cmplw cr6,r11,r10 bne cr6,$LL20@IteratorItbne cr6,$LL20@IteratorIt
p != endp != end
August 14-15 2006
Actual Code for STL LoopActual Code for STL Loop$LL20@IteratorIt$LL20@IteratorIt lwz r10,0(r11)lwz r10,0(r11) addi r11,r11,4addi r11,r11,4 add r3,r10,r3add r3,r10,r3 subf r10,r11,r9subf r10,r11,r9 cntlzw r10,r10cntlzw r10,r10 rlwinm r10,r10,27,31,31rlwinm r10,r10,27,31,31 cntlzw r10,r10cntlzw r10,r10 rlwinm r10,r10,27,31,31rlwinm r10,r10,27,31,31 cmplwi cr6,r10,0cmplwi cr6,r10,0 bne cr6,$LL20@IteratorItbne cr6,$LL20@IteratorIt
Code is poor Code is poor even when inlined!even when inlined!
p != endp != end
August 14-15 2006
bool Function Solutionsbool Function SolutionsAvoid functions that return bool, even Avoid functions that return bool, even inline functionsinline functionsPrefer doing comparisons in main Prefer doing comparisons in main functionfunction
But this can be stymied when using STL But this can be stymied when using STL iterators with overloaded operator!=iterators with overloaded operator!=
Since iterator comparisons use bool Since iterator comparisons use bool functions, consider using array style functions, consider using array style indexing for vectorsindexing for vectors
August 14-15 2006
Improving Code Improving Code GenerationGeneration
Use __restrict (have we mentioned that Use __restrict (have we mentioned that before?)before?)
But only use it when it is trueBut only use it when it is trueInlining is goodInlining is good
Avoid having tiny leaf functions—even just error Avoid having tiny leaf functions—even just error handlers—that can't be inlinedhandlers—that can't be inlined
Avoid functions that return Avoid functions that return bool/BOOL/booleanbool/BOOL/booleanLTCG, PGOLTCG, PGOKnow what code is generatedKnow what code is generated
August 14-15 2006
Compiler ImprovementsCompiler ImprovementsUpdated compiler coming Real Soon Updated compiler coming Real Soon NowNow
Improved VMX code-generation (less Improved VMX code-generation (less spilling to stack)spilling to stack)__declspec( passinreg )__declspec( passinreg )Bug fixes and miscellaneous other Bug fixes and miscellaneous other improvementsimprovementsSee See Sublime C++ for GamesSublime C++ for Games for more for more informationinformation
What about when the compiler isn't What about when the compiler isn't good enough?good enough?
August 14-15 2006
Simple Assembly Simple Assembly LanguageLanguageint __declspec( naked ) SimpleAddAssem( int x,int __declspec( naked ) SimpleAddAssem( int x, int y ) {int y ) { asm {asm { // x is in r3// x is in r3 // y is in r4// y is in r4 add r3, r3, r4add r3, r3, r4
// The return value is in r3// The return value is in r3 blr // Don’t forget the explicit ‘blr’blr // Don’t forget the explicit ‘blr’ }}}}
Expands to this code, always:Expands to this code, always:add r3,r3,r4add r3,r3,r4blrblr
August 14-15 2006
The Perils of Not Being The Perils of Not Being NakedNakedint SimpleAddAssem( int x,int SimpleAddAssem( int x, int y ) {int y ) { asm {asm { // x is in r3// x is in r3 // y is in r4// y is in r4 add r3, r3, r4add r3, r3, r4
// The return value is in r3// The return value is in r3 // Don’t put an explicit ‘blr’// Don’t put an explicit ‘blr’ }}}}
Expands to this code in Expands to this code in release:release:stw r3,x$(r1)stw r3,x$(r1)stw r4,y$(r1)stw r4,y$(r1)add r3,r3,r4add r3,r3,r4blrblr
Expands to this code in Expands to this code in /callcap:/callcap:mflr r12mflr r12stw r12,-8(r1)stw r12,-8(r1)stwu r1,-60h(r1)stwu r1,-60h(r1)stw r3,x$(r1)stw r3,x$(r1)stw r4,y$(r1)stw r4,y$(r1)mr r13,r13mr r13,r13add r3,r3,r4add r3,r3,r4mr r14,r14mr r14,r14addi r1,r1,96addi r1,r1,96lwz r12,-8(r1)lwz r12,-8(r1)mtlr r12mtlr r12blrblr
Special ‘mr’ Special ‘mr’ instructions change instructions change to /callcap to /callcap functionsfunctions when profiling, and when profiling, and
trash your registers!trash your registers!
August 14-15 2006
Assembly Language Assembly Language GuidelinesGuidelinesAvoid when possibleAvoid when possible
The compiler can schedule code very wellThe compiler can schedule code very wellIntrinsics give you access to special instructionsIntrinsics give you access to special instructionsThe compiler can call functions faster and easier!The compiler can call functions faster and easier!High-level code can be rearranged and updated High-level code can be rearranged and updated much fastermuch faster
Assembly may make sense for small, critical Assembly may make sense for small, critical routinesroutines
Know the pipelinesKnow the pipelinesUse __declspec( naked )Use __declspec( naked )Compare to C/C++ performanceCompare to C/C++ performance
August 14-15 2006
SummarySummaryThe compiler is your friend: learn to The compiler is your friend: learn to work with itwork with itThe compiler sometimes generates The compiler sometimes generates bad code—watch for it and work bad code—watch for it and work around itaround itKnow your toolsKnow your toolsUse assembly language when Use assembly language when necessarynecessary
August 14-15 2006
ReferencesReferencesUnsigned DevelopersUnsigned Developershttp://arstechnica.com/articles/paedia/cpu/xbox360-2.arshttp://arstechnica.com/articles/paedia/cpu/xbox360-2.arsLearning the Xbox 360 CPULearning the Xbox 360 CPUhttps://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/
whitepapers/xbox_360_cpu_overview.docwhitepapers/xbox_360_cpu_overview.dochttps://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/
whitepapers/xbox_360_cpu_pipelines.docwhitepapers/xbox_360_cpu_pipelines.docXbox 360 Optimization GuidesXbox 360 Optimization Guideshttps://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/
training/XfestFeb2006.htm#IDA4JUPtraining/XfestFeb2006.htm#IDA4JUPhttps://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/
training/xds_Gamefest_05Jun_Training.htm#2.4.training/xds_Gamefest_05Jun_Training.htm#2.4.
© 2006 © 2006 MicrosoftMicrosoft Corporation. All rights reserved. Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
DirectX Developer CenterDirectX Developer Centerhttp://msdn.microsoft.com/directxhttp://msdn.microsoft.com/directx
Game Development MSDN ForumsGame Development MSDN Forumshttp://forums.microsoft.com/msdnhttp://forums.microsoft.com/msdn
Xbox 360 CentralXbox 360 Centralhttp://xds.xbox.com/http://xds.xbox.com/
XNA Web siteXNA Web sitehttp://www.microsoft.com/xnahttp://www.microsoft.com/xna