© 2009 microsoft corporation. all rights reserved. this presentation is for informational purposes...

© 2009 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

http://www.xna.com

http://www.xna.com/

Out of Order

Making In-order Processors Play Nicely

Allan MurphyXNA Developer Connection, Microsoft

Optimization Exampleclass BaseParticle{

public:…virtual Vector& Position() { return mPosition; }virtual Vector& PreviousPosition() { return mPreviousPosition; }float& Intensity() { return mIntensity; }bool& Active() { return mActive; }float& Lifetime() { return mLifetime; }…

private:…float mIntensity;float mLifetime;bool mActive;Vector mPosition;Vector mPreviousPosition;…

};

Optimization Example// Boring old vector classclass Vector{

… public:

float x,y,z,w;};

// Boring old generic linked list classtemplate <class T> class ListNode{

public:ListNode(T* contents) : mNext(NULL), mContents(contents){}void SetNext(ListNode* node) { mNext = node; }ListNode* NextNode() { return mNext; }T* Contents() { return mContents; }

private:ListNode<T>* mNext;T* mContents;

};

Optimization Example// Run through list and update each active particlefor (ListNode<BaseParticle>* node = gParticles; node != NULL; node = node->NextNode())

if (node->Contents()->Active()){

Vector vel;vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x;vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y;vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z;const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z));

if (length > cLimitLength){

float newIntensity = cMaxIntensity - node->Contents()->Lifetime();if (newIntensity < 0.0f)

newIntensity = 0.0f;node->Contents()->Intensity() = newIntensity;

}else

node->Contents()->Intensity() = 0.0f;}

Optimization Example// Replacement for straight C vector work

// Build 360 friendly __vector4s__vector4 position, prevPosition;position.x = node->Contents()->Position().x;position.y = node->Contents()->Position().y;position.z = node->Contents()->Position().z;prevPosition.x = node->Contents()->PrevPosition().x;prevPosition.y = node->Contents()->PrevPosition().y;prevPosition.z = node->Contents()->PrevPosition().z;

// Use VMX to do the calculations__vector4 velocity = __vsubfp(position,previousPosition);__vector4 velocitySqr = __vmsum4fp(velocity,velocity);

// Grab the length result from the vectorconst float length = __fsqrts(velocitySqr.x);

• Job done, right?

Thank you for listening

Optimization Example• Hold on.

• If we time it…• Its actually slower than the straight C version

• And if we check the results..• It's also wrong!• Incorrect is a special case optimization

• Unfortunately, this does happen in practice

Important Caveat• Today we’re talking about optimization• But, the techniques discussed are orthogonal to…

…good algorithm choice…good multithreading system implementation

• It’s like Mr Knuth said.• They typically build code which is…

…very non-general…very difficult to maintain or understand…possibly completely platform specific

But My Code Is Really Quick On PC…?

• A common assumption:• It’s quick on PC• 360 & PS3 have 3.2GHz clock speed• Should be good on console! Right?

• Alas 360 core and PS3 PPU have..• No instruction reordering hardware• No store forwarding hardware• Smaller caches and slower memory• No L3 cache

The 4 Horsemen of In-Order Apocalypse• What goes wrong?

• LHS• L2 miss• Expensive, non pipelined instructions• Branch mispredict penalty

Load-Hit-Store (LHS)• What is it?

• Storing to a memory location…• …then loading from it very shortly after

• What causes LHS?• Casts, changing register set, aliasing

• Why is it a problem?• On PC, bullet usually dodged by…

• Instruction re-ordering• Store forwarding hardware

L2 Miss• What is it?

• Loading from a location not already in cache

• Why is it a problem?• Costs ~610 cycles to load a cache line• You can do a lot of work in 610 cycles

• What can we do about it?• Hot cold split• Reduce in-memory data size• Use cache coherent structures

Expensive Instructions• What is it?

• Certain instructions not pipelined• No other instructions issued ‘til they complete• Stalls both hardware threads

• high latency and low throughput

• What can we do about it?• Know when those instructions are generated• Avoid or code round those situations

• But only in critical places

Branch Mispredicts

• What is it?• Mispredicting a branch causes…• …CPU to discard instructions it predicted it needed• …23-24 cycle delay as correct instructions fetched

• Why is this a problem?• Misprediction penalty can…

…dominate total time in tight loops…waste time fetching unneeded instructions

Branch Mispredicts• What can we do about it?• Know how compiler implements branches

• for, do, while, if• Function pointers, switches, virtual calls

• Reduce total branch counts for task• Use test and set style instructions• Refactor calculations to remove branches• Unroll

Who Are Our Friends?• Profiling, profiling, profiling• 360 tools

• PIX CPU instruction trace• LibPMCPB counters• XbPerfView sampling capture

• Other platforms• SN Tuner, vTune

• Thinking laterally

General Improvements• inline

• Make sure your function fits the profile

• Pass and return in register• __declspec(passinreg)

• __restrict• Compiler released from being ultra careful

• const• Doesn’t affect code gen• But does affect your brain

General Improvements• Compiler options

• Inline all possible• Prefer speed over size• Platform specifics• 360

• /Ou - Removes integer div by zero trap• /Oc – Runs a second code scheduling pass

• Don’t write inline asm

General Improvements• Reduce parameter count

• Reduce function epilogue and prologue• Reduce stack access• Reduce LHS

• Prefer 32, 64 and 128 bit variables• Isolate constants – or constant sets

• Look to specialise, not generalise

• Avoid virtual if feasible• Unnecessary virtual means indirected branch

Know Your Cache Architecture• Cache size

• 360: 1Mb L2, 32Kb L1

• Cache line size• 360: 128 bytes; x86 – typically 64 bytes

• Pre-fetch mechanism• 360: dcbt, dcbz128

• Cross-core sharing policy• 360: L2 shared, L1 per core

Know Pipeline & LHS Conditions• LHS caused by:

• Pointer aliasing• Register set swap / casting

• Be aware of non-pipelined instructions• fsqrt, fdiv, int mul, int div, sraw

• Be aware of pipeline flush issues• Especially fcmp

Knowing Your Instruction Set• 360 specifics:

• VMX• Slow instructions• Regularly useful instructions

• fsel, vsel, vcmp*, vrlimi

• PS3• Altivec & world of SPE

• PC• SSE, SSE2, SSE3, SSE4, SSE4.1 and friends

What Went Wrong With The Example?• Correctness

• Always cross-compare during development

• Guessed at 1 performance issue• SIMD vs straight float

• Giving SIMD ‘some road’• Branch behaviour exactly the same• Adding SIMD adds an LHS

• Memory access and L2 usage unchanged

Image Analysis

Image Analysis Example• Classification via Gaussian Mixture Model

• For each pixel in a 320x240 array…• Evaluate ‘cost’ via up to 20 Gaussian models• Returns lowest cost found for pixel• Submit cost to graph structure for min-cut

• Profiling shows:• 86% of time in pixel cost function• No surprises there• 1,536,000 Gaussian model applies

Image Analysis Examplefloat GMM::Cost(unsigned char r, unsigned char g, unsigned char b, size_t k){

Component& component = mComponent[k]; SampleType x(r,g,b);

x -= component.Mean();

FloatVector fx((float)x[0],(float)x[1],(float)x[2]);return component.EofLog() + 0.5f * fx.Dot( component.CovInv().Multiply(fx));

}

float GMM::BestCost(unsigned char r, unsigned char g, unsigned char b){

float bestCost = Cost(r,g,b,0);for(size_t k=1; k<nK; k++){

float cost = Cost(r,g,b,k);if( cost < bestCost )

bestCost = cost;}return bestCost;

}

Image Analysis Example• What things look suspect?

• L2 miss on component load• Passing individual r,g,b elements• Building two separate vectors• Casting int to float• Vector maths• Branching may be an issue in BestCost()

• Loop • Conditional inside loop

• Confirm with PIX on 360

Image Analysis Example• Pass 1

• Don’t even touch platform specifics• Pass a single int, not 3 unsigned chars • Mark up all consts• Build the sample value once in the caller• Add __forceinline• Check correctness• Doesn’t help a lot – gives about 1.1x


• Turn Cost function innards to VMX• Return cost as __vector4 to avoid LHS• Remove if from loop in BestCost by…

• Keeping bestCost as a __vector4• Using vcmpgefp to make a comparison mask• Using vsel to pick the lowest value

• Speedup of 1.7x• Constructing the __vector4s on the fly expensive


- Build the colour as a __vector4 in calling function- Build a static __vector4 containing {0.5f,0.5f,0.5f,0.5f}- Load once in calling function- Mark all __vector4 as __declspec(passinreg)- Build __vector4 version of Component- All calculations done as __vector4- More like it – speedup of 5.2x


- Go all the way out to the per pixel calling code- Load __vector4 at a time from source array- Do 4 pixel costs at once- __vcmpgefp/__vsel works exactly the same- Return __vector4 with 4 costs- Write to results array as single __vector4- Gives a speedup of 19.54x

Image Analysis Example

__declspec(passinreg) __vector4 CMOGs::BestCost(__declspec(passinreg) __vector4 colours) const{

__vector4 half = gHalf;const size_t nK = m_componentCount;assert(nK != 0);

__vector4 bestCost = Cost(colour, half, 0 );for(size_t k=1;k<nK;k++){

const __vector4 cost = Cost(colour, half, k );const __vector4 mask = __vcmpgefp(bestCost,cost);

bestCost = __vsel(bestCost,cost,mask);}

return bestCost;}

Image Analysis Example

const Component& comp = m_vComponent[k];const __vector4 vEofLog = comp.GetVEofLog();colour0 = __vsubfp(colour0,comp.GetVMean());…const __vector4 row0 = comp.GetVCovInv(0);const __vector4 row1 = comp.GetVCovInv(1);const __vector4 row2 = comp.GetVCovInv(2);

x = __vspltw(colour0,0);y = __vspltw(colour0,1);z = __vspltw(colour0,2);

mulResult = __vmulfp(row0,x);mulResult = __vmaddfp(row1,y,mulResult);mulResult = __vmaddfp(row2,z,mulResult);

vdp2 = __vmsum3fp(mulResult,input);vdp2 = __vmaddfp(vdp2,half,vEofLog); // half is __vector4 parameterresult = vdp2;

…

Image Analysis Example• Hold on, this is image analysis.• Shouldn’t it be on the GPU?• Maybe, maybe not:

• Per pixel we manipulate a dynamic tree structure• Excluding the tree structure…

• CPU can run close to GPU speed• But syncing and memory throughput overhead not worth it

Movie Compression

Movie Compression Optimization• Timing results

• Freeware movie compressor on 360• 76.3% of instructions spent in InterError()

• Calculating error between macroblocks

• Majority of time in 8x8 macro block functions• Picking up source and target intensity macro block• For each pixel, calculating abs difference• Summing differences along rows• Returning sum of diffs• Or early out when sum exceeds a threshold

Movie Compression Optimization

int ThresholdSum(unsigned char *ptr1, unsigned char *ptr2, int stride2, int stride1,int thres){ int32 sad = 0; for (int i=8; i; i--) { sad += DSP_OP_ABS_DIFF(ptr1[0], ptr2[0]);

sad += DSP_OP_ABS_DIFF(ptr1[1], ptr2[1]);sad += DSP_OP_ABS_DIFF(ptr1[2], ptr2[2]);sad += DSP_OP_ABS_DIFF(ptr1[3], ptr2[3]);sad += DSP_OP_ABS_DIFF(ptr1[4], ptr2[4]);sad += DSP_OP_ABS_DIFF(ptr1[5], ptr2[5]);sad += DSP_OP_ABS_DIFF(ptr1[6], ptr2[6]);sad += DSP_OP_ABS_DIFF(ptr1[7], ptr2[7]);

if (sad > thres ) return sad;

ptr1 += stride1;ptr2 += stride2;

} return sad;}

Movie Compression Optimization• Look at our worst enemies• L2

• 8x8 byte blocks, seems tight

• LHS• Its all integer, so we should be LHS free

• Expensive instructions?• No, just byte maths

• Branching• Should get prediction right 7 out of 8 times

Movie Compression Optimization• Maths

• Element by element abs and average ops on bytes• Done row by row, exit on over sum• Perfect for VMX!

• Awesome speedup of… 0%• Huh? Why?

• Summing a row doesn’t suit VMX• Branch penalty still there• We have to do unaligned loads to VMX registers

Movie Compression Optimization• Let’s think again• Look at higher level picture

• Error calculated for 4 blocks at a time by caller• Rows in blocks (0,1) and (2,3) are contiguous• Pick up two blocks at a time in VMX registers

• Thresholding is by row• But there is no reason not to do it by column• Means we can sum columns in 7 instructions

• Use __restrict on block pointers

Movie Compression Optimization

0 1

32

VMX register 0VMX register 1VMX register 2VMX register 3VMX register 4VMX register 5VMX register 6VMX register 7

Movie Compression Optimization• Data Layout & Alignment

• Rows in 2 blocks are contiguous in memory• Source block always 16 byte aligned• Dest block only guaranteed to be byte aligned

• Unrolling• We can unroll the 8 iteration loop• We have plenty of VMX registers available

• Return value• Return a __vector4 to avoid LHS writing to int

Movie Compression Optimization• Miscellaneous

• Prebuild threshold word once• Remove stride word parameters

• Constant values in this application only• Proved with empirical research (and assert)

• Vector parameters and return in registers • Pushed vector error results out to caller

• All callers calculations in VMX – drop LHS

Movie Compression Optimization__vector4 __declspec(passinreg) twoblock_sad8x8__xbox (const unsigned char* __restrict ptr1, const unsigned char*

__restrict ptr2){ __vector4 zero = __vzero();__vector4 row1_0 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_1 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_2 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_3 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_4 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_5 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_6 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_7 = *(__vector4 *)ptr1; ptr1 += cStride1;

__vector4 row2_0 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_1 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_2 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_3 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_4 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_5 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_6 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_7 = *(__vector4 *)ptr2; ptr2 += cStride2;

row1_0 = __vsubsbs(__vmaxub(row1_0,row2_0),__vminub(row1_0,row2_0)); row1_1 = __vsubsbs(__vmaxub(row1_1,row2_1),__vminub(row1_1,row2_1)); row1_2 = __vsubsbs(__vmaxub(row1_2,row2_2),__vminub(row1_2,row2_2)); row1_3 = __vsubsbs(__vmaxub(row1_3,row2_3),__vminub(row1_3,row2_3)); row1_4 = __vsubsbs(__vmaxub(row1_4,row2_4),__vminub(row1_4,row2_4)); row1_5 = __vsubsbs(__vmaxub(row1_5,row2_5),__vminub(row1_5,row2_5)); row1_6 = __vsubsbs(__vmaxub(row1_6,row2_6),__vminub(row1_6,row2_6)); row1_7 = __vsubsbs(__vmaxub(row1_7,row2_7),__vminub(row1_7,row2_7));

row2_0 = __vmrglb(zero,row1_0); row1_0 = __vmrghb(zero,row1_0); row2_1 = __vmrglb(zero,row1_1); row1_1 = __vmrghb(zero,row1_1); row2_2 = __vmrglb(zero,row1_2); row1_2 = __vmrghb(zero,row1_2); row2_3 = __vmrglb(zero,row1_3); row1_3 = __vmrghb(zero,row1_3); row2_4 = __vmrglb(zero,row1_4); row1_4 = __vmrghb(zero,row1_4); row2_5 = __vmrglb(zero,row1_5); row1_5 = __vmrghb(zero,row1_5); row2_6 = __vmrglb(zero,row1_6); row1_6 = __vmrghb(zero,row1_6); row2_7 = __vmrglb(zero,row1_7); row1_7 = __vmrghb(zero,row1_7);

row1_0 = __vaddshs(row1_0,row1_1); row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7);

row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6); row1_0 = __vaddshs(row1_0,row1_4);

row2_0 = __vaddshs(row2_0,row2_1); row2_2 = __vaddshs(row2_2,row2_3); row2_4 = __vaddshs(row2_4,row2_5); row2_6 = __vaddshs(row2_6,row2_7);

row2_0 = __vaddshs(row2_0,row2_2); row2_4 = __vaddshs(row2_4,row2_6);

row2_0 = __vaddshs(row2_0,row2_4);

row1_1 = __vsldoi(row1_0,row2_0,2); row1_2 = __vsldoi(row1_0,row2_0,4); row1_3 = __vsldoi(row1_0,row2_0,6); row1_4 = __vsldoi(row1_0,row2_0,8); row1_5 = __vsldoi(row1_0,row2_0,10); row1_6 = __vsldoi(row1_0,row2_0,12); row1_7 = __vsldoi(row1_0,row2_0,14);

row1_0 = __vrlimi(row1_0,row2_0,0x1,0); row2_0 = __vsldoi(row2_0,zero,2); row1_1 = __vrlimi(row1_1,row2_0,0x1,0);

row1_0 = __vaddshs(row1_0,row1_1); // add 4 rows to the next row row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7);

row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6);

row1_0 = __vaddshs(row1_0,row1_4);

row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,3,0,0)); row1_0 = __vmrghh(zero,row1_0); row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,2,0,0));

return row1_0;}

Unpleasant

Movie Compression Optimization• Results

• Un-thresholded macro block compare• 2.86 times quicker than existing C• Not bad, but our code is doing 2 blocks at once, too• So actually, 5.72 times quicker

• Thresholded macro block compare• 4.12 times quicker

• Optimizations to just the block compares……reduced movie compression time by 22%…in worst case, saved 40 seconds from compress time

Do We Get Improvements In Reverse?

• Do we see improvements on PC?• Image analysis• Movie compression

Summary Interlude• Profiling, profiling, profiling

• Know your enemy

• Explore data alignment and layout• Give SIMD plenty of room to work

• Don’t ignore simple code structure changes• Specialise, not generalise

Original Example

Improving Original Example• PIX Summary

• 704k instructions executed• 40% L2 usage• Top penalties

• L2 cache miss @ 3m cycles• bctr mispredicts @ 1.14m cycles• __fsqrt @ 696k cycles• 2x fcmp @ 490k cycles• Some 20.9m cycles of penalty overall

• Takes 7.528ms

Improving Original Example

1) Avoid branch mispredict #1• Ditch the zealous use of virtual• Call functions just once• Gives 1.13x speedup

2) Improve L2 use #1• Refactoring list to contiguous array• Hot/cold split• Using bitfield for active flag• Gives 3.59x speedup

Improving Original Example4) Remove expensive instructions

• Ditch __fsqrts and compare with squares• Gives 4.05x speedup

5) Avoid branch mispredict #1• Insert __fsel() to select tail length• Gives 4.44x speedup• Insert 2nd fsel • Now only loop active branches remain• Gives 5.0x speedup

Improving Original Example7) Use VMX

• Use __vsubfp and __vmsum3fp for vector math• Gives 5.28x speedup

8) Avoid branch mispredict #2• Unroll the loop 4x• Sticks at 5.28x speedup

Improving Original Example9) Avoid branch mispredict #3

• Build a __vector4 mask from active flags• __vsel tail lengths from existing and new• Write a single __vector4 result• Now only the loop branch remaining• Gives 6.01x speedup

10) Improve L2 access #2• Add __dcbt on position array • Gives 16.01x speedup

Improving Original Example11) Improve L2 use #3

• Move to short coordinates• Now loading ¼ the data for positions• Gives 21.23x speedup

12) Avoid branch mispredict #4• We are now writing tail lengths for every particle• Wait, we don’t care about inactive particles• Epiphany - don’t check active flag at all• Gives 23.21x speedup

Improving Original Example13) Improve L2 use #4

• Remaining L2 misses on output array• __dcbt that too• Tweak __dcbt offsets and pre-load• 39.01x speedup

Improving Original Example• PIX Summary

• 259k instructions executed• 99.4% L2 usage• Top penalties

• ERAT Data Miss @ 14k cycles• 1 LHS via 4kb aliasing• No mispredict penalties• 71k cycles of penalty overall

• Takes 0.193ms

Improving Original Example• Caveat

• Slightly trivial code example• Not all techniques possible in ‘real life’• But principles always apply

• Dcbz128 mystery?• We write entire array• Should be able to save L2 loads by pre-zeroing• But results showed slowdown

Thanks For Listening

• Any questions?


http://www.xna.com


http://www.xna.com

http://www.xna.com/

http://www.xna.com/

© 2009 microsoft corporation. all rights reserved. this presentation is for informational purposes...

Documents