a. sazegari altivec technical lead

A. SazegariAltiVec Technical Lead

IntroductionAltiVec is an extension to the PowerPC Instruction Set Architecture Designed to extend Apples leadership position in multimedia processingAltiVec is a trademark of Motorola, Inc.

What Youll LearnAbout the AltiVec Architecture Its performance potential AltiVec programming

AltiVec TechnologyVector/SIMD technologyFixed-length vector operands (packed data)Single Instruction Multiple DataRISC-style instruction setOptimized for digital signal processingElevates multimedia to rst-class data typeUseful wherever data-parallelism exists

AltiVec ArchitectureNew Vector Register File:32 new 128-bit wide registersNew data-types:Packed byte, halfword, and word integersPacked IEEE single-precision oatsSaturation Arithmetic capability160 new PowerPC instructions

PowerPC ArchitectureInstruction StreamFPUIUFPRFGRFMemoryBranch Unit3264

AltiVec ArchitectureInstruction StreamFPUIUFPRFGRFMemory3264128Vector UnitVector Register FileBranch Unit

Programming ModelVector RegisterFileGeneral Reg.FileFPR0FPR31VR0Branch Registers64-bits32-bits128-bitsFloating-Point RegisterFile32-registers Separate Vector Register File More space for coefcients, variables, etc. More names for scheduling Wider for more parallelism No interference with FP or integerCondCountLinkTimeTimeVRSaveVSCRGPR0GPR31XERFPSCRVR31

Vector Data Types16 signed or unsigned integer bytes8 signed or unsigned integer halfwords4 signed or unsigned integer wordsor4 IEEE single-precision floating-point numbersOne Vector (128 bits)

Simple SIMD Example++++++++VRAVRBVRT8 halfword additions in one instructionSaturation arithmetic (clamp to max or min on overow)T = vec_adds (A, B); // vector signed short T, A, Bvaddshs T, A, B

Vector Dot ProductVRA1VRB1VRT1/A2XXXXVRC1vec_msum( )vec_sums( )VRB2VRT2XXXXXXXXXXXX

Arithmetic OperationsAdd, Subtract, AverageMultiply, Multiply-add, Multiply-sumLogicals (and, andc, or, nor, xor)Rotates and shiftsComparesConvert oat xed (scaled) and via Newton-Raphson renement of reciprocal estimate

Vector Permute0123456789ABCDEF101112131415161718191A1B1C1D1E1F1718DEF1E10121110A14141414VRAVRBVRCVRT Arbitrary bytewise data reorganization Small table-lookupT = vec_perm (A, B, C);

Compare and Select00000000000000000000000000000000C10000001A1AC11A00C100001A001AC100FFFFFF00000000FF00FFFF00FF00009A9A9A9A9A9A9A9A9A9A9A9A9A9A9A9AC19A9A9A1A1AC11A9AC19A9A1A9A1AC1C10000001A1AC11A00C100001A001AC1VRA1VRB1VRT1/C2VRA1/A2VRB2VRT2vec_cmpeq( )vec_sel( )================

Other AltiVec InstructionsLoad and Store (vector or scalar element)Pack, Unpack, and Merge elementsSplat (element or literal replication)Bitwise vector shiftsDouble-vector bytewise shifts

Data Stream PrefetchSoftware directed prefetch into cache4 simultaneous streamsIndependent and asynchronousCan be non-contiguous123NMemoryBlock Size = 0-32 VectorsStride = 32KBytes0-256 Blocks

Typical ImplementationALL instructions fully-pipelined with single-cycle throughputSimple ops: 1 cycle latencyCompound ops: 34 cycle latencyDual AltiVec instruction issueOne arithmetic, one permuteNo restriction on issue with scalar instructions

AltiVec vs. MMXBoth SIMD, but AltiVec:Does everything MMX does, plusTwice the SIMD parallelism4x the register namespace8x the register storage spaceNo mode switch or use overheadPermuteRicher set of DSP instructions

AltiVec PerformancePeak PerformanceMultimedia kernelsDSP benchmarksPerformance based on cycle-accurate simulator with real memory effects includedPerformance stated relative to optimized PowerPC scalar code

Peak PerformanceVector operations at 400MHz:Integer12.8 billion arithmetic ops/sec+ 6.4 billion byte crossbar ops/secFloating-point3.2 gigaops+ 1.6 billion FP crossbar ops/sec

Multimedia KernelsVideo and Audio11.4xDiscrete Cosine Transform (DCT)16.1x*Motion estimation (* by |A-B|)12.5xQuantization 9.6xRGB -> YCbCr (CCIR601) 3.6xInverse FFT (FP) 4.9xWindowing (FP)

Multimedia KernelsImage Processing 6.2xBilinear interpolation1.1cy/pxSeparable convolution2.2cy/pxRGB to YUV1.3cy/pxMedian Filter (3x3)

Multimedia KernelsGraphics 6.2xVector-matrix multiply (FP)17.5xBuffer accumulation 6.6xLine clipping 6.3xBezier curves

Communication KernelsModems and Telephony 2.5xCRC-3210.5x64-QAM Demodulator 7.6xLinear prediction 9.3xReal 13-tap FIR30.7xAutocorrelation12.5xGSM Module 4.2.11

Miscellaneous DSP KernelsMiscellaneous 2.5 to 20xParallel table lookup10.0xSorting 5.8xAssociative search16.0xGalois eld multiply 4.0xGamma Correction12.0cy/blockHaar Transform (wavelet)

DSP BenchmarksResults from an independent DSP benchmarking rm indicate AltiVec on integer DSP algorithms (FIR, FFT, etc.) is:Twice as fast as the worlds fastest DSP (TMS320C6201) per clock, and four times faster including frequency2 to 5 times faster than Pentium II per clock (but P would still be 35% smaller)

AltiVec ToolsProgramming Model and ABICompilers and assemblersMotorolas MCC CodeWarrior plug-inApples MrC and PPCASM in MPW and MWMetrowerks C/C++Emulator/Trace generatorMacsBugCycle-accurate simulatorPerformance proler

Programming in C11 new fundamental packed data typesAltiVec operatorsParse like function callsSpecic operators > assembly instructionsGeneric operators type sensitivesizeof(), a=b, &a, *p, etc.Compiler does register allocation, inlining, code scheduling, etc.

C Program Examplezero = ( vector unsigned long ) ( 0 ); //zero = vec_xor ( zero, zero );shiftFactor = vec_splat_u8 ( 11 );z = vec_sro ( x, shiftFactor );z = vec_srl ( z, shiftFactor );

do{carry= vec_addc ( z, y );z= vec_add ( z, y );y= vec_sld ( carry, zero, 4 );} while ( !vec_all_eq ( y, zero ) );

Vector ShiftsThis shiftFactor vector is populated in 2 sections for vector shift right by octet vsro and vector shift right vsr

bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 || used by || || ||

vsro is based on the permute cross bar and shifts bytes, Instruction vsr is a 0 to 7 bit shift.

Used sequentially,the combination of these instructions will shift a vector register right (or left) from 0 to 127 bits as specified in bits 121:127 of shiftFactor.

bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127|| shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||

AltiVec at AppleMac OS (blockmove, etc.)QuickDrawQTML (codecs, rasterizers)Media source code [email protected]

AltiVec SummaryMajor architectural extension will make future PowerPCs great media processorsEarly programming tools available nowDevelopment systems 2H98 (Now)AltiVec based systems in 1H99

a. sazegari altivec technical lead

Documents