a. sazegari altivec technical lead
DESCRIPTION
A. Sazegari AltiVec Technical Lead. Introduction. AltiVec™ is an extension to the PowerPC Instruction Set Architecture Designed to extend Apple’s leadership position in multimedia processing. AltiVec is a trademark of Motorola, Inc. What You’ll Learn. About the AltiVec Architecture - PowerPoint PPT PresentationTRANSCRIPT
-
A. SazegariAltiVec Technical Lead
-
IntroductionAltiVec is an extension to the PowerPC Instruction Set Architecture Designed to extend Apples leadership position in multimedia processingAltiVec is a trademark of Motorola, Inc.
-
What Youll LearnAbout the AltiVec Architecture Its performance potential AltiVec programming
-
AltiVec TechnologyVector/SIMD technologyFixed-length vector operands (packed data)Single Instruction Multiple DataRISC-style instruction setOptimized for digital signal processingElevates multimedia to rst-class data typeUseful wherever data-parallelism exists
-
AltiVec ArchitectureNew Vector Register File:32 new 128-bit wide registersNew data-types:Packed byte, halfword, and word integersPacked IEEE single-precision oatsSaturation Arithmetic capability160 new PowerPC instructions
-
PowerPC ArchitectureInstruction StreamFPUIUFPRFGRFMemoryBranch Unit3264
-
AltiVec ArchitectureInstruction StreamFPUIUFPRFGRFMemory3264128Vector UnitVector Register FileBranch Unit
-
Programming ModelVector RegisterFileGeneral Reg.FileFPR0FPR31VR0Branch Registers64-bits32-bits128-bitsFloating-Point RegisterFile32-registers Separate Vector Register File More space for coefcients, variables, etc. More names for scheduling Wider for more parallelism No interference with FP or integerCondCountLinkTimeTimeVRSaveVSCRGPR0GPR31XERFPSCRVR31
-
Vector Data Types16 signed or unsigned integer bytes8 signed or unsigned integer halfwords4 signed or unsigned integer wordsor4 IEEE single-precision floating-point numbersOne Vector (128 bits)
-
Simple SIMD Example++++++++VRAVRBVRT8 halfword additions in one instructionSaturation arithmetic (clamp to max or min on overow)T = vec_adds (A, B); // vector signed short T, A, Bvaddshs T, A, B
-
Vector Dot ProductVRA1VRB1VRT1/A2XXXXVRC1vec_msum( )vec_sums( )VRB2VRT2XXXXXXXXXXXX
-
Arithmetic OperationsAdd, Subtract, AverageMultiply, Multiply-add, Multiply-sumLogicals (and, andc, or, nor, xor)Rotates and shiftsComparesConvert oat xed (scaled) and via Newton-Raphson renement of reciprocal estimate
-
Vector Permute0123456789ABCDEF101112131415161718191A1B1C1D1E1F1718DEF1E10121110A14141414VRAVRBVRCVRT Arbitrary bytewise data reorganization Small table-lookupT = vec_perm (A, B, C);
-
Compare and Select00000000000000000000000000000000C10000001A1AC11A00C100001A001AC100FFFFFF00000000FF00FFFF00FF00009A9A9A9A9A9A9A9A9A9A9A9A9A9A9A9AC19A9A9A1A1AC11A9AC19A9A1A9A1AC1C10000001A1AC11A00C100001A001AC1VRA1VRB1VRT1/C2VRA1/A2VRB2VRT2vec_cmpeq( )vec_sel( )================
-
Other AltiVec InstructionsLoad and Store (vector or scalar element)Pack, Unpack, and Merge elementsSplat (element or literal replication)Bitwise vector shiftsDouble-vector bytewise shifts
-
Data Stream PrefetchSoftware directed prefetch into cache4 simultaneous streamsIndependent and asynchronousCan be non-contiguous123NMemoryBlock Size = 0-32 VectorsStride = 32KBytes0-256 Blocks
-
Typical ImplementationALL instructions fully-pipelined with single-cycle throughputSimple ops: 1 cycle latencyCompound ops: 34 cycle latencyDual AltiVec instruction issueOne arithmetic, one permuteNo restriction on issue with scalar instructions
-
AltiVec vs. MMXBoth SIMD, but AltiVec:Does everything MMX does, plusTwice the SIMD parallelism4x the register namespace8x the register storage spaceNo mode switch or use overheadPermuteRicher set of DSP instructions
-
AltiVec PerformancePeak PerformanceMultimedia kernelsDSP benchmarksPerformance based on cycle-accurate simulator with real memory effects includedPerformance stated relative to optimized PowerPC scalar code
-
Peak PerformanceVector operations at 400MHz:Integer12.8 billion arithmetic ops/sec+ 6.4 billion byte crossbar ops/secFloating-point3.2 gigaops+ 1.6 billion FP crossbar ops/sec
-
Multimedia KernelsVideo and Audio11.4xDiscrete Cosine Transform (DCT)16.1x*Motion estimation (* by |A-B|)12.5xQuantization 9.6xRGB -> YCbCr (CCIR601) 3.6xInverse FFT (FP) 4.9xWindowing (FP)
-
Multimedia KernelsImage Processing 6.2xBilinear interpolation1.1cy/pxSeparable convolution2.2cy/pxRGB to YUV1.3cy/pxMedian Filter (3x3)
-
Multimedia KernelsGraphics 6.2xVector-matrix multiply (FP)17.5xBuffer accumulation 6.6xLine clipping 6.3xBezier curves
-
Communication KernelsModems and Telephony 2.5xCRC-3210.5x64-QAM Demodulator 7.6xLinear prediction 9.3xReal 13-tap FIR30.7xAutocorrelation12.5xGSM Module 4.2.11
-
Miscellaneous DSP KernelsMiscellaneous 2.5 to 20xParallel table lookup10.0xSorting 5.8xAssociative search16.0xGalois eld multiply 4.0xGamma Correction12.0cy/blockHaar Transform (wavelet)
-
DSP BenchmarksResults from an independent DSP benchmarking rm indicate AltiVec on integer DSP algorithms (FIR, FFT, etc.) is:Twice as fast as the worlds fastest DSP (TMS320C6201) per clock, and four times faster including frequency2 to 5 times faster than Pentium II per clock (but P would still be 35% smaller)
-
AltiVec ToolsProgramming Model and ABICompilers and assemblersMotorolas MCC CodeWarrior plug-inApples MrC and PPCASM in MPW and MWMetrowerks C/C++Emulator/Trace generatorMacsBugCycle-accurate simulatorPerformance proler
-
Programming in C11 new fundamental packed data typesAltiVec operatorsParse like function callsSpecic operators > assembly instructionsGeneric operators type sensitivesizeof(), a=b, &a, *p, etc.Compiler does register allocation, inlining, code scheduling, etc.
-
C Program Examplezero = ( vector unsigned long ) ( 0 ); //zero = vec_xor ( zero, zero );shiftFactor = vec_splat_u8 ( 11 );z = vec_sro ( x, shiftFactor );z = vec_srl ( z, shiftFactor );
do{carry= vec_addc ( z, y );z= vec_add ( z, y );y= vec_sld ( carry, zero, 4 );} while ( !vec_all_eq ( y, zero ) );
-
Vector ShiftsThis shiftFactor vector is populated in 2 sections for vector shift right by octet vsro and vector shift right vsr
bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 || used by || || ||
vsro is based on the permute cross bar and shifts bytes, Instruction vsr is a 0 to 7 bit shift.
Used sequentially,the combination of these instructions will shift a vector register right (or left) from 0 to 127 bits as specified in bits 121:127 of shiftFactor.
bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127|| shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||
-
AltiVec at AppleMac OS (blockmove, etc.)QuickDrawQTML (codecs, rasterizers)Media source code [email protected]
-
AltiVec SummaryMajor architectural extension will make future PowerPCs great media processorsEarly programming tools available nowDevelopment systems 2H98 (Now)AltiVec based systems in 1H99