a. sazegari altivec technical lead

Download A. Sazegari AltiVec Technical Lead

If you can't read please download the document

Upload: deion

Post on 09-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

A. Sazegari AltiVec Technical Lead. Introduction. AltiVec™ is an extension to the PowerPC Instruction Set Architecture Designed to extend Apple’s leadership position in multimedia processing. AltiVec is a trademark of Motorola, Inc. What You’ll Learn. About the AltiVec Architecture - PowerPoint PPT Presentation

TRANSCRIPT

  • A. SazegariAltiVec Technical Lead

  • IntroductionAltiVec is an extension to the PowerPC Instruction Set Architecture Designed to extend Apples leadership position in multimedia processingAltiVec is a trademark of Motorola, Inc.

  • What Youll LearnAbout the AltiVec Architecture Its performance potential AltiVec programming

  • AltiVec TechnologyVector/SIMD technologyFixed-length vector operands (packed data)Single Instruction Multiple DataRISC-style instruction setOptimized for digital signal processingElevates multimedia to rst-class data typeUseful wherever data-parallelism exists

  • AltiVec ArchitectureNew Vector Register File:32 new 128-bit wide registersNew data-types:Packed byte, halfword, and word integersPacked IEEE single-precision oatsSaturation Arithmetic capability160 new PowerPC instructions

  • PowerPC ArchitectureInstruction StreamFPUIUFPRFGRFMemoryBranch Unit3264

  • AltiVec ArchitectureInstruction StreamFPUIUFPRFGRFMemory3264128Vector UnitVector Register FileBranch Unit

  • Programming ModelVector RegisterFileGeneral Reg.FileFPR0FPR31VR0Branch Registers64-bits32-bits128-bitsFloating-Point RegisterFile32-registers Separate Vector Register File More space for coefcients, variables, etc. More names for scheduling Wider for more parallelism No interference with FP or integerCondCountLinkTimeTimeVRSaveVSCRGPR0GPR31XERFPSCRVR31

  • Vector Data Types16 signed or unsigned integer bytes8 signed or unsigned integer halfwords4 signed or unsigned integer wordsor4 IEEE single-precision floating-point numbersOne Vector (128 bits)

  • Simple SIMD Example++++++++VRAVRBVRT8 halfword additions in one instructionSaturation arithmetic (clamp to max or min on overow)T = vec_adds (A, B); // vector signed short T, A, Bvaddshs T, A, B

  • Vector Dot ProductVRA1VRB1VRT1/A2XXXXVRC1vec_msum( )vec_sums( )VRB2VRT2XXXXXXXXXXXX

  • Arithmetic OperationsAdd, Subtract, AverageMultiply, Multiply-add, Multiply-sumLogicals (and, andc, or, nor, xor)Rotates and shiftsComparesConvert oat xed (scaled) and via Newton-Raphson renement of reciprocal estimate

  • Vector Permute0123456789ABCDEF101112131415161718191A1B1C1D1E1F1718DEF1E10121110A14141414VRAVRBVRCVRT Arbitrary bytewise data reorganization Small table-lookupT = vec_perm (A, B, C);

  • Compare and Select00000000000000000000000000000000C10000001A1AC11A00C100001A001AC100FFFFFF00000000FF00FFFF00FF00009A9A9A9A9A9A9A9A9A9A9A9A9A9A9A9AC19A9A9A1A1AC11A9AC19A9A1A9A1AC1C10000001A1AC11A00C100001A001AC1VRA1VRB1VRT1/C2VRA1/A2VRB2VRT2vec_cmpeq( )vec_sel( )================

  • Other AltiVec InstructionsLoad and Store (vector or scalar element)Pack, Unpack, and Merge elementsSplat (element or literal replication)Bitwise vector shiftsDouble-vector bytewise shifts

  • Data Stream PrefetchSoftware directed prefetch into cache4 simultaneous streamsIndependent and asynchronousCan be non-contiguous123NMemoryBlock Size = 0-32 VectorsStride = 32KBytes0-256 Blocks

  • Typical ImplementationALL instructions fully-pipelined with single-cycle throughputSimple ops: 1 cycle latencyCompound ops: 34 cycle latencyDual AltiVec instruction issueOne arithmetic, one permuteNo restriction on issue with scalar instructions

  • AltiVec vs. MMXBoth SIMD, but AltiVec:Does everything MMX does, plusTwice the SIMD parallelism4x the register namespace8x the register storage spaceNo mode switch or use overheadPermuteRicher set of DSP instructions

  • AltiVec PerformancePeak PerformanceMultimedia kernelsDSP benchmarksPerformance based on cycle-accurate simulator with real memory effects includedPerformance stated relative to optimized PowerPC scalar code

  • Peak PerformanceVector operations at 400MHz:Integer12.8 billion arithmetic ops/sec+ 6.4 billion byte crossbar ops/secFloating-point3.2 gigaops+ 1.6 billion FP crossbar ops/sec

  • Multimedia KernelsVideo and Audio11.4xDiscrete Cosine Transform (DCT)16.1x*Motion estimation (* by |A-B|)12.5xQuantization 9.6xRGB -> YCbCr (CCIR601) 3.6xInverse FFT (FP) 4.9xWindowing (FP)

  • Multimedia KernelsImage Processing 6.2xBilinear interpolation1.1cy/pxSeparable convolution2.2cy/pxRGB to YUV1.3cy/pxMedian Filter (3x3)

  • Multimedia KernelsGraphics 6.2xVector-matrix multiply (FP)17.5xBuffer accumulation 6.6xLine clipping 6.3xBezier curves

  • Communication KernelsModems and Telephony 2.5xCRC-3210.5x64-QAM Demodulator 7.6xLinear prediction 9.3xReal 13-tap FIR30.7xAutocorrelation12.5xGSM Module 4.2.11

  • Miscellaneous DSP KernelsMiscellaneous 2.5 to 20xParallel table lookup10.0xSorting 5.8xAssociative search16.0xGalois eld multiply 4.0xGamma Correction12.0cy/blockHaar Transform (wavelet)

  • DSP BenchmarksResults from an independent DSP benchmarking rm indicate AltiVec on integer DSP algorithms (FIR, FFT, etc.) is:Twice as fast as the worlds fastest DSP (TMS320C6201) per clock, and four times faster including frequency2 to 5 times faster than Pentium II per clock (but P would still be 35% smaller)

  • AltiVec ToolsProgramming Model and ABICompilers and assemblersMotorolas MCC CodeWarrior plug-inApples MrC and PPCASM in MPW and MWMetrowerks C/C++Emulator/Trace generatorMacsBugCycle-accurate simulatorPerformance proler

  • Programming in C11 new fundamental packed data typesAltiVec operatorsParse like function callsSpecic operators > assembly instructionsGeneric operators type sensitivesizeof(), a=b, &a, *p, etc.Compiler does register allocation, inlining, code scheduling, etc.

  • C Program Examplezero = ( vector unsigned long ) ( 0 ); //zero = vec_xor ( zero, zero );shiftFactor = vec_splat_u8 ( 11 );z = vec_sro ( x, shiftFactor );z = vec_srl ( z, shiftFactor );

    do{carry= vec_addc ( z, y );z= vec_add ( z, y );y= vec_sld ( carry, zero, 4 );} while ( !vec_all_eq ( y, zero ) );

  • Vector ShiftsThis shiftFactor vector is populated in 2 sections for vector shift right by octet vsro and vector shift right vsr

    bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 || used by || || ||

    vsro is based on the permute cross bar and shifts bytes, Instruction vsr is a 0 to 7 bit shift.

    Used sequentially,the combination of these instructions will shift a vector register right (or left) from 0 to 127 bits as specified in bits 121:127 of shiftFactor.

    bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127|| shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||

  • AltiVec at AppleMac OS (blockmove, etc.)QuickDrawQTML (codecs, rasterizers)Media source code [email protected]

  • AltiVec SummaryMajor architectural extension will make future PowerPCs great media processorsEarly programming tools available nowDevelopment systems 2H98 (Now)AltiVec based systems in 1H99