digital signal processors application architecture
DESCRIPTION
Digital Signal Processors Application ArchitectureTRANSCRIPT
-
Lecture 9: Digital Signal Processors:Applications and ArchitecturesPrepared by: Professor Kurt KeutzerComputer Science 252, Spring 2000With contributions from:Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson
Kurt Keutzer
-
Processor ApplicationsGeneral Purpose - high performancePentiums, Alphas, SPARCUsed for general purpose software Heavy weight OS - UNIX, NTWorkstations, PCsEmbedded processors and processor coresARM, 486SX, Hitachi SH7000, NEC V800Single programLightweight, often realtime OSDSP supportCellular phones, consumer electronics (e.g. CD players) Microcontrollers Extremely cost sensitiveSmall word size - 8 bit commonHighest volume processors by farAutomobiles, toasters, thermostats, ... IncreasingCostIncreasingvolume
Kurt Keutzer
-
Processor Markets$30B$9.3B/31%$5.7B/19%$10B/33%8-bitmicro16-bitmicroDSP32-bitmicro$5.2B/17%$1.2B/4%32 bit DSP
Kurt Keutzer
-
The Processor Design SpaceCostPerformanceMicroprocessorsPerformance iseverything& Software rulesEmbeddedprocessorsMicrocontrollersCost is everythingApplication specific architecturesfor performance
Kurt Keutzer
-
Market for DSP ProductsMixed/SignalAnalogDSPDSP is the fastest growing segment of the semiconductor market
Kurt Keutzer
-
DSP Applications Audio applications MPEG Audio Portable audioDigital camerasWireless Cellular telephones Base stationNetworking Cable modems ADSL VDSL
Kurt Keutzer
-
Another Look at DSP ApplicationsHigh-endWireless Base Station - TMS320C6000Cable modem gatewaysMid-endCellular phone - TMS320C540Fax/ voice serverLow endStorage products - TMS320C27Digital camera - TMS320C5000Portable phonesWireless headsetsConsumer audioAutomobiles, toasters, thermostats, ... IncreasingCostIncreasingvolume
Kurt Keutzer
-
Serving a range of applications
Kurt Keutzer
-
Worlds Cellular SubscribersMillionsYearDigitalAnalogSource: Ericsson Radio Systems, Inc.Will providea ubiquitousinfrastructurefor wirelessdata as wellas voice
Kurt Keutzer
-
CELLULAR TELEPHONE SYSTEMPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER 1 2 3 4 5 67 8 90415-555-1212SPEECHDECODESPEECHENCODEA/DBASEBANDCONVERTERDAC
Kurt Keutzer
-
HW/SW/IC PARTITIONINGPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER 1 2 3 4 5 67 8 90415-555-1212SPEECHDECODESPEECHENCODEA/DBASEBANDCONVERTERDACANALOG ICDSPASICMICROCONTROLLER
Kurt Keutzer
-
Mapping onto a system on a chip CRAMASICLOGICphonebookprotocolkeypadintfccontrolS/PDMAspeechqualityenhancmentde-intl &decodervoicerecognitionRPE-LTPspeech decoder demodulatorand synchronizer
Viterbi equalizer
Kurt Keutzer
-
Example Wireless Phone OrganizationC540ARM7
Kurt Keutzer
-
Multimedia I/O Architecture Low Power BusEmbedded ProcessorFifoVideoDecompVideoAudioFBFifoGraphicsPenSched ECC PactInterfaceDataFlowSRAM
Kurt Keutzer
-
Multimedia System on a ChipFuture chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/OE.g. Multimedia terminal electronics
Kurt Keutzer
-
Requirements of the Embedded ProcessorsOptimized for a single program - code often in on-chip ROM or off chip EPROMMinimum code size (one of the motivations initially for Java)Performance obtained by optimizing datapathLow costLowest possible areaTechnology behind the leading edgeHigh level of integration of peripherals (reduces system cost)Fast time to marketCompatible architectures (e.g. ARM) allows reuseable codeCustomizable coreLow power if application requires portability
Kurt Keutzer
-
Area of processor cores = CostNintendo processorCellular phones
Kurt Keutzer
-
Another figure of meritComputation per unit areaNintendo processorCellular phones???
Kurt Keutzer
-
Code sizeIf a majority of the chip is the program stored in ROM, then code size is a critical issueThe Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate
Kurt Keutzer
-
BENCHMARKS - DSPstoneZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHENAPPLICATION BENCHMARKSADPCM TRANSCODER - CCITT G.721REAL_UPDATECOMPLEX_UPDATESDOT_PRODUCTMATRIX_1X3CONVOLUTIONFIRFIR2DIMHR_ONE_BIQUADLMSFFT_INPUT_SCALED
Kurt Keutzer
-
Evolution of GP and DSPGeneral Purpose Microprocessor traces roots back to Eckert, Mauchly, Von Neumann (ENIAC)DSP evolved from Analog Signal Processors, using analog hardware to transform phyical signals (classical electrical engineering)ASP to DSP becauseDSP insensitive to environment (e.g., same response in snow or desert if it works at all)DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variationDifferent history and different applications led to different terms, different metrics, some new inventionsConvergence of markets will lead to architectural showdown
Kurt Keutzer
-
Embedded Systems vs. General Purpose Computing - 1Embedded System
Runs a few applications often known at design timeNot end-user programmableOperates in fixed run-time constraints, additional performance may not be useful/valuableGeneral purpose computing
Intended to run a fully general set of applicationsEnd-user programmable Faster is always better
Kurt Keutzer
-
Embedded Systems vs. General Purpose Computing - 2Embedded System
Differentiating features:powercostspeed (must be predictable)
General purpose computing
Differentiating featuresspeed (need not be fully predictable)speeddid we mention speed? cost (largest component power)
Kurt Keutzer
-
DSP vs. General Purpose MPUDSPs tend to be written for 1 program, not many programs. Hence OSes are much simpler, there is no virtual memory or protection, ...DSPs sometimes run hard real-time appsYou must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD!DSPs have an infinite continuous data stream
Kurt Keutzer
-
DSP vs. General Purpose MPUThe MIPS/MFLOPS of DSPs is speed of Multiply-Accumulate (MAC). DSP are judged by whether they can keep the multipliers busy 100% of the time.The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filtersFinite Impule Response (FIR) filtersFFT, and convolversIn DSPs, algorithms are king!Binary compatability not an issueSoftware is not (yet) king in DSPs. People still write in assembly language for a product to minimize the die area for ROM in the DSP chip.
Kurt Keutzer
-
TYPES OF DSP PROCESSORSDSP Multiprocessors on a die TMS320C80 TMS320C600032-BIT FLOATING POINTTI TMS320C4XMOTOROLA 96000AT&T DSP32CANALOG DEVICES ADSP2100016-BIT FIXED POINTTI TMS320C2XMOTOROLA 56000AT&T DSP16ANALOG DEVICES ADSP2100
Kurt Keutzer
-
Note of Caution on DSP ArchitecturesSuccessful DSP architectures have two aspects:Key architectural and micro-architectural features that enabled product success in key parameters Speed Code density Low power Architectural and micro-architectural features that are artifacts of the era in which they were designed
We will focus on the former!
Kurt Keutzer
-
Architectural Features of DSPsData path configured for DSP Fixed-point arithmeticMAC- Multiply-accumulateMultiple memory banks and buses - Harvard ArchitectureMultiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loopsSupport for MACSpecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!
Kurt Keutzer
-
DSP Data Path: ArithmeticDSPs dealing with numbers representing real world => Want reals/ fractionsDSPs dealing with numbers for addresses => Want integersSupport fixed point as well as integersS.radix point-1 x < 1S.radix point2N1 x < 2N1
Kurt Keutzer
-
DSP Data Path: PrecisionWord size affects precision of fixed point numbersDSPs have 16-bit, 20-bit, or 24-bit data wordsFloating Point DSPs cost 2X - 4X vs. fixed point, slower than fixed pointDSP programmers will scale values inside codeSW LibrariesSeparate explicit exponentBlocked Floating Point single exponent for a group of fractionsFloating point support simplify development
Kurt Keutzer
-
DSP Data Path: Overflow?DSP are descended from analog : what should happen to output when peg an input? (e.g., turn up volume control knob on stereo)Modulo Arithmetic???Set to most positive (2N11) or most negative value(2N1) : saturationMany algorithms were developed in this model
Kurt Keutzer
-
DSP Data Path: MultiplierSpecialized hardware performs all key arithmetic operations in 1 cycle 50% of instructions can involve multiplier => single cycle latency multiplierNeed to perform multiply-accumulate (MAC)n-bit multiplier => 2n-bit product
Kurt Keutzer
-
DSP Data Path: AccumulatorDont want overflow or have to scale accumulatorOption 1: accumalator wider than product: guard bitsMotorola DSP: 24b x 24b => 48b product, 56b AccumulatorOption 2: shift right and round product before adder
Kurt Keutzer
-
DSP Data Path: RoundingEven with guard bits, will need to round when store accumulator into memory3 DSP standard optionsTruncation: chop results => biases results upRound to nearest: < 1/2 round down, 1/2 round up (more positive) => smaller biasConvergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even
Kurt Keutzer
-
Data PathDSP Processor
Specialized hardware performs all key arithmetic operations in 1 cycle.Hardware support for managing numeric fidelity:ShiftersGuard bitsSaturation
General-Purpose Processor
Multiplies often take>1 cycleShifts often take >1 cycleOther operations (e.g., saturation, rounding) typically take multiple cycles.
Kurt Keutzer
-
320C54x DSP Functional Block Diagram
Kurt Keutzer
-
FIR Filtering: A Motivating ProblemM most recent samples in the delay line (Xi)New sample moves data down delay lineTap is a multiply-addEach tap (M+1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory write-back to update delay lineGoal: 1 FIR Tap / DSP instruction cycle
Kurt Keutzer
-
BENCHMARKS - FIR FILTERFINITE-IMPULSE RESPONSE FILTER. . . .
Kurt Keutzer
-
Micro-architectural impact - MACelement of finite-impulse response filter computationMPYXYACC REGADD/SUB
Kurt Keutzer
-
The critical hardware unit in a DSP is the multiplier - much of the architecture is organized around allowing use of the multiplier on every cycleThis means providing two operands on every cycle, through multiple data and address busses, multiple address units and local accumulator feedback 123D546Mapping of the filter onto a DSP execution unit
Kurt Keutzer
-
MAC Eg. - 320C54x DSP Functional Block Diagram
Kurt Keutzer
-
DSP MemoryFIR Tap implies multiple memory accessesDSPs want multiple data portsSome DSPs have ad hoc techniques to reduce memory bandwdith demandInstruction repeat buffer: do 1 instruction 256 timesOften disables interrupts, thereby increasing interrupt response timeSome recent DSPs have instruction cachesEven then may allow programmer to lock in instructions into cacheOption to turn cache into fast program memoryNo DSPs have data cachesMay have multiple data memories
Kurt Keutzer
-
Conventional ``Von Neumann memory
Kurt Keutzer
-
HARVARD ARCHITECTURE in DSPPROGRAMMEMORYX MEMORYY MEMORYGLOBALP DATAX DATAY DATA
Kurt Keutzer
-
Memory ArchitectureDSP ProcessorHarvard architecture2-4 memory accesses/cycleNo caches-on-chip SRAM
General-Purpose ProcessorVon Neumann architectureTypically 1 access/cycleMay use cachesProcessorProgramMemoryDataMemoryProcessorMemory
Kurt Keutzer
-
Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture
Kurt Keutzer
-
Eg. 320C62x/67x DSP
Kurt Keutzer
- DSP AddressingHave standard addressing modes: immediate, displacement, register indirectWant to keep MAC datapth busyAssumption: any extra instructions imply clock cycles of overhead in inner loop => complex addressing is good => dont use datapath to calculate fancy addressAutoincrement/Autodecrement register indirectlw r1,0(r2)+ => r1
-
DSP Addressing: FFTFFTs start or end with data in weird bufferfly order0 (000)=> 0 (000)1 (001)=> 4 (100)2 (010)=> 2 (010)3 (011)=> 6 (110)4 (100)=> 1 (001)5 (101)=> 5 (101)6 (110)=> 3 (011)7 (111)=> 7 (111)What can do to avoid overhead of address checking instructions for FFT?Have an optional bit reverse address addressing mode for use with autoincrement addressingMany DSPs have bit reverse addressing for radix-2 FFT
Kurt Keutzer
-
BIT REVERSED ADDRESSINGData flow in the radix-2 decimation-in-time FFT algorithm
Kurt Keutzer
-
DSP Addressing: BuffersDSPs dealing with continuous I/OOften interact with an I/O buffer (delay lines)To save memory, buffer often organized as circular bufferWhat can do to avoid overhead of address checking instructions for circular buffer?Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of bufferOption 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach endEvery DSP has modulo or circular addressing
Kurt Keutzer
-
CIRCULAR BUFFERSInstructions accomodate three elements: buffer address buffer size incrementAllows for cyling through: delay elements coefficients in data memory
Kurt Keutzer
-
AddressingDSP ProcessorDedicated address generation unitsSpecialized addressing modes; e.g.:AutoincrementModulo (circular)Bit-reversed (for FFT)Good immediate data supportGeneral-Purpose ProcessorOften, no separate address generation unitGeneral-purpose addressing modes
Kurt Keutzer
-
Address calculation unit for DSPSupports modulo and bit reversal arithmeticOften duplicated to calculate multiple addresses per cycle
Kurt Keutzer
-
DSP Instructions and ExecutionMay specify multiple operations in a single instructionMust support Multiply-Accumulate (MAC)Need parallel move supportUsually have special loop support to reduce branch overheadLoop an instruction or sequence0 value in register usually means loop maximum number of timesMust be sure if calculate loop count that 0 does not mean 0May have saturating shift left arithmeticMay have conditional execution to reduce branches
Kurt Keutzer
-
ADSP 2100: ZERO-OVERHEAD LOOPAddress Generation PCS = PC + 1if (PC = x && ! condition) PC = PCSelse PC = PC +1DO UNTIL condition Eliminates a few instructions in loops - Important in loops with small bodies
Kurt Keutzer
-
Instruction SetDSP Processor
Specialized, complex instructionsMultiple operations per instructionGeneral-Purpose Processor
General-purpose instructionsTypically only one operation per instructionmac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0mov *r0,x0mov *r1,y0 mpy x0, y0, aadd a, bmov y0, *r2inc r0inc rl
Kurt Keutzer
-
Specialized Peripherals for DSPsSynchronous serial portsParallel portsTimersOn-chip A/D, D/A convertersHost portsBit I/O portsOn-chip DMA controllerClock generatorsOn-chip peripherals often designed for background operation, even when core is powered down.
Kurt Keutzer
-
Specialized peripherals
Kurt Keutzer
-
TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995
Kurt Keutzer
-
Summary of Architectural Features of DSPsData path configured for DSP Fixed-point arithmeticMAC- Multiply-accumulateMultiple memory banks and buses - Harvard ArchitectureMultiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loopsSupport for MACSpecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!
Kurt Keutzer