digital signal processors application architecture

Lecture 9: Digital Signal Processors:Applications and ArchitecturesPrepared by: Professor Kurt KeutzerComputer Science 252, Spring 2000With contributions from:Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson

Kurt Keutzer

Processor ApplicationsGeneral Purpose - high performancePentiums, Alphas, SPARCUsed for general purpose software Heavy weight OS - UNIX, NTWorkstations, PCsEmbedded processors and processor coresARM, 486SX, Hitachi SH7000, NEC V800Single programLightweight, often realtime OSDSP supportCellular phones, consumer electronics (e.g. CD players) Microcontrollers Extremely cost sensitiveSmall word size - 8 bit commonHighest volume processors by farAutomobiles, toasters, thermostats, ... IncreasingCostIncreasingvolume

Kurt Keutzer

Processor Markets$30B$9.3B/31%$5.7B/19%$10B/33%8-bitmicro16-bitmicroDSP32-bitmicro$5.2B/17%$1.2B/4%32 bit DSP

Kurt Keutzer

The Processor Design SpaceCostPerformanceMicroprocessorsPerformance iseverything& Software rulesEmbeddedprocessorsMicrocontrollersCost is everythingApplication specific architecturesfor performance

Kurt Keutzer

Market for DSP ProductsMixed/SignalAnalogDSPDSP is the fastest growing segment of the semiconductor market

Kurt Keutzer

DSP Applications Audio applications MPEG Audio Portable audioDigital camerasWireless Cellular telephones Base stationNetworking Cable modems ADSL VDSL

Kurt Keutzer

Another Look at DSP ApplicationsHigh-endWireless Base Station - TMS320C6000Cable modem gatewaysMid-endCellular phone - TMS320C540Fax/ voice serverLow endStorage products - TMS320C27Digital camera - TMS320C5000Portable phonesWireless headsetsConsumer audioAutomobiles, toasters, thermostats, ... IncreasingCostIncreasingvolume

Kurt Keutzer

Serving a range of applications

Kurt Keutzer

Worlds Cellular SubscribersMillionsYearDigitalAnalogSource: Ericsson Radio Systems, Inc.Will providea ubiquitousinfrastructurefor wirelessdata as wellas voice

Kurt Keutzer

CELLULAR TELEPHONE SYSTEMPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER 1 2 3 4 5 67 8 90415-555-1212SPEECHDECODESPEECHENCODEA/DBASEBANDCONVERTERDAC

Kurt Keutzer

HW/SW/IC PARTITIONINGPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER 1 2 3 4 5 67 8 90415-555-1212SPEECHDECODESPEECHENCODEA/DBASEBANDCONVERTERDACANALOG ICDSPASICMICROCONTROLLER

Kurt Keutzer

Mapping onto a system on a chip CRAMASICLOGICphonebookprotocolkeypadintfccontrolS/PDMAspeechqualityenhancmentde-intl &decodervoicerecognitionRPE-LTPspeech decoder demodulatorand synchronizer

Viterbi equalizer

Kurt Keutzer

Example Wireless Phone OrganizationC540ARM7

Kurt Keutzer

Multimedia I/O Architecture Low Power BusEmbedded ProcessorFifoVideoDecompVideoAudioFBFifoGraphicsPenSched ECC PactInterfaceDataFlowSRAM

Kurt Keutzer

Multimedia System on a ChipFuture chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/OE.g. Multimedia terminal electronics

Kurt Keutzer

Requirements of the Embedded ProcessorsOptimized for a single program - code often in on-chip ROM or off chip EPROMMinimum code size (one of the motivations initially for Java)Performance obtained by optimizing datapathLow costLowest possible areaTechnology behind the leading edgeHigh level of integration of peripherals (reduces system cost)Fast time to marketCompatible architectures (e.g. ARM) allows reuseable codeCustomizable coreLow power if application requires portability

Kurt Keutzer

Area of processor cores = CostNintendo processorCellular phones

Kurt Keutzer

Another figure of meritComputation per unit areaNintendo processorCellular phones???

Kurt Keutzer

Code sizeIf a majority of the chip is the program stored in ROM, then code size is a critical issueThe Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate

Kurt Keutzer

BENCHMARKS - DSPstoneZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHENAPPLICATION BENCHMARKSADPCM TRANSCODER - CCITT G.721REAL_UPDATECOMPLEX_UPDATESDOT_PRODUCTMATRIX_1X3CONVOLUTIONFIRFIR2DIMHR_ONE_BIQUADLMSFFT_INPUT_SCALED

Kurt Keutzer

Evolution of GP and DSPGeneral Purpose Microprocessor traces roots back to Eckert, Mauchly, Von Neumann (ENIAC)DSP evolved from Analog Signal Processors, using analog hardware to transform phyical signals (classical electrical engineering)ASP to DSP becauseDSP insensitive to environment (e.g., same response in snow or desert if it works at all)DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variationDifferent history and different applications led to different terms, different metrics, some new inventionsConvergence of markets will lead to architectural showdown

Kurt Keutzer

Embedded Systems vs. General Purpose Computing - 1Embedded System

Runs a few applications often known at design timeNot end-user programmableOperates in fixed run-time constraints, additional performance may not be useful/valuableGeneral purpose computing

Intended to run a fully general set of applicationsEnd-user programmable Faster is always better

Kurt Keutzer

Embedded Systems vs. General Purpose Computing - 2Embedded System

Differentiating features:powercostspeed (must be predictable)

General purpose computing

Differentiating featuresspeed (need not be fully predictable)speeddid we mention speed? cost (largest component power)

Kurt Keutzer

DSP vs. General Purpose MPUDSPs tend to be written for 1 program, not many programs. Hence OSes are much simpler, there is no virtual memory or protection, ...DSPs sometimes run hard real-time appsYou must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD!DSPs have an infinite continuous data stream

Kurt Keutzer

DSP vs. General Purpose MPUThe MIPS/MFLOPS of DSPs is speed of Multiply-Accumulate (MAC). DSP are judged by whether they can keep the multipliers busy 100% of the time.The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filtersFinite Impule Response (FIR) filtersFFT, and convolversIn DSPs, algorithms are king!Binary compatability not an issueSoftware is not (yet) king in DSPs. People still write in assembly language for a product to minimize the die area for ROM in the DSP chip.

Kurt Keutzer

TYPES OF DSP PROCESSORSDSP Multiprocessors on a die TMS320C80 TMS320C600032-BIT FLOATING POINTTI TMS320C4XMOTOROLA 96000AT&T DSP32CANALOG DEVICES ADSP2100016-BIT FIXED POINTTI TMS320C2XMOTOROLA 56000AT&T DSP16ANALOG DEVICES ADSP2100

Kurt Keutzer

Note of Caution on DSP ArchitecturesSuccessful DSP architectures have two aspects:Key architectural and micro-architectural features that enabled product success in key parameters Speed Code density Low power Architectural and micro-architectural features that are artifacts of the era in which they were designed

We will focus on the former!

Kurt Keutzer

Architectural Features of DSPsData path configured for DSP Fixed-point arithmeticMAC- Multiply-accumulateMultiple memory banks and buses - Harvard ArchitectureMultiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loopsSupport for MACSpecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!

Kurt Keutzer

DSP Data Path: ArithmeticDSPs dealing with numbers representing real world => Want reals/ fractionsDSPs dealing with numbers for addresses => Want integersSupport fixed point as well as integersS.radix point-1 x < 1S.radix point2N1 x < 2N1

Kurt Keutzer

DSP Data Path: PrecisionWord size affects precision of fixed point numbersDSPs have 16-bit, 20-bit, or 24-bit data wordsFloating Point DSPs cost 2X - 4X vs. fixed point, slower than fixed pointDSP programmers will scale values inside codeSW LibrariesSeparate explicit exponentBlocked Floating Point single exponent for a group of fractionsFloating point support simplify development

Kurt Keutzer

DSP Data Path: Overflow?DSP are descended from analog : what should happen to output when peg an input? (e.g., turn up volume control knob on stereo)Modulo Arithmetic???Set to most positive (2N11) or most negative value(2N1) : saturationMany algorithms were developed in this model

Kurt Keutzer

DSP Data Path: MultiplierSpecialized hardware performs all key arithmetic operations in 1 cycle 50% of instructions can involve multiplier => single cycle latency multiplierNeed to perform multiply-accumulate (MAC)n-bit multiplier => 2n-bit product

Kurt Keutzer

DSP Data Path: AccumulatorDont want overflow or have to scale accumulatorOption 1: accumalator wider than product: guard bitsMotorola DSP: 24b x 24b => 48b product, 56b AccumulatorOption 2: shift right and round product before adder

Kurt Keutzer

DSP Data Path: RoundingEven with guard bits, will need to round when store accumulator into memory3 DSP standard optionsTruncation: chop results => biases results upRound to nearest: < 1/2 round down, 1/2 round up (more positive) => smaller biasConvergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even

Kurt Keutzer

Data PathDSP Processor

Specialized hardware performs all key arithmetic operations in 1 cycle.Hardware support for managing numeric fidelity:ShiftersGuard bitsSaturation

General-Purpose Processor

Multiplies often take>1 cycleShifts often take >1 cycleOther operations (e.g., saturation, rounding) typically take multiple cycles.

Kurt Keutzer

320C54x DSP Functional Block Diagram

Kurt Keutzer

FIR Filtering: A Motivating ProblemM most recent samples in the delay line (Xi)New sample moves data down delay lineTap is a multiply-addEach tap (M+1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory write-back to update delay lineGoal: 1 FIR Tap / DSP instruction cycle

Kurt Keutzer

BENCHMARKS - FIR FILTERFINITE-IMPULSE RESPONSE FILTER. . . .

Kurt Keutzer

Micro-architectural impact - MACelement of finite-impulse response filter computationMPYXYACC REGADD/SUB

Kurt Keutzer

The critical hardware unit in a DSP is the multiplier - much of the architecture is organized around allowing use of the multiplier on every cycleThis means providing two operands on every cycle, through multiple data and address busses, multiple address units and local accumulator feedback 123D546Mapping of the filter onto a DSP execution unit

Kurt Keutzer

MAC Eg. - 320C54x DSP Functional Block Diagram

Kurt Keutzer

DSP MemoryFIR Tap implies multiple memory accessesDSPs want multiple data portsSome DSPs have ad hoc techniques to reduce memory bandwdith demandInstruction repeat buffer: do 1 instruction 256 timesOften disables interrupts, thereby increasing interrupt response timeSome recent DSPs have instruction cachesEven then may allow programmer to lock in instructions into cacheOption to turn cache into fast program memoryNo DSPs have data cachesMay have multiple data memories

Kurt Keutzer

Conventional ``Von Neumann memory

Kurt Keutzer

HARVARD ARCHITECTURE in DSPPROGRAMMEMORYX MEMORYY MEMORYGLOBALP DATAX DATAY DATA

Kurt Keutzer

Memory ArchitectureDSP ProcessorHarvard architecture2-4 memory accesses/cycleNo caches-on-chip SRAM

General-Purpose ProcessorVon Neumann architectureTypically 1 access/cycleMay use cachesProcessorProgramMemoryDataMemoryProcessorMemory

Kurt Keutzer

Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture

Kurt Keutzer

Eg. 320C62x/67x DSP

Kurt Keutzer

DSP AddressingHave standard addressing modes: immediate, displacement, register indirectWant to keep MAC datapth busyAssumption: any extra instructions imply clock cycles of overhead in inner loop => complex addressing is good => dont use datapath to calculate fancy addressAutoincrement/Autodecrement register indirectlw r1,0(r2)+ => r1

DSP Addressing: FFTFFTs start or end with data in weird bufferfly order0 (000)=> 0 (000)1 (001)=> 4 (100)2 (010)=> 2 (010)3 (011)=> 6 (110)4 (100)=> 1 (001)5 (101)=> 5 (101)6 (110)=> 3 (011)7 (111)=> 7 (111)What can do to avoid overhead of address checking instructions for FFT?Have an optional bit reverse address addressing mode for use with autoincrement addressingMany DSPs have bit reverse addressing for radix-2 FFT

Kurt Keutzer

BIT REVERSED ADDRESSINGData flow in the radix-2 decimation-in-time FFT algorithm

Kurt Keutzer

DSP Addressing: BuffersDSPs dealing with continuous I/OOften interact with an I/O buffer (delay lines)To save memory, buffer often organized as circular bufferWhat can do to avoid overhead of address checking instructions for circular buffer?Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of bufferOption 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach endEvery DSP has modulo or circular addressing

Kurt Keutzer

CIRCULAR BUFFERSInstructions accomodate three elements: buffer address buffer size incrementAllows for cyling through: delay elements coefficients in data memory

Kurt Keutzer

AddressingDSP ProcessorDedicated address generation unitsSpecialized addressing modes; e.g.:AutoincrementModulo (circular)Bit-reversed (for FFT)Good immediate data supportGeneral-Purpose ProcessorOften, no separate address generation unitGeneral-purpose addressing modes

Kurt Keutzer

Address calculation unit for DSPSupports modulo and bit reversal arithmeticOften duplicated to calculate multiple addresses per cycle

Kurt Keutzer

DSP Instructions and ExecutionMay specify multiple operations in a single instructionMust support Multiply-Accumulate (MAC)Need parallel move supportUsually have special loop support to reduce branch overheadLoop an instruction or sequence0 value in register usually means loop maximum number of timesMust be sure if calculate loop count that 0 does not mean 0May have saturating shift left arithmeticMay have conditional execution to reduce branches

Kurt Keutzer

ADSP 2100: ZERO-OVERHEAD LOOPAddress Generation PCS = PC + 1if (PC = x && ! condition) PC = PCSelse PC = PC +1DO UNTIL condition Eliminates a few instructions in loops - Important in loops with small bodies

Kurt Keutzer

Instruction SetDSP Processor

Specialized, complex instructionsMultiple operations per instructionGeneral-Purpose Processor

General-purpose instructionsTypically only one operation per instructionmac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0mov *r0,x0mov *r1,y0 mpy x0, y0, aadd a, bmov y0, *r2inc r0inc rl

Kurt Keutzer

Specialized Peripherals for DSPsSynchronous serial portsParallel portsTimersOn-chip A/D, D/A convertersHost portsBit I/O portsOn-chip DMA controllerClock generatorsOn-chip peripherals often designed for background operation, even when core is powered down.

Kurt Keutzer

Specialized peripherals

Kurt Keutzer

TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995

Kurt Keutzer

Summary of Architectural Features of DSPsData path configured for DSP Fixed-point arithmeticMAC- Multiply-accumulateMultiple memory banks and buses - Harvard ArchitectureMultiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loopsSupport for MACSpecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!

Kurt Keutzer

digital signal processors application architecture

Documents

pcsembedded processors

mix of processors

chip epromminimum code

chip rom

digital signal processors

system cos

specific algorithms

processor coresarm