Download - Chapter 2-1: CPUs
Chapter 2-1: CPUs
Soo-Ik Chae
© 2007 Elsevier 1
Topicsp
CPU metrics.
Categories of CPUs.g
CPU mechanisms.
High Performance Embedded Computing© 2007 Elsevier 2
Performance as a design metricg
Performance = speed:Latency.
Throughput.
Average vs. peak fperformance.
Worst-case and best-fcase performance.
High Performance Embedded Computing© 2007 Elsevier 3
Other metrics
Cost (area).
Energy and power.gy p
Predictability: important for embedded systemssystems
Pipelining: branch penalty.
Memory system (Cache) : cache miss penalty
Security: difficult to measure because of the yfact that we do not know of a successful attack.
High Performance Embedded Computing© 2007 Elsevier 4
Flynn’s taxonomy of processorsy y p
Single-instruction single-data (SISD): RISC, etc.
Single-instruction multiple-data (SIMD): all processors perform the same operationsprocessors perform the same operations.
Multiple-instruction multiple-data (MIMD): homogeneo s or heterogeneo shomogeneous or heterogeneous multiprocessor.
Multiple-instruction multiple data (MISD).
High Performance Embedded Computing© 2007 Elsevier 5
Other axes of comparisonRISC.
Emphasis on software Si l l i l i t tiSingle-cycle, simple instructions Register to register: LOAD" and "STORE“ are independent instructions Low cycles per second,Large code sizes Spends more transistors on memory registersSpends more transistors on memory registers
CISC.Emphasis on hardware multi-cycle, complex instructions Memory-to-memory: LOAD" and "STORE“ incorporated in instructions High cycles per secondSmall code sizesTransistors used for storing complex instructions
High Performance Embedded Computing© 2007 Elsevier 6
Transistors used for storing complex instructions
RISC CISC
1. 1-cycle simple instructions 1. multi-cycle complex instructions
2. only LD/ST can access memory 2. any instruction may access memory
3 designed around pipeline 3 designed around instn set3. designed around pipeline 3. designed around instn. set
4. instns. executed by h/w 4. instns interpreted by micro-program
5. Fixed format instns 5. variable format instns
6 Few instns and modes 6 Many instns and modes6. Few instns and modes 6. Many instns and modes
7. Complexity in the compiler 7. Complexity in the micro-program
8. Multiple register sets 8. Single register set
High Performance Embedded Computing© 2007 Elsevier 7
Other axes of comparisonp
Instruction issue widthInstruction issue width.Single issue
Multiple issue: higher performance high cost increasedMultiple issue: higher performance, high cost, increased power consumption
Scheduling for multiple-issue machines.Scheduling for multiple issue machines.Static scheduling: VLIW
Dynamic schedule: superscalary p
Vector processing: instruction for 1D or 2D arrays
Multithreading: a fine-grained concurrencyMultithreading: a fine grained concurrency mechanism that allows the processor to quickly switch between several threads of execution
High Performance Embedded Computing© 2007 Elsevier 8
Embedded vs. general-purpose processorsg p p p
E b dd d b ti i d fEmbedded processors may be optimized for a category of applications.
Must be flexibleMust be flexibleCustomization may be narrow or broad.Billions of 8-bit processors sold each yearp y100s millions of 32-bit processors for embedded systems
We may judge embedded processors using different metrics:
Code size.M t fMemory system performance.Preditability.
High Performance Embedded Computing© 2007 Elsevier 9
ARM Processor Familyy
Processorfamily
# of pipeline stages
Memory organization
Clock Rate
MIPS/MHz
ARM6 3 Von Neumann 25 MHARM6 3 Von Neumann 25 MHz
ARM7 3 Von Neumann 66 MHz 0.9
ARM8 5 Von Neumann 72 MHz 1 2ARM8 5 Von Neumann 72 MHz 1.2
ARM9 5 Harvard 200 MHz 1.1
ARM10 6 Harvard 400 MHz 1.25ARM10 6 400 MHz 1.25
StrongARM 5 Harvard 233 MHz 1.15
ARM11 8 Von Neumann/ 550 MHz 1.2Harvard
High Performance Embedded Computing© 2007 Elsevier 10
ARM Architecture Version Summary yCore Version Feature
ARM1020T v5T Improved ARM/ThumbARM1020T v5T Improved ARM/Thumb
Interworking
CLZ instruction for
improved divisionimproved division
ARM9E-S, ARM10TDMI, ARM1020E v5TE Extended multiplication
and saturated maths for
DSP lik f ti litDSP-like functionality
ARM7EJ-S, ARM926EJ-S, ARM1026EJ-S v5TEJ Jazelle Technology for
Java acceleration
ARM11, ARM1136J-S, v6 Low power needed
SIMD (Single Instruction
Multiple Data) media
processing extensions
J: Jazelle
S: Synthesizable F: integral vector floating point unit
E: Enhanced DSP instruction
High Performance Embedded Computing© 2007 Elsevier 11
S: Synthesizable F: integral vector floating point unit
ARM7 3-stage pipeline organizationg p p gOrganizations
Address generating block address register
A[31:0] control
g gAddress registerIncrementer
Register bank
incrementer
register
PC
PCg31-GPRs, 6-PSRs2 read, 1 write portsAdditional 1 read, multiply
instruction
decode
&
registerbank
AL
1 write port for PCBarrel shifterALU
control
barrelshifter
LU bus
A bus
B bus
register
ALUData register
Control logic
ALU
Control logicExternal interfaceInstruction decoderDatapath control
data out register
D[31:0]
data in register
High Performance Embedded Computing© 2007 Elsevier 12
Datapath control D[31:0]
ARM7 3-stage Pipelineg p
fARM7 family has 3 stage pipeline
3 stage pipelineFetchFetch
Instruction fetch from memory
Decode
PC F D E
Instruction decoding
Datapath control signals
for the next cycle
PC+i F D Efor the next cycle
ExecuteReading registers
PC+2i F D EShift and ALU operations
Writing back to the register bank
PC+2i F D E
High Performance Embedded Computing© 2007 Elsevier 13
ARM7 multi-cycle instructionsy
fetch ADD decode execute1
fetch STR decode calc. addr.
fetchADD decode execute
2
3
data xfer
fetch ADD decode execute3
fetch ADD decode execute4
5 fetch ADD decode execute
i t titime
instruction
High Performance Embedded Computing© 2007 Elsevier 14
ARM7 multi-cycle instructions
Branch LDR
y
Branch LDR
LDR F D E1calc
E2xfer
E3move
B F D E1calc
E2link
E3adjust calc xfer move
F D E
j
PC+i F
F D E
discarded
PC+2i F
F D E
discarded
T F D E
T+i F D E
High Performance Embedded Computing© 2007 Elsevier 15
ARM9TDMI core
LDR BranchF D E M WADD F D E M W
B F D E1 E2 E3 M WLDR F D E M W
FF D E M W
F
F D E M W
F D E M W
F D E M WSeparated cacheInstruction and data cacheare accessible at the same time
High Performance Embedded Computing© 2007 Elsevier 16
ARM11 8 stage pipeline
Branch Prediction and Return StackBranch Prediction and Return Stack
Separate processing units for the ALU, MAC, and Load-Store (LS) instructions
lth h th i li i i l ialthough the pipeline is single issue
High Performance Embedded Computing© 2007 Elsevier 17
Feature Comparisonp
Feature ARM9E ARM10E XScale ARM11Feature ARM9E ARM10E XScale ARM11
Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6
pipeline length 5 6 7~8 8
Java decode (ARM926EJ) (ARM1026EJ) No Yes
V6 SIMD instructions
No No No Yes
MIA instructions No No YesAvailable as coprocessor
Branch prediction No Static Dynamic Dynamic
Independent Load-store unit
No Yes Yes Yes
Instruction issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order
Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU, MAC, LSU
Out-of-order completion
No Yes Yes Yes
Target Synthesizable Synthesizable Custom chip
Synthesizable and gimplementation
Synthesizable Synthesizable Custom chipy
Hardmacro
Performance range
Up to 250MHz Up to 325MHz200MHz ~
> 1GHz350MHz ~
> 1GHz
High Performance Embedded Computing© 2007 Elsevier 18
MIPS32 processor familyp y
MIPS: MIPS32 4K has 5-stage pipeline; 4KE g p p ;family has DSP extension; 4KS is designed for securityfor security.
High Performance Embedded Computing© 2007 Elsevier 19
MIPS32 processor familyp y
High Performance Embedded Computing© 2007 Elsevier 20
PowerPC processor familyp y
PowerPC: 400 series includes several embedded processors; MPD7410 is two-issue machine; 970FX has 16-stage pipelineissue machine; 970FX has 16 stage pipeline.
High Performance Embedded Computing© 2007 Elsevier 21
PowerPC processor familyp y
High Performance Embedded Computing© 2007 Elsevier 22
PowerPC processor familyp y
High Performance Embedded Computing© 2007 Elsevier 23
What is DSP?
DSP = Digital Signal Processingg g gOR
DSP = Digital Signal Processor?
DSP used to denote both i b d d d f th t t i hi h thmeaning can be deduced from the context in which the
term DSP is used.What is a Digital Signal Processor (DSP)?g g ( )
Microprocessor specifically designed to perform fast DSP operations (e.g., Fast Fourier Transforms, inner products, Multiply & Accumulate)p y )
High Performance Embedded Computing© 2007 Elsevier 24
DSP performancep
Wireless Systems requires more and more high y q gperformance and higher bandwidth
P fDSP performance
3GPerformance
~100,000MIPS384 2000 Kb
pmight not be enough for future applications
2.5G~10,000MIPS64-384 Kbps
384-2000 Kbps applications
2GBit Rate~100MIPS
8-13 Kbps
High Performance Embedded Computing© 2007 Elsevier 25
Digital signal processorsg g p
First DSP was AT&T DSP16:
Hardware multiply-accumulate unit.
Harvard architectureHarvard architecture.
Today, DSP is often used as a marketingused as a marketing term.
Modern DSPs areModern DSPs are heavily pipelined.
High Performance Embedded Computing© 2007 Elsevier 26
TMS320C55x ™ DSP Generation, 16-bit Fi d P i t M t P Effi i t DSPFixed Point – Most Power Efficient DSP
Features ApplicationsSpecifications• C55x™ DSP core delivers 300 MHz for up to 600-MIPS performance
• Feature-rich, miniaturized per-
sonal and portable products
• Advanced automatic power management
• 1.6-volt core and 3.3-volt peripherals
sonal and portable products
• 2G, 2.5G and 3G cell phones
and basestations
• Digital audio players
• Configurable idle domains to extend your battery life
• Shortened debug for faster time-to-market
144 MH /200 MH l k t • Digital still cameras
• Electronic books
• Voice recognition
• GPS receivers
• 144-MHz/200-MHz clock rate
• 256-KB RAM, 64-KB ROM
• Three McBSPs, I 2 C, watchdog
timer, general-purpose timers
• Fingerprint/Pattern recognition
• Wireless modems
• Headsets
• Biometrics
• USB 2.0 full-speed (12 Mbps)
•10-bit ADC
•real-time clock (RTC)
High Performance Embedded Computing© 2007 Elsevier 27
TMS320C55x ™ DSP + RISC,16 bit Fi d P i t OMAP P16-bit Fixed Point – OMAP Processor
Features ApplicationsSpecifications150 MHz TI enhanced ARM925• Dual CPU processor integrating a
TMS320C55x™ DSP core and an ARM925TDMI™ RISC @150 MHz
• 1.8-volt core and 1.8-volt peripherals
• Internet appliances
• Applications processing
• Enhanced gaming
• Webpad
150-MHz TI-enhanced ARM925
• 16 KB instruction cache and 8 KB data cache
• Data and instruction MMUs
• 32-bit and 16-bit instruction sets • Webpad
• Point-of-sale
• Medical devices
• Industry-specific PDAs
• 32-bit and 16-bit instruction sets
150-MHz TMS320C55x™ DSP
• 12 KW (24 KB) instruction cache
• 80 KW (160 KB) SRAM
• Telematics
• Digital media processing
• Military and government cellular
• 16 KW (32 KB) ROM
• Two 16-bit memory interfaces
for SDRAM and flash
• Nine-channel system DMA
controller
• LCD controller
• USB 1.1 host and client
• MMC/SD card interface
• Seven serial ports plus three
UARTs, Nine timers, Keyboard interface
• Less than 250 mW at 1.6 V
High Performance Embedded Computing© 2007 Elsevier 28
TMS320C62x ™ DSP Generation, 16-bit Fi d P i t Hi h P f DSPFixed Point – High Performance DSP
Features ApplicationsSpecifications• 16-bit fixed-point DSPs
• Up to 2400 MIPS
• Pooled modems
• Digital Subscriber Line (xDSL)
• C6000™ DSP Platform VelociTI™ advanced architecture
Up to 2400 MIPS
•Running at 300 Mhz
• Digital Subscriber Line (xDSL)
• Wireless basestations
• Central office switches
• Private Branch Exchange (PBX)
• Up to eight 32-bit instructions executed each cycle
• Eight independent, multi-purpose functional units thirty-two 32-bit registers
• Digital imaging
• Call processing
• 3D graphics
• Speech recognition
registers
• Industry’s most advanced C compiler and Assembly Optimizer maximize efficiency and performance
• Voice over packet
High Performance Embedded Computing© 2007 Elsevier 29
TMS320C67x ™ DSP Generation, 32-bit Fl ti P i t Hi h P f DSPFloating Point – High Performance DSP
Features ApplicationsSpecifications• 32-bit floating point DSPs
• Up to 1350 MFLOPS
• Pooled modems
• Digital Subscriber Line (xDSL)
• C6000™ DSP Platform VelociTI™ advanced architecture
Up to 1350 MFLOPS
•Running at 225 Mhz
• Digital Subscriber Line (xDSL)
• Wireless basestations
• Central office switches
• Private Branch Exchange (PBX)
• Up to eight 32-bit instructions executed each cycle
• Eight independent, multi-purpose functional units thirty-two 32-bit registers
• Digital imaging
• Call processing
• 3D graphics
• Speech recognition
registers
• Industry’s most advanced C compiler and Assembly Optimizer maximize efficiency and performance
• IEEE floating-point format
• Voice over packet• Up to 1350 MFLOPS at 225
• Two new multi-channel serial ports (McASP) (C6713 DSP) can support up to stereo channels of I2S (Inter IC Sound) and compatible with S/PDIF transmit protocol. Note I2S is a protocol for transmitting 2 channels of digital audio over a single serial connection
High Performance Embedded Computing© 2007 Elsevier 30
TMS320C64x ™ DSP Generation, 16-bit Fi d P i t Hi h P f DSPFixed Point – High Performance DSP
Features ApplicationsSpecifications•16-bit fixed point processor
TMS320C64x DSP high per-
•DSL and pooled modems
•Basestation transceivers
• C6000™ DSP Platform VelociTI™ advanced architecture
TMS320C64x DSP high per
formance core provides scalable
performance of up to 1.1 GHz
• The industry’s fastest DSPs with
t 600 MH (4800 MIPS)
•Basestation transceivers
•Wireless LAN
•Enterprise PBX
•Multimedia gateway
• Up to eight 32-bit instructions executed each cycle
• Eight independent, multi-purpose functional units thirty-two 32-bit registers
up to 600 MHz (4800 MIPS) performance
• C64x DSPs are software compatible with TI’s C62x™ DSPs
•Broadband video transcoders
•Streaming video servers and clients
•Highspeed raster image processing (RIP)
registers
• Industry’s most advanced C compiler and Assembly Optimizer maximize efficiency and performance
p g ( )
High Performance Embedded Computing© 2007 Elsevier 31
Example: TI C5x DSPp
High Performance Embedded Computing© 2007 Elsevier 32
Example: TI C5x DSPp
40-bit arithmetic unit 32-bit values with 8 guard bitsg
Barrel shifter.
17 x 17 multiplier17 x 17 multiplier.
Comparison unit for Viterbi encoding/decoding.
Single-cycle exponent encoder for wide-dynamic-range arithmetic.dy a c a ge a t et c
Two address generators.
High Performance Embedded Computing© 2007 Elsevier 33
TI C55x microarchitecture
High Performance Embedded Computing© 2007 Elsevier 34
TI C55x co-processorsp
Designed to support Pixel interpolation
A U BMotion estimation
DCT/IDCT computation
I t l t U M R
A U B
Interpolates U, M, R values given A, B, C, D pixels
M R
pixels. C D
High Performance Embedded Computing© 2007 Elsevier 35
Fixed Point Vs Floating Pointg
Floating Point Fixed PointFloating Point Fixed Point
Applications
•Modems
Applications
•Portable Products
•Digital Subscriber Line (DSL)
•Wireless Basestations
•2G, 2.5G and 3G Cell Phones
•Digital Audio Players
Di it l Still C•Central Office Switches
•Private Branch Exchange (PBX)
•Digital Imaging
•Digital Still Cameras
•Electronic Books
•Voice Recognition•Digital Imaging
•3D Graphics
•Speech Recognition
Voice Recognition
•GPS Receivers
•Headsetsp g
•Voice over IP •Biometrics
•Fingerprint Recognition
High Performance Embedded Computing© 2007 Elsevier 36
Simple VLIW architecturep
Powerful compilerA packet of instructionLarge register file with multiple ports feeds multiple function unitsfunction units.
E boxAdd 1 2 3 S b 4 5 6 Ld 7 f St 8 b NOP
Register file
Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP
Register file
ALU ALU Load/store Load/store FU
High Performance Embedded Computing© 2007 Elsevier 37
Clustered VLIW architecture
Register file, function units divided into clusters.
Cluster bus
Execution Execution
Register file Register file
High Performance Embedded Computing© 2007 Elsevier 38
Parallelism extraction in VLIW
Static:Use compiler to analyze program
Dynamic:Use hardware to identify opportunitiesanalyze program.
Simpler CPU.
Can make use of high
identify opportunities.
More complex CPU.
Can make use of dataCan make use of high-level language constructs.
Can make use of data values.
Can’t depend on data values.
High Performance Embedded Computing© 2007 Elsevier 39
Motorola Starcore SC140
DALU i l d 4 ALU 1 i t filDALU includes 4 ALUs, 1 register file.AGU includes 2 address arithmetic units (AAU) 1 address register file(AAU), 1 address register file.Program sequencer and control unit (PSEQ).P fPerformance:
4 MACs per cycle.10 RISC MIPS MH l k10 RISC MIPS per MHz clock.
High Performance Embedded Computing© 2007 Elsevier 40
SC140 Core
ProgramProgramsequencer Address
Register fileData
Register file
Powermgt
Clock/PLL2 AAUs BMU 4 ALUs
AGU DALU
High Performance Embedded Computing© 2007 Elsevier 41
Typical SC140 configurationyp g
Level 1 memory expansionRAM, ROM
DMA,CacheProgram sequencer Cache,
Interrupts,Level 2 mem,
Program sequencer
ALUALU AAUEtc.
ALUALUALU
AAU
peripherals
High Performance Embedded Computing© 2007 Elsevier 42
Instruction format
16-bit instructions.
Up to 6 instructions per cycle.p p y
Instructions are grouped to define allowable simultaneous operationssimultaneous operations.
MACR –D0,D1,D7 AND D4,D5 MOVE.L (R0),+N0,R6 ADDA R2,R3
DALU AGUHigh Performance Embedded Computing
© 2007 Elsevier 43
DALU AGU
AGU addressingg
Allowable addressing modes:Linear: useful for general purpose addressing.g g
Modulo: useful for FIFO queues.
Reverse-carry: useful for FFT.Reverse carry: useful for FFT.
Automatic updating during register indirect.
StackStack.
Array addressing: base, offset, modifier i tregisters.
High Performance Embedded Computing© 2007 Elsevier 44
TM-1 characteristics
27 function units
Characteristics5 RISC operations/sec5 RISC operations/sec
Floating point support
Sub-word parallelismSub-word parallelism support
Guarded operation (If Conversion)
Additional custom tioperations
High Performance Embedded Computing© 2007 Elsevier 45
TM 1 VLIW CPUTM-1 VLIW CPU
i filregister file
read/write crossbar
FU1 FU27...
slot 1 slot 2 slot 3 slot 4 slot 5
High Performance Embedded Computing© 2007 Elsevier 46
Trimedia TM 1Trimedia TM-1
memory interface
video in video outvideo in
audio in
video out
audio out
I2C serial
timers
image co-p
VLD co-p
ge co p
PCIVLIW CPU
High Performance Embedded Computing© 2007 Elsevier 47
Superscalar processorsp p
Instructions are dynamically scheduled.Dependencies are checked at run time in hardware.
Used to some extent in embeddedUsed to some extent in embedded processors.
Embedded Pentium is two-issue in-orderEmbedded Pentium is two-issue in-order.
High Performance Embedded Computing© 2007 Elsevier 48
SIMD and subword parallelismp
Many special-purpose SIMD machines.
Subword parallelism is widely used for video.p yALU is divided into subwords for independent operations on small operands.operations on small operands.
Vector processing is widely used for integer valuesvalues.
High Performance Embedded Computing© 2007 Elsevier 49
SIMD Extensions
High Performance Embedded Computing© 2007 Elsevier 50
SIMD Extensions
High Performance Embedded Computing© 2007 Elsevier 51
Multithreadingg
Low-level parallelism mechanism.
Hardware multithreading alternately fetches g yinstructions from separate threads.
Simultaneous multithreading (SMT) fetchesSimultaneous multithreading (SMT) fetches instructions from several threads on each c clecycle.
High Performance Embedded Computing© 2007 Elsevier 52
Processor Resource Utilization
Processor choice depends on program characteristics.
Leverage our knowledge of the core algorithms
Many researchers assume that multimediaMany researchers assume that multimedia algorithms exhibit high levels of parallelism.
Experiments with SimpleScalar shows that this isExperiments with SimpleScalar shows that this is not the case.
Most applications exhibit fewer than 4 IPCMost applications exhibit fewer than 4 IPC.
High Performance Embedded Computing© 2007 Elsevier 53
Available parallelism in multimedia li i (T ll l )applications (Talla et al.)
High Performance Embedded Computing© 2007 Elsevier 54
Dynamic behavior of loops in y pMediaBench (Fritts)
Path ratio (instructions executed per iteration) / (total number of loop instructions)loop instructions).
M di B h h ll th tiMediaBench shows small path ratio -> considerable conditional behavior in loops.
High Performance Embedded Computing© 2007 Elsevier 55
Operand characteristics in MediaBench
More than 10
78%78%
High Performance Embedded Computing© 2007 Elsevier 56
Operand characteristics in MediaBench
High Performance Embedded Computing© 2007 Elsevier 57
Operand characteristics in Video Codecs
High Performance Embedded Computing© 2007 Elsevier 58
Dynamic voltage scaling (DVS)y g g ( )
P l ith V2Power scales with V2
while performance scales roughly as Vscales roughly as V.Reduce operating voltage, add parallelvoltage, add parallel operating units to make up for lower clock
dspeed.DVS doesn’t work in high leakagehigh-leakage processors.
High Performance Embedded Computing© 2007 Elsevier 59
Dynamic voltage and frequency scaling y g q y g(DVFS)
Scale both voltage and clock frequency.
Can use control algorithms to match
f tperformance to application, reduce powerpower.
High Performance Embedded Computing© 2007 Elsevier 60
Razor architecture
Used specialized latch to detect errors.
Recovers only on errors, gains average-case
fperformance.
High Performance Embedded Computing© 2007 Elsevier 61
Razor architecture
Used specialized latch to detect errors.
Recovers only on errors, gains average-case
fperformance.
High Performance Embedded Computing© 2007 Elsevier 62