computer and digital system architecturepersonal.stevens.edu/~bmcnair/cpe517-f12/week08-517.pdf ·...
TRANSCRIPT
Computer and Digital System Architecture
EE/CpE-517-A
Bruce McNair [email protected]
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-1/39
Week 8
ARM processor cores
Furber Ch. 9
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-2/39
FPGA architecture
I/O pin
Switch block
Interconnects
Logic blocks
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-3/39
FPGA logic block
Lookup Table (LUT)
FF
1 0
config
inputs output
config
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-4/39
FPGA LUT
Lookup Table (LUT)
FF
1 0
config
inputs output
config
2-input LUT example
Input Output function
A B AND OR XOR NAND NOR … …
0 0 0 0 0 1 1
0 1 0 1 1 1 0
1 0 0 1 1 1 0
1 1 1 1 0 0 0
A 2-input LUT can implement 16 logical
functions
Note: Xylinx Virtex-7 FPGAs provide 6-input LUTs with up to ~2M logic cells
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-5/39
ASIC/FPGA development
Schematic design
HDL design
FPGA macrocell mapping
Placement
Design optimization
ASIC standard cell
mapping
Routing
Mask generation Programming
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-6/39
ASIC/FPGA development
Schematic design
HDL design
FPGA macrocell mapping
Placement
Design optimization
ASIC standard cell
mapping
Routing
Mask generation Programming
Design iterations
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-7/39
FPGA placement/routing
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-8/39
ASIC/FPGA development
Soft core
Hard core
Schematic design
HDL design
FPGA macrocell mapping
Placement
Design optimization
ASIC standard cell
mapping
Routing
Mask generation Programming
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-9/39
Typical ARM core designs
ARM core
Cache
Memory management
Signal processing
Interface logic
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-10/39
ARM cores
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-11/39
ARM7TDMI example core
ARM7TDMI
ARM7 device: 3.3 V logic
32-bit integer core 3 stage pipeline
Optional use of Thumb 16-bit compressed
instruction set
On-chip JTAG Debug support
Multiplier with 64-bit
result
EmbeddedICE support
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-12/39
ARM7TDMI example core
ARM7TDMI
ARM7 device: 3.3 V logic
32-bit integer core 3 stage pipeline
Optional use of Thumb 16-bit compressed
instruction set
On-chip JTAG Debug support
Multiplier with 64-bit
result
EmbeddedICE support
Applications: D-Link ADSL Router Apple iPod Lego Mindstors NXT Nokia cellular phones Nintendo DS Gameboy Advance Roomba 500 series Sirius Satellite radio Automotive systems
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-13/39
ARM7TDMI organization
Embedded ICE
bus splitter
JTAG TAP controller
ARM processor
core
TCK TMS TRST TDI TDO
Dout[31:0]
Din[31:0]
D[31:0]
A[31:0] mas[1:0] mreq, trans opc, r/w
extern1 extern0
other signals
scan chain 0
scan chain 1
scan chain 2
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-14/39
ARM7TDMI core interface signals
mreqseqlock
Dout[31:0]
D[31:0]
r/wmas[1:0]
mode[4:0]trans
abort
opccpicpacpb
memoryinterface
MMUinterface
coprocessorinterface
mclkwaiteclk
isync
bigend
enin
irq¼q
reset
enout
abe
VddVss
clockcontrol
configuration
interrupts
initialization
buscontrol
power
aleapedbe
dbgrqbreakptdbgack
debug
execextern1extern0dbgen
bl[3:0]
TRSTTCKTMSTDI
JTAGcontrols
TDO
Tbit statetbe
rangeout0rangeout1
dbgrqicommrxcommtx
enouti
highzbusdisecapclk
busen
Din[31:0]
A[31:0]
ARM7TDMI
core
tapsm[3:0]ir[3:0]tdoentck1tck2screg[3:0]
TAPinformation
drivebsecapclkbsicapclkbshighzpclkbsrstclkbssdinbssdoutbsshclkbsshclk2bs
boundaryscanextension
ARM7 TDMI
core
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-15/39
ARM7TDMI core interface signals
ARM7TDMI
core
A[31:0] Din[31:0]
Dout[31:0]
D[31:0] bl[3:0] r/w mas[1:0] mreq seq lock
trans mode[4:0] abort
Tbit
Memory interface
MMU interface
State
trans: Translation control for user/ supervisor mode
mode: CPSR[4:0] bits (processor mode)
abort: Disallowed access
Tbit: ARM or Thumb instruction set
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-16/39
ARM7TDMI memory interface timing
mclk
A[31:0], r’/w, mas, lock, trans’, opc’
Din[31:0]
Dout[31:0]
abort
mreq’, seq
enout’
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-17/39
ARM7TDMI core interface signals
ARM7TDMI
core
tapsm[3:0]
TAP information
boundary scan
extension
JTAG controls
ir[3:0] tdoen tck1 tck2 screg[3:0]
drivebs ecapclkbs lcapclkbs highz pclkbs rstclkbs sdinbs sdoutbs shclkbs shclk2bs TRST TCK TMS TDI TDO
TAP: Additional scan chains can be added to JTAG
Boundary scan extension: Allow for additional JTAG paths
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-18/39
ARM7TDMI core interface signals
ARM7TDMI
core
clock control
configuration
interrupts
initialization
mclk wait eclk
bigend
irq fiq
isync
reset
bus control
enin enout
enouti abe ale
ape dbe tbe
busen highz
busdis ecapclk
Bigend: memory access mode (big-endian or little-endian)
isync: interrupt latency can be reduced if they are already synchronized externally
reset: start execution at 0000000016
enout: ARM performing write cycle
ape: control latch to retime addresses if needed by external logic EE/CpE517A Copyright ©2011
Stevens Institute of Technology - All rights reserved 1-19/39
ARM7TDMI core interface signals
ARM7TDMI
core
debug
dbgrq breakpt dbgack
exec extern1 extern0
dbgen rangeout0 rangeout1
dbgrqi commrx commtx
opc cpl
cpa cpb
Vdd Vss
coprocessor interface
power power: +5 or +3 volt power supply
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-20/39
ARM7TDMI hard core
ARM7TDMI standard core characteristics 350 nm CMOS Process 74,209 Transistors 60 MIPS
2.1 mm2 core 87 mW power @ 3.3 V 690 MIPS/W 0-66 MHz clock
ARM7TDMI implementations 250 nm CMOS Process 0.9 V 12,000 MIPS/W
ARM7TDMI-S Synthesizable core
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-21/39
Improving performance
External memory
FPGA/ASIC
ARM7 core
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-22/39
Improving performance
External memory
FPGA/ASIC
ARM7 core External
memory
FPGA/ASIC
Memory cache
ARM7 core
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-23/39
Time to execute a program
instprog
clk
N CPITf×=
Ninst = number of instructions CPI = average cycles per instruction fclk = clock speed of processor
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-24/39
memory (double-
bandwidth)
ARM8 core organization
prefetch unit
integer unit
coprocessor(s)
PC instructions
CPinst CPdata write data
read data
addresses
To get around memory speed bottleneck, fetch more data/instruction information per access. Assume two sequential memory accesses in 1.5 cycles from on-chip cache memory.
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-25/39
ARM8 vs ARM7TDMI pipeline comparison
Instruction fetch
Thumb decompr
ARM decode
reg read
reg write Shift/ALU
Execute Decode Fetch ARM7TDMI
ARM8
decode
Instruction fetch
r.read Shift/ALU Data mem
access Reg write
Fetch Decode Execute Memory Write
Prefetch unit
Integer unit
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-26/39
ARM8 integer unit organization
inst decode
register read
write pipeline
register write
Rot/sgnx
ALU/shifter
multiplier
+4 mux
PC+8 instructions
coprocessor instructions
coproc data
write data
read data
address
forwarding paths write
memory
execute
decode
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-27/39
ARM8 core
ARM8 standard core characteristics 500 nm CMOS Process 124,554 Transistors 120-180 MIPS
~5-6 mm2 core 0-72 MHz clock
ARM8 core
On-chip cache
ARM810
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-28/39
ARM8 core
ARM8 standard core characteristics 500 nm CMOS Process 124,554 Transistors 120-180 MIPS
~5-6 mm2 core 0-72 MHz clock
vs. ARM7TDMI hard core
ARM7TDMI standard core characteristics 350 nm CMOS Process 74,209 Transistors 60 MIPS
2.1 mm2 core 87 mW power @ 3.3 V 690 MIPS/W 0-66 MHz clock
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-29/39
ARM9TDMI pipeline
I-cache
rot/sgn ex
+4
byte repl.
ALU
I decode
register read
D-cache
fetch
instructiondecode
execute
buffer/data
write-back
forwardingpaths
immediatefields
nextpc
regshift
load/storeaddress
LDR pc
SUBS pc
post-index
pre-index
LDM/STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BLMOV pc
EXECUTE
DECODE
FETCH
BUFFER/ DATA
WRITE-BACK EE/CpE517A Copyright ©2011
Stevens Institute of Technology - All rights reserved 1-30/39
decode
ARM9TDMI vs ARM7TDMI pipeline comparison
Instruction fetch
Thumb decompr
ARM decode
reg read
reg write Shift/ALU
Execute Decode Fetch ARM7TDMI
ARM9TDMI
Instruction fetch
r.read Shift/ALU Data mem
access Reg write
Fetch Decode Execute Memory Write
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-31/39
ARM9TDMI characteristics
ARM9TDMI core characteristics 250 nm CMOS Process 110,000 Transistors 220 MIPS
2.1 mm2 core 150 mW power @ 2.5 V 1500 MIPS/W 0-200 MHz clock
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-32/39
ARM9TDMI vs. ARM7TDMI
Parameter ARM9 ARM7
Process 250 nm 350 nm
Transistors 110,000 74,209
MIPS 220 60
Core area 2.1 mm2 2.1 mm2
Power 150 mW @ 2.5V 87 mW @ 3.3 V
MIPS/W 1500 690
Clock 0-200 MHz 0-66 MHz EE/CpE517A Copyright ©2011
Stevens Institute of Technology - All rights reserved 1-33/39
ARM9 Application – Qualcomm MSM6100 chip set
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-34/39
ARM9 Application – Qualcomm MSM6100 chip set
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-35/39
ARM10TDMI core
ARM7 ARM9 ARM10
Increased clock speed Clocks/instruction reduced
3-stage 5-stage pipeline
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-36/39
decode
ARM10TDMI pipeline
Instruction fetch decode r.read Multiplier
partials add reg
write
data write
data memory access
addr. calc.
shift/ALU multiply
branch prediction
Fetch Issue Decode Execute Memory Write
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-37/39
decode
ARM10TDMI pipeline
Instruction fetch decode r.read Multiplier
partials add reg
write
data write
data memory access
addr. calc.
shift/ALU multiply
branch prediction
Fetch Issue Decode Execute Memory Write
Lengthened memory cycle time
Lengthened memory cycle time
Multiplier critical path shortened
Additional “Issue” stage added to decode
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-38/39
decode
ARM10TDMI reduction in cycles/instruction
Instruction fetch decode r.read Multiplier
partials add reg
write
data write
data memory access
addr. calc.
shift/ALU multiply
branch prediction
Fetch Issue Decode Execute Memory Write
Double memory fetch allows improved prediction –
backwards branches assumed true (as in loops) forward branches assumed false
Non-blocking load/store: if execution is not dependent on load/store access delay,
let it proceed.
Double-width memory access allows load/store
multiple register operations to occur in parallel
EE/CpE517A Copyright ©2011 Stevens Institute of Technology - All rights reserved 1-39/39