university of tokyo research center for advanced science
TRANSCRIPT
Asynchronous VLSI System Design FOIL 1
Asynchronous VLSI System Design
Takashi Nanya
Research Center for Advanced Science and TechnologyUniversity of Tokyo
Asynchronous VLSI System Design FOIL 2
OVERVIEW
Why Asynchronous ?
How Systems Work without Clock
Design Methodologies
TITAC-2: 32-bit Asynchronous Microprocessor
Challenges
Asynchronous VLSI System Design FOIL 3
Why Asynchronous ?
CMOS Technology in 2012 (from SIA Roadmap, 1997 Edition)Minimum Tr. size : 0:05m
Number of logic Trs. : 180M=cm2
Onchip frequency: 3 10GHz
Chip area: 750cm2
Fundamental Limitation:Electromagnetic field in vacuum travels30mm in 100ps.Electric signals on chip travels much slower, e.g.3ns across chip.
3GHz frequency requires 10 clock cycles to cross the chip.
Wire delays moving into dominance on chip
SIA suggests0:13m technology at 2003 will require asynchronousarchitecture with 60% reuse
Asynchronous VLSI System Design FOIL 4
Gate and Interconnect Delay versus Technology Generation [NTRS1997]
0
5
10
15
20
25
30
35
40
45
650 500 350 250 180 130 100Generation (nm)
Delay(ps)
Gate
Gate with Al & SiO2
κGate with Cu & Low
Al 3.0 µΩ /cm
κ = 4.0κ = 2.0
CuSiO2LowAl & CuAl & Cu Line
κ
1.7 µΩ /cm
.8 µ43 µ
ThickLong
Gate DelaySum of Delays, Al & SiO2Sum of Delays, Cu & Low Interconnect Delay, Al & SiO2Interconnect Delay, Cu & Low
κ
κ
Asynchronous VLSI System Design FOIL 5
Asynchronous systems can enjoy :
Without any global clock
1. Potentially fast operation:Achieve average-case performance
Optimize frequent operations and allow rare operations to spend more
time Future technology scales up system performance
2. Timing-fault-tolerance:Robust and resilient with possible delay variation caused in fabrication
process and operating environment
3. Low power consumption:Signal transitions occur only when and where needed
Ex. DEC Alpha consumes 30W at 200MHz with 10 W for clock
600MHz version with 15.2MTrs consumes 72W at 2.0 V
4. Ease of modular composition:No clock alignment needed at interfaces
Can overcome design complexity
Asynchronous VLSI System Design FOIL 6
Synchronous System
Synchronous (Clock) Signal
Register
Logic Circuit
(ALU, Memory, etc)
Register
Asynchronous VLSI System Design FOIL 7
Asynchronous System
Register
Logic Circuit
(ALU, Memory, etc)
Register
coded data coded data
Read Req Write Ack
Asynchronous VLSI System Design FOIL 8
Encoding N-bit data-path (1)
When delay variation is large
=) Delay-Insensitive modelneeds to be encoded in an unordered code
N + logN =< datawidth =< 2N
2-rail encoding is useful (data width = 2N)
(0; 0)! (0; 1) ‘0’ has occurred (arrived)
(0; 0)! (1; 0) ‘1’ has occurred (arrived)
(1; 0); (0; 1)! (0; 0) initialization
Asynchronous VLSI System Design FOIL 9
Encoding N-bit data-path (2)
When delay variation is small
=) Bundled-data modelData + strobe signal
(datawidth = N + 1)
Strobe signal:
0! 1 data has become valid
1! 0 data has become invalid
Asynchronous VLSI System Design FOIL 10
2-rail logic
f1
f0
f1
f0
f1f0
a1a0b1b0
a1a0b1b0
a1a0
ab
f
ab
f
a f
Guarantees Hazard-Free Operation
Asynchronous VLSI System Design FOIL 11
Asynchronous Latch (2-rail)
W1
W0
Ack
R1
R0
Req
Asynchronous VLSI System Design FOIL 12
Asynchronous System Model
Collection of functional modulecommunicating asynchronouslywith each other
AsynchrnousFunctional
Module
AsynchrnousFunctional
Module
Req
Ack
Req, Ack: Control Signal or Data Signal
Asynchronous VLSI System Design FOIL 13
Delay Models
Assumptions on gate and wire delays.
Play an essential role for dependable system design.
Must be carefully examined and validated for design process, devicetechnology and operating environment.
Synchronous:Fixed or bounded delays that are known a priori
Asynchronous:Variable delays that are unknown
Fundamental-Mode:Bounded gate and wire delaysSpeed-Independent:Unbounded gate delays with no wire delaysDelay-Insensitive:Unbounded gate delays and wire delaysQuasi-Delay-Insensitive:Delay-Insensitive + Isochronic Forks
(All branches of a fork have the same wire delay)
Asynchronous VLSI System Design FOIL 14
Observations
Synchronous or Bounded-Delay model:Too optimistic of uncontrollable wire delaysCan make design expensive to guarantee reliable operations
Delay-Insensitive or Quasi-Delay-Insensitive model:Too cautious against unlikely variations of delayCan cause excessive hardware overhead and performance penalty
What is likely in future VLSI technology ?Component delays are liable to uncontrollable variation throughdesign phase, fabrication process, operating environment, aging, etc
However,It is unlikely that some delays decrease (increase) while othersincrease (decrease)
! Scalable-Delay-Insensitive (SDI) Model
Asynchronous VLSI System Design FOIL 15
Scalable-Delay-Insensitive (SDI) Model
Unbounded delays with bounded relative variation ratio
[Definition]
Consider any two componentsC1 andC2
d1; d2: Delays forC1, C2
D = d1=d2: Relative delayD of C1 toC2
De: Estimated relative delay(at design phase)
Da: Actual relative delay (through system’s lifetime)
R = Da=De: Relative variation ratio
Then, SDI model assumes that1=K < R < K , where
K is a constant (maximum variation ratio)
Asynchronous VLSI System Design FOIL 16
How SDI model works
Specification: Transitiont1 precedes transitiont2
DI implementation : t2 must be caused byt1
t1 t2
QDI implementation: t2 may be caused by other fanoutbranch that shares its stem witht1
t1
t2
SDI implementation: t2 may be caused by common stemt
that also causest1, if K d1 < d2
t
t2
t1
d1
d2
Asynchronous VLSI System Design FOIL 17
Example
Completion signal generation with QDI and SDI models
32bit registerwrite data(2-rail 32bits)
register select(1-outof-40)
40inputsOR
QDI-ack
6unit delay
5unit delay
16 unit delay
d0 d0
d31 d31
32inputsC-element
40inputsOR
6unit delay
5unit delay SDI-ack7 unit delay
(Max. var. ratio = 2)
(a) QDI model circuit
(b) SDI model circuit
40
64
Racki
Yi
Yi
Di
Di
EN
Register (1bit)
32
3unit delay (data, en -> write) 2unit delay (write -> ack)
32inputsC-element
C
C
C
C
d
en
ack
R0
d
en
ack
R0
Asynchronous VLSI System Design FOIL 18
SDI Design Methodology
Step 1:Divide the entire system into functional blocks
Step 2:Design each block as well as interconnections with QDImodel
Step 3:DetermineK considering process technology and areasize of each block
Step 4:Apply SDI implementation to each QDI block whenever
Kd1 < d2
TITAC–2 design: K = 2 validated by limiting each functionalblock to be within a maximum of1:93mm1:93mm based on
0:5 CMOS technology used
Asynchronous VLSI System Design FOIL 19
TITAC–2 Architecture
Asynchronous (clock-free) version of MIPS–R2000
Why MIPS–R2000 ?Nice target to demonstrate asynchronous designEasy to compare with synchronous counterpartReasonably simple to designC Compiler available
Instruction set ; almost same
Some instructions modified;multiply/divide resulting in least significant 32 bits2 delay slots for branch instructionsprivileged instructions
Object code:not compatible due to different instruction encoding
Asynchronous VLSI System Design FOIL 20
TITAC–2 Instruction Set
Logical AND, ANDI, OR, ORIXOR, XORI, LU, LUI
Arithmetic ADD, ADDI, SUB, SUBIMultiply MUL, MULUDivide DIV, DIVU, MOD, MODU
Compare SLT, SLTI, SLTU, SLTIUShift SLL, SRA, SRL
SLLV, SRAV, SRLVLoad LB, LBU, LH, LHU, LWStore SB, SH, SW
Branch J, JR, JAL, JALRBEQ, BNE, BGTZ, BLEZBGEZ, BGEZAL, BLTZ, BLTZAL
Privileged MOVSR, MOVRS, RFE, SYSCALLInstructions
Asynchronous VLSI System Design FOIL 21
TITAC–2 Architecture
latc
h
next
pc
PC
IR
I-cache
SR
C
deco
de
R1
RD WR
Memory Controller
Register File
address data
DS
T
IF stage ID stage EX stage ME stage WB
ALU
DS
T
DS
T
FW
D
latc
h
latc
h
latc
h
latc
h
write buffer
R1
5-stage pipeline structure
8-KByte Instruction Cacheon chip(Data cache not implemented)
40 32-bit registers(R32 – R39 for kernel-modeuse only)
Exception Handling(for user-mode only)
External Interrupt
Memory Protection
Asynchronous VLSI System Design FOIL 22
TITAC–2 Design Features
2-level Delay AssumptionSDI model for local functional blocksDI/QDI model for global interconnection
Elastic asynchronous pipeline with FIFO buffer
2-rail 2-phase register transfer
Idle-phase acceleration in 2-rail 2-phase data-path
Adjustable delay elements for bundled-data path(cache, external bus)
Asynchronous VLSI System Design FOIL 23
Data-Path Encoding
2-rail 2-phase (return-to-zero)for register transfer
Working phase:
(0; 0)! (0; 1) : logic value “0” has arrived
(0; 0)! (1; 0) : logic value “1” has arrived
Idle phase:
(0; 1)! (0; 0) : logic value “0” has cleared away
(1; 0)! (0; 0) : logic value “1” has cleared away
Bundled-Data for on-chip cache and external interface
Asynchronous VLSI System Design FOIL 24
Asynchronous Pipeline Stage Model
Request/Acknowledge HandshakeDecentralized Control
combinationalcircuit
sourcelatch
destinationlatch
writeacknowledge
readrequest
controlcircuit
2-rail2-phase
2-rail2-phase
combinationalcircuitacknowledge
C
Asynchronous VLSI System Design FOIL 25
Pipeline Latch
Asynchronous FIFO: 4 primitive latches in cascade7% faster than 3-latch FIFO
X
X
Y
Y
wr_ack
primitivelatch
C
C C
C C
C
rd_req
C
C
(X, X) : write data input
(Y, Y) : read data output
X’
X’
rd_req’
Asynchronous VLSI System Design FOIL 26
Combinational Circuit Design
2-rail logic implementation
Direct mapping from BDD representationEasy to generate completion signalsHazard-free due to monotonic signal transitions
DCVSL implementation for frequently used modules(e.g. adders in multiplier)
Gate-level implementation for other cases
Asynchronous VLSI System Design FOIL 27
2-rail Implementation from BDD
01 10
0 1
0 1 10
F
1 1
1
00
0
A
B
A
B
C
C
A
B
A
B
C
C
Completion
A
B
C
0 1
A
B
C
B
F
FF FF
Asynchronous VLSI System Design FOIL 28
Idle Phase Acceleration
combinational
circuit(f)
x1
x1
xn
xn
y1
y1
ym
ym
x1
x1
xn
xn
y1
y1
ym
ym
(a)
(b)
combinational
circuit(g)
combinational
circuit(f)
combinational
circuit(g)
iaf(W) : Working phase for "f"
f(W) g(W) f(I) g(I)
time
time
f(W) g(W) f(I)
g(I)ia
f(I) : Idle phase for "f"g(W) : Working phase for "g"g(I) : Idle phase for "g"
Asynchronous VLSI System Design FOIL 29
Conversion between 2-railed data and Bundled-data
RA
M
1-2C
ON
V1-rail
AD
R
DA
TA
2-rail
strobe
Asy
ncR
AMA
DR
DA
TAAsy
ncR
AM
AD
R
DA
TA
2-1C
ON
V
WE
RA
M
AD
RD
AT
A
Async RAM read-operation Async RAM write operation
delay delay delay2-
1CO
NV
2-rail
Y1X1/X1X2
/X2Y2
strobestrobe
Y1
Y2
Z1
/Z1
Z2
/Z2
Asynchronous VLSI System Design FOIL 30
Chip Implementation
Fabricated in NEC’s 0.5m, 3.3 V CMOS Processwith 3 metal layers
Die size:12:15mm 12:15mm
496,367 Logic Transistors + 8.6K Byte Memory macro
Number of I/O pins: 164 (excluding power pins)
Asynchronous VLSI System Design FOIL 31
Die Photo
REGISTERFILE
INSTRUCTIONDECODER
INSTRUCTIONFETCH
CACHETAG
INSTRUCTIONCACHE
INS
T. M
EM
PR
OT
EC
TIO
N
DA
TA
ME
MP
RO
TE
CT
ION
DIVIDER
MULTIPLIER
ALU
ME
Asynchronous VLSI System Design FOIL 32
Performance
Dhrystone V2.1 benchmark54.1 VAX MIPS consuming 2.1 W at 3.3 V in room temperature
Delay Insensitivity
Power Supply Voltage:from 1.5 V to 6.0 V(variable)
Chip Surface Temperature: from196C to+100C
Asynchronous VLSI System Design FOIL 33
Speed and Power Consumption vs. Supply Voltage
(at room temperature)
10
20
30
40
50
60
70
1 1.5 2 2.5 3 3.5 4 4.50.0
1.0
2.0
3.0
4.0
5.0
6.0
Dhr
ysto
ne [M
IPS
]
Pow
er C
onsu
mpt
ion
[W]
Supply Voltage [V]
SPEED [MIPS]
POWER[W]
Asynchronous VLSI System Design FOIL 34
Average Cycle Time for Asynchronous Pipeline
R1
PC
2
Nex
t PC
I-ca
che
SR
CF
WD
.EX
FW
D.M
E
R1
ALU
ME
M
Reg
File
PC
Con
d
IR
WR
RD
0.49
2.72
4.64
2.65
0.42
0.82
0.47
0.85
2.51
1.55
2.43
1.98
0.42
0.82
0.90
3.38
0.84
1.33
1.26
0.38
4.39
3.56
IF ID/WB EX ME
Latc
h
Latc
h
Tr Tf Tw
Legend:
Fun
ctio
n
Asynchronous VLSI System Design FOIL 35
Power Consumption of Pipeline Stages
Stage No. of Cells Power(W) (%)IF 13362 0.30 (19.4)I-cache —- 0.29 (19.0)ID/WB 13993 0.46 (29.6)EX 16089 0.35 (22.8)ME 7589 0.14 (9.1)total 52646 1.54 (100.0)
Asynchronous VLSI System Design FOIL 36
Comparison of TITAC-2 and MIPS-R2000
TITAC-2 MIPS-R2000Process 0:5m CMOS 2m CMOSRouting layers 3-metal 2-metal, 1-polyDesign method Standard-cell Full-customDie size 12:15 12:15mm 8:5 10mm
On-chip memory 8.6KB SRAM noneTransistor count 496K 100KClock frequency — 16.7 MHzSupply voltage 3.3 V 5 VPower dissipation 2.11W —(cache disabled) 1.02W < 2WThroughput 54.1 VAX MIPS —(cache disabled) 26.5 VAX MIPS 12 VAX MIPS
Asynchronous VLSI System Design FOIL 37
Tools Used
Commercial: synchronous
Verilog-XL: RTL/logic-level simulation, performance estimation
Design Compiler: technology mappingCell 3 Ensemble: floor planning, layout
Saber: analog simulation of original macros
Home-made:asynchronous
Asynchronous Synthesis from STG
Delay Evaluation
Test Generation
Asynchronous VLSI System Design FOIL 38
Other projects
Univ. of Manchster: Amulet/2ecompleted in Oct. 1996, 40 MIPS, 0.15 W
Univ.Tokyo/ Tokyo Inst. Tech.: TITAC-2completed in Feb.1997, 54 MIPS, 2.1W
California Inst. Tech.: Mini MIPS:expected in Spring 1998, 280 MIPS, 7.0W
Intel: microprocessor design for high performance
Philips: portable equipments for low power
Asynchronous VLSI System Design FOIL 39
Applications
currently:low powerdelay insensitivity
beyond0:13m at 2003:where wire delays are dominanthigh-performance VLSI (MPU, ASIC)
Asynchronous VLSI System Design FOIL 40
Challenges
High-performance low-power design
Verification
Testing
Transient fault tolerance
CAD support
Asynchronous VLSI System Design FOIL 41
Perspective
Design alternative
Fusion of synchronous and asynchronous design stylesInternally Async., externally SyncGlobally Async, locally Sync
Modular design to overcome design complexity
Asynchronous VLSI System Design FOIL 42
Conclusions
Asynchronous design methodology is now available
Performance is already comparable with synchronous counterpart
SDI model ensures both performance and dependability
Easiness of modular design helps much
More suitable architecture can fully exploit concurrency
Testing is a major challenge