university of tokyo research center for advanced science

Asynchronous VLSI System Design FOIL 1

Asynchronous VLSI System Design

Takashi Nanya

Research Center for Advanced Science and TechnologyUniversity of Tokyo


OVERVIEW

Why Asynchronous ?

How Systems Work without Clock

Design Methodologies

TITAC-2: 32-bit Asynchronous Microprocessor

Challenges


Why Asynchronous ?

CMOS Technology in 2012 (from SIA Roadmap, 1997 Edition)Minimum Tr. size : 0:05m

Number of logic Trs. : 180M=cm2

Onchip frequency: 3 10GHz

Chip area: 750cm2

Fundamental Limitation:Electromagnetic field in vacuum travels30mm in 100ps.Electric signals on chip travels much slower, e.g.3ns across chip.

3GHz frequency requires 10 clock cycles to cross the chip.

Wire delays moving into dominance on chip

SIA suggests0:13m technology at 2003 will require asynchronousarchitecture with 60% reuse


Gate and Interconnect Delay versus Technology Generation [NTRS1997]

0

5

10

15

20

25

30

35

40

45

650 500 350 250 180 130 100Generation (nm)

Delay(ps)

Gate

Gate with Al & SiO2

κGate with Cu & Low

Al 3.0 µΩ /cm

κ = 4.0κ = 2.0

CuSiO2LowAl & CuAl & Cu Line

κ

1.7 µΩ /cm

.8 µ43 µ

ThickLong

Gate DelaySum of Delays, Al & SiO2Sum of Delays, Cu & Low Interconnect Delay, Al & SiO2Interconnect Delay, Cu & Low

κ

κ


Asynchronous systems can enjoy :

Without any global clock

1. Potentially fast operation:Achieve average-case performance

Optimize frequent operations and allow rare operations to spend more

time Future technology scales up system performance

2. Timing-fault-tolerance:Robust and resilient with possible delay variation caused in fabrication

process and operating environment

3. Low power consumption:Signal transitions occur only when and where needed

Ex. DEC Alpha consumes 30W at 200MHz with 10 W for clock

600MHz version with 15.2MTrs consumes 72W at 2.0 V

4. Ease of modular composition:No clock alignment needed at interfaces

Can overcome design complexity


Synchronous System

Synchronous (Clock) Signal

Register

Logic Circuit

(ALU, Memory, etc)

Register


Asynchronous System

Register

Logic Circuit

(ALU, Memory, etc)

Register

coded data coded data

Read Req Write Ack


Encoding N-bit data-path (1)

When delay variation is large

=) Delay-Insensitive modelneeds to be encoded in an unordered code

N + logN =< datawidth =< 2N

2-rail encoding is useful (data width = 2N)

(0; 0)! (0; 1) ‘0’ has occurred (arrived)

(0; 0)! (1; 0) ‘1’ has occurred (arrived)

(1; 0); (0; 1)! (0; 0) initialization


Encoding N-bit data-path (2)

When delay variation is small

=) Bundled-data modelData + strobe signal

(datawidth = N + 1)

Strobe signal:

0! 1 data has become valid

1! 0 data has become invalid


2-rail logic

f1

f0

f1

f0

f1f0

a1a0b1b0

a1a0b1b0

a1a0

ab

f

ab

f

a f

Guarantees Hazard-Free Operation


Asynchronous Latch (2-rail)

W1

W0

Ack

R1

R0

Req


Asynchronous System Model

Collection of functional modulecommunicating asynchronouslywith each other

AsynchrnousFunctional

Module

AsynchrnousFunctional

Module

Req

Ack

Req, Ack: Control Signal or Data Signal


Delay Models

Assumptions on gate and wire delays.

Play an essential role for dependable system design.

Must be carefully examined and validated for design process, devicetechnology and operating environment.

Synchronous:Fixed or bounded delays that are known a priori

Asynchronous:Variable delays that are unknown

Fundamental-Mode:Bounded gate and wire delaysSpeed-Independent:Unbounded gate delays with no wire delaysDelay-Insensitive:Unbounded gate delays and wire delaysQuasi-Delay-Insensitive:Delay-Insensitive + Isochronic Forks

(All branches of a fork have the same wire delay)


Observations

Synchronous or Bounded-Delay model:Too optimistic of uncontrollable wire delaysCan make design expensive to guarantee reliable operations

Delay-Insensitive or Quasi-Delay-Insensitive model:Too cautious against unlikely variations of delayCan cause excessive hardware overhead and performance penalty

What is likely in future VLSI technology ?Component delays are liable to uncontrollable variation throughdesign phase, fabrication process, operating environment, aging, etc

However,It is unlikely that some delays decrease (increase) while othersincrease (decrease)

! Scalable-Delay-Insensitive (SDI) Model


Scalable-Delay-Insensitive (SDI) Model

Unbounded delays with bounded relative variation ratio

[Definition]

Consider any two componentsC1 andC2

d1; d2: Delays forC1, C2

D = d1=d2: Relative delayD of C1 toC2

De: Estimated relative delay(at design phase)

Da: Actual relative delay (through system’s lifetime)

R = Da=De: Relative variation ratio

Then, SDI model assumes that1=K < R < K , where

K is a constant (maximum variation ratio)


How SDI model works

Specification: Transitiont1 precedes transitiont2

DI implementation : t2 must be caused byt1

t1 t2

QDI implementation: t2 may be caused by other fanoutbranch that shares its stem witht1

t1

t2

SDI implementation: t2 may be caused by common stemt

that also causest1, if K d1 < d2

t

t2

t1

d1

d2


Example

Completion signal generation with QDI and SDI models

32bit registerwrite data(2-rail 32bits)

register select(1-outof-40)

40inputsOR

QDI-ack

6unit delay

5unit delay

16 unit delay

d0 d0

d31 d31

32inputsC-element

40inputsOR

6unit delay

5unit delay SDI-ack7 unit delay

(Max. var. ratio = 2)

(a) QDI model circuit

(b) SDI model circuit

40

64

Racki

Yi

Yi

Di

Di

EN

Register (1bit)

32

3unit delay (data, en -> write) 2unit delay (write -> ack)

32inputsC-element

C

C

C

C

d

en

ack

R0

d

en

ack

R0


SDI Design Methodology

Step 1:Divide the entire system into functional blocks

Step 2:Design each block as well as interconnections with QDImodel

Step 3:DetermineK considering process technology and areasize of each block

Step 4:Apply SDI implementation to each QDI block whenever

Kd1 < d2

TITAC–2 design: K = 2 validated by limiting each functionalblock to be within a maximum of1:93mm1:93mm based on

0:5 CMOS technology used


TITAC–2 Architecture

Asynchronous (clock-free) version of MIPS–R2000

Why MIPS–R2000 ?Nice target to demonstrate asynchronous designEasy to compare with synchronous counterpartReasonably simple to designC Compiler available

Instruction set ; almost same

Some instructions modified;multiply/divide resulting in least significant 32 bits2 delay slots for branch instructionsprivileged instructions

Object code:not compatible due to different instruction encoding


TITAC–2 Instruction Set

Logical AND, ANDI, OR, ORIXOR, XORI, LU, LUI

Arithmetic ADD, ADDI, SUB, SUBIMultiply MUL, MULUDivide DIV, DIVU, MOD, MODU

Compare SLT, SLTI, SLTU, SLTIUShift SLL, SRA, SRL

SLLV, SRAV, SRLVLoad LB, LBU, LH, LHU, LWStore SB, SH, SW

Branch J, JR, JAL, JALRBEQ, BNE, BGTZ, BLEZBGEZ, BGEZAL, BLTZ, BLTZAL

Privileged MOVSR, MOVRS, RFE, SYSCALLInstructions


TITAC–2 Architecture

latc

h

next

pc

PC

IR

I-cache

SR

C

deco

de

R1

RD WR

Memory Controller

Register File

address data

DS

T

IF stage ID stage EX stage ME stage WB

ALU

DS

T

DS

T

FW

D

latc

h

latc

h

latc

h

latc

h

write buffer

R1

5-stage pipeline structure

8-KByte Instruction Cacheon chip(Data cache not implemented)

40 32-bit registers(R32 – R39 for kernel-modeuse only)

Exception Handling(for user-mode only)

External Interrupt

Memory Protection


TITAC–2 Design Features

2-level Delay AssumptionSDI model for local functional blocksDI/QDI model for global interconnection

Elastic asynchronous pipeline with FIFO buffer

2-rail 2-phase register transfer

Idle-phase acceleration in 2-rail 2-phase data-path

Adjustable delay elements for bundled-data path(cache, external bus)


Data-Path Encoding

2-rail 2-phase (return-to-zero)for register transfer

Working phase:

(0; 0)! (0; 1) : logic value “0” has arrived

(0; 0)! (1; 0) : logic value “1” has arrived

Idle phase:

(0; 1)! (0; 0) : logic value “0” has cleared away

(1; 0)! (0; 0) : logic value “1” has cleared away

Bundled-Data for on-chip cache and external interface


Asynchronous Pipeline Stage Model

Request/Acknowledge HandshakeDecentralized Control

combinationalcircuit

sourcelatch

destinationlatch

writeacknowledge

readrequest

controlcircuit

2-rail2-phase

2-rail2-phase

combinationalcircuitacknowledge

C


Pipeline Latch

Asynchronous FIFO: 4 primitive latches in cascade7% faster than 3-latch FIFO

X

X

Y

Y

wr_ack

primitivelatch

C

C C

C C

C

rd_req

C

C

(X, X) : write data input

(Y, Y) : read data output

X’

X’

rd_req’


Combinational Circuit Design

2-rail logic implementation

Direct mapping from BDD representationEasy to generate completion signalsHazard-free due to monotonic signal transitions

DCVSL implementation for frequently used modules(e.g. adders in multiplier)

Gate-level implementation for other cases


2-rail Implementation from BDD

01 10

0 1

0 1 10

F

1 1

1

00

0

A

B

A

B

C

C

A

B

A

B

C

C

Completion

A

B

C

0 1

A

B

C

B

F

FF FF


Idle Phase Acceleration

combinational

circuit(f)

x1

x1

xn

xn

y1

y1

ym

ym

x1

x1

xn

xn

y1

y1

ym

ym

(a)

(b)

combinational

circuit(g)

combinational

circuit(f)

combinational

circuit(g)

iaf(W) : Working phase for "f"

f(W) g(W) f(I) g(I)

time

time

f(W) g(W) f(I)

g(I)ia

f(I) : Idle phase for "f"g(W) : Working phase for "g"g(I) : Idle phase for "g"


Conversion between 2-railed data and Bundled-data

RA

M

1-2C

ON

V1-rail

AD

R

DA

TA

2-rail

strobe

Asy

ncR

AMA

DR

DA

TAAsy

ncR

AM

AD

R

DA

TA

2-1C

ON

V

WE

RA

M

AD

RD

AT

A

Async RAM read-operation Async RAM write operation

delay delay delay2-

1CO

NV

2-rail

Y1X1/X1X2

/X2Y2

strobestrobe

Y1

Y2

Z1

/Z1

Z2

/Z2


Chip Implementation

Fabricated in NEC’s 0.5m, 3.3 V CMOS Processwith 3 metal layers

Die size:12:15mm 12:15mm

496,367 Logic Transistors + 8.6K Byte Memory macro

Number of I/O pins: 164 (excluding power pins)


Die Photo

REGISTERFILE

INSTRUCTIONDECODER

INSTRUCTIONFETCH

CACHETAG

INSTRUCTIONCACHE

INS

T. M

EM

PR

OT

EC

TIO

N

DA

TA

ME

MP

RO

TE

CT

ION

DIVIDER

MULTIPLIER

ALU

ME


Performance

Dhrystone V2.1 benchmark54.1 VAX MIPS consuming 2.1 W at 3.3 V in room temperature

Delay Insensitivity

Power Supply Voltage:from 1.5 V to 6.0 V(variable)

Chip Surface Temperature: from196C to+100C


Speed and Power Consumption vs. Supply Voltage

(at room temperature)

10

20

30

40

50

60

70

1 1.5 2 2.5 3 3.5 4 4.50.0

1.0

2.0

3.0

4.0

5.0

6.0

Dhr

ysto

ne [M

IPS

]

Pow

er C

onsu

mpt

ion

[W]

Supply Voltage [V]

SPEED [MIPS]

POWER[W]


Average Cycle Time for Asynchronous Pipeline

R1

PC

2

Nex

t PC

I-ca

che

SR

CF

WD

.EX

FW

D.M

E

R1

ALU

ME

M

Reg

File

PC

Con

d

IR

WR

RD

0.49

2.72

4.64

2.65

0.42

0.82

0.47

0.85

2.51

1.55

2.43

1.98

0.42

0.82

0.90

3.38

0.84

1.33

1.26

0.38

4.39

3.56

IF ID/WB EX ME

Latc

h

Latc

h

Tr Tf Tw

Legend:

Fun

ctio

n


Power Consumption of Pipeline Stages

Stage No. of Cells Power(W) (%)IF 13362 0.30 (19.4)I-cache —- 0.29 (19.0)ID/WB 13993 0.46 (29.6)EX 16089 0.35 (22.8)ME 7589 0.14 (9.1)total 52646 1.54 (100.0)


Comparison of TITAC-2 and MIPS-R2000

TITAC-2 MIPS-R2000Process 0:5m CMOS 2m CMOSRouting layers 3-metal 2-metal, 1-polyDesign method Standard-cell Full-customDie size 12:15 12:15mm 8:5 10mm

On-chip memory 8.6KB SRAM noneTransistor count 496K 100KClock frequency — 16.7 MHzSupply voltage 3.3 V 5 VPower dissipation 2.11W —(cache disabled) 1.02W < 2WThroughput 54.1 VAX MIPS —(cache disabled) 26.5 VAX MIPS 12 VAX MIPS


Tools Used

Commercial: synchronous

Verilog-XL: RTL/logic-level simulation, performance estimation

Design Compiler: technology mappingCell 3 Ensemble: floor planning, layout

Saber: analog simulation of original macros

Home-made:asynchronous

Asynchronous Synthesis from STG

Delay Evaluation

Test Generation


Other projects

Univ. of Manchster: Amulet/2ecompleted in Oct. 1996, 40 MIPS, 0.15 W

Univ.Tokyo/ Tokyo Inst. Tech.: TITAC-2completed in Feb.1997, 54 MIPS, 2.1W

California Inst. Tech.: Mini MIPS:expected in Spring 1998, 280 MIPS, 7.0W

Intel: microprocessor design for high performance

Philips: portable equipments for low power


Applications

currently:low powerdelay insensitivity

beyond0:13m at 2003:where wire delays are dominanthigh-performance VLSI (MPU, ASIC)


Challenges

High-performance low-power design

Verification

Testing

Transient fault tolerance

CAD support


Perspective

Design alternative

Fusion of synchronous and asynchronous design stylesInternally Async., externally SyncGlobally Async, locally Sync

Modular design to overcome design complexity


Conclusions

Asynchronous design methodology is now available

Performance is already comparable with synchronous counterpart

SDI model ensures both performance and dependability

Easiness of modular design helps much

More suitable architecture can fully exploit concurrency

Testing is a major challenge

university of tokyo research center for advanced science

Documents