simultaneous device and interconnect optimization

ECE902 VLSI Interconnects

Fall 1999, Prof. Lei He 1

Simultaneous Device and Interconnect Optimization

■ Simultaneous device and wire sizing

■ Simultaneous buffer insertion and wire sizing

■ Simultaneous topology construction, buffer insertion and wire sizing z WBA tree (student presentation)

z P-tree

Simultaneous device and wiresizing

■ Dominance-Property based approach to minimize weighted sum of delayz Simultaneous driver/buffer and wiresizing

[Cong-Koh, TVLSI’94] [Cong-Koh-Leung, ISLPED’96]z Simultaneous transistor and interconnect sizing

[Cong-He, PDW’96, ICCAD’96]

■ Lagrangian relaxation based approach to minimize maximum delayz Simultaneous buffer and wire sizing

[Chen-Chang-Wong, DAC’96]

■ Mathematical programming based approach to minimize area while meeting performance requirementz Simultaneous gate and wiresizing

[Menezes-Baldick-Pileggi, ICCAD’95]



RC Delay Model for Drivers

■ Rmin = resistance of min-size driver■ di = size of i-th driver■ Cg = gate capacitance of min-size

driver■ Cd = diffusion capacitance of min-

size driver

d1 did2 dk

tD(T,D)

Delay of Driver = i-thR

d(d C +d C )

ii d i+ g

min1

Rp

Rn

Cg Cd

Switch level RC Model for minimum size driver

Delay from 1st to 2nd last driver, t (T,D) = Delay of driverd

i=1

k-1

i-th∑

Total Delay Measure t(k,D,W)

Total Delay Measure: t(k,D,W) = t (k,D) + t (W)D l

■ Interconnect delay from last driver to sinks

t (W) = t(N)

where is user - specified normalized non - negative parameter

to prioritize sinkN

l

sink Ni

i i

i

i

∑ ×λ

λ

Where tD(T,D) is the delay from 1st to 2nd last driver

tl(W) is the interconnect delay from last driver to sinks



Power Dissipation Formulation

■ Short-circuit: ScP(i) ∝ di

Short circuit Power = ScP(i)−=∑i

k

1

■ Capacitive: CP(i) ∝ (diCd+di+1Cg) for I< k CP(k) ∝ (dkCd+CIL) CIL: load due to

routing tree

■ Total Power = Capacitive + Short-Circuit

Capacitive Power = CP(i)i

k

=∑

1

Main Theorem: Relation between Driver and Wire Sizing

■ Given (D,W) and (D’, W’) for k drivers

■ if W = opt-WS(D) and W’ = opt-WS(D’)Â D dominates D’ => W dominates W’

■ if D = opt-DS(W) and D’ = opt-DS(W’) ÂW dominates W’ => D dominates D’



K-SDWS LU-Bound algorithm for Delay Optimization

■ Lower bound of SDWS optimal solution

Dominate

W0 = Min. Width Assignment (dominated by opt. sol.)

D0 = Opt-DS(W0)

W1 = Opt-WS(D0)

D1 = Opt-DS(W1)

(Di,Wi) monotonically inreases

■ (Di, Wi) dominated by optimal solution

K-SDWS Optimal algorithm for Delay Optimization

■ Linear search for the optimal stage number, k*

Optimal k-SDWS solution

SDWS Optimal Algorithm for Delay Optimization

■ Case 1: the bounds meet

■ Case 2: bounds do not meet z Discretize driver sizes of k-th driver between the

bounds z For each discretized driver

− compute optimal sizes for k-1 drivers and wiresz Select best d-SDWS solution

gdsagMAXILD

MAX/CCae

s

CWTCk ==

= + wheres* and

*ln

/),(ln */1

)1( kk*D

MAX≤≤



K-SDWS Optimal algorithm for Combined Delay and Power Optimization

■ Linear search for the optimal stage number, k*

■ Compute Optimal Driver Sizing Solutin by MAPLE

Solution MonotoneSelect

0 1

1-k to2i allfor 0 1

2

2

1

1

1

=⋅

−⋅+

==−⋅+

−

+

−

gk

L

k

i

i

i

Cd

d

dBA

d

d

dBA

solutiondriver monotone no has ws.t.number stagesmallest :

1-

MAX

)1(DPMAX

DP

MAX

k

kk*≤≤

Experiments to Evaluate SDWS Algorithm

■ Compared with other design methods:

z CDSMIN (Constant Driver Sizing, ratio e and MINimum wire

width)

z ODSMIN (Optimal Driver Sizing,

MINimum wire width)

z DWSA [Cong-Koh-Leung, LPDW’94)

(Independent constant Driver Sizing with ratio e, optimal wire

width)

=+

g

L

i

i

C

Ck

d

d

/1

1



Experimental Results on Power-Delay Trade-off

Simultaneous Transistor and Interconnect Sizing[Cong-He,PDW & ICCAD’96]

Given: Initial layout design for multiple nets,Table-based models for device delay and interconnect coupling capacitances

Determine: Discrete sizes for transistors/wires

Minimize: α Delay + β Power + γ Area



■

z resistance for unit-width transistor/wirez area capacitance for unit-width transistor/wire z fringing capacitance for transistor/wirez discrete widths for transistors/wires

■ To minimize t(X) is a simple CH-posynomial program

)(),()(),()( 1,

)(

,0

)( 00 jCjiFxjCjiFXtji

xiR

jijx

iR

ii••+•••= ∑∑

)()()( 1)()( 00 iCiHiG

ix

iR

ix

iR

ii••+•+ ∑∑

:0C

:0R

:},...,,{ 21 nxxxX =:1C

Objective for Delay Minimization

Dominance Property for Simple CH-posynomial Programs

■ Theorem ([Cong-He, pdw’96]z The dominance property holds for simple CH-posynomial

program w.r.t. the local refinement.− If X dominates optimal solution X*

X’ = local refinement of XThen, X’ dominates X*

− Symmetric for X dominated by X*

)()()(0 0 1 ,1

qjqj

m

p

m

q

n

i

n

ijjx

axbXf p

i

pi ⋅⋅= ∑ ∑ ∑ ∑= = = ≠=

■ To minimize

is a simple CH-posynomial program where api and bqj are positive constants.



Overview of STIS Algorithm

■ Support mixed transistor sizing formulations:z find an optimal size for each gate, each pull-up or pull-down

block, or each transistor

■ Algorithm Flown Partition devices and interconnects into DC-Connect-

Components (DCCs)o Compute TIGHT lower and upper bounds by iterative LR

(local refinement) for devices and wires within each DCCp Compute optimal solution within bounds by bottom-up

dynamic program [Lillis-et al, ICCAD’95] within each DCC

Experimental Results■ Clock nets of 12.7Mchip/s all digital BPSK direct sequence

spread spectrum IF transceiver Chip in UCLA1 radio for wireless multimedia information systems

■ Clock nets routed interactively with Flint, fabricated by 1.2um SCMOS technology

■ CLK net: 112 inverters and 255 sinksDCLK net: 31 inverters and 123 sinks

■ Manually designed driver/buffer: cascade chain of 4 inverters■ Ideal inter-clock skew = 0:



Manual Design versus LR-Based Optimizations

■ Transistor sizing formulation can achieve higher delay and skew reduction at a similar power dissipation

■ Runtimes (wire segmenting: 10um) z LR-based SBWS 1.18s, STIS 0.88sz Dynamic programming run out of memory

z Total HSPICE simulation ~2000s

manual SBWS STISmax delay (ns) 4.6324 4.3447(-6.2%) 3.9632(-14.4)average power(mW) 60.85 46.09(-24.3%) 46.29(-24.2%)clock skew 470ps 130ps(-3.6x) 40ps(-11.7x)

Trend of Device Effective Resistance

■ R0 is NOT a constant. It depends on size, input slope tt and output load cl

z May differ by a factor of 2

z NOT a function of a single sizing variable

size = 100x

cl \ tt 0.05ns 0.10ns 0.20ns0.225pf 12200 12270 191800.425pf 8135 9719 125000.825pf 8124 8665 10250

size = 400x

cl \ tt 0.05ns 0.10ns 0.20ns0.501pf 12200 15550 191500.901pf 11560 13360 174401.701pf 8463 9688 12470

effective-resistance R0 for unit-width n-transistor

Invalidate simple CH-posynomial Fomulation!



Bounded CH-Posynomial Program and Extended Local Refinement

))(()()(0 0 1 ,1

)( qjqj

m

p

m

q

n

i

n

ijjx

XaxXbXf p

i

pi ⋅⋅= ∑ ∑ ∑ ∑= = = ≠=

■ To minimize

is a general CH-posynomial, when api and bqj are arbitrary functions of X , but each has an upper and lower bound.

■ Extended local refinement on w.r.t X is local refinement using following coefficients:z When X dominates X*, for any p, q and , we use

maxpia ,)( 1

pixpi forXa min

qja qjxqj forXa 1)(

ix

instead of instead ofminpib )(Xbpi

maxqjb )(Xbqjinstead of instead offor ,p

ix forqjx

ij≠

z Symmetric operation when X is dominated by X*

Dominance Property for Bounded CH-Posynomial Program

■ Theorem ([Cong-He, ISPD’98]:z The dominance property holds for bounded CH-posynomial

program w.r.t. the extended local refinement.− If X dominates optimal solution X*

X’ = extended local refinement of XThen, X’ dominates X*

− If X is dominated by X* X’ = extended local refinement of X

Then, X’ is dominated by X*

■ Application:z Device and wire sizing problem

− under general capacitance model− under table-based device delay model



Extended Local Refinement for Device

■ and are determined z under assumption that R0 increases w.r.t.

− increases of size and input slope− decrease of output load

z table lookup− using keeping updated lower and upper bounds on

transistor size, input slope and output load

)(max0 iR )(min

0 iR

■ When we use:z for LR optimization on transistor iz for LR optimization on transistors rather than i

,*XX ≥)(max

0 iR

)(min0 iR

■ When we use:z for LR optimization on transistor i

z for LR optimization on transistors rather than i

,*XX ≤)(min

0 iR

)(max0 iR

Comparison between STIS Formulations

DCLK step-model table-model

sgws 1.16 1.08 (-6.8%)

stis 1.13 (-2.5%) 0.96 (-17.2%)

2cm line step-model table-model

sgws 0.82 0.81 (-0.4%)

stis 0.75 (-8.6%) 0.69 (-16.5%)

■ Different formulations on DCLK and 2cm linez Parameters are based on 0.18um processz Optimal buffer insertion is used for 2cm line

■ Total runtimez LR-based optimization ~10 seconds

z HSPICE simulation ~3000 seconds



GISS can be Solved as General CH-Posynomial Program

z 16-bit bus each a 10mm-long line, 500um per segmentz Min min width (max spacing)z GISS/DP dynamic programming based and under

variable ca and cf

z GISS/LR LR-based and under general cap table

C e n te rs p a c in g

A v e r a g e D e la y s (n s ) R u n t im e s ( s )

M I N G I S S /D P G I S S /L R G I S S /D P G I S S /L R

2 x p i tc h 1 .5 1 0 .8 0 ( -4 7 % ) 0 .7 9 ( -4 7 % ) 1 8 3 2 .0

3 x p i tc h 1 .3 3 0 .5 2 ( -6 1 % ) 0 .5 2 ( -6 1 % ) 1 8 9 2 .4

4 x p i tc h 1 .2 8 0 .4 2 ( -6 7 % ) 0 .4 2 ( -6 7 % ) 5 1 1 2 .3

5 x p i tc h 1 .2 5 0 .3 7 ( -7 1 % ) 0 .3 6 ( -7 1 % ) 1 0 8 6 4 .9

6 x p i tc h 1 .2 3 0 .3 4 ( -7 2 % ) 0 .3 2 ( -7 3 % ) 1 3 7 9 7 .7

Simultaneous Device and Interconnect Optimization

■ Simultaneous device and wire sizing

■ Simultaneous buffer insertion and wire sizing

■ Simultaneous topology construction, buffer insertion and wire sizing



Buffer Insertion with Wiresizing[Lillis-Cheng-Lin, ICCAD’95]

■ Objective is to minimize power subject to delay constraints■ Incorporate the effect of signal slew on buffer delay using

piece-wise linear functions■ In the bottom-up phase, consider discrete wiresizing for

each edge e,z For each option (c, q), candidate wire width w,

cap(e, w) = wire cap. of e with width wres(e, w) = wire res. of e with width wCompute new option (c’, q’):

c’ = c + cap(e, w);q’ = q - res(e, w) × (cap(e, w)/2 + c)

■ Additional pruning rule considered for power minimization: Options (c, q) with power p, and (c’, q’) with power p’, prune (c, q) if p’< p, c’≤ c, q’≥ q

Simultaneous Buffer Insertion/Sizing and Wiresizing[Chu-Wong, ISPD’97]

■ Assumptions:z Consider only area capacitancez Continue wire widths and buffer sizes without bounds

■ Problem:z Given a single line, driver resist., load, and the total

number of segments n to be used

z Objective: find (i) the optimal number of buffers to beinserted in their locations and sizes

(ii) the optimal length and width of each segment



■ Results and Implications:z Closed form formula for optimal number of buffers

z All segments in the optimal solution are of equal length

z Closed form formulas for buffer and wire sizes, for any given buffer locations

z Buffer locations do not matter, as long as delay is the only objective and the buffer and wire sizes are not bounded

⇒ For delay minimization, a chain of cascade drivers is as good as using buffers to break a long line

However, power and area will be affected by buffer locations

■ For interconnect tree, apply the formulas on edges iteratively; keep buffer locations/sizes and wire widths of other edges fixed while optimizing one edge

■ Shortcoming: Ignore fringing capacitance which is significant in deep submicron

Simultaneous Buffer Insertion/Sizing and Wiresizingcontinued

Comparison of Several Interconnect Optimization Algorithms

■ T+B+W:Topology (T), followed by optimal buffer insertion and sizing B (B=10) then followed by optimal wire sizing (W=18)

■ TB+BW: Simultaneous T and B (B=3), followed by simultaneous buffer and wire sizing (BW) with B=40, W=18

■ Tbw+BW: Simultaneous TBW with small number of B=3 and W=3, then followed by BW as above

■ TBW: Simultaneous TBW with larger number of B=10 and W=8

■ Provided by the UCLA TRIO (Tree, Repeater, & Interconnect Optimization) package



Comparison of Optimization Results by Different Algorithms

AlgorithmsT+B+W TB+BW Tbw+BW TBW

0.40 0.39 0.35 0.340.47 0.48 0.38 0.38

Delay(nS)

0.42 0.41 0.36 0.355-pi

nne

ts

CPU (S) 0.1 0.1 1.4 150.42 0.37 0.34 0.330.56 0.56 0.44 0.44

Delay(nS)

0.47 0.45 0.38 0.3810-p

inne

ts

CPU (S) 0.8 1.0 6.4 760.45 0.43 0.38 0.390.54 0.48 0.42 0.41

Delay(nS)

0.46 0.43 0.38 0.3820-p

inne

ts

CPU (S) 1.6 4.0 27.6 350

simultaneous device and interconnect optimization

Documents