-1- sensitivity-guided metaheuristics for accurate discrete gate sizing jin hu*, andrew b. kahng,...

-1-

Sensitivity-Guided Metaheuristics for Accurate Discrete Gate Sizing

Sensitivity-Guided Metaheuristics for Accurate Discrete Gate Sizing

Jin Hu*, Andrew B. Kahng†, Seokhyeong Kang†,Myung-Chul Kim* and Igor L. Markov*†UC San Diego, *University of Michigan

International Conference on Computer-Aided DesignNovember 5th, 2012

-2-

OutlineOutline

Background and Motivation Sensitivity-Guided Metaheuristics

– Global Timing Recovery– Power Reduction with Feasible Timing

Experimental Results Conclusions and Ongoing Work

-3-

Gate Sizing in VLSI DesignGate Sizing in VLSI Design

Gate sizing– Effective approach to power, delay

optimization

– Sizing problem seen at all phases of RTL-to-GDS flow

Energy vs. Performance Envelope in VLSI Design

All Possible Designs

Energy

DelayLowest possible delay

Lowest possible energy

Energy consumption vs. performance tradeoff

Pareto frontier

-4-


Objective– Size the library cell of each gate while

minimizing total power subject to design constraints (e.g., slack, slew, capacitance)

A

B

C

delay / power5 / 10 10 / 15

5 / 1010 / 15

5 / 10

10 / 20

arrival time: 30power: 80

Z

A

B

C

delay / power12 / 5 12 / 9

12 / 512 / 9

24 / 2

6 / 30

arrival time: 30power: 80à 60

Z

-5-


Objective– Size the library cell of each gate while

minimizing total power subject to design constraints (e.g., slack, slew, capacitance)

Tunable parameters: gate-width, gate-length and Vth

gate-width(drive-strength)

multi-Vth

Lgate-bias

INVX2 INVX4 INVX8 INVX16

HVT NVT LVT

L=60nmL=65nm L=55nm

lower powerlower speed

higher powerhigher speed

-6-

Previous ApproachesPrevious Approaches Common heuristics/algorithms

Limitations– Continuous methods: industrial cell libraries offer

discrete gate sizes, and rounding solutions is not easy

– Discrete methods: scalability to large circuits is an issue

– Do not account for realistic delay models and constraints (capacitance, slew)

Continuous methods

Discrete methods

Linear programming Convex optimization

Lagrangian relaxation

Dynamic programming Sensitivity-based sizing

Optimality Scalability

-7-

Stochastic Combinatorial OptimizationStochastic Combinatorial Optimization Hard combinatorial optimizations are often solved

using Simulated Annealing or other metaheuristics

Our work uses two newer metaheuristic frameworks: Large-Step Markov Chains and Go-With-The-Winners

startup hill

end

Simulated Annealing SA: analogy to physical

annealing and thermodynamic ensembles [Kirkpatrick, Gelatt, Vecchi 1983]

State = solution; Energy = cost

Optimal only in limit of infinitely slow cooling and runtime

– Annealing on fractal landscapes: Sorkin91

– Finite-time annealing: BoeseK93

-8-

Stochastic Combinatorial Optimization: LSMCStochastic Combinatorial Optimization: LSMC Large-Step Markov Chains [Martin, Otto, Felten

1991]:Iteratively perform two operations: 1. descend using a greedy search method, 2. perturb local optimum result with kick move

Takes advantage of an available local search heuristic; more efficient than conventional simulated annealing

LSMC is essentially greedy, but with a powerful neighborhood operator (= {kick + descent}); always steps directly from one local minimum to better local minimum

start

next local min

kick move

-9-

Stochastic Combinatorial Optimization: GWTWStochastic Combinatorial Optimization: GWTW Go-With-The-Winners [Aldous, Vazirani 1994]:

invoke greedy heuristics with randomized multi-starts, explore large space by continuing the search from a small set of best-seen solutions

Finds global optimum with high probability under certain assumptions [AV94]

Runtime of GWTW is bounded by a polynomial in depth of tree and tree imbalance parameter

start-1 start-2 start-3 start-4

end

-10-

Our Work: Sensitivity-Guided MetaheuristicsOur Work: Sensitivity-Guided Metaheuristics We apply sensitivity-guided

metaheuristics based on the Go-With-The-Winners paradigm– Define parameterized space for gate sizing

– Explore a heuristic space using multistart technique and efficient parallelization on multi-core system

– Use total negative slack (TNS) as a sensitivity function (with fast estimation technique)

Infrastructure: ISPD 2012 gate sizing contest– Realistic benchmarks mapped into a modern

discrete gate library

-11-

OutlineOutline


– Global Timing Recovery– Power Reduction with Feasible

Timing Experimental Results Conclusions and Ongoing Work

-12-

Trident: Sensitivity-Guided MetaheuristicsTrident: Sensitivity-Guided Metaheuristics

In our heuristic, multiple tines of a trident represent multiple solution trajectories

Trident: central to the symbols of both UCSD and the Ukraine

-13-

Trident: Sensitivity-Guided MetaheuristicsTrident: Sensitivity-Guided MetaheuristicsOur Heuristic: explore a parameterized heuristic space with multistarts, then apply Go-With-The-Winners

Initial solution

Final solution

multistarts

go-with-the-winners

Global Timing Recovery(GTR)

Power Reduction withFeasible Timing (PRFT)

Find violation-free solutions with multstarts (recover feasibility )

Iteratively reduce total leakagewith greedy downsizing(maintain feasibility)

-14-

Trident: Entire FlowTrident: Entire Flow

Primary Optimization

Multi-threaded

Final Cell Assignments

Power Reduction with Feasible Timing (PRFT)

Sensitivity-guided Greedy Sizing

Perturbing (upsizing) Bottleneck Cells

Input Design(Netlist, SPEF, SDF,

Cell Library)

Initial Cell Assignments

Global Timing Recovery (GTR)

Coarse Search

Fine Search

Multistart

Violation-free solution

-15-

GTR seeks violation-free solutions w/ two parameters:α: leakage exponent and γ: % of upsizing

Global Timing Recovery: Flow on Each ThreadGlobal Timing Recovery: Flow on Each Thread

Run static timing analysis

Calculate sensitivity (α) for cells w/ negative

slack

Upsize γ% of cells in descending order of sensitivity

Timing meet?

Update timing

NO

Cell sensitivity

• TNS: total negative slack

• ∆TNS: TNS reduction after cell upsizing

• ∆leakage: cell leakage increase after cell upsizing

-16-

Estimate impact of a single cell modification– Invoking STA is computationally prohibitive

we approximate the impact on TNS (∆TNS) ∆TNS estimation

– Cell modification affects AATs and RATs for down stream cells, RATs for upstream cells

– ∆TNS estimation reduces runtime and allows us to find a heuristic solution quickly

Global Timing Recovery: ∆TNS EstimationGlobal Timing Recovery: ∆TNS Estimation

∆delay = delay change (old – new) from upsizing

Npaths = # of negative-slack paths through the cell

-17-

Multistart w/ different parameters and “Go-With-The-Winners”; GTR sweeps parameter α and γ, and chooses the best (minimum leakage) solution

Global Timing Recovery: MultistartGlobal Timing Recovery: Multistart

Search SpaceCoarse Search

Step Size(0, ]init

[ , ]

CGS CGS

Thres

(0, ]initThres

Best solutions

( α, γ )𝛼 𝛾∆𝛼 ∆𝛾

𝛼𝛾

[ , ]

Search SpaceFine Search

Step Size

FGS FGS

[ - /2, + /2]

[ - /2, + /2]Thres Thres

Thres Thres Best solutions( , )

𝛼𝛼 𝛼𝛼𝛾𝛾 𝛾𝛾

∆𝛼 ∆𝛾

𝛼𝛾

Focus onranges aroundbest-seen param.

-18-

In GTR, some cells are oversized PRFT iteratively reduces total leakage power using

sensitivity-guided greedy sizing (SGGS)

Power Reduction with Feasible TimingPower Reduction with Feasible Timing

Run static timing analysis

Calculate sensitivity for all cells

Downsize cell C with maximum sensitivity

slack (C ) < 0

Incremental STA

NO

Revert the sizing

YES

SGGS procedure:

-19-

PRFT runs multiple SGGS with different sensitivity functions (SF1 ~ SF5)

PRFT: Sensitivity FunctionsPRFT: Sensitivity Functions

SF1 ∆leakage / ∆delay

SF2 ∆leakage * slack

SF3 ∆leakage / (∆delay*#paths)

SF4 ∆leakage * slack / #paths

SF5∆leakage * slack / (∆delay*#paths)

Each SF provides a different solution, and we select the best solution among them

Each run automatically finds the best SF for a given testcase

-20-

Monotonic downward sizing can be a local optimum

Speed up bottleneck cells: recover timing slack with minimum power impact

Perturbation and greedy sizing recall the LSMC approach

PRFT: Speeding up Bottleneck CellsPRFT: Speeding up Bottleneck Cells

Sensitivity-guided Greedy Sizing w/ SFi

best solution

Speed up γ % bottleneck cells

best seen ?yes no

finalsolution

• Progression of GTR & PRFT(TNS, leakage)

GTR PRFT

kick-move

-21-

In PRFT, cell slack should be recalculated incremental STA is used after cell sizing to reduce runtime

To achieve further speedup, we propagate updated timing when it is larger than a propagation threshold (e.g., 0.1ps)

Incremental Static Timing AnalysisIncremental Static Timing Analysis

T

S1

S2

FI2

FI1

FO1 FO2

1. Update cell delay, transition time and AAT

2. Update RAT and slack

-22-

Handling Capacitance and Slew ViolationsHandling Capacitance and Slew Violations

Each standard cell can drive a certain maximum capacitance, and transition time must be smaller than maximum transition time

Trident removes max-capacitance and max-transition (slew) violations at every iteration of GTR

Max.

Cap.

viola

tion

1. Backward traversal: visit cells in reverse order, and upsize driving cells

2. Forward traversal:downsize fanout cells

Requires one to two iterations

Max.

Cap.

viola

tion

-23-

Configurations for GWTWConfigurations for GWTW Trident can configure the number of best-seen

solutions used in GWTW

GTR:coarse search

GTR:fine

search 1

GTR:fine

search 2

PRFT:greedy sizing

PRFT:kick-move+

greedy sizing

Parameters fromN-best seen solutions

N-best seen solutions

Which configuration is optimal in terms of runtime vs. sizing quality?– More start points More chances to find near-

optimum– Runtime increases with the number of start

points

SF1, SF2 ….

-24-

OutlineOutline


– Global Timing Recovery– Power Reduction with Feasible Timing

Experimental Results Conclusions and Ongoing Work

-25-

ISPD 2012 Gate Sizing Contest [Ozdal et al.]ISPD 2012 Gate Sizing Contest [Ozdal et al.]

Provide benchmarks to accurately model the discrete gate-sizing problem

Netilst (Verilog), parasitics (SPEF), timing constraint (SDC)

Library: 11 different logic functions, 30 different cell types (three multi-Vt and ten different sizes) 330 cells

The contest compares leakage power of violation-free solutions

-26-

Analysis of Our ImplementationAnalysis of Our Implementation Trident is written in C++, has a built-in static

timer – Significantly faster than Tcl access to PT in

contest env.– Improve runtime w/ an incremental STA

Runtime for all (14) ISPD2012 benchmarks: less than 83 hours w/ 4 threads (Intel Xeon E31230) GT

R31%

PRFT69%

coarse search: 6.2%fine-grain search: 25.1%

Greedy sizing: 45.4.0%Perturbing iterations: 23.0%

Runtime breakdown (for NETCARD_slow)

-27-

Characterization of ISTACharacterization of ISTA Runtime of full-scale STA and incremental STA

– Average runtime and maximum slack error have been measured after randomly sizing 1% of cells

benchmarks FSTA(sec)

ISTA runtime (msec) ISTA max. error (ps)

0ps 0.1ps 1.0ps 0ps 0.1ps 1.0ps

DMA 0.233 1.495 0.845 0.508 0.0 0.10 2.20

PCI 0.271 0.982 0.717 0.348 0.0 0.22 1.73

DES 1.108 0.7 0.508 0.422 0.0 0.10 1.28

VGA 1.729 22.75 8.108 2.069 0.0 0.33 2.59

B19 2.435 5.46 2.717 1.833 0.0 0.33 1.94

LEON3MP 6.746 43.21 2.152 0.939 0.0 0.32 2.73

NETCARD 9.751 9.612 2.299 1.675 0.0 0.32 2.57

Geomean 924X 2.86X 1.0X 0.54X 0.0X 1.0X 9.1X

-28-

Tuning of ISTATuning of ISTA Runtime vs. Quality of sizing solutions with

different propagation thresholds

Prop. threshold 0.0ps 0.1ps 1.0ps

ISTA runtime (msec) 1.495 0.845 0.508

ISTA max. error (ps) 0.0 0.10 2.20

Trident runtime (min)

19.4 13.9 12.8

Final leakage (mW) 0.299 0.299 0.306

Testcase: DMA (w/ tight timing constraint)

-29-

Configurations of GWTWConfigurations of GWTW Five multi-threaded stages can provide n best-

seen solutions for GWTW heuristic Solution quality vs. runtime

0.000 1.000 2.000 3.000 4.000 5.000 6.0000.880

0.900

0.920

0.940

0.960

0.980

1.000

1.020

10010 10011 11010

11011 11111 11211

20011 22211 22221

Normalized runtime

No

rma

lize

d l

ea

ka

ge

po

we

r

Stage configurations [A][B][C][D][E]

Stage• A: GTR coarse-grain

search• B: GTR fine-grain search I• C: GTR fine-grain search II• D: PRFT greedy sizing• E: PRFT speed up

bottleneckValue• 0: skip the stage• 1: keep one best

solution• 2: keep two best

solutions

Default configuration:“22211”

-30-

Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks

Benchmarks(# of cells)

leakage power Runtime

(min)

GTR param. PRFT param.

GTR PRFT α γ(%) SF γ(%)

DMA (25K) 0.65 0.299 14 0.91 24.5 SF5 1

PCI (33K) 0.348 0.183 13 0.91 34 SF4 4

DES (111K) 7.157 1.842 83 0.85 46.5 SF5 1

VGA (165K) 0.685 0.471 46 0.7 17.5 SF5 4

B19 (219K) 1.377 0.771 207 1.33 16.5 SF2 4

LEON3 (649K) 1.989 1.487 1323 0.71 7 SF4 1NETCARD

(959K) 1.997 1.861 1097 0.57 4 SF3 1

Leakage power/ runtime/ parameters for GTR, PRFT

Benchmark set with tight timing constraint

Best parameter values found by our heuristic

-31-

Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks

Benchmarks(# of cells)


(min)



DMA (25K) 0.211 0.145 10 1 10 SF5 5

PCI (33K) 0.185 0.111 10 1.11 36 SF5 4

DES (111K) 0.922 0.614 70 0.83 8.5 SF2 3

VGA (165K) 0.454 0.351 88 1 10 SF4 3

B19 (219K) 0.718 0.583 214 1.5 7.5 SF5 1

LEON3 (649K) 1.422 1.341 1274 0.89 4 SF4 2NETCARD

(959K) 1.818 1.77 300 2.67 4 SF3 1

Leakage power/ runtime/ parameters for GTR, PRFT

Benchmark set with loose timing constraint

-32-

Leakage Comparison on ISPD BenchmarksLeakage Comparison on ISPD Benchmarks

Contest best: best of all entries in the competition (ISPD 2012 contest) Intel Labs (contest organizer) released five (near-optimal) results ISPD 2012 contest:

http://archive.sigda.org/ispd/contests/12/ispd2012_contest.html

• In all benchmarks (except one), Trident achieves lowest leakage power: 43% further reduction over contest winner.

• We outperform Intel results on four.

0.8

1.3

1.8

2.3

2.8

DMA_f

ast

DMA_s

low

pci_brid

ge32_f

ast

pci_brid

ge32_s

low

des_p

erf_fa

st

des_p

erf_slo

w

b19_slow

vga_

lcd_fas

t

vga_

lcd_s

low

leon

3mp_s

low

netca

rd_fas

t

netca

rd_s

low

0.8

0.9

1

1.1

1.2

1.3 Intel LabsGTR+PRFTContest best

http://archive.sigda.org/ispd/contests/12/ispd2012_contest.html

-33-

ConclusionsConclusions

Within the research-oriented infrastructure used in ISPD 2012 Gate-Sizing Contest, we have developed a metaheuristic approach to gate sizing

Our implementation, Trident, outperforms the best reported results on all but one of the ISPD 2012 benchmarks.

Compared to the 2012 contest winner, we further reduce leakage power by an average of 43%

-34-

Ongoing WorksOngoing Works

Extension to support real industry library Consider addition of interconnect delay in the

next version of sizer

-35-

Thank you

-36-

Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks• Leakage power/ runtime/ parameters for GTR, PRFT

Benchmarks # of cells


(min)



DMA_fast 25.3 0.65 0.299 14 0.91 24.5 SF5 1

DMA_slow 25.3 0.211 0.145 10 1 10 SF5 5

PCI_fast 33.2 0.348 0.183 13 0.91 34 SF4 4

PCI_slow 33.2 0.185 0.111 10 1.11 36 SF5 4

DES_fast 111 7.157 1.842 83 0.85 46.5 SF5 1

DES_slow 111 0.922 0.614 70 0.83 8.5 SF2 3

VGA_fast 165 0.685 0.471 46 0.7 17.5 SF5 4

VGA_slow 165 0.454 0.351 88 1 10 SF4 3

B19_fast 219 1.377 0.771 207 1.33 16.5 SF2 4

B19_slow 219 0.718 0.583 214 1.5 7.5 SF5 1

LEON3_fast 649 1.989 1.487 1323 0.71 7 SF4 1

LEON3_slow 649 1.422 1.341 1274 0.89 4 SF4 2

NETCARD_fast 959 1.997 1.861 1097 0.57 4 SF3 1

NETCARD_slow 959 1.818 1.77 300 2.67 4 SF3 1

-37-

Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks• Leakage power/ runtime/ parameters for GTR, PRFT

Benchmarks # of cells


(min)



DMA_slow 25.3 0.211 0.145 10 1 10 SF5 5

PCI_slow 33.2 0.185 0.111 10 1.11 36 SF5 4

DES_slow 111 0.922 0.614 70 0.83 8.5 SF2 3

VGA_slow 165 0.454 0.351 88 1 10 SF4 3

B19_slow 219 0.718 0.583 214 1.5 7.5 SF5 1

LEON3_slow 649 1.422 1.341 1274 0.89 4 SF4 2

NETCARD_slow 959 1.818 1.77 300 2.67 4 SF3 1

-1- sensitivity-guided metaheuristics for accurate discrete gate sizing jin hu*, andrew b. kahng,...

Documents

discrete gate sizes

vlsi design energy

capacitance slide

better local minimum

ongoing work slide

design constraints

large space

boesek93 slide