-1- sensitivity-guided metaheuristics for accurate discrete gate sizing jin hu*, andrew b. kahng,...
TRANSCRIPT
-1-
Sensitivity-Guided Metaheuristics for Accurate Discrete Gate Sizing
Sensitivity-Guided Metaheuristics for Accurate Discrete Gate Sizing
Jin Hu*, Andrew B. Kahng†, Seokhyeong Kang†,Myung-Chul Kim* and Igor L. Markov*†UC San Diego, *University of Michigan
International Conference on Computer-Aided DesignNovember 5th, 2012
-2-
OutlineOutline
Background and Motivation Sensitivity-Guided Metaheuristics
– Global Timing Recovery– Power Reduction with Feasible Timing
Experimental Results Conclusions and Ongoing Work
-3-
Gate Sizing in VLSI DesignGate Sizing in VLSI Design
Gate sizing– Effective approach to power, delay
optimization
– Sizing problem seen at all phases of RTL-to-GDS flow
Energy vs. Performance Envelope in VLSI Design
All Possible Designs
Energy
DelayLowest possible delay
Lowest possible energy
Energy consumption vs. performance tradeoff
Pareto frontier
-4-
Gate Sizing in VLSI DesignGate Sizing in VLSI Design
Objective– Size the library cell of each gate while
minimizing total power subject to design constraints (e.g., slack, slew, capacitance)
A
B
C
delay / power5 / 10 10 / 15
5 / 1010 / 15
5 / 10
10 / 20
arrival time: 30power: 80
Z
A
B
C
delay / power12 / 5 12 / 9
12 / 512 / 9
24 / 2
6 / 30
arrival time: 30power: 80à 60
Z
-5-
Gate Sizing in VLSI DesignGate Sizing in VLSI Design
Objective– Size the library cell of each gate while
minimizing total power subject to design constraints (e.g., slack, slew, capacitance)
Tunable parameters: gate-width, gate-length and Vth
gate-width(drive-strength)
multi-Vth
Lgate-bias
INVX2 INVX4 INVX8 INVX16
HVT NVT LVT
L=60nmL=65nm L=55nm
lower powerlower speed
higher powerhigher speed
-6-
Previous ApproachesPrevious Approaches Common heuristics/algorithms
Limitations– Continuous methods: industrial cell libraries offer
discrete gate sizes, and rounding solutions is not easy
– Discrete methods: scalability to large circuits is an issue
– Do not account for realistic delay models and constraints (capacitance, slew)
Continuous methods
Discrete methods
Linear programming Convex optimization
Lagrangian relaxation
Dynamic programming Sensitivity-based sizing
Optimality Scalability
-7-
Stochastic Combinatorial OptimizationStochastic Combinatorial Optimization Hard combinatorial optimizations are often solved
using Simulated Annealing or other metaheuristics
Our work uses two newer metaheuristic frameworks: Large-Step Markov Chains and Go-With-The-Winners
startup hill
end
Simulated Annealing SA: analogy to physical
annealing and thermodynamic ensembles [Kirkpatrick, Gelatt, Vecchi 1983]
State = solution; Energy = cost
Optimal only in limit of infinitely slow cooling and runtime
– Annealing on fractal landscapes: Sorkin91
– Finite-time annealing: BoeseK93
-8-
Stochastic Combinatorial Optimization: LSMCStochastic Combinatorial Optimization: LSMC Large-Step Markov Chains [Martin, Otto, Felten
1991]:Iteratively perform two operations: 1. descend using a greedy search method, 2. perturb local optimum result with kick move
Takes advantage of an available local search heuristic; more efficient than conventional simulated annealing
LSMC is essentially greedy, but with a powerful neighborhood operator (= {kick + descent}); always steps directly from one local minimum to better local minimum
start
next local min
kick move
-9-
Stochastic Combinatorial Optimization: GWTWStochastic Combinatorial Optimization: GWTW Go-With-The-Winners [Aldous, Vazirani 1994]:
invoke greedy heuristics with randomized multi-starts, explore large space by continuing the search from a small set of best-seen solutions
Finds global optimum with high probability under certain assumptions [AV94]
Runtime of GWTW is bounded by a polynomial in depth of tree and tree imbalance parameter
start-1 start-2 start-3 start-4
end
-10-
Our Work: Sensitivity-Guided MetaheuristicsOur Work: Sensitivity-Guided Metaheuristics We apply sensitivity-guided
metaheuristics based on the Go-With-The-Winners paradigm– Define parameterized space for gate sizing
– Explore a heuristic space using multistart technique and efficient parallelization on multi-core system
– Use total negative slack (TNS) as a sensitivity function (with fast estimation technique)
Infrastructure: ISPD 2012 gate sizing contest– Realistic benchmarks mapped into a modern
discrete gate library
-11-
OutlineOutline
Background and Motivation Sensitivity-Guided Metaheuristics
– Global Timing Recovery– Power Reduction with Feasible
Timing Experimental Results Conclusions and Ongoing Work
-12-
Trident: Sensitivity-Guided MetaheuristicsTrident: Sensitivity-Guided Metaheuristics
In our heuristic, multiple tines of a trident represent multiple solution trajectories
Trident: central to the symbols of both UCSD and the Ukraine
-13-
Trident: Sensitivity-Guided MetaheuristicsTrident: Sensitivity-Guided MetaheuristicsOur Heuristic: explore a parameterized heuristic space with multistarts, then apply Go-With-The-Winners
Initial solution
Final solution
multistarts
go-with-the-winners
Global Timing Recovery(GTR)
Power Reduction withFeasible Timing (PRFT)
Find violation-free solutions with multstarts (recover feasibility )
Iteratively reduce total leakagewith greedy downsizing(maintain feasibility)
-14-
Trident: Entire FlowTrident: Entire Flow
Primary Optimization
Multi-threaded
Final Cell Assignments
Power Reduction with Feasible Timing (PRFT)
Sensitivity-guided Greedy Sizing
Perturbing (upsizing) Bottleneck Cells
Input Design(Netlist, SPEF, SDF,
Cell Library)
Initial Cell Assignments
Global Timing Recovery (GTR)
Coarse Search
Fine Search
Multistart
Violation-free solution
-15-
GTR seeks violation-free solutions w/ two parameters:α: leakage exponent and γ: % of upsizing
Global Timing Recovery: Flow on Each ThreadGlobal Timing Recovery: Flow on Each Thread
Run static timing analysis
Calculate sensitivity (α) for cells w/ negative
slack
Upsize γ% of cells in descending order of sensitivity
Timing meet?
Update timing
NO
Cell sensitivity
• TNS: total negative slack
• ∆TNS: TNS reduction after cell upsizing
• ∆leakage: cell leakage increase after cell upsizing
-16-
Estimate impact of a single cell modification– Invoking STA is computationally prohibitive
we approximate the impact on TNS (∆TNS) ∆TNS estimation
– Cell modification affects AATs and RATs for down stream cells, RATs for upstream cells
– ∆TNS estimation reduces runtime and allows us to find a heuristic solution quickly
Global Timing Recovery: ∆TNS EstimationGlobal Timing Recovery: ∆TNS Estimation
∆delay = delay change (old – new) from upsizing
Npaths = # of negative-slack paths through the cell
-17-
Multistart w/ different parameters and “Go-With-The-Winners”; GTR sweeps parameter α and γ, and chooses the best (minimum leakage) solution
Global Timing Recovery: MultistartGlobal Timing Recovery: Multistart
Search SpaceCoarse Search
Step Size(0, ]init
[ , ]
CGS CGS
Thres
(0, ]initThres
Best solutions
( α, γ )𝛼 𝛾∆𝛼 ∆𝛾
𝛼𝛾
[ , ]
Search SpaceFine Search
Step Size
FGS FGS
[ - /2, + /2]
[ - /2, + /2]Thres Thres
Thres Thres Best solutions( , )
𝛼𝛼 𝛼𝛼𝛾𝛾 𝛾𝛾
∆𝛼 ∆𝛾
𝛼𝛾
Focus onranges aroundbest-seen param.
-18-
In GTR, some cells are oversized PRFT iteratively reduces total leakage power using
sensitivity-guided greedy sizing (SGGS)
Power Reduction with Feasible TimingPower Reduction with Feasible Timing
Run static timing analysis
Calculate sensitivity for all cells
Downsize cell C with maximum sensitivity
slack (C ) < 0
Incremental STA
NO
Revert the sizing
YES
SGGS procedure:
-19-
PRFT runs multiple SGGS with different sensitivity functions (SF1 ~ SF5)
PRFT: Sensitivity FunctionsPRFT: Sensitivity Functions
SF1 ∆leakage / ∆delay
SF2 ∆leakage * slack
SF3 ∆leakage / (∆delay*#paths)
SF4 ∆leakage * slack / #paths
SF5∆leakage * slack / (∆delay*#paths)
Each SF provides a different solution, and we select the best solution among them
Each run automatically finds the best SF for a given testcase
-20-
Monotonic downward sizing can be a local optimum
Speed up bottleneck cells: recover timing slack with minimum power impact
Perturbation and greedy sizing recall the LSMC approach
PRFT: Speeding up Bottleneck CellsPRFT: Speeding up Bottleneck Cells
Sensitivity-guided Greedy Sizing w/ SFi
best solution
Speed up γ % bottleneck cells
best seen ?yes no
finalsolution
• Progression of GTR & PRFT(TNS, leakage)
GTR PRFT
kick-move
-21-
In PRFT, cell slack should be recalculated incremental STA is used after cell sizing to reduce runtime
To achieve further speedup, we propagate updated timing when it is larger than a propagation threshold (e.g., 0.1ps)
Incremental Static Timing AnalysisIncremental Static Timing Analysis
T
S1
S2
FI2
FI1
FO1 FO2
1. Update cell delay, transition time and AAT
2. Update RAT and slack
-22-
Handling Capacitance and Slew ViolationsHandling Capacitance and Slew Violations
Each standard cell can drive a certain maximum capacitance, and transition time must be smaller than maximum transition time
Trident removes max-capacitance and max-transition (slew) violations at every iteration of GTR
Max.
Cap.
viola
tion
1. Backward traversal: visit cells in reverse order, and upsize driving cells
2. Forward traversal:downsize fanout cells
Requires one to two iterations
Max.
Cap.
viola
tion
-23-
Configurations for GWTWConfigurations for GWTW Trident can configure the number of best-seen
solutions used in GWTW
GTR:coarse search
GTR:fine
search 1
GTR:fine
search 2
PRFT:greedy sizing
PRFT:kick-move+
greedy sizing
Parameters fromN-best seen solutions
N-best seen solutions
Which configuration is optimal in terms of runtime vs. sizing quality?– More start points More chances to find near-
optimum– Runtime increases with the number of start
points
SF1, SF2 ….
-24-
OutlineOutline
Background and Motivation Sensitivity-Guided Metaheuristics
– Global Timing Recovery– Power Reduction with Feasible Timing
Experimental Results Conclusions and Ongoing Work
-25-
ISPD 2012 Gate Sizing Contest [Ozdal et al.]ISPD 2012 Gate Sizing Contest [Ozdal et al.]
Provide benchmarks to accurately model the discrete gate-sizing problem
Netilst (Verilog), parasitics (SPEF), timing constraint (SDC)
Library: 11 different logic functions, 30 different cell types (three multi-Vt and ten different sizes) 330 cells
The contest compares leakage power of violation-free solutions
-26-
Analysis of Our ImplementationAnalysis of Our Implementation Trident is written in C++, has a built-in static
timer – Significantly faster than Tcl access to PT in
contest env.– Improve runtime w/ an incremental STA
Runtime for all (14) ISPD2012 benchmarks: less than 83 hours w/ 4 threads (Intel Xeon E31230) GT
R31%
PRFT69%
coarse search: 6.2%fine-grain search: 25.1%
Greedy sizing: 45.4.0%Perturbing iterations: 23.0%
Runtime breakdown (for NETCARD_slow)
-27-
Characterization of ISTACharacterization of ISTA Runtime of full-scale STA and incremental STA
– Average runtime and maximum slack error have been measured after randomly sizing 1% of cells
benchmarks FSTA(sec)
ISTA runtime (msec) ISTA max. error (ps)
0ps 0.1ps 1.0ps 0ps 0.1ps 1.0ps
DMA 0.233 1.495 0.845 0.508 0.0 0.10 2.20
PCI 0.271 0.982 0.717 0.348 0.0 0.22 1.73
DES 1.108 0.7 0.508 0.422 0.0 0.10 1.28
VGA 1.729 22.75 8.108 2.069 0.0 0.33 2.59
B19 2.435 5.46 2.717 1.833 0.0 0.33 1.94
LEON3MP 6.746 43.21 2.152 0.939 0.0 0.32 2.73
NETCARD 9.751 9.612 2.299 1.675 0.0 0.32 2.57
Geomean 924X 2.86X 1.0X 0.54X 0.0X 1.0X 9.1X
-28-
Tuning of ISTATuning of ISTA Runtime vs. Quality of sizing solutions with
different propagation thresholds
Prop. threshold 0.0ps 0.1ps 1.0ps
ISTA runtime (msec) 1.495 0.845 0.508
ISTA max. error (ps) 0.0 0.10 2.20
Trident runtime (min)
19.4 13.9 12.8
Final leakage (mW) 0.299 0.299 0.306
Testcase: DMA (w/ tight timing constraint)
-29-
Configurations of GWTWConfigurations of GWTW Five multi-threaded stages can provide n best-
seen solutions for GWTW heuristic Solution quality vs. runtime
0.000 1.000 2.000 3.000 4.000 5.000 6.0000.880
0.900
0.920
0.940
0.960
0.980
1.000
1.020
10010 10011 11010
11011 11111 11211
20011 22211 22221
Normalized runtime
No
rma
lize
d l
ea
ka
ge
po
we
r
Stage configurations [A][B][C][D][E]
Stage• A: GTR coarse-grain
search• B: GTR fine-grain search I• C: GTR fine-grain search II• D: PRFT greedy sizing• E: PRFT speed up
bottleneckValue• 0: skip the stage• 1: keep one best
solution• 2: keep two best
solutions
Default configuration:“22211”
-30-
Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks
Benchmarks(# of cells)
leakage power Runtime
(min)
GTR param. PRFT param.
GTR PRFT α γ(%) SF γ(%)
DMA (25K) 0.65 0.299 14 0.91 24.5 SF5 1
PCI (33K) 0.348 0.183 13 0.91 34 SF4 4
DES (111K) 7.157 1.842 83 0.85 46.5 SF5 1
VGA (165K) 0.685 0.471 46 0.7 17.5 SF5 4
B19 (219K) 1.377 0.771 207 1.33 16.5 SF2 4
LEON3 (649K) 1.989 1.487 1323 0.71 7 SF4 1NETCARD
(959K) 1.997 1.861 1097 0.57 4 SF3 1
Leakage power/ runtime/ parameters for GTR, PRFT
Benchmark set with tight timing constraint
Best parameter values found by our heuristic
-31-
Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks
Benchmarks(# of cells)
leakage power Runtime
(min)
GTR param. PRFT param.
GTR PRFT α γ(%) SF γ(%)
DMA (25K) 0.211 0.145 10 1 10 SF5 5
PCI (33K) 0.185 0.111 10 1.11 36 SF5 4
DES (111K) 0.922 0.614 70 0.83 8.5 SF2 3
VGA (165K) 0.454 0.351 88 1 10 SF4 3
B19 (219K) 0.718 0.583 214 1.5 7.5 SF5 1
LEON3 (649K) 1.422 1.341 1274 0.89 4 SF4 2NETCARD
(959K) 1.818 1.77 300 2.67 4 SF3 1
Leakage power/ runtime/ parameters for GTR, PRFT
Benchmark set with loose timing constraint
-32-
Leakage Comparison on ISPD BenchmarksLeakage Comparison on ISPD Benchmarks
Contest best: best of all entries in the competition (ISPD 2012 contest) Intel Labs (contest organizer) released five (near-optimal) results ISPD 2012 contest:
http://archive.sigda.org/ispd/contests/12/ispd2012_contest.html
• In all benchmarks (except one), Trident achieves lowest leakage power: 43% further reduction over contest winner.
• We outperform Intel results on four.
0.8
1.3
1.8
2.3
2.8
DMA_f
ast
DMA_s
low
pci_brid
ge32_f
ast
pci_brid
ge32_s
low
des_p
erf_fa
st
des_p
erf_slo
w
b19_slow
vga_
lcd_fas
t
vga_
lcd_s
low
leon
3mp_s
low
netca
rd_fas
t
netca
rd_s
low
0.8
0.9
1
1.1
1.2
1.3 Intel LabsGTR+PRFTContest best
-33-
ConclusionsConclusions
Within the research-oriented infrastructure used in ISPD 2012 Gate-Sizing Contest, we have developed a metaheuristic approach to gate sizing
Our implementation, Trident, outperforms the best reported results on all but one of the ISPD 2012 benchmarks.
Compared to the 2012 contest winner, we further reduce leakage power by an average of 43%
-34-
Ongoing WorksOngoing Works
Extension to support real industry library Consider addition of interconnect delay in the
next version of sizer
-35-
Thank you
-36-
Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks• Leakage power/ runtime/ parameters for GTR, PRFT
Benchmarks # of cells
leakage power Runtime
(min)
GTR param. PRFT param.
GTR PRFT α γ(%) SF γ(%)
DMA_fast 25.3 0.65 0.299 14 0.91 24.5 SF5 1
DMA_slow 25.3 0.211 0.145 10 1 10 SF5 5
PCI_fast 33.2 0.348 0.183 13 0.91 34 SF4 4
PCI_slow 33.2 0.185 0.111 10 1.11 36 SF5 4
DES_fast 111 7.157 1.842 83 0.85 46.5 SF5 1
DES_slow 111 0.922 0.614 70 0.83 8.5 SF2 3
VGA_fast 165 0.685 0.471 46 0.7 17.5 SF5 4
VGA_slow 165 0.454 0.351 88 1 10 SF4 3
B19_fast 219 1.377 0.771 207 1.33 16.5 SF2 4
B19_slow 219 0.718 0.583 214 1.5 7.5 SF5 1
LEON3_fast 649 1.989 1.487 1323 0.71 7 SF4 1
LEON3_slow 649 1.422 1.341 1274 0.89 4 SF4 2
NETCARD_fast 959 1.997 1.861 1097 0.57 4 SF3 1
NETCARD_slow 959 1.818 1.77 300 2.67 4 SF3 1
-37-
Experimental Results on ISPD BenchmarksExperimental Results on ISPD Benchmarks• Leakage power/ runtime/ parameters for GTR, PRFT
Benchmarks # of cells
leakage power Runtime
(min)
GTR param. PRFT param.
GTR PRFT α γ(%) SF γ(%)
DMA_slow 25.3 0.211 0.145 10 1 10 SF5 5
PCI_slow 33.2 0.185 0.111 10 1.11 36 SF5 4
DES_slow 111 0.922 0.614 70 0.83 8.5 SF2 3
VGA_slow 165 0.454 0.351 88 1 10 SF4 3
B19_slow 219 0.718 0.583 214 1.5 7.5 SF5 1
LEON3_slow 649 1.422 1.341 1274 0.89 4 SF4 2
NETCARD_slow 959 1.818 1.77 300 2.67 4 SF3 1