valuation and values in application-driven algorithmics: case studies from vlsi cad andrew b. kahng,...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Valuation and Values in Application-Driven Algorithmics: Case Studies from VLSI CAD
Andrew B. Kahng, UCLA Computer Science Dept. June 2, [email protected], http://vlsicad.cs.ucla.edu
My Research Applied algorithmics
– demonstrably useful solutions for real problems
– “best known” solutions
– “classic” (well-studied) : Steiner, partition, placement, TSP,...
– toolkits: discrete algorithms, global optimization, mathematical programming, approximation frameworks, new-age metaheuristics, engineering
“Ground truths”
– anatomies
– limits
Anatomies Technologies
– semiconductor process roadmap, design-manufacturing I/F– design technology: methodology, flows, design process– interconnect modeling/analysis: delay/noise est, compact models
Problems– structural theory of large-scale global optimizations
Heuristics– hypergraph partitioning and clustering– wirelength- and timing-driven placement– single/multiple topology synthesis (length, delay, skew, buffering,...)– TSP, ..., IP protection, ..., combinatorial exchange/auction, ...
Cultures– contexts and infrastructure for research and technology transfer
Bounds Exact methods Provable approximations Technology extrapolation
– achievable envelope of system implementation w.r.t. cost, speed, power, reliability, ...
– ideally, should drive and be driven by system architectures, design and implementation methodologies
Today’s Talk “Demonstrably useful solutions for real problems” “Valuation”: What problems require attention ?
– technology extrapolation
– automatic layout of phase-shifting masks
“Values”: How do we advance the leading edge ?
– anatomy of FM-based hypergraph partitioning heuristics
– culture change: restoring time-to-market and QOR in applied algorithmics via “IP reuse”
Today’s Talk “Demonstrably useful solutions for real problems” “Valuation”: What problems require attention ?
– technology extrapolation
– automatic layout of phase-shifting masks
“Values”: How do we advance the leading edge ?
– anatomy of FM-based hypergraph partitioning heuristics
– culture change: restoring time-to-market and QOR in applied algorithmics via “IP reuse”
Technology Extrapolation Evaluates impact of
– design technology
– process technology
Evaluates impact on– achievable design
– associated design problems
What matters, when ?
Sets new requirements for CAD tools and methodologies, capital and R&D investment, ... right tech at the right time
Roadmaps (SIA ITRS): familiar and influential example
How and when do L, How and when do L, SOI, SER, etc. SOI, SER, etc.
matter?matter?
What is the most power-efficient What is the most power-efficient noise management strategy?noise management strategy?
Will layout tools need to perform Will layout tools need to perform process simulation to effectively process simulation to effectively
address cross-die and cross-address cross-die and cross-wafer manufacturing variation?wafer manufacturing variation?
GTX: GSRC Technology Extrapolation System
GTX is a framework for technology extrapolation
Parameters (data)
Rules (models)
Rule chain (study)
Knowledge
Engine (derivation)
GUI (presentation)
ImplementationUser inputs
Pre-packaged
GTX
Graphical User Interface (GUI) Provides user interaction
Visualization (plotting, printing, saving to file)
4 views:
– Parameters
– Rules
– Rule chain
– Values in chain
GTX: Open, “Living Roadmap” Openness in grammar, parameters and rules
– easy sharing of data, models in research environment
– contributions of best known models from anywhere
Allows development of proprietary models
– separation between supplied (shared) and user-defined parameters / rules
– usability behind firewalls
– functionality for sharing results instead of data
Multi-platform (SUN Solaris, Windows, Linux)
http://vlsicad.cs.ucla.edu/GSRC/GTX/
GTX Activity Models implemented
– Cycle-time models of SUSPENS (with extension by Takahashi), BACPAC (Sylvester, Berkeley), Fisher (ITRS)
– Currently adding
– GENESYS (with help from Georgia Tech)
– RIPE (with help from RPI)
– New device and power modules (Synopsys / Berkeley)
– New SOI device model (Synopsys / Berkeley)
– Inductance extraction (Silicon Graphics / Berkeley / Synopsys)
Studies performed in GTX
– Modeling and parameter sensitivity analyses
– Design optimization studies: global interconnects, layer stack
– Routability estimation, via impact models, ...
Today’s Talk “Demonstrably useful solutions for real problems” “Valuation”: What problems require attention ?
– technology extrapolation
– automatic layout of phase-shifting masks
“Values”: How do we advance the leading edge ?
– anatomy of FM-based hypergraph partitioning heuristics
– culture change: restoring time-to-market and QOR in applied algorithmics via “IP reuse”
Subwavelength Gap since .35 m
Subwavelength Optical Lithography
EUV, X-rays, E-beams all > 10 years out
huge investment in > 30 years of optical litho infrastructure
Clear areas
Opaque(chrome)
areas
Mask Types
Bright Field– opaque features
– transparent background
Dark Field– transparent features
– opaque background
Phase Shifting Masks
conventional maskglass
Chrome
phase shifting mask
Phase shifter
0 E at mask 0
0 E at wafer 0
0 I at wafer 0
Impact of PSM PSM enables smaller transistor gate lengths Leff
– “critical” polysilicon features only (gate Leff)
– faster device switching faster circuits– better critical dimension (CD) control improved parametric yield
– all features on polysilicon layer, local interconnect layers– smaller die area more $/wafer (“full-chip PSM” == BIG win)
Alternative: build a $10B fab with equipment that won’t exist for 5+ years Data points
– exponential increase in price of CAD technology for PSM– Numerical Technologies market cap 3x that of Avant!– 25 nm gates (!!!) manufactured with 248nm DUV steppers (NTI + MIT Lincoln Labs, announced 2 days ago); 90nm gates in
production at Motorola, Lucent (since late 1999)
The Phase Assignment Problem
Assign 0, 180 phase regions such that critical features with width (separation) < B are induced by adjacent phase regions with opposite phases
Bright Field (Dark Field)
0 180180
0
Key: Global 2-Colorability
?180 0
0180 180
180
If there is an odd cycle of “phase implications” layout cannot be manufactured
– layout verification becomes a global, not local, issue
F4
F2
F3
F1
S1
S2
S3
S5
S4
S6
S7S8
Shifters: S1-S8
PROPER Phase Assignment:
– Opposite phases for opposite shifters
– Same phase for overlapping shifters
F4
F2
F3
F1
S1
S2
S3
S5
S4
S6
S7S8
Phase Conflict feature shifting
to remove overlap
Phase Conflict Resolution
F4
F2
F1
S1
S2
S3 S4
S7S8
Phase Conflict feature widening to turn
conflict into non-conflict
Phase Conflict Resolution
F3
How will VLSI CAD deal with PSM ? UCLA: first comprehensive methodology for PSM-aware layout design
– currently being integrated by Cadence, Numerical Technologies
Approach: partition responsibility for phase-assignability
– good layout practices (local geometry)
– (open) problem: is there a set of “design rules” that guarantees phase-assignability of layout ? (no T’s, no doglegs, even fingers...)
– automatic phase conflict resolution / bipartization (global colorability)
– enabling reuse of layout (free composability)
– problem: how can we guarantee reusability of phase-assigned layouts, such that no odd cycles can occur when the layouts are composed together in a larger layout ?
Compaction-Oriented Approach
Analyze input layout Find min-cost set of perturbations needed to
eliminate all “odd cycles” Induce constraints for output layout
– i.e., PSM-induced (shape, spacing) constraints
Compact to get phase-assignable layout Key: Minimize the set of new constraints,
i.e., break all odd cycles in conflict graph by deleting a minimum number of edges.
Conflict Graph
Dark Field: build graph over feature regions
– edge between two features whose separation is < B
Bright Field: build graph over shifter regions
– shifters for features whose width is < B
– two edge types
– adjacency edge between overlapping phase regions : endpoints must have same phase
– conflict edge between shifters on opposite side of critical feature: endpoints must have opposite phase
conflict graph Ggreen = feature; pink = conflict
Bright Field: conflict edge
adjacency edge
conflict graph G
Conflict Graph GDark Field:
Optimal Odd Cycle Elimination
conflict graph G
dual graph DT-join of odd-degree nodes in D
dark green = feature; pink = conflict
Optimal Odd Cycle Elimination
corresponds to broken edges in original conflict graph
- assign phases: dark green and purple- remaining pink conflicts correctly handled
T-join of odd-degree nodes in D
dark green = feature; pink = conflict
The T-join Problem How to delete minimum-cost set of edges from
conflict graph G to eliminate odd cycles? Construct geometric dual graph D = dual(G) Find odd-degree vertices T in D Solve the T-join problem in D:
– find min-weight edge set J in D such that
–all T-vertices have odd degree
–all other vertices have even degree
Solution J corresponds to desired min-cost edge set in conflict graph G
Solving T-join in Sparse Graphs Reduction to matching
– construct a complete graph T(G)
– vertices = T-vertices
– edge costs = shortest-path cost
– find minimum-cost perfect matching Typical example = sparse (not always planar) graph
– note that conflict graphs are sparse
– #vertices = 1,000,000
– #edges 5 #vertices
– # T-vertices 10% of #vertices = 100,000
Drawback: finding APSP too slow, memory-consuming
– #vertices = 100,000 #edges in T(G) = 5,000,000,000
Solving T-join: Reduction to Matching
Desirable properties of reduction to matching:
–exact (i.e., optimal)
–not much memory (say, 2-3X more)
– leads to very fast solution
Solution: gadgets!
– replace each edge/vertex with gadgets s.t.
matching all vertices in gadgeted graph
T-join in original graph
T-join Problem: Reduction to Matching replace each vertex with a chain of triangles one more edge for T-vertices in graph D: m = #edges, n = #vertices, t = #T in gadgeted graph: 4m-2n-t vertices, 7m-5n-t edges cost of red edges = original dual edge costs cost of
(black) edges in triangles = 0
vertex T
vertex T
Results
Layout1 Layout2 Layout3Testcase polygons edges polygons edges polygons edges
3769 12442 9775 26520 18249 51402
Algorithm edges runtime edges runtime edges runtimeGreedy 2650 0.56 2722 3.66 6180 5.38
GW 1612 3.33 1488 5.77 3280 14.47Exact 1468 19.88 1346 16.67 2958 74.33
New Gadgets 1468 3.62 1346 5.17 2958 17.9
• Runtimes in CPU seconds on Sun Ultra-10
• Greedy = breadth-first-search bicoloring
• GW = Goemans/Williamson95 heuristic
• Cook/Rohe98 for perfect matching
• Integration w/compactor: saves 9+% layout area vs. GW
Black points - featuresBlue - shifter overlapRed - extra nodes to distinguish opposite shifters
Bipartization Problem: delete min # of nodes (or edges) to make graph bipartite - blue nodes: shifting - red nodes: widening
Bipartization by node deletion is NP-hard(GW98: 9/4-approx)
Summary
New fast, optimal algorithms for edge-deletion bipartization
– Fast T-join using gadgets
– applicable to any AltPSM phase conflict graphs
Approximate solution for node-deletion bipartization
– Goemans-Williamson98 9/4-approximation
– If node-deletion cost < 1.5 edge deletion, GW is better than edge deletion
Comprehensive integration w/NTI, Cadence tools
Today’s Talk “Demonstrably useful solutions for real problems” “Valuation”: What problems require attention ?
– technology extrapolation
– automatic layout of phase-shifting masks
“Values”: How do we advance the leading edge ?
– anatomy of FM-based hypergraph partitioning heuristics
– culture change: restoring time-to-market and QOR in applied algorithmics via “IP reuse”
Applied Algorithmics R&D
Heuristics for hard problems
Problems have practical context
Choices dominated by engineering tradeoffs– QOR vs. resource usage, accessibility, adoptability
How do you know/show that your approach is good?
Hypergraph Partitioning in VLSI
Variants– directed/undirected hypergraphs– weighted/unweighted vertices, edges– constraints, objectives, …
Human-designed instances Benchmarks
– up to 4,000,000 vertices– sparse (vertex degree 4, hyperedge size 4)– small number of very large hyperedges
Efficiency, flexibility: KL-FM style preferred
Context: Top-Down Placement Speed
– 6,000 cells/minute to final detailed placement
– partitioning used only in top-down global placement
– implied partitioning runtime: 1 second for 25,000 cells, < 30 seconds for 750,000 cells
Structure– tight balance constraint on total cell areas in partitions
– widely varying cell areas
– fixed terminals (pads, terminal propagation, etc.)
Fiduccia-Mattheyses (FM) Approach
Pass:
– start with all vertices free to move (unlocked)
– label each possible move with immediate change in cost that it causes (gain)
– iteratively select and execute a move with highest gain, lock the moving vertex (i.e., cannot move again during the pass), and update affected gains
– best solution seen during the pass is adopted as starting solution for next pass
FM:
– start with some initial solution
– perform passes until a pass fails to improve solution quality
Key Elements of FM Three main operations
– computation of initial gain values at beginning of pass
– retrieval of the best-gain (feasible) move
– update of all affected gain values after a move is made
Contribution of Fiduccia and Mattheyses:
– circuit hypergraphs are sparse
– move gain is bounded between +2 *, -2 * max vertex degree
– hash moves by gains (gain bucket structure)
– each gain affected by a move is updated in constant time
– linear time complexity per pass
Taxonomy of Algorithm and Implementation Improvements
Modifications of the algorithm
Implicit decisions
Tuning that can change the result
Tuning that cannot change the result
Modifications of the Algorithm
Important changes to flow, new steps/features
– lookahead tie-breaking
– CLIP
– instead of actual gain, maintain “updated gain” = actual gain minus initial gain (at start of pass)
– WHY ???
– cut-line refinement
– insert nodes into gain structure only if incident to cut nets
– multiple unlocking
Modifications of the Algorithm
Important changes to flow, new steps/features
– lookahead tie-breaking
– CLIP
– instead of actual gain, maintain “updated gain” = actual gain minus initial gain
– promotes “clustered moves” (similar to “LIFO gain buckets”)
– cut-line refinement
– insert nodes into gain structure only if incident to cut nets
– multiple unlocking
Implicit Decisions Tie-breaking in choosing highest gain bucket Tie-breaking in where to attach new element in
gain bucket
– LIFO vs. FIFO vs. random ... (known issue: HK 95)
Whether to update, or skip updating, when “delta gain” of a move is zero
Tie-breaking when selecting the best solution seen during pass
– first encountered, last encountered, best-balance, ...
Tuning That Can Change the Result
Threshold large nets to reduce runtime
Skip gain update for large nets
Skip zero delta gain updates
– changes resolution of hash collisions in gain container
Loose/stable net removal
– perform gain updates for only selected nets
Allow illegal solutions during pass
Tuning That Can’t Change the Result
Skip updates for nets that cannot have non-zero delta gain
netcut-specific optimizations
2-way specific optimizations
optimizations for nets of small degree
.....
... 30 years since KL70, 18 years since FM82, 100’s of papers in literature
Zero Delta Gain Update
When vertex x is moved, gains for all vertices y on nets incident to x must potentially be updated
In all FM implementations, this is done by going through incident nets one at a time, computing changes in gain for vertices y on these nets
Implicit decision:
– reinsert a vertex y when it experiences a zero delta gain move (will shift position of y within the same gain bucket)
– skip the gain update (leave position of y unchanged)
Tie-Breaking Between Highest-Gain Buckets
Gain container typically implemented such that available moves are segregated, e.g., by source or destination partition
There can be more than one highest-gain bucket When balance constraint is anything other than “exact
bisection”, moves at multiple highest-gain buckets can be legal
Implicit decision:– choose the move that is from the same partition as the last vertex moved
(“toward”)
– choose the move that is not from the same partition as the last vertex moved (“away”)
– choose the move in partition 0 (“part0”)
Implicit Decision Effects: IBM01
ALGORITHM IBM01 with unit areas and 10% balanceUpdates Bias Flat LIFO Flat CLIP ML LIFO ML CLIPAll gain Away 856/1723 (12.8) 187/463 (16.9) 185/236 (27.0) 183/239 (25.7)All gain Part0 356/1226 (16.3) 185/395 (15.9) 181/238 (25.8) 181/235 (27.4)All gain Toward 188/577 (12.6) 181/436 (13.3) 180/236 (27.6) 180/239 (24.5)Nonzero Away 201/529 (8.44) 181/415 (13.6) 180/234 (26.3) 181/239 (25.8)Nonzero Part0 201/436 (8.81) 181/371 (13.7) 180/232 (25.9) 180/240 (26.0)Nonzero Toward 197/454 (9.29) 181/397 (13.2) 181/245 (26.9) 180/237 (24.6)
Effect of Implicit Decisions Stunning average cutsize difference for flat partitioner with worst vs.
best combination– far outweighs “new improvements”
One wrong decision can lead to misleading conclusions w.r.t. other decisions– “part0” is worse than “toward” with zero delta gain updates
– better or same without zero delta gain updates Stronger optimization engines mask flaws
– ML CLIP > ML LIFO > Flat CLIP > Flat LIFO
– less dynamic range ML masks bad flat implementation
Tuning Effects
Comparison of two CLIP-FM implementation
Min and Ave cutsizes from 100 single-start trials
Another quiz: Why did this happen ?– N.B.: original inventor of CLIP-FM couldn’t figure it out
Tolerance Algorithm CLIP Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06
Min 471 1228 2569 17782 1990 1499Paper1 Ave 2456 12158 16695 20178 3156 18154
Min 329 298 797 653 2557 7452%
Paper2 Ave 485 472 1635 1233 3074 1475Min 246 439 1915 488 2146 1303
Paper1Ave 462 4163 9720 1232 3016 15658Min 237 266 675 527 1775 681
10%
Paper2Ave 424 406 1325 893 2880 1192
Tuning Effects
Comparison of two CLIP-FM implementation
Min and Ave cutsizes from 100 single-start trials
Another quiz: Why did this happen ?– Hint: some modern IBM benchmarks have large macro-cells
Tolerance Algorithm CLIP Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06
Min 471 1228 2569 17782 1990 1499Paper1 Ave 2456 12158 16695 20178 3156 18154
Min 329 298 797 653 2557 7452%
Paper2 Ave 485 472 1635 1233 3074 1475Min 246 439 1915 488 2146 1303
Paper1Ave 462 4163 9720 1232 3016 15658Min 237 266 675 527 1775 681
10%
Paper2Ave 424 406 1325 893 2880 1192
Sheer Nightmare Stuff...
Comparison of two LIFO-FM implementations
Min and Ave cut sizes from 100 single-start trials
Papers 1, 2 both published since mid-1998
Tolerance LIFO-FM Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06
Min 450 648 2459 3201 2397 1436Paper1 Ave 2701 12253 16944 20281 3420 16578
Min 366 301 1588 1014 2640 10082%
Paper2 Ave 594 542 2688 1802 3382 1746Min 270 313 1624 544 1874 1479
Paper1Ave 486 3872 12348 2383 3063 14007Min 244 266 1057 561 2347 821
10%
Paper2Ave 445 405 1993 1290 3222 1640
How Much Can This Matter ? 5% ? 10% ? 20% ? more ? 50% ? more ? Answer: 400+ % 2000+% w.r.t. recent literature
and STANDARD, “WELL-UNDERSTOOD” heuristics + lots more + N years = leading partitioner, placer
Today’s Talk “Demonstrably useful solutions for real problems” “Valuation”: What problems require attention ?
– technology extrapolation
– automatic layout of phase-shifting masks
“Values”: How do we advance the leading edge ?
– anatomy of FM-based hypergraph partitioning heuristics
– culture change: restoring time-to-market and QOR in applied algorithmics via “IP reuse”
"Barriers to Entry” for Researchers Code development barrier
– bare-bones self-contained partitioner: 800 lines
– not leading-edge (Dutt/Deng LIFO-FM)
– modern partitioner requires much more code
Expertise barrier
– very small details can have stunning impact
– must not only know what to do, but also what not to do
– impossible to estimate knowledge/expertise required to do research at leading edge
Need reference implementations !
– reference prose (6 pp. 9pt double-column) insufficient
“Barriers to Relevance” for Researchers
All heuristic engines/algorithms tuned to test cases
Test case usage must capture real use models, driving applications
– e.g., recall bipartitioning is driven by top-down placement
– until CKM99: no one considered effect of fixed vertices !!!
Test case usage can be fatally flawed by “details”
– hidden or previously unrealized
– previously believed insignificant
– results of algorithm research will be flawed as a result
Research in mature areas can stall
– incremental research - difficult and risky
– implementations not available duplicated effort
– too much trust which approach is really the best?
– some results may not be replicable
– ‘not novel’ is common reason for paper rejection
– exploratory research - paradoxically, lower-risk
– novelty for the sake of novelty
– yet, novel approaches must be well-substantiated
Pitfalls: questionable value, roadblocks, obsolete contexts
Challenges for Applied Algorithmics
Challenges for Applied Algorithmics Difficult to be relevant (time-to-market, QOR issues)
– time to market: 5-7 year delay from publishing to first industrial use (cf. market lifetimes, tech extrapolation...)
– quality of results: unmeasurable, unpredictable, basically unknown
Good news: barriers to entry and barriers to relevance are self-inflicted, and possibly curable
– mature domains require mature R&D methodologies
– a possible solution: cultivate flexibility and reuse
– low cost “update” of previous work to support reuse
– future tool/algorithm development biased towards reuse
Analogy: Hardware Design :: Tool Design Hardware design is difficult
– complex electrical engineering and optimization problems– mistakes are costly– verification and test not trivial– few can afford to truly exploit the limits of technology – A Winning Approach: Hardware IP reuse
CAD tools design is difficult– complex software engineering and optimization problems– mistakes can be showstoppers– verification and test not trivial – few can manage complexity of leading-edge approaches– A "Surprising Idea”: CAD-IP reuse
What is CAD-IP?
Data models and benchmarks– context descriptions and use models
– testcases and good solutions Algorithms and algorithm analyses
– mathematical formulations
– comparison and evaluation methodologies for algorithms
– executables and source code of implementations
– leading-edge performance results Traditional (paper-based) publications
Bookshelf: A Repository for CAD-IP “Community memory” for CAD-IP
– data models
– algorithms
– implementations
Publication medium that enables efficient applied algorithmics algorithm research
– benchmarks, performance results
– algorithm descriptions and analyses
– quality implementations (e.g., open-source Capo, MLPart)
Simplified comparisons to identify best approaches
Easier for industry to communicate new use models
Summary: Addressing Inefficiencies Inefficiencies
– lack of openness and standards huge duplication of effort
– incomparable reporting improvement difficult
– lack of standard comparison/latest use models best approach not clear
– industry doesn’t bother w/feedback outdated use models
Proposed solutions– widely available, up-to-date, extensible benchmarks
– standardized performance reporting for leading-edge approaches
– available detailed descriptions of algorithms
– peer review of executables (and source code?)
– credit for quality implementations
Better research, faster adoption, more impact http://vlsicad.cs.ucla.edu/GSRC/bookshelf/
Today’s Talk “Demonstrably useful solutions for real problems” “Valuation”: What problems require attention ?
– technology extrapolation
– automatic layout of phase-shifting masks
“Values”: How do we advance the leading edge ?
– anatomy of FM-based hypergraph partitioning heuristics
– culture change: restoring time-to-market and QOR in applied algorithmics via “IP reuse”
Thank you for your attention !!!
Parameters Description of technology, circuit and design attributes
Importance of consistent naming cannot be overstated
– Naming conventions for parameters[<preposition>] _ <principal> _ {[qualifier] _ <place>} _ {<qualifier>} _ [<adverbial>] _ [<index>] _ [<unit>]
– Example: r_int_tot_lyr_pu_dl
– Benefits:– Relatively easy to understand parameter from its name
– Distinguishable (no two parameters should have the same name)
– r_int (interconnect resistance) = r_int (interconnect resistivity) ?– Unique (no two names for the same parameter)
– R_int = R_wire ?– Sortable (important literals come first)
– Software to automatically check parameter naming
Rules Methods to derive unknown parameters from known ones
ASCII rules
– Laws of physics, models of electrical behavior
– Statistical models (e.g., Rent's rule)
– Include closed-form expressions, vector operations, tables
– Storing of calibration data (e.g., “technology files”) for known process, design points in lookup tables
Constraints
– Simulated by rules that compute boolean values
– Used to limit range during “sweeping”
– Optimization over a collection of rules
– Example: buffer insertion for minimal delay with area constraints
Rules (Cont.) “External executable” rules
– Assume a callable executable (e.g., PERL script)
– Example: optimization of number and size of repeaters for global wires
– Use command-line interface and transfer through files
– Allow complex semantics of a rule
– Example: placers, IPEM executable [Cong, UCLA])
“Code” rules
– Implemented in C++ and linked into the inference engine
– Useful if execution speed is an issue
Engine
Contains no domain-specific knowledge
Evaluates rules in topological order
Performs studies (multiple evaluations tradeoffs/sweeping, optimization)
Parameters (data)
Rules (models)
Rule chain (study)
Knowledge
Engine (derivation)
GUI (presentation)
ImplementationUser inputs
Pre-packaged
GTX
Knowledge Representation Rules and parameters are specified separately from the derivation
engine
Human-readable ASCII grammar
Benefits:– Easy creation/sharing of parameters/rules by multiple users
– D. Sylvester and C. Cao: device and power, SOI modules that “drop in” to GTX
– P.K. Nag: Yield modeling
– Extensible to models of arbitrary complexity (specialized prediction methods, technology data sets, optimization engines)
– Avant! Apollo or Cadence SE P&R tool: just another wirelength estimator
– Applies to any domain of work in semiconductors, VLSI CAD
– Transistor sizing, single wire optimizations, system-level wiring predictions,…
Corking Effect in CLIP CLIP begins by placing all moves into the 0-gain buckets
– CLIP chooses moves by cumulative delta gain (“updated gain”)
– initially, every move has cumulative delta gain = 0
Historical legacy (and for speed): FM partitioners typically look only at the first move in a bucket– if it is illegal, skip the rest of the bucket (possibly skip all buckets for that
partition)
If the move at the head of each bucket at the beginning of a CLIP pass is illegal, pass terminates without making any moves – even if first move is legal, an illegal move soon afterward will “cork”
New test cases (IBM) have large cells– large cells have large degree, and often large initial gain
– CLIP inventor couldn’t understand bad performance on IBM cases
Tuning to Uncork CLIP Don’t place nodes with area > balance constraint in gain
container at pass initialization
– actually useful for all FM variants
– zero CPU overhead
Look beyond the first move in a bucket
– extremely expensive
– hurts quality (partitioner doesn’t operate well near balance tolerance
– not worth it, in our experience
Simply do a LIFO pass before starting CLIP
– spreads out nodes in gain buckets
– reduces likelihood that large node has largest total gain
Effect of Fixed Terminals
IBM01 / runtime (sec)
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
0 5 10 15 20 25 30 35 40 45 50
1248
IBM01 / normalized costs
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
0 5 10 15 20 25 30 35 40 45 50
1248
Normalized Cost for IBM01 Runtime for IBM01
Conflict in Cell (Macro) Based Layouts Consider connected components of conflict graphs within each
cell master– each component independently phase-assignable (2k versions)
– each is a single “vertex” in coarse-grain conflict graph
– problem: assure free composability (reusability) of cell masters, such that no odd cycles can arise in coarse-grain conflict graph
edge in coarse-grain conflict graph
cell master A cell master Bconnected component
Case I: Creating CAD IP of Questionable Value
Recent hypergraph partitioning papers report FM implementations 20x worse than leading-edge FM
– previous lack of openness caused wrong conclusions, wasted effort– some “improvements” may only apply to weak implementations– duplicated effort re-implementing (incorrectly?) well-known algorithms
– difficult to find the leading edge– no standard comparison methodology– how do you know if an implementation is poor?
To make leading-edge apparent and reproducible– publish performance results on standard benchmarks– peer review (executables, source code?)– similar to common publication standards !
Tolerance LIFO-FM Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06
Min(100) 450 648 2459 3201 2397 1436Paper1
Ave 2701 12253 16944 20281 3420 16578
Min(100) 366 301 1588 1014 2640 10082%
Paper2Ave 594 542 2688 1802 3382 1746
Case II: Roadblocks to Creating Needed CAD-IP
“Best approach” to global placement? – recursive bisection (1970s)– force-directed (1980s) – simulated annealing (1980s)– analytical (1990s)– hybrids, others
Why is this question difficult?– lastest public placement benchmarks are from 1980s– data formats are bulky (hard to mix and match components)– no public implementations since early 1990s – new ideas are not compared to old
To match approaches to new contexts– agree on common up-to-date data model– publish good format descriptions, benchmarks, performance results– publish implementations
Case III: Developing CAD-IP for Obsolete Contexts
Global placement example– much of academia studies variable-die placement
– row length and spacing not fixed
– explicit feedthroughs
– majority of industrial use is fixed-die – pre-defined layout dimensions
– HPWL-driven vs. routability- or timing-driven
– runtimes are often not even reported
– this affects benchmarks and algorithms Solution: perform sanity checks and request feedback
– explicitly define use model and QOR measures
– establish a repository for up-to-date formats, benchmarks etc.
– peer review (executables, source code?)
Implicit Decision Effects: IBM02
ALGORITHM IBM02 with unit areas and 10% balanceUpdates Bias Flat LIFO Flat CLIP ML LIFO ML CLIPAll gain Away 402/1404 (30.8) 274/662 (50.7) 263/285 (62.2) 262/281 (65.7)All gain Part0 307/1468 (43.2) 263/513 (41.0) 262/288 (65.9) 262/278 (62.0)All gain Toward 283/585 (23.7) 263/446 (40.3) 262/291 (61.4) 262/281 (60.1)Nonzero Away 275/471 (18.9) 274/466 (35.5) 262/282 (60.5) 262/286 (51.0)Nonzero Part0 262/444 (18.4) 262/442 (35.0) 262/280 (58.3) 262/286 (57.7)Nonzero Toward 265/453 (17.0) 262/445 (32.1) 262/281 (56.7) 262/284 (55.8)
Reference Implementations
Documentation does not allow replication of results
– amazingly, true even for "classic" algorithms
– true for vendor R&D, true for academic R&D
Published reference implementations will raise quality
– minimum standard for algorithm implementation quality
– reduce barrier to entry for new R&D
Conclusions Work with mature heuristics requires mature
methodologies
Identified research methodology risks
Identified reporting methodology risks
Community needs to adopt standards for both
– reference “benchmark” implementations
– vigilant awareness of use-model and context
– reporting method that facilitates comparison
Application-Driven Research
Well-studied areas have complex, "tuned" metaheuristics
Risks of poor research methodologies
– irreproducible results or descriptions
– no enabling account of key insights underlying the contribution
– experimental evidence not useful to others
– inconsistent with driving use model
– missing comparisons with leading-edge approaches
– Let’s look at some requirements this induces...
The GSRC Bookshelf for CAD-IP
Bookshelf consists of slots
– slots represent active research areas with “enough customers”
– collectively, the slots cover the field
Who maintains slots?
– experts in each topic collaborate to produce them - anyone can submit
Currently, 10 active slots
– SAT (U. Michigan, Sakallah)
– Graph Coloring (UCLA, Potkonjak)
– Hypergraph Partitioning (UCLA, Kahng)
– Block Packing (UCSC, Dai)
– Placement (UCLA, Kahng)
– Global Routing (SUNY Binghamton, Madden)
– Single Interconnect Tree Synthesis (UIC, Lillis and UCLA, Cong)
– Commitments for more: BDDs, NLP, Test and Verification
What’s in a Slot? Introduction – why this area is important and recent progress
– pointers to other resources (links, publications)
Data formats used for benchmarks– SAT, graph formats etc.
– new XML-based formats
Benchmarks, solutions, performance results– including experimental methodology (e.g., runtime-quality Pareto curve)
Binary utilities– format converters, instance generators, solution evaluators, legality checkers
– optimizers and solvers
– executables
Implementation source code Other info relevant to algorithm research and implementations
– detailed algorithm descriptions
– algorithm comparisons
Current Progress on the CAD-IP Bookshelf
– 33 members (17 developers)
Main policies and mechanisms published
10 active slots
– inc. executables, performance results for leading-edge partitioners, placers
First Bookshelf Workshop, Nov. 1999
– attendance: UCSC, UCB, NWU, UIC, SUNY Binghamton, UCLA
– agreed on abstract syntax and semantics for initial slots
– committed to XML for common data formats
– peer review of slot webpages
Ongoing research uses components in the Bookshelf