valuation and values in application-driven algorithmics: case studies from vlsi cad andrew b. kahng,...

Valuation and Values in Application-Driven Algorithmics: Case Studies from VLSI CAD

Andrew B. Kahng, UCLA Computer Science Dept. June 2, [email protected], http://vlsicad.cs.ucla.edu

My Research Applied algorithmics

– demonstrably useful solutions for real problems

– “best known” solutions

– “classic” (well-studied) : Steiner, partition, placement, TSP,...

– toolkits: discrete algorithms, global optimization, mathematical programming, approximation frameworks, new-age metaheuristics, engineering

“Ground truths”

– anatomies

– limits

Anatomies Technologies

– semiconductor process roadmap, design-manufacturing I/F– design technology: methodology, flows, design process– interconnect modeling/analysis: delay/noise est, compact models

Problems– structural theory of large-scale global optimizations

Heuristics– hypergraph partitioning and clustering– wirelength- and timing-driven placement– single/multiple topology synthesis (length, delay, skew, buffering,...)– TSP, ..., IP protection, ..., combinatorial exchange/auction, ...

Cultures– contexts and infrastructure for research and technology transfer

Bounds Exact methods Provable approximations Technology extrapolation

– achievable envelope of system implementation w.r.t. cost, speed, power, reliability, ...

– ideally, should drive and be driven by system architectures, design and implementation methodologies

Today’s Talk “Demonstrably useful solutions for real problems” “Valuation”: What problems require attention ?

– technology extrapolation

– automatic layout of phase-shifting masks

“Values”: How do we advance the leading edge ?

– anatomy of FM-based hypergraph partitioning heuristics

– culture change: restoring time-to-market and QOR in applied algorithmics via “IP reuse”

Technology Extrapolation Evaluates impact of

– design technology

– process technology

Evaluates impact on– achievable design

– associated design problems

What matters, when ?

Sets new requirements for CAD tools and methodologies, capital and R&D investment, ... right tech at the right time

Roadmaps (SIA ITRS): familiar and influential example

How and when do L, How and when do L, SOI, SER, etc. SOI, SER, etc.

matter?matter?

What is the most power-efficient What is the most power-efficient noise management strategy?noise management strategy?

Will layout tools need to perform Will layout tools need to perform process simulation to effectively process simulation to effectively

address cross-die and cross-address cross-die and cross-wafer manufacturing variation?wafer manufacturing variation?

GTX: GSRC Technology Extrapolation System

GTX is a framework for technology extrapolation

Parameters (data)

Rules (models)

Rule chain (study)

Knowledge

Engine (derivation)

GUI (presentation)

ImplementationUser inputs

Pre-packaged

GTX

Graphical User Interface (GUI) Provides user interaction

Visualization (plotting, printing, saving to file)

4 views:

– Parameters

– Rules

– Rule chain

– Values in chain

GTX: Open, “Living Roadmap” Openness in grammar, parameters and rules

– easy sharing of data, models in research environment

– contributions of best known models from anywhere

Allows development of proprietary models

– separation between supplied (shared) and user-defined parameters / rules

– usability behind firewalls

– functionality for sharing results instead of data

Multi-platform (SUN Solaris, Windows, Linux)

http://vlsicad.cs.ucla.edu/GSRC/GTX/

GTX Activity Models implemented

– Cycle-time models of SUSPENS (with extension by Takahashi), BACPAC (Sylvester, Berkeley), Fisher (ITRS)

– Currently adding

– GENESYS (with help from Georgia Tech)

– RIPE (with help from RPI)

– New device and power modules (Synopsys / Berkeley)

– New SOI device model (Synopsys / Berkeley)

– Inductance extraction (Silicon Graphics / Berkeley / Synopsys)

Studies performed in GTX

– Modeling and parameter sensitivity analyses

– Design optimization studies: global interconnects, layer stack

– Routability estimation, via impact models, ...

Subwavelength Gap since .35 m

Subwavelength Optical Lithography

EUV, X-rays, E-beams all > 10 years out

huge investment in > 30 years of optical litho infrastructure

Clear areas

Opaque(chrome)

areas

Mask Types

Bright Field– opaque features

– transparent background

Dark Field– transparent features

– opaque background

Phase Shifting Masks

conventional maskglass

Chrome

phase shifting mask

Phase shifter

0 E at mask 0

0 E at wafer 0

0 I at wafer 0

Impact of PSM PSM enables smaller transistor gate lengths Leff

– “critical” polysilicon features only (gate Leff)

– faster device switching faster circuits– better critical dimension (CD) control improved parametric yield

– all features on polysilicon layer, local interconnect layers– smaller die area more $/wafer (“full-chip PSM” == BIG win)

Alternative: build a $10B fab with equipment that won’t exist for 5+ years Data points

– exponential increase in price of CAD technology for PSM– Numerical Technologies market cap 3x that of Avant!– 25 nm gates (!!!) manufactured with 248nm DUV steppers (NTI + MIT Lincoln Labs, announced 2 days ago); 90nm gates in

production at Motorola, Lucent (since late 1999)

Double-Exposure Bright-Field PSM

0

180

180 + =

The Phase Assignment Problem

Assign 0, 180 phase regions such that critical features with width (separation) < B are induced by adjacent phase regions with opposite phases

Bright Field (Dark Field)

0 180180

0

Key: Global 2-Colorability

?180 0

0180 180

180

If there is an odd cycle of “phase implications” layout cannot be manufactured

– layout verification becomes a global, not local, issue

F4

F2

F3

F1

Critical features: F1,F2,F3,F4

F4

F2

F3

F1

Opposite-Phase Shifters (0,180)

F4

F2

F3

F1

S1

S2

S3

S5

S4

S6

S7S8

Shifters: S1-S8

PROPER Phase Assignment:

– Opposite phases for opposite shifters

– Same phase for overlapping shifters

F4

F2

F3

F1

S1

S2

S3

S5

S4

S6

S7S8

Phase Conflict

Proper Phase Assignment is IMPOSSIBLE

F4

F2

F3

F1

S1

S2

S3

S5

S4

S6

S7S8

Phase Conflict feature shifting

to remove overlap

Phase Conflict Resolution

F4

F2

F1

S1

S2

S3 S4

S7S8

Phase Conflict feature widening to turn

conflict into non-conflict

Phase Conflict Resolution

F3

How will VLSI CAD deal with PSM ? UCLA: first comprehensive methodology for PSM-aware layout design

– currently being integrated by Cadence, Numerical Technologies

Approach: partition responsibility for phase-assignability

– good layout practices (local geometry)

– (open) problem: is there a set of “design rules” that guarantees phase-assignability of layout ? (no T’s, no doglegs, even fingers...)

– automatic phase conflict resolution / bipartization (global colorability)

– enabling reuse of layout (free composability)

– problem: how can we guarantee reusability of phase-assigned layouts, such that no odd cycles can occur when the layouts are composed together in a larger layout ?

Automatic Conflict Resolution

Compaction-Oriented Approach

Analyze input layout Find min-cost set of perturbations needed to

eliminate all “odd cycles” Induce constraints for output layout

– i.e., PSM-induced (shape, spacing) constraints

Compact to get phase-assignable layout Key: Minimize the set of new constraints,

i.e., break all odd cycles in conflict graph by deleting a minimum number of edges.

Conflict Graph

Dark Field: build graph over feature regions

– edge between two features whose separation is < B

Bright Field: build graph over shifter regions

– shifters for features whose width is < B

– two edge types

– adjacency edge between overlapping phase regions : endpoints must have same phase

– conflict edge between shifters on opposite side of critical feature: endpoints must have opposite phase

conflict graph Ggreen = feature; pink = conflict

Bright Field: conflict edge

adjacency edge

conflict graph G

Conflict Graph GDark Field:

Optimal Odd Cycle Elimination

conflict graph G

dual graph DT-join of odd-degree nodes in D

dark green = feature; pink = conflict

Optimal Odd Cycle Elimination

corresponds to broken edges in original conflict graph

- assign phases: dark green and purple- remaining pink conflicts correctly handled

T-join of odd-degree nodes in D

dark green = feature; pink = conflict

The T-join Problem How to delete minimum-cost set of edges from

conflict graph G to eliminate odd cycles? Construct geometric dual graph D = dual(G) Find odd-degree vertices T in D Solve the T-join problem in D:

– find min-weight edge set J in D such that

–all T-vertices have odd degree

–all other vertices have even degree

Solution J corresponds to desired min-cost edge set in conflict graph G

Solving T-join in Sparse Graphs Reduction to matching

– construct a complete graph T(G)

– vertices = T-vertices

– edge costs = shortest-path cost

– find minimum-cost perfect matching Typical example = sparse (not always planar) graph

– note that conflict graphs are sparse

– #vertices = 1,000,000

– #edges 5 #vertices

– # T-vertices 10% of #vertices = 100,000

Drawback: finding APSP too slow, memory-consuming

– #vertices = 100,000 #edges in T(G) = 5,000,000,000

Solving T-join: Reduction to Matching

Desirable properties of reduction to matching:

–exact (i.e., optimal)

–not much memory (say, 2-3X more)

– leads to very fast solution

Solution: gadgets!

– replace each edge/vertex with gadgets s.t.

matching all vertices in gadgeted graph

T-join in original graph

T-join Problem: Reduction to Matching replace each vertex with a chain of triangles one more edge for T-vertices in graph D: m = #edges, n = #vertices, t = #T in gadgeted graph: 4m-2n-t vertices, 7m-5n-t edges cost of red edges = original dual edge costs cost of

(black) edges in triangles = 0

vertex T

vertex T

Example of Gadgeted Graph

Dual Graph

Gadgeted graph

black + red edges ==min-cost perfect matching

Results

Layout1 Layout2 Layout3Testcase polygons edges polygons edges polygons edges

3769 12442 9775 26520 18249 51402

Algorithm edges runtime edges runtime edges runtimeGreedy 2650 0.56 2722 3.66 6180 5.38

GW 1612 3.33 1488 5.77 3280 14.47Exact 1468 19.88 1346 16.67 2958 74.33

New Gadgets 1468 3.62 1346 5.17 2958 17.9

• Runtimes in CPU seconds on Sun Ultra-10

• Greedy = breadth-first-search bicoloring

• GW = Goemans/Williamson95 heuristic

• Cook/Rohe98 for perfect matching

• Integration w/compactor: saves 9+% layout area vs. GW

F4

F2

F3

F1

S1

S2

S3

S5

S4

S6

S7S8

Can distinguish between use of shifting, widening DOFs

Black points - featuresBlue - shifter overlapRed - extra nodes to distinguish opposite shifters

Bipartization Problem: delete min # of nodes (or edges) to make graph bipartite - blue nodes: shifting - red nodes: widening

Bipartization by node deletion is NP-hard(GW98: 9/4-approx)

Summary

New fast, optimal algorithms for edge-deletion bipartization

– Fast T-join using gadgets

– applicable to any AltPSM phase conflict graphs

Approximate solution for node-deletion bipartization

– Goemans-Williamson98 9/4-approximation

– If node-deletion cost < 1.5 edge deletion, GW is better than edge deletion

Comprehensive integration w/NTI, Cadence tools

Applied Algorithmics R&D

Heuristics for hard problems

Problems have practical context

Choices dominated by engineering tradeoffs– QOR vs. resource usage, accessibility, adoptability

How do you know/show that your approach is good?

Hypergraphs in VLSI CAD

Circuit netlist represented by hypergraph

Hypergraph Partitioning in VLSI

Variants– directed/undirected hypergraphs– weighted/unweighted vertices, edges– constraints, objectives, …

Human-designed instances Benchmarks

– up to 4,000,000 vertices– sparse (vertex degree 4, hyperedge size 4)– small number of very large hyperedges

Efficiency, flexibility: KL-FM style preferred

Context: Top-Down VLSI Placement

etc

Context: Top-Down Placement Speed

– 6,000 cells/minute to final detailed placement

– partitioning used only in top-down global placement

– implied partitioning runtime: 1 second for 25,000 cells, < 30 seconds for 750,000 cells

Structure– tight balance constraint on total cell areas in partitions

– widely varying cell areas

– fixed terminals (pads, terminal propagation, etc.)

Fiduccia-Mattheyses (FM) Approach

Pass:

– start with all vertices free to move (unlocked)

– label each possible move with immediate change in cost that it causes (gain)

– iteratively select and execute a move with highest gain, lock the moving vertex (i.e., cannot move again during the pass), and update affected gains

– best solution seen during the pass is adopted as starting solution for next pass

FM:

– start with some initial solution

– perform passes until a pass fails to improve solution quality

Cut During One Pass (Bipartitioning)

Moves

Cut

Multilevel Partitioning

RefinementClustering

Key Elements of FM Three main operations

– computation of initial gain values at beginning of pass

– retrieval of the best-gain (feasible) move

– update of all affected gain values after a move is made

Contribution of Fiduccia and Mattheyses:

– circuit hypergraphs are sparse

– move gain is bounded between +2 *, -2 * max vertex degree

– hash moves by gains (gain bucket structure)

– each gain affected by a move is updated in constant time

– linear time complexity per pass

Taxonomy of Algorithm and Implementation Improvements

Modifications of the algorithm

Implicit decisions

Tuning that can change the result

Tuning that cannot change the result

Modifications of the Algorithm

Important changes to flow, new steps/features

– lookahead tie-breaking

– CLIP

– instead of actual gain, maintain “updated gain” = actual gain minus initial gain (at start of pass)

– WHY ???

– cut-line refinement

– insert nodes into gain structure only if incident to cut nets

– multiple unlocking

Modifications of the Algorithm

Important changes to flow, new steps/features

– lookahead tie-breaking

– CLIP

– instead of actual gain, maintain “updated gain” = actual gain minus initial gain

– promotes “clustered moves” (similar to “LIFO gain buckets”)

– cut-line refinement

– insert nodes into gain structure only if incident to cut nets

– multiple unlocking

Implicit Decisions Tie-breaking in choosing highest gain bucket Tie-breaking in where to attach new element in

gain bucket

– LIFO vs. FIFO vs. random ... (known issue: HK 95)

Whether to update, or skip updating, when “delta gain” of a move is zero

Tie-breaking when selecting the best solution seen during pass

– first encountered, last encountered, best-balance, ...

Tuning That Can Change the Result

Threshold large nets to reduce runtime

Skip gain update for large nets

Skip zero delta gain updates

– changes resolution of hash collisions in gain container

Loose/stable net removal

– perform gain updates for only selected nets

Allow illegal solutions during pass

Tuning That Can’t Change the Result

Skip updates for nets that cannot have non-zero delta gain

netcut-specific optimizations

2-way specific optimizations

optimizations for nets of small degree

.....

... 30 years since KL70, 18 years since FM82, 100’s of papers in literature

Zero Delta Gain Update

When vertex x is moved, gains for all vertices y on nets incident to x must potentially be updated

In all FM implementations, this is done by going through incident nets one at a time, computing changes in gain for vertices y on these nets

Implicit decision:

– reinsert a vertex y when it experiences a zero delta gain move (will shift position of y within the same gain bucket)

– skip the gain update (leave position of y unchanged)

Tie-Breaking Between Highest-Gain Buckets

Gain container typically implemented such that available moves are segregated, e.g., by source or destination partition

There can be more than one highest-gain bucket When balance constraint is anything other than “exact

bisection”, moves at multiple highest-gain buckets can be legal

Implicit decision:– choose the move that is from the same partition as the last vertex moved

(“toward”)

– choose the move that is not from the same partition as the last vertex moved (“away”)

– choose the move in partition 0 (“part0”)

How Much Can This Matter ?

5% ? 10% ? 20% ? more ? 50% ? more ?

Implicit Decision Effects: IBM01

ALGORITHM IBM01 with unit areas and 10% balanceUpdates Bias Flat LIFO Flat CLIP ML LIFO ML CLIPAll gain Away 856/1723 (12.8) 187/463 (16.9) 185/236 (27.0) 183/239 (25.7)All gain Part0 356/1226 (16.3) 185/395 (15.9) 181/238 (25.8) 181/235 (27.4)All gain Toward 188/577 (12.6) 181/436 (13.3) 180/236 (27.6) 180/239 (24.5)Nonzero Away 201/529 (8.44) 181/415 (13.6) 180/234 (26.3) 181/239 (25.8)Nonzero Part0 201/436 (8.81) 181/371 (13.7) 180/232 (25.9) 180/240 (26.0)Nonzero Toward 197/454 (9.29) 181/397 (13.2) 181/245 (26.9) 180/237 (24.6)

Effect of Implicit Decisions Stunning average cutsize difference for flat partitioner with worst vs.

best combination– far outweighs “new improvements”

One wrong decision can lead to misleading conclusions w.r.t. other decisions– “part0” is worse than “toward” with zero delta gain updates

– better or same without zero delta gain updates Stronger optimization engines mask flaws

– ML CLIP > ML LIFO > Flat CLIP > Flat LIFO

– less dynamic range ML masks bad flat implementation

Tuning Effects

Comparison of two CLIP-FM implementation

Min and Ave cutsizes from 100 single-start trials

Another quiz: Why did this happen ?– N.B.: original inventor of CLIP-FM couldn’t figure it out

Tolerance Algorithm CLIP Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06

Min 471 1228 2569 17782 1990 1499Paper1 Ave 2456 12158 16695 20178 3156 18154

Min 329 298 797 653 2557 7452%

Paper2 Ave 485 472 1635 1233 3074 1475Min 246 439 1915 488 2146 1303

Paper1Ave 462 4163 9720 1232 3016 15658Min 237 266 675 527 1775 681

10%

Paper2Ave 424 406 1325 893 2880 1192

Tuning Effects

Comparison of two CLIP-FM implementation

Min and Ave cutsizes from 100 single-start trials

Another quiz: Why did this happen ?– Hint: some modern IBM benchmarks have large macro-cells

Tolerance Algorithm CLIP Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06

Min 471 1228 2569 17782 1990 1499Paper1 Ave 2456 12158 16695 20178 3156 18154

Min 329 298 797 653 2557 7452%

Paper2 Ave 485 472 1635 1233 3074 1475Min 246 439 1915 488 2146 1303

Paper1Ave 462 4163 9720 1232 3016 15658Min 237 266 675 527 1775 681

10%

Paper2Ave 424 406 1325 893 2880 1192

Sheer Nightmare Stuff...

Comparison of two LIFO-FM implementations

Min and Ave cut sizes from 100 single-start trials

Papers 1, 2 both published since mid-1998

Tolerance LIFO-FM Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06

Min 450 648 2459 3201 2397 1436Paper1 Ave 2701 12253 16944 20281 3420 16578

Min 366 301 1588 1014 2640 10082%

Paper2 Ave 594 542 2688 1802 3382 1746Min 270 313 1624 544 1874 1479

Paper1Ave 486 3872 12348 2383 3063 14007Min 244 266 1057 561 2347 821

10%

Paper2Ave 445 405 1993 1290 3222 1640

In Case You Are Wondering...

No, VLSI CAD Researchers Are Not Stupid.

How Much Can This Matter ? 5% ? 10% ? 20% ? more ? 50% ? more ? Answer: 400+ % 2000+% w.r.t. recent literature

and STANDARD, “WELL-UNDERSTOOD” heuristics + lots more + N years = leading partitioner, placer

"Barriers to Entry” for Researchers Code development barrier

– bare-bones self-contained partitioner: 800 lines

– not leading-edge (Dutt/Deng LIFO-FM)

– modern partitioner requires much more code

Expertise barrier

– very small details can have stunning impact

– must not only know what to do, but also what not to do

– impossible to estimate knowledge/expertise required to do research at leading edge

Need reference implementations !

– reference prose (6 pp. 9pt double-column) insufficient

“Barriers to Relevance” for Researchers

All heuristic engines/algorithms tuned to test cases

Test case usage must capture real use models, driving applications

– e.g., recall bipartitioning is driven by top-down placement

– until CKM99: no one considered effect of fixed vertices !!!

Test case usage can be fatally flawed by “details”

– hidden or previously unrealized

– previously believed insignificant

– results of algorithm research will be flawed as a result

Research in mature areas can stall

– incremental research - difficult and risky

– implementations not available duplicated effort

– too much trust which approach is really the best?

– some results may not be replicable

– ‘not novel’ is common reason for paper rejection

– exploratory research - paradoxically, lower-risk

– novelty for the sake of novelty

– yet, novel approaches must be well-substantiated

Pitfalls: questionable value, roadblocks, obsolete contexts

Challenges for Applied Algorithmics

Challenges for Applied Algorithmics Difficult to be relevant (time-to-market, QOR issues)

– time to market: 5-7 year delay from publishing to first industrial use (cf. market lifetimes, tech extrapolation...)

– quality of results: unmeasurable, unpredictable, basically unknown

Good news: barriers to entry and barriers to relevance are self-inflicted, and possibly curable

– mature domains require mature R&D methodologies

– a possible solution: cultivate flexibility and reuse

– low cost “update” of previous work to support reuse

– future tool/algorithm development biased towards reuse

Analogy: Hardware Design :: Tool Design Hardware design is difficult

– complex electrical engineering and optimization problems– mistakes are costly– verification and test not trivial– few can afford to truly exploit the limits of technology – A Winning Approach: Hardware IP reuse

CAD tools design is difficult– complex software engineering and optimization problems– mistakes can be showstoppers– verification and test not trivial – few can manage complexity of leading-edge approaches– A "Surprising Idea”: CAD-IP reuse

What is CAD-IP?

Data models and benchmarks– context descriptions and use models

– testcases and good solutions Algorithms and algorithm analyses

– mathematical formulations

– comparison and evaluation methodologies for algorithms

– executables and source code of implementations

– leading-edge performance results Traditional (paper-based) publications

Bookshelf: A Repository for CAD-IP “Community memory” for CAD-IP

– data models

– algorithms

– implementations

Publication medium that enables efficient applied algorithmics algorithm research

– benchmarks, performance results

– algorithm descriptions and analyses

– quality implementations (e.g., open-source Capo, MLPart)

Simplified comparisons to identify best approaches

Easier for industry to communicate new use models

Summary: Addressing Inefficiencies Inefficiencies

– lack of openness and standards huge duplication of effort

– incomparable reporting improvement difficult

– lack of standard comparison/latest use models best approach not clear

– industry doesn’t bother w/feedback outdated use models

Proposed solutions– widely available, up-to-date, extensible benchmarks

– standardized performance reporting for leading-edge approaches

– available detailed descriptions of algorithms

– peer review of executables (and source code?)

– credit for quality implementations

Better research, faster adoption, more impact http://vlsicad.cs.ucla.edu/GSRC/bookshelf/







Thank you for your attention !!!

Spare Slides

Parameters Description of technology, circuit and design attributes

Importance of consistent naming cannot be overstated

– Naming conventions for parameters[<preposition>] _ <principal> _ {[qualifier] _ <place>} _ {<qualifier>} _ [<adverbial>] _ [<index>] _ [<unit>]

– Example: r_int_tot_lyr_pu_dl

– Benefits:– Relatively easy to understand parameter from its name

– Distinguishable (no two parameters should have the same name)

– r_int (interconnect resistance) = r_int (interconnect resistivity) ?– Unique (no two names for the same parameter)

– R_int = R_wire ?– Sortable (important literals come first)

– Software to automatically check parameter naming

Rules Methods to derive unknown parameters from known ones

ASCII rules

– Laws of physics, models of electrical behavior

– Statistical models (e.g., Rent's rule)

– Include closed-form expressions, vector operations, tables

– Storing of calibration data (e.g., “technology files”) for known process, design points in lookup tables

Constraints

– Simulated by rules that compute boolean values

– Used to limit range during “sweeping”

– Optimization over a collection of rules

– Example: buffer insertion for minimal delay with area constraints

Rules (Cont.) “External executable” rules

– Assume a callable executable (e.g., PERL script)

– Example: optimization of number and size of repeaters for global wires

– Use command-line interface and transfer through files

– Allow complex semantics of a rule

– Example: placers, IPEM executable [Cong, UCLA])

“Code” rules

– Implemented in C++ and linked into the inference engine

– Useful if execution speed is an issue

Engine

Contains no domain-specific knowledge

Evaluates rules in topological order

Performs studies (multiple evaluations tradeoffs/sweeping, optimization)

Parameters (data)

Rules (models)

Rule chain (study)

Knowledge

Engine (derivation)

GUI (presentation)

ImplementationUser inputs

Pre-packaged

GTX

Knowledge Representation Rules and parameters are specified separately from the derivation

engine

Human-readable ASCII grammar

Benefits:– Easy creation/sharing of parameters/rules by multiple users

– D. Sylvester and C. Cao: device and power, SOI modules that “drop in” to GTX

– P.K. Nag: Yield modeling

– Extensible to models of arbitrary complexity (specialized prediction methods, technology data sets, optimization engines)

– Avant! Apollo or Cadence SE P&R tool: just another wirelength estimator

– Applies to any domain of work in semiconductors, VLSI CAD

– Transistor sizing, single wire optimizations, system-level wiring predictions,…

Corking Effect in CLIP CLIP begins by placing all moves into the 0-gain buckets

– CLIP chooses moves by cumulative delta gain (“updated gain”)

– initially, every move has cumulative delta gain = 0

Historical legacy (and for speed): FM partitioners typically look only at the first move in a bucket– if it is illegal, skip the rest of the bucket (possibly skip all buckets for that

partition)

If the move at the head of each bucket at the beginning of a CLIP pass is illegal, pass terminates without making any moves – even if first move is legal, an illegal move soon afterward will “cork”

New test cases (IBM) have large cells– large cells have large degree, and often large initial gain

– CLIP inventor couldn’t understand bad performance on IBM cases

Tuning to Uncork CLIP Don’t place nodes with area > balance constraint in gain

container at pass initialization

– actually useful for all FM variants

– zero CPU overhead

Look beyond the first move in a bucket

– extremely expensive

– hurts quality (partitioner doesn’t operate well near balance tolerance

– not worth it, in our experience

Simply do a LIFO pass before starting CLIP

– spreads out nodes in gain buckets

– reduces likelihood that large node has largest total gain

Effect of Fixed Terminals

IBM01 / runtime (sec)

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

0 5 10 15 20 25 30 35 40 45 50

1248

IBM01 / normalized costs

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

0 5 10 15 20 25 30 35 40 45 50

1248

Normalized Cost for IBM01 Runtime for IBM01

Enabling Reuse: Free Composability

Conflict in Cell (Macro) Based Layouts Consider connected components of conflict graphs within each

cell master– each component independently phase-assignable (2k versions)

– each is a single “vertex” in coarse-grain conflict graph

– problem: assure free composability (reusability) of cell masters, such that no odd cycles can arise in coarse-grain conflict graph

edge in coarse-grain conflict graph

cell master A cell master Bconnected component

Case I: Creating CAD IP of Questionable Value

Recent hypergraph partitioning papers report FM implementations 20x worse than leading-edge FM

– previous lack of openness caused wrong conclusions, wasted effort– some “improvements” may only apply to weak implementations– duplicated effort re-implementing (incorrectly?) well-known algorithms

– difficult to find the leading edge– no standard comparison methodology– how do you know if an implementation is poor?

To make leading-edge apparent and reproducible– publish performance results on standard benchmarks– peer review (executables, source code?)– similar to common publication standards !

Tolerance LIFO-FM Ibm01 Ibm02 Ibm03 Ibm04 Ibm05 Ibm06

Min(100) 450 648 2459 3201 2397 1436Paper1

Ave 2701 12253 16944 20281 3420 16578

Min(100) 366 301 1588 1014 2640 10082%

Paper2Ave 594 542 2688 1802 3382 1746

Case II: Roadblocks to Creating Needed CAD-IP

“Best approach” to global placement? – recursive bisection (1970s)– force-directed (1980s) – simulated annealing (1980s)– analytical (1990s)– hybrids, others

Why is this question difficult?– lastest public placement benchmarks are from 1980s– data formats are bulky (hard to mix and match components)– no public implementations since early 1990s – new ideas are not compared to old

To match approaches to new contexts– agree on common up-to-date data model– publish good format descriptions, benchmarks, performance results– publish implementations

Case III: Developing CAD-IP for Obsolete Contexts

Global placement example– much of academia studies variable-die placement

– row length and spacing not fixed

– explicit feedthroughs

– majority of industrial use is fixed-die – pre-defined layout dimensions

– HPWL-driven vs. routability- or timing-driven

– runtimes are often not even reported

– this affects benchmarks and algorithms Solution: perform sanity checks and request feedback

– explicitly define use model and QOR measures

– establish a repository for up-to-date formats, benchmarks etc.

– peer review (executables, source code?)

Implicit Decision Effects: IBM02

ALGORITHM IBM02 with unit areas and 10% balanceUpdates Bias Flat LIFO Flat CLIP ML LIFO ML CLIPAll gain Away 402/1404 (30.8) 274/662 (50.7) 263/285 (62.2) 262/281 (65.7)All gain Part0 307/1468 (43.2) 263/513 (41.0) 262/288 (65.9) 262/278 (62.0)All gain Toward 283/585 (23.7) 263/446 (40.3) 262/291 (61.4) 262/281 (60.1)Nonzero Away 275/471 (18.9) 274/466 (35.5) 262/282 (60.5) 262/286 (51.0)Nonzero Part0 262/444 (18.4) 262/442 (35.0) 262/280 (58.3) 262/286 (57.7)Nonzero Toward 265/453 (17.0) 262/445 (32.1) 262/281 (56.7) 262/284 (55.8)

Reference Implementations

Documentation does not allow replication of results

– amazingly, true even for "classic" algorithms

– true for vendor R&D, true for academic R&D

Published reference implementations will raise quality

– minimum standard for algorithm implementation quality

– reduce barrier to entry for new R&D

Conclusions Work with mature heuristics requires mature

methodologies

Identified research methodology risks

Identified reporting methodology risks

Community needs to adopt standards for both

– reference “benchmark” implementations

– vigilant awareness of use-model and context

– reporting method that facilitates comparison

Application-Driven Research

Well-studied areas have complex, "tuned" metaheuristics

Risks of poor research methodologies

– irreproducible results or descriptions

– no enabling account of key insights underlying the contribution

– experimental evidence not useful to others

– inconsistent with driving use model

– missing comparisons with leading-edge approaches

– Let’s look at some requirements this induces...

The GSRC Bookshelf for CAD-IP

Bookshelf consists of slots

– slots represent active research areas with “enough customers”

– collectively, the slots cover the field

Who maintains slots?

– experts in each topic collaborate to produce them - anyone can submit

Currently, 10 active slots

– SAT (U. Michigan, Sakallah)

– Graph Coloring (UCLA, Potkonjak)

– Hypergraph Partitioning (UCLA, Kahng)

– Block Packing (UCSC, Dai)

– Placement (UCLA, Kahng)

– Global Routing (SUNY Binghamton, Madden)

– Single Interconnect Tree Synthesis (UIC, Lillis and UCLA, Cong)

– Commitments for more: BDDs, NLP, Test and Verification

What’s in a Slot? Introduction – why this area is important and recent progress

– pointers to other resources (links, publications)

Data formats used for benchmarks– SAT, graph formats etc.

– new XML-based formats

Benchmarks, solutions, performance results– including experimental methodology (e.g., runtime-quality Pareto curve)

Binary utilities– format converters, instance generators, solution evaluators, legality checkers

– optimizers and solvers

– executables

Implementation source code Other info relevant to algorithm research and implementations

– detailed algorithm descriptions

– algorithm comparisons

Current Progress on the CAD-IP Bookshelf

[email protected]

– 33 members (17 developers)

Main policies and mechanisms published

10 active slots

– inc. executables, performance results for leading-edge partitioners, placers

First Bookshelf Workshop, Nov. 1999

– attendance: UCSC, UCB, NWU, UIC, SUNY Binghamton, UCLA

– agreed on abstract syntax and semantics for initial slots

– committed to XML for common data formats

– peer review of slot webpages

Ongoing research uses components in the Bookshelf

valuation and values in application-driven algorithmics: case studies from vlsi cad andrew b. kahng,...

Documents