parallelization by simplification: a case study in vlsi placement myung-chul kim, dong-jin lee and...
Post on 20-Dec-2015
217 views
TRANSCRIPT
PAPA2011, University of Michigan
Parallelization by SimPLification:A Case Study in VLSI Placement
Myung-Chul Kim, Dong-Jin Leeand Igor L. MarkovDept. of EECS, University of Michigan
1
PAPA2011, University of Michigan
Complexities of Parallel Algorithms & SW
1.Objectives of parallelizationA. Improve completion time by using multiple cores in ||B. Improve throughput by using stream processing
(latency may increase and become less predictable)C. Improve power consumption (by decreasing clk rate)2.Not an objective (a pitfall)
− Come up with a slow algorithm that is easy to parallelize
■In this talk: how to accomplish 1.A without 2− Take a leading algorithm and speed up its bottlenecks− Design a new algorithm that is
(a) better, (b) easy to parallelize
2
PAPA2011, University of Michigan
CAD Algorithms
■Sequence of optimizations− Subject to Amdahl’s law− The more the stages, the harder to parallelize effectively■Additional complications
− Elaborate data structures may entail overheadfor parallel access
− When processing is light, memory bandwidthmay become a bottleneck (with 4+ threads)
■Recommendations− A simpler algorithm is often either to parallelize
(fewer stages, simpler data structures)− Using standard solvers, e.g., linear algebra
helps reuse previous work on parallelization
3
PAPA2011, University of Michigan
Global Placement: Motivation
■Interconnect lagging in performance while transistors continue scaling
− Circuit delay, power dissipation and areadominated by interconnect
− Routing quality highly controlled by placement
■Circuit size and complexity rapidly increasing− Scalable placement algorithm is critical− Simplicity, integration with other optimizations
4
Unloaded
Coupling
IR drop
RC delay
PAPA2011, University of Michigan
Goals in Placement
■Find good relative ordering of cells− Minimize wire length and congestion− Maximize timing slack■Find good spacing of cells
− Eliminate wiring congestion problems− Provide space for post placement stages
–clock trees–buffer insertion–timing correction
■Find good global position
5
PAPA2011, University of Michigan
A B C
Optimize Relative Order
6
PAPA2011, University of Michigan
A B C
To spread ...
7
PAPA2011, University of Michigan
A B C
.. or not to spread
8
PAPA2011, University of Michigan
A B C
Place to the left
9
PAPA2011, University of Michigan
A B C
… or to the right
10
PAPA2011, University of Michigan
A B C
Optimize Relative Order
Without whitespace,placement is dominated by ordering
11
Example of Global Placement (APlace 2.04 from UCSD)
Example of Global Placement (mFar from UCSB)
PAPA2011, University of Michigan
Placement Formulation
■Objective: Minimize estimated wirelength− Half-perimeter wirelength (HPWL)
− (max X – min X) + (max Y – min Y)
■Subject to constraints:− Legality: Row-based
placement with no overlaps− Routability: Limiting local
interconnect congestion forsuccessful routing
− Timing: Meeting performancetarget of a design
14
xy
PAPA2011, University of Michigan
Quadratic Placement
■Consider a graph first, not a hypergraph
■Minimize Σ(xi-xj)2+(yi-yj)2 (the sum is over eij)
− Seems unrelated to Σ |xi-xj|+|yi-yj| but can still be separated into x- and y-components
■Physical analogy: Hooke’s law− Consider an elastic spring, spread by x− Force F=-kx (k is the spring constant)− Energy E=kx2
− Our goal: minimize the energy of the system
A system of springs will only settle in a minimum
15
PAPA2011, University of Michigan
Iterative Optimization
16
PAPA2011, University of Michigan
Prior Work
■ Ideal Placer
− Low runtime without sacrificing solution quality
− Simplicity, integration with other optimizations
17
Sp
eed
Solution Quality
Non-convex optimization
mFAR, Kraftwerk2, FastPlace3
Ideal placer
mPL6, APlace2, NTUPlace3
Quadratic and force-directed
PAPA2011, University of Michigan
Key features of SimPL
■Flat quadratic placement■Primal dual optimization
− Closing the gap between upper and lower bounds
18
Final Solution
Lower-Bound Solutionby Linear System Solver
Wir
elen
gth
Iteration
Final Legal Solution
Upper-Bound Solution by Look-ahead Legalization
Initial WL Opt.
PAPA2011, University of Michigan
Common Analytical Placement Flow
19
Placement Instance
Converge
yes
no
GlobalPlacement
Initial WLOptimization
Legalizationand Detailed Placement
SimPL Flow
20
We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005]
Placement Instance
Legalizationand Detailed Placement
B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008]
yesno
Pseudonet Insertion
Look-aheadLegalization
(Upper-Bound)
B2B GraphBuilding
Linear System Solver (Lower-Bound)
ConvergeGlobal
Placement
B2B GraphBuilding
Linear System Solver
WLConverge
yes
noInitial WLOptimization
PAPA2011, University of Michigan
SimPL: Look-ahead Legalization
■Purpose: Produces almost-legal placement (Upper-Bound)
while preserving the relative cell ordering givenby linear system solver (Lower-Bound)
■Identify target region − Find overflow bin b− Create a minimal wide enough bin cluster B around b
■Perform geometric top-down partitioning − Find cell area median (Cc) and whitespace median (CB)
− Assign cells (Cc) to corresponding partitions (CB)
■Non-linear scaling− Form stripe regions− Move cells across stripe regions in-order based on whitespace
21
PAPA2011, University of Michigan
SimPL: Look-ahead Legalization (1)
Performing geometric top-down partitioning
Overfilled binCell-area median (Cc)
B0 B1
whitespacemedian (CB)
Bin cluster (B)
22
PAPA2011, University of Michigan
SimPL: Look-ahead Legalization (2)
23
Cell-area median (Cc)
whitespacemedian (CB)
B0
PAPA2011, University of Michigan
SimPL: Look-ahead Legalization (2)
CB
Obstacle
borders
Uniform cutlines
CellOrdering
Per-stripeLinear Scaling
26
4
37
58
1
CB
26
4
37
58
1
CB
24
SimPL: Look-ahead Legalization (3)
■Example (adaptec1)
Look-ahead legalization stops when target regions become small enough
PAPA2011, University of Michigan
SimPL: Using legal locations as anchors
■Purpose: Gradually perturb the linear system to generate
lower-bound solutions with less overlap
■Anchors and Pseudonets− Look-ahead locations used
as fixed, zero-area anchors − Anchors and original cells
connected with 2-pin pseudonets− Pseudonet weights grow
linearly with iterations
26
PAPA2011, University of Michigan
Next illustration: Tug-of-war between low-wirelength and
legalized placements
27
SimPL Iterations on Adaptec1 (1)Iteration=0 (Init WL Opt.) Iteration=1 (Upper Bound)
Iteration=2 (Lower Bound) Iteration=3 (Upper Bound)
28
SimPL Iterations on Adaptec1 (2)Iteration=11 (Upper Bound)
Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)
Iteration=11 (Upper Bound)
Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)
Iteration=10 (Lower Bound)
29
SimPL Iterations on Adaptec1 (3)
30
Iteration=31 (Upper Bound)Iteration=30 (Lower Bound)
Iteration=40 (Lower Bound) Iteration=41 (Upper Bound)
PAPA2011, University of Michigan
Convergence of SimPL
■ Legal solution is formed between two bounds
31
PAPA2011, University of Michigan
Empirical Results: ISPD05 Benchmarks
■Experimental setup− Single threaded runs on a 3.2GHz Intel core i7 Quad
CPU Q660 Linux workstation− HPWL is computed by GSRC Bookshelf Evaluator< 5000 lines of code in C++, including CG solver
for sparse linear systems (w Jacobi preconditioner)
32
PAPA2011, University of Michigan
Initial placement 8%
CG solver 31%
Sparse matrix and B2B net
modeling8%
Look-ahead legalization
14%Pseudo-net insertion 1%
Post Global Placement
38%
IO 0%
Speeding Up Placement Using Parallelism
■SimPL has very few components (5KLOC)■Each bottleneck is amenable to some form of ||-ism
− Thread-level − Instruction-level
34
PAPA2011, University of Michigan
Parallelism in Conjugate Gradient Solver
■Coarse-grain row partitioning− Implemented using OpenMP3.0 compiler intrinsic
■SSE2 (Streaming SIMD Extensions) instructions− Process 4 multiple data with a single instruction− Marginal runtime improvement in SpMxV
■Reducing memory bandwidth demand of SpMxV− CSR (Compressed Sparse Row) format
Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM 2003
35
PAPA2011, University of Michigan
Parallelism in CG Solver - Example
36
PAPA2011, University of Michigan
Parallelism in B2B Mode Update
■B2B net model update– B2B model is separable– Can process the x and y cases in parallel
− Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads.
37
PAPA2011, University of Michigan
SSE optimization affects Runtime Profile
38
Initial placement 5%
CG solver 19%
Sparse matrix and B2B net
modeling10%
Look-ahead legalization
18%
Pseudo-net insertion 1%
Post Global Placement
46%
IO 1%
Initial placement 8%
CG solver 31%
Sparse matrix and B2B net
modeling8%
Look-ahead legalization
14%Pseudo-net insertion 1%
Post Global Placement
38%
IO 0%
PAPA2011, University of Michigan
Parallelism in Look-ahead Legalization (1)
■Look-ahead legalization (LAL) started consuming a significant fraction of overall runtime
■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization
− Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel
− After each level of T&N on bin cluster, eachthread generates two sub-clusters with similar numbers of cells
39
PAPA2011, University of Michigan
Parallelism in Look-ahead Legalization (2)
■LAL keeps the global queue of bin clusters Q■Static partitioning
− Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start
■Subtask updates
− Thread ti processes one of two sub-clusters (for the next level of T&N), the remainder is added to the global cluster queue Q
■Dynamic task scheduling
− When thread ti is idle, it dynamically retrieves clusters from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1)
40
PAPA2011, University of Michigan
Empirical Results – Overall Speed-ups
■Experimental setup− Multithreaded runs on a 8-core AMD-based system
with four dual-core CPUs and 16GByte RAM− Each CPU was Opteron 880 processor running
at 2.4GHz with 1024KB cache
41
Empirical Results – Component Speed-ups
42PAPA2011, University of Michigan
PAPA2011, University of Michigan
Empirical Results – Component Speed-ups
43
PAPA2011, University of Michigan
Extending the Routability-driven Placement
■Ongoing work: simultaneous place-and-route
44
PAPA2011, University of Michigan
Simultaneous Place-and-Route
■After Look-Ahead Legalization (LAL) perform Look-Ahead Routing (LAR)
− Integrate an in-house router through clean API− Cell locations in, accurate congestion maps out− The placer accounts for congestion in addition to density
(slightly modified formulas, almost no extra work)■ISPD 2011 contest organized by IBM Research
− New, large benchmarks− Placements evaluated by a common global router
45
PAPA2011, University of Michigan
SimPL SimPLR
■Key metric is #overflows (OF)■Also shown – routed WL (RtWL)
46
PAPA2011, University of Michigan
Conclusions
■ New flat quadratic placement algorithm: SimPL− Novel primal-dual based approach − Amenable to integration with physical synthesis
■ Self-contained, compact implementation − Fastest among available academic placers − Highly competitive solution quality− Amenable to parallelism− Easy to extend to simultaneous place-and-route
47
Questions and Answers
Thank you!Time for Questions
48PAPA2011, University of Michigan