large-scale static timing analysis akintayo holder rensselaer polytechnic institute 2/18/20161
DESCRIPTION
Static Timing Analysis A gate (or other device) requires all input signals to be present at the same time. “Same time” defined by clock signal. STA ensures all devices have their expected inputs. 2/18/20163TRANSCRIPT
Large-Scale Static Timing Analysis
Akintayo HolderRensselaer Polytechnic Institute
05/04/23 1
Contents
• Introduction• Method• Summary
05/04/23 2
Static Timing Analysis
• A gate (or other device) requires all input signals to be present at the same time.
• “Same time” defined by clock signal.
• STA ensures all devices have their expected inputs.
05/04/23 3
Static Timing Analysis• Static timing analysis
(STA) ensures that signals will propagate through a circuit.
• Checks that every gate will have valid inputs.
• Block oriented• Polynomial• Circuit dictates STA’s
behaviour
05/04/23 4
STA: Arrival Time Calculation
• Late-mode static timing analysis was used in this study.
• Arrival time (AT) is the latest time a signal can arrive.
• AT(i) := max (AT(j) + Delay(S(j,i)), for all S(j,i))
• AT goes forward.
05/04/23 69 Slides 5
Slack, how STA checks that the circuit is correct.
• Required Arrival Time (RAT) – earliest time a signal must arrive.
• RAT goes backward from output nodes.
• For a circuit to work, there must be overlap between the RAT and AT of each node (Slack).
• Slack = RAT - AT
05/04/23 69 Slides 6
Behaviour of STA
• Block oriented STA is polynomial with respect to the size of the circuit.
• Running time depends on the circuit size,• But more exacting tolerances require more
exacting estimates.• Multiple process corners, statistical STA and
other techniques are common in industrial tools and increase running times.
05/04/23 7
Why large-scale static timing analysis is hard
• Performance depends on the circuit.• Circuit must be divided/partitioned to
minimize the cost of communication.• Graph/circuit partitioning is hard.
05/04/23 8
What is the role of Decomposition?
• Divide the computation into smaller tasks
• Execute tasks in parallel• The smaller the tasks,
the more important the decomposition.
• Case of Amdahl’s Law
05/04/23 69 Slides 9
What is meant by Regularity?
• Presence of a pattern or evidence of rules.• Easy to understand and define.• A regular pattern can be used to decompose
work.• Irregular, no obvious pattern or apparent
rules.• No clear pattern. Decomposition chooses
among many “bad” choices.
05/04/23 69 Slides 10
Where do we get Structural Irregularity from?
• Structural Irregularity is derived from unstructured objects.
• Unstructured objects are defined by their neighbour relations
• Neighbour relations are unique, complex and often irregular.
05/04/23 69 Slides 11
STA and Structural Irregularity
• STA depends on shape of circuit• Circuit is irregular, which causes STA to
demonstrate structural irregularity• Irregularity makes decomposition hard and
limits ability to scale
05/04/23 12
Why large-scale static timing analysis is hard
• STA exhibits structural irregularity.• Irregularity hinders problem decomposition.• Large-scale systems usually display traits that
makes efficient decomposition difficult.• Good decomposition needed for good
performance.
05/04/23 13
Why balance will help performance
• Balance refers to the relative performance of the processors, the network and the memory interconnect.
• Balance is the ability of the processor to saturate either the network or the memory interconnect.
• Balance simplifies decomposition.
05/04/23 14
The Blue Gene has modest processors and high performance networks
05/04/23 69 Slides 15
The IBM Blue Gene/L• Supports MPI as primary
programming interface• Simple memory
management.– 512M or 1G per node.– 1 or 2 processors per node.– No virtual memory.
• Subset of Linux API.• Adaptive routing for point
to point, independent networks for collective.
• Optimized for bandwidth, scalability and efficiency.
05/04/23 16
More on Balance• Cray XT5• High-density blades with
8 Opteron quad core processors
• SeaStar2+, a low latency, high bandwidth interconnect.
• Hybrid supercomputer, uses FPGAs and co-processors
• Better performance when adding a processor from a new node.– 45% slow down when
adding processor on existing node, compared with adding processor from new node.
• Network saturation not observed with point to point experiments.
• Global operations varied in run time.
05/04/23 17Worley et al., 2009. Snell et al., 2007
Contents
• Introduction• Method• Summary
05/04/23 18
Understanding the Input
• Static timing analysis processes a timing graph, which represents the circuit.
• Timing graph– DAG– From input to output pins
• Large-scale STA sorts edges in DAG by source depth.
05/04/23 19
An approach to Large-Scale STA• Compute all nodes on a
level, • share the results, • move to the next level.
• Levelized Bulk Synchronization:– Only arrival time
calculations,– Each processor loads the
entire circuit,– Online partitioning using
round robin.
05/04/23 20
Levelized Bulk Synchronization Algorithm
level = 0Repeat
RepeatRepeat
if (local segment) calculate delay/slew at local processorUntil (all incoming segments have been visited)Compute node arrival time and slew
Until (all assigned nodes at the current level have been processed)Repeat
if (gate segment) Calculate delay/slew at sink processorelse if (net segment) Calculate delay/slew at source processorendif
Until (all remote outgoing segments have been visited)Once all processors are complete, advance to next level
Until (all levels have been processed)
05/04/23 69 Slides 21
Our modest 3 million node benchmark circuit
• 3 million nodes• 4 million segments• Almost 1000 levels
depth.
• A mean width of 3000 nodes but a median width of 32 nodes.
• Estimated theoretical speedup = 260
• Speedup computed for 1024 processors.
• Levels with more than 1024 nodes, truncated to 1024
05/04/23 69 Slides 22
Our modest 3 million node benchmark circuit
05/04/23 23
Timing? 260x on 1024 processors
05/04/23 69 Slides 24But Including partitioning, Speedup is 119x.
Removing global dependencies to improve timing algorithm
• Bulk synch forces all processor to wait at the end of a level.
• Not required by STA.• Solution: Remove global
synch.
• No Global Synch :– Compute x nodes– Send y updates– Continue until all nodes
are done.
05/04/23 69 Slides 25
Removing Global Synchronization
Speedup without Partitioning Costs
Algorithms 256 CPUs
1024 CPUs
Levelized Bulk Synchronization
120 263
No Global Synchronization
120 292
• 10% improvement with 1024 processors.
• Improved partitioning appears to have more potential for impact.
05/04/23 69 Slides 26
What about partitioning
• Each processor loads the entire circuit
• For each level:– For each node on level:
• Assign node n to task (n%(num cpus))
• Build list of local nodes, and segments
• Flexible with respect to the number of cores.
• Ignores structure of the circuit
• Limits the size of the circuit
05/04/23 27
What about partitioning?
05/04/23 28
Future Work
• Examine how large-scale Static Timing Analysis performs on different circuits.
05/04/23 29
Why synthesize large circuits?
• Motivated by a lack of benchmark circuits and the relatively small size of provided circuits
• Would algorithm scale to 10,000s of processors with larger circuits?
• Solution: Generate billion node timing graphs
05/04/23 30
Synthesis of Large circuits
By Concatenation• Make R copies of the circuit• Connect all copies to a new
global sink• Creates a timing graph that
captures internal intricacies of original.
• Does not capture shape• Maybe disjoint
By Scaling• Create histogram of pins by
level and of segments by level.
• Multiply levels, pins and segments by R
• Creates a timing graph with same shape (histogram) but more pins
• Not a rational circuit
05/04/23 31
By Scaling. 23.6 x speedup from 2^9 to 2^14
05/04/23 32
By Concatenation. 12.6 x speedup from 2^10 to 2^14
05/04/23 33
Large circuits and parallel I/O
• File divided into m parts• Each processor loads
one or more parts• Supports n=m/k
processors with one file, where m >= k > 0
• Scales because processors read fewer parts as count increases
05/04/23 34
Parallel I/O.12.2x speedup from 10^9 to 10^14
05/04/23 35
Contents
• Introduction• Method• Summary
05/04/23 36
Summary
• Scaling appears related to the size of the circuit, static timing analysis scales further with larger circuits.
• We do not know how shape and structure affect the performance of large-scale static timing analysis.
05/04/23 37
• Prototype for a Large-Scale Static Timing Analyzer running on an IBM Blue Gene, Holder, A., Carothers, C. D., and Kalafala, K., 2010, International Workshop on Parallel and Distributed Scientific and Engineering Computing.
• The Impact of Irregularity on Efficient Large-Scale Integer-Intensive Computing, Holder, A., PhD Proposal
05/04/23 38
Related Work
• Hathaway, D.J. and Donath, W.E., 2001. Distributed Static Timing Analysis.
• Hathaway, D.J. and Donath, W.E., 2003. Distributed Static Timing Analysis.
• Gove, D.J., Mains, R.E. and Chen, G.J., 2009. Multithreaded Static Timing Analysis.
• Gulati, K. and Khatri, S.P., 2009. Accelerating statistical static timing analysis using graphics processing units.
05/04/23 69 Slides 39
Thanks for your time and attention
05/04/23 40