large-scale static timing analysis akintayo holder rensselaer polytechnic institute 2/18/20161

Post on 18-Jan-2018

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Static Timing Analysis A gate (or other device) requires all input signals to be present at the same time. “Same time” defined by clock signal. STA ensures all devices have their expected inputs. 2/18/20163

TRANSCRIPT

Large-Scale Static Timing Analysis

Akintayo HolderRensselaer Polytechnic Institute

05/04/23 1

Contents

• Introduction• Method• Summary

05/04/23 2

Static Timing Analysis

• A gate (or other device) requires all input signals to be present at the same time.

• “Same time” defined by clock signal.

• STA ensures all devices have their expected inputs.

05/04/23 3

Static Timing Analysis• Static timing analysis

(STA) ensures that signals will propagate through a circuit.

• Checks that every gate will have valid inputs.

• Block oriented• Polynomial• Circuit dictates STA’s

behaviour

05/04/23 4

STA: Arrival Time Calculation

• Late-mode static timing analysis was used in this study.

• Arrival time (AT) is the latest time a signal can arrive.

• AT(i) := max (AT(j) + Delay(S(j,i)), for all S(j,i))

• AT goes forward.

05/04/23 69 Slides 5

Slack, how STA checks that the circuit is correct.

• Required Arrival Time (RAT) – earliest time a signal must arrive.

• RAT goes backward from output nodes.

• For a circuit to work, there must be overlap between the RAT and AT of each node (Slack).

• Slack = RAT - AT

05/04/23 69 Slides 6

Behaviour of STA

• Block oriented STA is polynomial with respect to the size of the circuit.

• Running time depends on the circuit size,• But more exacting tolerances require more

exacting estimates.• Multiple process corners, statistical STA and

other techniques are common in industrial tools and increase running times.

05/04/23 7

Why large-scale static timing analysis is hard

• Performance depends on the circuit.• Circuit must be divided/partitioned to

minimize the cost of communication.• Graph/circuit partitioning is hard.

05/04/23 8

What is the role of Decomposition?

• Divide the computation into smaller tasks

• Execute tasks in parallel• The smaller the tasks,

the more important the decomposition.

• Case of Amdahl’s Law

05/04/23 69 Slides 9

What is meant by Regularity?

• Presence of a pattern or evidence of rules.• Easy to understand and define.• A regular pattern can be used to decompose

work.• Irregular, no obvious pattern or apparent

rules.• No clear pattern. Decomposition chooses

among many “bad” choices.

05/04/23 69 Slides 10

Where do we get Structural Irregularity from?

• Structural Irregularity is derived from unstructured objects.

• Unstructured objects are defined by their neighbour relations

• Neighbour relations are unique, complex and often irregular.

05/04/23 69 Slides 11

STA and Structural Irregularity

• STA depends on shape of circuit• Circuit is irregular, which causes STA to

demonstrate structural irregularity• Irregularity makes decomposition hard and

limits ability to scale

05/04/23 12

Why large-scale static timing analysis is hard

• STA exhibits structural irregularity.• Irregularity hinders problem decomposition.• Large-scale systems usually display traits that

makes efficient decomposition difficult.• Good decomposition needed for good

performance.

05/04/23 13

Why balance will help performance

• Balance refers to the relative performance of the processors, the network and the memory interconnect.

• Balance is the ability of the processor to saturate either the network or the memory interconnect.

• Balance simplifies decomposition.

05/04/23 14

The Blue Gene has modest processors and high performance networks

05/04/23 69 Slides 15

The IBM Blue Gene/L• Supports MPI as primary

programming interface• Simple memory

management.– 512M or 1G per node.– 1 or 2 processors per node.– No virtual memory.

• Subset of Linux API.• Adaptive routing for point

to point, independent networks for collective.

• Optimized for bandwidth, scalability and efficiency.

05/04/23 16

More on Balance• Cray XT5• High-density blades with

8 Opteron quad core processors

• SeaStar2+, a low latency, high bandwidth interconnect.

• Hybrid supercomputer, uses FPGAs and co-processors

• Better performance when adding a processor from a new node.– 45% slow down when

adding processor on existing node, compared with adding processor from new node.

• Network saturation not observed with point to point experiments.

• Global operations varied in run time.

05/04/23 17Worley et al., 2009. Snell et al., 2007

Contents

• Introduction• Method• Summary

05/04/23 18

Understanding the Input

• Static timing analysis processes a timing graph, which represents the circuit.

• Timing graph– DAG– From input to output pins

• Large-scale STA sorts edges in DAG by source depth.

05/04/23 19

An approach to Large-Scale STA• Compute all nodes on a

level, • share the results, • move to the next level.

• Levelized Bulk Synchronization:– Only arrival time

calculations,– Each processor loads the

entire circuit,– Online partitioning using

round robin.

05/04/23 20

Levelized Bulk Synchronization Algorithm

level = 0Repeat

RepeatRepeat

if (local segment) calculate delay/slew at local processorUntil (all incoming segments have been visited)Compute node arrival time and slew

Until (all assigned nodes at the current level have been processed)Repeat

if (gate segment) Calculate delay/slew at sink processorelse if (net segment) Calculate delay/slew at source processorendif

Until (all remote outgoing segments have been visited)Once all processors are complete, advance to next level

Until (all levels have been processed)

05/04/23 69 Slides 21

Our modest 3 million node benchmark circuit

• 3 million nodes• 4 million segments• Almost 1000 levels

depth.

• A mean width of 3000 nodes but a median width of 32 nodes.

• Estimated theoretical speedup = 260

• Speedup computed for 1024 processors.

• Levels with more than 1024 nodes, truncated to 1024

05/04/23 69 Slides 22

Our modest 3 million node benchmark circuit

05/04/23 23

Timing? 260x on 1024 processors

05/04/23 69 Slides 24But Including partitioning, Speedup is 119x.

Removing global dependencies to improve timing algorithm

• Bulk synch forces all processor to wait at the end of a level.

• Not required by STA.• Solution: Remove global

synch.

• No Global Synch :– Compute x nodes– Send y updates– Continue until all nodes

are done.

05/04/23 69 Slides 25

Removing Global Synchronization

Speedup without Partitioning Costs

Algorithms 256 CPUs

1024 CPUs

Levelized Bulk Synchronization

120 263

No Global Synchronization

120 292

• 10% improvement with 1024 processors.

• Improved partitioning appears to have more potential for impact.

05/04/23 69 Slides 26

What about partitioning

• Each processor loads the entire circuit

• For each level:– For each node on level:

• Assign node n to task (n%(num cpus))

• Build list of local nodes, and segments

• Flexible with respect to the number of cores.

• Ignores structure of the circuit

• Limits the size of the circuit

05/04/23 27

What about partitioning?

05/04/23 28

Future Work

• Examine how large-scale Static Timing Analysis performs on different circuits.

05/04/23 29

Why synthesize large circuits?

• Motivated by a lack of benchmark circuits and the relatively small size of provided circuits

• Would algorithm scale to 10,000s of processors with larger circuits?

• Solution: Generate billion node timing graphs

05/04/23 30

Synthesis of Large circuits

By Concatenation• Make R copies of the circuit• Connect all copies to a new

global sink• Creates a timing graph that

captures internal intricacies of original.

• Does not capture shape• Maybe disjoint

By Scaling• Create histogram of pins by

level and of segments by level.

• Multiply levels, pins and segments by R

• Creates a timing graph with same shape (histogram) but more pins

• Not a rational circuit

05/04/23 31

By Scaling. 23.6 x speedup from 2^9 to 2^14

05/04/23 32

By Concatenation. 12.6 x speedup from 2^10 to 2^14

05/04/23 33

Large circuits and parallel I/O

• File divided into m parts• Each processor loads

one or more parts• Supports n=m/k

processors with one file, where m >= k > 0

• Scales because processors read fewer parts as count increases

05/04/23 34

Parallel I/O.12.2x speedup from 10^9 to 10^14

05/04/23 35

Contents

• Introduction• Method• Summary

05/04/23 36

Summary

• Scaling appears related to the size of the circuit, static timing analysis scales further with larger circuits.

• We do not know how shape and structure affect the performance of large-scale static timing analysis.

05/04/23 37

• Prototype for a Large-Scale Static Timing Analyzer running on an IBM Blue Gene, Holder, A., Carothers, C. D., and Kalafala, K., 2010, International Workshop on Parallel and Distributed Scientific and Engineering Computing.

• The Impact of Irregularity on Efficient Large-Scale Integer-Intensive Computing, Holder, A., PhD Proposal

05/04/23 38

Related Work

• Hathaway, D.J. and Donath, W.E., 2001. Distributed Static Timing Analysis.

• Hathaway, D.J. and Donath, W.E., 2003. Distributed Static Timing Analysis.

• Gove, D.J., Mains, R.E. and Chen, G.J., 2009. Multithreaded Static Timing Analysis.

• Gulati, K. and Khatri, S.P., 2009. Accelerating statistical static timing analysis using graphics processing units.

05/04/23 69 Slides 39

Thanks for your time and attention

05/04/23 40

top related