large-scale static timing analysis akintayo holder rensselaer polytechnic institute 2/18/20161

40
Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 04/23/22 1

Upload: cleopatra-welch

Post on 18-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

Static Timing Analysis A gate (or other device) requires all input signals to be present at the same time. “Same time” defined by clock signal. STA ensures all devices have their expected inputs. 2/18/20163

TRANSCRIPT

Page 1: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Large-Scale Static Timing Analysis

Akintayo HolderRensselaer Polytechnic Institute

05/04/23 1

Page 2: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Contents

• Introduction• Method• Summary

05/04/23 2

Page 3: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Static Timing Analysis

• A gate (or other device) requires all input signals to be present at the same time.

• “Same time” defined by clock signal.

• STA ensures all devices have their expected inputs.

05/04/23 3

Page 4: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Static Timing Analysis• Static timing analysis

(STA) ensures that signals will propagate through a circuit.

• Checks that every gate will have valid inputs.

• Block oriented• Polynomial• Circuit dictates STA’s

behaviour

05/04/23 4

Page 5: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

STA: Arrival Time Calculation

• Late-mode static timing analysis was used in this study.

• Arrival time (AT) is the latest time a signal can arrive.

• AT(i) := max (AT(j) + Delay(S(j,i)), for all S(j,i))

• AT goes forward.

05/04/23 69 Slides 5

Page 6: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Slack, how STA checks that the circuit is correct.

• Required Arrival Time (RAT) – earliest time a signal must arrive.

• RAT goes backward from output nodes.

• For a circuit to work, there must be overlap between the RAT and AT of each node (Slack).

• Slack = RAT - AT

05/04/23 69 Slides 6

Page 7: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Behaviour of STA

• Block oriented STA is polynomial with respect to the size of the circuit.

• Running time depends on the circuit size,• But more exacting tolerances require more

exacting estimates.• Multiple process corners, statistical STA and

other techniques are common in industrial tools and increase running times.

05/04/23 7

Page 8: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Why large-scale static timing analysis is hard

• Performance depends on the circuit.• Circuit must be divided/partitioned to

minimize the cost of communication.• Graph/circuit partitioning is hard.

05/04/23 8

Page 9: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

What is the role of Decomposition?

• Divide the computation into smaller tasks

• Execute tasks in parallel• The smaller the tasks,

the more important the decomposition.

• Case of Amdahl’s Law

05/04/23 69 Slides 9

Page 10: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

What is meant by Regularity?

• Presence of a pattern or evidence of rules.• Easy to understand and define.• A regular pattern can be used to decompose

work.• Irregular, no obvious pattern or apparent

rules.• No clear pattern. Decomposition chooses

among many “bad” choices.

05/04/23 69 Slides 10

Page 11: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Where do we get Structural Irregularity from?

• Structural Irregularity is derived from unstructured objects.

• Unstructured objects are defined by their neighbour relations

• Neighbour relations are unique, complex and often irregular.

05/04/23 69 Slides 11

Page 12: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

STA and Structural Irregularity

• STA depends on shape of circuit• Circuit is irregular, which causes STA to

demonstrate structural irregularity• Irregularity makes decomposition hard and

limits ability to scale

05/04/23 12

Page 13: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Why large-scale static timing analysis is hard

• STA exhibits structural irregularity.• Irregularity hinders problem decomposition.• Large-scale systems usually display traits that

makes efficient decomposition difficult.• Good decomposition needed for good

performance.

05/04/23 13

Page 14: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Why balance will help performance

• Balance refers to the relative performance of the processors, the network and the memory interconnect.

• Balance is the ability of the processor to saturate either the network or the memory interconnect.

• Balance simplifies decomposition.

05/04/23 14

Page 15: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

The Blue Gene has modest processors and high performance networks

05/04/23 69 Slides 15

Page 16: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

The IBM Blue Gene/L• Supports MPI as primary

programming interface• Simple memory

management.– 512M or 1G per node.– 1 or 2 processors per node.– No virtual memory.

• Subset of Linux API.• Adaptive routing for point

to point, independent networks for collective.

• Optimized for bandwidth, scalability and efficiency.

05/04/23 16

Page 17: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

More on Balance• Cray XT5• High-density blades with

8 Opteron quad core processors

• SeaStar2+, a low latency, high bandwidth interconnect.

• Hybrid supercomputer, uses FPGAs and co-processors

• Better performance when adding a processor from a new node.– 45% slow down when

adding processor on existing node, compared with adding processor from new node.

• Network saturation not observed with point to point experiments.

• Global operations varied in run time.

05/04/23 17Worley et al., 2009. Snell et al., 2007

Page 18: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Contents

• Introduction• Method• Summary

05/04/23 18

Page 19: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Understanding the Input

• Static timing analysis processes a timing graph, which represents the circuit.

• Timing graph– DAG– From input to output pins

• Large-scale STA sorts edges in DAG by source depth.

05/04/23 19

Page 20: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

An approach to Large-Scale STA• Compute all nodes on a

level, • share the results, • move to the next level.

• Levelized Bulk Synchronization:– Only arrival time

calculations,– Each processor loads the

entire circuit,– Online partitioning using

round robin.

05/04/23 20

Page 21: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Levelized Bulk Synchronization Algorithm

level = 0Repeat

RepeatRepeat

if (local segment) calculate delay/slew at local processorUntil (all incoming segments have been visited)Compute node arrival time and slew

Until (all assigned nodes at the current level have been processed)Repeat

if (gate segment) Calculate delay/slew at sink processorelse if (net segment) Calculate delay/slew at source processorendif

Until (all remote outgoing segments have been visited)Once all processors are complete, advance to next level

Until (all levels have been processed)

05/04/23 69 Slides 21

Page 22: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Our modest 3 million node benchmark circuit

• 3 million nodes• 4 million segments• Almost 1000 levels

depth.

• A mean width of 3000 nodes but a median width of 32 nodes.

• Estimated theoretical speedup = 260

• Speedup computed for 1024 processors.

• Levels with more than 1024 nodes, truncated to 1024

05/04/23 69 Slides 22

Page 23: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Our modest 3 million node benchmark circuit

05/04/23 23

Page 24: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Timing? 260x on 1024 processors

05/04/23 69 Slides 24But Including partitioning, Speedup is 119x.

Page 25: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Removing global dependencies to improve timing algorithm

• Bulk synch forces all processor to wait at the end of a level.

• Not required by STA.• Solution: Remove global

synch.

• No Global Synch :– Compute x nodes– Send y updates– Continue until all nodes

are done.

05/04/23 69 Slides 25

Page 26: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Removing Global Synchronization

Speedup without Partitioning Costs

Algorithms 256 CPUs

1024 CPUs

Levelized Bulk Synchronization

120 263

No Global Synchronization

120 292

• 10% improvement with 1024 processors.

• Improved partitioning appears to have more potential for impact.

05/04/23 69 Slides 26

Page 27: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

What about partitioning

• Each processor loads the entire circuit

• For each level:– For each node on level:

• Assign node n to task (n%(num cpus))

• Build list of local nodes, and segments

• Flexible with respect to the number of cores.

• Ignores structure of the circuit

• Limits the size of the circuit

05/04/23 27

Page 28: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

What about partitioning?

05/04/23 28

Page 29: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Future Work

• Examine how large-scale Static Timing Analysis performs on different circuits.

05/04/23 29

Page 30: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Why synthesize large circuits?

• Motivated by a lack of benchmark circuits and the relatively small size of provided circuits

• Would algorithm scale to 10,000s of processors with larger circuits?

• Solution: Generate billion node timing graphs

05/04/23 30

Page 31: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Synthesis of Large circuits

By Concatenation• Make R copies of the circuit• Connect all copies to a new

global sink• Creates a timing graph that

captures internal intricacies of original.

• Does not capture shape• Maybe disjoint

By Scaling• Create histogram of pins by

level and of segments by level.

• Multiply levels, pins and segments by R

• Creates a timing graph with same shape (histogram) but more pins

• Not a rational circuit

05/04/23 31

Page 32: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

By Scaling. 23.6 x speedup from 2^9 to 2^14

05/04/23 32

Page 33: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

By Concatenation. 12.6 x speedup from 2^10 to 2^14

05/04/23 33

Page 34: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Large circuits and parallel I/O

• File divided into m parts• Each processor loads

one or more parts• Supports n=m/k

processors with one file, where m >= k > 0

• Scales because processors read fewer parts as count increases

05/04/23 34

Page 35: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Parallel I/O.12.2x speedup from 10^9 to 10^14

05/04/23 35

Page 36: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Contents

• Introduction• Method• Summary

05/04/23 36

Page 37: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Summary

• Scaling appears related to the size of the circuit, static timing analysis scales further with larger circuits.

• We do not know how shape and structure affect the performance of large-scale static timing analysis.

05/04/23 37

Page 38: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

• Prototype for a Large-Scale Static Timing Analyzer running on an IBM Blue Gene, Holder, A., Carothers, C. D., and Kalafala, K., 2010, International Workshop on Parallel and Distributed Scientific and Engineering Computing.

• The Impact of Irregularity on Efficient Large-Scale Integer-Intensive Computing, Holder, A., PhD Proposal

05/04/23 38

Page 39: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Related Work

• Hathaway, D.J. and Donath, W.E., 2001. Distributed Static Timing Analysis.

• Hathaway, D.J. and Donath, W.E., 2003. Distributed Static Timing Analysis.

• Gove, D.J., Mains, R.E. and Chen, G.J., 2009. Multithreaded Static Timing Analysis.

• Gulati, K. and Khatri, S.P., 2009. Accelerating statistical static timing analysis using graphics processing units.

05/04/23 69 Slides 39

Page 40: Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161

Thanks for your time and attention

05/04/23 40