code optimization of parallel programs vivek sarkar rice university vsarkar@rice.edu vivek sarkar...
Post on 14-Dec-2015
216 Views
Preview:
TRANSCRIPT
Code Optimization of Parallel ProgramsCode Optimization of Parallel Programs
Vivek SarkarRice University
vsarkar@rice.edu
Vivek SarkarRice University
vsarkar@rice.edu
L3 Directory/Control
L2 L2 L2
LSU LSUIFUBXU
IDU IDU
IFUBXU
FPU FPUFXUFXU
ISU ISU
2
Parallel Software Challenges & Focus Area for this TalkParallel Software Challenges & Focus Area for this Talk
Middleware
Parallel Runtime & System Libraries
OS and Hypervisors
Languages
Programming Tools
Parallelism in middleware e.g., transactions, relational databases, web services, J2EE containers
Explicitly parallel languages e.g., OpenMP, Java Concurrency, .NET Parallel Extensions, Intel TBB, CUDA, Cilk, MPI, Unified
Parallel C, Co-Array Fortran, X10, Chapel, Fortress
Parallel Debugging and Performance Tools e.g., Eclipse Parallel Tools Platform, TotalView, Thread Checker
Code partitioning for accelerators, data transfer optimizations, SIMDization, space-time scheduling, power management
Parallel runtime and system libraries for task scheduling, synchronization, parallel data structures
Virtualization, scalable management of heterogeneous resources per core (frequency, power)
Static & Dynamic Optimizing Compilers
Domain-specific Programming Models
Domain-specific implicitly parallel programming models e.g., Matlab, stream processing, map-reduce (Sawzall),
Application Libraries
Parallel application libraries e.g., linear algebra, graphics imaging, signal processing, security
Parallel intermediate representation, optimization of synchronization & data transfer, automatic parallelization
Multicore Back-ends
3
OutlineOutline
Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to
Code Optimization of Parallel Code Rice Habanero Multicore Software project
4
Our Current Paradigm for Code Optimization has served us well for Fifty Years ….Our Current Paradigm for Code Optimization has served us well for Fifty Years ….
Translation Translation Translation
Fortran Autocoder II ALPHA
IL
OPTIMIZER
REGISTER ALLOCATOR
IL
IL
ASSEMBLER
STRETCH STRETCH-HARVEST
OBJECT CODE
Stretch – Harvest Compiler Organization
(1958 - 1962) Source: “Compiling for Parallelism”,Fran Allen, Turning Lecture, June 2007
5
… and has been adapted to meet challenges along the way … … and has been adapted to meet challenges along the way …
Interprocedural analysis Array dependence analysis Pointer alias analysis Instruction scheduling & software pipelining SSA form Profile-directed optimization Dynamic compilation Adaptive optimization Auto-tuning . . .
6
… but is now under siege because of parallelism … but is now under siege because of parallelism
Proliferation of parallel hardware Multicore, manycore, accelerators, clusters, …
Proliferation of parallel libraries and languages OpenMP, Java Concurrency, .NET Parallel
Extensions, Intel TBB, Cilk, MPI, UPC, CAF, X10, Chapel, Fortress, …
7
Paradigm ShiftsParadigm Shifts "The Structure of Scientific Revolutions”, Thomas S. Kuhn (1970) A paradigm is a scientific structure or framework consisting of
Assumptions, Laws, Techniques Normal science is a puzzle solving activity governed by the rules
of the paradigm. It is uncritical of the current paradigm,
Crisis sets in when a series of serious anomalies appear “The emergence of new theories is generally preceded by a
period of pronounced professional insecurity” Scientists engage in philosophical and metaphysical disputes.
A revolution or paradigm shift occurs when an an entire paradigm is replaced by another
8
Kuhn’s History of ScienceKuhn’s History of Science
Normal Science
Immature Science
Anomalies
Crisis
Revolution
Revolution: A new paradigm emergesOld Theory: well established, many followers, many anomalies
New Theory: few followers, untested, new concepts/techniques, accounts for anomalies, asks new questionsSource: www.philosophy.ed.ac.uk/ug_study/ ug_phil_sci1h/phil_sci_files/L10_Kuhn1.ppt
9
Some Well Known Paradigm ShiftsSome Well Known Paradigm Shifts
Newton’s Laws to Einstein's Theory of Relativity Ptolemy’s geocentric view to Copernicus and
Galileo’s heliocentric view Creationism to Darwin’s Theory of Evolution
10
OutlineOutline
Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches Rice Habanero Multicore Software project
11
What anomalies do we see when optimizing parallel code?What anomalies do we see when optimizing parallel code?
Examples1. Control flow rules2. Data flow rules3. Load elimination rules
12
1. Control Flow Rules from Sequential Code Optimization1. Control Flow Rules from Sequential Code Optimization
Control Flow Graph Node = Basic Block Edge = Transfer of Control Flow Succ(b) = successors of block b Pred(b) = predecessors of block b
Dominators Block d dominates block b if every (sequential) path from
START to b includes d Dom(b) = set of dominators of block b Every block has a unique immediate dominator (parent in
dominator tree)
13
Dominator ExampleDominator Example
START
BB1
BB2 BB3
BB4
STOP
Control Flow Graph
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
T F
START
BB1
BB2 BB3 BB4
STOP
Dominator Tree
14
Anomalies in Control Flow Rules for Parallel CodeAnomalies in Control Flow Rules for Parallel Code
BB1parbeginBB2
||BB3
parendBB4
Does B4 have a unique immediate dominator? Can the dominator relation be represented as a tree?
BB1
FORK
BB2 BB3
JOIN
BB4
Parallel Control Flow Graph
15
2. Data Flow Rules from Sequential Code Optimization2. Data Flow Rules from Sequential Code Optimization
Example: Reaching Definitions REACHin(n) = set of definitions d s.t. there is a
(sequential) path from d to n in the CFG, and d is not killed along that path.
QuickTime™ and a decompressor
are needed to see this picture.
16
Anomalies in Data Flow Rules for Parallel CodeAnomalies in Data Flow Rules for Parallel Code
What definitions reach COEND?
What if there were no synchronization edges?
How should the data flow equations be defined for parallel code?
QuickTime™ and a decompressor
are needed to see this picture.
control
sync
S1: X1 := … parbegin // Task 1S2: X2 := … post(ev2);S3: . . . post(ev3);S4: wait(ev8); X4 := …|| // Task 2S5: . . .S6: wait(ev2);S7: X7 := …S8: wait(ev3); post(ev8); parend . . .
17
3. Load Elimination Rules from Sequential Code Optimization3. Load Elimination Rules from Sequential Code Optimization
A load instruction at point P, T3 := *q, is redundant, if the value of *q is available at point P
T1 := *q
T2 := *p
T3 := *q
T1 := *q
T2 := *p
T3 := T1
18
Anomalies in Load Elimination Rules for Parallel Code(Original Version)Anomalies in Load Elimination Rules for Parallel Code(Original Version)
TASK 1. . .T1 := *qT2 := *pT3 := *qprint T1, T2, T3
Question: Is [0, 1, 0] permitted as a possible output?Answer: It depends on the programming model. It is not permitted by Sequential Consistency [Lamport 1979] But it is permitted by Location Consistency [Gao & Sarkar 1993, 2000]
TASK 2. . . *p = 1. . .
Assume that p = q, and that *p = *q = 0 initially.
19
Anomalies in Load Elimination Rules for Parallel Code(After Load Elimination)Anomalies in Load Elimination Rules for Parallel Code(After Load Elimination)
TASK 1. . .T1 := *qT2 := *pT3 := T1print T1, T2, T3
Question: Is [0, 1, 0] permitted as a possible output?Answer: Yes, it will be permitted by Sequential Consistency, if load elimination is performed!
TASK 2. . . *p = 1. . .
Assume that p = q, and that *p = *q = 0 initially.
20
OutlineOutline
Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to
Code Optimization of Parallel Code Rice Habanero Multicore Software project
21
Incremental Approaches to coping with Parallel Code OptimizationIncremental Approaches to coping with Parallel Code Optimization
Large investment in infrastructures for sequential code optimization
Introduce ad hoc rules to incrementally extend them for parallel code optimization Code motion fences at sycnhronization operations Task creation and termination via function call
interfaces Use of volatile storage modifiers . . .
22
More Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the FutureMore Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the Future
Need for a new Parallel Intermediate Representation (PIR) with robust support for code optimization of parallel programs Abstract execution model for PIR Storage classes (types) for locality and memory hierarchies General framework for task partitioning and code motion in
parallel code Compiler-friendly memory model Combining automatic parallelization and explicit parallelism
. . .
23
Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]
A Program Dependence Graph, PDG = (N', Ecd, Edd) is derived from a CFG and consists of:
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
24
PDG ExamplePDG Example
/* S1 */ max = a[i];
/* S2 */ div = a[i] / b[i] ;
/* S3 */ if ( max < b[i] )
/* S4 */ max = b[i] ;
QuickTime™ and a decompressor
are needed to see this picture.
S1 S2 S3
S4
max(true)
max (output)
max (anti)
25
PDG restrictionsPDG restrictions
Control Dependence Predicate-ancestor condition: if there are two disjoint
c.d. paths from (ancestor) node A to node N, then A cannot be a region node i.e., A must be a predicate node
No-postdominating-descendant condition: if node P postdominates node N in the CFG, then there cannot be a c.d. path from node N to node P
26
Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]
Node 4 is executed twice in this acyclic PDG
QuickTime™ and a decompressor
are needed to see this picture.
“Parallel Program Graphs and their Classification”, V.Sarkar & B.Simons, LCPC 1993
27
PDG restrictions (contd)PDG restrictions (contd)
Data Dependence There cannot be a data dependence edge in the
PDG from node A to node B if there is no path from A to B in the CFG
The context C of a data dependence edge (A,B,C) must be plausible i.e., it cannot identify a dependence from an execution instance IA of node A to an execution instance IB of node B if IB precedes IA in the CFG's execution e.g., a data dependence from iteration i+1 to
iteration i is not plausible in a sequential program
28
Limitations of Program Dependence GraphsLimitations of Program Dependence Graphs
PDGs and CFGs are tightly coupled A transformation in one
must be reflected in the other
PDGs reveal maximum parallelism in the program
CFGs reveal sequential execution
Neither is well suited for code optimization of parallel programs e.g., how do we represent a partitioning of { 1, 3, 4 } and { 2 } into two tasks?
QuickTime™ and a decompressor
are needed to see this picture.
29
Another Limitation: no Parallel Execution Semantics defined for PDGsAnother Limitation: no Parallel Execution Semantics defined for PDGs
What is the semantics of control dependence edges with cycles? What is the semantics of data dependences when a source or
destination node may have zero, one or more instances?
QuickTime™ and a decompressor
are needed to see this picture.
A[f(i,j)] = …
… = A[g(i)]
30
Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992]
Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992]
A Parallel Program Graph, PPG = (N, Econtrol , Esync) consists of: N, a set of compute, predicate, and parallel nodes
A parallel node creates parallel threads of computation for each of its successors
Econtrol , a set of labeled control edges. Edge (A,B,L) in Econtrol identifies a control edge from node A to node B with label L.
Esync , a set of synchronization edges. Edge (A,B,F) in Esync defines a synchronization from node A to node B with synchronization condition F which identifies execution instances of A and B that need to be synchronized
“A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs”, V.Sarkar, LCPC 1992
31
PPG ExamplePPG Example
QuickTime™ and a decompressor
are needed to see this picture.
32
Relating CFGs to PPGsRelating CFGs to PPGs
Construction of PPG for a sequential program PPG nodes = CFG nodes
PPG control edges = CFG edges
PPG synchronization edges = empty set
33
Relating PDGs to PPGsRelating PDGs to PPGs
Construction of PPG for PDGs PPG nodes = PDG nodes
PPG parallel nodes = PDG regions nodes
PPG control edges = PDG control dependence edges
PPG synchronization edges = PDG data dependence edges Synchronization condition F in PPG synchronization
edge mirrors context of PDG data dependence edge
34
Example of Transforming PPGsExample of Transforming PPGs
QuickTime™ and a decompressor
are needed to see this picture.
35
Abstract Interpreter for PPGsAbstract Interpreter for PPGs
Build a partial order of dynamic execution instances of PPG nodes as PPG execution unravels.
Each execution instance IA is labeled with its history (calling context), H(IA).
Initialize to a singleton set containing an instance of the start node, ISTART , with H(ISTART ) initialized to the empty sequence.
36
Abstract Interpreter for PPGs (contd)Abstract Interpreter for PPGs (contd)
Each iteration of the scheduling algorithm:
Selects an execution instance IA in such that all of IA's
predecessors in have been scheduled Simulates execution of IA and evaluates branch label L
Creates an instance IB of each c.d. successor B of A for label L
Adds (IB, IC) to , if instance IC has been created in and there exists a PPG synchronization edge from B to C (or from a PPG descendant of B to C)
Adds (IC, IB) to , if instance IC has been created in and there exists a PPG synchronization edge from C to B (or from a PPG descendant of C to B)
37
Abstract Interpreter for PPGs: ExampleAbstract Interpreter for PPGs: Example
1. Create ISTART
2. Schedule ISTART
3. Create IPAR
4. Schedule IPAR
5. Create I1, I2, I3
6. Add (I1, I3) to 7. Schedule I2
8. Schedule I1
9. Schedule I3
10. . . .
QuickTime™ and a decompressor
are needed to see this picture.
38
Weak (Deterministic) Memory Model for PPGsWeak (Deterministic) Memory Model for PPGs All memory accesses are assumed to be non-atomic
Read-write hazard --- if Ia reads a location for which there is a
parallel write of a different value, then the execution result is an error Analogous to an exception thrown if a data race occurs May be thrown when read or write operation is performed
Write-write hazard --- if Ia writes into a location for which there is a parallel write of a different value, then the resulting value in the location is undefined Execution results in an error if that location is subsequently read
Separation of data communication and synchronization: Data communication specified by read/write operations Sequencing specified by synchronization and control edges
39
Soundness PropertiesSoundness Properties
Reordering Theorem For a given Parallel Program Graph, G, and input
store, i, the final store f = G(i) obtained is the same for all possible scheduled sequences in the abstract interpreter
Equivalence Theorem A sequential program and its PDG have identical
semantics i.e., they yield the same output store when executed with the same input store
40
Reaching Definitions Analysis on PPGs [LCPC 1997]Reaching Definitions Analysis on PPGs [LCPC 1997]
QuickTime™ and a decompressor
are needed to see this picture.
“Analysis and Optimization of Explicitly Parallel Programs using the Parallel Program Graph Representation”, V.Sarkar, LCPC 1997
A definition D is redefined at program point P if there is a control path from D to P, and D is killed along all paths from D to P.
41
Reaching Definitions Analysis on PPGsReaching Definitions Analysis on PPGs
QuickTime™ and a decompressor
are needed to see this picture.
control
sync// Task 1S2: X2 := … post(ev2);S3: . . . post(ev3);S4: wait(ev8); X4 := …
QuickTime™ and a decompressor
are needed to see this picture.
// Task 2S5: . . .S6: wait(ev2);S7: X7 := …S8: wait(ev3); post(ev8);
S1: X1 := …
42
PPG LimitationsPPG Limitations
Past work has focused on comprehensive representation and semantics for deterministic programs
Extensions needed for Atomicity and mutual exclusion Stronger memory models Storage classes with explicit locality
43
Issues in Modeling Synchronized/Atomic Blocks[LCPC 1999]Issues in Modeling Synchronized/Atomic Blocks[LCPC 1999]
Questions: Can the load of p.x be moved below
the store of q.y? Can the load of p.x be moved outside
the synchronized block? Can the load of r.z be moved inside the
synchronized block? Can the load of r.z be moved back
outside the synchronized block? How should the data dependences be
modeled?
a = ...synchronized (L) { ... = p.x q.y = ... b =} ... = r.z
“Dependence Analysis for Java”, C.Chambers et al, LCPC 1999
44
OutlineOutline
Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to
Code Optimization of Parallel Code Rice Habanero Multicore Software project
45
Habanero Project (habanero.rice.edu)Habanero Project (habanero.rice.edu)
1) HabaneroProgramming
Language
Sequential C, Fortran, Java,
…
ForeignFunctionInterface
Parallel Applications
Multicore Hardware
Vendor Compiler & Libraries
2) Habanero Static
Compiler
3) Habanero Virtual
Machine
4) Habanero Concurrency
Library
X10 …
5) Habanero Toolkit
46
2) Habanero Static Parallelizing & Optimizing Compiler2) Habanero Static Parallelizing & Optimizing Compiler
Front End
IRGen
AST
C / Fortran(restricted code regions
for targeting accelerators & high-end computing)
InterproceduralAnalysis
ParallelIR (PIR)
AnnotatedClassfiles
PIRAnalysis &
Optimization
Portable Managed Runtime
Platform-specific static compiler
PartitionedCode
Sequential C, Fortran, Java,
…
ForeignFunctionInterface
X10/Habanero Language
ClassfileTransformations
47
Habanero Target Applications and PlatformsHabanero Target Applications and PlatformsApplications:
Parallel Benchmarks SSCA’s #1, #2, #3 from DARPA HPCS program NAS Parallel Benchmarks JGF, JUC, SciMark benchmarksMedical Imaging Back-end processing for Compressive Sensing
(www.dsp.ece.rice.edu/cs) Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA)Seismic Data Processing Rice Inversion project (www.trip.caam.rice.edu) Contact: Bill Symes (Rice), James Gunning (CSIRO)Computer Graphics and Visualization Mathematical modeling and smoothing of meshes Contact: Joe Warren (Rice)Computational Chemistry Fock Matrix Construction Contacts: David Bernholdt, Wael Elwasif, Robert
Harrison, Annirudha Shet (ORNL)Habanero Compiler Implement Habanero compiler in Habanero so as to
exploit multicore parallelism within the compiler
Platforms:
AMD Barcelona Quad-Core Clearspeed Advance X620 DRC Coprocessor Module w/ Xilinx Virtex
FPGA IBM Cell IBM Cyclops-64 (C-64) IBM Power5+, Power6 Intel Xeon Quad-Core NVIDIA Tesla S870 Sun UltraSparc T1, T2 . . .
Additional suggestions welcome!
48
Habanero Research TopicsHabanero Research Topics1) Language Research Explicit parallelism: portable constructs for homogeneous & heterogeneous multicore Implicit deterministic parallelism: array views, single-assignment constructs Implicit non-deterministic parallelism: unordered iterators, partially ordered statement blocks Builds on our experiences with the X10, CAF, HPF, Matlab D, Fortran 90 and Sisal languages
2) Compiler research New Parallel Intermediate Representation (PIR) Automatic analysis, transformation, and parallelization of PIR Optimization of high-level arrays and iterators Optimization of synchronization, data transfer, and transactional memory operations Code partitioning for accelerators Builds on our experiences with the D System, Massively Scalar, Telescoping Languages
Framework, ASTI and PTRAN research compilers
49
Habanero Research Topics (contd.)Habanero Research Topics (contd.)3) Virtual machine research VM support for work-stealing scheduling algorithms with extensions for places,
transactions, task groups Runtime support for other Habanero language constructs (phasers, regions,
distributions) Integration and exploitation of lightweight profiling in VM scheduler and memory
management system Builds on our experiences with the Jikes Research Virtual Machine
4) Concurrency library research New nonblocking data structures to support the Habanero runtime Efficient software transactional memory libraries Builds on our experiences with the java.util.concurrent and DSTM2 libraries
5) Toolkit research Program analysis for common parallel software errors Performance attribution of shared code regions (loops, procedure calls) using static and
dynamic calling context Builds on our experiences with the HPCToolkit, Eclipse PTP and DrJava projects
50
Opportunities for Broader ImpactOpportunities for Broader Impact
Education Influence how parallelism is taught in future Computer Science curricula
Open Source Build an open source testbed to grow ecosystem for researchers in
Parallel Software area Industry standards
Use research results as proofs of concept for new features that can be standardized
Infrastructure can provide foundation for reference implementations
Collaborations welcome!
51
Habanero Team (Nov 2007)Habanero Team (Nov 2007)
Send email to Vivek Sarkar (vsarkar@rice.edu) if you are interested in a PhD, postdoc, research scientist, or programmer position
in the Habanero project, or in collaborating with us!
52
Other Challenges in Code Optimization of Parallel CodeOther Challenges in Code Optimization of Parallel Code
Optimization of task coordination Task creation and termination --- fork, join Mutual exclusion --- locks, transactions Synchronization --- semaphores, barriers
Data Locality Optimizations Computation and data alignment Communication optimizations
Deployment and Code Generation Homogeneous Multicore Heterogeneous Multicore and Accelerators
Automatic Parallelization Revisited . . .
53
Related Work (Incomplete List)Related Work (Incomplete List)
Analysis of nondeterministic sequentially consistent parallel programs [Shasha, Snir 1988], [Midkiff et al 1989], [Chow, Harrison 1992], [Lee et al
1997], … Analysis of deterministic parallel programs with copy-in/copy-out semantics
[Srinivasan 1994], [Ferrante et al 1996], … Value-oriented semantics for functional subsets of PDGs
[Selke 1989], [Cartwright, Felleisen 1989], [Beck, Pingali 1989], [Ottenstein, Ballance, Maccabe 1990] , …
Serialization of restricted subsets of PDGs [Ferrante, Mace, Simons 1988], [Simons et al 1990], …
Concurrency analysis [Long, Clarke 1989], [Duesterwald, Soffa 1991], [Masticola, Ryder 1993],
[Naumovich, Avrunin 1998], [Agarwal et al 2007], …
54
PLDI 2008 Tutorial (Tucson, AZ)PLDI 2008 Tutorial (Tucson, AZ)
Analysis and Optimization of Parallel Programs Intermediate representations for parallel programs Data flow analysis frameworks for parallel programs Locality analyses: scalar/array privatization, escape analysis of objects,
locality types Memory models and their impact on code optimization of locks and
transactional memory operations Optimizations of task partitions and synchronization operations
Sam Midkiff, Vivek Sarkar Sunday afternoon (June 8, 2008, 1:30pm - 5:00pm)
55
ConclusionsConclusions
New paradigm shift in Code Optimization due to Parallel Programs
Foundations of Code Optimization will need to be revisited from scratch Foundations will impact high-level and low-level
optimizers, as well as tools Exciting times to be a compiler researcher!
top related