code optimization of parallel programs vivek sarkar rice university vsarkar@rice.edu vivek sarkar...

Code Optimization of Parallel ProgramsCode Optimization of Parallel Programs

Vivek SarkarRice University

vsarkar@rice.edu

Vivek SarkarRice University

vsarkar@rice.edu

L3 Directory/Control

L2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPUFXUFXU

ISU ISU

Parallel Software Challenges & Focus Area for this TalkParallel Software Challenges & Focus Area for this Talk

Middleware

Parallel Runtime & System Libraries

OS and Hypervisors

Languages

Programming Tools

Parallelism in middleware e.g., transactions, relational databases, web services, J2EE containers

Explicitly parallel languages e.g., OpenMP, Java Concurrency, .NET Parallel Extensions, Intel TBB, CUDA, Cilk, MPI, Unified

Parallel C, Co-Array Fortran, X10, Chapel, Fortress

Parallel Debugging and Performance Tools e.g., Eclipse Parallel Tools Platform, TotalView, Thread Checker

Code partitioning for accelerators, data transfer optimizations, SIMDization, space-time scheduling, power management

Parallel runtime and system libraries for task scheduling, synchronization, parallel data structures

Virtualization, scalable management of heterogeneous resources per core (frequency, power)

Static & Dynamic Optimizing Compilers

Domain-specific Programming Models

Domain-specific implicitly parallel programming models e.g., Matlab, stream processing, map-reduce (Sawzall),

Application Libraries

Parallel application libraries e.g., linear algebra, graphics imaging, signal processing, security

Parallel intermediate representation, optimization of synchronization & data transfer, automatic parallelization

Multicore Back-ends

OutlineOutline

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to

Code Optimization of Parallel Code Rice Habanero Multicore Software project

Our Current Paradigm for Code Optimization has served us well for Fifty Years ….Our Current Paradigm for Code Optimization has served us well for Fifty Years ….

Translation Translation Translation

Fortran Autocoder II ALPHA

OPTIMIZER

REGISTER ALLOCATOR

ASSEMBLER

STRETCH STRETCH-HARVEST

OBJECT CODE

Stretch – Harvest Compiler Organization

(1958 - 1962) Source: “Compiling for Parallelism”,Fran Allen, Turning Lecture, June 2007

… and has been adapted to meet challenges along the way … … and has been adapted to meet challenges along the way …

Interprocedural analysis Array dependence analysis Pointer alias analysis Instruction scheduling & software pipelining SSA form Profile-directed optimization Dynamic compilation Adaptive optimization Auto-tuning . . .

… but is now under siege because of parallelism … but is now under siege because of parallelism

Proliferation of parallel hardware Multicore, manycore, accelerators, clusters, …

Proliferation of parallel libraries and languages OpenMP, Java Concurrency, .NET Parallel

Extensions, Intel TBB, Cilk, MPI, UPC, CAF, X10, Chapel, Fortress, …

Paradigm ShiftsParadigm Shifts "The Structure of Scientific Revolutions”, Thomas S. Kuhn (1970) A paradigm is a scientific structure or framework consisting of

Assumptions, Laws, Techniques Normal science is a puzzle solving activity governed by the rules

of the paradigm. It is uncritical of the current paradigm,

Crisis sets in when a series of serious anomalies appear “The emergence of new theories is generally preceded by a

period of pronounced professional insecurity” Scientists engage in philosophical and metaphysical disputes.

A revolution or paradigm shift occurs when an an entire paradigm is replaced by another

Kuhn’s History of ScienceKuhn’s History of Science

Normal Science

Immature Science

Anomalies

Crisis

Revolution

Revolution: A new paradigm emergesOld Theory: well established, many followers, many anomalies

New Theory: few followers, untested, new concepts/techniques, accounts for anomalies, asks new questionsSource: www.philosophy.ed.ac.uk/ug_study/ ug_phil_sci1h/phil_sci_files/L10_Kuhn1.ppt

Some Well Known Paradigm ShiftsSome Well Known Paradigm Shifts

Newton’s Laws to Einstein's Theory of Relativity Ptolemy’s geocentric view to Copernicus and

Galileo’s heliocentric view Creationism to Darwin’s Theory of Evolution

OutlineOutline

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches Rice Habanero Multicore Software project

What anomalies do we see when optimizing parallel code?What anomalies do we see when optimizing parallel code?

Examples1. Control flow rules2. Data flow rules3. Load elimination rules

1. Control Flow Rules from Sequential Code Optimization1. Control Flow Rules from Sequential Code Optimization

Control Flow Graph Node = Basic Block Edge = Transfer of Control Flow Succ(b) = successors of block b Pred(b) = predecessors of block b

Dominators Block d dominates block b if every (sequential) path from

START to b includes d Dom(b) = set of dominators of block b Every block has a unique immediate dominator (parent in

dominator tree)

Dominator ExampleDominator Example

BB2 BB3

Control Flow Graph

QuickTime™ and a decompressor

are needed to see this picture.

BB2 BB3 BB4

Dominator Tree

Anomalies in Control Flow Rules for Parallel CodeAnomalies in Control Flow Rules for Parallel Code

BB1parbeginBB2

parendBB4

Does B4 have a unique immediate dominator? Can the dominator relation be represented as a tree?

BB2 BB3

Parallel Control Flow Graph

2. Data Flow Rules from Sequential Code Optimization2. Data Flow Rules from Sequential Code Optimization

Example: Reaching Definitions REACHin(n) = set of definitions d s.t. there is a

(sequential) path from d to n in the CFG, and d is not killed along that path.

Anomalies in Data Flow Rules for Parallel CodeAnomalies in Data Flow Rules for Parallel Code

What definitions reach COEND?

What if there were no synchronization edges?

How should the data flow equations be defined for parallel code?

control

S1: X1 := … parbegin // Task 1S2: X2 := … post(ev2);S3: . . . post(ev3);S4: wait(ev8); X4 := …|| // Task 2S5: . . .S6: wait(ev2);S7: X7 := …S8: wait(ev3); post(ev8); parend . . .

3. Load Elimination Rules from Sequential Code Optimization3. Load Elimination Rules from Sequential Code Optimization

A load instruction at point P, T3 := *q, is redundant, if the value of *q is available at point P

T1 := *q

T2 := *p

T3 := *q

T1 := *q

T2 := *p

T3 := T1

Anomalies in Load Elimination Rules for Parallel Code(Original Version)Anomalies in Load Elimination Rules for Parallel Code(Original Version)

TASK 1. . .T1 := *qT2 := *pT3 := *qprint T1, T2, T3

Question: Is [0, 1, 0] permitted as a possible output?Answer: It depends on the programming model. It is not permitted by Sequential Consistency [Lamport 1979] But it is permitted by Location Consistency [Gao & Sarkar 1993, 2000]

TASK 2. . . *p = 1. . .

Assume that p = q, and that *p = *q = 0 initially.

Anomalies in Load Elimination Rules for Parallel Code(After Load Elimination)Anomalies in Load Elimination Rules for Parallel Code(After Load Elimination)

TASK 1. . .T1 := *qT2 := *pT3 := T1print T1, T2, T3

Question: Is [0, 1, 0] permitted as a possible output?Answer: Yes, it will be permitted by Sequential Consistency, if load elimination is performed!

TASK 2. . . *p = 1. . .

Assume that p = q, and that *p = *q = 0 initially.

OutlineOutline

Incremental Approaches to coping with Parallel Code OptimizationIncremental Approaches to coping with Parallel Code Optimization

Large investment in infrastructures for sequential code optimization

Introduce ad hoc rules to incrementally extend them for parallel code optimization Code motion fences at sycnhronization operations Task creation and termination via function call

interfaces Use of volatile storage modifiers . . .

More Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the FutureMore Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the Future

Need for a new Parallel Intermediate Representation (PIR) with robust support for code optimization of parallel programs Abstract execution model for PIR Storage classes (types) for locality and memory hierarchies General framework for task partitioning and code motion in

parallel code Compiler-friendly memory model Combining automatic parallelization and explicit parallelism

Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]

A Program Dependence Graph, PDG = (N', Ecd, Edd) is derived from a CFG and consists of:

PDG ExamplePDG Example

/* S1 */ max = a[i];

/* S2 */ div = a[i] / b[i] ;

/* S3 */ if ( max < b[i] )

/* S4 */ max = b[i] ;

S1 S2 S3

max(true)

max (output)

max (anti)

PDG restrictionsPDG restrictions

Control Dependence Predicate-ancestor condition: if there are two disjoint

c.d. paths from (ancestor) node A to node N, then A cannot be a region node i.e., A must be a predicate node

No-postdominating-descendant condition: if node P postdominates node N in the CFG, then there cannot be a c.d. path from node N to node P

Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]

Node 4 is executed twice in this acyclic PDG

“Parallel Program Graphs and their Classification”, V.Sarkar & B.Simons, LCPC 1993

PDG restrictions (contd)PDG restrictions (contd)

Data Dependence There cannot be a data dependence edge in the

PDG from node A to node B if there is no path from A to B in the CFG

The context C of a data dependence edge (A,B,C) must be plausible i.e., it cannot identify a dependence from an execution instance IA of node A to an execution instance IB of node B if IB precedes IA in the CFG's execution e.g., a data dependence from iteration i+1 to

iteration i is not plausible in a sequential program

Limitations of Program Dependence GraphsLimitations of Program Dependence Graphs

PDGs and CFGs are tightly coupled A transformation in one

must be reflected in the other

PDGs reveal maximum parallelism in the program

CFGs reveal sequential execution

Neither is well suited for code optimization of parallel programs e.g., how do we represent a partitioning of { 1, 3, 4 } and { 2 } into two tasks?

Another Limitation: no Parallel Execution Semantics defined for PDGsAnother Limitation: no Parallel Execution Semantics defined for PDGs

What is the semantics of control dependence edges with cycles? What is the semantics of data dependences when a source or

destination node may have zero, one or more instances?

A[f(i,j)] = …

… = A[g(i)]

Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992]

A Parallel Program Graph, PPG = (N, Econtrol , Esync) consists of: N, a set of compute, predicate, and parallel nodes

A parallel node creates parallel threads of computation for each of its successors

Econtrol , a set of labeled control edges. Edge (A,B,L) in Econtrol identifies a control edge from node A to node B with label L.

Esync , a set of synchronization edges. Edge (A,B,F) in Esync defines a synchronization from node A to node B with synchronization condition F which identifies execution instances of A and B that need to be synchronized

“A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs”, V.Sarkar, LCPC 1992

PPG ExamplePPG Example

Relating CFGs to PPGsRelating CFGs to PPGs

Construction of PPG for a sequential program PPG nodes = CFG nodes

PPG control edges = CFG edges

PPG synchronization edges = empty set

Relating PDGs to PPGsRelating PDGs to PPGs

Construction of PPG for PDGs PPG nodes = PDG nodes

PPG parallel nodes = PDG regions nodes

PPG control edges = PDG control dependence edges

PPG synchronization edges = PDG data dependence edges Synchronization condition F in PPG synchronization

edge mirrors context of PDG data dependence edge

Example of Transforming PPGsExample of Transforming PPGs

Abstract Interpreter for PPGsAbstract Interpreter for PPGs

Build a partial order of dynamic execution instances of PPG nodes as PPG execution unravels.

Each execution instance IA is labeled with its history (calling context), H(IA).

Initialize to a singleton set containing an instance of the start node, ISTART , with H(ISTART ) initialized to the empty sequence.

Abstract Interpreter for PPGs (contd)Abstract Interpreter for PPGs (contd)

Each iteration of the scheduling algorithm:

Selects an execution instance IA in such that all of IA's

predecessors in have been scheduled Simulates execution of IA and evaluates branch label L

Creates an instance IB of each c.d. successor B of A for label L

Adds (IB, IC) to , if instance IC has been created in and there exists a PPG synchronization edge from B to C (or from a PPG descendant of B to C)

Adds (IC, IB) to , if instance IC has been created in and there exists a PPG synchronization edge from C to B (or from a PPG descendant of C to B)

Abstract Interpreter for PPGs: ExampleAbstract Interpreter for PPGs: Example

1. Create ISTART

2. Schedule ISTART

3. Create IPAR

4. Schedule IPAR

5. Create I1, I2, I3

6. Add (I1, I3) to 7. Schedule I2

8. Schedule I1

9. Schedule I3

10. . . .

Weak (Deterministic) Memory Model for PPGsWeak (Deterministic) Memory Model for PPGs All memory accesses are assumed to be non-atomic

Read-write hazard --- if Ia reads a location for which there is a

parallel write of a different value, then the execution result is an error Analogous to an exception thrown if a data race occurs May be thrown when read or write operation is performed

Write-write hazard --- if Ia writes into a location for which there is a parallel write of a different value, then the resulting value in the location is undefined Execution results in an error if that location is subsequently read

Separation of data communication and synchronization: Data communication specified by read/write operations Sequencing specified by synchronization and control edges

Soundness PropertiesSoundness Properties

Reordering Theorem For a given Parallel Program Graph, G, and input

store, i, the final store f = G(i) obtained is the same for all possible scheduled sequences in the abstract interpreter

Equivalence Theorem A sequential program and its PDG have identical

semantics i.e., they yield the same output store when executed with the same input store

Reaching Definitions Analysis on PPGs [LCPC 1997]Reaching Definitions Analysis on PPGs [LCPC 1997]

“Analysis and Optimization of Explicitly Parallel Programs using the Parallel Program Graph Representation”, V.Sarkar, LCPC 1997

A definition D is redefined at program point P if there is a control path from D to P, and D is killed along all paths from D to P.

Reaching Definitions Analysis on PPGsReaching Definitions Analysis on PPGs

control

sync// Task 1S2: X2 := … post(ev2);S3: . . . post(ev3);S4: wait(ev8); X4 := …

// Task 2S5: . . .S6: wait(ev2);S7: X7 := …S8: wait(ev3); post(ev8);

S1: X1 := …

PPG LimitationsPPG Limitations

Past work has focused on comprehensive representation and semantics for deterministic programs

Extensions needed for Atomicity and mutual exclusion Stronger memory models Storage classes with explicit locality

Issues in Modeling Synchronized/Atomic Blocks[LCPC 1999]Issues in Modeling Synchronized/Atomic Blocks[LCPC 1999]

Questions: Can the load of p.x be moved below

the store of q.y? Can the load of p.x be moved outside

the synchronized block? Can the load of r.z be moved inside the

synchronized block? Can the load of r.z be moved back

outside the synchronized block? How should the data dependences be

modeled?

a = ...synchronized (L) { ... = p.x q.y = ... b =} ... = r.z

“Dependence Analysis for Java”, C.Chambers et al, LCPC 1999

OutlineOutline

Habanero Project (habanero.rice.edu)Habanero Project (habanero.rice.edu)

1) HabaneroProgramming

Language

Sequential C, Fortran, Java,

ForeignFunctionInterface

Parallel Applications

Multicore Hardware

Vendor Compiler & Libraries

2) Habanero Static

Compiler

3) Habanero Virtual

Machine

4) Habanero Concurrency

Library

X10 …

5) Habanero Toolkit

2) Habanero Static Parallelizing & Optimizing Compiler2) Habanero Static Parallelizing & Optimizing Compiler

Front End

C / Fortran(restricted code regions

for targeting accelerators & high-end computing)

InterproceduralAnalysis

ParallelIR (PIR)

AnnotatedClassfiles

PIRAnalysis &

Optimization

Portable Managed Runtime

Platform-specific static compiler

PartitionedCode

Sequential C, Fortran, Java,

ForeignFunctionInterface

X10/Habanero Language

ClassfileTransformations

Habanero Target Applications and PlatformsHabanero Target Applications and PlatformsApplications:

Parallel Benchmarks SSCA’s #1, #2, #3 from DARPA HPCS program NAS Parallel Benchmarks JGF, JUC, SciMark benchmarksMedical Imaging Back-end processing for Compressive Sensing

(www.dsp.ece.rice.edu/cs) Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA)Seismic Data Processing Rice Inversion project (www.trip.caam.rice.edu) Contact: Bill Symes (Rice), James Gunning (CSIRO)Computer Graphics and Visualization Mathematical modeling and smoothing of meshes Contact: Joe Warren (Rice)Computational Chemistry Fock Matrix Construction Contacts: David Bernholdt, Wael Elwasif, Robert

Harrison, Annirudha Shet (ORNL)Habanero Compiler Implement Habanero compiler in Habanero so as to

exploit multicore parallelism within the compiler

Platforms:

AMD Barcelona Quad-Core Clearspeed Advance X620 DRC Coprocessor Module w/ Xilinx Virtex

FPGA IBM Cell IBM Cyclops-64 (C-64) IBM Power5+, Power6 Intel Xeon Quad-Core NVIDIA Tesla S870 Sun UltraSparc T1, T2 . . .

Additional suggestions welcome!

Habanero Research TopicsHabanero Research Topics1) Language Research Explicit parallelism: portable constructs for homogeneous & heterogeneous multicore Implicit deterministic parallelism: array views, single-assignment constructs Implicit non-deterministic parallelism: unordered iterators, partially ordered statement blocks Builds on our experiences with the X10, CAF, HPF, Matlab D, Fortran 90 and Sisal languages

2) Compiler research New Parallel Intermediate Representation (PIR) Automatic analysis, transformation, and parallelization of PIR Optimization of high-level arrays and iterators Optimization of synchronization, data transfer, and transactional memory operations Code partitioning for accelerators Builds on our experiences with the D System, Massively Scalar, Telescoping Languages

Framework, ASTI and PTRAN research compilers

Habanero Research Topics (contd.)Habanero Research Topics (contd.)3) Virtual machine research VM support for work-stealing scheduling algorithms with extensions for places,

transactions, task groups Runtime support for other Habanero language constructs (phasers, regions,

distributions) Integration and exploitation of lightweight profiling in VM scheduler and memory

management system Builds on our experiences with the Jikes Research Virtual Machine

4) Concurrency library research New nonblocking data structures to support the Habanero runtime Efficient software transactional memory libraries Builds on our experiences with the java.util.concurrent and DSTM2 libraries

5) Toolkit research Program analysis for common parallel software errors Performance attribution of shared code regions (loops, procedure calls) using static and

dynamic calling context Builds on our experiences with the HPCToolkit, Eclipse PTP and DrJava projects

Opportunities for Broader ImpactOpportunities for Broader Impact

Education Influence how parallelism is taught in future Computer Science curricula

Open Source Build an open source testbed to grow ecosystem for researchers in

Parallel Software area Industry standards

Use research results as proofs of concept for new features that can be standardized

Infrastructure can provide foundation for reference implementations

Collaborations welcome!

Habanero Team (Nov 2007)Habanero Team (Nov 2007)

Send email to Vivek Sarkar (vsarkar@rice.edu) if you are interested in a PhD, postdoc, research scientist, or programmer position

in the Habanero project, or in collaborating with us!

Other Challenges in Code Optimization of Parallel CodeOther Challenges in Code Optimization of Parallel Code

Optimization of task coordination Task creation and termination --- fork, join Mutual exclusion --- locks, transactions Synchronization --- semaphores, barriers

Data Locality Optimizations Computation and data alignment Communication optimizations

Deployment and Code Generation Homogeneous Multicore Heterogeneous Multicore and Accelerators

Automatic Parallelization Revisited . . .

Related Work (Incomplete List)Related Work (Incomplete List)

Analysis of nondeterministic sequentially consistent parallel programs [Shasha, Snir 1988], [Midkiff et al 1989], [Chow, Harrison 1992], [Lee et al

1997], … Analysis of deterministic parallel programs with copy-in/copy-out semantics

[Srinivasan 1994], [Ferrante et al 1996], … Value-oriented semantics for functional subsets of PDGs

[Selke 1989], [Cartwright, Felleisen 1989], [Beck, Pingali 1989], [Ottenstein, Ballance, Maccabe 1990] , …

Serialization of restricted subsets of PDGs [Ferrante, Mace, Simons 1988], [Simons et al 1990], …

Concurrency analysis [Long, Clarke 1989], [Duesterwald, Soffa 1991], [Masticola, Ryder 1993],

[Naumovich, Avrunin 1998], [Agarwal et al 2007], …

PLDI 2008 Tutorial (Tucson, AZ)PLDI 2008 Tutorial (Tucson, AZ)

Analysis and Optimization of Parallel Programs Intermediate representations for parallel programs Data flow analysis frameworks for parallel programs Locality analyses: scalar/array privatization, escape analysis of objects,

locality types Memory models and their impact on code optimization of locks and

transactional memory operations Optimizations of task partitions and synchronization operations

Sam Midkiff, Vivek Sarkar Sunday afternoon (June 8, 2008, 1:30pm - 5:00pm)

ConclusionsConclusions

New paradigm shift in Code Optimization due to Parallel Programs

Foundations of Code Optimization will need to be revisited from scratch Foundations will impact high-level and low-level

optimizers, as well as tools Exciting times to be a compiler researcher!

code optimization of parallel programs vivek sarkar rice university vsarkar@rice.edu vivek sarkar...

Documents

situating sarkar

hillol sarkar

2014-2015 annual report - eeri.org · ishan nayak’...

modi sarkar: unfolding of hindutva agenda - peoples...

sarkar case 1

bounded memory scheduling of dynamic task graphs - rice...

comp 422, lecture 3: physical organization & communication...

comp 422, lecture 8: memory consistency models and...

ar/ year -...

dibyendu sarkar

quire: lightweight provenance for smart phone operating...

comp 422, lecture 3: physical organization & communication...

jit abhishek sarkar

comp 422, lecture 10: intel thread building blocks...

husys report_souvik sarkar

public service commission, west bengal€¦ · 78 2500897...

sm presentation_akash sarkar

model-based compressive sensing volkan cevher...

audiofiles harika basana (ilsai@rice.edu ), elizabeth chan...

abstract - arxiv.org · abhinav verma rice university...