finding inherent parallelism - - get a free blog

Finding Inherent Parallelism

Dr Wayne Kelly

Senior Lecturer

Queensland University of Technology

Parallelization

SequentialProgram

ParallelProgram

ParallelProgram

Change

Algorithm

2

ExploitInherent

Parallelism

LCA2011 Multicore and Parallel Computing Miniconference

Inherent Parallelism

1: a = 1;

2: b = x + 1;

3: c = a + b;

for (i=1; i<n; i++)

a[i] = 0;

<<parallel>>

for (i=1; i<n; i++)

a[i] = 0;

1

3

2

3LCA2011 Multicore and Parallel Computing Miniconference

Exploiting Inherent Parallelism

1. Which steps can be performed in parallel?– requires analysis of dependencies

2. Which of those steps are worth while being performing in parallel?– requires profiling or performance prediction

3. What code do we need to generate to efficiently execute those steps in parallel?– mapping computation to threads,

synchronization, aggregation, etc.


Automatic Parallelization

Parallelization can be performed:• Automatically by a tool/compiler,

and/or• Manually by a programmer.

Unfortunately:

• Current compilers are not generally smart enough to perform parallelization in general.

• Manual parallelization requires highly skilled programmers, is very time consuming and error prone.


Control and Data Dependencies

Control Dependencies

for (i=0; i<n; i++)

{

a[i] = speed(b[i],2);

if (a[i] > 100)

break;

b[i] += a[i] * i;

}

Data Dependencies

Flow dependence (W -> R):

a = 42;...

b = a + 1;

Output dependence (W -> W):

a = 42;...

a = 0;

Anti dependence (R -> W):

a = b + 1;...

b = 42;


Array Data Dependence Analysis

for (int i=0; i<n; i++)

for (int j=i; j<n; j++)

a[i, j+1] = a[n,j];

Any data dependencies between loop iterations?

∃ ir, jr, iw, jw : 0 ≤ ir < n ^ ir ≤ jr < n ^

0 ≤ iw < n ^ iw ≤ jw < n ^

iw = n ^ jw+1 = jr

If there is a (flow) data dependence then there must exist at least one iteration (ir, jr) that reads the same array element that is written by some iteration (iw, jw).


Pointer/Reference Aliasing

Results a = new Results(x, 100);

…

Results b = a;

…

a.Normalize();

b.FindBest(selector);

Resultsobject

a:

b:


Static Analysis Complicated By

• Complex inter-procedural data flow

• Pointers and pass by reference

• Object-oriented virtual function calls

• Component-oriented development

• This is why 99% of automatic parallelizing compilers are for Fortran or other languages with similar semantic restrictions.


Conservative Analysis

• Static data dependence analysis is inexact

– Conservatively over estimates data dependencies

– Results in under estimation of parallelism

• For large, complex, object-oriented applications we often end up finding that virtually nothing can be executed in parallel.

• How much potential parallelism are we missing?


Runtime Analysis

• At runtime we have none of these problems.

• We know precisely:

– which memory locations are accessed

– which virtual methods get invoked

– which components are dynamically loaded

– which branches are taken

– which exceptions are thrown


Dynamic Data Dependence Analysis

• Instrument the code so as to record which memory addresses are read or written by each instruction as they execute.

Instruction instance

A

Instruction instance

B

mx

WritesReads

Data

dependence


Problems with Runtime Analysis

• May not detect all data dependencies– May vary with input data

– Cannot guarantee that code is safe to run in parallel.

– But, can prove that code is not parallelizable and why

• Static analysis:– Upper bound on data dependencies

– Lower bound on parallelism

• Dynamic analysis:– Lower bound on data dependencies

– Upper bound on parallelism


Runtime Structures

• Create a node for each executed instruction which reads or writes.

• Dictionary which maps each memory location to the most recent instruction that wrote it– allows flow and output dependencies to be detected

• Dictionary which maps each memory location to the set of all instructions which have read it since it was last written.– allows anti dependencies to be detected

• Also need to consider object allocation, de-allocation and garbage collection.


Loop Parallelism• To determine which loops are parallelizable need to

associate instruction instances with loop iterations

Method A

Method B

Loop iteration 1

Loop iteration 3

Method C

Loop iteration 5

Instruction instance A

Instruction instance B

Method D

mx

Last written by reads

Visualization of Data Dependencies

• IDE extension to overlay source code with colour coded arrows linking data dependence source and sink instructions:

for (int i = 0; i < n; i++)

{

a[i] = 0;

int x = b[n-i];

b[i] = foo(i-1);

}

• Refactoring to support Parallelization


What else can we do?1. Determine which loops appear to be parallelizable

– determine why they are not parallelizable

2. Analyse parallelism considering only flow dependencies to investigate the parallelism that might result if we can apply source transformations such as loop variable privatization to eliminate output and anti dependences.

3. Measure the total theoretical amount of parallelism by determining the earliest execution time of each instruction assuming an infinite number of processors

Total number of instructions executed

Parallelism =

Length of longest sequential path


What’s the Purpose of the Tool?

1. Help programmers parallelize specific applications.

2. Analyse a large collection of general purpose applications to determine:

a) How much parallelism is typical?

b) What different forms does it commonly arise in?

c) What parallelism are we missing with current techniques?


Implementation

Data collection:

Dependence info:

IDE:

CIL x86 native

Mono.NETJIT

Intel PIN

XMLDependence info

Visual StudioExtension

EclipsePlug-in


Questions

Wayne Kelly ([email protected])


mailto:[email protected]

finding inherent parallelism - - get a free blog

Documents