domain decomposition in parallel computing ashok srinivasan asriniva florida state university cot...

33
Domain decomposition in parallel computing Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University COT 5410 – Spring 2004

Upload: samson-griffith

Post on 25-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Domain decomposition in parallel computing

Ashok Srinivasan

www.cs.fsu.edu/~asriniva

Florida State University

COT 5410 – Spring 2004

Page 2: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Outline

• Background

• Geometric partitioning

• Graph partitioning– Static– Dynamic

• Important points

Page 3: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Background• Tasks in a parallel computation need access to

certain data• Same datum may be needed by multiple tasks

– Example: In matrix-vector multiplication, b2 is needed for the computation of all ci2, 1 < i < n

– If a process does not “own” a datum needed by its task, then it has to get it from a process that has it

• This communication is expensive

– Aims of domain decomposition• Distribute the data in such a manner that the communication

required is minimized

• Ensure that the computational loads on processes are balanced

Page 4: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Domain decomposition example

• Finite difference computation– New value of a node depends on old values of its

neighbors

• We want to divide the nodes amongst the processes so that – Communication is minimized

• Measure of partition quality

– Computational load is evenly balanced

Page 5: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Geometric partitioning

• Partition a set of points– Uses only coordinate information

• Balances the load– The heuristic tries to ensure that communication costs

are low

• Algorithms are typically fast, but partition not of high quality

• Examples– Orthogonal recursive bisection– Inertial– Space filling curves

Page 6: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Orthogonal recursive bisection

• Recursively bisect orthogonal to the longest dimension– Assume communication is proportional to the surface area of

the domain, and aligned with coordinate axes– Recursive bisection

• Divide into two pieces, keeping load balanced• Apply recursively, until desired number of partitions obtained

Page 7: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Inertial

• ORB may not be effective if cuts along the x, y, or z directions are not good ones

• Inertial– Recursively bisect

orthogonal to the inertial axis

Page 8: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Space filling curves

• Space filling curves– A continuous curve that fills the space– Order the points based on their relative

position on the curve– Choose a curve that preserves proximity

• Points that are close in space should be close in the ordering too

• Example– Hilbert curve

Page 9: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Hilbert curve

• Sources– http://www.dcs.napier.ac.uk/~andrew/hilbert.html– http://www.fractalus.com/kerry/tutorials/hilbert/hilbert-tutorial.html

H1

H2

Hi

Hi+1 Hilbert curve = lim Hn

n

Page 10: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Domain decomposition with a space filling curve

• Order points based on their position on the curve

• Divide into P parts– P is the number of processes

• Space filling curves can be used in adaptive computations too

• They can be extended to higher dimensions too

Page 11: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Graph partitioning

• Model as graph partitioning– Graph G = (V, E)– Each task is represented by a vertex

• A weight can be used to represent the computational effort

– An edge exists between tasks if one needs data owned by the other

• Weights can be associated with edges too

– Goal• Partition vertices into P parts such that each partition has equal

vertex weights• Minimize the weights of edges cut• Problem is NP hard

– Edge cut metric• Judge the quality of the partitioning by the number of edges cut

Page 12: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Static graph partitioning

• Combinatorial– Levelized nested dissection – Kernighan-Lin/Feduccia-Matheyses

• Spectral partitioning

• Multi-level methods

Page 13: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Combinatorial partitioning

• Use only connectivity information

• Examples– Levelized nested dissection – Kernighan-Lin/Feduccia-Matheyses

Page 14: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Levelized nested dissection (LND)

• Idea is similar to the geometric methods– But cannot use coordinate information– Instead of projecting vertices along the longest

axis, order them based on distance from a vertex that may be one extreme of the longest dimension of a graph

• Pseudo-peripheral vertex– Perform a breadth-first search, starting from an arbitrary

vertex

– The vertex that is encountered last might be a good approximation to a peripheral vertex

Page 15: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

LND example Finding a pseudoperipheral vertex

Initial vertex

1

1

1

2

2

2

33

3

34

Pseudoperipheral vertex

Page 16: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

LND example – Partitioning

5

3

5

6

4

2

53

2

1

4

Initial vertexPartition

Recursively bisect the subgraphs

Page 17: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Kernighan-Lin/Fiduccia-Matheyses

• Refines an existing partition• Kernighan-Lin

– Consider pairs of vertices from different partitions– Choose a pair whose swapping will result in the best improvement

in partition quality• The best improvement may actually be a worsening

– Perform several passes• Choose best partition among those encountered

• Fiduccia-Matheyses– Similar but more efficient

• Boundary Kernighan-Lin– Consider only boundary vertices to swap

• ... and many other variants

Page 18: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Kernighan-Lin example

Better partition

Edge cut = 3

Existing partition

Edge cut = 4

Swap these

Page 19: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Spectral method

• Based on the observation that a Fiedler vector of a graph contains connectivity information

• Laplacian of a graph: L– lii = di (degree of vertex i)– lij = -1 if edge {i,j} exists, otherwise 0

• Smallest eigenvalue of L is 0 with eigenvector all 1• All other eigenvalues are positive for a connected graph

• Fiedler vector– Eigenvector corresponding to the second smallest

eigenvalue

Page 20: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Fiedler vector

• Consider a partitioning of V into A and B– Let yi = 1 if vi A, and yi = -1 if vi B– For load balance, i yi = 0

– Also eij E (yi-yj)2 = 4 x number of edges

across partitions

– Also, yTLy = i di yi2 – 2 eij E

yiyj

= eij E (yi-yj)2

Page 21: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Optimization problem

• The optimal partition is obtain by solving– Minimize yTLy– Constraints:

• yi {-1,1}• i yi = 0

– This is NP hard

• Relaxed problem– Minimize yTLy– Constraints:

• i yi = 0• Add a constraint on a norm of y, example, ||y||2 = n0.5

– Note• (1, 1, ..., 1)T is an eigenvector with eigenvalue 0• For a connected graph, all other eigenvalues are positive and orthogonal

to this eigenvector, which implies i yi = 0• The objective function is minimized by a Fiedler vector

Page 22: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Spectral algorithm

• Find a Fiedler vector of the Laplacian of the graph– Note that the Fiedler value (the second smallest eigenvalue)

yields a lower bound on the communication cost, when the load is balanced

• From the Fiedler vector, bisect the graph– Let all vertices with components in the Fiedler vector greater

than the median be in one component, and the rest in the other

• Recursively apply this to each partition• Note: Finding the Fiedler vector of a large graph can

be time consuming

Page 23: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Multilevel methods

• Idea– It takes time to partition a large graph– So partition a small graph instead!

• Three phases– Graph coarsening

• Combine vertices to create a smaller graph– Example: Find a suitable matching

• Apply this recursively until a suitably small graph is obtained

– Partitioning• Use spectral or another partitioning algorithm to partition the

small graph

– Multilevel refinement• Uncoarsen the graph to get a partitioning of the original graph• At each level, perform some graph refinement

Page 24: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Multilevel example(without refinement)

126

11

107

4

53

2

18

9

1

1315

16

14

Page 25: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Multilevel example(without refinement)

126

11

107

4

53

2

18

9

1

1315

16

14

Page 26: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Multilevel example(without refinement)

126

11

107

4

53

2

18

9

2

11

11

2

2

1

1

11

1315

16

14

Page 27: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Multilevel example(without refinement)

126

11

107

4

53

2

18

9

2

11

11

2

2

1315

16

14

1

1

1

Page 28: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Multilevel example(without refinement)

126

11

107

4

53

2

18

9

2

11

11

2

2

2

1

1

13 14

15

16 1

1

1

2

Page 29: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Dynamic partitioning

• We have an initial partitioning– Now, the graph changes– Determine a good partition, fast– Also minimize the number of vertices

that need to be moved

• Examples– PLUM– Jostle– Diffusion

Page 30: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

PLUM

• Partition based on the initial mesh– Vertex and edge weights alone changed

• Map partitions to processors– Use more partitions than processors

• Ensures finer granularity

– Compute a similarity matrix based on data already on a process• Measures savings on data redistribution cost for each (process,

partition) pair• Choose assignment of partitions to processors

– Example: Maximum weight matching» Duplicate each processor: # of partitions/P times

– Alternative: Greedy approximation algorithm » Assign in order of maximum similarity value

• http://citeseer.nj.nec.com/oliker98plum.html

Page 31: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

JOSTLE• Use Hu and Blake’s scheme for load balancing

– Solve Lx = b using Conjugate Gradient• L = Laplacian of processor graph, bi = Weight on process Pi

– Average weight

– Move max(xi-xj, 0) weight between Pi and Pj

• Leads to balanced load– Equivalent to Pi sending xi load to each neighbor j, and each

neighbor Pj sending xj to Pi

– Net loss in load for Pi = di xi - neighborj xj = L(i)x = bi

» where L(i) is row i of L, and di is degree of i– New load for Pi = weight on Pi - bi = average weight

• Leads to minimum L2 norm of load moved – Using max(xi-xj, 0)

• Select vertices to move, based on relative gain– http://citeseer.nj.nec.com/walshaw97parallel.html

Page 32: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Diffusion

• Involves only communication with neighbors• A simple scheme

– Processor Pi repeatedly sends wi weight to each neighbor

• wi = weight on Pi

• wk = (I – L) wk-1 , wk = weight vector at iteration k– Simple criteria exist for choosing to ensure convergence

» Example: = 0.5/(maxi di),

• More sophisticated schemes exist

Page 33: Domain decomposition in parallel computing Ashok Srinivasan asriniva Florida State University COT 5410 – Spring 2004

Important points• Goals of domain decomposition

– Balance the load– Minimize communication

• Space filling curves• Graph partitioning model

– Spectral method• Relax NP hard integer optimization to floating point, and then

discretize to get approximate integer solution

– Multilevel methods• Three phases

• Dynamic partitioning – additional requirements– Use old solution to find new one fast– Minimize number of vertices moved