BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Optimization and Parallelization of FINDAlgorithm
Song Li Eric Darve
Institute for Computational and Mathematical Engineering, Stanford [email protected]
SIAM CSE09March 4, 2009
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Outline
1 Background
2 Serial FIND (Fast Inverse using Nested Dissection)
3 Simulation Results
4 Parallel Methods
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Outline
1 Background
2 Serial FIND (Fast Inverse using Nested Dissection)
3 Simulation Results
4 Parallel Methods
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Introduction
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Modeling the current throughnano-devices by Non-EquilibriumGreen’s Function approachSystem of Schrödinger-PoissonequationsBest known algorithm (RGF) hasrunning time O(n3
xny )
Our method (FIND): O(n2xny )
Other devices: nanotubes andnanowires
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
The Math Problem
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
What we want: thediagonal of Gr = A−1
What we have: a sparsematrix A from adiscretized 2D mesh
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
The Math Problem
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
4× 5 mesh
ny = 5
nx = 4
What we want: thediagonal of Gr = A−1
What we have: a sparsematrix A from adiscretized 2D mesh
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
The Math Problem
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
20× 20 matrix A4× 5 mesh
ny = 5
nx = 4
What we want: thediagonal of Gr = A−1
What we have: a sparsematrix A from adiscretized 2D mesh
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
The Math Problem
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
20× 20 matrix A4× 5 mesh
ny = 5
nx = 4
What we want: thediagonal of Gr = A−1
What we have: a sparsematrix A from adiscretized 2D mesh
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
The Math Problem
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
20× 20 matrix A4× 5 mesh
ny = 5
nx = 4
What we want: thediagonal of Gr = A−1
What we have: a sparsematrix A from adiscretized 2D mesh
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Outline
1 Background
2 Serial FIND (Fast Inverse using Nested Dissection)
3 Simulation Results
4 Parallel Methods
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Key Observations
Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1
Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Key Observations
Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1
Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Key Observations
Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1
Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Key Observations
Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1
Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Key Observations
Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1
Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Overall Structure: Partition Tree
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Order the mesh nodesin a way similar tonested dissection
Partition the wholemesh and form a treestructure to exploit thesubproblem overlap
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
One Step of Elimination
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Gaussian elimination: A∗( t, t) def= A( t, t)− A( t, t)A( t, t)−1A( t, t)
A( t, t) A( t, t) 0A( t, t) A( t, t) A( t, t)
0 A( t, t) A( t, t) elimination
=⇒
A( t, t) A( t, t) 00 A∗( t, t) A( t, t)0 A( t, t) A( t, t)
eliminated node
inner node
bounary node
outer node⇒
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Two Full Elimination Processes
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused
eliminated node
inner node
bounary node
outer node
target node
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Extensions and Optimizations
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity
rewrite the one step elimination:A∗( t, t) def
= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse
Exploit to optimize!The elimination preserves symmetry andthis further reduces cost
t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Outline
1 Background
2 Serial FIND (Fast Inverse using Nested Dissection)
3 Simulation Results
4 Parallel Methods
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Simulation Device
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Running Time ComparisonLog-Log Scale with Reference Lines
1
8
64
512
4096
32768
64 128 256 512 1024
Run
ning
tim
e (s
econ
d)
n (=Nx=Ny)
Running Time ComparisonBetween FIND and RGF
FINDO(n3)RGF
O(n4)
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Memory Cost Comparison
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
FIND: O(N log(N))
RGF: O(N3/2)
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Outline
1 Background
2 Serial FIND (Fast Inverse using Nested Dissection)
3 Simulation Results
4 Parallel Methods
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
How to Parallelize?
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
Straightforward for leaf clustersTop level clusters dominate runningtime with less degree of parallelismUse the idle processors for redundantcomputationsMore floating point operations butshorter wall clock timeWorks for 1D, 2D, and 3D domains
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Problem and Processor Settings
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Problem and Processor Settings
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Problem and Processor Settings
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Problem and Processor Settings
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Problem and Processor Settings
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Detailed Merging Process
P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0
P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1
P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2
P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3
P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4
P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5
P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6
P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7
P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8
P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9
P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10
P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11
P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12
P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13
P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14
P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15
Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Communication Pattern
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm
BackgroundSerial FIND (Fast Inverse using Nested Dissection)
Simulation ResultsParallel Methods
Summary
Direct method for fast inverseTwo extensions, two optimizationsAn optimal parallel schemeCollaboration with other groups for moreapplications
Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm