sparse matrices and combinatorial algorithms in star-p status and plans april 8, 2005

Sparse Matrices and

Combinatorial Algorithms in Star-P

Status and PlansApril 8, 2005

Matlab’s sparse matrix design principles

All operations should give the same results for sparse and full matrices (almost all)

Sparse matrices are never created automatically, but once created they propagate

Performance is important -- but usability, simplicity, completeness, and robustness are more important

Storage for a sparse matrix should be O(nonzeros)

Time for a sparse operation should be O(flops)(as nearly as possible)

Matlab’s sparse matrix design principles

All operations should give the same results for sparse and full matrices (almost all)

Sparse matrices are never created automatically, but once created they propagate

Performance is important -- but usability, simplicity, completeness, and robustness are more important

Storage for a sparse matrix should be O(nonzeros)

Time for a sparse operation should be O(flops)(as nearly as possible)

Star-P dsparse matrices: same principles, some different tradeoffs

Sparse matrix operations

dsparse layout, same semantics as ddense

For now, only row distribution

Matrix operators: +, -, max, etc.

Matrix indexing and concatenation A (1:3, [4 5 2]) = [ B(:, 7) C ] ;

A \ b by direct methods

Star-P sparse data structure

• Full: • 2-dimensional array of real or

complex numbers

• (nrows*ncols) memory

31 0 53

0 59 0

41 26 0

31 53 59 41 26

1 3 2 1 2

• Sparse:

• compressed row storage

• about (2*nzs + nrows) memory

P0

P1

P2

Pn

5941 532631

23 131

Each processor stores:• # of local nonzeros• range of local rows• nonzeros in CSR form

Star-P distributed sparse data structure

Sorting in Star-P

[V, perm] = sort (V)

Useful primitive for some sparse array algorithms, e.g. the sparse constructor

Star-P uses a parallel sample sort

The sparse( ) constructor

A = sparse (I, J, V, nr, nc)

Input: ddense vectors I, J, V, and dimensions nr, nc

A(I(k), J(k)) = V(k) for each k

Sort triples (i, j, v) by (i, j)

Adjust for desired row distribution

Locally convert to compressed row indices Sum values with duplicate indices

Sparse matrix times dense vector

y = A * x

The first call to matvec caches a communication schedule for matrix A. Later calls to multiply any vector by A use the cached schedule.

Communication and computation overlap.

Sparse linear systems

x = A \ b

Matrix division uses MPI-based direct solvers: SuperLU_dist: nonsymmetric static pivoting MUMPS: nonsymmetric multifrontal PSPASES: Cholesky

ppsetoption(’SparseDirectSolver’,’SUPERLU’)

Iterative solvers implemented in Star-P

Ongoing work on preconditioners

Sparse eigenvalue solvers

[V, D] = eigs(A); etc.

Implemented via MPI-based PARPACK [Lehoucq, Maschhoff, Sorensen, Yang]

Work in progress

Combinatorial Scientific Computing

Sparse matrix methods; machine learning; search and information retrieval; adaptive multilevel modeling; geometric meshing; computational biology; . . .

How will combinatorial methods be used by people who don’t understand them in complete detail?

Analogy: Matrix division in Matlab

x = A \ b;

Works for either full or sparse A

Is A square?no => use QR to solve least squares problem

Is A triangular or permuted triangular?yes => sparse triangular solve

Is A symmetric with positive diagonal elements?yes => attempt Cholesky after symmetric minimum degree

Otherwise => use LU on A(:, colamd(A))

Combinatorics in Star-P

A sparse matrix language is a good start on primitives for combinatorial computing.

Random-access indexing: A(i,j)

Neighbor sequencing: find (A(i,:))

Sparse table construction: sparse (I, J, V)

Other primitives we are building: sorting, searching, pointer-jumping, connectivity and paths, . . .

Matching and depth-first search in Matlab

dmperm: Dulmage-Mendelsohn decomposition

Square, full rank A: [p, q, r] = dmperm(A); A(p,q) is block upper triangular with nonzero diagonal also, strongly connected components of a directed graph also, connected components of an undirected graph

Arbitrary A: [p, q, r, s] = dmperm(A); maximum-size matching in a bipartite graph minimum-size vertex cover in a bipartite graph decomposition into strong Hall blocks

Connected components (work in progress)

Sequential Matlab uses depth-first search (dmperm), which doesn’t parallelize well

Shiloach-Vishkin algorithm: repeat

Link every (super)vertex to a random neighbor Shrink each tree to a supervertex by pointer jumping

until no further change

Hybrid SV / search method under construction

Other potential graph kernels: Breadth-first search (after Husbands, LBNL) Bipartite matching (after Riedy, UCB) Strongly connected components (after Pinar, LBNL)

MIT Lincoln Laboratory

SSCA #2 -- Overview

SSCA #2: Directed weighted multigraph with no self-loops

Scalable data generator Kernel 1 — Graph Construction Kernel 2 — Classify Large Sets (Search) Kernel 3 — Subgraph Extraction Kernel 4 — Graph Clustering / Partitioning

Randomly sized “cliques” Most intra-clique edges present Random inter-clique edges,

gradually 'thinning' with distance Integer and character string edge labels Randomized vertex numbers


HPCS Benchmark SpectrumSSCA #2

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

Data Generator

1. Kernel

2. Kernel

3. Kernel

4. Kernel

HPCchallengeBenchmarks

Micro &Kernel

BenchmarksMission Partner

ApplicationBenchmarks

6.Signal

ProcessingKnowledgeFormation

Ex

isti

ng

Ap

pli

cati

on

s

Em

erg

ing

Ap

pli

cat

ion

s

Fu

ture

Ap

pli

cat

ion

s

Sim

ula

tio

nIn

telli

gen

ce

Re

co

nn

ais

san

ce5.

SimulationMulti-Physics

1.OptimalPattern

Matching

1.OptimalPattern

Matching

4.SimulationNAS PB AU

3.SimulationNWCHEM

Scalable SyntheticCompact Applications

HPCSSpanning

Set ofKernels

Kernels

DiscreteMath…GraphAnalysis…LinearSolvers…SignalProcessing…Simulation…I/O

ExecutionPerformance

Bounds

ExecutionPerformance

Indicators

LocalDGEMMSTREAM

RandomAccess1D FFT

GlobalLinpackPTRANS

RandomAccess1D FFT

CurrentUM2000GAMESS

OVERFLOWLBMHDRFCTHHYCOM

Near-FutureNWChemALEGRA

CCSM

Execution andDevelopment

Performance Indicators

System Bounds

Data Generator

Graph Construction

Sort Large Sets

Graph Extraction

Graph Clustering

Data Generator

Graph Construction

Sort Large Sets

Graph Extraction

Graph Clustering

2.Graph

Analysis

2.Graph

Analysis


Scalable Data Generator

Produce edge tuples containing the start vertex, end vertex, and weight for each directed edge Hierarchical nature, with random-sized cliques connected by

edges between some of the cliques Interclique edges are assigned using a random distribution

that represents a hierarchical thinning as vertices are further apart

Weights are either random integer values or randomly selected character strings

Random permutations to avoid induced locality Vertex numbers — local processor memory locality Parallel — global memory locality

Data sets must be developed in their entirety before graph construction

May be parallelized — data tuple vertex naming schemes must be globally consistent

Data generator will not be timed Statistics will be saved to be used in later verification stages


Kernel 1 — Graph Construction

Construct the multigraph in a format usable by all subsequent kernels

No subsequent modifications will be permitted to benefit specific kernels

Graph construction will be timed


Kernel 2 — Sort on Selected Edge Weights

Sort on edge weights to determine those vertex pairs with The largest integer weights With a desired string weight

The two vertex pair lists will be saved for use in the subsequent kernel Start set SI for integer start sets Start set SC for character start sets

Sorting weights will be timed


Kernel 3 — Extract Subgraphs

Extract a series of subgraphs formed by following paths of length k from a start set of initial vertices

The set of initial vertices will be determined from the previous kernel

Produce a series of subgraphs consisting of vertices and edges on paths of length k from the vertex pairs start sets SI and SC

A possible computational kernel for graph extraction is Breadth First Search

Extracting graphs will be timed


Kernel 4 — Partition Graph Using a Clustering Algorithm

Apply a graph clustering algorithm to partition the vertices of the graph into subgraphs no larger than a maximum size so as to minimize the number of edges that need be “cut”, e.g. Kernighan and Lin Sangiovanni-Vincentelli, Chen, and Chua

The output of this kernel will be a description of the clusters and the edges that form the interconnecting “network”

This step should identify the structure in the graph specified in the data generation phase

It is important that the implementation of this kernel does not utilize a priori knowledge of the details in the data generator or the statistics collected in the graph generation process

Identifying the clusters and the interconnecting “network” will be timed

Concise SSCA#2

A serial Matlab implementation of SSCA#2

Coding style is simple and “data parallel”

Our starting point for Star-P implementation

Edge labels are only integers, not strings yet

Data generation

Using Lincoln Labs serial Matlab generator at present

Data-parallel Star-P generator under construction

Generate “edge triples” as distributed dense vectors

Kernel 1: Graph construction

cSSCA#2 K1: ~30 lines of executable serial Matlab

Multigraph is a cell array of simple directed graphs

Graphs are dsparse matrices, created by sparse( )

Kernel 2: Search in large sets


Uses max( ) and find( ) to locate edges of max weight

ng = length(G.edgeWeights); % number of graphs maxwt = 0;for g = 1:ng maxwt = max(maxwt, max(max(G.edgeWeights{g})));endintedges = zeros(0,2);for g = 1:ng [intstart intend] = find(G.edgeWeights{g} == maxwt);intedges = [intedges ; intstart intend];end;

Kernel 3: Subgraph extraction


Returns subgraphs consisting of vertices and edges within specified distance of specified starting vertices

Sparse matrix-vector product for breadth-first search

Kernel 4: Partitioning / clustering

cSSCA#2 K4: serial prototype exists, not yet parallel

Different algorithm from Lincoln Labs executable spec, which uses seed-growing breadth-first search

cSSCA#2 uses spectral methods for clustering, based on eigenvectors (Fiedler vectors) of the graph

One version seeks small balanced edge cuts (graph partitioning); another seeks clusters with low isoperimetric number (clustering)

sparse matrices and combinatorial algorithms in star-p status and plans april 8, 2005

Documents