multifrontal sparse qr factorization on the gpu · 2012-03-13 · multifrontal sparse qr...

32
Multifrontal sparse QR factorization on the GPU Tim Davis, Sanjay Ranka, Sharanyan Chetlur, Nuri Yeralan University of Florida Feb 2012

Upload: others

Post on 15-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal sparse QR factorization on the GPU

Tim Davis, Sanjay Ranka,Sharanyan Chetlur, Nuri Yeralan

University of Florida

Feb 2012

Page 2: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

GPU-based Multifrontal QR factorization

why sparse QR?

multifrontal sparse QR in a nutshell

multi-threaded sparse QR

sparse multifrontal QR on the GPU

our strategywork in progress

Page 3: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Why multifrontal sparse QR factorization?

wide applicability of QR

numerically stable

better parallelism.independent problems decoupled, unlike LU or CholeskyCommunication-Avoiding QR (CAQR)

orthogonal methods have higher flops per memory reference

QR assembly step is GPU-friendly

related to other direct methods (LU, Cholesky, LDLT )

Page 4: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal sparse QR factorization in a nutshell

rows can be operated on in any order

group together rows with left-most nonzeros in the samecolumn

factorize each block of rows independently

each block of rows takes on the same nonzero pattern (afrontal matrix)

merger of frontal matrices: copy, not add (unlike LU,Cholesky)

repeat until the matrix becomes upper triangular

Page 5: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

X . x x

X x . .

X . x .

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

The dots: union of the nonzero patterns of all rows in each block.

Page 6: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Householder Sparse QR

Sort the rows of A by column of leftmost nonzero, and annihilate

X . x x

X x . .

X . x .

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

r r r r

0 ∗ ∗ ∗0 0 ∗ ∗

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

Can do the other blocks at the same time.

Page 7: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Householder Sparse QR

Key observation: each block of rows to annihilate has the samenonzero pattern. So place them in a dense submatrix and usedense matrix kernels. For column 1:

1 2 3 4 5 6 7

X . x x

X x . .

X . x .

X x x . x

X . x . x

X . . x .

X x . . x

X . x . x

X . x x .

X x x

X x .

use this:

1 2 4 6

X . x x

X x . .

X . x .

Page 8: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal QR factorization in a nutshell

Group rows of A with nonzero in same leftmost column

. . x . . x .

. . x x . . x

. . x . x . x

. . x . x x .

Apply Householder to reduce each group to upper triangular,one row becomes a row of R

. . r r r r r

. . . x x x x

. . . . x x x

. . . . . x x

Append remainder to the group for the next nonzero column

next: a tree of columns (the column elimination tree)

Lump adjacent columns together if their rows of R have thesame nonzero pattern (supernodes)

Page 9: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

the column elimination tree

r r r r r

r r r r

1 2 3 4 5 6 7 8 9 10 11 12

r r r r r

r r r r r

r r r

rr

r

r

r

r

r

r

r

r

r

r

r

r

r

r r r r

rrrr

r r r

rr

r

r

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

x x x

x

x x

x

x

x

x

x

x x

x

x

x

x

x

.

.

.

.

.

.

.

.

.

x

x

x

x

x

x x

x

x

x

x

x

x

x

. . .

.

.

. .

.

.

.

.

x

x

x

x

x

x

x

x x

x x

x

x

.

.

.

.

.

..

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

xx

x

x x x x x

1 2 3 4 5 6 7 8 9 10 11 12

.

.

.

.

. .

..

. .

.

.

.

.

.

. .

.

.

.

.

1

2

3 4

5

6

8

7

9

10

12

11

the QR factor Rthe matrix A

Page 10: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

QR factorization of a leaf frontal matrix

1 6 8 112

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x x

1

3

4

5

6 x .

.

.

.

.

.

..

.2

1 6 8 112

h

h

h

h

c c

c

c c c

r r

r

r r

rrr

r

h

h

h

h h

h hh

factorized front 1rows of A for front 1

Page 11: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

QR for a non-leaf frontal matrix

1c 1c 1c3

1c 1c4

1c5

6 8 11

child front 15 7 9 11

15 3c 3c

14 3c 3c 3c3c 3c 3c 3c13

child 3

6 87 12

10

11

9 2c 2c 2c

8 2c 2c 2c 2c

2c 2c

2c

child 2

1

2

3 4

5

6

8

7

9

10

12

11

Page 12: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Non-leaf frontal matrix: children to assemble

5 6 7 8 9 11 12x x x x x. .16

x xx . . . .17

x xx xx . .18

x x x x..19

x x x...20

x xx..21

x xxxx22

1c 1c 1c3

1c 1c4

1c5

6 8 11

child front 15 7 9 11

15 3c 3c

14 3c 3c 3c3c 3c 3c 3c13

child 3

6 87 12

10

11

9 2c 2c 2c

8 2c 2c 2c 2c

2c 2c

2c

child 2

rows of A for front 4

Assembly: shuffle the above data into a single matrix (next slide)

Page 13: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix assembly: no read-modify-write

3c3cc33c

2c2c2c

1c

2c1c1c

2c 2c 2c

3c 3c 3c

1c 1c

2c2c

3c 3c

1c

2c

5 6 7 8 9 11 12

9

3

13

8

14

41015511

x x x..20.

.

..

. .

. .....

. ...

...

x x x x x. .16x x. . . .17x xx xx . .18

x x x x..19

x xx..21x xxxx22

x

5 7 9 11child 3

15 3c 3c14 3c 3c 3c

3c 3c 3c 3c13

1c 1c 1c3

1c 1c4

1c5

6 8 11child front 1

6 87 12

1011

9 2c 2c 2c8 2c 2c 2c 2c

2c 2c

2c

child 2

5 6 7 8 9 11 12x x x x x. .16x xx . . . .17x xx xx . .18

x x x x..19x x x...20

x xx..21x xxxx22

assembled front 4rows of A for front 4

Page 14: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix: after factorization

5 6 7 8 9 11 12r r r r r r r

r rr

rr r r

rrrr

hhh

hhhhhh

h

hhhhhhhh

c c cccc

ccc

chhh h hh h h h

hhh h

hh

h hhh

hhh

h hhh

hh

hhhh

h

h

hh h

h

h

h c: contribution to parent

R factor

(Q)h: Householder vectors

factorized front 4

Page 15: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal QR: assembly step is GPU-friendly

Assembly for multifrontal LU, Cholesky

requires data shuffling and additionin parallel: two thread blocks would need to synchronize thesummation (read-modify-write)

for multifrontal QR

requires just data shuffling; no additionin parallel: no read-modify-write

Page 16: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix QR factorization on the GPU

Kernel 1: suppose one thread block can factorize one stripewith a fixed maximum number of rows.

If stripe has too many columns, slice it and apply Q aftercomputed.

Page 17: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix QR factorization on the GPU

Kernel 2: take two stripes; annihilate below diagonal. Before:

after:

numerically stable because of orthogonal operations

Page 18: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix QR factorization on the GPU

Combine with more stripes:

Page 19: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix QR factorization on the GPU

Rinse and repeat, always working on pairs of stripes at a time,where two pairs fit in shared memory:

Page 20: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix QR factorization on the GPU

Kernel dependencies for a frontal matrix of 4 stripes and 6 blocksof columns:

Page 21: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Frontal matrix QR factorization on the GPU

Further pipelining for additional parallelism

Page 22: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Algorithm outline

symbolic analysis: on the CPU

numeric factorization: on the GPU

Page 23: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Symbolic analysis: on the CPU

Fill-reducing ordering (typically O(|A|) time)

Symbolic analysis (nearly O(|A|), without forming ATA

find the column etreerow counts of R

find relaxed supernodes

sort rows of A

task assignment (subtrees = parallel subtasks)

total time: about O(|A|)

Page 24: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Numerical factorization: on the GPU

numerical factorization of subtrees:

frontal matrix assemblyfrontal matrix factorizationcontribution block stacked for parent

the challenge of heterogeneous computations within a subtree

some factorize while others assemblefronts vary wildly in size from tiny to hugetree driven by matrix; not simple balanced binary tree

staging:

factorize one subtree while transfering anotherCPU ↔ GPU

multi-GPU:

each GPU handles independent subtrees.if front is huge, treat like multi-GPU dense QR factorizationwith blocking/striping

Page 25: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Performance results: pre-GPU method

Least squares problem: 2 million by 110 thousand

Method ordering procs time

x=A\b COLMMD 1 ?x=A\b AMD 1 11 daysMA49 AMD 1 3.5 hoursSuiteSparseQR AMD 1 1.5 hoursSuiteSparseQR METIS 1 45 minutesSuiteSparseQR METIS 16 7.3 minutes

Algorithmic speedup vs x=A\b: 375x

Parallel speedup: 5.75x on 16 cores

Total: 2,155x (14 Gflops on 70 Gflops machine)

Single core: 2.5 Gflop peak, same as LAPACK QR

Page 26: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Gflop vs LAPACK (single core)

100

101

102

103

0.1

0.2

0.3

0.4

0.50.60.70.80.9

1

2

3

4

Flop count / memory usage in bytes

GF

lops

SuiteSparseQRDense QR (DGEQRF)

n=4000n=1000

n=100

Page 27: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal QR on the GPU

Tesla C2050

double-precision frontal matrix QR (65 GFlops)

fronts remain on the GPU

frontal matrix assembly

in-progress:

strip-mining schedulingnodes of tree = one frontal matrixsplit each front into a subtreeparallel assembly of some fronts while others are factorized

Page 28: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal QR on the GPU: strip-mining

4

1

2

35

8

6 7

Page 29: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal QR on the GPU: strip-mining

1st kernel launch

2nd

3rd

4th

5th

etc ...

1

2

5 6 73a 3b

3c 3d

3e

A AA

A

4a 4b 4c 4d

A A

A

8a 8b

8c

A A

Page 30: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Multifrontal Sparse QR on the GPU: Summary

Fast symbolic analysis and fill-reducing ordering (∼ O(|A|))

Dense matrix kernels to exploit tightly-coupled regularparallelism within the GPU

Elimination tree for loosely-coupled irregular parallelism

High performance in pre-GPU version

peak Gflop rate same as LAPACKample parallel speedupAppears as the built-in x=A\b and qr in MATLAB R2009a.If A is all nonzero, x=sparse(A)\b can be faster than x=A\b.

GPU method in progress

dense QR for frontal matrices: Sharanyan Chetlurassembly: by the GPU. Regular memory traffic to/from globalmemory; all irregular traffic in shared memory within eachthread block: Nuri Yeralan.strip-mining scheduling of the expanded frontal matrix treestaging subtrees to handle large problems (> 6 GB)

Page 31: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Acknowledgements

Page 32: Multifrontal sparse QR factorization on the GPU · 2012-03-13 · Multifrontal sparse QR factorization in a nutshell rows can be operated on in any order group together rows with

Postscript

Please send me your matrices!

http://www.cise.ufl.edu/dropbox/www