cholesky factorization of band matrices using multithreaded blas · 2019-09-26 · cholesky...

Cholesky Factorization of Band MatricesUsing Multithreaded BLAS

Alfredo Remon, Enrique S. Quintana-Ortı, Gregorio Quintana-Ortı

Depto. de Ingenierıa y Ciencia de ComputadoresUniversidad Jaume I de Castellon (Spain){remon,quintana,gquintan}@icc.uji.es

PARA’06 - June 2006

' $Cholesky Factorization of Band Matrices Using Multithreaded BLAS PARA’06 - June 2006

Band Cholesky Factorization

Quick start:

• Solution of Ax = b, with A symmetric positive definite band (SPDB)

��

• Sources:

– Matrix Market: a request for matrices arising in SPDB linear systemsreturns 21 different matrices

– Reorganization of sparse SPD systems

Band Cholesky Factorization

Quick start (Cont.):

• Exploiting the band structure results in important savings both incomputational cost and storage. Just for the factorization of A, withbandwidth k:

Structure Flops Elements

Dense n3 n2

Banded (k � n) n(k2 + 3k) k + 1× n

• This is recognized in LAPACK:

– unblocked pbtf2 and

– blocked pbtrf

LAPACK Routines for Band Cholesky Factorization

Factorize A into either

A = UTU or A = LLT ,

with U,L ∈ Rn×n upper and lower triangular with bandwidth k, resp.

Matrix A is stored in packed format

α21 α32

α22 α44

−α43

∗ ∗

The Cholesky factor L overwrites the corresponding entries of A

Routine pbtrf implements a right-looking algorithm

BRAABM

Repartition to expose a b× b block A11

A10 A11

A20 A22A21

A31 A32

A43 A44

BRAABM

Then, A22 ∈ Rl×l, with l = k − b, and A33 is b× b

During Iteration i

A22A21

A31 A32 A33

A11 = L11LT11

A21 := A21L−T11 A22 := A22 − L21L

A31 := A31L−T11 A32 := A32 − L31L

T21 A33 := A33 − L31L

A22A21

A31 A32 A33

stored as

A11 = L11LT11 b× b dense Cholesky factorization (potf2,

with leading dimension lda-1)

A22A21

A31 A32 A33

stored as

A21 := A21L−T11 k × b triangular system solve (trsm)

A22 := A22 − L21LT21 k × k rank-b update (syrk)

A22A21

A31 A32 A33

stored as

A31 := A31L−T11 No appropriate BLAS kernel!

W := triu(A31) CopyW := WL−T

11 b× b trsm

A22A21

A31 A32 A33

stored as

A32 := A32 − L31LT21 No appropriate BLAS kernel!

A32 := A32 −WLT21 b× k matrix product (gemm)

A22A21

A31 A32 A33

stored as

A33 := A33 − L31LT31 No appropriate BLAS kernel!

A33 := A33 −WW T k × b syrktriu(A31) := triu(W ) Copy back

A22A21

A31 A32 A33

Summary:

• Provided b � k, the update of A22 is the major computation while theupdates of A11, A31, and A33 are minor

• No appropriate kernels in BLAS for the updates of A31, A32, and A33

Experimental Results

Results correspond to matrix of order n=10,000 and the optimal block size

Platform Architecture #Proc. Frequency L2 cache L3 cache RAM

(GHz) (KBytes) (MBytes) (GBytes)

xeon Intel Xeon 2 2.4 512 – 1

itanium Intel Itanium2 4 1.5 256 4 4

Platform BLAS Compiler Optimization Operating

Flags System

xeon Goto 1.00 gcc 3.3.5 -O3 Linux 2.4.27

MKL 8.0 gcc 3.3.5 -O3 Linux 2.4.27

itanium Goto 0.95mt icc 9.0 -O3 Linux 2.4.21

MKL 8.0 icc 9.0 -O3 Linux 2.4.21

What to expect? itanium + MKL(4T)

A22A21

A31 A32 A33

0 200 400 600 800 1000 1200 1400 1600 1800 20000

Bandwidth, kd

Distribution of flops for blocked routine DPBTRF+MKL BLAS on ITANIUM (4 proc.)

A11 4TA21 4TA22 4TA31 4TA32 4TA33 4T

What do we get? itanium + MKL(4T)

A22A21

A31 A32 A33

0 200 400 600 800 1000 1200 1400 1600 1800 20000

Bandwidth, kd

Distribution of time for blocked routine DPBTRF+MKL on ITANIUM (4 proc.)

A11 4TA21 4TA22 4TA31 4TA32 4TA33 4T

Experimental Results

A22A21

A31 A32 A33

Summary:

• Given their sizes, the updates of A11, A31, and A33 are executedsequentially but they may not be that small! (Amdahl’s law)

• It is the storage scheme that separates the update of A31 from that ofA21, and the update of A32 and A33 from that of A22

Can we merge them back together?

Merging the Updates

Solution A: Embed A into an augmented A with bandwidth k + b

��

Band matrix A Augmented matrix

Storage for band matrix Storage for augmented matrix

Merging the Updates

A22A21

A31 A32 A33

Augmented subdiagonals

]L−T

11 in a single trsm

Merging the Updates

A22A21

A31 A32 A33

Augmented subdiagonals

A32 A33

A31 A33

] [A21

in a single syrk

Merging the Updates

Summary of solution A:

• Provided b � k, only the factorization of A11 is small

• Same #flops as LAPACK code

• Same sequence of BLAS calls as in dense Cholesky factorization

• Need b additional rows in storage scheme

Merging the Updates

Solution B: Allow space for the operations with full (square) A31

A22A21

A32 A33A31

Merging the Updates

W := stril(A31); stril(A31) := 0 copy and set to zero

]L−T

11 trsm

A32 A33

A31 A33

] [A21

stril(A31) := W copy back

Experimental Results xeon + Goto BLAS(2T)

0 200 400 600 800 1000 1200 1400 1600 1800 20000

Bandwidth, ku=kl

Performance using optimal block size on Intel Xeon

Goto BLAS(2T)Goto BLAS(2T)+inlineGoto BLAS(2T)+A+inlineGoto BLAS(2T)+B+inline

Experimental Results itanium + MKL(4T)

0 200 400 600 800 1000 1200 1400 1600 1800 20000

Bandwidth, ku=kl

Performance using optimal block size on Intel Itanium

MKL BLAS(4T)MKL BLAS(4T)+AMKL BLAS(4T)+B

Conclusions

• Band Cholesky:

– No appropriate kernels in BLAS for some of the operations required inthe Cholesky factorization of band matrices

– Even if they were available, exploiting the structure of A31 would splitthe computations in several parts

– Excessive splitting results in lower performance, specially whenmultiple processors are used

• General:

– Operations that are considered minor in current LAPACK routines willneed to be reconsidered for future multicores

– Some LAPACK routines may need to be recoded

Thank you!

cholesky factorization of band matrices using multithreaded blas · 2019-09-26 · cholesky...

Documents

blas cabrera

3 aÑos - · pdf fileespichan tirado andrian blas pascal...

cholesky decomposition · 2018-05-05 · cholesky...

ecse 420 - parallel cholesky algorithm - report

cholesky decomposition - rosetta code

blas infante comenius

subject argentina – overview. gruas san blas summary....

h san blas

multithreaded chapter 4: multithreaded programming

blas infante - almanzor

elmer blas -blockping.pdf

cholesky decomposition in mmse mimo

14 multithreaded programming

h-cholesky on many-core

tara blas 2015

blas infante

comparison between qr and cholesky method2

multithreaded programming

blas lapack

deschamps blas ted_evaluation