julien langou, university of colorado denver. wednesday ... · pdf filemixed single-double...

Mixed single-double precision solver

Julien Langou, University of Colorado Denver.

Mixed single-double precision solverWednesday April 1st, 2008.

1. Mixed single-double precision solver2. Rectangular Full Packed Format3. Communication optimal algorithm


Architecture (BLAS/LAPACK) n DGEMM/SGEMM DGETRF/SGETRFIntel Pentium III Coppermine (Goto) 3500 2.10 2.24Intel Pentium III Katmai (Goto) 3000 2.12 2.11SunUltraSPARC IIe (Sunperf) 3000 1.45 1.79Intel Pentium IV Prescott (Goto) 4000 2.00 1.86Intel Pentium IV-M Northwood (Goto) 4000 2.02 1.98AMD Opteron (Goto) 4000 1.98 1.93Cray X1 (libsci) 4000 1.68 1.54IBM Power PC G5 (2.7 GHz) (VecLib) 4000 2.29 2.03Compaq Alpha EV6 (CXML) 3000 0.99 1.08IBM SP Power3 (ESSL) 2000 1.03 1.13SGI Octane (ATLAS) 1500 1.08 1.13Intel Itanium 2 (Goto) 1500 0.71 –


Single Doublex87 SSE 3DNow! x87 SSE2 speedup

Pentium 1 0 0 1 0 1Pentium II 1 0 0 1 0 1Pentium III 1 4 0 1 0 4Pentium IV 1 4 0 1 2 2Atlhon 2 0 4 2 0 2Enhanced Atlhon 2 0 4 2 0 2Atlhon 4 2 4 4 2 0 2Atlhon MP 2 4 4 2 0 2

MMX Set of “MultiMedia eXtensions” to the x86 ISA. Mainly new instructions for integer performance, and maybe some prefetch. For Intel, allchips starting with the Pentium MMX processor possess these extensions. For AMD, all chips starting with the K6 possess these extensions.

SSE Streaming SIMD (Single Instruction Multiple Data) Extensions. SSE is a superset of MMX. These instructions are used to speed up singleprecision (32 bit) floating point arithmetic. For Intel, all chips listed starting with the Pentium III possess SSE extensions. For AMD, all chipsstarting from Athlon4 possess SSE.

3DNow! AMD’s extension to MMX that does almost the exact same thing SSE does, except the single precision arithmetic is not IEEE compliant.It is also a superset of MMX (but not of SSE; 3DNow! was released before SSE). It is supported only on AMD, starting with the K6-2 chip.

SSE2 Additional instructions that perform double precision floating arithmetic. Allows for 2 double precision FLOPs every cycle. For Intel, supportedon the Pentium 4 and for AMD, supported on the Opteron.


• – Factorize the matrix in single precision (O(n3) FLOPs)

– Then use this factorization as a preconditioner in a double precision iterative method (O(n2))

• The single LU factorization is a poor double LU factorization but an excellent preconditioner.

• Typically the iterative methods will converge in a few steps

• single precision computation dominate the # of FLOP: O(n3) in single vs O(n2) in double:

Speed of single precision

• Double precision iterative solver:

Accuracy in double precision


Mixed precision, Iterative Refinement for Direct Solvers1: LU← PA (εs)2: solve Ly = Pb (εs)3: solve Ux0 = y (εs)

do k = 1,2, ...

4: rk← b−Axk−1(εd)5: solve Ly = Prk (εs)6: solve Uzk = y (εs)7: xk← xk−1 + zk (εd)

check convergencedone

T IME = ρ32

(23

n3)

+#IT ER∗ρ64 ∗O(n2)


Hardware and software details of the systems used for performance experiments.

Architecture Clock Peak SP Memory BLAS Compiler[GHz] / Peak DP [MB]

AMD Opteron 246 2.0 2 2048 Goto-1.13 Intel-9.1IBM PowerPC 970 2.5 2 2048 Goto-1.13 IBM-8.1Intel Xeon 5100 3.0 2 4096 Goto-1.13 Intel-9.1STI Cell BE 3.2 14 512 – Cell SDK-1.1


Performance improvements for direct dense methods when going from a full double precision solve (refer-ence time) to a mixed precision solve.

Nonsymmetric SymmetricAMD Opteron 246 1.82 1.54IBM PowerPC 970 1.56 1.35

Intel Xeon 5100 1.56 1.43STI Cell BE 8.62 10.64

Mixed precision, iterative refinement method for the solution of dense linear systems on the STI Cell BEprocessor.

500 1000 1500 2000 2500 3000 35000

50

100

150

problem size

Gflo

p/s

LU Solve −− Cell Broadband Engine 3.2 GHz

Full−singleMixed−precisionPeak−double

1000 2000 3000 40000

50

100

150

200

problem size

Gflo

p/s

Cholesky Solve −− Cell Broadband Engine 3.2 GHz

Full−single

Mixed−precision

Peak−double


Intel Pentium III CopperMine (0.9GHz), Goto BLAS


Cluster AMD Opteron(tm) 64 processors, Open MPI (MX) (nb=121,p=4,q=16)


Test matrices for sparse mixed precision, iterative refinement solution methods.

n. Matrix Size Nonzeroes symm. pos. def. C. Num.1 SiO 33401 1317655 yes no O(103)2 Lin 25600 1766400 yes no O(105)3 c-71 76638 859554 yes no O(10)4 cage-11 39082 559722 no no O(1)5 raefsky3 21200 1488768 no no O(10)6 poisson3Db 85623 2374949 no no O(103)

Performance improvements for direct sparse methods when going from a full double precision solve (refer-ence time) to a mixed precision solve.

Matrix number1 2 3 4 5 6

AMD Opteron 246 1.827 1.783 1.580 1.858 1.846 1.611IBM PowerPC 970 1.393 1.321 1.217 1.859 1.801 1.463

Intel Xeon 5100 1.799 1.630 1.554 1.768 1.728 1.524

1 2 3 4 5 60

0.5

1

1.5

2

2.5

matrix numbersp

eedu

p

Intel Woodcrest 3.0 GHz

33

32

2

2

Single/doubleMixed prec./double


Mixed precision, inner-outer FGMRES(mout)-GMRESSP(min)

1: for i = 0,1, ... do2: r = b−Axi (εd)3: β = h1,0 = ||r||2 (εd)4: check convergence and exit if done5: for k = 1, . . . ,mout do6: vk = r / hk,k−1 (εd)7: Perform one cycle of GMRESSP(min) in order to (approximately) solve Azk = vk, (initial guess zk = 0) (εs)8: r = A zk (εd)9: for j=1,. . . ,k do

10: h j,k = rT v j (εd)11: r = r−h j,k v j (εd)12: end for13: hk+1,k = ||r||2 (εd)14: end for15: Define Zk = [z1, . . . ,zk], Hk = {hi, j}1≤i≤k+1,1≤ j≤k (εd)16: Find yk, the vector of size k, that minimizes ||βe1−Hk yk||2 (εd)17: xi+1 = xi +Zk yk (εd)18: end for


Mixed precision iterative refinement with FGMRES-GMRESSP vs FGMRES-GMRESDP (left) and vs fulldouble precision GMRES (right).

1 2 3 4 5 60

0.5

1

1.5

2

2.5

matrix number

spee

dup

GMRES SP−DP/DP−DP

Intel Woodcrest 3.0 GHzAMD Opteron 246 2.0 GHzIBM PowerPC 970 2.5 GHz

1 2 3 4 5 60

1

2

3

4

5

6

matrix number

spee

dup

GMRES SP−DP/DP

Intel Woodcrest 3.0 GHzAMD Opteron 246 2.0 GHzIBM PowerPC 970 2.5 GHz

matrix n. min mout m1 30 20 1502 20 10 403 100 9 3004 10 4 185 20 20 3006 20 10 50


Number of iterations needed for our mixed precision method to converge to an accuracy better than the oneof the associated double precision solve as a function of the condition number of the coefficient matrix in thecontext of direct dense nonsymmetric solves.

100 102 104 106 108 10100

5

10

15

20

25

30

35

condition number of the coefficient matrix

# ite

ratio

ns o

f ite

rativ

e re

finem

ent


Improving the robustness of the solverUse FGMRES instead of Richardson.

• M. Arioli and I. Duff. FGMRES to obtain backward stability in mixed precision. RAL-TR-2008-006. 2008.

• J. D. Hogg and J. A. Scott. A fast and robust mixed precision solver for the solution of sparse symmetriclinear systems. RAL-TR-2008-023. 2008.


Iterative Refinement in Quadruple Precision on a Intel Xeon 3.2GHz

QGESV QDGESVn time (s) time (s) speedup

100 0.29 0.03 9.5200 2.27 0.10 20.9300 7.61 0.24 30.5400 17.81 0.44 40.4500 34.71 0.69 49.7600 60.11 1.01 59.0700 94.95 1.38 68.7800 141.75 1.83 77.3900 201.81 2.33 86.31000 276.94 2.92 94.8

Time for the various Kernels in the Quadruple Accuracy Versions for n=900

driver name time (s) kernel name time (s)QGESV 201.81 QGETRF 201.1293

QGETRS 0.6845QDGESV 2.33 DGETRF 0.3200

DGETRS 0.0127DLANGE 0.0042DGECON 0.0363QGEMV 1.5526ITERREF 1.9258


Extension to Other Algorithms

• As a rule of thumb, any algorithm with O(n3) computation and O(n) output can benefit from mixed precisioniterative refinement. (Idea: the residual computation needs to be at most O(n2).)

• This is not the case for eigendecomposition, SVD, matrix-matrix multiplication. This is the case for linearsystem of equations solve.

• – J. J. Dongarra, C. B. Moler and J. H. Wilkinson, Improving the Accuracy of Computed Eigenvalues andEigenvectors, SIAM Journal on Numerical Analysis, 20(1):23-45, 1983.

– J. J. Dongarra, Improving the Accuracy of Computed Singular Values, SIAM Journal on Scientific andStatistical Computing, 4(4):712-719, 1983.

– J. J.Dongarra, Algorithm 589: SICEDR: A FORTRAN Subroutine for Improving the Accuracy of Com-puted Matrix Eigenvalues, ACM Transactions on Mathematical Software, 8(4):371-375, 1982.


F. G. Gustavson, J. Wasniewski, J. J. Dongarra, and Julien Langou.Rectangular Full Packed Format for Cholesky’s Algorithm: Factorization, Solution and Inversion.arXiv:0901.1696.


Case n is odd (n= 7), TRANSR = ’N’, UPLO = ’L’

COMPLEX Rectangular Full Packed Format.

EIGHT cases: N is ODD or EVEN TRANSR = 'L' or TRANSR = 'C' UPLO = 'L' or UPLO = 'U'

COMPLEX => entries are complex numbers. They are represented below by ( REAL_PART, IMAG_PART ).

On the LEFT: example matrix in Full Format (so ZHE, ZSY, ZTR), the entry (***,***) represents unreference memory spaces.On the RIGHT: example matrix in Rectangular Full Packed Format (so ZHF, ZSF, ZTF)

The GREEN entries represents the part of the matrix that is not tranpose in the RFP format.The RED entries represents the part of the matrix that is conjugate-tranpose in the RFP format.

This web page is automatically generated by the following C-code: visualize_html.tar.gz.( There is also a TERMINAL version with colors :) visualize_terminal.tar.gz. )

Case n is odd (n= 7), TRANSR = 'N', UPLO = 'L'

( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,+97) (+32,-65) (+33,-64) (+34,-63) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) ( +1,+96) ( +8,+89) (+40,-57) (+41,-56) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) ( +2,+95) ( +9,+88) (+16,+81) (+48,-49) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > ( +3,+94) (+10,+87) (+17,+80) (+24,+73) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49) ( +6,+91) (+13,+84) (+20,+77) (+27,+70)

Case n is odd (n= 7), TRANSR = 'N', UPLO = 'U'

( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+24,+73) (+31,+66) (+38,+59) (+45,+52) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) ( +0,-97) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) ( +7,-90) ( +8,-89) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49) (+14,-83) (+15,-82) (+16,-81) (+48,+49)

Case n is odd (n= 7), TRANSR = 'C', UPLO = 'L'

( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,-97) ( +1,-96) ( +2,-95) ( +3,-94) ( +4,-93) ( +5,-92) ( +6,-91) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) (+32,+65) ( +8,-89) ( +9,-88) (+10,-87) (+11,-86) (+12,-85) (+13,-84) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) (+33,+64) (+40,+57) (+16,-81) (+17,-80) (+18,-79) (+19,-78) (+20,-77) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > (+34,+63) (+41,+56) (+48,+49) (+24,-73) (+25,-72) (+26,-71) (+27,-70) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49)

Case n is odd (n= 7), TRANSR = 'C', UPLO = 'U'

( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,-76) (+22,-75) (+23,-74) (+24,-73) ( +0,+97) ( +7,+90) (+14,+83) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+28,-69) (+29,-68) (+30,-67) (+31,-66) (+32,-65) ( +8,+89) (+15,+82) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+35,-62) (+36,-61) (+37,-60) (+38,-59) (+39,-58) (+40,-57) (+16,+81) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+42,-55) (+43,-54) (+44,-53) (+45,-52) (+46,-51) (+47,-50) (+48,-49) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49)

Case n is even (n= 6), TRANSR = 'N', UPLO = 'L'

( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,-50) (+22,-49) (+23,-48)

Case n is odd (n= 7), TRANSR = ’N’, UPLO = ’U’








( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,+97) (+32,-65) (+33,-64) (+34,-63) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) ( +1,+96) ( +8,+89) (+40,-57) (+41,-56) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) ( +2,+95) ( +9,+88) (+16,+81) (+48,-49) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > ( +3,+94) (+10,+87) (+17,+80) (+24,+73) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49) ( +6,+91) (+13,+84) (+20,+77) (+27,+70)


( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+24,+73) (+31,+66) (+38,+59) (+45,+52) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) ( +0,-97) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) ( +7,-90) ( +8,-89) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49) (+14,-83) (+15,-82) (+16,-81) (+48,+49)


( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,-97) ( +1,-96) ( +2,-95) ( +3,-94) ( +4,-93) ( +5,-92) ( +6,-91) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) (+32,+65) ( +8,-89) ( +9,-88) (+10,-87) (+11,-86) (+12,-85) (+13,-84) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) (+33,+64) (+40,+57) (+16,-81) (+17,-80) (+18,-79) (+19,-78) (+20,-77) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > (+34,+63) (+41,+56) (+48,+49) (+24,-73) (+25,-72) (+26,-71) (+27,-70) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49)


( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,-76) (+22,-75) (+23,-74) (+24,-73) ( +0,+97) ( +7,+90) (+14,+83) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+28,-69) (+29,-68) (+30,-67) (+31,-66) (+32,-65) ( +8,+89) (+15,+82) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+35,-62) (+36,-61) (+37,-60) (+38,-59) (+39,-58) (+40,-57) (+16,+81) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+42,-55) (+43,-54) (+44,-53) (+45,-52) (+46,-51) (+47,-50) (+48,-49) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49)


( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,-50) (+22,-49) (+23,-48)

Case n is odd (n= 7), TRANSR = ’C’, UPLO = ’L’








( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,+97) (+32,-65) (+33,-64) (+34,-63) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) ( +1,+96) ( +8,+89) (+40,-57) (+41,-56) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) ( +2,+95) ( +9,+88) (+16,+81) (+48,-49) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > ( +3,+94) (+10,+87) (+17,+80) (+24,+73) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49) ( +6,+91) (+13,+84) (+20,+77) (+27,+70)


( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+24,+73) (+31,+66) (+38,+59) (+45,+52) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) ( +0,-97) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) ( +7,-90) ( +8,-89) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49) (+14,-83) (+15,-82) (+16,-81) (+48,+49)


( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,-97) ( +1,-96) ( +2,-95) ( +3,-94) ( +4,-93) ( +5,-92) ( +6,-91) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) (+32,+65) ( +8,-89) ( +9,-88) (+10,-87) (+11,-86) (+12,-85) (+13,-84) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) (+33,+64) (+40,+57) (+16,-81) (+17,-80) (+18,-79) (+19,-78) (+20,-77) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > (+34,+63) (+41,+56) (+48,+49) (+24,-73) (+25,-72) (+26,-71) (+27,-70) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49)


( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,-76) (+22,-75) (+23,-74) (+24,-73) ( +0,+97) ( +7,+90) (+14,+83) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+28,-69) (+29,-68) (+30,-67) (+31,-66) (+32,-65) ( +8,+89) (+15,+82) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+35,-62) (+36,-61) (+37,-60) (+38,-59) (+39,-58) (+40,-57) (+16,+81) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+42,-55) (+43,-54) (+44,-53) (+45,-52) (+46,-51) (+47,-50) (+48,-49) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49)


( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,-50) (+22,-49) (+23,-48)

Case n is odd (n= 7), TRANSR = ’C’, UPLO = ’U’








( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,+97) (+32,-65) (+33,-64) (+34,-63) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) ( +1,+96) ( +8,+89) (+40,-57) (+41,-56) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) ( +2,+95) ( +9,+88) (+16,+81) (+48,-49) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > ( +3,+94) (+10,+87) (+17,+80) (+24,+73) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49) ( +6,+91) (+13,+84) (+20,+77) (+27,+70)


( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+24,+73) (+31,+66) (+38,+59) (+45,+52) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) ( +0,-97) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) ( +7,-90) ( +8,-89) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49) (+14,-83) (+15,-82) (+16,-81) (+48,+49)


( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,-97) ( +1,-96) ( +2,-95) ( +3,-94) ( +4,-93) ( +5,-92) ( +6,-91) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) (+32,+65) ( +8,-89) ( +9,-88) (+10,-87) (+11,-86) (+12,-85) (+13,-84) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) (+33,+64) (+40,+57) (+16,-81) (+17,-80) (+18,-79) (+19,-78) (+20,-77) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > (+34,+63) (+41,+56) (+48,+49) (+24,-73) (+25,-72) (+26,-71) (+27,-70) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49)


( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,-76) (+22,-75) (+23,-74) (+24,-73) ( +0,+97) ( +7,+90) (+14,+83) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+28,-69) (+29,-68) (+30,-67) (+31,-66) (+32,-65) ( +8,+89) (+15,+82) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+35,-62) (+36,-61) (+37,-60) (+38,-59) (+39,-58) (+40,-57) (+16,+81) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+42,-55) (+43,-54) (+44,-53) (+45,-52) (+46,-51) (+47,-50) (+48,-49) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49)


( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,-50) (+22,-49) (+23,-48)


Case n is even (n= 6), TRANSR = ’N’, UPLO = ’L’








( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+24,+73) (+31,+66) (+38,+59) (+45,+52) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) ( +0,-97) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) ( +7,-90) ( +8,-89) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49) (+14,-83) (+15,-82) (+16,-81) (+48,+49)


( +0,+97) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) ( +0,-97) ( +1,-96) ( +2,-95) ( +3,-94) ( +4,-93) ( +5,-92) ( +6,-91) ( +1,+96) ( +8,+89) (***,***) (***,***) (***,***) (***,***) (***,***) (+32,+65) ( +8,-89) ( +9,-88) (+10,-87) (+11,-86) (+12,-85) (+13,-84) ( +2,+95) ( +9,+88) (+16,+81) (***,***) (***,***) (***,***) (***,***) (+33,+64) (+40,+57) (+16,-81) (+17,-80) (+18,-79) (+19,-78) (+20,-77) ( +3,+94) (+10,+87) (+17,+80) (+24,+73) (***,***) (***,***) (***,***) ====== > (+34,+63) (+41,+56) (+48,+49) (+24,-73) (+25,-72) (+26,-71) (+27,-70) ( +4,+93) (+11,+86) (+18,+79) (+25,+72) (+32,+65) (***,***) (***,***) ( +5,+92) (+12,+85) (+19,+78) (+26,+71) (+33,+64) (+40,+57) (***,***) ( +6,+91) (+13,+84) (+20,+77) (+27,+70) (+34,+63) (+41,+56) (+48,+49)


( +0,+97) ( +7,+90) (+14,+83) (+21,+76) (+28,+69) (+35,+62) (+42,+55) (+21,-76) (+22,-75) (+23,-74) (+24,-73) ( +0,+97) ( +7,+90) (+14,+83) (***,***) ( +8,+89) (+15,+82) (+22,+75) (+29,+68) (+36,+61) (+43,+54) (+28,-69) (+29,-68) (+30,-67) (+31,-66) (+32,-65) ( +8,+89) (+15,+82) (***,***) (***,***) (+16,+81) (+23,+74) (+30,+67) (+37,+60) (+44,+53) (+35,-62) (+36,-61) (+37,-60) (+38,-59) (+39,-58) (+40,-57) (+16,+81) (***,***) (***,***) (***,***) (+24,+73) (+31,+66) (+38,+59) (+45,+52) ====== > (+42,-55) (+43,-54) (+44,-53) (+45,-52) (+46,-51) (+47,-50) (+48,-49) (***,***) (***,***) (***,***) (***,***) (+32,+65) (+39,+58) (+46,+51) (***,***) (***,***) (***,***) (***,***) (***,***) (+40,+57) (+47,+50) (***,***) (***,***) (***,***) (***,***) (***,***) (***,***) (+48,+49)


( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,-50) (+22,-49) (+23,-48) ( +1,+70) ( +7,+64) (***,***) (***,***) (***,***) (***,***) ( +0,+71) (+28,-43) (+29,-42) ( +2,+69) ( +8,+63) (+14,+57) (***,***) (***,***) (***,***) ( +1,+70) ( +7,+64) (+35,-36) ( +3,+68) ( +9,+62) (+15,+56) (+21,+50) (***,***) (***,***) ====== > ( +2,+69) ( +8,+63) (+14,+57) ( +4,+67) (+10,+61) (+16,+55) (+22,+49) (+28,+43) (***,***) ( +3,+68) ( +9,+62) (+15,+56) ( +5,+66) (+11,+60) (+17,+54) (+23,+48) (+29,+42) (+35,+36) ( +4,+67) (+10,+61) (+16,+55) ( +5,+66) (+11,+60) (+17,+54)

Case n is even (n= 6), TRANSR = 'N', UPLO = 'U'

( +0,+71) ( +6,+65) (+12,+59) (+18,+53) (+24,+47) (+30,+41) (+18,+53) (+24,+47) (+30,+41) (***,***) ( +7,+64) (+13,+58) (+19,+52) (+25,+46) (+31,+40) (+19,+52) (+25,+46) (+31,+40)

Case n is even (n= 6), TRANSR = ’N’, UPLO = ’U’

( +1,+70) ( +7,+64) (***,***) (***,***) (***,***) (***,***) ( +0,+71) (+28,-43) (+29,-42) ( +2,+69) ( +8,+63) (+14,+57) (***,***) (***,***) (***,***) ( +1,+70) ( +7,+64) (+35,-36) ( +3,+68) ( +9,+62) (+15,+56) (+21,+50) (***,***) (***,***) ====== > ( +2,+69) ( +8,+63) (+14,+57) ( +4,+67) (+10,+61) (+16,+55) (+22,+49) (+28,+43) (***,***) ( +3,+68) ( +9,+62) (+15,+56) ( +5,+66) (+11,+60) (+17,+54) (+23,+48) (+29,+42) (+35,+36) ( +4,+67) (+10,+61) (+16,+55) ( +5,+66) (+11,+60) (+17,+54)


( +0,+71) ( +6,+65) (+12,+59) (+18,+53) (+24,+47) (+30,+41) (+18,+53) (+24,+47) (+30,+41) (***,***) ( +7,+64) (+13,+58) (+19,+52) (+25,+46) (+31,+40) (+19,+52) (+25,+46) (+31,+40) (***,***) (***,***) (+14,+57) (+20,+51) (+26,+45) (+32,+39) (+20,+51) (+26,+45) (+32,+39) (***,***) (***,***) (***,***) (+21,+50) (+27,+44) (+33,+38) ====== > (+21,+50) (+27,+44) (+33,+38) (***,***) (***,***) (***,***) (***,***) (+28,+43) (+34,+37) ( +0,-71) (+28,+43) (+34,+37) (***,***) (***,***) (***,***) (***,***) (***,***) (+35,+36) ( +6,-65) ( +7,-64) (+35,+36) (+12,-59) (+13,-58) (+14,-57)

Case n is even (n= 6), TRANSR = 'C', UPLO = 'L'

( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,+50) ( +0,-71) ( +1,-70) ( +2,-69) ( +3,-68) ( +4,-67) ( +5,-66) ( +1,+70) ( +7,+64) (***,***) (***,***) (***,***) (***,***) (+22,+49) (+28,+43) ( +7,-64) ( +8,-63) ( +9,-62) (+10,-61) (+11,-60) ( +2,+69) ( +8,+63) (+14,+57) (***,***) (***,***) (***,***) (+23,+48) (+29,+42) (+35,+36) (+14,-57) (+15,-56) (+16,-55) (+17,-54) ( +3,+68) ( +9,+62) (+15,+56) (+21,+50) (***,***) (***,***) ====== > ( +4,+67) (+10,+61) (+16,+55) (+22,+49) (+28,+43) (***,***) ( +5,+66) (+11,+60) (+17,+54) (+23,+48) (+29,+42) (+35,+36)

Case n is even (n= 6), TRANSR = 'C', UPLO = 'U'

( +0,+71) ( +6,+65) (+12,+59) (+18,+53) (+24,+47) (+30,+41) (+18,-53) (+19,-52) (+20,-51) (+21,-50) ( +0,+71) ( +6,+65) (+12,+59) (***,***) ( +7,+64) (+13,+58) (+19,+52) (+25,+46) (+31,+40) (+24,-47) (+25,-46) (+26,-45) (+27,-44) (+28,-43) ( +7,+64) (+13,+58) (***,***) (***,***) (+14,+57) (+20,+51) (+26,+45) (+32,+39) (+30,-41) (+31,-40) (+32,-39) (+33,-38) (+34,-37) (+35,-36) (+14,+57) (***,***) (***,***) (***,***) (+21,+50) (+27,+44) (+33,+38) ====== > (***,***) (***,***) (***,***) (***,***) (+28,+43) (+34,+37) (***,***) (***,***) (***,***) (***,***) (***,***) (+35,+36)

Case n is even (n= 6), TRANSR = ’C’, UPLO = ’L’

( +1,+70) ( +7,+64) (***,***) (***,***) (***,***) (***,***) ( +0,+71) (+28,-43) (+29,-42) ( +2,+69) ( +8,+63) (+14,+57) (***,***) (***,***) (***,***) ( +1,+70) ( +7,+64) (+35,-36) ( +3,+68) ( +9,+62) (+15,+56) (+21,+50) (***,***) (***,***) ====== > ( +2,+69) ( +8,+63) (+14,+57) ( +4,+67) (+10,+61) (+16,+55) (+22,+49) (+28,+43) (***,***) ( +3,+68) ( +9,+62) (+15,+56) ( +5,+66) (+11,+60) (+17,+54) (+23,+48) (+29,+42) (+35,+36) ( +4,+67) (+10,+61) (+16,+55) ( +5,+66) (+11,+60) (+17,+54)


( +0,+71) ( +6,+65) (+12,+59) (+18,+53) (+24,+47) (+30,+41) (+18,+53) (+24,+47) (+30,+41) (***,***) ( +7,+64) (+13,+58) (+19,+52) (+25,+46) (+31,+40) (+19,+52) (+25,+46) (+31,+40) (***,***) (***,***) (+14,+57) (+20,+51) (+26,+45) (+32,+39) (+20,+51) (+26,+45) (+32,+39) (***,***) (***,***) (***,***) (+21,+50) (+27,+44) (+33,+38) ====== > (+21,+50) (+27,+44) (+33,+38) (***,***) (***,***) (***,***) (***,***) (+28,+43) (+34,+37) ( +0,-71) (+28,+43) (+34,+37) (***,***) (***,***) (***,***) (***,***) (***,***) (+35,+36) ( +6,-65) ( +7,-64) (+35,+36) (+12,-59) (+13,-58) (+14,-57)


( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,+50) ( +0,-71) ( +1,-70) ( +2,-69) ( +3,-68) ( +4,-67) ( +5,-66) ( +1,+70) ( +7,+64) (***,***) (***,***) (***,***) (***,***) (+22,+49) (+28,+43) ( +7,-64) ( +8,-63) ( +9,-62) (+10,-61) (+11,-60) ( +2,+69) ( +8,+63) (+14,+57) (***,***) (***,***) (***,***) (+23,+48) (+29,+42) (+35,+36) (+14,-57) (+15,-56) (+16,-55) (+17,-54) ( +3,+68) ( +9,+62) (+15,+56) (+21,+50) (***,***) (***,***) ====== > ( +4,+67) (+10,+61) (+16,+55) (+22,+49) (+28,+43) (***,***) ( +5,+66) (+11,+60) (+17,+54) (+23,+48) (+29,+42) (+35,+36)


( +0,+71) ( +6,+65) (+12,+59) (+18,+53) (+24,+47) (+30,+41) (+18,-53) (+19,-52) (+20,-51) (+21,-50) ( +0,+71) ( +6,+65) (+12,+59) (***,***) ( +7,+64) (+13,+58) (+19,+52) (+25,+46) (+31,+40) (+24,-47) (+25,-46) (+26,-45) (+27,-44) (+28,-43) ( +7,+64) (+13,+58) (***,***) (***,***) (+14,+57) (+20,+51) (+26,+45) (+32,+39) (+30,-41) (+31,-40) (+32,-39) (+33,-38) (+34,-37) (+35,-36) (+14,+57) (***,***) (***,***) (***,***) (+21,+50) (+27,+44) (+33,+38) ====== > (***,***) (***,***) (***,***) (***,***) (+28,+43) (+34,+37) (***,***) (***,***) (***,***) (***,***) (***,***) (+35,+36)

Case n is even (n= 6), TRANSR = ’C’, UPLO = ’U’

( +1,+70) ( +7,+64) (***,***) (***,***) (***,***) (***,***) ( +0,+71) (+28,-43) (+29,-42) ( +2,+69) ( +8,+63) (+14,+57) (***,***) (***,***) (***,***) ( +1,+70) ( +7,+64) (+35,-36) ( +3,+68) ( +9,+62) (+15,+56) (+21,+50) (***,***) (***,***) ====== > ( +2,+69) ( +8,+63) (+14,+57) ( +4,+67) (+10,+61) (+16,+55) (+22,+49) (+28,+43) (***,***) ( +3,+68) ( +9,+62) (+15,+56) ( +5,+66) (+11,+60) (+17,+54) (+23,+48) (+29,+42) (+35,+36) ( +4,+67) (+10,+61) (+16,+55) ( +5,+66) (+11,+60) (+17,+54)


( +0,+71) ( +6,+65) (+12,+59) (+18,+53) (+24,+47) (+30,+41) (+18,+53) (+24,+47) (+30,+41) (***,***) ( +7,+64) (+13,+58) (+19,+52) (+25,+46) (+31,+40) (+19,+52) (+25,+46) (+31,+40) (***,***) (***,***) (+14,+57) (+20,+51) (+26,+45) (+32,+39) (+20,+51) (+26,+45) (+32,+39) (***,***) (***,***) (***,***) (+21,+50) (+27,+44) (+33,+38) ====== > (+21,+50) (+27,+44) (+33,+38) (***,***) (***,***) (***,***) (***,***) (+28,+43) (+34,+37) ( +0,-71) (+28,+43) (+34,+37) (***,***) (***,***) (***,***) (***,***) (***,***) (+35,+36) ( +6,-65) ( +7,-64) (+35,+36) (+12,-59) (+13,-58) (+14,-57)


( +0,+71) (***,***) (***,***) (***,***) (***,***) (***,***) (+21,+50) ( +0,-71) ( +1,-70) ( +2,-69) ( +3,-68) ( +4,-67) ( +5,-66) ( +1,+70) ( +7,+64) (***,***) (***,***) (***,***) (***,***) (+22,+49) (+28,+43) ( +7,-64) ( +8,-63) ( +9,-62) (+10,-61) (+11,-60) ( +2,+69) ( +8,+63) (+14,+57) (***,***) (***,***) (***,***) (+23,+48) (+29,+42) (+35,+36) (+14,-57) (+15,-56) (+16,-55) (+17,-54) ( +3,+68) ( +9,+62) (+15,+56) (+21,+50) (***,***) (***,***) ====== > ( +4,+67) (+10,+61) (+16,+55) (+22,+49) (+28,+43) (***,***) ( +5,+66) (+11,+60) (+17,+54) (+23,+48) (+29,+42) (+35,+36)


( +0,+71) ( +6,+65) (+12,+59) (+18,+53) (+24,+47) (+30,+41) (+18,-53) (+19,-52) (+20,-51) (+21,-50) ( +0,+71) ( +6,+65) (+12,+59) (***,***) ( +7,+64) (+13,+58) (+19,+52) (+25,+46) (+31,+40) (+24,-47) (+25,-46) (+26,-45) (+27,-44) (+28,-43) ( +7,+64) (+13,+58) (***,***) (***,***) (+14,+57) (+20,+51) (+26,+45) (+32,+39) (+30,-41) (+31,-40) (+32,-39) (+33,-38) (+34,-37) (+35,-36) (+14,+57) (***,***) (***,***) (***,***) (+21,+50) (+27,+44) (+33,+38) ====== > (***,***) (***,***) (***,***) (***,***) (+28,+43) (+34,+37) (***,***) (***,***) (***,***) (***,***) (***,***) (+35,+36)


SUBROUTINE DPFTRF( TRANSR, UPLO, N, A, INFO )** -- LAPACK routine (version 3.2) --*

IF( NISODD ) THEN IF( NORMALTRANSR ) THEN IF( LOWER ) THEN* CALL DPOTRF( 'L', N1, A( 0 ), N, INFO ) IF( INFO.GT.0 ) + RETURN CALL DTRSM( 'R', 'L', 'T', 'N', N2, N1, ONE, A( 0 ), N, + A( N1 ), N ) CALL DSYRK( 'U', 'N', N2, N1, -ONE, A( N1 ), N, ONE, + A( N ), N ) CALL DPOTRF( 'U', N2, A( N ), N, INFO ) IF( INFO.GT.0 ) + INFO = INFO + N1* ELSE


SUBROUTINE DPFTRI( TRANSR, UPLO, N, A, INFO )** -- LAPACK routine (version 3.2) --*

IF( NISODD ) THEN IF( NORMALTRANSR ) THEN IF( LOWER ) THEN* CALL DLAUUM( 'L', N1, A( 0 ), N, INFO ) CALL DSYRK( 'L', 'T', N1, N2, ONE, A( N1 ), N, ONE, + A( 0 ), N ) CALL DTRMM( 'L', 'U', 'N', 'N', N2, N1, ONE, A( N ), N, + A( N1 ), N ) CALL DLAUUM( 'U', N2, A( N ), N, INFO )*


SUBROUTINE DTFSM( TRANSR, SIDE, UPLO, TRANS, DIAG, M, N, ALPHA, A, + B, LDB )** -- LAPACK routine (version 3.2) --*

IF( MISODD ) THEN IF( NORMALTRANSR ) THEN IF( LOWER ) THEN IF( NOTRANS ) THEN CALL DTRSM( 'L', 'L', 'N', DIAG, M1, N, ALPHA, + A( 0 ), M, B, LDB ) CALL DGEMM( 'N', 'N', M2, N, M1, -ONE, A( M1 ), + M, B, LDB, ALPHA, B( M1, 0 ), LDB ) CALL DTRSM( 'L', 'U', 'T', DIAG, M2, N, ONE, + A( M ), M, B( M1, 0 ), LDB )*


Performance in Mflop/s of Cholesky Factorization/Inversion/Solution on SUN UltraSPARC IV+ computer.long real arithmetic.

0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000

3500Performance of Cholesky Factorization on SUN UltraSPARC IV+ computer, long real arithmetic

problem size

Mflo

p/s

PFTRF N UPFTRF N LPFTRF T UPFTRF T LPOTRF UPOTRF LPPTRF UPPTRF L

0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000Performance of Cholesky Inversion on SUN UltraSPARC IV+ computer, long real arithmetic

problem sizeM

flop/

s


0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000Performance of Cholesky Solution on SUN UltraSPARC IV+ computer, long real arithmetic

problem size

Mflo

p/s


Performance in Mflop/s of Cholesky Factorization/Inversion/Solution on SUN UltraSPARC IV+ computer.long complex arithmetic.

0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000

3500Performance of Cholesky Factorization on SUN UltraSPARC IV+ computer, long complex arithmetic

problem size

Mflo

p/s


0 500 1000 1500 2000 2500 3000 3500 4000500

1000

1500

2000

2500

3000

3500Performance of Cholesky Inversion on SUN UltraSPARC IV+ computer, long complex arithmetic

problem size

Mflo

p/s


0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000

3500Performance of Cholesky Solution on SUN UltraSPARC IV+ computer, long complex arithmetic

problem size

Mflo

p/s



Performance in Mflop/s of Cholesky Factorization/Inversion/Solution on SGI Altix 3700, Intel Itanium 2 computer.long real arithmetic.

0 1000 2000 3000 40000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Performance of Cholesky Factorization on SGI Altix 3700, Intel Itanium2 computer, long real arithmetic

problem size

Mflo

p/s


0 1000 2000 3000 40000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Performance of Cholesky Inversion on SGI Altix 3700, Intel Itanium2 computer, long real arithmetic

problem sizeM

flop/

s


0 1000 2000 3000 40000

1000

2000

3000

4000

5000

6000Performance of Cholesky Solution on SGI Altix 3700, Intel Itanium2 computer, long real arithmetic

problem size

Mflo

p/s


Performance in Mflop/s of Cholesky Factorization/Inversion/Solution on SGI Altix 3700, Intel Itanium 2 computer.long complex arithmetic.

0 1000 2000 3000 40000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000Performance of Cholesky Factorization on SGI Altix 3700, Intel Itanium2 computer, long complex arithmetic

problem size

Mflo

p/s


0 1000 2000 3000 40000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000Performance of Cholesky Inversion on SGI Altix 3700, Intel Itanium2 computer, long complex arithmetic

problem size

Mflo

p/s


0 1000 2000 3000 40000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000Performance of Cholesky Solution on SGI Altix 3700, Intel Itanium2 computer, long complex arithmetic

problem size

Mflo

p/s



Performance in Mflop/s of Cholesky Factorization/Inversion/Solution on ia64 Itanium computer.

0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000

3500

4000

4500Performance of Cholesky Factorization on ia64 Itanium computer

problem size

Mflo

p/s


0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000

3500

4000

4500Performance of Cholesky Inversion on ia64 Itanium computer

problem size

Mflo

p/s

PFTRI N UPFTRI N LPFTRI T UPFTRI T LPOTRI UPOTRI LPPTRI UPPTRI L

0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

3000

3500

4000

4500Performance of Cholesky Solution on ia64 Itanium computer

problem size (nrhs=100 to 400)

Mflo

p/s

PFTRS N UPFTRS N LPFTRS T UPFTRS T LPOTRS UPOTRS LPPTRS UPPTRS L


Performance in Mflop/s of Cholesky Factorization/Inversion/Solution on SX-6 NEC computer.

0 500 1000 1500 2000 2500 3000 3500 40000

1000

2000

3000

4000

5000

6000

7000

8000Performance of Cholesky Factorization on SX−6 NEC computer

problem size

Mflo

p/s


0 500 1000 1500 2000 2500 3000 3500 40000

1000

2000

3000

4000

5000

6000

7000

8000Performance of Cholesky Inversion on SX−6 NEC computer

problem size

Mflo

p/s


0 500 1000 1500 2000 2500 3000 3500 40000

1000

2000

3000

4000

5000

6000

7000

8000Performance of Cholesky Solution on SX−6 NEC computer

problem size (nrhs=100 to 400)

Mflo

p/s



Performance of Cholesky Factorization/Inversion/Solution on quad-socket quad-core Intel Tigerton computer. Weuse reference LAPACK-3.2.0 (from netlib) and MKL-10.0.1.014 multithreaded BLAS. For the solution phase,nrhs is fixed to 100 for any n. Due to time limitation, the experiment was stopped for the packed storage formatinversion at n = 4000.

0 5000 10000 15000 200000

10

20

30

40

50

60Performance of Cholesky Factorization on Intel Tigerton computer (ref. LAPACK)

problem size

Gflo

p/s


0 5000 10000 15000 200000

5

10

15

20

25

30

35

40

45

50Performance of Cholesky Inversion on Intel Tigerton computer (ref LAPACK)

problem size

Gflo

p/s


0 5000 10000 15000 200000

100

200

300

400

500

600

700

800Performance of Cholesky Solution on Intel Tigerton computer (ref LAPACK)

problem size (nrhs=100)

Mflo

p/s



Performance of Cholesky Factorization/Inversion/Solution on quad-socket quad-core Intel Tigerton computer. Weuse MKL-10.0.1.014 multithreaded LAPACK and BLAS. For the solution phase, nrhs is fixed to 100 for any n.Due to time limitation, the experiment was stopped for the packed storage format inversion at n = 4000.

0 5000 10000 15000 200000

10

20

30

40

50

60

70

80

90

100Performance of Cholesky Factorization on Intel Tigerton computer (MKL)

problem size

Gflo

p/s


0 5000 10000 15000 200000

5

10

15

20

25

30

35

40

45

50Performance of Cholesky Inversion on Intel Tigerton computer (MKL)

problem size

Gflo

p/s


0 5000 10000 15000 200000

20

40

60

80

100

120

140

160

180

200Performance of Cholesky Solution on Intel Tigerton computer (MKL)

problem size (nrhs=100)

Mflo

p/s



Performance in Gflop/s of Cholesky Factorization on IBM Power 4 (left) and SUN UltraSPARC-IV (right) com-puter with a different number of Processors, testing the SMP Parallelism. The implementation of PPTRF of sunperfdoes not show any SMP parallelism. UPLO = ’L’. N = 5,000 (strong scaling experiment).

0 5 10 150

5

10

15

20

25

30

35

40Scalabilty of Cholesky Factorization on an IBM Power 4 computer

number of processors

Gflo

p/s

PFTRF N LPOTRF LPPTRF L

0 5 10 150

5

10

15

20

25

30Scalabilty of Cholesky Factorization on a SUN UltraSPARC−IV computer

number of processors

Gflo

p/s

PFTRF N LPOTRF LPPTRF L


For more information:

• Alfredo Buttari, Julien Langou, Jakub Kurzak and Jack Dongarra. A class of parallel tiled linear algebraalgorithms for multicore architectures. Parallel Computing, 35:38-53, 2009.

• Alfredo Buttari, Julien Langou, Jakub Kurzak and Jack Dongarra. Parallel tiled QR factorization for multicorearchitectures. Concurrency Computat.: Pract. Exper., 20(13):1573-1590, 2008.

• James W. Demmel, Laura Grigori, Mark F. Hoemmen, and Julien Langou. Communication-optimal paralleland sequential QR and LU factorizations. arXiv:0808.2664.

• James W. Demmel, Laura Grigori, Mark F. Hoemmen, and Julien Langou. Implementing Communication-Optimal Parallel and Sequential QR Factorizations. arXiv:0809.2407.


1. TSQR: Tall Skinny QR

2. CAQR: Communication Avoiding QR


AllReduceAlgorithms:Applica4ontoHouseholderQRFactoriza4on

JimDemmel,UniversityofCalifornia,Berkeley;LauraGrigori,INRIA,France;MarkHoemmen,UniversityofCalifornia,Berkeley;JulienLangou,UniversityofColorado,Denver


ReduceAlgorithms:Introduc4onTheQRfactoriza4onofalongandskinnymatrixwithitsdatapar44onedver4callyacrossseveralprocessorsarisesinawiderangeofapplica4ons.

A1

A2

A3

Q1

Q2

Q3

R

Input:Aisblockdistributedbyrows

Output:QisblockdistributedbyrowsRisglobal


ReduceAlgorithms:Introduc4onExampleofapplica3ons:

a)  inlinearleastsquaresproblemswhichthenumberofequa4onsisextremelylargerthanthenumberofunknowns




b)  inblockitera4vemethods(itera4vemethodswithmul4pleright‐handsidesoritera4veeigenvaluesolvers)


Exampleofapplica3ons:inblockitera3vemethods.

a)  initera3vemethodswithmul3pleright‐handsides(blockitera4vemethods:)

1)  Trilinos(SandiaNa4onalLab.)throughBelos(R.Lehoucq,H.Thornquist,U.Hetmaniuk).

2)  BlockGMRES,BlockGCR,BlockCG,BlockQMR,…

b)  initera3vemethodswithasingleright‐handside

1)  s‐stepmethodsforlinearsystemsofequa4ons(e.g.A.Chronopoulos),

2)  LGMRES(Jessup,Baker,Dennis,U.ColoradoatBoulder)implementedinPETSc,

3)  RecentworkfromM.HoemmenandJ.Demmel(U.CaliforniaatBerkeley).

e)  initera3veeigenvaluesolvers,

1)  PETSc(ArgonneNa4onalLab.)throughBLOPEX(A.Knyazev,UCDHSC),

2)  HYPRE(LawrenceLivermoreNa4onalLab.)throughBLOPEX,

3)  Trilinos(SandiaNa4onalLab.)throughAnasazi(R.Lehoucq,H.Thornquist,U.Hetmaniuk),

4)  PRIMME(A.Stathopoulos,Coll.William&Mary),

5)  AndalsoTRLAN,BLZPACK,IRBLEIGS.




b)  inblockitera4vemethods(itera4vemethodswithmul4pleright‐handsidesoritera4veeigenvaluesolvers)

c)  indenselargeandmoresquareQRfactoriza4onwheretheyareusedasthepanelfactoriza4onstep


BlockedLUandQRalgorithms(LAPACK)

‐

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U

‐

qr( )

dgeqf2+dlarQ

dlarR

V

R

A(1)

A(2)V

R

LAPACKblockLU(right‐looking):dgetrf LAPACKblockQR(right‐looking):dgeqrf

Upd

ateofth

eremainingsub

matrix

Pane

lfactoriza4

on


BlockedLUandQRalgorithms(LAPACK)

‐

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U

LAPACKblockLU(right‐looking):dgetrf

Upd

ateofth

eremainingsub

matrix

Pane

lfactoriza4

on

Latencybounded:morethannbAllReduceforn*nb2ops

CPU‐bandwidthbounded:thebulkofthecomputa4on:n*n*nbopshighlyparalleliable,efficientandsaclable.


Paralleliza3onofLUandQR.

Parallelizetheupdate:• Easyanddoneinanyreasonablesoeware.• Thisisthe2/3n3termintheFLOPscount.• CanbedoneefficientlywithLAPACK+mul4threadedBLAS

‐

dgemm

‐

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U


Paralleliza3onofLUandQR.

Parallelizetheupdate:• Easyanddoneinanyreasonablesoeware.• Thisisthe2/3n3termintheFLOPscount.• CanbedoneefficientlywithLAPACK+mul4threadedBLAS

Parallelizethepanelfactoriza3on:• Notanop4oninmul4corecontext(p<16)• Seee.g.ScaLAPACKorHPLbuts4llbyfartheslowestandthebojleneckofthecomputa4on.

Hidethepanelfactoriza3on:• Lookahead(seee.g.HighPerformanceLINPACK)• DynamicScheduling

lu( )

dgeK2

‐

dgemm

lu( )

dgeK2


Hidingthepanelfactoriza4onwithdynamicscheduling.

TimeCourtesyfromAlfredoBujari,UTennessee


Whataboutstrongscalability?


Whataboutstrongscalability?N=1536

NB=64

procs=16

CourtesyfromJakubKurzak,UTennessee



NB=64

procs=16

Wecannothidethepanelfactoriza4onintheMM,actuallyitistheMMsthatarehiddenbythepanelfactoriza4ons!




NB=64

procs=16

Wecannothidethepanelfactoriza4on(n2)withtheMM(n3),actuallyitistheMMsthatarehiddenbythepanelfactoriza4ons!

NEEDFORNEWMATHEMATICALALGORITHMS



Anewgenera4onofalgorithms?Algorithmsfollowhardwareevolu3onalong3me.

LINPACK(80’s)(Vectoropera4ons)

Relyon‐Level‐1BLASopera4ons

LAPACK(90’s)(Blocking,cachefriendly)



Anewgenera4onofalgorithms?Algorithmsfollowhardwareevolu3onalong3me.

LINPACK(80’s)(Vectoropera4ons)


LAPACK(90’s)(Blocking,cachefriendly)


NewAlgorithms(00’s)(mul4corefriendly)

Relyon‐aDAG/scheduler‐blockdatalayout‐someextrakernels

Thosenewalgorithms‐haveaverylowgranularity,theyscaleverywell(mul4core,petascalecompu4ng,…)‐removesalotsofdependenciesamongthetasks,(mul4core,distributedcompu4ng)‐avoidlatency(distributedcompu4ng,out‐of‐core)‐relyonfastkernelsThosenewalgorithmsneednewkernelsandrelyonefficientschedulingalgorithms.


2005‐2007:Newalgorithmsbasedon2Dpar44onning:

–  UTexas(vandeGeijn):SYRK,CHOL(mul4core),LU,QR(out‐of‐core)–  UTennessee(Dongarra):CHOL(mul4core)–  HPC2N(Kågström)/IBM(Gustavson):Chol(Distributed)

–  UCBerkeley(Demmel)/INRIA(Grigori):LU/QR(distributed)–  UCDenver(Langou):LU/QR(distributed)

A3rdrevolu4onfordenselinearalgebra?



a)  inblockitera4vemethods(itera4vemethodswithmul4pleright‐handsidesoritera4veeigenvaluesolvers),

b)  indenselargeandmoresquareQRfactoriza4onwheretheyareusedasthepanelfactoriza4onstep,ormoresimply

c)  inlinearleastsquaresproblemswhichthenumberofequa4onsisextremelylargerthanthenumberofunknowns.

Themaincharacteris3csofthosethreeexamplesarethat

a)  thereisonlyonecolumnofprocessorsinvolvedbutseveralprocessorrows,

b)  allthedataisknownfromthebeginning,

c)  andthematrixisdense.



a)  inblockitera4vemethods(itera4vemethodswithmul4pleright‐handsidesoritera4veeigenvaluesolvers),

b)  indenselargeandmoresquareQRfactoriza4onwheretheyareusedasthepanelfactoriza4onstep,ormoresimply

c)  inlinearleastsquaresproblemswhichthenumberofequa4onsisextremelylargerthanthenumberofunknowns.

Themaincharacteris3csofthosethreeexamplesarethat

a)  thereisonlyonecolumnofprocessorsinvolvedbutseveralprocessorrows,

b)  allthedataisknownfromthebeginning,

c)  andthematrixisdense.

VariousmethodsalreadyexisttoperformtheQRfactoriza4onofsuchmatrices:

a)  Gram‐Schmidt(mgs(row),cgs),

b)  Householder(qr2,qrf),

c)  orCholeskyQR.

Wepresentanewmethod:

AllreduceHouseholder(rhh_qr3,rhh_qrf).


TheCholeskyQRAlgorithm

chol( )

/

Ci AiT Ai

AiQi

CR

R

SYRK: C:=ATA (mn2)

CHOL: R:=chol(C) (n3/3)

TRSM: Q:=A/R (mn2)


Bibligraphy

•  A.StathopoulosandK.Wu,Ablockorthogonaliza4onprocedurewithconstantsynchroniza4onrequirements,SIAMJournalonScien0ficCompu0ng,23(6):2165‐2182,2002.

•  Popularizedbyitera4veeigensolverlibraries:

1)  PETSc(ArgonneNa4onalLab.)throughBLOPEX(A.Knyazev,UCDHSC),

2)  HYPRE(LawrenceLivermoreNa4onalLab.)throughBLOPEX,

3)  Trilinos(SandiaNa4onalLab.)throughAnasazi(R.Lehoucq,H.Thornquist,U.Hetmaniuk),

4)  PRIMME(A.Stathopoulos,Coll.William&Mary).


ParalleldistributedCholeskyQRTheCholeskyQRmethodintheparalleldistributedcontextcanbedescribedasfollows:

+

chol( )

/

+ +

1.

2.

3‐4.

5.

C C1 C2 C3 C4

Ci AiT Ai

AiQi

CR

R

1:SYRK: C:=ATA (mn2)

2:MPI_Reduce: C:=sumprocsC (onproc0)

3:CHOL: R:=chol(C) (n3/3)

4:MPI_Bdcast BroadcasttheRfactoronproc0

toalltheotherprocessors

5:TRSM: Q:=A/R (mn2)


Efficientenough?

#ofprocs

cholqr cgs mgs(row) qrf mgs

1 489.2 (1.02) 134.1 (3.73) 73.5 (6.81) 39.1 (12.78) 56.18 (8.90)

2 467.3 (0.54) 78.9 (3.17) 39.0 (6.41) 22.3 (11.21) 31.21 (8.01)

4 466.4 (0.27) 71.3 (1.75) 38.7 (3.23) 22.2 (5.63) 29.58 (4.23)

8 434.0 (0.14) 67.4 (0.93) 36.7 (1.70) 20.8 (3.01) 21.15 (2.96)

16 359.2 (0.09) 54.2 (0.58) 31.6 (0.99) 18.3 (1.71) 14.44 (2.16)

32 197.8 (0.08) 41.9 (0.37) 29.0 (0.54) 15.8 (0.99) 8.38 (1.87)

MFLOP/sec/proc

Timeinsec

Inthisexperiment,wefixtheproblem:m=100,000andn=50.

Performance(M

FLOP/sec/proc)

Time(sec)

#ofprocs#ofprocs


Simpleenough?

intcholeskyqr_A_v0(intmloc,intn,double*A,intlda,double*R,intldr,MPI_Commmpi_comm){

intinfo;cblas_dsyrk(CblasColMajor,CblasUpper,CblasTrans,n,mloc, 1.0e+00,A,lda,0e+00,R,ldr);MPI_Allreduce(MPI_IN_PLACE,R,n*n,MPI_DOUBLE,MPI_SUM,mpi_comm);lapack_dpotrf(lapack_upper,n,R,ldr,&info);cblas_dtrsm(CblasColMajor,CblasRight,CblasUpper,CblasNoTrans,CblasNonUnit, mloc,n,1.0e+00,R,ldr,A,lda);return0;

}

(…and,OK,youmightwanttoaddanMPIuserdefineddatatypetosendonlytheupperpartofR)

+

chol( )

\

+ +

1.

2.

3‐4.

5.

C C1 C2 C3 C4

Ci AiT

Ai

AiQi

CR

R


Stableenough?

1.00E‐161.00E‐141.00E‐121.00E‐101.00E‐081.00E‐061.00E‐041.00E‐021.00E+00

1.00E+001.00E+021.00E+041.00E+061.00E+081.00E+101.00E+121.00E+14

cholqr

cgs

mgs

Householder

1.00E‐16

1.00E‐15

1.00E‐14

1.00E+001.00E+021.00E+041.00E+061.00E+081.00E+101.00E+121.00E+14

cholqr

cgs

mgs

Householder

κ2(A)

κ2(A)

||A–QR|| 2/||A|| 2

||I–Q

T Q|| 2

m=100,n=50


ParalleldistributedCholeskyQRTheCholeskyQRmethodintheparalleldistributedcontextcanbedescribedasfollows:

+

chol( )

/

+ +

1.

2.

3‐4.

5.

C C1 C2 C3 C4

Ci AiT Ai

AiQi

CR

R

1:SYRK: C:=ATA (mn2)

2:MPI_Reduce: C:=sumprocsC (onproc0)

3:CHOL: R:=chol(C) (n3/3)

4:MPI_Bdcast BroadcasttheRfactoronproc0

toalltheotherprocessors

5:TRSM: Q:=A/R (mn2)

Thismethodisextremelyfast.Fortworeasons:1.  first,thereisonlyoneortwocommunica3onsphase,2.  second,thelocalcomputa3onsareperformedwithfastopera3ons.

Anotheradvantageofthismethodisthattheresul4ngcodeisexactlyfourlines,3.  sothemethodissimpleandreliesheavilyonotherlibraries.

Despiteallthoseadvantages,4.  thismethodishighlyunstable.


ReduceAlgorithmsThegather‐scajervariantofouralgorithmcanbe

summarizedasfollows:

1.  performlocalQRfactoriza4onofthematrixA

2.  gatherthepRfactorsonprocessor03.  performaQRfactoriza4onofalltheRputthe

onesontopoftheothers,theRfactorobtainedistheRfactor

4.  scajerthetheQfactorsfromprocessor0toalltheprocessors

5.  mul4plylocallythetwoQfactorstogether,done.

*

1.

2‐3‐4.

5.

Qi(0)Ri

Ai

R R1

R2

R3

R4QW4

QW3

QW2

QW1

QWi

Qi(0)Q

qr(

qr( )

)


ReduceAlgorithms

•  Thisisthescajer‐gatherversionofouralgorithm.

•  Thisvariantisnotveryefficientfortworeasons:–  firstthecommunica4onphases2and4arehighlyinvolving

processor0;–  secondthecostofstep3isp/3*n3,socangetprohibi4vefor

largep.

•  NotethattheCholeskyQRalgorithmcanalsobeimplementedinascajer‐gatherwaybutreduce‐broadcast.Thisleadsnaturallytothealgorithmpresentedbelowwhereareduce‐broadcastversionofthepreviousalgorithmisdescribed.Thiswillbeourfinalalgorithm.


Ontwoprocesses

A0 Q0

A1 Q1

processes

4me


OntwoprocessesR0(0)( , )QR( )

A0 V0(0)

R1(0)( , )QR( )

A1 V1(0)

processes

4me



A0 V0(0)

)R0(0)

R1(0)

R1(0)( , )QR( )

A1 V1(0)

processes

4me

(



A0 V0(0)

R0(1)( , )QR( )R0(0)

R1(0)

V0(1)

V1(1)

R1(0)( , )QR( )

A1 V1(0)

processes

4me



A0 V0(0)

R0(1)( , )QR( )R0(0)

R1(0)

V0(1)

V1(1)

InApply( to )V0(1)

0nV1(1)

Q0(1)

Q1(1)

R1(0)( , )QR( )

A1 V1(0)

processes

4me



A0 V0(0)

R0(1)( , )QR( )R0(0)

R1(0)

V0(1)

V1(1)

InApply( to )V0(1)

0nV1(1)

Q0(1)

Q1(1)

Q0(1)

R1(0)( , )QR( )

A1 V1(0)

processes

4me

Q1(1)



A0 V0(0)

R0(1)( , )QR( )R0(0)

R1(0)

V0(1)

V1(1)

InApply( to )V0(1)

0nV1(1)

Q0(1)

Q1(1)

Apply( to )0n

V0(0)

Q0(1)

Q0

R1(0)( , )QR( )

A1 V1(0)

Apply( to )

V1(0)

Q1(1)

Q1

processes

4me

0n


Thebigpicture….

A1

A2

A3

A4

A5

A6

Q0

Q4

Q1

Q5

Q6

Q3

Q2

R

R

R

R

R

R

R

A QR4me

processes

A0


Thebigpicture….

A1

A2

A3

A4

A5

A6

4me

processes

communica3on

computa3on

A0


Thebigpicture….

A1

A2

A3

A4

A5

A6

Q0

Q4

Q1

Q5

Q6

Q3

Q2

R

R

R

R

R

R

R

4me

processes

communica3on

computa3on

A0


Latencybutalsopossibilityoffastpanelfactoriza4on.

•  DGEQR3istherecursivealgorithm(seeElmrothandGustavson,2000),DGEQRFandDGEQR2aretheLAPACKrou4nes.

•  TimesincludeQRandDLARFT.

•  RunonPen4umIII.

QRfactoriza3onandconstruc3onofTm=10,000

PerfinMFLOP/sec(Timesinsec)

n DGEQR3 DGEQRF DGEQR2

50 173.6 (0.29) 65.0 (0.77) 64.6 (0.77)

100 240.5 (0.83) 62.6 (3.17) 65.3 (3.04)

150 277.9 (1.60) 81.6 (5.46) 64.2 (6.94)

200 312.5 (2.53) 111.3 (7.09) 65.9 (11.98)

m=1000,000,thexaxisisn

MFLOP/sec


WhenonlyRiswantedprocesses

4me

R0(0)( )QR( )

A0

R0(1)( )QR( )

R0(0)R1(0)

R1(0)( )QR( )

A1

R2(0)( )QR( )

A2

R2(1)( )QR( )

R2(0)R3(0)

R3(0)( )QR( )

A3

R( )QR( )

R0(1)R2(1)


WhenonlyRiswanted:TheMPI_Allreduce

InthecasewhereonlyRiswanted,insteadofconstruc4ngourowntree,onecansimplyuseMPI_Allreducewithauserdefinedopera4on.Theopera4onwegivetoMPIisbasicallytheAlgorithm2.Itperformstheopera4on:

Thisbinaryopera4onisassocia3veandthisisallMPIneedstouseauser‐definedopera4ononauser‐defineddatatype.Moreover,ifwechangethesignsoftheelementsofRsothatthediagonalofRholdsposi4veelementsthenthebinaryopera4onRfactorbecomescommuta3ve.

Thecodebecomestwolines:

lapack_dgeqrf( mloc, n, A, lda, tau, &dlwork, lwork, &info );

MPI_Allreduce( MPI_IN_PLACE, A, 1, MPI_UPPER,

LILA_MPIOP_QR_UPPER, mpi_comm);

QR ( )R1R2

R


Doesitwork?•  TheexperimentsareperformedonthebeowulfclusterattheUniversityofColoradoatDenver.The

clusterismadeof35bi‐proPen4umIII(900MHz)connectedwithDolphininterconnect.•  Numberofopera4onsistakenas2mn2forallthemethods

•  TheblocksizeusedinScaLAPACKis32.

•  ThecodeiswrijeninC,useMPI(mpich‐2.1),LAPACK(3.1.1),BLAS(goto‐1.10),theLAPACKCwrappers(hjp://icl.cs.utk.edu/~delmas/lapwrapmw.htm)andtheBLASCwrappers(hjp://www.netlib.org/blas/blast‐forum/cblas.tgz)

•  Thecodeshasbeentestedinvariousconfigura4onandhaveneverfailedtoproduceacorrectanswer,releasingthosecodesisintheagenda

FLOPs(total)forRonly FLOPs(total)forQandR

CholeskyQR mn2+n3/3 2mn2+n3/3

Gram‐Schmidt 2mn2 2mn2

Householder 2mn2‐2/3n3 4mn2‐4/3n3

AllreduceHH (2mn2‐2/3n3)+2/3n3p (4mn2‐4/3n3)+4/3n3p

»  Numberofopera3onsistakenas2mn2forallthemethods


QandR:Strongscalability•  Inthisexperiment,wefixthe

problem:m=100,000andn=50.Thenweincreasethenumberofprocessors.

•  Oncemorethealgorithmrhh_qr3isthesecondbehindCholeskyQR.Notethatrhh_qr3isincondionnallystablewhilethestabilityofCholeskyQRdependsonthesquareofthecondi4onnumberoftheini4almatrix.

#ofprocs

cholqr rhh_qr3 cgs mgs(row) rhh_qrf qrf qr2

1 489.2 (1.02) 120.0 (4.17) 134.1 (3.73) 73.5 (6.81) 51.9 (9.64) 39.1 (12.78) 34.3 (14.60)

2 467.3 (0.54) 100.8 (2.48) 78.9 (3.17) 39.0 (6.41) 31.2 (8.02) 22.3 (11.21) 20.2 (12.53)

4 466.4 (0.27) 97.9 (1.28) 71.3 (1.75) 38.7 (3.23) 31.0 (4.03) 22.2 (5.63) 18.8 (6.66)

8 434.0 (0.14) 95.9 (0.65) 67.4 (0.93) 36.7 (1.70) 34.0 (1.84) 20.8 (3.01) 17.7 (3.54)

16 359.2 (0.09) 103.8 (0.30) 54.2 (0.58) 31.6 (0.99) 27.8 (1.12) 18.3 (1.71) 16.3 (1.91)

32 197.8 (0.08) 84.9 (0.18) 41.9 (0.37) 29.0 (0.54) 33.3 (0.47) 15.8 (0.99) 14.5 (1.08)

#procs

MFLOP/sec/proc

MFLOP/sec/proc Timeinsec


QandR:Weakscalabilitywithrespecttom•  Wefixthelocalsizetobemloc=100,000

andn=50.Whenweincreasethenumberofprocessors,theglobalmgrowspropor4onally.

•  rhh_qr3istheAllreducealgorithmwithrecursivepanelfactoriza4on,rhh_qrfisthesamewithLAPACKHouseholderQR.Weseetheobviousbenefitofusingrecursion.Seeaswell(6).qr2andqrfcorrespondtotheScaLAPACKHouseholderQRfactoriza4onrou4nes.

#ofprocs

cholqr rhh_qr3 Cgs mgs(row) rhh_qrf qrf qr2

1 489.2 (1.02) 121.2 (4.13) 135.7 (3.69) 70.2 (7.13) 51.9 (9.64) 39.8 (12.56) 35.1 (14.23)

2 466.9 (1.07) 102.3 (4.89) 84.4 (5.93) 35.6 (14.04) 27.7 (18.06) 20.9 (23.87) 20.2 (24.80)

4 454.1 (1.10) 96.7 (5.17) 67.2 (7.44) 41.4 (12.09) 32.3 (15.48) 20.6 (24.28) 18.3 (27.29)

8 458.7 (1.09) 96.2 (5.20) 67.1 (7.46) 33.2 (15.06) 28.3 (17.67) 20.5 (24.43) 17.8 (28.07)

16 451.3 (1.11) 94.8 (5.27) 67.2 (7.45) 33.3 (15.04) 27.4 (18.22) 20.0 (24.95) 17.2 (29.10)

32 442.1 (1.13) 94.6 (5.29) 62.8 (7.97) 32.5 (15.38) 26.5 (18.84) 19.8 (25.27) 16.9 (29.61)

64 414.9 (1.21) 93.0 (5.38) 62.8 (7.96) 32.3 (15.46) 27.0 (18.53) 19.4 (25.79) 16.6 (30.13)

#procs

MFLOP/sec/proc



QandR:Weakscalabilitywithrespectton•  Wefixtheglobalsize

m=100,000andthenweincreasenassqrt(p)sothattheworkloadmn2perprocessorremainsconstant.

•  Duetobejerperformanceinthelocalfactoriza4onorSYRK,CholeskyQR,rhh_q3andrhh_qrfexhibitincreasingperformanceatthebeginningun4lthen3comesintoplay

#ofprocs


1 490.7 (1.02) 120.8 (4.14) 134.0 (3.73) 69.7 (7.17) 51.7 (9.68) 39.6 (12.63) 39.9 (14.31)

2 510.2 (0.99) 126.0 (4.00) 78.6 (6.41) 40.1 (12.56) 32.1 (15.71) 25.4 (19.88) 19.0 (26.56)

4 541.1 (0.92) 149.4 (3.35) 75.6 (6.62) 39.1 (12.78) 31.1 (16.07) 25.5 (19.59) 18.9 (26.48)

8 540.2 (0.92) 173.8 (2.86) 72.3 (6.87) 38.5 (12.89) 43.6 (11.41) 27.8 (17.85) 20.2 (24.58)

16 501.5 (1.00) 195.2 (2.56) 66.8 (7.48) 38.4 (13.02) 51.3 (9.75) 28.9 (17.29) 19.3 (25.87)

32 379.2 (1.32) 177.4 (2.82) 59.8 (8.37) 36.2 (13.84) 61.4 (8.15) 29.5 (16.95) 19.3 (25.92)

64 266.4 (1.88) 83.9 (5.96) 32.3 (15.46) 36.1 (13.84) 52.9 (9.46) 28.2 (17.74) 18.4 (27.13)

MFLOP/sec/proc

#procs(n)MFLOP/sec/proc Timeinsec

n3effect


Ronly:Strongscalability•  Inthisexperiment,wefixthe

problem:m=100,000andn=50.Thenweincreasethenumberofprocessors.

#ofprocs


1 1099.046 (0.45) 147.6 (3.38) 139.309 (3.58) 73.5 (6.81) 69.049 (7.24) 69.108 (7.23) 68.782 (7.27)

2 1067.856 (0.23) 123.424 (2.02) 78.649 (3.17) 39.0 (6.41) 41.837 (5.97) 38.008 (6.57) 40.782 (6.13)

4 1034.203 (0.12) 116.774 (1.07) 71.101 (1.76) 38.7 (3.23) 39.295 (3.18) 36.263 (3.44) 36.046 (3.47)

8 876.724 (0.07) 119.856 (0.52) 66.513 (0.94) 36.7 (1.70) 37.397 (1.67) 35.313 (1.77) 34.081 (1.83)

16 619.02 (0.05) 129.808 (0.24) 53.352 (0.59) 31.6 (0.99) 33.581 (0.93) 31.339 (0.99) 31.697 (0.98)

32 468.332 (0.03) 95.607 (0.16) 42.276 (0.37) 29.0 (0.54) 37.226 (0.42) 25.695 (0.60) 25.971 (0.60)

64 195.885 (0.04) 77.084 (0.10) 25.89 (0.30) 22.8 (0.34) 36.126 (0.22) 17.746 (0.44) 17.725 (0.44)

#procs

MFLOP/sec/proc



Ronly:Weakscalabilitywithrespecttom•  Wefixthelocalsizetobemloc=100,000

andn=50.Whenweincreasethenumberofprocessors,theglobalmgrowspropor4onally.

#ofprocs


1 1098.7 (0.45) 145.4 (3.43) 138.2 (3.61) 70.2 (7.13) 70.6 (7.07) 68.7 (7.26) 69.1 (7.22)

2 1048.3 (0.47) 124.3 (4.02) 70.3 (7.11) 35.6 (14.04) 43.1 (11.59) 35.8 (13.95) 36.3 (13.76)

4 1044.0 (0.47) 116.5 (4.29) 82.0 (6.09) 41.4 (12.09) 35.8 (13.94) 36.3 (13.74) 34.7 (14.40)

8 993.9 (0.50) 116.2 (4.30) 66.3 (7.53) 33.2 (15.06) 35.1 (14.21) 35.5 (14.05) 33.8 (14.75)

16 918.7 (0.54) 115.2 (4.33) 64.1 (7.79) 33.3 (15.04) 34.0 (14.66) 33.4 (14.94) 33.0 (15.11)

32 950.7 (0.52) 112.9 (4.42) 63.6 (7.85) 32.5 (15.38) 33.4 (14.95) 33.3 (15.01) 32.9 (15.19)

64 764.6 (0.65) 112.3 (4.45) 62.7 (7.96) 32.3 (15.46) 34.0 (14.66) 32.6 (15.33) 32.3 (15.46)

#procs

MFLOP/sec/proc



QandR:Strongscalability

0

100

200

300

400

500

600

700

800

32 64 128 256

ReduceHH(QR3)

ReduceHH(QRF)

ScaLAPACKQRF

ScaLAPACKQR2

Inthisexperiment,wefixtheproblem:m=1,000,000andn=50.Thenweincreasethenumberofprocessors.BlueGeneL

frost.ncar.edu

#ofprocessors

MFLOPs/sec/proc


ConclusionsWehavedescribedanewmethodfortheHouseholderQRfactoriza4onofskinnymatrices.ThemethodisnamedAllreduceHouseholderandhasfouradvantages:

1.  thereisonlyonesynchroniza3onpointinthealgorithm,

2.  themethodharvestsmostofefficiencyofthecompu3ngunitbylargelocalopera4ons,

3.  themethodisstable,

4.  andfinallythemethodiselegantinpar4cularinthecasewhereonlyRisneeded.

AllreducealgorithmshavebeendepictedherewithHouseholderQRfactoriza4on.HoweveritcanbeappliedtoanythingforexampleGram‐SchmidtorLU.

Currentdevelopmentisinwri4nga2DblockcyclicQRfactoriza4onandLUfactoriza4onbasedonthoseideas.


1. TSQR: Tall Skinny QR

2. CAQR: Communication Avoiding QR


k=1 k=1, j=2 k=1, j=3

k=1, i=2 k=1, i=2, j=2 k=1, i=2, j=3

k=1, i=3 k=1, i=3, j=2 k=1, i=3, j=3

DLARFB/DGESSM DLARFB/DGESSM

DSSRFB/DSSSSM DSSRFB/DSSSSM

DSSRFB/DSSSSM DSSRFB/DSSSSM

DGEQRT/DGETRF

DTSQRT/DTSTRF

DTSQRT/DTSTRF

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. LAWN 191 – A Class of Parallel Tiled Linear Algebra Algorithms for MulticoreArchitectures, September 2007.


for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}



}

1.dgeqrt(A[0][0],T[0][0]);

VR AT

dgeqrt0



}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);

V A

dgeqrt0

TR

dlarB0,1



}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);

V A

dgeqrt0

TR

dlarB0,3dlarB0,2dlarB0,1



}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);

dgeqrt0


V

R

A

TR

dtsqrt1,0



}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0


V

A

B

T

dtsqrt1,0

A

B

dssrB1,1



}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);7.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);8.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0


V

A

B

T

dtsqrt1,0

A

B

dssrB1,1 dssrB1,1 dssrB1,1



}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);7.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);8.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0


dtsqrt1,0

dssrB1,1 dssrB1,1 dssrB1,1


Execution of the parallel TSQR factorization on a binary tree of four processors. The gray boxes indicate wherelocal QR factorizations take place. The Q and R factors each have two subscripts: the first is the sequence numberwithin that stage, and the second is the stage number.


Execution of the sequential TSQR factorization on a flat tree with four submatrices. The gray boxes indicate wherelocal QR factorizations take place The Q and R factors each have two subscripts: the first is the sequence numberfor that stage, and the second is the stage number.


Execution of a hybrid parallel / out-of-core TSQR factorization. The matrix has 16 blocks, and four processors can execute local QR factorizations simultaneously.The gray boxes indicate where local QR factorizations take place. We number the blocks of the input matrix A in hexadecimal to save space (which means that thesubscript letter A is the number 1010, but the non-subscript letter A is a matrix block). The Q and R factors each have two subscripts: the first is the sequence numberfor that stage, and the second is the stage number.


A is m–by–n with m = pb and n = qb.We are only interested in the first step of the Householder QR factorization.

@@@@

←−

6

?

m = pb

-� n = qb


1 Classic unblocked Householder factorization: DGEQF21. DGEQR2: Panel factorization.

@@@@

←− 2mb2− 23 b3 = 2(p− 1

3 )b3

2. DLARF: Apply the V vector one-by-one to the remaining matrix. Thismakes b small step

←− (I− )

. . .

←− (I− )

∑pb`=(p−1)b+1 4(q−1)b`

= 4(q−1)b∑bi=1 (p−1)b+ i

= 4(q−1)b((p−1)b2 + 12 b2)

= (4p−2)(q−1)b3

total (4pq−2p−2q+ 43 )b3


2 Classic block Householder factorization: DGEQRF1. DGEQR2: Panel factorization.

@@@@

←− 2mb2− 23 b3 = 2(p− 1

3 )b3

2. DLARFT: Construction of the T matrix to apply the Househodler byblock

@@←−

@@@@

mb2− 13 b3 = (p− 1

3 )b3

3. DLARFB: Apply the V vectors by block to the remaining matrix.

←− (I−

@@ @

@

T@@

)

4bm(n−b)−b2(n−b)= (4p−1)(q−1)b3 Or doing it y hand:(2p−1)(q−1)b3

+(q−1)b3

+(2p−1)(q−1)b3

= (4p−1)(q−1)b3

total (4pq−p−q)b3

So if we want to compare quickly the unblocked Househodler code with the block Househodler code, there is an overhead of (p+q)b3. There pb3 overhead due toDLARFT and qb3 due to DLARFB.


3 SUQRAStep 1.1.a. First QR factorization.

@@@@←− 4

3 b3

1.b DLARFT: Construction of the T matrix to apply the Househodler byblock

@@←−

@@@@

23 b3

1.c. DLARFB: Apply the V vectors by block to the remaining matrix.

←− (I−@@ @

@

T@@

) 3(q−1)b3

Step 2. (repeat this step (p−1) times.2.a. TSQR factorization (TS=triangle-square)

@@ ←−

@@ 2b3

2.b TS-LARFT: Construction of the T matrix to apply the Househodler byblock

@@←−

@@ 4

3 b3

2.c. TS-LARFB: Apply the V vectors by block to the remaining matrix.

←− (I−@@@@

T@@ ) 5(q−1)b3

total (5pq− 53 p−2q+ 2

3 )b3

This algorithm is going to perform 20% more FLOPS.⇒ Need for inner blocking.


4 SUQRA without blockingStep 1.1.a. First QR factorization.

@@@@←− 4

3 b3

1.c. DLARFB: Apply the V vectors by block to the remaining matrix.

←− Apply@@ to 2(q−1)b3

Step 2. (repeat this step (p−1) times.2.a. TSQR factorization (TS=triangle-square)

@@ ←−

@@ 2b3

2.c. TS-LARFB: Apply the V vectors by block to the remaining matrix.

←− Apply@@ to 4(q−1)b3

total (4pq−2p−2q+ 43 )b3

Exactly the same number of FLOPs as for the unblocked QR factorization


5 comparisonunblocked QR unblocked SUQRA blocked QR blockedSUQRA

panel (2p− 23 )b3 (2p− 2

3 )b3 (2p− 23 )b3 (2p− 2

3 )b3

T (p− 13 )b3 ( 4

3 p− 23 )b3

update (4pq−2q−4p+2)b3 (4pq−2q−4p+2)b3 (4pq−q−4p+1)b3 (5pq−2q−5p+2)b3

total (4pq−2q−2p+ 43 )b3 (4pq−2q−2p+ 4

3 )b3 (4pq−q− p)b3 (5pq−2q− 53 p+ 2

3 )b3

We can integrate those values to know the FLOPS count of the algorithm, we replace q by i when i varies from q to 1 and we replace p with p−q + i. We assumethat p > q, i.e. the matrix is taller than longer.This gives for the unblocked QR algorithm:

q

∑i=1

(4(p−q+ i)i−2i−2(p−q+ i)+43)b3

=q

∑i=1

(4p−4q−4)i+4i2−2p+2q+43)b3

= (4p−4q−4)q2

2+4

q3

3−2pq+2q2 +

43

q)b3

= (2pq2−2q3−2q2 +43

q3−2pq+2q2 +43

q)b3

= (2pq2− 23

q3−2pq+43

q)b3

= 2mn2− 23

n3−2mnb+43

nb2

The order is OK we find: 2mn2− 23 n3 as expected.

This gives for the block SUQRA:

q

∑i=1

(5(p−q+ i)i−2i− 53(p−q+ i)+

23)b3

∼q

∑i=1

(5pi−5qi+5i2)b3

∼ (52

pq2− 52

q3 +53

q3)b3

∼ 52

mn2− 56

n3

(1)

which means 20% overhead.


6 Remark on the panel factorizationClassical method

@@@@ ←− 2(2b)b2− 2

3 b3 103 b3

PUQRAPUQRA-step1

@@@@←−

@@@@←−

43 b3 + 4

3 b3

PUQRA-step2

@@

@@

←−@@

@@

23 b3 10

3 b3

SUQRASUQRA-step1

@@@@←− 4

3 b3

SUQRA-step2

@@ ←−

@@ 2b3 10

3 b3


DSSRFBDTSQRT

A

U T

V

B

C



An interesting middleware: SMPSs

void dgeqrt(double *RV1, double *T);void dtsqrt(double *R, double *V2, double *T);void dlarfb(double *V1, double *T, double *C1);void dssrfb(double *V2, double *T, double *C1, double *C2);

cilk void qr_panel(int k){ int m;

dgeqrt(A[k][k], T[k][k]);

for (m = k+1; m < TILES; m++) dtsqrt(A[k][k], A[m][k], T[m][k]);}

cilk void qr_update(int n, int k){ int m;

dlarfb(A[k][k], T[k][k], A[k][n]);

for (m = k+1; m < TILES; m++) dssrfb(A[m][k], T[m][k], A[k][n], A[m][n]);

if (n == k+1) spawn qr_panel(k+1);}

spawn qr_panel(0);sync;

for (k = 0; k < TILES; k++) { for (n = k+1; n < TILES; n++) spawn qr_update(n, k); sync;}

Figure 14: Cilk implementation of the tile QR fac-torization with 1D work assignment and lookahead.

5.2 SMPSs Implementation

Figure 15 shows implementation using SMPSs, whichfollows closely the one for Cholesky. The func-tions implementing parallel tasks are designated with#pragma ccs task annotations defining directionalityof the parameters (input, output, inout). The par-allel section of the code is designated with #pragmaccs start and #pragma ccs finish annotations. Insidethe parallel section the algorithm is implemented us-ing the canonical representation of four loops withthree levels of nesting, which closely matches thepseudocode definition of Figure 11.

The SMPSs runtime system schedules tasks basedon dependencies and attempts to maximize datareuse by following the parent-child links in the task

#pragma css task \ inout(RV1[NB][NB]) output(T[NB][NB])void dgeqrt(double *RV1, double *T);

#pragma css task \ inout(R[NB][NB], V2[NB][NB]) output(T[NB][NB])void dtsqrt(double *R, double *V2, double *T);

#pragma css task \ input(V1[NB][NB], T[NB][NB]) inout(C1[NB][NB])void dlarfb(double *V1, double *T, double *C1);

#pragma css task \ input(V2[NB][NB], T[NB][NB]) inout(C1[NB][NB], C2[NB][NB])void dssrfb(double *V2, double *T, double *C1, double *C2);

#pragma css startfor (k = 0; k < TILES; k++) {

dgeqrt(A[k][k], T[k][k]);

for (m = k+1; m < TILES; m++) dtsqrt(A[k][k], A[m][k], T[m][k]);

for (n = k+1; n < TILES; n++) { dlarfb(A[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++) dssrfb(A[m][k], T[m][k], A[k][n], A[m][n]); }}#pragma css finish

Figure 15: SMPSs implementation of the tile QR fac-torization.

graph when possible.There is a caveat here, however. V1 is an input

parameter of task dlarfb(). It also is an inout param-eter of task dtsqrt(). However, dlarfb() only readsthe lower triangular portion of the tile, while dtsqrt()only updates the upper triangular portion of the tile.Since in both cases the tile is passed to the functionsby the pointer to the upper left corner of the tile,SMPSs sees a false dependency. As a result, the ex-ecution of the dlarfb() tasks in a given step will bestalled until all the dtsqrt() tasks complete, despitethe fact that both types of tasks can be scheduled inparallel as soon as the dgeqrt() task completes. Fig-ure 16 shows conceptually the change that needs tobe done.

Currently SMPSs is not capable of recognizing ac-cesses to triangular matrices. There are however mul-tiple ways to enforce the correct behavior. The sim-

12

Figure 23 shows performance for the QR factor-ization. The situation is a little different here. Per-formance of Cilk implementations is still the poor-est and the performance of the static pipeline isstill superior. However, performance of the initialSMPSs implementation is only marginally better thatCilk 1D, while performance of the improved SMPSs*implementation is only marginally worse that staticpipeline. The same conclusions apply to the tile LUfactorization (Figure 24).

0 1000 2000 3000 4000 5000 6000 7000 80000

10

20

30

40

50

60

70

80

90

100

110Tile QR Factorization Performance

Matrix Size

Gflo

p/s

Static PipelineSMPSs*SMPSsCilk 1DCilk 2D

Figure 23: Performance of the tile QR factorization indouble precision on a 2.4 GHz quad-socket quad-core(16 cores total) Intel Tigerton system. Tile size nb =144, inner block size IB = 48.

Relatively better performance of SMPSs for theQR and LU factorizations versus the Cholesky fac-torization can be explained by the fact that the LUfactorization is two times more expensive and the QRfactorization is four times more expensive, in terms offloating point operations. This diminishes the impactof various overheads for smaller size problems.

0 1000 2000 3000 4000 5000 6000 7000 80000

10

20

30

40

50

60

70

80

90

100Tile LU Factorization Performance

Matrix Size

Gflo

p/s

Static PipelineSMPSs*SMPSsCilk 1DCilk 2D

Figure 24: Performance of the tile LU factorization indouble precision on a 2.4 GHz quad-socket quad-core(16 cores total) Intel Tigerton system. Tile size nb =200, inner block size IB = 40.

8 Conclusions

In this work, suitability of emerging multicore pro-gramming frameworks was analyzed for implement-ing modern formulations of classic dense linear alge-bra algorithms, the tile Cholesky, the tile QR andthe tile LU factorizations. These workloads are rep-resented by large task graphs with compute-intensivetasks interconnected with a very dense and complexnet of dependencies.

For the workloads under investigation, the con-ducted experiments show clear advantage of themodel, where automatic parallelization is based onconstruction of arbitrary DAGs. SMPSs providesmuch higher level of automation than Cilk and simi-lar frameworks, requiring only minimal programmer’sintervention and basically leaving the programmeroblivious to any aspects of parallelization. At thesame time it delivers superior performance throughmore flexible scheduling of operations.

SMPSs still looses to hand-written code for veryregular compute-intensive workloads investigated

17

From:

• Jakub Kurzak, Hatem Ltaief, Jack Dongarra, and Rosa M.Badia. Scheduling Linear Algebra Operations on Multi-core Processors. LAWN 213.

See also:

• Rosa M. Badia, Jose R. Herrero, Jesus Labarta, Josep M.Perez, Enrique S. Quintana-Ortı, and Gregorio Quintana-Ortı. Parallelizing dense and banded linear algebra li-braries using SMPSs. UPC-DAC-RR-2008-64.

• Hatem Ltaief, Jakub Kurzak, and Jack Dongarra. Schedul-ing Two-sided Transformations using Algorithms-by-Tileson Multicore Architectures. LAWN 214.


2000 4000 6000 8000 10000 120000

10

20

30

40

50

60

70

80

Cholesky −− 2−way Quad Clovertown

problem size

Gflo

p/s

DGEMM peak

LAPACKMKL−9.1Tiled+asynch.



2000 4000 6000 8000 10000 120000

10

20

30

40

50

60

70

80

QR −− 2−way Quad Clovertown

problem size

Gflo

p/s

DGEMM peak

LAPACKMKL−9.1Tiled+asynch.



0 500 1000 1500 2000 2500 3000 3500 40000

20

40

60

80

100

120

140

160

180

200Tile QR Factorization −− 3.2 GHz CELL Processor

Matrix Size

Gflo

p/s

SSSRFB PeakTile QR

Figure 5: Performance of the tile QR factorization in single precision on a 3.2GHz CELL processor with eight SPEs. Square matrices were used. The solidhorizontal line marks performance of the SSSRFB kernel times the numberof SPEs (22.16! 8 = 177 [Gflop/s]).

28

Performance of the tile QR factorization in single precision on a 3.2 GHz CELL processor with eight SPEs.Square matrices were used. Solid horizontal line marks performance of the SSSRFB kernel times the number ofSPEs (22.16×8 = 177 [Gflop/s]).

“The presented implementation of tile QR factorization on the CELL processor allows for factorization of a 4000–by–4000 dense matrix in single precision in exactly half of a second. To the author’s knowledge, at present, it is thefastest reported time of solving such problem by any semiconductor device implemented on a single semiconductordie.”

Jakub Kurzak and Jack Dongarra, LAWN 201 – QR Factorization for the CELL Processor, May 2008.


QandR:Strongscalability

0

100

200

300

400

500

600

700

800

32 64 128 256

ReduceHH(QR3)

ReduceHH(QRF)

ScaLAPACKQRF

ScaLAPACKQR2

Inthisexperiment,wefixtheproblem:m=1,000,000andn=50.Thenweincreasethenumberofprocessors.BlueGeneL

frost.ncar.edu

#ofprocessors

MFLOPs/sec/proc


Strategy:

1. obtain some lower bounds for the cost (latency, bandwidth, # of operations) of LU, QR and Cholesky insequential and parallel distributed

2. compute the costs of our algorithms and compare with the lower bound.

Lower bounds:

1. For LU, observe that: I 0 −BA I 00 0 I

=

IA I0 0 I

I 0 −BI A ·B

I

therefore lower bound for matrix-matrix multiply (latency, bandwidth and operations) also holds for LU.

2. For Cholesky, observe that: I AT −BA I +AAT 0−BT 0 D

=

IA I−BT (A ·B)T X

I AT −BI A ·B

XT

however this gets nasty due to the AAT term in the initial matrix A. See Grey Ballard, James Demmel,Olga Holtz, and Oded Schwartz. Communication-optimal Parallel and Sequential Cholesky decomposition.UCB/EECS-2009-29, February 13th, 2009.

3. For QR, we needed to redo the proof of optimality of matrix-matrix multiply. See James W. Demmel, LauraGrigori, Mark F. Hoemmen, and Julien Langou. Communication-avoiding parallel and sequential QR factor-izations. arXiv:0902.2537, May 30th, 2008.


New Algorithm ScaLAPACK AlgorithmPar. CAQR PDGEQRF Lower bound

# flops(4

3

) · (n3/P)

X(4

3

) · (n3/P)

X O(n3/P

)# words

(34 · logP

) · (n2/√

P)

X(3

4 · logP) · (n2/

√P)

X O(n2/√

P)

# messages(3

8 · log3 P) · (√P

)X

(54 · log2 P

) · (n) × O(√

P)

New Algorithm X ScaLAPACK ×

Performance models of parallel CAQR and ScaLAPACK’s parallel QR factorization PDGEQRF on a square n×nmatrix with P processors, along with lower bounds on the number of flops, words, and messages. The matrixis stored in a 2-D Pr×Pc block cyclic layout with square b× b blocks. We choose b, Pr, and Pc optimally andindependently for each algorithm. Everything (messages, words, and flops) is counted along the critical path.


Seq. CAQR Householder QR Lower bound# flops 4

3n3 43n3 O(n3)

# words 3 n3√W

13

n4

W O( n3√W

)

# messages 12 n3

W 3/212

n3

W O( n3

W 3/2)

Performance models of sequential CAQR and blocked sequential HouseholderQR on a square n×n matrix with fast memory size W , along with lower boundson the number of flops, words, and messages.


Opportunities for further research:

how to tile the matrix? square blocks, nonsquare blocks?

kernel tuning: introduction of a lots of new kernels (e.g. QR fact. of a triangle on top of a square). For eachkernel:

1. how to optimize the blocking parameter (nb)?

2. which algorithmic variants to choose (left looking, recursive, ...) ?

3. the inner blocking parameter (ib).

Question 2 and 3 are standard autotuning problems. Choosing ib and the algorithmic vairant is done in termof nb. Question 1 is more subtle. Choosing nb is done at the matrix level (n) since it influences the granularityof the algorithm.

reduction algorithm: Which reduction tree to use? Binary tree? Flat tree? Hybrid tree? Each of these choicesrepresent an algorithm change. No framework to accomodate this yet.

scheduling: How to schedule all these tasks?

1. static scheduling or dynamic scheduling?

2. and in parallel distributed ...

julien langou, university of colorado denver. wednesday ... · pdf filemixed single-double...

Documents