maintaininghighperformancein …homes.sice.indiana.edu/rcwhaley/theses/samuel_ms.pdfthe original...

MAINTAINING HIGH PERFORMANCE IN

THE QR FACTORIZATION

WHILE SCALING BOTH PROBLEM SIZE AND PARALLELISM

APPROVED BY SUPERVISING COMMITTEE:

R. Clint Whaley, Ph.D., Chair

Qing Yi, Ph.D.

Dakai Zhu, Ph.D.

Accepted:Dean, Graduate School




by

SIJU SAMUEL, M.Tech.

THESISPresented to the Graduate Faculty of

The University of Texas at San Antonioin Partial Fulfillmentof the Requirementsfor the Degree of

MASTER OF SCIENCE IN COMPUTER SCIENCE

THE UNIVERSITY OF TEXAS AT SAN ANTONIOCollege of Sciences

Department of Computer ScienceMay 2011

Acknowledgements

I want to thank my advisor, Dr. R. Clint Whaley, for his excellent academic guidance for

the thesis and for my graduate studies. I am very grateful to all my professors and friends for their

support leading to a rewarding experience at UTSA. I want to thank my family for their love and

constant support for all my pursuits in life. The research for this work was supported in part by

National Science Foundation CRI grants CNS-0551504 and CCF-0833203.

May 2011

iii




Siju Samuel, M.S.The University of Texas at San Antonio, 2011

Supervising Professor: R. Clint Whaley, Ph.D.

QR factorization is an extremely important linear algebra operation used in solving multi-

ple linear equations, particularly least-square-error problems, and in finding eigenvalues and eigen-

vectors. This thesis details the author’s contributions to the field of computer science by providing

performance-efficient QR routines to ATLAS (Automatically Tuned Linear Algebra Software). AT-

LAS is an open source linear algebra library, intended for high performance computing. The author

has added new implementations for four types/precisions (single real, double real, single complex,

and double complex) in four different variants of matrix factorization (QR, RQ, QL and LQ). QR

factorization involves a panel factorization and a trailing matrix update operation. A statically

blocked algorithm is used for the full matrix factorization. A recursive formulation is implemented

for the QR panel factorization, providing more robust performance. Together these techniques

result in substantial performance improvement over the LAPACK version.

iv

TABLE OF CONTENTS

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTER 1: Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Libraries and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Dynamic Panel Factorization Using Recursion . . . . . . . . . . . . . . . . . . 5

1.2.2 Impact of Panel Factorization in Full Matrix Factorization . . . . . . . . . . . 6

1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

CHAPTER 2: Overview of QR Matrix Factorization And LAPACK

Implementation 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 QR Decomposition using Householder Transformation . . . . . . . . . . . . . 10

2.2 QR, RQ, QL and LQ Transformations : Computation And Storage . . . . . . . . . . 10

2.2.1 QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 RQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 QL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 LQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 DGEQR2 : Unblocked QR Implementation in LAPACK . . . . . . . . . . . . . . . 15

2.4 RQ2, QL2 and LQ2 : Serial Implementation in LAPACK . . . . . . . . . . . . . . 18

2.5 DGEQRF : QR Blocked Implementation in LAPACK using WY Representation . . 22

2.5.1 DGEQRF Computational Steps . . . . . . . . . . . . . . . . . . . . . . . . . 25

v

2.6 RQF, QLF and LQF Blocked Implementation in LAPACK using WY Representation 28

2.6.1 DGERQF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6.2 DGEQLF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6.3 DGELQF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

CHAPTER 3: QR ATLAS Implementation using Recursive Algorithm 32

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Unblocked QR factorization (ATL geqr2) . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Statically blocked QR factorization (ATL geqrf) . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Call structure for ATLAS GEQRF (ATL geqrf) . . . . . . . . . . . . . . . . . 34

3.3.2 Computation of GEQRR (ATL geqrr) . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Blocked QL Factorization (ATL geqlf) . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.1 Computation of GEQLR (ATL geqlr) . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Blocked RQ and LQ Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5.1 Blocked RQ Factorization (ATL gerqf) . . . . . . . . . . . . . . . . . . . . . . 39

3.5.2 Blocked LQ Factorization (ATL gelqr) . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Derivation for T-Block for the Recursive T Computation . . . . . . . . . . . . . . . . 40

CHAPTER 4: Results and Analysis 46

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3.1 General Comparison Of ATLAS and LAPACK Performance . . . . . . . . . . 49

4.3.2 Full Problem Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.3 Core2 Full Problem Performance . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.4 Opt8 Panel Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

CHAPTER 5: Summary 57

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

VITA

vi

LIST OF FIGURES

1.1 Statically Blocked QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Recursive QR Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Unblocked Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 LAPACK Blocked QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 LAPACK - A matrix and Work matrix in DLARFB . . . . . . . . . . . . . . . . . . 27

2.4 LAPACK Blocked RQ Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 LAPACK Blocked QL Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6 LAPACK Blocked LQ Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Recursive QR Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Recursive QL Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Recursive RQ Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Recursive LQ Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5 Recursive T Computation for QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6 Recursive T Computation for RQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.7 Recursive T Computation for QL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.8 Recursive T Computation for LQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 Static Blocking Full Square Matrix (Unblocked Vs Recursive Panel Factorization) . . 49

4.2 QR (a,b,c,d) & QL (c,d,e,f), Core2, Full Problem Performances . . . . . . . . . . . . 53

4.3 LQ (a,b,c,d) & RQ (c,d,e,f), Core2, Full Problem Performances . . . . . . . . . . . . 54

4.4 QR (a,b,c,d) & QL (c,d,e,f), Opt8, Full Problem Performances . . . . . . . . . . . . 55

4.5 LQ (a,b,c,d) & RQ (c,d,e,f), Opt8, Full Problem Performances . . . . . . . . . . . . 56

vii

LIST OF TABLES

2.1 DGEQLF LARFB Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 QR2 ATLAS routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 QRF ATLAS routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

viii

CHAPTER 1: INTRODUCTION

1.1 Introduction

The goal of this research is to provide new QR routines in the ATLAS (Automatically Tuned Linear

Algebra Software) [11, 12, 13, 16, 15, 14] high-performance linear algebra library. In linear algebra,

QR, RQ, QL and LQ matrix factorization refers to the decomposition of a matrix into an orthogonal

and a triangular matrix. Householder transformations [5] are widely used in numerical linear algebra

to perform the above decompositions because they are efficiently stable. LAPACK (Linear Algebra

PACKage) [1] provides an implementation of the matrix decomposition in FORTRAN. This work

developed an ANSI C implementation of the above matrix factorizations as part of ATLAS. The

recursive panel factorization algorithm [3] is used to provide better computational performance.

1.1.1 Libraries and Terminology

ATLAS provides optimized versions of two standard Linear Algebra APIs. It provides a com-

plete BLAS (Basic Linear Algebra Subprograms) library, and a partial LAPACK (Linear Algebra

PACKage) implementation.

The BLAS are a collection of kernels that are highly tuned to each architecture. They are

designed to be used as building blocks by higher level packages such as LAPACK, which handles

more complicated operations such as factoring matrices, solving systems and finding eigenvalues.

The idea is that most of the tuning can take place in the BLAS (a small set of routines), providing

performance portability to all operations which are built out of them.

The BLAS are split into three levels, depending on their operands. The Level 1 BLAS are

vector-vector operations (eg., dot product), the Level 2 BLAS are matrix-vector operations (eg.,

matrix-vector product), and the Level 3 BLAS are matrix-matrix operations (eg., matrix-matrix

multiply). The Level 1 BLAS have O(N) data and O(N) operations, while the Level 2 BLAS

have O(N2) data and O(N2) operations. Thus, neither the Level 2 or 1 BLAS get significant

reuse of data in the cache, and thus the performance of both is strongly limited by the speed of

memory. This has two unfortunate consequences: these problems cannot run anywhere near the

computational peak, and they do not parallelize well (since memory performance does not scale

with core count, as computational performance does).

1

On the other hand, the Level 3 BLAS have O(N2) data, but O(N3) operations, which,

along with their ability to be reordered and tiled, allows these operations to get significant reuse at

all levels of the memory hierarchy. The L3BLAS therefore usually obtain most of the theoretical

peak of the machine when run in serial, and achieve extremely good scaling when parallelized.

Level 1, Level 2 and Level 3 operations in BLAS will be identified as L1BLAS, L2BLAS and L3BLAS,

respectively .

The routines of the BLAS and LAPACK follow a naming convention: the routines start

with ”D” for ”double precision real”. For single real, single complex and double complex preci-

sion/types are ”S”, ”C”, ”Z” respectively. In the BLAS, some of the important routines are GER,

GEMV and GEMM. GER and GEMV are L2BLAS, and GEMM is the most important L3BLAS. The routine

DGER perform ”General” rank-1 update (adding the outer product of two vectors to the elements

of a matrix). In BLAS ”General” is used to describe dense rectangular or square matrices, as

opposed to more specialized matrix types, such as triangular or symmetric. DGEMV performs a real

matrix-vector multiplication. DGEMM performs a real matrix-matrix multiplication.

DGEQRF is an LAPACK routine which performs a QR factorization on a full matrix in

double precision using static blocking. DGEQLF, DGELQF and DGERQF are the similar routines for QL,

LQ and RQ variants. Many routines will be introduced in this paper using a similar naming pattern

as needed.

ATLAS follows a naming convention: the routine ATL dgemm denotes a double precision

matrix-matrix multiplication routine. In most case the equivalent routines in BLAS and LAPACK

are named similarly. For example ATL dgemm in ATLAS corresponds to DGEMM in the BLAS. Simi-

larly ATL zgeqrf refers to ZGEQRF (complex precision) in LAPACK. A similar pattern is employed

for all QR related routines.

A general usage of GEMM or GEQRF without the precision character denotes the general

computational algorithm. QR , RQ, QL and LQ factorization refers to the general factorization algo-

rithm for each variant independent of precisions/types. Based on the context QR is also used to

denote the matrix factorization in general independent of variant and precisions/types.

The following terminologies are used in matrix factorization operations. An unblocked

factorization refers to factoring a matrix using the L2BLAS. A blocked factorization means dividing

a matrix into a ”column panel” and ”trailing matrix”, performing a panel factorization, and then

updating the trailing matrix (using the L3BLAS) iteratively. The column panel is a sub-matrix

2

with typically a large number of rows and relatively few columns. The trailing matrix is a matrix

following the panel matrix in the full matrix to be factored in one iteration. Similarly a row

panel has a small number of rows and large number of columns. The width with which a matrix is

blocked is called nb, which is also referred as the blocking factor . Static blocking refers to a blocked

factorization where the nb is constant in each iteration; if the nb varies in blocked factorization, it

is called dynamic blocking. Figure 1.1(a) shows an example for column panel, trailing matrix and

nb involved in a matrix factorization, and those terms will be more fully described in the following

section.

1.1.2 Literature Review

The original Householder transformation and procedure for factoring A = QR appeared in 1958,

in [4]. This is nearly the unblocked algorithm, although LAPACK makes a minor mathematical

change to force the Householder vector to begin with ’1.0’ (to save a storage location). The blocked

version was introduced in 1987 by Bischof and Van Loan [8], however this required a large work

area. That problem was corrected with some new mathematics and a new supplementary matrix,

in 1989 by Schreiber and Van Loan [7]. This is the modern form implemented in LAPACK. The

mathematics and computations necessary for the recursive version (not yet in LAPACK) were

developed in 2000 by Elmroth and Gustavson [3]. In 2008, even more new mathematics led to a

tiled, DAG-based parallel version of QR in [2],[6].

QR is used in solving multiple linear equations, particularly least-square-error problems,

and in finding eigenvalues and eigenvectors. The QR factorization (and its variants) are preferred

primarily for a favorable error analysis property: The factorization is considered perfectly stable.

1.2 Overview

A QR factorization of A ∈ ℜM×N with M ≥ N yields

A = Q ·R =[

Q1 Q2

]

·

R1

0

= Q1 ·R1 (1.2.1)

Where Q ∈ ℜM×M is orthogonal and R1 ∈ ℜN×N is upper triangular. We also handle the

M < N case, but since we are concentrating mostly on the panel where M > N , we do not discuss

3

the M < N case here. In LAPACK, Q is obtained as a product of Householder transformations for

each column. The LAPACK implementation of the QR matrix factorization uses a statically blocked

algorithm. It involves factoring a panel using an unblocked algorithm (L2BLAS) and then applying

the Householder transformation to the trailing matrix using the L3BLAS. The panel factorization

and trailing matrix update operations are done iteratively

In Figure 1.1(b), for a matrix AM×N , a panel of width nb (static blocking factor) is

factored into a lower triangular/trapezoidal portion of YP and upper triangular RT . YP represent

the Householder elementary vectors computed for each Householder transformation. YP is a storage

efficient representation to get Q as used in LAPACK. T is an nb×nb upper traingular matrix used

for the trailing matrix update. Using the panel factorization results and T , the right hand side

trailing matrix computation is performed. The shaded area shows the new matrix to factor in the

next iteration (i.e. at each step of of the iteration M and N will be reduced by nb).

Figure 1.1: Statically Blocked QR Factorization

The unblocked panel factorization is done using the L2BLAS. The remainder matrix up-

date is done using the L3BLAS. To maximize the performance of QR factorization it is extremely

important to choose an nb which will maximize the performance in the panel factorization operation

and the trailing matrix update operations. Unfortunately, we will see that these goals are at odds,

with the unblocked panel factorization demanding a small nb, while the update does best with a

large nb. We address this issue by adding a recursive panel factorization that can maintain high

performance using a large nb.

4

Panel factorization performance is strongly affected by the cache state. If the panel is

cache contained, the L2BLAS will run close to the speed of the cache, which is much faster than

main memory. However, limiting the panel size to fit into cache can result in using a very thin

panel for the trailing matrix update. If nb is too small, the L3BLAS are not able to get much cache

reuse, and thus both serial and parallel performance is strongly reduced. On the other hand, the

very large nb that is ideal for the trailing matrix update can cause insupportable slowdown when

used in the L2BLAS-based panel factorization.

1.2.1 Dynamic Panel Factorization Using Recursion

The recursive QR factorization is a dynamically blocked algorithm which uses the L3BLAS, and was

developed by Elmroth and Gustavson [3]. We use the recursive algorithm to perform the panel

factorization. A panel is recursively divided into sub-panels until the sub-panel fits into the L2-

cache (or nb gets too narrow to make further division helpful). Only then do we call the unblocked

panel factorization, which will then run at roughly the cache speed, rather than at memory speed.

More importantly, the panel factor time will now usually be dominated by the time in the L3BLAS

coming from the recursion, and only mildly affected by the unblocked speed experienced at the end

of the recursion.

Figure 1.2: Recursive QR Panel Factorization

An illustration is given in Figure 1.2, where an input panel matrix A is recursively fac-

tored (1). It is divided into sub-panel A0 and A1 (2). Here it is assumed that A0 and A1 fit into the

L2-cache (one level of recursion). A0 is factored using unblocked algorithm (3). Then the trailing

matrix A1 is updated using the L3BLAS (4). Then, the sub-panel A1′ is factored using unblocked

algorithm (5). This completes the panel factorization (6).

5

The recursive algorithm involves more floating point operations compared to the un-

blocked algorithm [3]. The performance advantage in using a recursive algorithm in the panel

factorization is attributed to performing L2BLAS operations at cache speed and the L3BLAS replac-

ing the L2BLAS operations involved in the trailing sub-panel update. Similar improvements can also

be achieved by statically blocking the panel, but this requires extensive tuning for different panel

sizes and shapes to perform efficiently. Using recursion helps by dynamically blocking sub-panels

of arbitrary size and shapes without requiring additional empirical tuning of blocking factors.

1.2.2 Impact of Panel Factorization in Full Matrix Factorization

The dynamically blocked panel factorizations allow us to factor much larger panels efficiently. For

a full matrix factorization, this larger nb results in much improved L3BLAS performance and scaling

in the trailing matrix update.

Although a larger nb helps L3BLAS performance, a very large nb can still hurt overall

performance. As mentioned earlier, any blocked algorithm requires more floating point operations

than the unblocked QR factorization. The additional number of of floating point operation grows at

O(nb2N) [3]. If we begin the recursion at the top, nb = N2 , which means we would do O(N3) extra

flops, which is intolerable in QR, since the entire algorithm is only O(N3). A high nb also demands

more storage for computation which indirectly affects cache reuse. So the best algorithm uses a

hybrid approach [3], where static blocking with a constrained nb is used in the main algorithm in

order to limit the flop growth, while dynamic blocking on panels (where flop growth is minor due

to the static nb) ensures we need only 1 tuning step (for the nb used as the outer blocking factor

in the full matrix factorization).

1.3 Motivation

The recursive panel factorization gives significant speedup compared to the unblocked algorithm.

It gives comparable or better performance than the statically blocked algorithm for all problem size

and shapes. Note that static blocking for the panel factorization with a fine tuned nb can win over

the recursive algorithm for small problems. The main problem with using a second step of static

blocking to handle the panel factorization is that it requires substantial tuning on both dimensions

in order to maintain good performance. Tuning for all used rectangular shapes is infeasible, and

6

thus recursion, which performs well on any shape of panel, is strongly preferred.

We validate this by examining the factorization performance of rectangular matrices when

using unblocked, statically blocked and recursive algorithms. Figure 1.3 gives the QR factorization

performance of thin matrices of width 56 and 168 respectively and varying M . The thin matrix

simulates the equivalent of a panel in large matrices.

For each dimension in the graph, using nb = 1 indicates the operation is using the un-

blocked algorithm. The statically blocked performance for each nb is charted next. The last column

always represents the performance obtained by factoring the entire panel using the recursive algo-

rithm. For this experiment, we have used a machine with L2-cache size of 6144 KB per core, which

can hold roughly 14000 x 56 and 4681 x 168 double precision elements.

Figure 1.3: QR Factorization

Figure 1.3 shows that the blocked algorithms (static and recursive) outperform the un-

blocked algorithm for all surveyed problem sizes. As the problem gets larger, the speedup increases.

This is because the larger the matrix, the greater advantage the L3BLAS have over the L2BLAS, with

the L3BLAS completely dominant once the problem is out off cache.

In a statically blocked algorithm, the nb has significant impact on performance. For ex-

ample, for 4000 x 168 matrix, an nb of 16 gives the best performance. When nb is 4 the performance

is reduced by 56%. This shows that choosing the right nb is extremely important in achieving the

best performance for a given problem size.

Figure 1.3 also shows the best performance for one problem might not yield best perfor-

7

mance for a different size. An nb of 8 gives the best performance for 12000 x 56, but for 12000 X

168 it performs 21% slower than the best performance nb of 16. This shows that a tuned nb for a

particular problem size may be substantially slower when used for other panel sizes.

Now if we compare the statically blocked and the dynamically blocked performances,

we can see that when the problem is small, the recursively blocked factorization shows slightly

reduced performance compared to the best statically blocked nb. But as the problem size increases,

the recursive algorithm gives better performance than the best statically blocked nb. This is the

general pattern we see in more complete timings: recursion gets competitive performance for all

panel shapes. Therefore, since recursion leads to good performance without requiring massive and

unsustainable tunings, we use it for our panel factorization.

1.4 Outline

The rest of this paper is organized as follows: Chapter 2, provides a detailed analysis of LAPACK’s

QR subroutines, in order to get a thorough understanding of the matrix factorization algorithms.

In Chapter 3, the new ATLAS implementation using the dynamically blocked recursive algorithm

is discussed. The mathematical analysis and the implementation details are outlined for all QR

variants and in all four precisions/types. Chapter 4 provides results and analysis. A quantitative

comparison of the new matrix factorization technique in ATLAS with the LAPACK implementa-

tions are provided. The impact of recursive panel factorization in QR matrix factorization and full

matrix factorizations are studied in two commodity architectures, showing that the recursive algo-

rithm provides an efficient panel factorization, which significantly improves full matrix factorization

performance. Finally Chapter 5 provides the summary of the paper.

In general, details of mathematical analysis and algorithms are presented for double pre-

cision QR factorization. A discussion of other variants and precisions/types are made in reference

to QR double precision, when there is a substantial difference.

8

CHAPTER 2: OVERVIEW OF QR MATRIX FACTORIZATION

AND LAPACK IMPLEMENTATION

2.1 Introduction

This chapter describes the mathematical details of the QR matrix factorization and how it is

implemented in LAPACK [1]. LAPACK provides implementation for QR (QL, RQ and LQ) using

unblocked and a blocked factorization algorithm in four precisions/types (real single, real double,

single complex and double complex). The unblocked algorithm uses the L2BLAS. As we have seen

in the previous chapter, the blocked algorithm iteratively processes a panel using the unblocked

factorization and then updates a trailing matrix using the L3BLAS. The trailing matrix update

using the L3BLAS is CPU-bound, but the unblocked panel factorization runs at the speed of the

memory. To provide an improved factorization the author needed to do a detailed study of the

existing mathematical formulation, computational mapping, storage and APIs . Mathematical and

computational steps are explained mainly in terms of QR factorization in double precision. Details

about other QR variants and other precisions/types are also provided when there is a significant

difference.

QR matrix factorization is an extremely important operation which is used by LAPACK

to solve all over- and under-determined systems. A full matrix factorization can be made efficient

by providing a better panel factorization technique. A detailed analysis of LAPACK QR routines

will enable us to find the improvements that can be applied to the panel factorization techniques. In

addition, the mathematical algorithms used in trailing matrix update by LAPACK can be adapted

to work in the new ATLAS QR routines .

2.1.1 Outline

The remaining sections of this chapter are organized as follows: First a mathematical overview of

the QR factorization is given in Section 2.1.2. Section 2.2 gives an overview of the computation and

storage for all the variants and precisions implemented by LAPACK. Section 2.3 details the imple-

mentation of the unblocked algorithm along with the mapping of the subroutines to mathematical

steps in LAPACK for QR double precision. Section 2.4 provides similar details for the other QR

variants. The blocked algorithm analysis for QR is provided in Section 2.5. Finally, Section 2.6

9

describes how to apply blocking to the other variants.

2.1.2 QR Decomposition using Householder Transformation

Using Householder transformations, the matrix A can be a factored into a product QR,

A = QR,

where Q is an orthogonal matrix and R is an upper triangular matrix.

and Q = H1H2.....Hn−1

and R = Hn−1.......H2H1A

The Hi are Householder matrices,

H = (I −1

βvvT ) (2.1.1)

H matrix is orthogonal and symmetric.

where given a vector x ∈ Rn (these x are the columns/(modified

columns) of A)

α = ‖x‖2, β = α(α− x1)

v = (x1 − α, x2, .....xn)T

See [5] for details.

LAPACK provides the following implementation for QR decomposition using Householder trans-

formation

1. DGEQRF: Blocked factorization

2. DGEQR2: Unblocked factorization (used to factor panels for DGEQRF)

2.2 QR, RQ, QL and LQ Transformations : Computation And

Storage

A given matrix A[m × n], can be factored into QR, RQ, LQ and QL variants. Where Q is an

orthogonal matrix and L, R are triangular matrices. The computation and storage details of each

of these variants is discussed in the following sections.

10

2.2.1 QR

In QR factorization using Householder transformations, a matrix of size A[m×n] can be decomposed

into:

A = QR

For real numbers, Q is an orthogonal matrix and R is an upper triangular or

an upper trapezoidal matrix

For complex numbers, Q is a unitary matrix and R is an upper triangular or

an upper trapezoidal matrix

Computation and Storage in LAPACK and ATLAS (QR routines)

In the QR-variant, A[m×n] is input matrix, and on exit, the elements on and above the diagonal of

the array contain the min(m,n)×n upper trapezoidal matrix R (R is upper triangular if m >= n).

The elements below the diagonal, together with the array TAU [1], represent the orthogonal matrix

Q in case of real (unitary matrix Q in case of Complex) as a product of elementary reflectors.

The matrix Q is represented as a product of elementary reflectors

Q = H1H2....Hk, where k = min(m,n).

Each Hi has the form

Hi = I − τvvT (real precisions) (2.2.1)

Hi = I − τvvT (complex precisions) (2.2.2)

On exit, matrix A can be represented as shown below for several representative examples of m and

n. Matrices with dimensions [3× 3], [5× 3] and [3× 5] are shown for real/complex precisions.

[m=n]

A[3×3]

=

R11 R12 R13

v12 R22 R23

v13 v23 R33

[m>n]

A[5×3]

=

R11 R12 R13

v12 R22 R23

v13 v23 R33

v14 v24 v34

v15 v25 v35

[m<n]

A[3×5]

=

R11 R12 R13 R14 R15

v12 R22 R23 R24 R25

v13 v23 R33 R34 R35

v is a real/complex vector with v(1 : i− 1) = 0 and v(i) = 1; v(i+ 1 : m) is stored on exit in

A(i + 1 : m, i). Note that the elements at v(1 : i − 1) = 0 and v(i) = 1 are not stored, since the

elements above the diagonal are known to be 0, and the diagonal element is known to be 1.

11

2.2.2 RQ

In the RQ factorization using Householder transformations, a matrix of size A[m × n] can be

decomposed into:

A = RQ


Q = H1H2....Hk, where k = min(m,n). ( real precisions)

Q = HT

1 HT

2 ....HT

k , where k = min(m,n). (complex precisions)


Hi = I − τvvT ( real precisions) (2.2.3)

Hi = I − τvvT ( complex precisions) (2.2.4)

In the RQ-variant, A[m × n] is the input matrix. On exit, the matrix A can be represented as

shown below for several representative examples of m and n.

For real precisions,

[m=n]

A[3×3]

=

R11 R12 R13

v21 R22 R23

v31 v32 R33

[m>n]

A[5×3]

=

R11 R12 R13

R21 R22 R23

R31 R32 R33

v21 R42 R43

v31 v32 R55

[m<n]

A[3×5]

=

v11 v12 R13 R14 R15

v21 v22 v23 R24 R25

v31 v32 v33 v34 R35

Where v is a real vector with v(n − k + i + 1 : n) = 0 and v(n − k + i) = 1; v(1 : n − k + i − 1)

is stored on exit in A(m − k + i, 1 : n − k + i − 1). Note that the elements at v(n − k + i) = 1

and v(n − k + i + 1 : n) = 0 are not stored. Tau is a vector containing min(m, k) scalars which

represents (τ) of Equation 2.2.3.

For complex this becomes,

[m=n]

A[3×3]

=

R11 R12 R13

v21 R22 R23

v31 v32 R33

[m>n]

A[5×3]

=

R11 R12 R13

R21 R22 R23

R31 R32 R33

v21 R42 R43

v31 v32 R55

[m<n]

A[3×5]

=

v11 v12 R13 R14 R15

v21 v22 v23 R24 R25

v31 v32 v33 v34 R35

12

For complex, compared to the real, conjg(v(1 : n− k + i− 1)) is stored on exit in A(m− k + i, 1 :

n − k + i − 1). Tau is a vector containing min(m, k) complex scalars which represents (τ) of

Equation 2.2.4.

2.2.3 QL

In the QL factorization using Householder transformations, a matrix of size A[m × n] can be

decomposed into:

A = QL


Q = Hk...H2H1, where k = min(m,n). (real and complex precisions)




In the QL-variant, A[m × n] is the input matrix. On exit, the matrix A can be represented as

shown below for several examples of m and n.

For real/complex precisions,

[m=n]

A[3×3]

=

L11 v21 v31

L21 L22 v32

L31 L32 L33

[m>n]

A[5×3]

=

v11 v21 v31

v12 v22 v32

L11 v23 v33

L21 L22 v34

L31 L32 L33

[m<n]

A[3×5]

=

L11 L12 L13 v21 v31

L21 L22 L23 L24 v32

L31 L32 L33 L34 L35

v is a real vector/complex vector with v(m−k+i+1 : m) = 0 and v(m−k+i) = 1; v(1 : m−k+i−1)

is stored on exit in A(1 : m− k+ i− 1, n− k+ i). Note that the elements at v(m− k+ i) = 1; and

v(m− k+ i+1 : m) = 0 are not stored. Tau is a vector containing min(m, k) real/complex scalars

which represents (τ) of Equation 2.2.5 and 2.2.6

13

2.2.4 LQ

In the LQ factorization using Householder transformations, a matrix of size A[m × n] can be

decomposed into:

A = LQ


Q = Hk...H2H1, where k = min(m,n). (real precisions)

Q = HT

k ...HT

2 HT

1 , where k = min(m,n). (complex precisions)




In the LQ-variant, A[m × n] is the input matrix. On exit, the matrix A can be represented as

shown below for for several representative examples of m and n.

For real precision,

[m=n]

A[3×3]

=

L11 v12 v13

L21 L22 v23

L31 L32 L33

[m>n]

A[5×3]

=

L11 v12 v13

L21 L22 v23

L31 L32 L33

L41 L42 L43

L51 L52 L53

[m<n]

A[3×5]

=

L11 v12 v13 v14 v15

L21 L22 v23 v24 v25

L31 L32 L33 v34 v35

v is a real vector with v(1 : i− 1) = 0 and v(i) = 1; v(i+ 1 : n) is stored on exit in A(i, i+ 1 : n).

Note that the elements at v(1 : i− 1) = 0 and v(i) = 1; are not stored. Tau is a vector containing

min(m, k) scalars which represents (τ) of Equation 2.2.7.

For complex precision,

[m=n]

A[3×3]

=

L11 v12 v13

L21 L22 v23

L31 L32 L33

[m>n]

A[5×3]

=

L11 v12 v13

L21 L22 v23

L31 L32 L33

L41 L42 L43

L51 L52 L53

[m<n]

A[3×5]

=

L11 v12 v13 v14 v15

L21 L22 v23 v24 v25

L31 L32 L33 v34 v35

14

v is a complex vector with v(1 : i − 1) = 0 and v(i) = 1; conjg(v(i + 1 : n)) is stored on exit

in A(i, i + 1 : n),. Tau is a vector containing min(m, k) complex scalars which represents (τ) of

Equation 2.2.8.

2.3 DGEQR2 : Unblocked QR Implementation in LAPACK

This section outlines LAPACK’s DGEQR2, which performs the unblocked factorization (DGEQR2 is

called by DGEQRF to factor the panel). DGEQR2 implements an unblocked QR factorization of a real

m by n matrix in double precision. Here a detailed description is provided for QR double precision.

The analysis results of other variants and precisions/types are outlined in later sections.

In DGEQR2, for the matrix A the computation proceeds one column vector v at a time. Each

column result updates all of the columns to right. An illustration of the unblocked factorization is

given in Figure 2.1.

Figure 2.1: Unblocked Factorization

For the first column, find the Householder transformation (1). Apply the transformation

to the A matrix; the elements below the diagonal becomes zero for the first column and the first row

15

is transformed to a part of the triangular matrix R (2). Note that, LAPACK does not explicitly

compute H · A, the equivalent computation is given in Step 2 of the computational algorithm

detailed below. In LAPACK, the elementary reflectors (v) are stored in place of zeroed vectors (3).

Similar operations continues for iteration 2; but it operates on one less number of row and column.

In Figure 2.1 H1, H2 denote the Householder matrices as discussed in Equation 2.1.1.

The call structure of DGEQR2 operation is:

DGEQR2

Loop over columns of A:

DLARFG/P : find Householder vector for a column v

DNRM2 : perform Norm-2 operation to find alpha of

Equation 2.1.1. This is done excluding the

diagonal element

DLAPY2 : Update Norm-2 result from DNRM2 with diagonal element

DSCAL : scale column i of A below the diagonal using results from

DLAPY2 to get v

DLARF : update the trailing matrix

DGEMV : find work vector w of Equation 2.3.1

DGER : Rank 1 update of remaining columns as in

Equation 2.3.2

The computational steps for each iteration of the loop are:

Step 1: Create Householder Elementary Reflector (DLARFG)

The Householder elementary reflector matrix is calculated in term of τ and v. The τ and v are

calculated by applying the following modifications to Equation 2.1.1. An elementary vector of size

3 is taken to show how the computation maps to the H = I− τvvT used in LAPACK. In the below

equations, A11, A22 and A33 represents the first column vector of A and α represent the 2-norm of

the column.

16

I −1

α(α−A11)

A11 − α

A21

A31

[

A11 − α A21 A31

]

⇒ I −(α−A11)

α

1

A21

(A11−α)

A31

(A11−α)

[

1 A21

(A11−α)A31

(A11−α)

]

⇒ I − τ

1

v1

v2

[

1 v1 v2

]

where τ = (α−A11)α

. Here α can take either positive root or negative root. In LAPACK, α

is chosen in such a way that the value of τ will lie between 1 and 2.

So, H = I − τvvT

Step 2: Applying Householder transformations to A (DLARF)

DLARF computes A′ = HA, with H =(

I − τvvT)

,⇒

A′ =(

I − τvvT)

A = A− τvvTA = A− τv(

vTA)

Now vTA =(

AT v)T

, So let

w =(

AT v)

, (DGEMV) (2.3.1)

A′ = A− τvwT (DGER) (2.3.2)

DGEQR2 Operation for A 3× 3 Matrix

The computational steps are explained by using a 3 × 3 matrix. The first iteration is illustrated;

similar operations will be performed in subsequent iterations, except they will apply to one less

row and column on each step.

17

Step 0: DGEQR2 (input)

A =

A11 A12 A13

A21 A22 A23

A31 A32 A33

τ =

0

0

0

Step 1: DLARFG (output)

A =

R11 A12 A13

v21 A22 A23

v31 A32 A33

τ =

τ1

0

0

Step 2: DLARF/DGEMV (output)

A12 A13

A22 A23

A32 A33

T

×

1

v21

v31

⇒ w =

w1

w2

Step 3: DLARF/DGER (output)

I ×

1 A12 A13

v21 A22 A23

v31 A32 A33

+

1

v21

v31

× (−τ)×[

w1 w2

]

⇒

1 R12 R13

v21 A′

22 A′

23

v31 A′

32 A′

33

Step 4: DGEQR2 (output after first iteration)

R11 R12 R13

v21 A′

22 A′

23

v31 A′

32 A′

33

⇒ A′

Note that, R11 is computed in DLARFG. It is replaced to 1 to make the full v vector before the DLARF

computations. After the DLARF operation, the value is changed back to R11 in DGEQR2. The similar

operations are performed in other variants, but it is not discussed explicitly.

2.4 RQ2, QL2 and LQ2 : Serial Implementation in LAPACK

This section outline the unblocked matrix factorization algorithm in LAPACK for RQ, QL and LQ

variants. We used double precision type as an example and a 3 by 3 matrix is used to explain the

computational steps.

18

DGERQ2 Operation for A 3× 3 Matrix

In RQ, A = RQ. The Householder Elementary vector is stored along rows. The Householder

transformation Hi = I − τvvT is applied from right.

AQT = R

Q = H1H2.......Hk, where k = min(m,n).

QT = Hk.......H2H1

The first iteration operation (calculating A ·Hk) is illustrated; similar operations will be performed

in subsequent iterations, except they will apply to one less row and column on each step. Note

that, Hi is implemented as Hi = I − τvT v specifying v as a row vector instead of a column vector.

Step 0: DGERQ2 (input)

A =

A11 A12 A13

A21 A22 A23

A31 A32 A33

τ =

0

0

0


A11 A12 A13

A21 A22 A23

v31 v32 R33

τ =

0

0

τ3


A =

A11 A12 A13

A21 A22 A23

×[

v31 v32 1]T

⇒ w =

w1

w2


I ×

A12 A12 A13

A21 A22 A23

v31 v32 1

+

w1

w2

× (−τ)×[

v31 v32 1]

⇒

A′

11 A′

12 R13

A′

21 A′

22 R23

v31 v32 1

19

Step 4: DGERQ2 (output)

A′

11 A′

12 R13

A′

21 A′

22 R23

v31 v32 R33

⇒ A′

DGEQL2 Operation for A 3× 3 Matrix

In QL, A = QL. The Householder Elementary vector is stored along columns. The Householder

transformation Hi = I − τvvT is applied from left.

QTA = R

Q = Hk.......H2H1, where k = min(m,n).

QT = H1H2.......Hk

The first iteration operation (calculating A ·Hk) is illustrated; similar operations will be performed

in subsequent iterations, except they will apply to one less row and column on each step.

Step 0: DGEQL2 (input)

A =

A11 A12 A13

A21 A22 A23

A31 A32 A33

τ =

0

0

0


A11 A12 v31

A21 A22 v32

A31 A32 L33

τ =

0

0

τ3


A =

A11 A12

A21 A22

A31 A32

T

×

v31

v32

1

⇒ w =

w1

w2


I ×

A12 A12 v31

A21 A22 v32

A31 A32 1

+ (−τ)×

v31

v32

1

×

w1

w2

T

⇒

A′

11 A′

12 v31

A′

21 A′

22 v32

L31 L32 1

20

Step 4: DGEQL2 (output)

A′

11 A′

12 v31

A′

21 A′

22 v32

L31 L32 L33

⇒ A′

DGELQ2 Operation for A 3× 3 Matrix

In LQ, A = LQ. The Householder elementary vector is stored along rows. The Householder

transformation Hi = I − τvvT is applied from right.

AQT = L

Q = Hk.......H2H1, where k = min(m,n).

QT = H1H2.......Hk

The first iteration operation (calculating A ·H1) is illustrated; similar operations will be performed

in subsequent iterations, except they will apply to one less row and column on each step. Note

that, Hi is implemented as Hi = I − τvT v specifying v as a row vector instead of a column vector.

Step 0: DGELQ2 (input)

A =

A11 A12 A13

A21 A22 A23

A31 A32 A33

τ =

0

0

0


A =

L11 v12 v13

A21 A22 A23

A31 A32 A33

τ =

τ1

0

0


A21 A22 A23

A31 A32 A33

×[

1 v12 v13

]T

⇒ w =

w1

w2

21


I ×

1 v12 v13

A21 A22 A23

A31 A32 A33

+ (−τ)×

w1

w2

×

1 v12

v13

⇒

1 v12 v13

L21 A′

22 A′

32

L31 A′

32 A′

33

Step 4: DGELQ2 (output)

L11 v12 v13

L21 A′

22 A′

32

L31 A′

32 A′

33

⇒ A′

2.5 DGEQRF : QR Blocked Implementation in LAPACK using

WY Representation

The DGEQRF subroutine uses blocking to perform QR decomposition based on a storage efficient

WY representation as given by Schreiber/van Loan [7]. The paper discusses compact representa-

tion of Q = (I + Y · T · Y T ), where the columns of Y represent the vectors v computed for each

Householder transform, and T is upper triangular. It is also rich in matrix-matrix multiplication,

which is highly desirable for achieving high performance. In DGEQRF, it uses Q = (I−Y ·T ·Y T ) and

the differences between the paper and the actual implementation are analyzed and given below.

Refer to [7] for details.

In DGEQRF ,

Q = I − Y TY T ∈ Rm×m is orthogonal with Y ∈ Rm×j(m > j) and T ∈ Rj×j (upper

triangular).

22

P = I − τvvT is the Householder matrix

If Q+ = QP,

then Q+ = I − Y+T+YT+

where Y+ =[

Y v

]

∈ Rm×(j+1) and

T+ =

T z

0 ρ

with ρ = τ and z = −τTY T v

Proof:

I − Y+T+YT+ = I −

[

Y v

]

T z

0 ρ

Y T

vT

= I −[

Y v

]

TY T + zvT

ρvT

= I − Y TY T − Y zvT − vρvT (2.5.1)

QP = (I − Y TY T )(I − τvvT )

= I − Y TY T + Y TY T τvvT − τvvT (2.5.2)

From Equation 2.5.1 and 2.5.2

ρ = τ z = −τTY T v

DGEQRF Implementation in LAPACK

DGEQRF iteratively processes a panel using the unblocked algorithm and then updates the trailing

matrix. The call structure of the DGEQRF operation is as below:

DGEQRF

Loop : for each nb-width column panel;

DGEQR2 : Perform unblocked panel factorization

DLARFT : Find triangular factor matrix T

23

DLARFB : Apply Householder block reflector to

trailing matrix

Blocked QR factorization involves factoring a panel and applying the Householder trans-

formation to the remainder of the matrix. The operation is done iteratively based on the blocking

factor nb, which is the panel width. Figure 2.2 illustrates the basic operation. The panel of width

nb is factored into Rp and Yp using the unblocked version of the QR factorization module, DGEQR2.

Then T [nb×nb] matrix is computed using DLARFT, for which the details are discussed later in the

chapter. Finally, the transformation is applied to trailing matrix using QT · Atrailing, using the

identity Q = (I − Yp · T · Y Tp ), producing RT (the upper portion of the final R) and AT (gray area,

and the new matrix to factor). The dotted lines indicate the next iteration of the algorithm. Note

that it is not required to build T for the final iteration.

Figure 2.2: LAPACK Blocked QR Factorization

Assume the A matrix is of size m×n. Let nb be the block size in the direction of n. The

details are explained using a 6X6 matrix. (Note: For simplicity the LAPACK parameter NX is

ignored)

24

2.5.1 DGEQRF Computational Steps

Step 1: Factor the panel (DGEQR2)

Perform the QR factorization on the M ×nb panel by calling DGEQR2. For this example, we assume

nb = 2.

A11 A12 A13 A14 A15 A16

A21 A22 A23 A24 A25 A26

A31 A32 A33 A34 A35 A36

A41 A42 A43 A44 A45 A46

A51 A52 A53 A54 A55 A56

A61 A62 A63 A64 A65 A66

⇒

α1 R12 A13 A14 A15 A16

v21 α2 A23 A24 A25 A26

v31 v32 A33 A34 A35 A36

v41 v42 A43 A44 A45 A46

v51 v52 A53 A54 A55 A56

v61 v62 A63 A64 A65 A66

τ1

τ2

0

0

0

0

We will next need to compute the T matrix (triangular factor).

Step 2: Find T Matrix.

Find T matrix for M × nb block using DLARFT. For this example, we assume nb = 2.

T Matrix is calculated as

T = τ when i = 1

T =

T z

0 ρ

when i > 1 and i ≤ nb

where ρ = τi

z = −τTY T v (2.5.3)

Equation 2.5.3 can be written as T · (−τY T v)

Step 2.1:

ResV ect = (−τY T v) using DGEMV

Step 2.2:

Ti = (T ×ResV ect) using DTRMV

Steps 2.1 to 2.2 repeats nb times,

25

Step 3: Apply (I − Y TY T )T ×Atrailing using DLARFB

In DLARFB, the (I − Y TY T )T × Atrailing computation is made as per Equation 2.5.4. Different

matrices involved in the computational steps are shown in Figure 2.3.

Here,

A =[

Y C

]

Y =[

Y1 Y2

]T

C =[

C1 C2

]T

where Y is the Householder reflector obtained using DGEQR2. C is the remainder matrix

of A, where the transformation (I − Y TY T )T · C is applied using DLARFB. Note that Y1 is lower

triangular/trapezoidal, and a portion of R is stored in upper triangular area of Y1 . The different

steps involved are detailed below.

(I − Y TY T )T ×Atrailing ⇒ C − (Y TY T )T (C)

⇒ C − (Y T TY T )C

⇒ C − Y T TY TC

⇒ C − Y (CT (T TY T )T )T

⇒ C − Y (CTY T )T

⇒ C − Y (CTY T )T (2.5.4)

In DLARFB the following operations are done to get Equation 2.5.4.

Step 3.1 Compute W = CTY where, Y =[

Y1 Y2

]T

, C =[

C1 C2

]T

⇒ CT1 Y1 + CT

2 Y2

Step 3.1.1 using DTRMM

W = W × Y1

Step 3.1.2 using DGEMM

W = W + CT2 Y2

Step 3.2 using DTRMM

W = W × T

26

Y1 C1

Y2 C2

Atrailing = [C1 C2] = C

� -nb

� -N

� -NOrig

?

6

M

T1

C1,W

Work

� -K

6

?

N

Figure 2.3: LAPACK - A matrix and Work matrix in DLARFB

Step 3.3 At this step, W = CTY

Now compute C := C − YW T

Step 3.3.1 using DGEMM

C2 = C2 − Y2WT

Step 3.3.2 using DTRMM

W = WY T1

Step 3.3.3

C1 = C1 −W T

The above steps performed the operation,

⇒ C = C − Y (CTY T )T

which is equivalent to 2.5.4.

The A matrix is transformed into A′ matrix by the first block (M × 2) operation. This

continues for two more iterations (panels) for the given problem.

27

α1 R12 A13 A14 A15 A16

v21 α2 A23 A24 A25 A26

v31 v32 A33 A34 A35 A36

v41 v42 A43 A44 A45 A46

v51 v52 A53 A54 A55 A56

v61 v62 A63 A64 A65 A66

⇒

α1 R12 A′

13 A′

14 A′

15 A′

16

v21 α2 A′

23 A′

24 A′

25 A′

26

v31 v32 A′

33 A′

34 A′

35 A′

36

v41 v42 A′

43 A′

44 A′

45 A′

46

v51 v52 A′

53 A′

54 A′

55 A′

56

v61 v62 A′

63 A′

64 A′

65 A′

66

2.6 RQF, QLF and LQF Blocked Implementation in LAPACK

using WY Representation

This section outlines the blocked matrix factorization algorithms in LAPACK for the RQ, QL

and LQ variants. Similar to the blocked QR implementation, all the variants perform a panel

factorization using corresponding unblocked algorithm and then update the trailing matrix. The

matrix used in the equations are illustrated in corresponding figures for the first iteration of the

panel factorization. Figures also show the pointer to each matrix involved in the computation for

the second iteration. In this figures C represents the trailing matrix in the first iteration and C2

represents the trailing matrix in the second iterations.

2.6.1 DGERQF

In the DGERQF the panels are row wise and blocking is from bottom to top. Here, as in Figure 2.4

Figure 2.4: LAPACK Blocked RQ Factorization

28

A =

C

Y

Y =[

Y1 Y2

]

C =[

C1 C2

]

where Y is the Householder reflector obtained using DGERQ2. C is the remainder matrix of A, and

(I − Y TTY ) is the block reflector. The transformation C · (I − Y TTY ) is applied using DLARFB.

Note that Y2 is lower triangular, and a portion of R is stored in upper triangular area of Y2 [refer

Figure 2.4]. Matrix T is lower triangular and has values

T =

ρ 0

z T

in each iteration

In Figure 2.4 A and A2 represents pointer to Y in first two iterations, whereas pointer to

C remains same in each iteration. Computational steps for the transformation C · (I − Y TTY ) are

shown in Table 2.1

In each iteration block reflectors are applied backwards to compute H = Hk...H2 ·H1

Table 2.1: DGEQLF LARFB Computation

Operation Calling Routine Remark

W = C2 copy C2 to W

W = C · Y T perform w = C1 · Y1 + C2 · Y1

W = W · Y T2 DTRMM

W = W + C1 · YT1 DGEMM

W = W · T DTRMM

W = C −W · T apply C = [C1 C2]− (W ∗ [Y1 Y2])

C1 = C1 −W · Y1 DGEMM Update C1

W = W · Y2 DTRMM

C2 = C2 −W Update C2

2.6.2 DGEQLF

In DGEQLF, the panels are column wise and blocking is from left to right.

29

Figure 2.5: LAPACK Blocked QL Factorization

Here, as in Figure 2.5

A =[

C Y

]

Y =[

Y1 Y2

]T

C =[

C1 C2

]T

where Y is the Householder reflector obtained using DGEQL2. C is the remainder matrix

of A, where the transformation (I − Y TY T )T · C is applied using DLARFB. Note that Y2 is upper

triangular, and a portion of L is stored in lower triangular area of Y2 . Matrix T is lower triangular

and has values

T =

ρ 0

z T

in each iteration

In Figure 2.5 A and A2 represents pointer to Y in first two iterations, whereas pointer to C

remains same in each iteration. Computational steps remains similar to that of DGEQRF, in finding

out (I − Y TY T )T · C. The differences with respect to storage of Y , C, L and T as represented in

figure.

In each iteration block reflectors are applied backwards to compute H = Hk...H2 ·H1

30

2.6.3 DGELQF

In DGELQF, the panels are row wise and blocking is from top to bottom.

Figure 2.6: LAPACK Blocked LQ Factorization

Here, as in Figure 2.6

A =

Y

C

Y =[

Y1 Y2

]

C =[

C1 C2

]

where Y is the Householder reflector obtained using DGELQ2. C is the remainder matrix

of A, and (I − Y TTY ) is the block reflector. The transformation C · (I − Y TTY ) is applied using

DLARFB. Note that Y1 is upper triangular, and a portion of L is stored in lower triangular area of

Y1. Matrix T is upper triangular and has values

T =

T z

ρ 0

in each iteration

In Figure 2.6 A and A2 represents pointer to Y in each iteration, whereas C1 and C2 denote that

of C. The elementary block reflectors are applied forward to compute H = H1 ·H2...Hk.

Computational steps remains similar to that of DGERQF, in finding out C · (I − Y TTY )

. The differences are with respect to storage of Y , C, L and T as represented in figure.

The block reflectors are applied forward to compute H = H1 ·H2...Hk.

31

CHAPTER 3: QR ATLAS IMPLEMENTATION USING

RECURSIVE ALGORITHM

3.1 Introduction

This chapter describes the mathematical details of QR matrix factorization and how it is imple-

mented in ATLAS [12]. ATLAS provides implementations for QR (QL, RQ and LQ) using an

unblocked and a blocked factorization algorithm in four precisions/types (real single, real dou-

ble, single complex and double complex). As in the LAPACK, the blocked algorithm iteratively

processes a panel and then updates the remainder matrix using L3BLAS. In ATLAS, the panel fac-

torization is done using a recursive algorithm [3], which uses recursion to automatically block the

panel into sub-panels until the problem size fits into the L2-cache. At that point the sub-panel is

factored using the unblocked algorithm, which is cache contained. The trailing sub-panel is also

updated using the L3BLAS within the recursive factorization steps . This gives us L3BLAS compu-

tation in both panel factorization and trailing matrix update operations, which improves the full

matrix factorization performance. The dynamic blocking until the problem is L2-cache-contained

also helps the unblocked algorithm involved in the panel factorization to perform at cache speed.

The trailing matrix update operations remains the same in ATLAS and LAPACK. Another differ-

ence when compared to to LAPACK is that ATLAS uses a L3BLAS-based formulation for finding

triangular factor T based on the work by Elmroth and Gustavson [3].

The ATLAS implementations of all QR variants are outlined in this chapter. The recursive

panel factorization algorithm and derivation for finding triangular factor T are presented for QR,

QL, RQ and LQ variants. Discussion are made mainly in terms of the QR factorization algorithm

in double precision and other precisions/types are also provided if there is a difference. In general a

detailed discussions is provided when there is major difference in algorithm compared to LAPACK.

The author has contributed in conducting mathematical analysis, implementation and performance

analysis for all the variants and precisions.

ATLAS provides implementation in ANSI C for all the routines. The factorization routines

for QR variant is provided using

1. Statically blocked version: (ATL geqrf)

32

2. Recursively blocked version: (ATL geqrr, called by ATL geqrf to factor the column panel)

3. Unblocked version: (ATL geqr2, called by ATL geqrr to factor the L2-cache-contained column

panel)

In general, all the precisions are implemented using the same source file, but compiled to

different object files for each precisions. Similarly implementations are provided for RQ, QL and

LQ variants. All QR routines use the high performance BLAS provided by ATLAS for performing

the bulk of the computation.

3.1.1 Outline

The remaining sections of the chapter is organized as follows: First the unblocked QR2 factor-

ization is discussed in Section 3.2. Section 3.3 provides implementation details of the blocked

QR factorization. This section also gives a detailed illustrations of the computational steps in-

volved in the recursive panel factorization. Section 3.4 discusses the blocked QL factorization. The

blocked factorization for RQ and LQ variants are presented in Section 3.5. Section 3.6 outlines the

mathematical analysis for computing the triangular factor T based on the work by Elmroth and

Gustavson [3].

3.2 Unblocked QR factorization (ATL geqr2)

The unblocked implementations of the QR variants in ATLAS follows the same mathematical

formulation and call structure as that of LAPACK described in Section 2.3. Table 3.1 lists different

routines implemented in ATLAS for providing the QR2 functionality. Refer to the ATLAS software

for details.

3.3 Statically blocked QR factorization (ATL geqrf)

This section outlines ATLAS’s ATL geqrf, which implements statically blocked factorization for the

QR variant. As in the LAPACK computation described in Section 2.5, the matrix A is split into a

panel and remainder matrix iteratively. The difference is that the panel factorization step is per-

formed with a call to the recursive panel factorization, ATL geqrr, rather than calling the unblocked

33

Table 3.1: QR2 ATLAS routines

Program/Files Variant Precisions

ATL geqr2.c qr All

ATL geql2.c ql All

ATL gerq2.c rq All

ATL gelq2.c lq All

ATL larf.c All All

ATL larfg.c All All

ATL larfp.c All All

ATL ladiv.c All All

ATL lapy2.c All s,d

ATL lapy3.c All c,z

ATL lacgv.c All c,z

algorithm as LAPACK does. The recursion leads to an automatic blocking and it also replaces the

L2BLAS-based operations of LAPACK’s unblocked algorithm (GEQR2) by calls to the L3BLAS. In

addition, unlike LAPACK, the triangular factor T is computed using the recursive algorithm as

described in [3]. Automatic blocking in the panel factorization is achieved by recursing until the

problem size fit into the L2 cache and then calling the unblocked algorithm ATL geqr2. Equivalent

implementations are provided for RQ, QL and LQ factorization in ATL gerqf, ATL geqrlf and

ATL gelqf respectively. Table 3.2 gives the list of all the routines in ATLAS for implementing

blocked matrix factorization. Refer to ATLAS for details. A detailed analysis of QR variant is

given below. QL, RQ and LQ implementations details are given in subsequent sections.

3.3.1 Call structure for ATLAS GEQRF (ATL geqrf)

The following calls are repeated for each nb in MIN(M,N) (nb is the block size),

1. ATL geqrr : Performs the recursive factorization of a panel. ATL geqrr recursively calls itself

by dividing the panel width by 2 until it reaches a stopping point and calls the unblocked

code, ATL geqr2 to factor the sub-panel. The T Matrix is computed by calling ATL larft,

34

Table 3.2: QRF ATLAS routines

Program/Files Variant Precisions

ATL geqrf.c qr All

ATL geqlf.c ql All

ATL gerqf.c rq All

ATL gelqf.c lq All

ATL geqrr.c lq All

ATL geqlr.c lq All

ATL gerqr.c lq All

ATL gelqr.c lq All

ATL larft.c All All

ATL larfb.c All All

then ATL larfb is called to apply the Householder reflector [7, 3]. ATL larft computes the

T matrix using the recursive algorithm developed in [3]

2. ATL larfb : Apply Householder reflector to remainder of A matrix .

3.3.2 Computation of GEQRR (ATL geqrr)

The following section outline the recursive panel factorization algorithm in ATL geqrr. The compu-

tational steps are illustrated in Figure 3.1 (assuming 2 levels of recursion). The steps of Figure 3.1

are :

1. Block A to perform factorization, which is stored in column major. T will be computed

and stored as upper triangular matrix at the end of all computational steps. The gray area

represents the completed operation

2. Divide the panel into left and right sub-panels by recursion

3. Continue recursion on left sub-panel A0 to get A00 and A01. Recursion needs to be continued

till the problem fits into L2 cache. Here it is assumed that A00 fits into L2 cache . Factorize A00

35

Figure 3.1: Recursive QR Panel Factorization

36

to QR form using ATL geqr2. Obtain the elementary reflector Y00 and the upper triangular

part of R. Produce the T00 matrix by calling ATL larft

4. Apply blocked reflector to A01

5. Factor the subsub-panel A01 to obtain Y01 and upper triangular part of R Also obtain T01

6. Compute the block part of T (i.e z0) using the identity −T1YT1 Y2T2

7. Apply blocked transformation to A1.

8. Apply recursive panel factorization to the right sub-panel. Follow the similar steps as that of

left panel.

9. Compute z1, T10 and T11.

10. Compute z using T0 and T1.

11. Complete the factorization of the panel into Y and R. Obtain upper triangular matrix T

3.4 Blocked QL Factorization (ATL geqlf)

ATL geqlf provides the QL variant of blocked factorization in ATLAS. As in the LAPACK com-

putation described in Section 2.6.2, the matrix A is split into a panel and remainder matrix and

processed iteratively. Just as in ATL geqrf, ATLAS differs from LAPACK by calling a recur-

sive routine (ATL geqlr) to factor the column panel. The rest of the section outlines the panel

factorization in detail. The computation described here is applicable to all precisions.

3.4.1 Computation of GEQLR (ATL geqlr)

The computational steps are illustrated in Figure 3.2. Here we show one level of recursion. The

steps of Figure 3.2 are:

1. Block panel A to perform the recursion, which is stored in column major. T will be computed

and stored as lower triangular matrix at the end of all computational steps. Gray area

represent the completed operation

2. Divide the panel into left and right sub-panels by recursion

37

3. Factor the right sub-panel A1 to get Y1 and a lower triangular L part. Then, compute T1.

4. Then apply block reflector to A0 (trailing matrix update)

5. Factor left sub-panel and compute T0

6. Compute z using T0 and T1 using identity −T2YT2 Y1T1

7. Complete the factorization of the panel into Y and L. Obtain lower triangular matrix T

Figure 3.2: Recursive QL Panel Factorization

3.5 Blocked RQ and LQ Factorization

The RQ and LQ transformations can be performed by using QL and QR transformations with a

transpose. This is done mainly for performance reasons. RQ is (Q · L)T and LQ is (Q ·R)T , i.e.:

AT = Q ·R ⇒ A = (Q ·R)T = RT ·QT = L ·QT (3.5.1)

AT = Q · L ⇒ A = (Q · L)T = LT ·QT = R ·QT (3.5.2)

38

And since QT is also orthogonal, and the “Q” in the LQ and RQ factorizations refer to any

orthogonal matrix, we can obtain an LQ or RQ factorization of A by transposing the result of a

QR or QL factorization of AT. The QL and QR factorizations access data in contiguous columns

(the data is column-major) and proceed column-by-column, while the RQ and LQ factorizations,

as written for LAPACK, access data in non-contiguous rows and proceed row-by-row. The cost

of the transpose copy is small compared to the performance gained by using the more cache and

TLB1 friendly memory access pattern employed by QR and QL.

For a square matrix, the transpose copy in the above operations can be done in-place

without requiring any additional work space. But for a rectangular matrix, in-place transpose is

extremely problematic. ATLAS performs panel factorization of RQ and LQ by using QL and QR

panel factorization, which will enable us to take the performance advantage described in the above

paragraph. The details are discussed in the following section.

3.5.1 Blocked RQ Factorization (ATL gerqf)

ATL gerqf provides the RQ variant of the blocked factorization in ATLAS. As in the LAPACK

computation described in Section 2.6.1, the matrix A is split into a panel and remainder matrix

and processed iteratively.

In ATLAS, the panel factorization is obtained by making a transposed copy of the panel

and then factorizing using ATL geqlr as discussed earlier. After factorization and finding T , the

output of ATL geqlr is transposed back into the original panel. Then we perform the trailing matrix

update as usual. The computational steps are illustrated in Figure 3.3. The steps of Figure 3.3

are:

1. First iteration starts from last Panel Ap.

2. Make a transpose copy of Ap and factorize using ATL geqlr. T is also computed. Note that

we must conjugate the output T for complex precisions.

3. Make a transpose copy of ATL geqlr’s column panel back into the row panel Ap. Then

perform trailing matrix update.

4. Proceed to next iteration

1Translation Lookaside Buffer.

39

Figure 3.3: Recursive RQ Panel Factorization

3.5.2 Blocked LQ Factorization (ATL gelqr)

The computational steps are similar to that of ATL gelqr described above. The computational

steps are illustrated in Figure 3.4. The steps of Figure 3.4 are:

1. First iteration starts from top Panel Ap.

2. Make a transpose copy of Ap and factorize using ATL geqrr. T is also computed. Similar to

RQ variant, we must conjugate the output T for complex precisions.

3. Make a transpose copy of ATL geqrr’s column panel back into row panel Ap. Then perform

trailing matrix update.

4. Proceed to next iteration

3.6 Derivation for T-Block for the Recursive T Computation

This section outlines the computation of the T matrix (triangular factor) in ATL larft. ATLAS

provides a recursive algorithm for computing T using the L3BLAS based on the work by Elmroth

40

Figure 3.4: Recursive LQ Panel Factorization

and Gustavson [3]. This paper explicitly provides the formulation only for the QR variant, but

ATL larft can compute T for all the variants. In ATLAS implementation for all the variants are

provided in ATL larft routine. The computational details are given below.

QR

The Level 3 formulation of computing the T matrix for QR as described in Elmroth and Gus-

tavson [3] is shown below. Details of the computation for QR can be referred in the above paper.

From [1, 3], the triangular factor T is defined as a product of k Householder transforms

Qi, i = 1, · · · , k, as.

Q = Q1Q2 · · ·Qk = I − Y TY T , where T is k × k upper triangular.

Y is the trapezoidal matrix consisting of k consecutive Householder vectors. Referring to Figure 3.5,

suppose k1 + k2 = k and T1 and T2 are the associated triangular matrices,

Q = (I − Y1T1YT1 )(I − Y2T2Y

T2 ) where Y =

[

Y1 Y2

]

= I − Y2T2YT2 − Y1T1Y

T1 + Y1T1Y

T1 Y2T2Y

T2 (3.6.1)

41

Figure 3.5: Recursive T Computation for QR

Also, Q = I − Y TY T

= I −[

Y1 Y2

]

T1 z

0 T2

Y T1

Y T2

= I −[

Y1 Y2

]

T1YT1 + zY T

2

T2YT2

= I − Y1T1YT1 − Y1zY

T2 − Y2T2Y

T2 (3.6.2)

Compute block z of T from Equations 3.6.1 and 3.6.2 as, z = −T1YT1 Y2T2

Computational Steps

The computational steps involve, recursively dividing the panel by 2 (to get right sub-panel and

left sub-panel), until width of the sub-panel becomes two or less. At that point compute T for the

right and left sub-panel and then compute the z (T-block). This is done recursively. This example

shows one level of recursion. In Figure 3.5 the following steps are performed in order to obtain T ,

1. Divide the panel into left and right sub-panels by recursion. (Here we assume both the panels

reached the stopping criteria).

2. Compute the triangular factor T1 for Y1. The T1 matrix is computed based on the details

provided in Step 2 of Section 2.5.1.

42

3. Compute the triangular factor T2 for Y2 using the same method as in the previous step.

4. Compute z using the identity z = −T1YT1 Y2T2. This step complete the formation of T for

the panel Y .

The computational steps remains same for other variants. So in the following sections discussions

are limited to the formulation of z each variants, which is newly introduced in ATLAS computations.

RQ

From [1, 3], the triangular factor T is defined as a product of k Householder transforms Qi, i =

1, · · · , k, as.

Q = Qk · · ·Q2Q1 = I − Y TTY, where T is k × k lower triangular.

Y is the trapezoidal matrix consisting of k consecutive Householder vectors.

Figure 3.6: Recursive T Computation for RQ

Referring to Figure 3.6, suppose k1 + k2 = k and T1 and T2 are the associated triangular

matrices,

Q = (I − Y T2 T2Y2)(I − Y T

1 T1Y1) where Y =

Y1

Y2

= I − Y T1 T1Y1 − Y T

2 T2Y2 + Y T2 T2Y2Y

T1 T1Y1 (3.6.3)

Also,Q = I −[

Y T1 Y T

2

]

T1 0

z T2

Y1

Y2

= I −[

Y T1 Y T

2

]

T1Y1

zY1 + T2Y2

= I − Y T1 T1Y1 − Y T

2 zY1 − Y T2 T2Y2 (3.6.4)

Compute block z of T from Equations 3.6.3 and 3.6.4 as, z = −T2Y2YT1 T1

43

QL


1, · · · , k, as.

Q = Qk · · ·Q2Q1 = I − Y TY T , where T is k × k lower triangular.


Figure 3.7: Recursive T Computation for QL


matrices,

Q = (I − Y2T2YT2 )(I − Y1T1Y

T1 ) where Y =

[

Y1 Y2

]

= I − Y2T2YT2 − Y1T1Y

T1 + Y2T2Y

T2 Y1T1Y

T1 (3.6.5)

Also,Q = I −[

Y1 Y2

]

T1 0

z T2

Y T1

Y T2

= I −[

Y1 Y2

]

T1YT1

zY T1 + T2Y

T2

= I − Y1T1YT1 − Y T

2 zY T1 − Y2T2Y

T2 (3.6.6)

Compute block z of T from Equations 3.6.5 and 3.6.6 as, z = −T2YT2 Y1T1

44

LQ


1, · · · , k, as.

Q = Q1Q2 · · ·Qk = I − Y TTY, where T is k × k upper triangular.


Figure 3.8: Recursive T Computation for LQ


matrices,

Q = (I − Y T1 T1Y1)(I − Y T

2 T2Y2) where Y =

Y1

Y2

= I − Y T2 T2Y2 − Y T

1 T1Y1 + Y T1 T1Y1Y

T2 T2Y2 (3.6.7)

Also,Q = I −[

Y T1 Y T

2

]

T1 z

0 T2

Y1

Y2

= I −[

Y T1 Y T

2

]

T1Y1 + zY2

T2Y2

= I − Y T1 T1Y1 − Y T

1 zY2 − Y T2 T2Y2 (3.6.8)

Compute block z of T from Equations 3.6.7 and 3.6.8 as, z = −T1Y1YT2 T2

45

CHAPTER 4: RESULTS AND ANALYSIS

4.1 Introduction

This chapter compares the performance results of the recursive panel factorization vs the unblocked

panel factorization, both of which use the statically blocked QR factorization for the full problem.

As we have seen, the blocked algorithm has a panel factorization step followed by the trailing

matrix update, performed iteratively. As described in the previous chapter, ATLAS uses a recursive

algorithm to perform the panel factorization, which divides the width by 2 using recursion until

the problem is reduced to a sub-panel that fits into the L2-cache. At the recursive stopping

point, it performs an unblocked factorization (GEQR2) and then use the L3BLAS to update the

remainder of the panel using the LAPACK routine LARFB. The trailing matrix of the full matrix

also gets updated using the L3BLAS in LARFB. The automatic blocking and Level 3 update due to

the recursive factorization provides several distinct advantages :

1. The unblocked factorization (GEQR2) is cache contained

2. Panel performance is dominated by the L3BLAS rather than the L2BLAS

3. The efficient panel performance allows us to use a larger nb, resulting in improved L3BLAS

performance and scaling in the trailing matrix update.

4. Using automatic blocking ensures we do not have to separately tune for all panel shapes

The results in this chapter will validate the above attributes in two different commodity

platforms. The statically blocked full matrix factorization with unblocked panel factorization (LA-

PACK) and statically blocked full matrix factorization with dynamically blocked recursive panel

factorization (ATLAS) algorithms are compared and the results are discussed. Performance com-

parisons of default LAPACK, tuned LAPACK and recursive ATLAS implementations are conducted

for full matrix factorization. Although we primarily discuss double precision QR, the experimental

results for all QR variants and precisions/types are provided.

46

4.1.1 Outline

The rest of the chapter is organized as follows: Section 4.2 provides an overview of the experi-

mental methodology. In Section 4.3 we show the performance results for solving full problems. As

previously described, full problem always uses a statically blocked algorithm, but LAPACK uses an

unblocked algorithm to factorize the panel while ATLAS uses a recursive algorithm. Section 4.3.1

provides quantitative comparison of the recursive and non-recursive algorithm. The performance

results of full matrix factorization on two commodity platforms for all variants and precisions are

provided in Section 4.3.2 and their performance characteristics are analyzed.

4.2 Experimental Methodology

This section gives outline of the libraries, timing, os/hardware and tuning used in the experiments

as below.

Libraries And Software

Unless otherwise noted, all experiments were conducted using ATLAS v-3.9.26 and LAPACK v-

3.2.1 libraries.

Timing and Floating Point Operations

All timings used the ATLAS timers. Timing is calculated from the average of the several sample

runs. We report average results since parallel timings in practice are strongly affected by the system

states during the call and vary widely based on unknown starting conditions. For the timing of full

square problems, we use 50 trials to factor matrices under 4000 × 4000 elements, and at least 20

trials for larger matrices. We flush all processors’ caches between timing invocations, as described

in [10].

The floating point counts for all operations are taken from LAPACK’s dopla.f. The

performance measurements are in MFLOPS (millions of floating point operations per second). Note

that the actual number of floating point computations in the full matrix factorization varies based

on nb for a same full problem size [3]. Despite this, the MFLOP rates reported in this paper always

use the unblocked flop counts from dopla.f. Therefore the number of floating point operations

47

for a given problem size remains the same independent of the factorization algorithm and blocking

factor. Hence, the MFLOPS we report always give unbiased measure for comparing performance

using different algorithms, and can be converted into raw times using a simple equation.

OS/Hardware

Timing is done on two commodity platforms, both of which have 8 cores in two physical packages

and run Linux. More specifically, our test systems are:

(1) Opt8, O8: 2.1Ghz AMD Opteron 2352 running Fedora 8 Linux 2.6.25.14-69 and gcc

4.2.1, LAPACK 3.2.1 and ATLAS 3.9.26

(2) Core2, C2: 2.5Ghz Intel E5420 Core2 Xeon running Fedora 9 Linux 2.6.25.11-97 and

gcc 4.3.0, LAPACK 3.2.1 and ATLAS 3.9.26.

In both architectures, each core has separate L1 and L2 caches, but L3 caches are shared

across cores. Each physical package on the Opt8 consists of one chip, which has a built-in memory

controller. The physical packages of the Core2 contain two chips, and all cores share an off-chip

memory controller.

Tuning

ATLAS’s LAPACK autotuner[9] was used to empirically tune the LAPACK blocking parameter (i.e.

the nb used by the statically blocked GEQRF). For the variable blocking in the recursive algorithm

the L2-cache size is obtained from ATLAS’s cache detection probe.

4.3 Results

The influence of the dynamically blocked recursive algorithm in the panel factorization and the

experimental results are discussed in chapter 1 (see Section 1.3). We have seen that the recursive

panel factorization is an efficient way to factorize a panel of varying size and shapes.

In Section 4.3.1, we compare the performance of the ATLAS and LAPACK implementa-

tions, of QR as the blocking parameter is varied. Section 4.3.2 provides the detailed results for all

types, precisions, variants and algorithms on both target platforms.

48

4.3.1 General Comparison Of ATLAS and LAPACK Performance

The general characteristics and performance advantages of using a recursive algorithm over using

the unblocked algorithm for panel factorization and its influence on full problem performance is

analyzed below. In Figure 4.1, full square matrices are factored by varying nb as a multiple of

matrix multiplication nb of 56 for double precision (56, 112, 168). Other nbs such as 28, 84 and

140 are also taken for comparison, but this does not favor any tuned algorithm. The graph shows the

results of using both the recursive and non-recursive algorithm. The recursive performance data is

stacked over the non-recursive data. The hatched region highlights the performance improvement

for each problem size for that specific nb compared to the LAPACK routine that factors the panel

using the unblocked code. The X-axis shows the problem size and Y-axis shows the performance

in MFLOPS. These results were obtained using ATLAS 3.9.40.

Figure 4.1: Static Blocking Full Square Matrix (Unblocked Vs Recursive Panel Factorization)

For small problems, we see relatively modest improvements. For large problems, which

can more fully use parallelism, however, we see that the recursive panel factorization is key in

getting good performance. For large problems, large blocking factors provide distinct performance

improvements. However, this improvement does not continue smoothly as we increase nb. This

49

leveling off of performance can largely be explained by two main factors: (1) GEMM already gets

sufficient cache reuse so that further size increases are only of modest benefit, and (2) the extra flops

involved in the larger nb no longer pay for themselves via strongly improved GEMM performance.

An important point to note is that an nb of 168 for the recursive algorithm provides

near maximum performance among different nb for varying problem size from 1000 to 8000. But

in the case of the non-recursive case, it is necessary to use either 56 or 112 for problem sizes to

achieve reasonable performance. When the unblocked factorization uses an nb of 168, the panel

factorization is so slow that the increased performance of the trailing update is overwhelmed by

the slowdown in the panel factorization.

4.3.2 Full Problem Performance

We have seen that the dynamically blocked panel factorization can improve the full matrix factor-

ization by efficiently factoring a larger panel. This in turn allows us to use larger matrices in the

trailing matrix update. To show the benefit of using a dynamically blocked recursive algorithm we

will examine the best tuned performance of the ATLAS recursive factorization versus the default

LAPACK implementation and the best tuned performance of the LAPACK implementation (as

per [9]).

The performance of following methodologies are compared for all variants and preci-

sions/types:

1. Default LAPACK: This is the performance of the default installation of LAPACK. The default

width of all panels for these factorizations is 32, for all precisions.

2. Tuned LAPACK: The panel-width is tuned for different problem sizes for each precision using

the ATLAS LAPACK tuner, as discussed in [9].

3. Recursive ATLAS: Uses ATLAS’s autotuned dynamically blocked matrix factorization in

ATLAS.

In this section, the blocked version of matrix factorizations represented as QR, QL, RQ

and LQ for each variant respectively. The precisions are S, C, D, Z for Single Precision Real,

Single Precision Complex, Double Precision Real, and Double Precision Complex. Following the

50

LAPACK convention, the names of our methods will consist of the precision followed by the two-

letter algorithm identifier; eg., “ZRQ” refers to the RQ factorization for double-precision complex

elements. We provide performance for Core2 and Opt8 architecture as below:

4.3.3 Core2 Full Problem Performance

The performance for QR and QL variants are given in Figure 4.2, and Figure 4.3 gives the perfor-

mance data for RQ and LQ variants. In these charts, the default LAPACK performance is repre-

sented as Dflt-LPK and the best performance the ATLAS framework finds (by changing blocking

factors) for LAPACK is labeled Tuned-LPK and finally the ATLAS performance is labeled Rec-ATL.

The X-axis is the square problem size, the Y-Axis is MFLOPS achieved. Each problem

size compares the three methodologies at that problem size. Note that all of the performance charts

in this section follow this scheme.

In general, default LAPACK is the worst performer. Tuned LAPACK is typically much

faster, with recursive ATLAS giving the best performance. We concentrate our discussion on

QR double precision. Figure 4.2(b) shows both tuned LAPACK and recursive ATLAS perform

better than default LAPACK for all the problem sizes. A representative example with dimension

6000 X 6000 shows that recursive ATLAS is 1.26 (1.90) times faster than the tuned (default)

LAPACK versions, respectively. The higher performance in ATLAS is attributed to an efficient

panel factorization coupled with improved performance in the remainder matrix update enabled by

using a larger blocking factor. In this case, tuned LAPACK used nb = 80 whereas ATLAS used

nb = 168.

The tuned LAPACK is slightly faster than recursion for a few problem sizes, particularly

in the complex cases. This seems to be because ATLAS does not tune for every problem size

explicitly. Rather, it tunes for various sizes and then predicts the nb of other sizes based on the

closest measured size. For recursive factorization, which can use very large nb effectively, ATLAS’s

tuning decides to apply the larger nb too early, which slightly depresses the performance of some

small cases. Recursion requires more flops than unblocked, which can have a small effect as well.

However, significant losses have never been observed, so we are unwilling to do the additional tuning

required to avoid these minor issues.

The full problem charts are fairly uniform for QL and QR variants for all precisions/types.

so we will move forward without individual discussion; and discuss them collectively henceforth.

51

For the RQ and LQ variants given in Figures 4.3, it can be seen that for large problems

the ATLAS implementation performs far better than the best LAPACK implementation. For

example, for the 6000 X 6000 matrix of DLQ, ATLAS is 1.7 times faster than the tuned LAPACK.

Similar performance characteristics can be seen for other precisions. This is attributed to ATLAS’s

improved LQ (RQ) factorization, which uses the more efficient QR (QL) for performing the panel

factorization (See Section 3.5).

4.3.4 Opt8 Panel Performance

This section shows results for the same run on the AMD Opteron. The full problems performance

are shown in Figures 4.4 and 4.5. The charts are fairly uniform and the performance character-

istics remains the same as that of Core2, so a detailed discussion is not provided. These uniform

performance results underscores the fact that the recursive algorithm employed in ATLAS can

successfully auto-adapt to various systems.

52

Figure 4.2: QR (a,b,c,d) & QL (c,d,e,f), Core2, Full Problem Performances

53

Figure 4.3: LQ (a,b,c,d) & RQ (c,d,e,f), Core2, Full Problem Performances

54

Figure 4.4: QR (a,b,c,d) & QL (c,d,e,f), Opt8, Full Problem Performances

55

Figure 4.5: LQ (a,b,c,d) & RQ (c,d,e,f), Opt8, Full Problem Performances

56

CHAPTER 5: SUMMARY

We have presented a dynamically blocked QR panel factorization approach based on the recursive

algorithm [3]. The recursion helps to perform dynamic blocking and enables the panel factorization

to use the L3BLAS. We have applied this technique to all QR variants (QR, QL, RQ and LQ) in four

precisions (single real, double real, single complex, and double complex). We have presented the

mathematical analysis and implementation details. Finally the performance improvements in panel

factorization and its overall impact in full matrix factorization were overviewed. The results showed

that the performance of the new code exceeds that of the LAPACK block algorithms. In addition, we

have shown that the ATLAS implementation performs equally well for both square and rectangular

problems with minimal tuning for the panel width. An ANSI C implementation was developed for

for all QR variants and precisions and made available as part of ATLAS (Automatically Tuned

Linear Algebra Software) library.

57

BIBLIOGRAPHY

[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Ham-marling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM,Philadelphia, PA, 3rd edition, 1999.

[2] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factor-ization for multicore architectures. Concurrency and Computation: Practice and Experience,20:1573–1590, June 2008.

[3] E.Elmroth and F.G. Gustavson. Applying recursion to serial and parallel QR factorizationleads to better performance. IBM J. RES. DEVELOP., 44(4):605–624, 2000.

[4] Alston S. Householder. Unitary Triangularization of a Nonsymmetric Matrix. Journal of theACM (JACM), 5(4), Oct 1958.

[5] Steven J. Leon. Linear Algebra with Applications. Prentice-Hall, 2002.

[6] Gregorio Quintana-Orti, Enrique S. Quintana-Orti, Ernie Chan, Robert A. van de Geijn,and Field G. Van Zee. Scheduling of qr factorization algorithms on smp and multi-core ar-chitectures. In Proceedings of the 16th Euromicro Conference on Parallel, Distributed andNetwork-Based Processing (PDP 2008), pages 301–310, Washington, DC, USA, 2008. IEEEComputer Society.

[7] Robert Schreiber and Charles van Loan. A storage efficient WY representation for prod-ucts of householder transformations. Technical report, Cornell University, 1989. URL:http://hdl.handle.net/1813/6704.

[8] The WY representation for products of Householder Transformations. C. bischof and c. vanloan. SIAM Journal on Scientific and Statistical Computing, 8(1):s2–s13, January 1987.

[9] R. Clint Whaley. Empirically tuning lapack’s blocking factor for increased performance. InProceedings of the International Multiconference on Computer Science and Information Tech-nology, Wisla, Poland, October 2008.

[10] R. Clint Whaley and Anthony M Castaldo. Achieving accurate and context-sensitive timingfor code optimization. Software: Practice and Experience, 38(15):1621–1642, 2008.

[11] R. Clint Whaley and Jack Dongarra. Automatically Tuned Linear Algebra Soft-ware. Technical Report UT-CS-97-366, University of Tennessee, December 1997.http://www.netlib.org/lapack/lawns/lawn131.ps.

[12] R. Clint Whaley and Jack Dongarra. Automatically tuned linear algebra software. In Su-perComputing 1998: High Performance Networking and Computing, San Antonio, TX, USA,1998. CD-ROM Proceedings. Winner, best paper in the systems category.http://www.cs.utsa.edu/~whaley/papers/atlas_sc98.ps.

[13] R. Clint Whaley and Jack Dongarra. Automatically Tuned Linear Algebra Software. InNinth SIAM Conference on Parallel Processing for Scientific Computing, 1999. CD-ROMProceedings.

58

[14] R. Clint Whaley and Antoine Petitet. Atlas homepage. http://math-atlas.sourceforge.

net/.

[15] R. Clint Whaley and Antoine Petitet. Minimizing development and maintenance costs insupporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101–121,February 2005. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps.

[16] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimizationof software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001.

59

VITA

Siju Samuel completed his masters in Structural Engineering from Indian Institute of

Technology -Kanpur, India in 1999. Later he worked as a software engineer for 9 years. He

returned to college in 2008, entering as a special undergraduate and later as a graduate student in

Computer Science at UTSA in Jan 2009. He has been happily married to Swapna and has a baby

girl Anna.

maintaininghighperformancein …homes.sice.indiana.edu/rcwhaley/theses/samuel_ms.pdfthe original...

Documents