maintaininghighperformancein …homes.sice.indiana.edu/rcwhaley/theses/samuel_ms.pdfthe original...
TRANSCRIPT
MAINTAINING HIGH PERFORMANCE IN
THE QR FACTORIZATION
WHILE SCALING BOTH PROBLEM SIZE AND PARALLELISM
APPROVED BY SUPERVISING COMMITTEE:
R. Clint Whaley, Ph.D., Chair
Qing Yi, Ph.D.
Dakai Zhu, Ph.D.
Accepted:Dean, Graduate School
MAINTAINING HIGH PERFORMANCE IN
THE QR FACTORIZATION
WHILE SCALING BOTH PROBLEM SIZE AND PARALLELISM
by
SIJU SAMUEL, M.Tech.
THESISPresented to the Graduate Faculty of
The University of Texas at San Antonioin Partial Fulfillmentof the Requirementsfor the Degree of
MASTER OF SCIENCE IN COMPUTER SCIENCE
THE UNIVERSITY OF TEXAS AT SAN ANTONIOCollege of Sciences
Department of Computer ScienceMay 2011
Acknowledgements
I want to thank my advisor, Dr. R. Clint Whaley, for his excellent academic guidance for
the thesis and for my graduate studies. I am very grateful to all my professors and friends for their
support leading to a rewarding experience at UTSA. I want to thank my family for their love and
constant support for all my pursuits in life. The research for this work was supported in part by
National Science Foundation CRI grants CNS-0551504 and CCF-0833203.
May 2011
iii
MAINTAINING HIGH PERFORMANCE IN
THE QR FACTORIZATION
WHILE SCALING BOTH PROBLEM SIZE AND PARALLELISM
Siju Samuel, M.S.The University of Texas at San Antonio, 2011
Supervising Professor: R. Clint Whaley, Ph.D.
QR factorization is an extremely important linear algebra operation used in solving multi-
ple linear equations, particularly least-square-error problems, and in finding eigenvalues and eigen-
vectors. This thesis details the author’s contributions to the field of computer science by providing
performance-efficient QR routines to ATLAS (Automatically Tuned Linear Algebra Software). AT-
LAS is an open source linear algebra library, intended for high performance computing. The author
has added new implementations for four types/precisions (single real, double real, single complex,
and double complex) in four different variants of matrix factorization (QR, RQ, QL and LQ). QR
factorization involves a panel factorization and a trailing matrix update operation. A statically
blocked algorithm is used for the full matrix factorization. A recursive formulation is implemented
for the QR panel factorization, providing more robust performance. Together these techniques
result in substantial performance improvement over the LAPACK version.
iv
TABLE OF CONTENTS
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1: Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Libraries and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Dynamic Panel Factorization Using Recursion . . . . . . . . . . . . . . . . . . 5
1.2.2 Impact of Panel Factorization in Full Matrix Factorization . . . . . . . . . . . 6
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CHAPTER 2: Overview of QR Matrix Factorization And LAPACK
Implementation 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 QR Decomposition using Householder Transformation . . . . . . . . . . . . . 10
2.2 QR, RQ, QL and LQ Transformations : Computation And Storage . . . . . . . . . . 10
2.2.1 QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 RQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 QL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 LQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 DGEQR2 : Unblocked QR Implementation in LAPACK . . . . . . . . . . . . . . . 15
2.4 RQ2, QL2 and LQ2 : Serial Implementation in LAPACK . . . . . . . . . . . . . . 18
2.5 DGEQRF : QR Blocked Implementation in LAPACK using WY Representation . . 22
2.5.1 DGEQRF Computational Steps . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
2.6 RQF, QLF and LQF Blocked Implementation in LAPACK using WY Representation 28
2.6.1 DGERQF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.2 DGEQLF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.3 DGELQF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 3: QR ATLAS Implementation using Recursive Algorithm 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Unblocked QR factorization (ATL geqr2) . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Statically blocked QR factorization (ATL geqrf) . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Call structure for ATLAS GEQRF (ATL geqrf) . . . . . . . . . . . . . . . . . 34
3.3.2 Computation of GEQRR (ATL geqrr) . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Blocked QL Factorization (ATL geqlf) . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Computation of GEQLR (ATL geqlr) . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Blocked RQ and LQ Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1 Blocked RQ Factorization (ATL gerqf) . . . . . . . . . . . . . . . . . . . . . . 39
3.5.2 Blocked LQ Factorization (ATL gelqr) . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Derivation for T-Block for the Recursive T Computation . . . . . . . . . . . . . . . . 40
CHAPTER 4: Results and Analysis 46
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 General Comparison Of ATLAS and LAPACK Performance . . . . . . . . . . 49
4.3.2 Full Problem Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Core2 Full Problem Performance . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.4 Opt8 Panel Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
CHAPTER 5: Summary 57
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
VITA
vi
LIST OF FIGURES
1.1 Statically Blocked QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Recursive QR Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Unblocked Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 LAPACK Blocked QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 LAPACK - A matrix and Work matrix in DLARFB . . . . . . . . . . . . . . . . . . 27
2.4 LAPACK Blocked RQ Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 LAPACK Blocked QL Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 LAPACK Blocked LQ Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Recursive QR Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Recursive QL Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Recursive RQ Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Recursive LQ Panel Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Recursive T Computation for QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Recursive T Computation for RQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Recursive T Computation for QL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 Recursive T Computation for LQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Static Blocking Full Square Matrix (Unblocked Vs Recursive Panel Factorization) . . 49
4.2 QR (a,b,c,d) & QL (c,d,e,f), Core2, Full Problem Performances . . . . . . . . . . . . 53
4.3 LQ (a,b,c,d) & RQ (c,d,e,f), Core2, Full Problem Performances . . . . . . . . . . . . 54
4.4 QR (a,b,c,d) & QL (c,d,e,f), Opt8, Full Problem Performances . . . . . . . . . . . . 55
4.5 LQ (a,b,c,d) & RQ (c,d,e,f), Opt8, Full Problem Performances . . . . . . . . . . . . 56
vii
LIST OF TABLES
2.1 DGEQLF LARFB Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 QR2 ATLAS routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 QRF ATLAS routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
viii
CHAPTER 1: INTRODUCTION
1.1 Introduction
The goal of this research is to provide new QR routines in the ATLAS (Automatically Tuned Linear
Algebra Software) [11, 12, 13, 16, 15, 14] high-performance linear algebra library. In linear algebra,
QR, RQ, QL and LQ matrix factorization refers to the decomposition of a matrix into an orthogonal
and a triangular matrix. Householder transformations [5] are widely used in numerical linear algebra
to perform the above decompositions because they are efficiently stable. LAPACK (Linear Algebra
PACKage) [1] provides an implementation of the matrix decomposition in FORTRAN. This work
developed an ANSI C implementation of the above matrix factorizations as part of ATLAS. The
recursive panel factorization algorithm [3] is used to provide better computational performance.
1.1.1 Libraries and Terminology
ATLAS provides optimized versions of two standard Linear Algebra APIs. It provides a com-
plete BLAS (Basic Linear Algebra Subprograms) library, and a partial LAPACK (Linear Algebra
PACKage) implementation.
The BLAS are a collection of kernels that are highly tuned to each architecture. They are
designed to be used as building blocks by higher level packages such as LAPACK, which handles
more complicated operations such as factoring matrices, solving systems and finding eigenvalues.
The idea is that most of the tuning can take place in the BLAS (a small set of routines), providing
performance portability to all operations which are built out of them.
The BLAS are split into three levels, depending on their operands. The Level 1 BLAS are
vector-vector operations (eg., dot product), the Level 2 BLAS are matrix-vector operations (eg.,
matrix-vector product), and the Level 3 BLAS are matrix-matrix operations (eg., matrix-matrix
multiply). The Level 1 BLAS have O(N) data and O(N) operations, while the Level 2 BLAS
have O(N2) data and O(N2) operations. Thus, neither the Level 2 or 1 BLAS get significant
reuse of data in the cache, and thus the performance of both is strongly limited by the speed of
memory. This has two unfortunate consequences: these problems cannot run anywhere near the
computational peak, and they do not parallelize well (since memory performance does not scale
with core count, as computational performance does).
1
On the other hand, the Level 3 BLAS have O(N2) data, but O(N3) operations, which,
along with their ability to be reordered and tiled, allows these operations to get significant reuse at
all levels of the memory hierarchy. The L3BLAS therefore usually obtain most of the theoretical
peak of the machine when run in serial, and achieve extremely good scaling when parallelized.
Level 1, Level 2 and Level 3 operations in BLAS will be identified as L1BLAS, L2BLAS and L3BLAS,
respectively .
The routines of the BLAS and LAPACK follow a naming convention: the routines start
with ”D” for ”double precision real”. For single real, single complex and double complex preci-
sion/types are ”S”, ”C”, ”Z” respectively. In the BLAS, some of the important routines are GER,
GEMV and GEMM. GER and GEMV are L2BLAS, and GEMM is the most important L3BLAS. The routine
DGER perform ”General” rank-1 update (adding the outer product of two vectors to the elements
of a matrix). In BLAS ”General” is used to describe dense rectangular or square matrices, as
opposed to more specialized matrix types, such as triangular or symmetric. DGEMV performs a real
matrix-vector multiplication. DGEMM performs a real matrix-matrix multiplication.
DGEQRF is an LAPACK routine which performs a QR factorization on a full matrix in
double precision using static blocking. DGEQLF, DGELQF and DGERQF are the similar routines for QL,
LQ and RQ variants. Many routines will be introduced in this paper using a similar naming pattern
as needed.
ATLAS follows a naming convention: the routine ATL dgemm denotes a double precision
matrix-matrix multiplication routine. In most case the equivalent routines in BLAS and LAPACK
are named similarly. For example ATL dgemm in ATLAS corresponds to DGEMM in the BLAS. Simi-
larly ATL zgeqrf refers to ZGEQRF (complex precision) in LAPACK. A similar pattern is employed
for all QR related routines.
A general usage of GEMM or GEQRF without the precision character denotes the general
computational algorithm. QR , RQ, QL and LQ factorization refers to the general factorization algo-
rithm for each variant independent of precisions/types. Based on the context QR is also used to
denote the matrix factorization in general independent of variant and precisions/types.
The following terminologies are used in matrix factorization operations. An unblocked
factorization refers to factoring a matrix using the L2BLAS. A blocked factorization means dividing
a matrix into a ”column panel” and ”trailing matrix”, performing a panel factorization, and then
updating the trailing matrix (using the L3BLAS) iteratively. The column panel is a sub-matrix
2
with typically a large number of rows and relatively few columns. The trailing matrix is a matrix
following the panel matrix in the full matrix to be factored in one iteration. Similarly a row
panel has a small number of rows and large number of columns. The width with which a matrix is
blocked is called nb, which is also referred as the blocking factor . Static blocking refers to a blocked
factorization where the nb is constant in each iteration; if the nb varies in blocked factorization, it
is called dynamic blocking. Figure 1.1(a) shows an example for column panel, trailing matrix and
nb involved in a matrix factorization, and those terms will be more fully described in the following
section.
1.1.2 Literature Review
The original Householder transformation and procedure for factoring A = QR appeared in 1958,
in [4]. This is nearly the unblocked algorithm, although LAPACK makes a minor mathematical
change to force the Householder vector to begin with ’1.0’ (to save a storage location). The blocked
version was introduced in 1987 by Bischof and Van Loan [8], however this required a large work
area. That problem was corrected with some new mathematics and a new supplementary matrix,
in 1989 by Schreiber and Van Loan [7]. This is the modern form implemented in LAPACK. The
mathematics and computations necessary for the recursive version (not yet in LAPACK) were
developed in 2000 by Elmroth and Gustavson [3]. In 2008, even more new mathematics led to a
tiled, DAG-based parallel version of QR in [2],[6].
QR is used in solving multiple linear equations, particularly least-square-error problems,
and in finding eigenvalues and eigenvectors. The QR factorization (and its variants) are preferred
primarily for a favorable error analysis property: The factorization is considered perfectly stable.
1.2 Overview
A QR factorization of A ∈ ℜM×N with M ≥ N yields
A = Q ·R =[
Q1 Q2
]
·
R1
0
= Q1 ·R1 (1.2.1)
Where Q ∈ ℜM×M is orthogonal and R1 ∈ ℜN×N is upper triangular. We also handle the
M < N case, but since we are concentrating mostly on the panel where M > N , we do not discuss
3
the M < N case here. In LAPACK, Q is obtained as a product of Householder transformations for
each column. The LAPACK implementation of the QR matrix factorization uses a statically blocked
algorithm. It involves factoring a panel using an unblocked algorithm (L2BLAS) and then applying
the Householder transformation to the trailing matrix using the L3BLAS. The panel factorization
and trailing matrix update operations are done iteratively
In Figure 1.1(b), for a matrix AM×N , a panel of width nb (static blocking factor) is
factored into a lower triangular/trapezoidal portion of YP and upper triangular RT . YP represent
the Householder elementary vectors computed for each Householder transformation. YP is a storage
efficient representation to get Q as used in LAPACK. T is an nb×nb upper traingular matrix used
for the trailing matrix update. Using the panel factorization results and T , the right hand side
trailing matrix computation is performed. The shaded area shows the new matrix to factor in the
next iteration (i.e. at each step of of the iteration M and N will be reduced by nb).
Figure 1.1: Statically Blocked QR Factorization
The unblocked panel factorization is done using the L2BLAS. The remainder matrix up-
date is done using the L3BLAS. To maximize the performance of QR factorization it is extremely
important to choose an nb which will maximize the performance in the panel factorization operation
and the trailing matrix update operations. Unfortunately, we will see that these goals are at odds,
with the unblocked panel factorization demanding a small nb, while the update does best with a
large nb. We address this issue by adding a recursive panel factorization that can maintain high
performance using a large nb.
4
Panel factorization performance is strongly affected by the cache state. If the panel is
cache contained, the L2BLAS will run close to the speed of the cache, which is much faster than
main memory. However, limiting the panel size to fit into cache can result in using a very thin
panel for the trailing matrix update. If nb is too small, the L3BLAS are not able to get much cache
reuse, and thus both serial and parallel performance is strongly reduced. On the other hand, the
very large nb that is ideal for the trailing matrix update can cause insupportable slowdown when
used in the L2BLAS-based panel factorization.
1.2.1 Dynamic Panel Factorization Using Recursion
The recursive QR factorization is a dynamically blocked algorithm which uses the L3BLAS, and was
developed by Elmroth and Gustavson [3]. We use the recursive algorithm to perform the panel
factorization. A panel is recursively divided into sub-panels until the sub-panel fits into the L2-
cache (or nb gets too narrow to make further division helpful). Only then do we call the unblocked
panel factorization, which will then run at roughly the cache speed, rather than at memory speed.
More importantly, the panel factor time will now usually be dominated by the time in the L3BLAS
coming from the recursion, and only mildly affected by the unblocked speed experienced at the end
of the recursion.
Figure 1.2: Recursive QR Panel Factorization
An illustration is given in Figure 1.2, where an input panel matrix A is recursively fac-
tored (1). It is divided into sub-panel A0 and A1 (2). Here it is assumed that A0 and A1 fit into the
L2-cache (one level of recursion). A0 is factored using unblocked algorithm (3). Then the trailing
matrix A1 is updated using the L3BLAS (4). Then, the sub-panel A1′ is factored using unblocked
algorithm (5). This completes the panel factorization (6).
5
The recursive algorithm involves more floating point operations compared to the un-
blocked algorithm [3]. The performance advantage in using a recursive algorithm in the panel
factorization is attributed to performing L2BLAS operations at cache speed and the L3BLAS replac-
ing the L2BLAS operations involved in the trailing sub-panel update. Similar improvements can also
be achieved by statically blocking the panel, but this requires extensive tuning for different panel
sizes and shapes to perform efficiently. Using recursion helps by dynamically blocking sub-panels
of arbitrary size and shapes without requiring additional empirical tuning of blocking factors.
1.2.2 Impact of Panel Factorization in Full Matrix Factorization
The dynamically blocked panel factorizations allow us to factor much larger panels efficiently. For
a full matrix factorization, this larger nb results in much improved L3BLAS performance and scaling
in the trailing matrix update.
Although a larger nb helps L3BLAS performance, a very large nb can still hurt overall
performance. As mentioned earlier, any blocked algorithm requires more floating point operations
than the unblocked QR factorization. The additional number of of floating point operation grows at
O(nb2N) [3]. If we begin the recursion at the top, nb = N2 , which means we would do O(N3) extra
flops, which is intolerable in QR, since the entire algorithm is only O(N3). A high nb also demands
more storage for computation which indirectly affects cache reuse. So the best algorithm uses a
hybrid approach [3], where static blocking with a constrained nb is used in the main algorithm in
order to limit the flop growth, while dynamic blocking on panels (where flop growth is minor due
to the static nb) ensures we need only 1 tuning step (for the nb used as the outer blocking factor
in the full matrix factorization).
1.3 Motivation
The recursive panel factorization gives significant speedup compared to the unblocked algorithm.
It gives comparable or better performance than the statically blocked algorithm for all problem size
and shapes. Note that static blocking for the panel factorization with a fine tuned nb can win over
the recursive algorithm for small problems. The main problem with using a second step of static
blocking to handle the panel factorization is that it requires substantial tuning on both dimensions
in order to maintain good performance. Tuning for all used rectangular shapes is infeasible, and
6
thus recursion, which performs well on any shape of panel, is strongly preferred.
We validate this by examining the factorization performance of rectangular matrices when
using unblocked, statically blocked and recursive algorithms. Figure 1.3 gives the QR factorization
performance of thin matrices of width 56 and 168 respectively and varying M . The thin matrix
simulates the equivalent of a panel in large matrices.
For each dimension in the graph, using nb = 1 indicates the operation is using the un-
blocked algorithm. The statically blocked performance for each nb is charted next. The last column
always represents the performance obtained by factoring the entire panel using the recursive algo-
rithm. For this experiment, we have used a machine with L2-cache size of 6144 KB per core, which
can hold roughly 14000 x 56 and 4681 x 168 double precision elements.
Figure 1.3: QR Factorization
Figure 1.3 shows that the blocked algorithms (static and recursive) outperform the un-
blocked algorithm for all surveyed problem sizes. As the problem gets larger, the speedup increases.
This is because the larger the matrix, the greater advantage the L3BLAS have over the L2BLAS, with
the L3BLAS completely dominant once the problem is out off cache.
In a statically blocked algorithm, the nb has significant impact on performance. For ex-
ample, for 4000 x 168 matrix, an nb of 16 gives the best performance. When nb is 4 the performance
is reduced by 56%. This shows that choosing the right nb is extremely important in achieving the
best performance for a given problem size.
Figure 1.3 also shows the best performance for one problem might not yield best perfor-
7
mance for a different size. An nb of 8 gives the best performance for 12000 x 56, but for 12000 X
168 it performs 21% slower than the best performance nb of 16. This shows that a tuned nb for a
particular problem size may be substantially slower when used for other panel sizes.
Now if we compare the statically blocked and the dynamically blocked performances,
we can see that when the problem is small, the recursively blocked factorization shows slightly
reduced performance compared to the best statically blocked nb. But as the problem size increases,
the recursive algorithm gives better performance than the best statically blocked nb. This is the
general pattern we see in more complete timings: recursion gets competitive performance for all
panel shapes. Therefore, since recursion leads to good performance without requiring massive and
unsustainable tunings, we use it for our panel factorization.
1.4 Outline
The rest of this paper is organized as follows: Chapter 2, provides a detailed analysis of LAPACK’s
QR subroutines, in order to get a thorough understanding of the matrix factorization algorithms.
In Chapter 3, the new ATLAS implementation using the dynamically blocked recursive algorithm
is discussed. The mathematical analysis and the implementation details are outlined for all QR
variants and in all four precisions/types. Chapter 4 provides results and analysis. A quantitative
comparison of the new matrix factorization technique in ATLAS with the LAPACK implementa-
tions are provided. The impact of recursive panel factorization in QR matrix factorization and full
matrix factorizations are studied in two commodity architectures, showing that the recursive algo-
rithm provides an efficient panel factorization, which significantly improves full matrix factorization
performance. Finally Chapter 5 provides the summary of the paper.
In general, details of mathematical analysis and algorithms are presented for double pre-
cision QR factorization. A discussion of other variants and precisions/types are made in reference
to QR double precision, when there is a substantial difference.
8
CHAPTER 2: OVERVIEW OF QR MATRIX FACTORIZATION
AND LAPACK IMPLEMENTATION
2.1 Introduction
This chapter describes the mathematical details of the QR matrix factorization and how it is
implemented in LAPACK [1]. LAPACK provides implementation for QR (QL, RQ and LQ) using
unblocked and a blocked factorization algorithm in four precisions/types (real single, real double,
single complex and double complex). The unblocked algorithm uses the L2BLAS. As we have seen
in the previous chapter, the blocked algorithm iteratively processes a panel using the unblocked
factorization and then updates a trailing matrix using the L3BLAS. The trailing matrix update
using the L3BLAS is CPU-bound, but the unblocked panel factorization runs at the speed of the
memory. To provide an improved factorization the author needed to do a detailed study of the
existing mathematical formulation, computational mapping, storage and APIs . Mathematical and
computational steps are explained mainly in terms of QR factorization in double precision. Details
about other QR variants and other precisions/types are also provided when there is a significant
difference.
QR matrix factorization is an extremely important operation which is used by LAPACK
to solve all over- and under-determined systems. A full matrix factorization can be made efficient
by providing a better panel factorization technique. A detailed analysis of LAPACK QR routines
will enable us to find the improvements that can be applied to the panel factorization techniques. In
addition, the mathematical algorithms used in trailing matrix update by LAPACK can be adapted
to work in the new ATLAS QR routines .
2.1.1 Outline
The remaining sections of this chapter are organized as follows: First a mathematical overview of
the QR factorization is given in Section 2.1.2. Section 2.2 gives an overview of the computation and
storage for all the variants and precisions implemented by LAPACK. Section 2.3 details the imple-
mentation of the unblocked algorithm along with the mapping of the subroutines to mathematical
steps in LAPACK for QR double precision. Section 2.4 provides similar details for the other QR
variants. The blocked algorithm analysis for QR is provided in Section 2.5. Finally, Section 2.6
9
describes how to apply blocking to the other variants.
2.1.2 QR Decomposition using Householder Transformation
Using Householder transformations, the matrix A can be a factored into a product QR,
A = QR,
where Q is an orthogonal matrix and R is an upper triangular matrix.
and Q = H1H2.....Hn−1
and R = Hn−1.......H2H1A
The Hi are Householder matrices,
H = (I −1
βvvT ) (2.1.1)
H matrix is orthogonal and symmetric.
where given a vector x ∈ Rn (these x are the columns/(modified
columns) of A)
α = ‖x‖2, β = α(α− x1)
v = (x1 − α, x2, .....xn)T
See [5] for details.
LAPACK provides the following implementation for QR decomposition using Householder trans-
formation
1. DGEQRF: Blocked factorization
2. DGEQR2: Unblocked factorization (used to factor panels for DGEQRF)
2.2 QR, RQ, QL and LQ Transformations : Computation And
Storage
A given matrix A[m × n], can be factored into QR, RQ, LQ and QL variants. Where Q is an
orthogonal matrix and L, R are triangular matrices. The computation and storage details of each
of these variants is discussed in the following sections.
10
2.2.1 QR
In QR factorization using Householder transformations, a matrix of size A[m×n] can be decomposed
into:
A = QR
For real numbers, Q is an orthogonal matrix and R is an upper triangular or
an upper trapezoidal matrix
For complex numbers, Q is a unitary matrix and R is an upper triangular or
an upper trapezoidal matrix
Computation and Storage in LAPACK and ATLAS (QR routines)
In the QR-variant, A[m×n] is input matrix, and on exit, the elements on and above the diagonal of
the array contain the min(m,n)×n upper trapezoidal matrix R (R is upper triangular if m >= n).
The elements below the diagonal, together with the array TAU [1], represent the orthogonal matrix
Q in case of real (unitary matrix Q in case of Complex) as a product of elementary reflectors.
The matrix Q is represented as a product of elementary reflectors
Q = H1H2....Hk, where k = min(m,n).
Each Hi has the form
Hi = I − τvvT (real precisions) (2.2.1)
Hi = I − τvvT (complex precisions) (2.2.2)
On exit, matrix A can be represented as shown below for several representative examples of m and
n. Matrices with dimensions [3× 3], [5× 3] and [3× 5] are shown for real/complex precisions.
[m=n]
A[3×3]
=
R11 R12 R13
v12 R22 R23
v13 v23 R33
[m>n]
A[5×3]
=
R11 R12 R13
v12 R22 R23
v13 v23 R33
v14 v24 v34
v15 v25 v35
[m<n]
A[3×5]
=
R11 R12 R13 R14 R15
v12 R22 R23 R24 R25
v13 v23 R33 R34 R35
v is a real/complex vector with v(1 : i− 1) = 0 and v(i) = 1; v(i+ 1 : m) is stored on exit in
A(i + 1 : m, i). Note that the elements at v(1 : i − 1) = 0 and v(i) = 1 are not stored, since the
elements above the diagonal are known to be 0, and the diagonal element is known to be 1.
11
2.2.2 RQ
In the RQ factorization using Householder transformations, a matrix of size A[m × n] can be
decomposed into:
A = RQ
The matrix Q is represented as a product of elementary reflectors
Q = H1H2....Hk, where k = min(m,n). ( real precisions)
Q = HT
1 HT
2 ....HT
k , where k = min(m,n). (complex precisions)
Each Hi has the form
Hi = I − τvvT ( real precisions) (2.2.3)
Hi = I − τvvT ( complex precisions) (2.2.4)
In the RQ-variant, A[m × n] is the input matrix. On exit, the matrix A can be represented as
shown below for several representative examples of m and n.
For real precisions,
[m=n]
A[3×3]
=
R11 R12 R13
v21 R22 R23
v31 v32 R33
[m>n]
A[5×3]
=
R11 R12 R13
R21 R22 R23
R31 R32 R33
v21 R42 R43
v31 v32 R55
[m<n]
A[3×5]
=
v11 v12 R13 R14 R15
v21 v22 v23 R24 R25
v31 v32 v33 v34 R35
Where v is a real vector with v(n − k + i + 1 : n) = 0 and v(n − k + i) = 1; v(1 : n − k + i − 1)
is stored on exit in A(m − k + i, 1 : n − k + i − 1). Note that the elements at v(n − k + i) = 1
and v(n − k + i + 1 : n) = 0 are not stored. Tau is a vector containing min(m, k) scalars which
represents (τ) of Equation 2.2.3.
For complex this becomes,
[m=n]
A[3×3]
=
R11 R12 R13
v21 R22 R23
v31 v32 R33
[m>n]
A[5×3]
=
R11 R12 R13
R21 R22 R23
R31 R32 R33
v21 R42 R43
v31 v32 R55
[m<n]
A[3×5]
=
v11 v12 R13 R14 R15
v21 v22 v23 R24 R25
v31 v32 v33 v34 R35
12
For complex, compared to the real, conjg(v(1 : n− k + i− 1)) is stored on exit in A(m− k + i, 1 :
n − k + i − 1). Tau is a vector containing min(m, k) complex scalars which represents (τ) of
Equation 2.2.4.
2.2.3 QL
In the QL factorization using Householder transformations, a matrix of size A[m × n] can be
decomposed into:
A = QL
The matrix Q is represented as a product of elementary reflectors
Q = Hk...H2H1, where k = min(m,n). (real and complex precisions)
Each Hi has the form
Hi = I − τvvT (real precisions) (2.2.5)
Hi = I − τvvT (complex precisions) (2.2.6)
In the QL-variant, A[m × n] is the input matrix. On exit, the matrix A can be represented as
shown below for several examples of m and n.
For real/complex precisions,
[m=n]
A[3×3]
=
L11 v21 v31
L21 L22 v32
L31 L32 L33
[m>n]
A[5×3]
=
v11 v21 v31
v12 v22 v32
L11 v23 v33
L21 L22 v34
L31 L32 L33
[m<n]
A[3×5]
=
L11 L12 L13 v21 v31
L21 L22 L23 L24 v32
L31 L32 L33 L34 L35
v is a real vector/complex vector with v(m−k+i+1 : m) = 0 and v(m−k+i) = 1; v(1 : m−k+i−1)
is stored on exit in A(1 : m− k+ i− 1, n− k+ i). Note that the elements at v(m− k+ i) = 1; and
v(m− k+ i+1 : m) = 0 are not stored. Tau is a vector containing min(m, k) real/complex scalars
which represents (τ) of Equation 2.2.5 and 2.2.6
13
2.2.4 LQ
In the LQ factorization using Householder transformations, a matrix of size A[m × n] can be
decomposed into:
A = LQ
The matrix Q is represented as a product of elementary reflectors
Q = Hk...H2H1, where k = min(m,n). (real precisions)
Q = HT
k ...HT
2 HT
1 , where k = min(m,n). (complex precisions)
Each Hi has the form
Hi = I − τvvT (real precisions) (2.2.7)
Hi = I − τvvT (complex precisions) (2.2.8)
In the LQ-variant, A[m × n] is the input matrix. On exit, the matrix A can be represented as
shown below for for several representative examples of m and n.
For real precision,
[m=n]
A[3×3]
=
L11 v12 v13
L21 L22 v23
L31 L32 L33
[m>n]
A[5×3]
=
L11 v12 v13
L21 L22 v23
L31 L32 L33
L41 L42 L43
L51 L52 L53
[m<n]
A[3×5]
=
L11 v12 v13 v14 v15
L21 L22 v23 v24 v25
L31 L32 L33 v34 v35
v is a real vector with v(1 : i− 1) = 0 and v(i) = 1; v(i+ 1 : n) is stored on exit in A(i, i+ 1 : n).
Note that the elements at v(1 : i− 1) = 0 and v(i) = 1; are not stored. Tau is a vector containing
min(m, k) scalars which represents (τ) of Equation 2.2.7.
For complex precision,
[m=n]
A[3×3]
=
L11 v12 v13
L21 L22 v23
L31 L32 L33
[m>n]
A[5×3]
=
L11 v12 v13
L21 L22 v23
L31 L32 L33
L41 L42 L43
L51 L52 L53
[m<n]
A[3×5]
=
L11 v12 v13 v14 v15
L21 L22 v23 v24 v25
L31 L32 L33 v34 v35
14
v is a complex vector with v(1 : i − 1) = 0 and v(i) = 1; conjg(v(i + 1 : n)) is stored on exit
in A(i, i + 1 : n),. Tau is a vector containing min(m, k) complex scalars which represents (τ) of
Equation 2.2.8.
2.3 DGEQR2 : Unblocked QR Implementation in LAPACK
This section outlines LAPACK’s DGEQR2, which performs the unblocked factorization (DGEQR2 is
called by DGEQRF to factor the panel). DGEQR2 implements an unblocked QR factorization of a real
m by n matrix in double precision. Here a detailed description is provided for QR double precision.
The analysis results of other variants and precisions/types are outlined in later sections.
In DGEQR2, for the matrix A the computation proceeds one column vector v at a time. Each
column result updates all of the columns to right. An illustration of the unblocked factorization is
given in Figure 2.1.
Figure 2.1: Unblocked Factorization
For the first column, find the Householder transformation (1). Apply the transformation
to the A matrix; the elements below the diagonal becomes zero for the first column and the first row
15
is transformed to a part of the triangular matrix R (2). Note that, LAPACK does not explicitly
compute H · A, the equivalent computation is given in Step 2 of the computational algorithm
detailed below. In LAPACK, the elementary reflectors (v) are stored in place of zeroed vectors (3).
Similar operations continues for iteration 2; but it operates on one less number of row and column.
In Figure 2.1 H1, H2 denote the Householder matrices as discussed in Equation 2.1.1.
The call structure of DGEQR2 operation is:
DGEQR2
Loop over columns of A:
DLARFG/P : find Householder vector for a column v
DNRM2 : perform Norm-2 operation to find alpha of
Equation 2.1.1. This is done excluding the
diagonal element
DLAPY2 : Update Norm-2 result from DNRM2 with diagonal element
DSCAL : scale column i of A below the diagonal using results from
DLAPY2 to get v
DLARF : update the trailing matrix
DGEMV : find work vector w of Equation 2.3.1
DGER : Rank 1 update of remaining columns as in
Equation 2.3.2
The computational steps for each iteration of the loop are:
Step 1: Create Householder Elementary Reflector (DLARFG)
The Householder elementary reflector matrix is calculated in term of τ and v. The τ and v are
calculated by applying the following modifications to Equation 2.1.1. An elementary vector of size
3 is taken to show how the computation maps to the H = I− τvvT used in LAPACK. In the below
equations, A11, A22 and A33 represents the first column vector of A and α represent the 2-norm of
the column.
16
I −1
α(α−A11)
A11 − α
A21
A31
[
A11 − α A21 A31
]
⇒ I −(α−A11)
α
1
A21
(A11−α)
A31
(A11−α)
[
1 A21
(A11−α)A31
(A11−α)
]
⇒ I − τ
1
v1
v2
[
1 v1 v2
]
where τ = (α−A11)α
. Here α can take either positive root or negative root. In LAPACK, α
is chosen in such a way that the value of τ will lie between 1 and 2.
So, H = I − τvvT
Step 2: Applying Householder transformations to A (DLARF)
DLARF computes A′ = HA, with H =(
I − τvvT)
,⇒
A′ =(
I − τvvT)
A = A− τvvTA = A− τv(
vTA)
Now vTA =(
AT v)T
, So let
w =(
AT v)
, (DGEMV) (2.3.1)
A′ = A− τvwT (DGER) (2.3.2)
DGEQR2 Operation for A 3× 3 Matrix
The computational steps are explained by using a 3 × 3 matrix. The first iteration is illustrated;
similar operations will be performed in subsequent iterations, except they will apply to one less
row and column on each step.
17
Step 0: DGEQR2 (input)
A =
A11 A12 A13
A21 A22 A23
A31 A32 A33
τ =
0
0
0
Step 1: DLARFG (output)
A =
R11 A12 A13
v21 A22 A23
v31 A32 A33
τ =
τ1
0
0
Step 2: DLARF/DGEMV (output)
A12 A13
A22 A23
A32 A33
T
×
1
v21
v31
⇒ w =
w1
w2
Step 3: DLARF/DGER (output)
I ×
1 A12 A13
v21 A22 A23
v31 A32 A33
+
1
v21
v31
× (−τ)×[
w1 w2
]
⇒
1 R12 R13
v21 A′
22 A′
23
v31 A′
32 A′
33
Step 4: DGEQR2 (output after first iteration)
R11 R12 R13
v21 A′
22 A′
23
v31 A′
32 A′
33
⇒ A′
Note that, R11 is computed in DLARFG. It is replaced to 1 to make the full v vector before the DLARF
computations. After the DLARF operation, the value is changed back to R11 in DGEQR2. The similar
operations are performed in other variants, but it is not discussed explicitly.
2.4 RQ2, QL2 and LQ2 : Serial Implementation in LAPACK
This section outline the unblocked matrix factorization algorithm in LAPACK for RQ, QL and LQ
variants. We used double precision type as an example and a 3 by 3 matrix is used to explain the
computational steps.
18
DGERQ2 Operation for A 3× 3 Matrix
In RQ, A = RQ. The Householder Elementary vector is stored along rows. The Householder
transformation Hi = I − τvvT is applied from right.
AQT = R
Q = H1H2.......Hk, where k = min(m,n).
QT = Hk.......H2H1
The first iteration operation (calculating A ·Hk) is illustrated; similar operations will be performed
in subsequent iterations, except they will apply to one less row and column on each step. Note
that, Hi is implemented as Hi = I − τvT v specifying v as a row vector instead of a column vector.
Step 0: DGERQ2 (input)
A =
A11 A12 A13
A21 A22 A23
A31 A32 A33
τ =
0
0
0
Step 1: DLARFG (output)
A11 A12 A13
A21 A22 A23
v31 v32 R33
τ =
0
0
τ3
Step 2: DLARF/DGEMV (output)
A =
A11 A12 A13
A21 A22 A23
×[
v31 v32 1]T
⇒ w =
w1
w2
Step 3: DLARF/DGER (output)
I ×
A12 A12 A13
A21 A22 A23
v31 v32 1
+
w1
w2
× (−τ)×[
v31 v32 1]
⇒
A′
11 A′
12 R13
A′
21 A′
22 R23
v31 v32 1
19
Step 4: DGERQ2 (output)
A′
11 A′
12 R13
A′
21 A′
22 R23
v31 v32 R33
⇒ A′
DGEQL2 Operation for A 3× 3 Matrix
In QL, A = QL. The Householder Elementary vector is stored along columns. The Householder
transformation Hi = I − τvvT is applied from left.
QTA = R
Q = Hk.......H2H1, where k = min(m,n).
QT = H1H2.......Hk
The first iteration operation (calculating A ·Hk) is illustrated; similar operations will be performed
in subsequent iterations, except they will apply to one less row and column on each step.
Step 0: DGEQL2 (input)
A =
A11 A12 A13
A21 A22 A23
A31 A32 A33
τ =
0
0
0
Step 1: DLARFG (output)
A11 A12 v31
A21 A22 v32
A31 A32 L33
τ =
0
0
τ3
Step 2: DLARF/DGEMV (output)
A =
A11 A12
A21 A22
A31 A32
T
×
v31
v32
1
⇒ w =
w1
w2
Step 3: DLARF/DGER (output)
I ×
A12 A12 v31
A21 A22 v32
A31 A32 1
+ (−τ)×
v31
v32
1
×
w1
w2
T
⇒
A′
11 A′
12 v31
A′
21 A′
22 v32
L31 L32 1
20
Step 4: DGEQL2 (output)
A′
11 A′
12 v31
A′
21 A′
22 v32
L31 L32 L33
⇒ A′
DGELQ2 Operation for A 3× 3 Matrix
In LQ, A = LQ. The Householder elementary vector is stored along rows. The Householder
transformation Hi = I − τvvT is applied from right.
AQT = L
Q = Hk.......H2H1, where k = min(m,n).
QT = H1H2.......Hk
The first iteration operation (calculating A ·H1) is illustrated; similar operations will be performed
in subsequent iterations, except they will apply to one less row and column on each step. Note
that, Hi is implemented as Hi = I − τvT v specifying v as a row vector instead of a column vector.
Step 0: DGELQ2 (input)
A =
A11 A12 A13
A21 A22 A23
A31 A32 A33
τ =
0
0
0
Step 1: DLARFG (output)
A =
L11 v12 v13
A21 A22 A23
A31 A32 A33
τ =
τ1
0
0
Step 2: DLARF/DGEMV (output)
A21 A22 A23
A31 A32 A33
×[
1 v12 v13
]T
⇒ w =
w1
w2
21
Step 3: DLARF/DGER (output)
I ×
1 v12 v13
A21 A22 A23
A31 A32 A33
+ (−τ)×
w1
w2
×
1 v12
v13
⇒
1 v12 v13
L21 A′
22 A′
32
L31 A′
32 A′
33
Step 4: DGELQ2 (output)
L11 v12 v13
L21 A′
22 A′
32
L31 A′
32 A′
33
⇒ A′
2.5 DGEQRF : QR Blocked Implementation in LAPACK using
WY Representation
The DGEQRF subroutine uses blocking to perform QR decomposition based on a storage efficient
WY representation as given by Schreiber/van Loan [7]. The paper discusses compact representa-
tion of Q = (I + Y · T · Y T ), where the columns of Y represent the vectors v computed for each
Householder transform, and T is upper triangular. It is also rich in matrix-matrix multiplication,
which is highly desirable for achieving high performance. In DGEQRF, it uses Q = (I−Y ·T ·Y T ) and
the differences between the paper and the actual implementation are analyzed and given below.
Refer to [7] for details.
In DGEQRF ,
Q = I − Y TY T ∈ Rm×m is orthogonal with Y ∈ Rm×j(m > j) and T ∈ Rj×j (upper
triangular).
22
P = I − τvvT is the Householder matrix
If Q+ = QP,
then Q+ = I − Y+T+YT+
where Y+ =[
Y v
]
∈ Rm×(j+1) and
T+ =
T z
0 ρ
with ρ = τ and z = −τTY T v
Proof:
I − Y+T+YT+ = I −
[
Y v
]
T z
0 ρ
Y T
vT
= I −[
Y v
]
TY T + zvT
ρvT
= I − Y TY T − Y zvT − vρvT (2.5.1)
QP = (I − Y TY T )(I − τvvT )
= I − Y TY T + Y TY T τvvT − τvvT (2.5.2)
From Equation 2.5.1 and 2.5.2
ρ = τ z = −τTY T v
DGEQRF Implementation in LAPACK
DGEQRF iteratively processes a panel using the unblocked algorithm and then updates the trailing
matrix. The call structure of the DGEQRF operation is as below:
DGEQRF
Loop : for each nb-width column panel;
DGEQR2 : Perform unblocked panel factorization
DLARFT : Find triangular factor matrix T
23
DLARFB : Apply Householder block reflector to
trailing matrix
Blocked QR factorization involves factoring a panel and applying the Householder trans-
formation to the remainder of the matrix. The operation is done iteratively based on the blocking
factor nb, which is the panel width. Figure 2.2 illustrates the basic operation. The panel of width
nb is factored into Rp and Yp using the unblocked version of the QR factorization module, DGEQR2.
Then T [nb×nb] matrix is computed using DLARFT, for which the details are discussed later in the
chapter. Finally, the transformation is applied to trailing matrix using QT · Atrailing, using the
identity Q = (I − Yp · T · Y Tp ), producing RT (the upper portion of the final R) and AT (gray area,
and the new matrix to factor). The dotted lines indicate the next iteration of the algorithm. Note
that it is not required to build T for the final iteration.
Figure 2.2: LAPACK Blocked QR Factorization
Assume the A matrix is of size m×n. Let nb be the block size in the direction of n. The
details are explained using a 6X6 matrix. (Note: For simplicity the LAPACK parameter NX is
ignored)
24
2.5.1 DGEQRF Computational Steps
Step 1: Factor the panel (DGEQR2)
Perform the QR factorization on the M ×nb panel by calling DGEQR2. For this example, we assume
nb = 2.
A11 A12 A13 A14 A15 A16
A21 A22 A23 A24 A25 A26
A31 A32 A33 A34 A35 A36
A41 A42 A43 A44 A45 A46
A51 A52 A53 A54 A55 A56
A61 A62 A63 A64 A65 A66
⇒
α1 R12 A13 A14 A15 A16
v21 α2 A23 A24 A25 A26
v31 v32 A33 A34 A35 A36
v41 v42 A43 A44 A45 A46
v51 v52 A53 A54 A55 A56
v61 v62 A63 A64 A65 A66
τ1
τ2
0
0
0
0
We will next need to compute the T matrix (triangular factor).
Step 2: Find T Matrix.
Find T matrix for M × nb block using DLARFT. For this example, we assume nb = 2.
T Matrix is calculated as
T = τ when i = 1
T =
T z
0 ρ
when i > 1 and i ≤ nb
where ρ = τi
z = −τTY T v (2.5.3)
Equation 2.5.3 can be written as T · (−τY T v)
Step 2.1:
ResV ect = (−τY T v) using DGEMV
Step 2.2:
Ti = (T ×ResV ect) using DTRMV
Steps 2.1 to 2.2 repeats nb times,
25
Step 3: Apply (I − Y TY T )T ×Atrailing using DLARFB
In DLARFB, the (I − Y TY T )T × Atrailing computation is made as per Equation 2.5.4. Different
matrices involved in the computational steps are shown in Figure 2.3.
Here,
A =[
Y C
]
Y =[
Y1 Y2
]T
C =[
C1 C2
]T
where Y is the Householder reflector obtained using DGEQR2. C is the remainder matrix
of A, where the transformation (I − Y TY T )T · C is applied using DLARFB. Note that Y1 is lower
triangular/trapezoidal, and a portion of R is stored in upper triangular area of Y1 . The different
steps involved are detailed below.
(I − Y TY T )T ×Atrailing ⇒ C − (Y TY T )T (C)
⇒ C − (Y T TY T )C
⇒ C − Y T TY TC
⇒ C − Y (CT (T TY T )T )T
⇒ C − Y (CTY T )T
⇒ C − Y (CTY T )T (2.5.4)
In DLARFB the following operations are done to get Equation 2.5.4.
Step 3.1 Compute W = CTY where, Y =[
Y1 Y2
]T
, C =[
C1 C2
]T
⇒ CT1 Y1 + CT
2 Y2
Step 3.1.1 using DTRMM
W = W × Y1
Step 3.1.2 using DGEMM
W = W + CT2 Y2
Step 3.2 using DTRMM
W = W × T
26
Y1 C1
Y2 C2
Atrailing = [C1 C2] = C
� -nb
� -N
� -NOrig
?
6
M
T1
C1,W
Work
� -K
6
?
N
Figure 2.3: LAPACK - A matrix and Work matrix in DLARFB
Step 3.3 At this step, W = CTY
Now compute C := C − YW T
Step 3.3.1 using DGEMM
C2 = C2 − Y2WT
Step 3.3.2 using DTRMM
W = WY T1
Step 3.3.3
C1 = C1 −W T
The above steps performed the operation,
⇒ C = C − Y (CTY T )T
which is equivalent to 2.5.4.
The A matrix is transformed into A′ matrix by the first block (M × 2) operation. This
continues for two more iterations (panels) for the given problem.
27
α1 R12 A13 A14 A15 A16
v21 α2 A23 A24 A25 A26
v31 v32 A33 A34 A35 A36
v41 v42 A43 A44 A45 A46
v51 v52 A53 A54 A55 A56
v61 v62 A63 A64 A65 A66
⇒
α1 R12 A′
13 A′
14 A′
15 A′
16
v21 α2 A′
23 A′
24 A′
25 A′
26
v31 v32 A′
33 A′
34 A′
35 A′
36
v41 v42 A′
43 A′
44 A′
45 A′
46
v51 v52 A′
53 A′
54 A′
55 A′
56
v61 v62 A′
63 A′
64 A′
65 A′
66
2.6 RQF, QLF and LQF Blocked Implementation in LAPACK
using WY Representation
This section outlines the blocked matrix factorization algorithms in LAPACK for the RQ, QL
and LQ variants. Similar to the blocked QR implementation, all the variants perform a panel
factorization using corresponding unblocked algorithm and then update the trailing matrix. The
matrix used in the equations are illustrated in corresponding figures for the first iteration of the
panel factorization. Figures also show the pointer to each matrix involved in the computation for
the second iteration. In this figures C represents the trailing matrix in the first iteration and C2
represents the trailing matrix in the second iterations.
2.6.1 DGERQF
In the DGERQF the panels are row wise and blocking is from bottom to top. Here, as in Figure 2.4
Figure 2.4: LAPACK Blocked RQ Factorization
28
A =
C
Y
Y =[
Y1 Y2
]
C =[
C1 C2
]
where Y is the Householder reflector obtained using DGERQ2. C is the remainder matrix of A, and
(I − Y TTY ) is the block reflector. The transformation C · (I − Y TTY ) is applied using DLARFB.
Note that Y2 is lower triangular, and a portion of R is stored in upper triangular area of Y2 [refer
Figure 2.4]. Matrix T is lower triangular and has values
T =
ρ 0
z T
in each iteration
In Figure 2.4 A and A2 represents pointer to Y in first two iterations, whereas pointer to
C remains same in each iteration. Computational steps for the transformation C · (I − Y TTY ) are
shown in Table 2.1
In each iteration block reflectors are applied backwards to compute H = Hk...H2 ·H1
Table 2.1: DGEQLF LARFB Computation
Operation Calling Routine Remark
W = C2 copy C2 to W
W = C · Y T perform w = C1 · Y1 + C2 · Y1
W = W · Y T2 DTRMM
W = W + C1 · YT1 DGEMM
W = W · T DTRMM
W = C −W · T apply C = [C1 C2]− (W ∗ [Y1 Y2])
C1 = C1 −W · Y1 DGEMM Update C1
W = W · Y2 DTRMM
C2 = C2 −W Update C2
2.6.2 DGEQLF
In DGEQLF, the panels are column wise and blocking is from left to right.
29
Figure 2.5: LAPACK Blocked QL Factorization
Here, as in Figure 2.5
A =[
C Y
]
Y =[
Y1 Y2
]T
C =[
C1 C2
]T
where Y is the Householder reflector obtained using DGEQL2. C is the remainder matrix
of A, where the transformation (I − Y TY T )T · C is applied using DLARFB. Note that Y2 is upper
triangular, and a portion of L is stored in lower triangular area of Y2 . Matrix T is lower triangular
and has values
T =
ρ 0
z T
in each iteration
In Figure 2.5 A and A2 represents pointer to Y in first two iterations, whereas pointer to C
remains same in each iteration. Computational steps remains similar to that of DGEQRF, in finding
out (I − Y TY T )T · C. The differences with respect to storage of Y , C, L and T as represented in
figure.
In each iteration block reflectors are applied backwards to compute H = Hk...H2 ·H1
30
2.6.3 DGELQF
In DGELQF, the panels are row wise and blocking is from top to bottom.
Figure 2.6: LAPACK Blocked LQ Factorization
Here, as in Figure 2.6
A =
Y
C
Y =[
Y1 Y2
]
C =[
C1 C2
]
where Y is the Householder reflector obtained using DGELQ2. C is the remainder matrix
of A, and (I − Y TTY ) is the block reflector. The transformation C · (I − Y TTY ) is applied using
DLARFB. Note that Y1 is upper triangular, and a portion of L is stored in lower triangular area of
Y1. Matrix T is upper triangular and has values
T =
T z
ρ 0
in each iteration
In Figure 2.6 A and A2 represents pointer to Y in each iteration, whereas C1 and C2 denote that
of C. The elementary block reflectors are applied forward to compute H = H1 ·H2...Hk.
Computational steps remains similar to that of DGERQF, in finding out C · (I − Y TTY )
. The differences are with respect to storage of Y , C, L and T as represented in figure.
The block reflectors are applied forward to compute H = H1 ·H2...Hk.
31
CHAPTER 3: QR ATLAS IMPLEMENTATION USING
RECURSIVE ALGORITHM
3.1 Introduction
This chapter describes the mathematical details of QR matrix factorization and how it is imple-
mented in ATLAS [12]. ATLAS provides implementations for QR (QL, RQ and LQ) using an
unblocked and a blocked factorization algorithm in four precisions/types (real single, real dou-
ble, single complex and double complex). As in the LAPACK, the blocked algorithm iteratively
processes a panel and then updates the remainder matrix using L3BLAS. In ATLAS, the panel fac-
torization is done using a recursive algorithm [3], which uses recursion to automatically block the
panel into sub-panels until the problem size fits into the L2-cache. At that point the sub-panel is
factored using the unblocked algorithm, which is cache contained. The trailing sub-panel is also
updated using the L3BLAS within the recursive factorization steps . This gives us L3BLAS compu-
tation in both panel factorization and trailing matrix update operations, which improves the full
matrix factorization performance. The dynamic blocking until the problem is L2-cache-contained
also helps the unblocked algorithm involved in the panel factorization to perform at cache speed.
The trailing matrix update operations remains the same in ATLAS and LAPACK. Another differ-
ence when compared to to LAPACK is that ATLAS uses a L3BLAS-based formulation for finding
triangular factor T based on the work by Elmroth and Gustavson [3].
The ATLAS implementations of all QR variants are outlined in this chapter. The recursive
panel factorization algorithm and derivation for finding triangular factor T are presented for QR,
QL, RQ and LQ variants. Discussion are made mainly in terms of the QR factorization algorithm
in double precision and other precisions/types are also provided if there is a difference. In general a
detailed discussions is provided when there is major difference in algorithm compared to LAPACK.
The author has contributed in conducting mathematical analysis, implementation and performance
analysis for all the variants and precisions.
ATLAS provides implementation in ANSI C for all the routines. The factorization routines
for QR variant is provided using
1. Statically blocked version: (ATL geqrf)
32
2. Recursively blocked version: (ATL geqrr, called by ATL geqrf to factor the column panel)
3. Unblocked version: (ATL geqr2, called by ATL geqrr to factor the L2-cache-contained column
panel)
In general, all the precisions are implemented using the same source file, but compiled to
different object files for each precisions. Similarly implementations are provided for RQ, QL and
LQ variants. All QR routines use the high performance BLAS provided by ATLAS for performing
the bulk of the computation.
3.1.1 Outline
The remaining sections of the chapter is organized as follows: First the unblocked QR2 factor-
ization is discussed in Section 3.2. Section 3.3 provides implementation details of the blocked
QR factorization. This section also gives a detailed illustrations of the computational steps in-
volved in the recursive panel factorization. Section 3.4 discusses the blocked QL factorization. The
blocked factorization for RQ and LQ variants are presented in Section 3.5. Section 3.6 outlines the
mathematical analysis for computing the triangular factor T based on the work by Elmroth and
Gustavson [3].
3.2 Unblocked QR factorization (ATL geqr2)
The unblocked implementations of the QR variants in ATLAS follows the same mathematical
formulation and call structure as that of LAPACK described in Section 2.3. Table 3.1 lists different
routines implemented in ATLAS for providing the QR2 functionality. Refer to the ATLAS software
for details.
3.3 Statically blocked QR factorization (ATL geqrf)
This section outlines ATLAS’s ATL geqrf, which implements statically blocked factorization for the
QR variant. As in the LAPACK computation described in Section 2.5, the matrix A is split into a
panel and remainder matrix iteratively. The difference is that the panel factorization step is per-
formed with a call to the recursive panel factorization, ATL geqrr, rather than calling the unblocked
33
Table 3.1: QR2 ATLAS routines
Program/Files Variant Precisions
ATL geqr2.c qr All
ATL geql2.c ql All
ATL gerq2.c rq All
ATL gelq2.c lq All
ATL larf.c All All
ATL larfg.c All All
ATL larfp.c All All
ATL ladiv.c All All
ATL lapy2.c All s,d
ATL lapy3.c All c,z
ATL lacgv.c All c,z
algorithm as LAPACK does. The recursion leads to an automatic blocking and it also replaces the
L2BLAS-based operations of LAPACK’s unblocked algorithm (GEQR2) by calls to the L3BLAS. In
addition, unlike LAPACK, the triangular factor T is computed using the recursive algorithm as
described in [3]. Automatic blocking in the panel factorization is achieved by recursing until the
problem size fit into the L2 cache and then calling the unblocked algorithm ATL geqr2. Equivalent
implementations are provided for RQ, QL and LQ factorization in ATL gerqf, ATL geqrlf and
ATL gelqf respectively. Table 3.2 gives the list of all the routines in ATLAS for implementing
blocked matrix factorization. Refer to ATLAS for details. A detailed analysis of QR variant is
given below. QL, RQ and LQ implementations details are given in subsequent sections.
3.3.1 Call structure for ATLAS GEQRF (ATL geqrf)
The following calls are repeated for each nb in MIN(M,N) (nb is the block size),
1. ATL geqrr : Performs the recursive factorization of a panel. ATL geqrr recursively calls itself
by dividing the panel width by 2 until it reaches a stopping point and calls the unblocked
code, ATL geqr2 to factor the sub-panel. The T Matrix is computed by calling ATL larft,
34
Table 3.2: QRF ATLAS routines
Program/Files Variant Precisions
ATL geqrf.c qr All
ATL geqlf.c ql All
ATL gerqf.c rq All
ATL gelqf.c lq All
ATL geqrr.c lq All
ATL geqlr.c lq All
ATL gerqr.c lq All
ATL gelqr.c lq All
ATL larft.c All All
ATL larfb.c All All
then ATL larfb is called to apply the Householder reflector [7, 3]. ATL larft computes the
T matrix using the recursive algorithm developed in [3]
2. ATL larfb : Apply Householder reflector to remainder of A matrix .
3.3.2 Computation of GEQRR (ATL geqrr)
The following section outline the recursive panel factorization algorithm in ATL geqrr. The compu-
tational steps are illustrated in Figure 3.1 (assuming 2 levels of recursion). The steps of Figure 3.1
are :
1. Block A to perform factorization, which is stored in column major. T will be computed
and stored as upper triangular matrix at the end of all computational steps. The gray area
represents the completed operation
2. Divide the panel into left and right sub-panels by recursion
3. Continue recursion on left sub-panel A0 to get A00 and A01. Recursion needs to be continued
till the problem fits into L2 cache. Here it is assumed that A00 fits into L2 cache . Factorize A00
35
Figure 3.1: Recursive QR Panel Factorization
36
to QR form using ATL geqr2. Obtain the elementary reflector Y00 and the upper triangular
part of R. Produce the T00 matrix by calling ATL larft
4. Apply blocked reflector to A01
5. Factor the subsub-panel A01 to obtain Y01 and upper triangular part of R Also obtain T01
6. Compute the block part of T (i.e z0) using the identity −T1YT1 Y2T2
7. Apply blocked transformation to A1.
8. Apply recursive panel factorization to the right sub-panel. Follow the similar steps as that of
left panel.
9. Compute z1, T10 and T11.
10. Compute z using T0 and T1.
11. Complete the factorization of the panel into Y and R. Obtain upper triangular matrix T
3.4 Blocked QL Factorization (ATL geqlf)
ATL geqlf provides the QL variant of blocked factorization in ATLAS. As in the LAPACK com-
putation described in Section 2.6.2, the matrix A is split into a panel and remainder matrix and
processed iteratively. Just as in ATL geqrf, ATLAS differs from LAPACK by calling a recur-
sive routine (ATL geqlr) to factor the column panel. The rest of the section outlines the panel
factorization in detail. The computation described here is applicable to all precisions.
3.4.1 Computation of GEQLR (ATL geqlr)
The computational steps are illustrated in Figure 3.2. Here we show one level of recursion. The
steps of Figure 3.2 are:
1. Block panel A to perform the recursion, which is stored in column major. T will be computed
and stored as lower triangular matrix at the end of all computational steps. Gray area
represent the completed operation
2. Divide the panel into left and right sub-panels by recursion
37
3. Factor the right sub-panel A1 to get Y1 and a lower triangular L part. Then, compute T1.
4. Then apply block reflector to A0 (trailing matrix update)
5. Factor left sub-panel and compute T0
6. Compute z using T0 and T1 using identity −T2YT2 Y1T1
7. Complete the factorization of the panel into Y and L. Obtain lower triangular matrix T
Figure 3.2: Recursive QL Panel Factorization
3.5 Blocked RQ and LQ Factorization
The RQ and LQ transformations can be performed by using QL and QR transformations with a
transpose. This is done mainly for performance reasons. RQ is (Q · L)T and LQ is (Q ·R)T , i.e.:
AT = Q ·R ⇒ A = (Q ·R)T = RT ·QT = L ·QT (3.5.1)
AT = Q · L ⇒ A = (Q · L)T = LT ·QT = R ·QT (3.5.2)
38
And since QT is also orthogonal, and the “Q” in the LQ and RQ factorizations refer to any
orthogonal matrix, we can obtain an LQ or RQ factorization of A by transposing the result of a
QR or QL factorization of AT. The QL and QR factorizations access data in contiguous columns
(the data is column-major) and proceed column-by-column, while the RQ and LQ factorizations,
as written for LAPACK, access data in non-contiguous rows and proceed row-by-row. The cost
of the transpose copy is small compared to the performance gained by using the more cache and
TLB1 friendly memory access pattern employed by QR and QL.
For a square matrix, the transpose copy in the above operations can be done in-place
without requiring any additional work space. But for a rectangular matrix, in-place transpose is
extremely problematic. ATLAS performs panel factorization of RQ and LQ by using QL and QR
panel factorization, which will enable us to take the performance advantage described in the above
paragraph. The details are discussed in the following section.
3.5.1 Blocked RQ Factorization (ATL gerqf)
ATL gerqf provides the RQ variant of the blocked factorization in ATLAS. As in the LAPACK
computation described in Section 2.6.1, the matrix A is split into a panel and remainder matrix
and processed iteratively.
In ATLAS, the panel factorization is obtained by making a transposed copy of the panel
and then factorizing using ATL geqlr as discussed earlier. After factorization and finding T , the
output of ATL geqlr is transposed back into the original panel. Then we perform the trailing matrix
update as usual. The computational steps are illustrated in Figure 3.3. The steps of Figure 3.3
are:
1. First iteration starts from last Panel Ap.
2. Make a transpose copy of Ap and factorize using ATL geqlr. T is also computed. Note that
we must conjugate the output T for complex precisions.
3. Make a transpose copy of ATL geqlr’s column panel back into the row panel Ap. Then
perform trailing matrix update.
4. Proceed to next iteration
1Translation Lookaside Buffer.
39
Figure 3.3: Recursive RQ Panel Factorization
3.5.2 Blocked LQ Factorization (ATL gelqr)
The computational steps are similar to that of ATL gelqr described above. The computational
steps are illustrated in Figure 3.4. The steps of Figure 3.4 are:
1. First iteration starts from top Panel Ap.
2. Make a transpose copy of Ap and factorize using ATL geqrr. T is also computed. Similar to
RQ variant, we must conjugate the output T for complex precisions.
3. Make a transpose copy of ATL geqrr’s column panel back into row panel Ap. Then perform
trailing matrix update.
4. Proceed to next iteration
3.6 Derivation for T-Block for the Recursive T Computation
This section outlines the computation of the T matrix (triangular factor) in ATL larft. ATLAS
provides a recursive algorithm for computing T using the L3BLAS based on the work by Elmroth
40
Figure 3.4: Recursive LQ Panel Factorization
and Gustavson [3]. This paper explicitly provides the formulation only for the QR variant, but
ATL larft can compute T for all the variants. In ATLAS implementation for all the variants are
provided in ATL larft routine. The computational details are given below.
QR
The Level 3 formulation of computing the T matrix for QR as described in Elmroth and Gus-
tavson [3] is shown below. Details of the computation for QR can be referred in the above paper.
From [1, 3], the triangular factor T is defined as a product of k Householder transforms
Qi, i = 1, · · · , k, as.
Q = Q1Q2 · · ·Qk = I − Y TY T , where T is k × k upper triangular.
Y is the trapezoidal matrix consisting of k consecutive Householder vectors. Referring to Figure 3.5,
suppose k1 + k2 = k and T1 and T2 are the associated triangular matrices,
Q = (I − Y1T1YT1 )(I − Y2T2Y
T2 ) where Y =
[
Y1 Y2
]
= I − Y2T2YT2 − Y1T1Y
T1 + Y1T1Y
T1 Y2T2Y
T2 (3.6.1)
41
Figure 3.5: Recursive T Computation for QR
Also, Q = I − Y TY T
= I −[
Y1 Y2
]
T1 z
0 T2
Y T1
Y T2
= I −[
Y1 Y2
]
T1YT1 + zY T
2
T2YT2
= I − Y1T1YT1 − Y1zY
T2 − Y2T2Y
T2 (3.6.2)
Compute block z of T from Equations 3.6.1 and 3.6.2 as, z = −T1YT1 Y2T2
Computational Steps
The computational steps involve, recursively dividing the panel by 2 (to get right sub-panel and
left sub-panel), until width of the sub-panel becomes two or less. At that point compute T for the
right and left sub-panel and then compute the z (T-block). This is done recursively. This example
shows one level of recursion. In Figure 3.5 the following steps are performed in order to obtain T ,
1. Divide the panel into left and right sub-panels by recursion. (Here we assume both the panels
reached the stopping criteria).
2. Compute the triangular factor T1 for Y1. The T1 matrix is computed based on the details
provided in Step 2 of Section 2.5.1.
42
3. Compute the triangular factor T2 for Y2 using the same method as in the previous step.
4. Compute z using the identity z = −T1YT1 Y2T2. This step complete the formation of T for
the panel Y .
The computational steps remains same for other variants. So in the following sections discussions
are limited to the formulation of z each variants, which is newly introduced in ATLAS computations.
RQ
From [1, 3], the triangular factor T is defined as a product of k Householder transforms Qi, i =
1, · · · , k, as.
Q = Qk · · ·Q2Q1 = I − Y TTY, where T is k × k lower triangular.
Y is the trapezoidal matrix consisting of k consecutive Householder vectors.
Figure 3.6: Recursive T Computation for RQ
Referring to Figure 3.6, suppose k1 + k2 = k and T1 and T2 are the associated triangular
matrices,
Q = (I − Y T2 T2Y2)(I − Y T
1 T1Y1) where Y =
Y1
Y2
= I − Y T1 T1Y1 − Y T
2 T2Y2 + Y T2 T2Y2Y
T1 T1Y1 (3.6.3)
Also,Q = I −[
Y T1 Y T
2
]
T1 0
z T2
Y1
Y2
= I −[
Y T1 Y T
2
]
T1Y1
zY1 + T2Y2
= I − Y T1 T1Y1 − Y T
2 zY1 − Y T2 T2Y2 (3.6.4)
Compute block z of T from Equations 3.6.3 and 3.6.4 as, z = −T2Y2YT1 T1
43
QL
From [1, 3], the triangular factor T is defined as a product of k Householder transforms Qi, i =
1, · · · , k, as.
Q = Qk · · ·Q2Q1 = I − Y TY T , where T is k × k lower triangular.
Y is the trapezoidal matrix consisting of k consecutive Householder vectors.
Figure 3.7: Recursive T Computation for QL
Referring to Figure 3.7, suppose k1 + k2 = k and T1 and T2 are the associated triangular
matrices,
Q = (I − Y2T2YT2 )(I − Y1T1Y
T1 ) where Y =
[
Y1 Y2
]
= I − Y2T2YT2 − Y1T1Y
T1 + Y2T2Y
T2 Y1T1Y
T1 (3.6.5)
Also,Q = I −[
Y1 Y2
]
T1 0
z T2
Y T1
Y T2
= I −[
Y1 Y2
]
T1YT1
zY T1 + T2Y
T2
= I − Y1T1YT1 − Y T
2 zY T1 − Y2T2Y
T2 (3.6.6)
Compute block z of T from Equations 3.6.5 and 3.6.6 as, z = −T2YT2 Y1T1
44
LQ
From [1, 3], the triangular factor T is defined as a product of k Householder transforms Qi, i =
1, · · · , k, as.
Q = Q1Q2 · · ·Qk = I − Y TTY, where T is k × k upper triangular.
Y is the trapezoidal matrix consisting of k consecutive Householder vectors.
Figure 3.8: Recursive T Computation for LQ
Referring to Figure 3.8, suppose k1 + k2 = k and T1 and T2 are the associated triangular
matrices,
Q = (I − Y T1 T1Y1)(I − Y T
2 T2Y2) where Y =
Y1
Y2
= I − Y T2 T2Y2 − Y T
1 T1Y1 + Y T1 T1Y1Y
T2 T2Y2 (3.6.7)
Also,Q = I −[
Y T1 Y T
2
]
T1 z
0 T2
Y1
Y2
= I −[
Y T1 Y T
2
]
T1Y1 + zY2
T2Y2
= I − Y T1 T1Y1 − Y T
1 zY2 − Y T2 T2Y2 (3.6.8)
Compute block z of T from Equations 3.6.7 and 3.6.8 as, z = −T1Y1YT2 T2
45
CHAPTER 4: RESULTS AND ANALYSIS
4.1 Introduction
This chapter compares the performance results of the recursive panel factorization vs the unblocked
panel factorization, both of which use the statically blocked QR factorization for the full problem.
As we have seen, the blocked algorithm has a panel factorization step followed by the trailing
matrix update, performed iteratively. As described in the previous chapter, ATLAS uses a recursive
algorithm to perform the panel factorization, which divides the width by 2 using recursion until
the problem is reduced to a sub-panel that fits into the L2-cache. At the recursive stopping
point, it performs an unblocked factorization (GEQR2) and then use the L3BLAS to update the
remainder of the panel using the LAPACK routine LARFB. The trailing matrix of the full matrix
also gets updated using the L3BLAS in LARFB. The automatic blocking and Level 3 update due to
the recursive factorization provides several distinct advantages :
1. The unblocked factorization (GEQR2) is cache contained
2. Panel performance is dominated by the L3BLAS rather than the L2BLAS
3. The efficient panel performance allows us to use a larger nb, resulting in improved L3BLAS
performance and scaling in the trailing matrix update.
4. Using automatic blocking ensures we do not have to separately tune for all panel shapes
The results in this chapter will validate the above attributes in two different commodity
platforms. The statically blocked full matrix factorization with unblocked panel factorization (LA-
PACK) and statically blocked full matrix factorization with dynamically blocked recursive panel
factorization (ATLAS) algorithms are compared and the results are discussed. Performance com-
parisons of default LAPACK, tuned LAPACK and recursive ATLAS implementations are conducted
for full matrix factorization. Although we primarily discuss double precision QR, the experimental
results for all QR variants and precisions/types are provided.
46
4.1.1 Outline
The rest of the chapter is organized as follows: Section 4.2 provides an overview of the experi-
mental methodology. In Section 4.3 we show the performance results for solving full problems. As
previously described, full problem always uses a statically blocked algorithm, but LAPACK uses an
unblocked algorithm to factorize the panel while ATLAS uses a recursive algorithm. Section 4.3.1
provides quantitative comparison of the recursive and non-recursive algorithm. The performance
results of full matrix factorization on two commodity platforms for all variants and precisions are
provided in Section 4.3.2 and their performance characteristics are analyzed.
4.2 Experimental Methodology
This section gives outline of the libraries, timing, os/hardware and tuning used in the experiments
as below.
Libraries And Software
Unless otherwise noted, all experiments were conducted using ATLAS v-3.9.26 and LAPACK v-
3.2.1 libraries.
Timing and Floating Point Operations
All timings used the ATLAS timers. Timing is calculated from the average of the several sample
runs. We report average results since parallel timings in practice are strongly affected by the system
states during the call and vary widely based on unknown starting conditions. For the timing of full
square problems, we use 50 trials to factor matrices under 4000 × 4000 elements, and at least 20
trials for larger matrices. We flush all processors’ caches between timing invocations, as described
in [10].
The floating point counts for all operations are taken from LAPACK’s dopla.f. The
performance measurements are in MFLOPS (millions of floating point operations per second). Note
that the actual number of floating point computations in the full matrix factorization varies based
on nb for a same full problem size [3]. Despite this, the MFLOP rates reported in this paper always
use the unblocked flop counts from dopla.f. Therefore the number of floating point operations
47
for a given problem size remains the same independent of the factorization algorithm and blocking
factor. Hence, the MFLOPS we report always give unbiased measure for comparing performance
using different algorithms, and can be converted into raw times using a simple equation.
OS/Hardware
Timing is done on two commodity platforms, both of which have 8 cores in two physical packages
and run Linux. More specifically, our test systems are:
(1) Opt8, O8: 2.1Ghz AMD Opteron 2352 running Fedora 8 Linux 2.6.25.14-69 and gcc
4.2.1, LAPACK 3.2.1 and ATLAS 3.9.26
(2) Core2, C2: 2.5Ghz Intel E5420 Core2 Xeon running Fedora 9 Linux 2.6.25.11-97 and
gcc 4.3.0, LAPACK 3.2.1 and ATLAS 3.9.26.
In both architectures, each core has separate L1 and L2 caches, but L3 caches are shared
across cores. Each physical package on the Opt8 consists of one chip, which has a built-in memory
controller. The physical packages of the Core2 contain two chips, and all cores share an off-chip
memory controller.
Tuning
ATLAS’s LAPACK autotuner[9] was used to empirically tune the LAPACK blocking parameter (i.e.
the nb used by the statically blocked GEQRF). For the variable blocking in the recursive algorithm
the L2-cache size is obtained from ATLAS’s cache detection probe.
4.3 Results
The influence of the dynamically blocked recursive algorithm in the panel factorization and the
experimental results are discussed in chapter 1 (see Section 1.3). We have seen that the recursive
panel factorization is an efficient way to factorize a panel of varying size and shapes.
In Section 4.3.1, we compare the performance of the ATLAS and LAPACK implementa-
tions, of QR as the blocking parameter is varied. Section 4.3.2 provides the detailed results for all
types, precisions, variants and algorithms on both target platforms.
48
4.3.1 General Comparison Of ATLAS and LAPACK Performance
The general characteristics and performance advantages of using a recursive algorithm over using
the unblocked algorithm for panel factorization and its influence on full problem performance is
analyzed below. In Figure 4.1, full square matrices are factored by varying nb as a multiple of
matrix multiplication nb of 56 for double precision (56, 112, 168). Other nbs such as 28, 84 and
140 are also taken for comparison, but this does not favor any tuned algorithm. The graph shows the
results of using both the recursive and non-recursive algorithm. The recursive performance data is
stacked over the non-recursive data. The hatched region highlights the performance improvement
for each problem size for that specific nb compared to the LAPACK routine that factors the panel
using the unblocked code. The X-axis shows the problem size and Y-axis shows the performance
in MFLOPS. These results were obtained using ATLAS 3.9.40.
Figure 4.1: Static Blocking Full Square Matrix (Unblocked Vs Recursive Panel Factorization)
For small problems, we see relatively modest improvements. For large problems, which
can more fully use parallelism, however, we see that the recursive panel factorization is key in
getting good performance. For large problems, large blocking factors provide distinct performance
improvements. However, this improvement does not continue smoothly as we increase nb. This
49
leveling off of performance can largely be explained by two main factors: (1) GEMM already gets
sufficient cache reuse so that further size increases are only of modest benefit, and (2) the extra flops
involved in the larger nb no longer pay for themselves via strongly improved GEMM performance.
An important point to note is that an nb of 168 for the recursive algorithm provides
near maximum performance among different nb for varying problem size from 1000 to 8000. But
in the case of the non-recursive case, it is necessary to use either 56 or 112 for problem sizes to
achieve reasonable performance. When the unblocked factorization uses an nb of 168, the panel
factorization is so slow that the increased performance of the trailing update is overwhelmed by
the slowdown in the panel factorization.
4.3.2 Full Problem Performance
We have seen that the dynamically blocked panel factorization can improve the full matrix factor-
ization by efficiently factoring a larger panel. This in turn allows us to use larger matrices in the
trailing matrix update. To show the benefit of using a dynamically blocked recursive algorithm we
will examine the best tuned performance of the ATLAS recursive factorization versus the default
LAPACK implementation and the best tuned performance of the LAPACK implementation (as
per [9]).
The performance of following methodologies are compared for all variants and preci-
sions/types:
1. Default LAPACK: This is the performance of the default installation of LAPACK. The default
width of all panels for these factorizations is 32, for all precisions.
2. Tuned LAPACK: The panel-width is tuned for different problem sizes for each precision using
the ATLAS LAPACK tuner, as discussed in [9].
3. Recursive ATLAS: Uses ATLAS’s autotuned dynamically blocked matrix factorization in
ATLAS.
In this section, the blocked version of matrix factorizations represented as QR, QL, RQ
and LQ for each variant respectively. The precisions are S, C, D, Z for Single Precision Real,
Single Precision Complex, Double Precision Real, and Double Precision Complex. Following the
50
LAPACK convention, the names of our methods will consist of the precision followed by the two-
letter algorithm identifier; eg., “ZRQ” refers to the RQ factorization for double-precision complex
elements. We provide performance for Core2 and Opt8 architecture as below:
4.3.3 Core2 Full Problem Performance
The performance for QR and QL variants are given in Figure 4.2, and Figure 4.3 gives the perfor-
mance data for RQ and LQ variants. In these charts, the default LAPACK performance is repre-
sented as Dflt-LPK and the best performance the ATLAS framework finds (by changing blocking
factors) for LAPACK is labeled Tuned-LPK and finally the ATLAS performance is labeled Rec-ATL.
The X-axis is the square problem size, the Y-Axis is MFLOPS achieved. Each problem
size compares the three methodologies at that problem size. Note that all of the performance charts
in this section follow this scheme.
In general, default LAPACK is the worst performer. Tuned LAPACK is typically much
faster, with recursive ATLAS giving the best performance. We concentrate our discussion on
QR double precision. Figure 4.2(b) shows both tuned LAPACK and recursive ATLAS perform
better than default LAPACK for all the problem sizes. A representative example with dimension
6000 X 6000 shows that recursive ATLAS is 1.26 (1.90) times faster than the tuned (default)
LAPACK versions, respectively. The higher performance in ATLAS is attributed to an efficient
panel factorization coupled with improved performance in the remainder matrix update enabled by
using a larger blocking factor. In this case, tuned LAPACK used nb = 80 whereas ATLAS used
nb = 168.
The tuned LAPACK is slightly faster than recursion for a few problem sizes, particularly
in the complex cases. This seems to be because ATLAS does not tune for every problem size
explicitly. Rather, it tunes for various sizes and then predicts the nb of other sizes based on the
closest measured size. For recursive factorization, which can use very large nb effectively, ATLAS’s
tuning decides to apply the larger nb too early, which slightly depresses the performance of some
small cases. Recursion requires more flops than unblocked, which can have a small effect as well.
However, significant losses have never been observed, so we are unwilling to do the additional tuning
required to avoid these minor issues.
The full problem charts are fairly uniform for QL and QR variants for all precisions/types.
so we will move forward without individual discussion; and discuss them collectively henceforth.
51
For the RQ and LQ variants given in Figures 4.3, it can be seen that for large problems
the ATLAS implementation performs far better than the best LAPACK implementation. For
example, for the 6000 X 6000 matrix of DLQ, ATLAS is 1.7 times faster than the tuned LAPACK.
Similar performance characteristics can be seen for other precisions. This is attributed to ATLAS’s
improved LQ (RQ) factorization, which uses the more efficient QR (QL) for performing the panel
factorization (See Section 3.5).
4.3.4 Opt8 Panel Performance
This section shows results for the same run on the AMD Opteron. The full problems performance
are shown in Figures 4.4 and 4.5. The charts are fairly uniform and the performance character-
istics remains the same as that of Core2, so a detailed discussion is not provided. These uniform
performance results underscores the fact that the recursive algorithm employed in ATLAS can
successfully auto-adapt to various systems.
52
Figure 4.2: QR (a,b,c,d) & QL (c,d,e,f), Core2, Full Problem Performances
53
Figure 4.3: LQ (a,b,c,d) & RQ (c,d,e,f), Core2, Full Problem Performances
54
Figure 4.4: QR (a,b,c,d) & QL (c,d,e,f), Opt8, Full Problem Performances
55
Figure 4.5: LQ (a,b,c,d) & RQ (c,d,e,f), Opt8, Full Problem Performances
56
CHAPTER 5: SUMMARY
We have presented a dynamically blocked QR panel factorization approach based on the recursive
algorithm [3]. The recursion helps to perform dynamic blocking and enables the panel factorization
to use the L3BLAS. We have applied this technique to all QR variants (QR, QL, RQ and LQ) in four
precisions (single real, double real, single complex, and double complex). We have presented the
mathematical analysis and implementation details. Finally the performance improvements in panel
factorization and its overall impact in full matrix factorization were overviewed. The results showed
that the performance of the new code exceeds that of the LAPACK block algorithms. In addition, we
have shown that the ATLAS implementation performs equally well for both square and rectangular
problems with minimal tuning for the panel width. An ANSI C implementation was developed for
for all QR variants and precisions and made available as part of ATLAS (Automatically Tuned
Linear Algebra Software) library.
57
BIBLIOGRAPHY
[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Ham-marling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM,Philadelphia, PA, 3rd edition, 1999.
[2] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factor-ization for multicore architectures. Concurrency and Computation: Practice and Experience,20:1573–1590, June 2008.
[3] E.Elmroth and F.G. Gustavson. Applying recursion to serial and parallel QR factorizationleads to better performance. IBM J. RES. DEVELOP., 44(4):605–624, 2000.
[4] Alston S. Householder. Unitary Triangularization of a Nonsymmetric Matrix. Journal of theACM (JACM), 5(4), Oct 1958.
[5] Steven J. Leon. Linear Algebra with Applications. Prentice-Hall, 2002.
[6] Gregorio Quintana-Orti, Enrique S. Quintana-Orti, Ernie Chan, Robert A. van de Geijn,and Field G. Van Zee. Scheduling of qr factorization algorithms on smp and multi-core ar-chitectures. In Proceedings of the 16th Euromicro Conference on Parallel, Distributed andNetwork-Based Processing (PDP 2008), pages 301–310, Washington, DC, USA, 2008. IEEEComputer Society.
[7] Robert Schreiber and Charles van Loan. A storage efficient WY representation for prod-ucts of householder transformations. Technical report, Cornell University, 1989. URL:http://hdl.handle.net/1813/6704.
[8] The WY representation for products of Householder Transformations. C. bischof and c. vanloan. SIAM Journal on Scientific and Statistical Computing, 8(1):s2–s13, January 1987.
[9] R. Clint Whaley. Empirically tuning lapack’s blocking factor for increased performance. InProceedings of the International Multiconference on Computer Science and Information Tech-nology, Wisla, Poland, October 2008.
[10] R. Clint Whaley and Anthony M Castaldo. Achieving accurate and context-sensitive timingfor code optimization. Software: Practice and Experience, 38(15):1621–1642, 2008.
[11] R. Clint Whaley and Jack Dongarra. Automatically Tuned Linear Algebra Soft-ware. Technical Report UT-CS-97-366, University of Tennessee, December 1997.http://www.netlib.org/lapack/lawns/lawn131.ps.
[12] R. Clint Whaley and Jack Dongarra. Automatically tuned linear algebra software. In Su-perComputing 1998: High Performance Networking and Computing, San Antonio, TX, USA,1998. CD-ROM Proceedings. Winner, best paper in the systems category.http://www.cs.utsa.edu/~whaley/papers/atlas_sc98.ps.
[13] R. Clint Whaley and Jack Dongarra. Automatically Tuned Linear Algebra Software. InNinth SIAM Conference on Parallel Processing for Scientific Computing, 1999. CD-ROMProceedings.
58
[14] R. Clint Whaley and Antoine Petitet. Atlas homepage. http://math-atlas.sourceforge.
net/.
[15] R. Clint Whaley and Antoine Petitet. Minimizing development and maintenance costs insupporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101–121,February 2005. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps.
[16] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimizationof software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001.
59
VITA
Siju Samuel completed his masters in Structural Engineering from Indian Institute of
Technology -Kanpur, India in 1999. Later he worked as a software engineer for 9 years. He
returned to college in 2008, entering as a special undergraduate and later as a graduate student in
Computer Science at UTSA in Jan 2009. He has been happily married to Swapna and has a baby
girl Anna.