rice universityoptimization/l1/tval3/cbl_phd_thesis.pdf · rice university compressive sensingfor...

164
RICE UNIVERSITY Compressive Sensing for 3D Data Processing Tasks: Applications, Models and Algorithms by Chengbo Li A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved, Thesis Committee: Yin Zhang, Professor, Chair Computational and Applied Mathematics William W. Symes, Noah G. Harding Professor Computational and Applied Mathematics Wotao Yin, Assistant Professor Computational and Applied Mathematics Kevin Kelly, Associate Professor Electrical and Computer Engineering Houston, Texas April 2011

Upload: others

Post on 08-Sep-2019

6 views

Category:

Documents


0 download

TRANSCRIPT

  • RICE UNIVERSITY

    Compressive Sensing for 3D Data Processing Tasks:

    Applications, Models and Algorithms

    by

    Chengbo Li

    A Thesis Submitted

    in Partial Fulfillment of the

    Requirements for the Degree

    Doctor of Philosophy

    Approved, Thesis Committee:

    Yin Zhang, Professor, ChairComputational and Applied Mathematics

    William W. Symes, Noah G. Harding ProfessorComputational and Applied Mathematics

    Wotao Yin, Assistant ProfessorComputational and Applied Mathematics

    Kevin Kelly, Associate ProfessorElectrical and Computer Engineering

    Houston, Texas

    April 2011

  • Abstract

    Compressive Sensing for 3D Data Processing

    Tasks: Applications, Models and Algorithms

    by

    Chengbo Li

    Compressive sensing (CS) is a novel sampling methodology representing a paradigm

    shift from conventional data acquisition schemes. The theory of compressive sens-

    ing ensures that under suitable conditions compressible signals or images can be

    reconstructed from far fewer samples or measurements than what are required by

    the Nyquist rate. So far in the literature, most works on CS concentrate on one-

    dimensional or two-dimensional data. However, besides involving far more data,

    three-dimensional (3D) data processing does have particularities that require the de-

    velopment of new techniques in order to make successful transitions from theoretical

    feasibilities to practical capacities. This thesis studies several issues arising from the

    applications of the CS methodology to some 3D image processing tasks. Two specific

    applications are hyperspectral imaging and video compression where 3D images are

    either directly unmixed or recovered as a whole from CS samples. The main issues

    include CS decoding models, preprocessing techniques and reconstruction algorithms,

    as well as CS encoding matrices in the case of video compression.

    Our investigation involves three major parts. (1) Total variation (TV) regular-

  • iii

    ization plays a central role in the decoding models studied in this thesis. To solve

    such models, we propose an efficient scheme to implement the classic augmented

    Lagrangian multiplier method and study its convergence properties. The resulting

    Matlab package TVAL3 is used to solve several models. Computational results show

    that, thanks to its low per-iteration complexity, the proposed algorithm is capable

    of handling realistic 3D image processing tasks. (2) Hyperspectral image processing

    typically demands heavy computational resources due to an enormous amount of data

    involved. We investigate low-complexity procedures to unmix, sometimes blindly, CS

    compressed hyperspectral data to directly obtain material signatures and their abun-

    dance fractions, bypassing the high-complexity task of reconstructing the image cube

    itself. (3) To overcome the “cliff effect” suffered by current video coding schemes, we

    explore a compressive video sampling framework to improve scalability with respect

    to channel capacities. We propose and study a novel multi-resolution CS encoding

    matrix, and a decoding model with a TV-DCT regularization function.

    Extensive numerical results are presented, obtained from experiments that use not

    only synthetic data but also real data measured by hardware. The results establish

    feasibility and robustness, to various extent, of the proposed 3D data processing

    schemes, models and algorithms. There still remain many challenges to be further

    resolved in each area, but hopefully the progress made in this thesis will represent a

    useful first step towards meeting these challenges in the future.

  • Acknowledgements

    I would like to express my deepest and sincerest gratitude to my academic advisor

    and also my spiritual mentor, Prof. Yin Zhang. His enthusiasm, profound knowledge,

    and upbeat personality have greatly influenced me in these four years. He has been

    helping me accumulate my research skills, tap into my full potential, as well as build

    up my confidence step by step in the course of researching. Without his wholehearted

    guidance, I might have already lost my interest in optimization, or even in research.

    I truly take pride in working with him.

    I feel so grateful for Prof. Wotao Yin, who has led me to this CAAM family at

    Rice University since 2007. He has provided me tremendous help on both academic

    and living sides. I owe many thanks to him for his encouragement, patience, and

    guidance. Besides, his intelligence and humor have deeply impressed me. He is not

    only my mentor, but also my friend in life.

    Prof. Kevin Kelly and Ting Sun, who are my collaborators in the ECE department

    of Rice University, have shared large quantities of data with me and helped me fully

    understand the mechanism of hardware they built like the single-pixel camera. It has

    been a great pleasure working with them and I look forward to the future collaboration

    in other areas.

    Within these four years, two successful internship experiences tremendously en-

    riched my life. I deeply appreciate my supervisors Dr. Hong Jiang in Bell Laboratories

    and Dr. Amit Chakraborty in Siemens Corporate Research for their instructions and

    praise for my work there. Besides, a profound discussion between Dr. Jiang and

    me inspired my research on video compression. I could not have made such rapid

    progress in the field of video coding without Dr. Jiang’s encouragement and support.

  • v

    Besides, I need to thank Prof. Richard Baraniuk who introduced me a treasured

    opportunity that I can continue projecting my professional strength after gradua-

    tion; Prof. Richard Tapia who taught me that mathematicians could take on more

    than mathematics; Prof. William Symes who is one of my committee members and

    earnestly reviewed my thesis; Prof. Liliana Borcea who was my mentor during my

    first year at CAAM and helped me adapt the new environment; Daria Lawrence who

    reminded me about administrative procedures and important deadlines from time to

    time; Josh Bell who is one of my best friends in America and treated me just like one

    of his families; Chao Wang who is my soul mate and has been supportive through

    all these years. Meanwhile, I offer my regards and blessings to all of those professors

    and peers who have provided me knowledge and expertise during my undergraduate

    and graduate studies.

    Last but certainly not least, I wish to dedicate this thesis to my grandparents and

    my parents for their selfless love and unconditional support over the years. No matter

    where I am and how far apart we are, you are the love of my life for eternity.

  • Contents

    Abstract ii

    Acknowledgements iv

    List of Figures viii

    1 Introduction 1

    1.1 Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 TV Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 3D Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 General TVAL3 Algorithm 9

    2.1 Review of Augmented Lagrangian Method . . . . . . . . . . . . . . . 92.1.1 Derivations and Basic Results . . . . . . . . . . . . . . . . . . 102.1.2 Operator Splitting . . . . . . . . . . . . . . . . . . . . . . . . 142.1.3 A Discussion on Alternating Direction Methods . . . . . . . . 18

    2.2 An Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . 23

    2.3 General TVAL3 and One Instance . . . . . . . . . . . . . . . . . . . . 312.3.1 Application to 2D TV Minimization . . . . . . . . . . . . . . . 33

    3 Hyperspectral Data Unmixing 39

    3.1 Introduction to Hyperspectral Imaging . . . . . . . . . . . . . . . . . 393.2 Compressive Sensing and Unmixing Scheme . . . . . . . . . . . . . . 42

    3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 433.2.2 SVD Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 463.2.3 Compressed Unmixing Algorithm . . . . . . . . . . . . . . . . 50

    3.3 Numerical Results on CSU Scheme . . . . . . . . . . . . . . . . . . . 583.3.1 Setup of Experiments . . . . . . . . . . . . . . . . . . . . . . . 583.3.2 Experimental Results on Synthetic Data . . . . . . . . . . . . 593.3.3 Hardware Implementation . . . . . . . . . . . . . . . . . . . . 62

  • vii

    3.3.4 Experimental Results on Hardware-Measured Data . . . . . . 643.4 Extension to CS Blind Unmixing . . . . . . . . . . . . . . . . . . . . 693.5 Experiments for CS Blind Unmixing . . . . . . . . . . . . . . . . . . 82

    3.5.1 Denoising Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 833.5.2 Further Scenario Tests . . . . . . . . . . . . . . . . . . . . . . 873.5.3 Remarks on Compressed Blind Unmixing . . . . . . . . . . . . 90

    3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    4 Scalable Video Coding 100

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.2 Compressive Video Sensing . . . . . . . . . . . . . . . . . . . . . . . . 105

    4.2.1 Encoding Using Compressive Sensing . . . . . . . . . . . . . . 1064.2.2 TV-DCT Method for Decoding . . . . . . . . . . . . . . . . . 107

    4.3 Multi-Resolution Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 1114.3.1 Theoretical basis of Low Resolution Reconstruction . . . . . . 1124.3.2 Illustration of Low Resolution Reconstruction . . . . . . . . . 1154.3.3 A Novel Idea to Build Scalable Sensing Matrices . . . . . . . . 116

    4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 1264.4.1 Graceful Degradation of TV-DCT Method . . . . . . . . . . . 1264.4.2 Scalability of Multi-Resolution Scheme . . . . . . . . . . . . . 134

    4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    5 Conclusions and Remarks 141

    5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415.2 Remarks and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 144

    Bibliography 146

  • List of Figures

    2.1 Recovered phantom image from orthonormal measurements. . . . . . 352.2 Recovered MR brain image. . . . . . . . . . . . . . . . . . . . . . . . 36

    3.1 Synthetic abundance distributions. . . . . . . . . . . . . . . . . . . . 593.2 Endmember spectral signatures. . . . . . . . . . . . . . . . . . . . . . 603.3 Recoverability for noisy and noise-free cases. . . . . . . . . . . . . . . 613.4 “Urban” image and endmember selection. . . . . . . . . . . . . . . . 623.5 Spectral signatures with water absorption bands abandoned. . . . . . 633.6 Estimated abundance: CS unmixing solution. . . . . . . . . . . . . . 643.7 Estimated abundance: least squares solution. . . . . . . . . . . . . . . 653.8 Single-pixel camera schematic for hyperspectral data acquisition. . . . 663.9 Target image “Color wheel”. . . . . . . . . . . . . . . . . . . . . . . . 673.10 Measured spectral signatures of the three endmembers. . . . . . . . . 683.11 Estimated abundance: CS unmixing solution. . . . . . . . . . . . . . 693.12 Four slices computed by the proposed approach. . . . . . . . . . . . . 703.13 Four slices computed slice-by-slice using 2D TV algorithm TwIST. . . 713.14 Four slices computed slice-by-slice using 2D TV algorithm TVAL3. . 723.15 Four slices computed slice-by-slice using 2D TV algorithm NESTA. . 733.16 Target image “Subtractive color mixing”. . . . . . . . . . . . . . . . . 743.17 Estimated abundance: CS unmixing solution. . . . . . . . . . . . . . 743.18 Four slices computed by the proposed approach. . . . . . . . . . . . . 753.19 Four slices computed slice-by-slice using 2D TV algorithm TwIST. . . 763.20 Four slices computed slice-by-slice using 2D TV algorithm TVAL3. . 773.21 Four slices computed slice-by-slice using 2D TV algorithm NESTA. . 783.22 Endmember spectral signatures. . . . . . . . . . . . . . . . . . . . . . 833.23 Synthetic abundance distributions. . . . . . . . . . . . . . . . . . . . 843.24 Hyperspectral imaging under specific wavelengths. . . . . . . . . . . . 853.25 Removing the Gaussian noise involved in endmembers. . . . . . . . . 923.26 Removing the periodic noise involved in endmembers. . . . . . . . . . 933.27 Removing the impulsive noise involved in endmembers (random positions corrupted).3.28 Removing the impulsive noise involved in endmembers (same positions corrupted). 953.29 Correcting the wrong scale involved in endmembers. . . . . . . . . . . 963.30 Selecting endmembers from candidates. . . . . . . . . . . . . . . . . . 97

  • ix

    3.31 Unmixing from one endmember missing. . . . . . . . . . . . . . . . . 983.32 Unmixing from two endmembers missing. . . . . . . . . . . . . . . . . 99

    4.1 Diagram of a video network. . . . . . . . . . . . . . . . . . . . . . . . 1014.2 Video coding using compressive sensing. . . . . . . . . . . . . . . . . 1074.3 TV-DCT regularization. . . . . . . . . . . . . . . . . . . . . . . . . . 1104.4 Flowchart of two schemes. . . . . . . . . . . . . . . . . . . . . . . . . 1164.5 Recursive construction of vectorized permutation matrices. . . . . . . 1184.6 Demo of the initial permutation matrix. . . . . . . . . . . . . . . . . 1194.7 Diagram of the mapping T . . . . . . . . . . . . . . . . . . . . . . . . 1224.8 CIF test videos: Frames from (a) News and (b) Container. . . . . . . 1274.9 Recoverability for the noise-free case. . . . . . . . . . . . . . . . . . . 1274.10 PSNR comparison using different regularizations. . . . . . . . . . . . 1284.11 A typical frame from recovered clips Container. . . . . . . . . . . . . 1304.12 A typical frame from recovered clips News. . . . . . . . . . . . . . . . 1314.13 PSNR as a function of additive Gaussian noise (CNR). . . . . . . . . 1324.14 Impact of quantization on CIF videos. . . . . . . . . . . . . . . . . . 1334.15 Impact of quantization on HD videos. . . . . . . . . . . . . . . . . . . 1334.16 Reconstruction at different resolutions for HD video clip Life. . . . . 1354.17 Reconstruction at different resolutions for HD video clip Rush hour. . 1364.18 Three methods used for low-resolution reconstruction. . . . . . . . . . 1374.19 PSNR comparison for low-resolution reconstruction. . . . . . . . . . . 138

  • Chapter 1

    Introduction

    For many years, signal processing relies on the well-known Shannon sampling theorem

    [1], stating that the sampling rate must be at least twice as high as the highest

    frequency to avoid losing information while capturing the signal (the so-called Nyquist

    rate). In many applications, such as digital cameras, the Nyquist rate is too high to

    either store or transmit without making compression a necessity prior. In addition,

    increasing the sampling rate might be very costly in many other scenarios — medical

    scanners, high-speed analog-to-digital converters, and so forth.

    In recent years, a new theory of compressive sensing — also known under the

    terminology of compressed sensing, compressive sampling, or CS — has drawn a lot

    of researchers’ attention. It builds a fundamentally novel approach to data acquisition

    and compression which overcomes drawbacks of the traditional method. Nowadays,

    compressive sensing has been widely studied and applied to various fields, such as

    radar imaging [35], magnetic resonance imaging [36, 37, 38], analog-to-information

    conversion [39], sensor networks [40, 41] and even homeland security [42].

    A new iterative CS solver — TVAL3 — has been proposed for 1D and 2D sig-

    nal processing in the author’s master thesis [9], and has been successfully applied to

    1

  • 2

    single-pixel cameras [32, 34]. TVAL3 is short for “TV minimization by augmented

    Lagrangian and alternating direction algorithms”. Its efficiency and robustness has

    been empirically investigated, but the theoretical convergence has not been estab-

    lished. In this thesis, the algorithm behind TVAL3 will be restated for more general

    cases and a proof of convergence will be studied and presented. After that, the thesis

    will move into the main part — high-dimensional data processing employing the CS

    theory and the general TVAL3 method. It would be inefficient to study the general

    case of the high-dimensional data without considering inherent structures and char-

    acteristics of different kinds. Therefore, two classes of 3D data processing problem

    will be addressed here — hyperspectral data unmixing and video compression.

    The thesis is organized as follows: a review of compressive sensing, an introduction

    to the total variation, and the background of hyperspectral data unmixing and video

    compression will be covered in this chapter; Chapter 2 completes the general TVAL3

    algorithm by extending it to a more general setting and establishing a convergence

    result; Chapter 3 and 4 describe in detail the compressive sensing and unmixing

    of hyperspectral data and the compressive video sensing framework, respectively;

    Chapter 5 concludes the thesis by iterating the main contributions and discussing the

    future work in the relevant fields of scientific research.

    1.1 Compressive Sensing

    In 2004, Donoho, Candès, Romberg and Tao conducted a series of in-depth research

    based on the discovery that a signal may still be recovered even though the num-

    ber of data is deemed insufficient by Shannon’s criterion, and built the theory of

    compressive sensing [4, 3, 2]. To make the exact recovery possible from far fewer

    samples or measurements, CS theory counts on two principles: sparsity and incoher-

  • 3

    ence. Sparsity screens out the signal of interest, while incoherence restricts the sensing

    schema. Specifically, a large but sparse signal is encoded by a relatively small num-

    ber of incoherent linear measurements, and the original signal can be reconstructed

    from the encoded sample by finding the sparsest signal from the solution set of a

    under-determined linear system. It has been proven that computing the sparsest so-

    lution directly (ℓ0 minimization in mathematics) is NP-hard and generally requires

    prohibitive computations of exponential complexity [10]. However, the discovery of

    ℓ0-ℓ1 equivalence [8] averted solving NP-hard problems for compressive sensing.

    Differing from ℓ0-norm, which counts the number of nonzeros and is not a real

    norm literally, ℓ1-norm measures the sum of magnitudes of all elements of a vector.

    The use of ℓ1-norm as a sparsity-promotion function can be traced back decades. In

    1986, for example, Santosa and Symes [13] introduced ℓ1 minimization to reflection

    seismology, seeking a sparse reflection function which indicates significant variances

    between subsurface layers from bandlimited data. They appear to be the first to

    give a coherent mathematical argument behind using ℓ1-norm for sparsity promotion,

    though it had been used by practitioners long before. In the next few years, Donoho

    and his colleague carried this brilliant idea further and explored some early results

    regarding ℓ1 minimization and signal recovery [15, 16]. More work on ℓ1 minimization

    under special setups was investigated in the early 2000s [22, 23, 24, 25].

    Grounded on those early efforts, a major breakthrough was achieved by Candès,

    Romberg and Tao [3, 2], and Donoho [4] between 2004 and 2006, which theoretically

    proved ℓ1 minimization is equivalent to ℓ0 minimization under some conditions for

    signal reconstruction problems. Furthermore, they showed that a K-sparse signal

    (under some basis) could be exactly recovered from cK linear measurements using ℓ1

    minimization, where c is a constant. This new theory has significantly improved those

    earlier results on sparse recovery using ℓ1. Here, the constant c directly decides the size

  • 4

    of linear measurements. The introduction of the restricted isometry property (RIP)

    for matrices [5] — a key concept of compressive sensing — responded this question

    theoretically. Candès and Tao showed that if the measurement matrix satisfies the

    RIP to a certain degree, it is sufficient to guarantee the exact sparse signal recovery.

    It has been shown that Gaussian, Bernoulli and partial Fourier matrices with random

    permutations possess the RIP with high probability [3, 26], and become reasonable

    choices as the measurement or sensing matrix. For example, K-sparse signals of length

    N require only cK log(N/K)≪ N random Gaussian measurements for exact recovery.

    However, it is extremely difficult and sometimes impractical to verify the RIP property

    for most types of matrices. Is RIP truly an indispensable property for compressive

    sensing? For instance, measurement matrices A and GA in ℓ1 minimization should

    retain exactly the same recoverability and stability as long as matrix G is square and

    nonsingular, but their RIP constant may vary a lot due to different choices of G.

    A non-RIP analysis, studied by Zhang, proved recoverability and stability theorems

    without the aid of RIP and claimed prior knowledge could never hurt, but possibly

    enhance the reconstruction via ℓ1 minimization [7].

    Other than ℓ1 minimization methods (also known as Basis Pursuit [12, 27, 28]),

    greedy methods could also handle compressive sensing problems by iteratively com-

    puting the support of the signal. Generally speaking, a greedy method refers to the

    one following the metaheuristic of choosing the best immediate or local optimum at

    each stage and eventually expecting to find the global optimum. In 1993, Mallat and

    Zhang introduced Matching Pursuit (MP) [29], which is the prototypical greedy al-

    gorithm applied to compressive sensing. In recent years, a series of MP-based greedy

    methods have been proposed for compressive sensing, such as Orthogonal Matching

    Pursuit [30], Compressive Sampling Matching Pursuit [31], and so on. However, ℓ1

    minimization methods usually require fewer measurements than greedy algorithms

  • 5

    and provide better stability. When noise exists or the signal is not exactly sparse, ℓ1

    minimization methods provide a much more stable solution and make the methods

    applicable to real world problems.

    1.2 TV Regularization

    Total variation (abbreviated TV) regularization can be regarded as a generalized ℓ1

    regularization in compressive sensing problems. Instead of assuming the signal is

    sparse, the premise of TV regularization is that the gradient of the underlying signal

    or image is sparse. In other words, total variation measures the discontinuities and

    the TV minimization seeks the solution with the sparsest gradient.

    In the broad area of compressive sensing, TV minimization has attracted more

    and more research activities since recent research indicates that the use of TV regular-

    ization instead of the ℓ1 term makes the reconstructed images sharper by preserving

    the edges or boundaries more accurately. In most cases, edges of the underlying im-

    age are more essential to characterize different properties than the smooth part. For

    example, in the realm of seismic imaging, detecting boundaries of distinct media play

    a key role in identifying the geological structure. This advantage of TV minimization

    stems from the property that it can recover not only sparse signals or images, but

    also dense staircase signals or piecewise constant images. Even though this result has

    only been theoretically proven under some special circumstances [2], it stands true

    on a much larger scale empirically.

    The history of TV is long and rich, tracing back at least to 1881 when Jordan first

    introduced total variation for real-valued functions while studying the convergence of

    Fourier series [11]. After decades of research, it has been thoroughly investigated and

    widely used for the computation of discontinuous solutions of inverse problems (see

  • 6

    [19, 20, 21], for example). In 1992, Rudin, Osher and Fatemi [14] first introduced the

    concept total variation into image denoising problems. From then on, TV minimizing

    models have become one of the most popular and successful methodologies for image

    denoising [14, 43], deconvolution [47, 46] and restoration [49, 48], to cite just a few.

    Some constructive discussions on TV regularized problems have been reported by

    Chambolle et al. [50, 51].

    In spite of those remarkable advantages of TV regularization, the properties of

    non-differentiability and non-linearity make TV minimization far less accessible and

    solvable computationally than ℓ1 minimization. Geman and Yang [45] proposed a

    joint minimization method to solve half-quadratic models [44, 45]. Grounded on

    this work, Wang, Yang, Yin and Zhang proposed and studied a fast half-quadratic

    method to solve deconvolution and denoising problems with TV regularization [46]

    and further extended this method to image reconstruction [52] and multichannel im-

    age deconvolution problems [53, 54]. The two central ideas in this approach are

    “splitting” and “alternating”. The key step is to introduce a so-called splitting vari-

    able to move the differentiation operator from inside the TV term to outside, thus

    enabling low-complexity subproblems in an alternating minimization setting. These

    ideas have been previously used in solving a number other problems, but their ap-

    plications to TV regularized problems has resulted in algorithms significantly faster

    than the previous state-of-the-art algorithms in this area.

    Even though this method is very efficient and effective, it restricts the measure-

    ment matrix to the partial Fourier matrix. Under a more general setting, Goldstein

    and Osher [56] added Bregman regularization [55] into this idea, producing the so-

    called split Bregman algorithm for TV regularized problems. This algorithm is equiv-

    alent to the classic alternating direction method of multipliers [58, 59] when only one

    inner iteration of split Bregman is performed. Around the same year, Li, Zhang and

  • 7

    Yin employed the same splitting and alternating direction idea on the classic aug-

    mented Lagrangian method [60, 61] and developed an efficient TV regularized solver

    — TVAL3 [9, 125]. This particular implementation also integrates a non-monotone

    line search [82] and Barzilai-Borwein steps [79] into it and results in a much faster

    algorithm. TVAL3 has been proposed and thoroughly studied in author’s master the-

    sis [9], and numerical evidences indicates that TVAL3 outperforms other TV solvers

    when solving compressive sensing problems, such as SOCP [48], ℓ1-Magic [2, 3, 5],

    TwIST [86, 85] and NESTA [84]. However, its theoretical result of convergence has

    not been established until recently. In this thesis, algorithms of 3D data processing

    are extended from TVAL3, whose general descriptions as well as convergence proof

    will be revealed in Chapter 2.

    1.3 3D Data Processing

    Three-dimensional (3D) data processing has tremendous applications in today’s world,

    such as in surveillance [93], exploitation [92], wireless communications [96], military

    intelligence [94], public entertainments [95], environmental monitoring [91], and so

    forth. However, some common bottlenecks or difficulties slow down the pace of devel-

    opment of 3D data processing. One of the main difficulties rises from the enormous

    volume of 3D data, which causes inconvenience of storing, transmitting and even pro-

    cessing. Therefore, it is critical to explore the inherence of data on different domains

    and develop effectual methods to reduce the volume of 3D data without losing the

    key information.

    Compressive sensing has been widely recognized as a promising and effective acqui-

    sition method for 1D and 2D data processing. In this thesis, the author will explore

    two important classes of 3D data processing tasks — hyperspectral unmixing and

  • 8

    video compression — grounded on the framework of compressive sensing. Both hy-

    perspectral and video data can be regarded as a series of 2D images. Simply applying

    the compressive sensing idea on 2D images slice by slice could work to some extent,

    but is far from optimal or ideal situations. More sparsity and further compression

    can be obtained by properly utilizing inherent connections among those 2D slices.

    For example, video clips are usually continuous in time domain and the unchanged

    background in adjacent frames could be subtracted. This is one straightforward way

    to enhance the sparsity of video data. Moreover, advanced techniques or methods

    require further study on the nature of 3D data sets. More detailed introduction and

    review of hyperspectral and video data will be presented at the beginning of Chapters

    3 and 4, respectively.

    1.4 Organization

    The thesis is organized as follows. Chapter 2 describes the TVAL3 algorithm in a gen-

    eral setting and establishes a theoretical convergence result for the algorithm. Chapter

    3 focuses on the hyperspectral imaging and proposes new compressive sensing and

    unmixing schemes which can significantly reduce both the storage and computational

    complexity. Chapter 4 turns to the discussion of video compression for wireless com-

    munication and raises a novel multi-resolution framework based on the compressive

    video sensing. Both Chapter 3 and Chapter 4 contain descriptions and results of a

    number of numerical experiments to demonstrate the efficiency and effectiveness, as

    well as limitations, of proposed methods or framework. Lastly, Chapter 5 concludes

    the whole thesis and points out the future work of compressive sensing on 3D data

    processing.

  • Chapter 2

    General TVAL3 Algorithm

    The algorithm of TVAL3 has been proposed and numerically studied for TV regular-

    ized compressive sensing problems in author’s master thesis [9]. Extensive numerical

    experiments have demonstrated its efficiency and high tolerance to noise. In this chap-

    ter, the methodology of TVAL3 will be described in a general case and convergence

    will be theoretically analyzed for the first time.

    Starting with the review of the classic augmented Lagrangian method, this chapter

    will describe the development of the general TVAL3 algorithm step by step.

    2.1 Review of Augmented Lagrangian Method

    For constrained optimization, a fundamental class of methods is to seek the minimizer

    or maximizer by solving a sequence of unconstrained subproblems iteratively. The

    solutions of subproblems should converge to a minimizer or maximizer eventually.

    Back to 1943, Courant [57] proposed the quadratic penalty method, which could be

    viewed as the precursor to the augmented Lagrangian method. This method penalizes

    equality constraint violation by adding a multiple of the square of the constraint

    9

  • 10

    violation into the objective function and turns the constrained optimization problems

    to be unconstrained. Due to its simplicity and intuitive appeal, this approach has been

    used and studied comprehensively. However, it requires the penalty parameter to go to

    infinity to guarantee convergence, which may cause a deterioration in the numerical

    conditioning of the method. In 1969, Hestenes [60] and Powell [61] independently

    proposed the augmented Lagrangian method which, by introducing and adjusting

    Lagrangian multiplier estimates, no longer requires the penalty parameter to go to

    infinity for the method to converge.

    2.1.1 Derivations and Basic Results

    Let us begin with considering a general equality-constrained minimization problem

    minx

    f(x), s.t. h(x) = 0, (2.1)

    where h is a vector-valued function and both f and hi for all i are differentiable. The

    first-order optimality conditions for (2.1) are

    ∇xL(x, λ) = 0,

    h(x) = 0,

    (2.2)

    where L(x, λ) = f(x) − λTh(x) is the Lagrangian function of (2.1). By optimiza-

    tion theory, conditions in (2.2) are necessary for optimality under some constraint

    qualifications. In addition, if problem (2.1) is a convex program, then they are also

    sufficient.

    In light of the optimality conditions above, an optimum x∗ to the original problem

    (2.1) is both a stationary point of the Lagrangian function and a feasible point of

  • 11

    constraints, which means x∗ solves

    minxL(x, λ), s.t. h(x) = 0. (2.3)

    In fact, it is obvious that (2.1) is equivalent to (2.3) for any λ. According to the

    quadratic penalty method, a local minimizer x∗ of (2.3) may be obtained by solving a

    series of unconstrained problems with the constraint violations penalized as follows:

    minxLA(x, λ;µ) = f(x)− λTh(x) +

    µ

    2h(x)Th(x). (2.4)

    It follows the analysis of the penalty method that λ can be arbitrary but µ needs to go

    to infinity, which may cause a deterioration of the numerical conditioning and result

    in inaccuracy. The augmented Lagrangian method iteratively solves problem (2.4)

    above, but updates multiplier λ in a specific way, and still guarantee convergence to

    the minimizer of (2.1) without forcing penalty parameter µ to go to infinity. In that

    case, LA(x, λ;µ) is known as the augmented Lagrangian function.

    Intuitively, the augmented Lagrangian function differs from the standard La-

    grangian function by adding a square penalty term, and differs from the quadratic

    penalty function by the presence of the linear term involving the multiplier λ. Hence,

    the augmented Lagrangian method combines the advantages of the Lagrange multi-

    plier and penalty techniques without having their respective drawbacks.

    Specifically, the augmented Lagrangian method can be described as follows. Fixing

    the multiplier λ at the current estimate λk and the penalty parameter µ to µk > 0

    at the k-th iteration, we minimize the augmented Lagrangian function LA(x, λk;µk)

    with respect to x and denote the minimizer of current iterate as xk+1. To update

    the multiplier estimates from iteration to iteration, Hestenes [60] and Powell [61]

  • 12

    suggested the following update formula:

    λk+1 = λk − µkh(xk+1). (2.5)

    Bertsekas [71] proved one of the fundamental theorems to estimate the error

    bounds and also the rate of convergence. For convenience, ‖.‖ refers to ℓ2 norm

    hereafter. The theorem can be reiterated as follows:

    Theorem 2.1.1 (Local Convergence). Let x∗ be a strictly local optimum of (2.1)

    at which the gradients ∇hi(x∗) are linearly independent, and f, h ∈ C2 in an open

    neighborhood of x∗. Furthermore, x∗ together with its associated Lagrangian multiplier

    λ∗ satisfies

    zT∇2xxL(x∗, λ∗)z > 0,

    for all z 6= 0 with ∇hi(x∗)T z = 0 ∀i; i.e., the second-order sufficient conditions are

    satisfied for λ = λ∗. Choose µ̄ > 0 so that ∇2xxLA(x∗, λ∗; µ̄) is also positive definite.

    Then there exist positive constants δ, ǫ, and M such that the following claims hold:

    1. For all (λk, µk) ∈ D where D , {(λ, µ) : ‖λ− λ∗‖ < δµ, µ ≥ µ̄}, the problem

    minxLA(x, λk;µk) s.t. ‖x− x∗‖ = ǫ

    has a unique solution xk , x(λk, µk). It satisfies

    ‖xk − x∗‖ ≤ Mµk‖λk − λ∗‖.

    Moreover, function x(λ, µ) is continuously differentiable in the interior of D.

  • 13

    2. For all (λk, µk) ∈ D,

    ‖λk+1 − λ∗‖ ≤ Mµk‖λk − λ∗‖,

    if λk+1 is attained by (2.5).

    3. For all (λk, µk) ∈ D, ∇2xxLA(xk, λk;µk) is positive definite and ∇hi(xk) are

    linearly independent.

    A detailed proof for local convergence theorem can be found in [71], pp. 108.

    The local convergence theorem implies at least three features of the augmented

    Lagrangian method. First of all, the method converges in one iteration if λ = λ∗.

    Secondly, as long as µk satisfies Mµk

    < 1 for any k, the error bounds in the theorem

    are able to guarantee that

    ‖λk+1 − λ∗‖ < ‖λk − λ∗‖;

    i.e., the multiplier estimates converge linearly. Hence, {xk} also converges linearly.

    Lastly, if µk goes to infinity, then

    limk→+∞

    ‖λk+1 − λ∗‖‖λk − λ∗‖ = 0;

    i.e., the multiplier estimates converge superlinearly.

    The augmented Lagrangian method requires solving an unconstrained minimiza-

    tion subproblem at each iteration, which could be overly expensive. Therefore, design-

    ing appropriate schemes to solve subproblems is one of the key issues when applying

    the augmented Lagrangian method.

    Numerically, it is impossible to find an exact minimizer of unconstrained minimiza-

  • 14

    tion subproblem at each iteration. For convex optimization, Rockafellar [63] proved

    the global convergence of the augmented Lagrangian method in the convex case for

    an arbitrary penalty parameter, without demanding an exact minimum at each iter-

    ation. In addition, the objective function f is no long assumed to be differentiable

    and the theorem still holds.

    Theorem 2.1.2 (Global Convergence). Suppose that

    1. f is convex and hi are linear constraints;

    2. the feasible set {x : h(x) = 0} is non-empty;

    3. µk = µ is constant for all k;

    4. a sequence {ǫk}∞ satisfies 0 ≤ ǫk → 0 and∑∞

    i

    √ǫk

  • 15

    f1 and f2 are convex, proper, lower semicontinuous functionals, and B is a linear

    operator. In the early 1980s, Glowinski et al. studied this type of problems in depth

    using the augmented Lagrangian and operator-splitting methods [68, 69, 70], which

    are also closely related to the time-dependent approach as can be seen in, e.g., [67].

    We consider

    minx{f1(Bx) + f2(x)} , s.t. Ax = b, (2.6)

    where f1 may be non-differentiable. Let w = Bx, then (2.6) is clearly equivalent to

    minw,x{f1(w) + f2(x)} , s.t. Ax = b, Bx = w. (2.7)

    With a new variable and the extra linear constraints, the objective of (2.6) has been

    split into two parts. The aim of splitting is to separate non-differentiable terms from

    other differentiable ones. Now (2.7) can be simply rewritten as

    minw,x{f1(w) + f2(x)} , s.t. h(w, x) = 0, (2.8)

    where for simplicity the two linear constraints have been written into a single con-

    straint.

    The augmented Lagrangian function for (2.8) is

    LA(w, x, λ;µ) = f1(w) + f2(x)− λTh(w, x) +µ

    2h(w, x)Th(w, x). (2.9)

    For fixed λk and µk, denote f1(w) as ϕ(w) and other parts in LA(w, x, λk;µk) as

  • 16

    φ(w, x) which is differentiable. Then the augmented Lagrangian method solves

    minw,x{ϕ(w) + φ(w, x)} (2.10)

    at the k-th iteration and then update the multiplier. The multiplier-updating formula

    could be more general than the one suggested by Hestenes and Powell; that is,

    λk+1 = λk − ςkµkh(xk+1). (2.11)

    Provided that ςk is selected from a closed interval in (0, 2), the convergence of the

    augmented Lagrangian method is still guaranteed in the convex case analogous to

    Theorem 2.1.2 [63]. Considering problem (2.6) without constraints, Glowinski proved

    a stronger theorem for both finite and infinite dimensional settings [70].

    Other than (2.11), Buys [62] and Tapia [64] have suggested two other multiplier

    update formulas (called Buys update and Tapia update respectively), both involving

    second-order information of LA. Tapia [65] and Byrd [66] have shown that both

    update formulas give quadratic convergence if one-step (for Tapia update) or two-

    step (for Buys update) Newton’s method is applied to subproblems. However, the

    estimate of the second-order derivative and the use of Newton’s step can be too

    expensive to compute at each iteration for large-scale problems.

    Specifically, an implementation of the augmented Lagrangian method for (2.6)

    can be put into the following algorithmic framework:

    Algorithm 2.1.1 (Augmented Lagrangian Method).

    Initialize µ0, λ0, 0 < α0 ≤ ς0 ≤ α1 < 2, tolerance tol, and starting points w0, x0.

    While ‖∇L(xk, λk)‖ > tol Do

    Set wk+10 = wk and xk+10 = x

    k;

  • 17

    Find a minimizer (wk+1, xk+1) of LA(w, x, λk;µk), starting from wk+10 and

    xk+10 and terminating when ‖∇(w,x)LA(wk+1, xk+1, λk;µk)‖ ≤ tol;

    Update the multiplier using (2.11) to obtain λk+1;

    Choose the new penalty parameter µk+1 ≥ µk and α0 ≤ ςk+1 ≤ α1;

    End Do

    To accommodate non-differentiable functions, let

    ∇̃g(u) = argminξ∈∂g(u)

    ‖ξ‖.

    That is, ∇̃g(u) is the member of ∂g(u) with the smallest ℓ2 norm; and it is equivalent

    to the gradient of g if the functional is differentiable. In Algorithm 2.1.1, we will

    replace“∇” by “∇̃” whenever the objective function is non-differentiable.

    In Algorithm 2.1.1, ςk = 1 appears to generally give the best convergence from

    our computational experience, but it is not necessarily the case for the choice of small

    µk. Concerning the choice of µk, it has been shown that larger µk results in faster

    asymptotic convergence rate. On the other hand, larger µk causes numerical condi-

    tioning problems in practice. Fortunately, the combined effect of all these factors is

    the fact that convergence of the augmented Lagrangian method is relatively insensi-

    tive to the choice of the penalty parameter in most cases. In practice, starting with a

    small µk and then increasing µk from iterate to iterate usually gives a faster conver-

    gence numerically than keeping µk fixed. This approach is also known as parameter

    continuation.

    The augmented Lagrangian method has been successfully applied to different

    fields, such as constraint motion problems [75], seismic reflection tomography [76],

    and so forth. From a numerical perspective, the only nontrivial part in the use of

    Algorithm 2.1.1 is how to efficiently minimize the augmented Lagrangian function or

  • 18

    equivalently (2.10) at each iteration. Taking into account the particular structure as

    in (2.10), a well-suited algorithm will be proposed and theoretically analyzed in the

    next section. Before that, another method of multipliers which has a close relation

    to the augmented Lagrangian method will be briefly reviewed.

    2.1.3 A Discussion on Alternating Direction Methods

    Extending the classic augmented Lagrangian method as described above, Glowin-

    ski et al. [58, 59] also suggested another slightly different way to handle (2.8) —

    the alternating direction method (abbreviated ADM). The common advantage of

    both methods includes the capability of handling the non-differentiability and side-

    constraints. Instead of requiring the exact minimizer of the augmented Lagrangian

    function (2.9) at each iteration, ADM only demands minimizers with respect to w

    and x respectively, and then update the multiplier. Specifically, at the k-th iteration,

    we compute

    xk+1 = argminxLA(wk, x, λk;µk),

    wk+1 = argminwLA(w, xk+1, λk;µk),

    λk+1 = λk − ςkµkh(wk+1, xk+1).

    (2.12)

    Contrary to the joint minimization as is done in the augment Lagrangian method,

    ADM uses the idea of alternating minimization to produce computationally more

    affordable iterations (2.12). Provided that

    0 < ςk = ς <1 +√5

    2,

    the theoretical convergence of ADM can be similarly guaranteed [70]. More results

    and analysis applying ADM to convex programming and variational inequalities can

  • 19

    be found, for example, in [72, 73, 74].

    ADM can potentially reduce the iteration-complexity of the algorithm by solving

    two simpler subproblems at each iteration, instead of directly minimizing the aug-

    mented Lagrangian function (2.9). In fact, under the assumption that f2 is linear,

    Gabay and Mercier [59] also proved the convergence of ADM for

    0 < ςk = ς < 2.

    However, the linear assumption is quite strict and most problems stemmed from signal

    processing or sparse optimization do not fall into this category.

    Even though ADM seems more appealing than the classic augmented Lagrangian

    method, our general TVAL3 algorithm is still founded on the augmented Lagrangian

    method. First of all, on the problems of our interests ADM appears to be more

    sensitive to the choice of penalty parameters, whereas the augmented Lagrangian

    method is more robust. This is advantageous since the observation or data acquired

    by hardware in the field of signal processing are almost always noisy and a more

    robust method is favorable. Secondly, ADM requires separability of the objective

    function into exactly two blocks, and demands high-accuracy minimization for each

    block. ADM is most efficient if both subproblems can be accurately solved efficiently.

    However, it is not necessarily the case for the problems we solve in signal processing

    or sparse optimization. For example, in TV regularized minimization, one of those

    subproblems is usually quadratic minimization and that dominates the computation.

    Thus, without special structures, it can be too expensive to find a high-accuracy

    minimizer at each iteration. The general TVAL3 algorithm considered in this chapter

    handles the quadratic subproblems in an inexact manner (one aggressive step along

    the descent direction). The convergence of the general TVAL3 algorithm, founded

  • 20

    on the framework of the augmented Lagrangian method, will be proved later in this

    chapter.

    2.2 An Algorithm

    A major concern while applying the augmented Lagrangian method for (2.10) is

    how to efficiently solve a series of unconstraint subproblems. Here we propose an

    alternating direction type method for minimizing the type of functions in (2.10).

    2.2.1 Descriptions

    Suppose g : Rn → R is continuous and bounded below, and has the following form:

    g(u) , g(w, x) = ϕ(w) + φ(w, x). (2.13)

    Furthermore, let us assume that φ is continuously differentiable and minimizing

    g(w, x) with respect to w only is easy. Many optimization problems originated in

    compressive sensing, image denoising, deblurring and impainting fall into this cate-

    gory after introducing appropriate splitting variables and employing the augmented

    Lagrangian method or other penalty methods. An instance will be given in the next

    section and further discussions corresponding to this type will be involved in the

    following chapters.

    The goal is to solve

    minw,x

    g(w, x). (2.14)

    The proposed algorithm is based on an alternating direction scheme, as well as a non-

  • 21

    monotone line search [82] with Barzilai-Borwein [79] steps to accelerate convergence.

    The Barzilai-Borwein (BB) method utilizes the previous two iterates to select step

    length and may achieve superlinear convergence under certain circumstances [79, 80].

    For given wk, applying BB method on minimizing g(wk, x) with respect to x leads to

    a step length

    ᾱk =sTk sksTk yk

    , (2.15)

    or alternatively

    ᾱk =sTk ykyTk yk

    , (2.16)

    where sk = xk − xk−1 and yk = ∇xg(wk, xk)T −∇xg(wk, xk−1)T (assuming g is differ-

    entiable w.r.t. x).

    Starting with a BB step in (2.15) or (2.16), we utilize a nonmonotone line search

    algorithm (NLSA) to ensure convergence. The NLSA is an improved version of the

    Grippo, Lampariello and Lucidi nonmonotone line search [81]. Zhang and Hager

    [82] showed that the scheme was generally superior to previous schemes with either

    nonmonotone or monotone line search techniques, based on extensive numerical ex-

    periments. At each iteration, NLSA requires checking the so-called nonmonotone

    Armijo condition, which is

    g(wk, xk + αkdk) ≤ Ck + δαk∇xg(wk, xk)dk (2.17)

    where dk is a descent direction and Ck is a weighted average of function values. More

    specifically, the algorithmic framework can be depicted as follows:

  • 22

    Algorithm 2.2.1 (Nonmonotone Alternating Direction).

    Initialize ζ > 0, 0 < δ < 1 < ρ, 0 ≤ ηmin ≤ ηmax ≤ 1, tolerance tol,

    and starting points w0, x0. Set Q0 = 1 and C0 = g(w0, x0).

    While ‖∇̃g(wk, xk)‖ > tol Do

    Let dk be a descent direction of g(wk, x) at xk;

    Choose αk = ᾱkρθk where ᾱk > 0 is the BB step and θk is the largest integer

    such that both the nonmonotone Armijo condition (2.17) and αk ≤ ζ hold;

    Set xk+1 = xk + αkdk;

    Choose ηk ∈ [ηmin, ηmax] and set

    Qk+1 = ηkQk + 1, Ck+1 = (ηkQkCk + g(wk, xk+1))/Qk+1;

    Set wk+1 = argminw g(w, xk+1).

    End Do

    The nonmonotone Armijo condition could also been substituted by the nonmono-

    tone Wolf conditions [82]. The choice of ηk controls the degree of nonmonotonicity.

    Specifically, if ηk = 0 for all k, the line search is monotone; if ηk = 1 for all k, Ck is the

    average value of objective function at (wi, xi) for i = 1, 2, . . . , k. Therefore, the bigger

    ηk is, the more nonmonotone the scheme becomes. Besides, θk is not necessary to be

    positive. In practical implementations, starting from the BB step, we could increase

    or decrease the step length by forward or backward tracking until the nonmonotone

    Armijo condition satisfies.

    Although Algorithm 2.2.1 takes a form of alternating direction method, it treats

    the two directions quite differently. One direction can be regarded as an “easy”

    direction, another a “hard” one. The proposed algorithm deviates from the two

    common alternating direction strategies: the classic alternating minimization or the

    popular block coordinate descent technique. Unlike the former, it does not require

  • 23

    minimization of the objective function in the hard direction; and unlike the latter, it

    does not ask for a descent of function value at each iteration. This feature allows the

    algorithm to have inexpensive iterations and to take relatively large steps, while still

    possessing a convergence guarantee as will be shown. Indeed, computational evidence

    shows that this feature helps enhance the practical efficiency of the algorithm in a

    number of applications described later in this thesis.

    2.2.2 Convergence Analysis

    The convergence proof of Algorithm 2.2.1 has some similarities with the proof of

    NLSA shown in [82] and both proof follows the same path. However, NLSA only

    considers continuously differentiable functionals using gradient methods whereas Al-

    gorithm 2.2.1 takes into account non-differentiability of the objective function under

    the framework of alternating direction. For notational simplicity, define

    gk(·) , g(wk, ·). (2.18)

    The convergence proof requires the following two assumptions:

    Assumption 2.2.1 (Direction Assumption). There exist c1 > 0 and c2 > 0 such that

    ∇gk(xk)dk ≤ −c1‖∇gk(xk)‖2,

    ‖ dk ‖ ≤ c2‖∇gk(xk)‖.(2.19)

    Assumption 2.2.2 (Lipschitz Condition). There exists L > 0, such that for any

    given x, x̃, and w,

    ‖∇xg(w, x)−∇xg(w, x̃)‖ = ‖∇xφ(w, x)−∇xφ(w, x̃)‖ ≤ L‖x− x̃‖. (2.20)

  • 24

    The direction assumption obviously holds if

    dk = −∇gk(xk)T .

    This choice leads to the simple steepest-descent step in Algorithm 2.2.1. The Lipschitz

    condition is widely assumed in the analysis of convergence of gradient methods. In

    this sense, Assumptions 2.2.1 and 2.2.2 are both reasonable.

    To start with, the following lemma presents some basic properties and suggests

    the algorithm is well-defined.

    Lemma 2.2.1. If ∇gk(xk)dk ≤ 0 holds for each k, then for the sequences generated

    by Algorithm 2.2.1, we have gk(xk) ≤ gk−1(xk) ≤ Ck for each k and {Ck} is monotone

    non-increasing. Moreover, if ∇gk(xk)dk < 0, step length αk > 0 always exists.

    Proof. Define real-valued function

    Dk(t) =tCk−1 + gk−1(xk)

    t+ 1for t ≥ 0,

    then

    D′k(t) =Ck−1 − gk−1(xk)

    (t+ 1)2for t ≥ 0.

    Due to the nonmonotone Armijo condition (2.17) and ∇gk(xk)dk ≤ 0, we have

    Ck−1 − gk−1(xk) ≥ −δαk−1∇gk−1(xk−1)dk−1 ≥ 0.

    Therefore, D′k(t) ≥ 0 holds for any t ≥ 0, and then Dk is non-decreasing.

    Since

    Dk(0) = gk−1(xk) and Dk(ηk−1Qk−1) = Ck,

  • 25

    we have

    gk−1(xk) ≤ Ck for any k.

    As being described in Algorithm 2.2.1,

    wk = argminw

    g(w, xk),

    then we have

    g(wk, xk) ≤ g(wk−1, xk).

    Hence, gk(xk) ≤ gk−1(xk) ≤ Ck holds for any k.

    Furthermore,

    Ck+1 =(ηkQkCk + gk(xk+1))

    Qk+1≤ (ηkQkCk + Ck+1)

    Qk+1,

    i.e.,

    (ηkQk + 1)Ck+1 ≤ (ηkQkCk + Ck+1),

    i.e.,

    Ck+1 ≤ Ck.

    Thus, {Ck} is monotone non-increasing.

    If Ck is replaced by gk(xk) in (2.17), the nonmonotone Armijo condition becomes

    the standard Armijo condition. It is well-known that αk > 0 exists for the standard

    Armijo condition while ∇gk(xk)dk < 0 and g is bounded below (see [83] for example).

    Since gk(xk) ≤ Ck, it follows αk > 0 exists as well for the nonmonotone Armijo

    condition (2.17).

  • 26

    Defining Ak recursively by

    Ak =1

    k + 1

    k∑

    i=0

    gk(xk), (2.21)

    then by induction, it is easy to show that Ck is bounded above by Ak. Together with

    the facts that Ck is also bounded below by gk(xk) and αk > 0 always exists, it is

    sufficient to claim that Algorithm 2.2.1 is well-defined.

    In the next lemma, the lower bound of the step length generated by Algorithm

    2.2.1 will be given in accordance with the final convergence proof.

    Lemma 2.2.2. Assuming ∇gk(xk)dk ≤ 0 for any k and Lipschitz condition (2.20)

    holds with constant L, then

    αk ≥ min{ζ

    ρ,2(1− δ)

    |∇gk(xk)dk|‖dk‖2

    }

    . (2.22)

    Proof. It is noteworthy that ρ > 1 is required in Algorithm 2.2.1. If ραk ≥ ζ , then

    the lemma already holds.

    Otherwise,

    ραk = ᾱkρθk+1 < ζ,

    which indicates that θk is not the largest integer to make the k-th step length less

    than ζ . According to Algorithm 2.2.1, θk must be the largest integer satisfying the

    nonmonotone Armijo condition (2.17), which leads to

    gk(xk + ραkdk) ≥ Ck + δραk∇gk(xk)dk.

  • 27

    Lemma 2.2.1 showed Ck ≥ gk(xk), so

    gk(xk + ραkdk) ≥ gk(xk) + δραk∇gk(xk)dk. (2.23)

    On the other hand, for α > 0 we have

    ∫ α

    0

    (∇gk(xk + tdk)−∇gk(xk)) dk dt = gk(xk + αdk)− gk(xk)− α∇gk(xk)dk.

    Together with the Lipschitz condition, we get

    gk(xk + αdk) = gk(xk) + α∇gk(xk)dk +∫ α

    0

    (∇gk(xk + tdk)−∇gk(xk)) dk dt

    ≤ gk(xk) + α∇gk(xk)dk +∫ α

    0

    tL‖dk‖2 dt

    = gk(xk) + α∇gk(xk)dk +1

    2Lα2‖dk‖2.

    Let α = ραk, which gives

    gk(xk + ραkdk) ≤ gk(xk) + ραk∇gk(xk)dk +1

    2Lρ2α2k‖dk‖2. (2.24)

    Compare (2.23) with (2.24), which implies

    (δ − 1)∇gk(xk)dk ≤1

    2Lραk‖dk‖2.

    Since ∇gk(xk)dk ≤ 0,

    αk ≥2(1− δ)

    |∇gk(xk)dk|‖dk‖2

    .

    Therefore, the step length αk is bounded below by (2.22).

    With the aid of the above lower bound, we are able to establish the convergence

  • 28

    of Algorithm 2.2.1:

    Theorem 2.2.1 (Optimality Conditions). Suppose g is bounded below and both di-

    rection assumption (2.19) and Lipschitz condition (2.20) hold. Then the iterates

    uk , (wk, xk) generated by Algorithm 2.2.1 satisfies

    limk→0∇̃g(uk) = 0. (2.25)

    Proof. Since g is differentiable with respect to x, (2.25) is equivalent to

    limk→0∇̃wg(wk, xk) = 0,

    limk→0∇xg(wk, xk) = 0.

    (2.26)

    The proof can be completed by showing two parts respectively.

    First, due to the nature of Algorithm 2.2.1,

    wk = argminw

    g(w, xk).

    Then

    0 ∈ ∂wg(wk, xk),

    which implies

    ∇̃wg(wk, xk) = 0.

    Next, let us show the second half grounded on the nonmonotone Armijo condition

    gk(xk + αkdk) ≤ Ck + δαk∇gk(xk)dk. (2.27)

    If ραk < ζ , according to the lower bound of αk given by Lemma 2.2.2 and direction

  • 29

    assumption (2.19), we have

    gk(xk + αkdk) ≤ Ck − δ2(1− δ)

    |∇gk(xk)dk|2‖dk‖2

    ≤ Ck −2δ(1− δ)

    c21‖∇gk(xk)‖4c22‖∇gk(xk)‖2

    = Ck −[2δ(1− δ)c21

    Lρc22

    ]

    ‖∇gk(xk)‖2.

    On the other hand, if ραk ≥ ζ , this lower bound together with direction assumption

    (2.19) gives

    gk(xk + αkdk) ≤ Ck + δαk∇gk(xk)dk

    ≤ Ck − δαkc1‖∇gk(xk)‖2

    ≤ Ck −δζc1ρ‖∇gk(xk)‖2.

    Define constant

    τ̃ = min

    {2δ(1− δ)c21

    Lρc22,δζc1ρ

    }

    ,

    which leads to

    gk(xk + αkdk) ≤ Ck − τ̃‖∇gk(xk)‖2. (2.28)

    Next we show that

    1

    Qk≥ 1− ηmax. (2.29)

    Obviously it follows Q0 = 1 that

    1

    Q0≥ 1− ηmax.

  • 30

    Assuming that (2.29) also holds for k = j, then

    Qj+1 = ηjQj + 1

    ≤ ηj1− ηmax

    + 1

    ≤ ηmax1− ηmax

    + 1

    =1

    1− ηmax,

    which implies

    1

    Qj+1≥ 1− ηmax.

    By induction, we conclude that (2.29) holds for all k.

    Thus, it follows from (2.28) and (2.29) that

    Ck − Ck+1 = Ck −ηkQkCk + gk(xk+1)

    Qk+1

    =Ck(ηkQk + 1)− (ηkQkCk + gk(xk+1))

    Qk+1

    =Ck − gk(xk+1)

    Qk+1

    ≥ τ̃‖∇gk(xk)‖2

    Qk+1

    ≥ τ̃(1− ηmax)‖∇gk(xk)‖2. (2.30)

    Since g is bounded below, {Ck} is also bounded below. Besides, Lemma 2.2.1

    illustrates {Ck} is monotone non-increasing, so there exists C∗ ∈ R such that

    Ck → C∗, as k →∞.

    Hence, we have

    Ck − Ck+1 → 0, as k →∞.

  • 31

    Combining this and (2.30), we get

    ‖∇gk(xk)‖ → 0;

    i.e.,

    limk→0∇̃xg(wk, xk) = 0.

    Coupling two parts completes the proof of this theorem.

    With the aid of Theorem 2.2.1, we can further conclude the global convergence of

    Algorithm 2.2.1 under the assumption of strong convexity.

    Corollary 2.2.1. If g is jointly and strongly convex, then under the same assumptions

    as in Theorem 2.2.1, sequence (wk, xk) generated by Algorithm 2.2.1 converges to the

    unique minimizer (w∗, x∗) of unconstraint problem (2.13).

    The proof is omitted here since it directly follows Theorem 2.2.1.

    By this time, we have proposed an alternating direction type method with a non-

    monotone line search for a special class of unconstraint minimization problems, and

    fulfilled descriptions by thoroughly studying the convergence. TVAL3 — a combi-

    nation of this algorithm and the classic augmented Lagrangian method — aiming at

    solving a more general class of both constraint and unconstraint problems will be

    depicted next.

    2.3 General TVAL3 and One Instance

    The general TVAL3 algorithm is built by means of a combination of the classic aug-

    mented Lagrangian method with an appropriate variable splitting (see Algorithm

  • 32

    2.1.1) and nonmonotone alternating direction method for subproblems (see Algo-

    rithm 2.2.1). More precisely, it implements the following algorithmic framework after

    variable splitting:

    Algorithm 2.3.1 (General TVAL3).

    Initialization.

    While ‖∇̃L(xk, λk)‖ > tol Do

    Set starting points wk+10 = wk and xk+10 = x

    k for the subproblem;

    Find minimizer wk+1 and xk+1 of LA(w, x, λk;µk) using Algorithm 2.2.1;

    Update the multiplier using (2.11) and non-decrease the penalty parameter;

    End Do

    In fact, the purpose of variable splitting is to separate the non-differentiable part

    in order to easily find its closed-form solution while applying the general TVAL3

    algorithm. In other words, the original non-differentiable problem is divided into two

    parts — separable non-differentiable part with explicit solution and differentiable part

    requiring heavy computation.

    From previous analysis, the convergence of this method follows immediately. The-

    orem 2.1.2 has ensured the convergence of outer loop while Theorem 2.2.1 has provided

    the convergence of inner loop, which together indicates the convergence of the gen-

    eral TVAL3 method. The convergence rate is not deepened since it is not necessarily

    related to the practical efficiency of methods or algorithms. The convergence rate

    analyzes the relation between error and number of iterations, but neglects the com-

    plexity of each iteration. In the real world, the real cost relies on the multiplication of

    both. One advantage of the general TVAL3 method is its low cost at each iteration.

    Mostly it requires only two or three matrix-vector multiplications to process one inner

    iteration, which results in the significant decrease on overall computation.

  • 33

    2.3.1 Application to 2D TV Minimization

    One instance is for solving the compressive sensing problem with total variation (TV)

    regularization:

    minu

    TV (u) ,∑

    i

    ‖Diu‖, s.t. Au = b, (2.31)

    where u ∈ Rn or u ∈ Rs×t with s · t = n, Diu ∈ R2 is the discrete gradient of u at

    pixel i, A ∈ Rm×n (m < n) is the measurement matrix, and b ∈ Rm is the observation

    of u via some linear measurements. The regularization term is called isotropic TV. If

    ‖.‖ is replaced by 1-norm, then it is called anisotropic TV. With minor modifications,

    the following derivation for solving (2.31) is applicable for anisotropic TV as well.

    In light of variable splitting, an equivalent variant of (2.31) is considered:

    minwi,u

    i

    ‖wi‖, s.t. Au = b and Diu = wi for all i. (2.32)

    Its corresponding augmented Lagrangian function is

    LA(wi, u) =∑

    i

    (‖wi‖ − νTi (Diu− wi) +βi2‖Diu− wi‖2)

    −λT (Au− b) + µ2‖Au− b‖2, (2.33)

    and then the subproblem at each iteration of TVAL3 becomes

    minwi,uLA(wi, u). (2.34)

    At the k-th iteration, solving (2.34) with respect to wi gives a closed-form solution

  • 34

    since it is separable; i.e.,

    wi,k+1 = max

    {∥∥∥∥Diuk −

    νiβi

    ∥∥∥∥− 1

    βi, 0

    }(Diuk − νi/βi)‖Diuk − νi/βi‖

    , (2.35)

    where 0 ·(0/0) = 0 is followed. This formula is used to be called shrinkage (see [46] for

    example). On the other hand, (2.33) is quadratic with respect to u and its gradient

    can be easily derived as

    dk(u) =∑

    i

    (βiDTi (Diu− wi,k+1)−DTi νi) + µAT (Au− b)− ATλ. (2.36)

    According to Algorithm 2.2.1, we only require one step of steepest descent with prop-

    erly adjusted step length; i.e.;

    uk+1 = uk − αkdk(uk). (2.37)

    Therefore, the TVAL3 algorithm for TV regularized problems on compressive

    sensing has been obtained by incorporating (2.35), (2.36) and (2.37) into the general

    framework of Algorithm 2.3.1.

    To demonstrate the efficiency of the TVAL3 implementation, it is compared to

    other state-of-the-art implementations of TV regularized methods, such as ℓ1-Magic

    [2, 3, 5], TwIST [85, 86] and NESTA [84].

    Experiments were performed on a Lenovo X301 laptop running Windows XP and

    MATLAB R2009a (32-bit) and equipped with a 1.4GHz Intel Core 2 Duo SU9400

    and 2GB of DDR3 memory.

    While running TVAL3, we uniformly set parameters η = .9995, ρ = 5/3, δ = 10−5

    and ζ = 104 presented in Algorithm 2.2.1, and initialized multipliers to 0 and fixed

    weights in front of multipliers at 1.6 presented in Algorithm 2.3.1. Additionally, the

  • 35

    SNR: 77.64dB, CPU time: 4.27s SNR: 46.59dB, CPU time: 13.81s

    SNR: 34.18dB, CPU time: 24.35s SNR: 51.08dB, CPU time: 1558.29s

    Figure 2.1: Recovered 64×64 phantom image from 30% orthonormal measurements without noise.Top-left: original image. Top-middle: reconstructed by TVAL3. Top-right: reconstructed byTwIST. Bottom-middle: reconstructed by NESTA. Bottom-right: reconstructed by ℓ1-Magic.

    values of penalty parameters might vary in a range of 25 to 29 according to distinct

    noise level and required accuracy.

    In an effort to make comparisons fair, for other tested solvers mentioned above,

    we did tune their parameters and try to make them perform optimal or near optimal.

    In the first test, a 64× 64 phantom image is encoded by an orthonormal random

    matrix generated by QR factorization from a Gaussian random matrix. The images

    are recovered by TVAL3, TwIST, NESTA and ℓ1-Magic respectively from 30% mea-

    surements without the additive noise. The quality of recovered images is measured by

    the signal-to-noise ratio (SNR), which is defined as the power ratio between a signal

    and the background noise. All parameters are tuned to achieve the best performance.

    From Figure 2.1, we observe that TVAL3 achieves the highest-quality image

  • 36

    50 100 150 200 250

    50

    100

    150

    200

    250

    SNR: 9.40dB, CPU time: 10.20s50 100 150 200 250

    50

    100

    150

    200

    250

    SNR: 4.66dB, CPU time: 142.04s50 100 150 200 250

    50

    100

    150

    200

    250

    SNR: 8.03dB, CPU time: 29.42s50 100 150 200 250

    50

    100

    150

    200

    250

    Figure 2.2: Recovered 256 × 256 MR brain image. Both the measurement rate and the noiselevel are 10%. Top-left: original image. Top-right: reconstructed by TVAL3. Bottom-left:reconstructed by TwIST. Bottom-right: reconstructed by NESTA.

    (77.64dB) but requires the shortest running time (4.27 seconds). The second highest-

    quality image (51.08dB) is recovered by ℓ1-Magic at the expense of the unacceptable

    running time (1558.29 seconds). TwIST and NESTA attain relatively medium-quality

    images (around 46.59dB and 34.18dB respectively) within reasonable running times

    (13.81 and 24.35 seconds respectively). This test suggests that TVAL3 is capable of

    high accuracy within an affordable running time, and outperforms other state-of-the-

    art implementations more or less.

    Noise is inevitable in practice. The following test focuses on the performance of

    different implementations under the influence of Gaussian noise. Specifically, a 256×

  • 37

    256 MR brain image, which contains much more details than phantom, is encoded

    by a permutated sequency-ordered Walsh Hadamard matrix using fast transform. In

    order to investigate the robustness, we choose both noise level and measurement rate

    to be 10%. The above phantom test has indicated the ℓ1-Magic is hardly applicable

    to large-scale problems due to its low efficiency, so only TVAL3, TwIST and NESTA

    are performed here.

    From Figure 2.2, we can only recognize vague outline of the image recovered by

    TwIST even though the running time is longest. Nevertheless, the image recovered

    by either TVAL3 or NESTA is more subtle and preserves more details contained in

    the original brain image. In comparison with NESTA, TVAL3 achieves better accu-

    racy (higher SNR) in shorter running time statistically, and provides higher contrast

    visually. For example, some gyri in the image recovered by TVAL3 are still distin-

    guishable but this is not the case in images recovered by either TwIST or NESTA.

    Furthermore, the image recovered by NESTA is still noisy while the image recovered

    by TVAL3 is much cleaner. This implies the fact that TVAL3 is capable of better

    denoising effects than NESTA. Actually, this would be a desirable property when

    handling data with lots of noise, which will always be the case in practice.

    Two tests are far less than enough to draw a solid conclusion. More numerical

    experiments and analysis with different flavors have been covered in [9], which revealed

    the comprehensive performance of TVAL3 on TV regularized problems.

    With moderate modifications, TVAL3 is easily to extend to some other TV reg-

    ularized models with extra requirements, for example, imposing nonnegativity con-

    straints or dealing with complex signals/measurements. For the convenience of other

    researchers, it has been implemented in MATLAB aiming at solving various TV reg-

    ularized models in the field of compressive sensing, and published at the following

    URL:

  • 38

    http://www.caam.rice.edu/~optimization/L1/TVAL3/.

    http://www.caam.rice.edu/~optimization/L1/TVAL3/

  • Chapter 3

    Hyperspectral Data Unmixing

    In this chapter, we develop a hyperspectral unmixing scheme with the aid of compres-

    sive sensing. This scheme could recover the abundance and signatures straightly from

    the compressed data instead of the whole massive hyperspectral cube. In light of the

    general TVAL3 method discussed in Chapter 2, a effective and robust reconstruction

    algorithm is proposed and conscientiously investigated.

    3.1 Introduction to Hyperspectral Imaging

    By exploiting the wavelength composition of electromagnetic radiation (EMR), hy-

    perspectral imaging collects and processes data from across the electromagnetic spec-

    trum. Hyperspectral sensors capture information as a series of “images” over many

    contiguous spectral bands containing the visible, near-infrared and shortwave infrared

    spectral bands [98]. These images, generated from different bands, pile up and form

    a 3D hyperspectral cube for processing and further analysis. If each image can be

    viewed as a long vector, the hyperspectral cube will become a large matrix which

    is more easily accessible mathematically. Each column of the matrix records the in-

    39

  • 40

    formation from the same spectral band and each row records the information at the

    same pixel. For much of the past decade, hyperspectral imaging has been actively

    researched and widely developed. It has matured into one of the most powerful and

    fast growing technologies. For example, the development of hyperspectral sensors

    and their corresponding software to analyze hyperspectral data has been regarded as

    a critical breakthrough in the field of remote sensing. Hyperspectral imaging has a

    wide range of applications in industry, agriculture and military, such as terrain clas-

    sification, mineral detection and exploration [87, 88], pharmaceutical counterfeiting

    [89], environmental monitoring [91] and military surveillance [90].

    The fundamental property of hyperspectral imaging which researchers want to

    obtain is spectral reflectance: the ratio of reflected energy to incident energy as a

    function of wavelength [97]. Reflectance varies with wavelength for most materi-

    als. These variations are evident and sometimes characteristic when comparing these

    spectral reflectance plots of different materials. Several libraries of reflectance spec-

    tra of natural and man-made materials are accessible for public use, such as ASTER

    Spectral Library [122] and USGS Spectral Library [123]. These libraries provide a

    source of reference spectra that helps the interpretation and analysis of hyperspectral

    images.

    It is highly possible that more than one material contributes to an individual

    spectrum captured by the sensor, which leads to a composite or mixed spectrum.

    Typically, hyperspectral imaging is of spatially low resolution, in which each pixel,

    from a given spatial element of resolution and at a given spectral band, is a mixture

    of several different material substances, termed endmembers, each possessing a char-

    acteristic hyperspectral signature [99]. In general, endmembers imply those spectrally

    “pure” features, such as soil, vegetation, and so forth. In mineralogy, it refers to a

    mineral at the extreme end of a mineral series in terms of purity. For example, al-

  • 41

    bite (NaAlSi3O8) and anorthite (CaAl2Si2O8) are two endmembers in the plagioclase

    series of minerals.

    If the endmember spectra or signatures are available beforehand, we can mathe-

    matically decompose each pixel’s spectrum of a hyperspectral image to identify the

    relative abundance of each endmember component. This process is called unmixing.

    Linear unmixing is a simple spectral matching approach, whose underlying premise is

    that a relatively small number of common endmembers are involved in a scene, and

    most spectral variability in this scene can be attributed to spatial mixing of these

    endmember components in distinct proportions. In the linear model, interactions

    among distinct endmembers are assumed to be negligible [100], which is a plausi-

    ble hypothesis in the realm of hyperspectral imaging. Frequently, the representative

    endmembers for a given scene are known a priori and their signatures can be ob-

    tained from a spectral library (e.g., ASTER [122] and USGS [123]) or codebook. On

    the other hand, when endmembers are unknown but the hyperspectral data is fully

    accessible, many algorithms exist for determining endmembers in a scene, including

    N-FINDR [102], PPI (pixel purity index) [101], VCA (vertex component analysis)

    [103], SGA (simplex growing algorithm) [104]; NMF-MVT (nonnegative matrix fac-

    torization minimum volume transform) [105], SISAL (simplex identification via split

    augmented Lagrangian) [106], MVSA (minimum volume simplex analysis) [108] and

    MVES (minimum-volume enclosing simplex) [107].

    Because of the their enormous volume, it is particularly difficult to directly process

    and analyze hyperspectral data cubes in real time or near real time. On the other

    hand, hyperspectral data are highly compressible with two-fold compressibility:

    1. each spatial image is compressible, and

    2. the entire cube, when treated as a matrix, is of low rank.

  • 42

    To fully exploit such rich compressibility, a scheme is proposed in this chapter, which

    never requires to explicitly store or process a hyperspectral cube itself. In this scheme,

    data are acquired by means of compressive sensing (CS). As introduced in Chapter 1,

    the theory of CS shows that a sparse or compressible signal can be recovered from a

    relatively small number of linear measurements. In particular, the concept of the sin-

    gle pixel camera [32] can be extended to the acquisition of compressed hyperspectral

    data, which will be described and used while setting up the experiments. The main

    novelty of the scheme is in the decoding side where we combine data reconstruction

    and unmixing into a single step of much lower complexity. The proposed scheme is

    both computationally low-cost and memory-efficient. At this point, we start from

    the assumption that the involved endmember signatures are known and given, from

    which we then directly compute abundance fractions. For brevity, we will call the

    proposed procedure compressive sensing and unmixing or CSU scheme.

    In fact, a prior information is not always accessible or precise. For example, the

    change of experimental environment may cause fluctuation of endmember reflectance

    and give rise to a slightly different signature from the one in the standard library.

    Without the aid of correct or complete a priori, the unmixing problem will become

    significantly more intractable. Later in this chapter, the CSU scheme is extended to

    blind unmixing where endmember signatures are not precisely known a priori.

    3.2 Compressive Sensing and Unmixing Scheme

    In this section, we propose and conduct a proof-of-concept study on a low-complexity,

    compressive sensing and unmixing (CSU) scheme, formulating a unmixing model

    based on total variation (TV) minimization, and developing an efficient algorithm

    to solve this model [109]. To validate the CSU scheme, experimental and numerical

  • 43

    evidence will be provided in the next section. This proposed scheme directly unmixes

    compressively sensed data, bypassing the high-complexity step of reconstructing the

    hyperspectral cube itself. The effectiveness and efficiency of the proposed CSU scheme

    are demonstrates using both synthetic and hardware-measured data.

    3.2.1 Problem Formulation

    Let us introduce those necessary notations first. Suppose that in a given scene there

    exist ne significant endmembers, with spectral signatures wTi ∈ Rnb , for i = 1, . . . , ne,

    where nb ≥ ne denotes the number of spectral bands. Let xi ∈ Rnb represent the

    hyperspectral data vector at the i-th pixel and hTi ∈ Rne represent the abundance

    fractions of the endmembers for any i ∈ {1, . . . , np}, where np denotes the number of

    pixels. Furthermore, letX = [x1, . . . , xnp]T ∈ Rnp×nb denote a matrix representing the

    hyperspectral cube, W = [w1, . . . , wne]T ∈ Rne×nb the mixing matrix containing the

    endmember spectral signatures, and H = [h1, . . . , hnp]T ∈ Rnp×ne a matrix holding

    the respective abundance fractions. We use A ∈ Rm×np to denote the measurement

    matrix in compressive sensing data acquisition, and F ∈ Rm×nb to denote the obser-

    vation matrix, where m < np is the number of samples for each spectral band. For

    convenience, 1s denotes the column vector of all ones with length s. In addition, we

    use 〈·, ·〉 to denote the usual matrix inner product since the notation (·)T (·) for vector

    inner product would not correctly apply.

    Assuming negligible interactions among endmembers, the hyperspectral vector xi

    at the i-th pixel can be regarded as a linear combination of the endmember spectral

    signatures, and the weights are gathered in a nonnegative abundance vector hi. Ide-

    ally, the components of hi, representing abundance fractions, should sum up to unity;

    i.e., the hyperspectral vectors lie in the convex hull of endmember spectral signatures

  • 44

    [103]. In short, the data model has the form

    X = HW, H1ne = 1np, and H ≥ 0. (3.1)

    However, in reality the sum-to-unity condition on H does not usually hold due to

    imprecisions and noise of various kinds. In our implementation, we imposed this

    condition on synthetic data, but skipped it for measured data.

    Since each column of X represents a 2D image corresponding to a particular

    spectral band, we can collect the compressed hyperspectral data F ∈ Rm×nb by

    randomly sampling all the columns of X using the same measurement matrix A ∈

    Rm×np, where m < np is the number of samples for each column. Mathematically,

    the data acquisition model can be described as

    AX = F. (3.2)

    Combining (3.1) and (3.2), we obtain constraints

    AHW = F, H1ne = 1np, and H ≥ 0. (3.3)

    For now, we assume that the endmember spectral signatures inW are known, our goal

    is to find their abundance distributions (or fractions) in H , given the measurement

    matrix A and the compressed hyperspectral data F . In general, system (3.3) is not

    sufficient for determining H , necessitating the use of some prior knowledge about H

    in order to find it.

    In compressive sensing, regularization by ℓ1 minimization has been widely used.

    However, Chapter 1 has suggested shown that the use of TV regularization is em-

    pirically more advantageous on image problems such as deblurring, denoising and

  • 45

    reconstruction, since it can better preserve edges or boundaries in images that are

    essential characteristics. TV regularization puts emphasis on sparsity in the gradient

    map of the image and is suitable when the gradient of the underlying image is sparse

    [2]. In our case, we make the assumption that the gradient of each image composed

    by abundance fractions for each endmember is mostly and approximately piecewise

    constant. This is reasonable in the sense that most applications of hyperspectral

    imaging focus on characteristics (or simply described as jumps) in a scenario instead

    of those smooth parts. Mathematically, we propose to recover the abundance matrix

    H by solving the following unmixing model:

    minH∈Rnp×ne

    ne∑

    j=1

    TV(Hej) s.t. AHW = F, H1ne = 1np, H ≥ 0, (3.4)

    where ej is the j-th standard unit vector in Rnp,

    TV(Hej) ,

    np∑

    i=1

    ‖Di(Hej)‖, (3.5)

    ‖.‖ is the 2-norm in R2 corresponding to the isotropic TV, and Di ∈ R2×np denotes

    the discrete gradient operator at the i-th pixel, as described in Chapter 2. In stead

    of 2-norm, 1-norm is also applicable here corresponding to the anisotropic TV, which

    arouses quite similar analysis and derivation. Since the unmixing model directly uses

    compressed data F , we will call it a compressed unmixing model.

    It is important to note that although H consists of several related images each

    corresponding to the distribution of abundance fractions of one material in a scene,

    these images generally do not share many common edges as in color images or some

    other vector-valued images. For example, a sudden decrease in one fraction can be

    compensated by an increase in another while all the rest fractions remain unchanged,

  • 46

    indicating the occurrence of an edge in two but not all images inH . This phenomenon

    can be observed from the test cases in Section 3.3. Therefore, in our model (3.4),

    instead of applying a coupled TV regularization function for vector-valued images (see

    [17] and [18], for example), we simply use a sum of TV terms for individual scalar-

    valued images without coupling them in the TV regularization. It is possible that

    under certain conditions, the use of vector-valued TV is more appropriate, but this

    point is beyond the scope of this study. Nevertheless, the images in H are connected

    in the constraint H1ne = 1np.

    3.2.2 SVD Preprocessing

    The size of the fidelity equation AHW = F in (3.3) is m×nb where m, although less

    than np in compressive sensing, can still be quite large, and nb, the number of spectral

    bands, typically ranges from hundreds to thousands. Here a preprocessing procedure

    is proposed based on singular value decomposition of the observation matrix F , in

    order to decrease the size of the fidelity equations from m× nb to m× ne. Since the

    number of endmembers ne is typically up to two orders of magnitude smaller than nb,

    the resulting reduction in complexity is significant, potentially enabling near-real-time

    processing speed. The proposed preprocessing procedure is based on the following

    result.

    Theorem 3.2.1. Let A ∈ Rm×np and W ∈ Rne×nb be full-rank, and F ∈ Rm×nb be

    rank-ne with ne < min{nb, np, m}. Let F = UeΣeV Te be the economy-size singular

    value decomposition of F where Σe ∈ Rne×ne is diagonal and positive definite, Ue ∈

    Rm×ne and Ve ∈ Rnb×ne both have orthonormal columns. Assume that rank(WVe) =

    ne, then the two linear systems below for H ∈ Rnp×ne have the same solution set; i.e.,

  • 47

    the equivalence holds

    AHW = F ⇐⇒ AHWVe = UeΣe. (3.6)

    Proof. We show that the two linear system has an identical solution set. Denote

    the solution sets for the two system by H1 = {H : AHW = F} and H2 = {H :

    AHWVe = UeΣe}, respectively, which are both subspaces. Given that F = UeΣeV Teand V Te Ve = I, it is obvious that H1 ⊆ H2. To show H1 = H2, it suffices to verify

    that the dimensions of the two are equal, i.e., dim(H1) = dim(H2).

    Let “vec” denote the operator that stacks the columns of a matrix to form a vector.

    By well-known properties of Kronecker product “⊗”, AHW = F is equivalent to

    (W T ⊗ A) vecH = vecF, (3.7)

    where W T ⊗A ∈ R(nbm)×(nenp), and

    rank(W T ⊗A) = rank(W )rank(A) = nem. (3.8)

    Similarly, AHWVe = UeΣe is equivalent to

    ((WVe)T ⊗ A) vecH = vec(UeΣe), (3.9)

    where (WVe)T ⊗ A ∈ R(nem)×(nenp) and, under our assumption rank(WVe) = ne,

    rank((WVe)T ⊗A) = rank(WVe)rank(A) = nem. (3.10)

    Hence, rank(W T ⊗ A) = rank((WVe)T ⊗ A), which implies the solution sets of (3.7)

  • 48

    and (3.9) have the same dimension; i.e., dim(H1) = dim(H2). Since H1 ⊆ H2, we

    conclude that H1 = H2.

    This proposition ensures that under a mild condition the matrices W and F in

    the fidelity equation AHW = F can be replaced, without changing the solution set,

    by the much smaller matrices WVe and UeΣe, respectively, potentially leading to

    multi-order magnitude reductions in equation size.

    Suppose that F is an observation matrix for a rank-ne hyperspectral data matrix

    X̂ . Then F = AĤŴ for some full rank matrices Ĥ ∈ Rnp×ne and Ŵ ∈ Rne×nb.

    Clearly, the rows of Ŵ span the same space as the columns of Ve do. Therefore, the

    condition rank(WVe) = ne is equivalent to rank(WŴT ) = ne, which definitely holds

    for W = Ŵ . It will also hold for a random W with high probability. Indeed, the

    condition rank(WVe) = ne is rather mild.

    In practice, the observation matrix F usually contains model imprecisions or ran-

    dom noise, and hence is unlikely to be exactly rank ne. In this case, truncating the

    SVD of F to rank-ne is a sensible strategy, which will not only serve the dimension

    red