special session on convex optimization for system identification

Upload: alex240574

Post on 02-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    1/60

    Special Session on Convex Optimization for

    System Identification

    Kristiaan Pelckmans Johan A.K. Suykens

    SysCon, Information Technology, Uppsala University, 75501, Sweden KU Leuven ESAT-SCD, B-3001 Leuven (Heverlee), Belgium

    Abstract: This special session aims to survey, present new results and stimulate discussionson how to apply techniques of convex optimization in the context of system identification. Thescope of this session includes but is not limited to linear and nonlinear modeling, model structureselection, block structured systems, regularization mechanisms (e.g. L1, sum-of-norms, nuclearnorm), segmentation of time-series, trend filtering, optimal experiment design and others.

    1. INTRODUCTION

    Insights in convex optimization continue to be a drivingforce for new formulations and methods of estimationand identification. The paramount incentive for phrasingmethods of estimation in a format of a standard convexoptimization problem has been the availability of efficientsolvers, both from a theoretical as well as practical per-spective. From a conceptual perspective, convexity of aproblem formulation ensures that no local minima areexisting. The work in convex optimization has now becomea well-matured subject, to such an extent that researchersview the distinction to convex and non-convex problems

    as more profound than the distinction between linear andnon-linear optimization problems. A survey of techniquesof convex optimization, together with applications in esti-mation is Boyd and Vandenberghe [2004].

    Convex optimization has always maintained a close con-nection to systems theory and estimation problems. Maincatalyzers of this synergy include the following:

    (Combinatorial) Interest in convex approaches to ef-ficiently solving complex optimization problems canbe traced back to the maxflow-mincut theorem, es-sentially stating when a combinatorial problem couldbe solved as a convex linear programming one. Thisresult has mainly impacted the field of OperationsResearch (OR), but the related so-called property ofunimodularity of a matrix is currently seeing a revivalin a context of machine learning and estimation.Another important result in this line of work hasbeen the recent interest in Semi-Definite Program-ming (SDP) relaxations for NP-hard combinatorialproblems. A standard reference is Papadimitriou andSteiglitz [1998] and Schrijver [1998].

    (LMI) An immediate predecessor of the solid bodyof work in convex optimization is the literature onLinear Matrix Inequalities (LMIs). This research hasfound a particularly rich application area in systemstheory, were LMIs occur naturally in problems of

    stability, and of automatic control. A standard workis Boyd et al. [1994] and Ben-Tal and Nemirovskii[2001].

    (L1) Compressed Sensing or Compressive Sampling

    (CS) has led to vigorous research in applicationsof convex optimization in estimation and recoveryproblems. More specifically, the interest in sparseestimation problems has led to countless proposals ofL1 norms in estimation problems, often stimulated bythe promise that sparsity of (a transformation of) theunknowns has a natural interpretation in the specificapplication at hand. A main benefit of research in CSover earlier interest in the use of the L1 heuristic forrecovery are the newly derived theoretical guarantees,a research area often attributed to Candes and Tao[2005]. Lately, much interest has been devoted tothe study of the low-rank approximation problemRecht et al. [2010], where different heuristics wereproposed to relax the combinatorial rank constraint.For extensions in the area of system identification seeRecht et al. [2010], Liu and Vandenberghe [2009].

    (Structure) Recent years made it apparent that tech-niques of convex optimization can play yet anotherimportant role in identification and estimation prob-lems. If the structure of the system underlying theobservations can be imposed as constraints in theestimation problem, the set of possible solutions canbe sharply reduced, leading in turn to better esti-mates. This view has been especially important inmodeling of nonlinear phenomena, where the role ofa parametric model is entirely replaced by structural

    constraints. Examples of such thinking are SupportVector Machines and other non-parametric methods.The former are also tied to convex optimization viaanother link: it is found that Lagrange duality (as inthe theory of convex optimality) can lead to a system-atic approach for introducing nonlinearity throughthe use of Mercer kernels.

    (Design) The design of experiments as in statisticalsciences has always related closely to convex opti-mization. This is no different in a context of systemidentification, where the design of experiments nowpoints to design of an input sequence which excitesproperly all the modes of the system to be identified.One apparent reason why techniques of convex op-

    timization are useful is that such experiments haveto work in the allowed operation regions, that is,

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    2/60

    constraints enter naturally in most cases. For relatereferences, see e.g. Boyd and Vandenberghe [2004].

    (Priors) A modern view is that methods based onthe L1-norm, nuclear norm relaxation or by imposingstructure, are examples of a larger picture, namelythat such terms can be used to make the inverse prob-lem well-posed. In other words, they fill in unknownpieces of information in the estimation problem byimposing or suggesting a prior structure. In general,an estimation problem can be greatly helped if oneis able to suggest a good prior for completing theevidence given by data only. Such prior can comein a form of a dictionary into which the solutionfit in nicely, or as a penalization term in the costfunction of the optimization function which penalizesthe occurrence of unusual phenomena in the solution.A statistical treatment of regularization is surveyedin Bickel et al. [2006], regularization and priors insystem identification are discussed in Ljung [1999],Ljung et al. [2011].

    The last item suggests a definite way forward. Namely, howcan techniques of convex optimization be used to modelappropriate priors for a context of system identification.We conjecture that the near future will witness such shiftof focus from parametric Box-Jenkins and state-spacemodels to structural constraints and application specificpriors.

    REFERENCES

    A. Ben-Tal and A.S. Nemirovskii.Lectures on modern con-vex optimization: analysis, algorithms, and engineeringapplications, volume 2. Society for Industrial Mathe-matics, 2001.

    P.J. Bickel, B. Li, A.B. Tsybakov, S.A. van de Geer, B. Yu,

    T. Valdes, C. Rivero, J. Fan, and A. van der Vaart.Regularization in statistics. Test, 15(2):271344, 2006.

    S. Boyd and L. Vandenberghe. Convex Optimization.Cambridge University Press, 2004.

    S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan.Linear matrix inequalities in system and control theory,volume 15. Society for Industrial Mathematics, 1994.

    E.J. Candes and T. Tao. Decoding by linear programming.Information Theory, IEEE Transactions on, 51(12):42034215, 2005.

    Z. Liu and L. Vandenberghe. Interior-point method fornuclear norm approximation with application to systemidentification. SIAM Journal on Matrix Analysis andApplications, 31(3):1235, 2009.

    L. Ljung. System identification. Wiley Online Library,1999.

    L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encoun-ters with system identification. European Journal ofControl, 17(5):449, 2011.

    C.H. Papadimitriou and K. Steiglitz. Combinatorial opti-mization: algorithms and complexity. Dover, 1998.

    B. Recht, M. Fazel, and P.A. Parrilo. Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization. SIAM Review, 52(3):471501, 2010.

    A. Schrijver. Theory of linear and integer programming.John Wiley & Sons Inc, 1998.

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    3/60

    Convex optimization techniques in system

    identification

    Lieven Vandenberghe

    Electrical Engineering Department, UCLA, Los Angeles, CA 90095(Tel: 310-206-1259; e-mail: [email protected])

    Abstract: In recent years there has been growing interest in convex optimization techniquesfor system identification and time series modeling. This interest is motivated by the success ofconvex methods for sparse optimization and rank minimization in signal processing, statistics,and machine learning, and by the development of new classes of algorithms for large-scalenondifferentiable convex optimization.

    1. INTRODUCTION

    Low-dimensional model structure in identification prob-lems is typically expressed in terms of matrix rank orsparsity of parameters. In optimization formulations thisgenerally leads to non-convex constraints or objectivefunctions. However, formulations based on convex penal-ties that indirectly minimize rank or maximize sparsityare often quite effective as heuristics, relaxations, or, inrare cases, exact reformulations. The best known exam-ple is 1-norm regularization in sparse optimization, i.e.,the use of the 1-norm x1 in an optimization problemas a substitute for the cardinality (number of nonzeroelements) of a vector x. This idea has a rich history instatistics, image and signal processing [Rudin et al., 1992,Tibshirani, 1996, Chen et al., 1999, Efron et al., 2004,Candes and Tao, 2007], and an extensive mathematicaltheory has been developed to explain when and why itworks well [Donoho and Huo, 2001, Donoho and Tanner,2005, Candes et al., 2006b, Candes and Tao, 2005, Candeset al., 2006a, Candes and Tao, 2006, Donoho, 2006, Tropp,2006]. Several excellent surveys and tutorials on this topicare available; see for example [Romberg, 2008, Candes andWakin, 2008, Elad, 2010].

    The 1-norm used in sparse optimization has a naturalcounterpart in the nuclear norm for matrix rank minimiza-tion. Here one uses the penalty function Xwhere denotes the nuclear norm (sum of singular values) as a sub-stitute for rank(X). Applications of nuclear norm meth-ods in system theory and control were first explored by[Fazel, 2002, Fazel et al., 2004], and have recently gained inpopularity in the wake of the success of 1-norm techniquesfor sparse optimization [Recht et al., 2010]. Much of therecent work in this area has focused on the low-rank matrixcompletion problem [Candes and Recht, 2009, Candes andPlan, 2010, Candes and Tao, 2010, Mazumder et al., 2010],i.e., the problem of identifying a low-rank matrix froma subset of its entries. This problem has applications incollaborative prediction [Srebro et al., 2005] and multi-tasklearning [Pong et al., 2011]. Applications of nuclear norm

    methods in system identification are discussed in [Liu andVandenberghe, 2009a, Grossmann et al., 2009, Mohan andFazel, 2010, Gebraad et al., 2011, Fazel et al., 2011].

    The 1-norm and nuclear norm techniques can be extended

    in several interesting ways. The two types of penalties canbe combined to promote sparse-plus-low-rank structurein matrices [Candes et al., 2011, Chandrasekaran et al.,2011]. Structured sparsity, such as group sparsity or hi-erarchical sparsity, can be induced by extensions of the1-norm penalty [Bach et al., 2012, Jenatton et al., 2011,Bach et al., 2011]. Finally, Chandrasekaran et al. [2010]and Bach [2010] describe systematic approaches for con-structing convex penalties for different types of nonconvexstructural constraints.

    In this tutorial paper we discuss a few applications of con-vex methods for structured rank minimization and sparseoptimization, in combination with classical ideas from

    system identification and signal processing. We focus onsubspace algorithms for system identification and topologyselection problems in graphical models. The second part ofthe paper (section 4) provides a short survey of availableconvex optimization algorithms.

    2. SYSTEM IDENTIFICATION

    Subspace methods in system identification and signalprocessing rely on singular value decompositions (SVDs)to make low-rank matrix approximations [Ljung, 1999].The structure in the approximated matrices (for example,Hankel structure) is therefore lost during the low-rank

    approximation. A convex optimization formulation basedon the nuclear norm penalty offers an interesting alterna-tive, because it promotes low rank while preserving linearmatrix structure. An additional benefit of an optimiza-tion formulation is the possibility of adding other convexregularization terms or constraints on the optimizationvariables.

    As an illustration, consider the input-output equation usedas starting point in many subspace identification methods:

    Y =OX+ HU.

    The matrices U and Y are block Hankel matrices con-structed from a sequence of inputs u(t) and outputs y(t)of a state space model

    x(t + 1) =Ax(t) + Bu(t), y(t) =C x(t) + Du(t),

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    4/60

    and the columns of X form a sequence of states x(t).The matrix Hdepends on the system matrices, and O isan extended observability matrix [Verhaegen and Verdult,2007, p.295] A simple subspace method consists in formingthe Hankel matrices U and Y and then projecting therows of Y on the nullspace of U. If the data are exactand a persistence of excitation assumption holds, the rankof the projected output matrix is equal to the systemorder and from it a system realization is easily computed.When the input-output data are not exact, one can usea singular value decomposition of the projected outputHankel matrix to estimate the order and compute a systemrealization. However, as mentioned, this step destroys theHankel structure in Y and U. The nuclear norm penaltyon the other hand can be used as a convex heuristicfor indirectly reducing the rank, while preserving linearstructure. For example, if the inputs are exactly knownand the measured outputs ym(t) are subject to error, onecan solve the convex problem

    minimize Y Q

    +t y(t) ym(t)

    2

    2

    where the columns of Q form a basis of the nullspace ofU and is a positive weight. The optimization variablesare the model outputs y(t) and the matrix Y is a Hankelmatrix constructed from the model outputs y(t). This isa convex optimization problem that can be solved viasemidefinite programming. We refer the reader to [Liu andVandenberghe, 2009a,b] for more details and numericalresults. As an important advantage, the optimizationformulation can be extended to include convex contraintson the model outputs. Another promising applicationis identification with missing data [Ding et al., 2007,

    Grossmann et al., 2009].

    3. GRAPHICAL MODELS

    In a graphical model of a normal distribution x N(0,)the edges in the graph represent the conditional depen-dence relations between the components ofx. The verticesin the graph correspond to the components of x; theabsence of an edge between vertices i and j indicates thatxiand xj are independent, conditional on the other entriesofx. Equivalently, vertices i and j are connected if thereis a nonzero in the i, j position of the inverse covariancematrix 1.

    A key problem in the estimation of the graphical modelis the selection of the topology. Several authors haveaddressed this problem by adding a 1-norm penalty to themaximum likelihood estimation problem, and solving

    minimize tr CX log det X+ X1. (1)

    Here Xdenotes the inverse covariance 1, the matrix Cisthe sample covariance matrix, and X1=

    ij|Xij |. See

    [Meinshausen and Buhlmann, 2006, Banerjee et al., 2008,Ravikumar et al., 2008, Friedman et al., 2008, Lu, 2009,Scheinberg and Rish, 2009, Yuan and Lin, 2007, Duchiet al., 2008, Li and Toh, 2010, Scheinberg and Ma, 2012].

    Graphical models of the conditional independence rela-tions can be extended to Gaussian vector time series[Brillinger, 1996, Dahlhaus, 2000]. In this extension the

    topology of the graph is determined by the sparsity patternof the inverse spectral density matrix

    S() =

    k=

    Rkejk,

    with Rk = E x(t+ k)x(t)T

    . Using this characterization,one can formulate extensions of the regularized maximumlikelihood problem (1) to vector time series. In [Songsiriet al., 2010, Songsiri and Vandenberghe, 2010] autoregres-sive models

    x(t) =

    pk=1

    Akx(t k) + v(t), v(t) N(0,),

    were considered, and convex formulations were presentedfor the problem of estimating the parameters Ak, ,subject to conditional independence constraints, and ofestimating the topology via a 1-norm type regularization.The topology selection problem leads to the following

    extension of (1):minimize tr(CX) log det X00+h(X)subject to X 0.

    (2)

    The variable X is a (p+ 1) (p+ 1) block matrix withblocks of size n n (the length of the vector x(t)), andX00 is the leading block ofX. The penalty h is chosen toencourage a common, symmetric sparsity pattern for thediagonal sums

    pki=0

    Xi,i+k, k= 0, 1, . . . , p ,

    of the blocks in X.An extension to ARMA processes is studied by Avventiyet al. [2010].

    4. ALGORITHMS

    For small and medium sized problems the applicationsdiscussed in the previous sections can be handled bygeneral-purpose convex optimization solvers, such as themodeling packages CVX [Grant and Boyd, 2007] andYALMIP [Lofberg, 2004], and general-purpose conic op-timization packages. In this section we discuss algorithmicapproaches that are of interest for large problems that fall

    outside the scope of the general-purpose solvers.

    4.1 Interior-point algorithms

    Interior-point algorithms are known to attain a high ac-curacy in a small number of iterations, fairly independentof problem data and dimensions. The main drawback isthe high linear algebra complexity per iteration associatedwith solving the Newton equations that determine searchdirections. However sometimes problem structure can beexploited to devise dedicated interior-point implementa-tions that are significantly more efficient than general-purpose solvers.

    A simple example is the 1-norm approximation problem

    minimize Ax b1

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    5/60

    with A of size m n. This can be formulated as a linearprogram (LP)

    minimize

    mi=1

    yi

    subject to A IA I

    xy b

    b ,

    at the expense of introducingmauxiliary variables and 2mlinear inequality constraints. By taking advantage of thestructure in the inequalities, each iteration of an interior-point method for the LP can be reduced to solving linearsystems ATDAx = r where D is a positive diagonalmatrix. As a result, the complexity of solving the 1-norm approximation problem using a custom interior-point solver is roughly the equivalent of a small number ofweighted least-squares problems.

    A similar result holds for the nuclear norm approximationproblem

    minimize A(x)B (3)

    whereA(x) is a matrix valued function of size pqand xisan n-vector of variables. This problem can be formulatedas a semidefinite program (SDP)

    minimize tr U+ tr V

    subject to

    U (A(x)B)T

    A(x)B V

    0

    (4)

    with variablesx,U,V. The very larger number of variables(O(p2) if we assume p q) makes the nuclear normoptimization problem very expensive to solve by general-

    purpose SDP solvers. A specialized interior-point solverfor the SDP is described in [Liu and Vandenberghe,2009a], with a linear algebra cost per iteration ofO(n2pq)if n max{p, q}. This is comparable to solving thematrix approximation problem in Frobenius norm, i.e.,minimizing A(x) BF, and the improvement makes itpossible to solve nuclear norm problems with p and q onthe order of several hundred by an interior-point method.

    We refer the reader to the book chapter [Andersen et al.,2012] for additional examples of special-purpose interior-point algorithms.

    4.2 Nonlinear optimization methods

    Burer and Monteiro Burer and Monteiro [2003, 2005] havedeveloped a large-scale method for semidefinite program-ming, based on substituting a low-rank factorization forthe matrix variable and solving the resulting nonconvexproblem by an augmented Lagrangian method. Adaptedto the SDP (4), the method amounts to reformulating theproblem as

    minimize L2F+ R2F

    subject to A(x)B = LRT (5)

    with variablesx,L Rpr,R Rqr, wherer is a upper

    bound on the rank ofA(x) b at optimum. Recht et al.[2010] discuss in detail Burer and Monteiros method inthe context of nuclear norm optimization.

    4.3 Proximal gradient algorithms

    The proximal gradient algorithm is an extension of thegradient algorithm to problems with simple constraints orwith simple nondifferentiable terms in the cost function.It is less general than the subgradient algorithm, but itis typically much faster and it handles many types ofnondifferentiable problems that occur in practice.

    The proximal gradient algorithm applies to a convexproblem of the form

    minimize f(x) =g(x) + h(x), (6)

    in which the cost function fis split in two components gand h, with g differentiable and h a simple nondifferen-tiable function. Simple here means that the prox-operatorofh, defined as the mapping

    proxth(x) = argminu

    h(u) +

    1

    2tu x22

    (with t > 0), is inexpensive to compute. It can be shownthat ifh is closed and convex, then proxth(x) exists andis unique for every x.

    A typical example ish(x) = x1. Its prox-operator is theelement-wise soft-thresholding

    proxth(x)i =

    xi t ifxi t0 ift xi txi+ t ifxi t.

    Constrained optimization problems

    minimize g(x)subject to x C

    can be brought in the form (6) by defining h(x) =IC(x),the indicator function ofC (i.e., IC(x) = 0 ifx C andIC(x) = + if x C). The prox-operator for IC is theEuclidean projection onC. Prox-operators share many ofthe properties of Euclidean projections on closed convexsets. For example, they are nonexpansive, i.e.,

    proxth(x) proxth(y)2 x y2

    for all x, y. (See Moreau [1965].)

    The proximal gradient method for minimizing (6) uses theiteration

    x+ =proxth(x tg(x))

    where t > 0 is a step size. The proximal gradient updateconsists of a standard gradient step for the differentiableterm g, followed by an application of the prox-operatorassociated with the non-differentiable term h. It can bemotivated by noting that x+ is the minimizer of thefunction

    h(y) + g(x) + g(x)T(y x) + 1

    2ty x22

    over y, so x+ minimizes an approximation off, obtainedby adding to ha simple local quadratic model ofg .

    It can be shown that ifg is Lipschitz continuous with

    constantL, then the suboptimalityf(x(k)) f

    decreasesto zero as O(1/k) [Nesterov, 2004, Beck and Teboulle,2009]. Recently, faster variants of the proximal gradient

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    6/60

    method with an 1/k2 rate convergence, under the sameassumptions and with the same complexity per step, havebeen developed [Nesterov, 2004, 2005, Beck and Teboulle,2009, Tseng, 2008, Becker et al., 2011].

    The (accelerated) proximal gradient methods are wellsuited for problems of the form

    minimize g(x) + x

    where g is differentiable with a Lipschitz-continuous gra-dient. Most common norms have easily computed prox-operators, and the following property is useful when com-puting the prox-operator of a norm h(x) = x:

    proxth(x) =x tPB(x/t),

    where PB is Euclidean projection on the unit ball in thedual norm.

    In other applications it is advantageous to apply theproximal gradient method to the dual problem. Considerfor example an optimization problem

    minimize f(x) + Ax b

    with f strongly convex. Reformulating this problem as

    minimize f(x) + ysubject to y= Ax b

    (7)

    and taking the Lagrange dual, gives

    maximize bTz f(ATz)subject to zd 1

    wheref(u) = supx(uTxf(x)) is the conjugate off and

    d is the dual norm of . It can be shown that iff is

    strongly convex, then f

    is diff

    erentiable with a Lipschitzcontinuous gradient. If projection on the unit ball of thedual norm is inexpensive, the dual problem is thereforereadily solved by a fast gradient projection method.

    An extensive library of fast proximal-type algorithmsis available in the MATLAB software package TFOCS[Becker et al., 2010].

    4.4 ADMM

    TheAlternating Direction Method of Multipliers(ADMM)was proposed in the 1970s as a simplified version ofthe augmented Lagrangian method. It is a simple and

    often very eff

    ective method for large-scale or distributedoptimization, and has recently been applied successfullyto the regularized covariance selection problem mentionedabove [Scheinberg et al., 2010, Scheinberg and Ma, 2012].The recent survey by Boyd et al. [2011] gives an overviewof the theory and applications of ADMM. Here we limitourselves to a description of the method when applied to aproblem of the form (7). The ADMM iteration consists oftwo alternating minimization steps (over x and y) of theaugmented Lagrangian

    L(x,y,z) =

    f(x) + y+ zT(y Ax + b) + t

    2

    y Ax + b22,

    followed by an update

    z := z + t(y Ax b)

    of the dual variable z. The complexity of minimizingover x depends on the properties of f. If f is quadratic,for example, it reduces to a least-squares problem. Theminimization of the augmented Lagrangian over y reducesto the evaluation of the prox-operator of the norm .

    A numerical comparison of the ADMM and proximal

    gradient algorithms for nuclear norm minimization can befound in the recent paper by Fazel et al. [2011].

    5. SUMMARY

    Advances in algorithms for large-scale nondifferentiableconvex optimization are leading to a greater role of con-vex optimization in system identification and time seriesmodeling. These techniques are based on formulations thatincorporate convex penalty functions that promote low-dimensional model structure (such as sparsity or rank).Similar techniques have been used extensively in signalprocessing, image processing, and machine learning. Whileat this point theoretical results that characterize the suc-cess of these convex heuristics in system identificationare limited, the extensive theory that supports 1-normtechniques in sparse optimization, gives hope that progresscan be made in our understanding of similar techniques forsystem identification as well.

    ACKNOWLEDGMENT

    This material is based upon work supported by theNational Science Foundation under Grants No. ECCS-0824003 and ECCS-1128817. Any opinions, findings, andconclusions or recommendations expressed in this materialare those of the author and do not necessarily reflect the

    views of the National Science Foundation.

    REFERENCES

    M. S. Andersen, J. Dahl, Z. Liu, and L. Vandenberghe.Interior-point methods for large-scale cone program-ming. In S. Sra, S. Nowozin, and S. J. Wright, editors,Optimization for Machine Learning, pages 5583. MITPress, 2012.

    E. Avventiy, A. Lindquist, and B. Wahlberg. Graphicalmodels of autoregressive moving-average processes. InThe 19th International Symposium on MathematicalTheory of Networks and Systems (MTNS 2010), July2010.

    F. Bach. Structued sparsity-inducing norms throughsubmodular functions. 2010. Available fromarxiv.org/abs/1008.4220.

    F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Opti-mization with sparsity-inducing penalties. Foundationsand Trends in Machine Learning, 4(1):1106, 2011.

    F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convexoptimization with sparsity-inducing norms. In S. Sra,S. Nowozin, and S. J. Wright, editors, Optimization forMachine Learning, pages 1953. MIT Press, 2012.

    O. Banerjee, L. El Ghaoui, and A. dAspremont. Modelselection through sparse maximum likelihood estimationfor multivariate Gaussian or binary data. Journal ofMachine Learning Research, 9:485516, 2008.

    A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183202, 2009.

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    7/60

    S. Becker, E. J. Candes, and M. Grant. Templates forconvex cone problems with applications to sparse signalrecovery. 2010. arxiv.org/abs/1009.2065.

    S. Becker, J. Bobin, and E. Candes. NESTA: a fast andaccurate first-order method for sparse recovery. SIAMJournal on Imaging Sciences, 4(1):139, 2011.

    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.Distributed optimization and statistical learning via thealternating direction method of mulitipliers. Founda-tions and Trends in Machine Learning, 3(1):1122, 2011.

    D. R. Brillinger. Remarks concerning graphical models fortime series and point processes. Revista de Econometria,16:123, 1996.

    S. Burer and R. D. C. Monteiro. A nonlinear programmingalgorithm for solving semidefinite programs via low-rankfactorization.Mathematical Programming (Series B), 95(2), 2003.

    S. Burer and R. D. C. Monteiro. Local minima and con-vergence in low-rank semidefinite programming. Math-ematical Programming (Series A), 103(3), 2005.

    E. Candes and T. Tao. The Dantzig selector: Statisticalestimation when p is much larger than n. The Annalsof Statistics, 35(6):23132351, 2007.

    E. J. Candes and Y. Plan. Matrix completion with noise.Proceedings of the IEEE, 98(6):925936, 2010.

    E. J. Candes and B. Recht. Exact matrix completionvia convex optimization. Foundations of ComputationalMathematics, 9(6):717772, 2009.

    E. J. Candes and T. Tao. Decoding by linear program-ming. IEEE Transactions on Information Theory, 51(12):42034215, 2005.

    E. J. Candes and T. Tao. Near-optimal signal recoveryfrom random projections and universal encoding strate-gies. IEEE Transaction on Information Theory, 52(12),2006.

    E. J. Candes and T. Tao. The power of convex relaxation:near-optimal matrix completion. IEEE Transactions onInformation Theory, 56(5):20532080, 2010.

    E. J. Candes and M. B. Wakin. An introduction to com-pressive sampling. IEEE Signal Processing Magazine,25(2):2130, 2008.

    E. J. Candes, J. Romberg, and T. Tao. Robust uncertaintyprinciples: Exact signal reconstruction from highly in-complete frequency information. IEEE Transactions onInformation Theory, 52(2):489509, 2006a.

    E. J. Candes, J. K. Romberg, and T. Tao. Stable signalrecovery from incomplete and inaccurate measurements.

    Communications on Pure and Applied Mathematics, 59(8):12071223, 2006b.E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust

    principal component analysis? Journal of the ACM, 58(3), 2011.

    V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S.Willsky. The convex geometry of linear inverse prob-lems. 2010. arXiv:1012.0621v1.

    V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S.Willsky. Rank-sparsity incoherence for matrix decompo-sition. SIAM Journal on Optimization, 21(2):572596,2011.

    S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomicdecomposition by basis pursuit. SIAM Journal on

    Scientific Computing, 20:3361, 1999.

    R. Dahlhaus. Graphical interaction models for multivari-ate time series. Metrika, 51(2):157172, 2000.

    T. Ding, M. Sznaier, and O. Camps. A rank minimizationapproach to fast dynamic event detection and trackmatching in video sequences. In Proceedings of the 46thIEEE conference on decision and control, 2007.

    D. L. Donoho. Compressed sensing. IEEE Transactionson Information Theory, 52(4):12891306, 2006.

    D. L. Donoho and X. Huo. Uncertainty principles andideal atomic decomposition. IEEE Transactions onInformation Theory, 47(7):28452862, 2001.

    D. L. Donoho and J. Tanner. Sparse nonnegative solutionsof underdetermined systems by linear programming.Proceedings of the National Academy of Sciences of theUnited States of America, 102(27):94469451, 2005.

    J. Duchi, S. Gould, and D. Koller. Projected subgradientmethods for learning sparse Gaussians. In Proceedingsof the Conference on Uncertainty in AI, 2008.

    B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Leastangle regression. The Annals of Statistics, 32(2):407

    499, 2004.M. Elad. Sparse and Redundant Representations: FromTheory to Applications in Signal and Image Processing.Springer, 2010.

    M. Fazel. Matrix Rank Minimization with Applications.PhD thesis, Stanford University, 2002.

    M. Fazel, H. Hindi, and S. Boyd. Rank minimizationand applications in system theory. In Proceedings ofAmerican Control Conference, pages 32733278, 2004.

    M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankelmatrix rank minimization with applictions to systemidentification and realization. 2011. Submitted.

    J. Friedman, T. Hastie, and R. Tibshirani. Sparse inversecovariance estimation with the graphical lasso. Bio-statistics, 9(3):432, 2008.

    P. M. O. Gebraad, J. W. van Wingerden, G. J. van derVeen, and M. Verhaegen. LPV subspace identificationusing a novel nuclear norm regularization method. InProceedings of the American Control Conference, pages165170, 2011.

    M. Grant and S. Boyd. CVX: Matlab software for dis-ciplined convex programming (web page and software).http://stanford.edu/~boyd/cvx, 2007.

    C. Grossmann, C. N. Jones, and M. Morari. System iden-tification via nuclear norm regularization for simulatedbed processes from incomplete data sets. InProceedingsof the 48th IEEE Conference on Decision and Control,

    pages 46924697, 2009.R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Prox-imal methods for hierarchical sparse coding. Journal ofMachine Learning Research, 12:22972334, 2011.

    L. Li and K.-C. Toh. An inexact interior point method forL1-regularized sparse covariance selection. Mathemati-cal Programming Computation, 2:291315, 2010.

    Z. Liu and L. Vandenberghe. Interior-point method fornuclear norm approximation with application to systemidentification. SIAM Journal on Matrix Analysis andApplications, 31:12351256, 2009a.

    Z. Liu and L. Vandenberghe. Semidefinite programmingmethods for system realization and identification. InProceedings of the 48th IEEE Conference on Decision

    and Control, pages 46764681, 2009b.

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    8/60

    L. Ljung. System Identification: Theory for the User.Prentice Hall, Upper Saddle River, New Jersey, secondedition, 1999.

    J. Lofberg. YALMIP : A toolbox for modeling and opti-mization in MATLAB. In Proceedings of the CACSDConference, Taipei, Taiwan, 2004.

    Z. Lu. Smooth optimization approach for sparse covarianceselection. SIAM Journal on Optimization, 19(4):18071827, 2009.

    R. Mazumder, T. Hastie, and R. Tibshirani. Spectralregularization algorithms for learning large incompletematrices. Journal of Machine Learning Research, 11:22872322, 2010.

    N. Meinshausen and P. Buhlmann. High-dimensionalgraphs and variable selection with the Lasso. Annalsof Statistics, 34(3):14361462, 2006.

    K. Mohan and M. Fazel. Reweighted nuclear norm min-imization with application to system identification. InProceedings of the American Control Conference (ACC),pages 29532959, 2010.

    J. J. Moreau. Proximite et dualite dans un espace hilber-tien. Bull. Math. Soc. France, 93:273299, 1965.Yu. Nesterov. Introductory Lectures on Convex Opti-

    mization. Kluwer Academic Publishers, Dordrecht, TheNetherlands, 2004.

    Yu. Nesterov. Smooth minimization of non-smooth func-tions. Mathematical Programming Series A, 103:127152, 2005.

    T. K. Pong, P. Tseng, Shuiwang Ji, and J. Ye. Trace normregularization: reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):34653489, 2011.

    R. Ravikumar, M. J. Wainwright, G. Raskutti, andB. Yu. High-dimensional covariance estimation by min-imizing 1-penalized log-determinant divergence, 2008.arxiv.org/abs/0811.3628.

    B. Recht, M. Fazel, and P. A. Parrilo. Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization. SIAM Review, 52(3):471501, 2010.

    J. Romberg. Imaging via compressive sampling. IEEESignal Processing Magazine, 25(2):1420, 2008.

    L. Rudin, S. J. Osher, and E. Fatemi. Nonlinear totalvariation based noise removal algorithms. Physica D,60:259268, 1992.

    K. Scheinberg and S. Ma. Optimization methods for sparseinverse covariance selection. In S. Sra, S. Nowozin,

    and S. J. Wright, editors, Optimization for MachineLearning, pages 455477. MIT Press, 2012.K. Scheinberg and I. Rish. SINCO - a greedy coordinate

    ascent method for sparse inverse covariance selectionproblem. Technical report, 2009. IBM ResesarchReport.

    K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inversecovariance selection via alternating linearization meth-ods. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,R.S. Zemel, and A. Culotta, editors, Advances in NeuralInformation Processing Systems 23, pages 21012109.2010.

    J. Songsiri and L. Vandenberghe. Topology selection ingraphical models of autoregressive processes. Journal of

    Machine Learning Research, 11:26712705, 2010.

    J. Songsiri, J. Dahl, and L. Vandenberghe. Graphicalmodels of autoregressive processes. In Y. Eldar andD. Palomar, editors, Convex Optimization in SignalProcessing and Communications, pages 89116. Cam-bridge University Press, Cambridge, 2010.

    N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Lawrence K. Saul, YairWeiss, and Leon Bottou, editors, Advances in NeuralInformation Processing Systems 17, pages 13291336.MIT Press, Cambridge, MA, 2005.

    R. Tibshirani. Regression shrinkage and selection via theLasso. Journal of the Royal Statistical Society. Series B(Methodological), 58(1):267288, 1996.

    J. A. Tropp. Just relax: Convex programming methods foridentifying sparse signals in noise. IEEE Transactionson Information Theory, 52(3):10301051, 2006.

    P. Tseng. On accelerated proximal gradient methods forconvex-concave optimization. 2008.

    M. Verhaegen and V. Verdult. Filtering and SystemIdentification. Cambridge University Press, 2007.

    M. Yuan and Y. Lin. Model selection and estimation in theGaussian graphical model. Biometrika, 94(1):19, 2007.

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    9/60

    Distributed Change Detection ?

    Henrik Ohlsson , Tianshi Chen

    Sina Khoshfetrat Pakazad

    Lennart Ljung

    S. Shankar Sastry

    Division of Automatic Control, Department of Electrical Engineering,Linkoping University, Sweden, e-mail: [email protected].

    Department of Electrical Engineering and Computer Sciences,University of California at Berkeley, CA, USA.

    Abstract: Change detection has traditionally been seen as a centralized problem. Manychange detection problems are however distributed in nature and the need for distributedchange detection algorithms is therefore significant. In this paper a distributed change detectionalgorithm is proposed. The change detection problem is first formulated as a convex optimizationproblem and then solved distributively with the alternating direction method of multipliers

    (ADMM). To further reduce the computational burden on each sensor, a homotopy solutionis also derived. The proposed method have interesting connections with Lasso and compressedsensing and the theory developed for these methods are therefore directly applicable.

    1. INTRODUCTION

    The change detection problem is often thought of as a cen-tralized problem. Many scenarios are however distributedand lack a central node or require a distributed processing.A practical example is a sensor network. It may be vulnera-ble to select one of the sensors as a central node. Moreover,it may be preferable if the sensor failing can be detectedin a distributed manner. Another practical example is the

    monitoring of a fleet of agents (airplanes/UAVs/robots) ofthe same type, see e.g., Chu et al. [2011]. The problem ishow to detect if one or more agents start deviating fromthe rest. Theoretically, this can be done straightforwardlyin a centralized manner. The centralized solution, however,poses many difficulties in practice. For instance, the com-munication between the central monitor to the agents andthe computation capacity and speed of the central monitoris highly demanding due to a large number of agents in thefleet and/or the extremely large data sets to be processed,Chu et al. [2011]. Therefore, it is desirable to deal with thechange detection problem in a distributed way.

    In a distributed setting, there will be no central node. Each

    sensor or agent makes use of measurements from itself andthe other sensors or agents to detect if it has failed ornot. To tackle the problem, we first formulate the changedetection problem as a convex optimization problem. Wethen solve the problem in a distributed manner usingthe so-called alternating direction method of multipliers(ADMM, see for instance Boyd et al. [2011]). The opti-mization problem turns out to have connections with theLasso [Tibsharani, 1996] and compressive sensing [Candeset al., 2006, Donoho, 2006] and the theory developed for

    ? Ohlsson, Chen and Ljung are partially supported by the SwedishResearch Council in the Linnaeus center CADICS and by the Euro-pean Research Council under the advanced grant LEARN, contract

    267381. Ohlsson is also supported by a postdoctoral grant from theSweden-America Foundation, donated by ASEAs Fellowship Fund,and by a postdoctoral grant from the Swedish Research Council.

    these methods are therefore applicable. To further reducethe computational burden on each sensor, a homotopysolution (see e.g.,Garrigues and El Ghaoui [2008]) is alsostudied. Finally, we show the effectiveness of the proposedmethod by a numerical example.

    2. PROBLEM FORMULATION

    The basic idea of the proposed method is to use systemidentification, in a distributed manner, to obtain a nominalmodel for the sensors or agents and then detect whetherone or more sensors or agents start deviating from thisnominal model.

    To set up the notation, assume that we are given a sensornetwork consisting ofNsensors. Denote the measurementfrom sensor i at time t byyi(t) and assume that there isa linear relation of the form

    yi(t) =T

    i(t)+ei(t), (1)

    describing the relation between the measurable quantityyi(t) R

    n and known quantity Ti (t) Rnm. We will

    call Rm the state. The state is related to the sensorreading through i(t). ei(t) Rn is the measurementnoise and assumed to be white Gaussian distributed withzero mean and variance 2i. Moreover, ei(t) is assumedindependent of that of ej(t), for all i = 1, . . . , N andj = 1, . . . , i 1, i+ 1, . . . , N . At time t it is assumed thatsensori obtains yi(t) and knows i(t).

    The problem is now, in a distributed manner, to detect afailing sensor. That is, detect if the relation (1) is no longervalid.

    Remark 2.1. (Time varying state). A dynamical equationor a random walk type of description for the state canbe incorporated. This is straightforward but for the sake

    of clarity and due to page limitations, this is not shownhere. Actually, the only restriction is that does not varyover sensors (that is, not dependent on i).

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    10/60

    Remark 2.2. (Partially observed state). Note that (1) doesnot imply that the sensors need to measure all elements of. Some sensors can observe some parts and other sensorsother parts.

    Remark 2.3. (Time-varying network topology). That sen-sors are added and taken away from the network is a very

    realistic scenario. We will assume that N is the maximumnumber of sensors in the network and set i(t) = 0 i f sensori is not present at time t.

    Remark 2.4. (Multidimensional yi(t)). For notational sim-plicity, from now on, we have chosen to letyi(t) R. How-ever, the extension to multidimensionalyi(t) is straightfor-ward.

    Remark 2.5. (Distributed system identification). The pro-posed algorithm could also be seen as a robust distributedsystem identification scheme. The algorithm computes, ina distributed manner, a nominal model using observationsfrom several systems and is robust to systems deviatingfrom the majority.

    A straightforward way to solve the distributed changedetection problem as posted here would be to

    (1) locally, at each sensor, estimate(2) broadcast the estimates and the error covariances(3) at each sensor, fuse the estimates(4) at each sensor, use a likelihood ratio test to detect

    a failing sensor (see e.g., Willsky and Jones [1976],Willsky [1976])

    This method will work fine as long as the number ofmeasurements available at each sensor well exceedsm. Letus say that{(yi(t),i(t))}

    Tt=TTi+1

    is available at sensori,i= 1, . . . , N . It is hence required thatT1, T2, . . . , T N m.Ifm > Ti for some i= 1, . . . , N , the method will however

    fail. Thatm > Ti for somei = 1, . . . , N , is a very realisticscenario.Ti may for example be very small if new sensorsmay be added to the network at any time. The caseT1, T2, . . . , T N mwas previously discussed in Chu et al.[2011].

    Remark 2.6. (Sending all data). One may also consider tobroadcast data and solve the full problem on each sen-sor. Sending all data available at time T may howeverbe too much. Sensor i would then have to broadcast{(yi(t),i(t))}

    Tt=TTi+1

    .

    3. BACKGROUND

    Change detection has a long history (see e.g., Gustafsson[2001], Patton et al. [1989], Basseville and Nikiforov [1993]and references therein) but has traditionally been seen asa centralized problem. The literature on distributed ordecentralized change detection is therefore rather smalland only standard methods such as CUSUM and general-ized likelihood ration (GLR) test have been discussed andextended to distributive scenarios (see e.g., Tartakovskyand Veeravalli [2002, 2003]). The method proposed herehas certainly a strong connection to GLR (see for instanceOhlsson et al. [2012]) and an extensive comparison is seenas future work.

    The change detection algorithm proposed has also connec-

    tions to compressive sensing and `1-minimization. Thereare several comprehensive review papers that cover theliterature of compressive sensing and related optimization

    techniques in linear programming. The reader is referredto the works of Candes and Wakin [2008], Bruckstein et al.[2009], Loris [2009], Yang et al. [2010].

    4. PROPOSED METHOD BATCH SOLUTION

    Assume that the data set {(yi(t),i(t))}T

    t=T

    Ti+1 is avail-

    able at sensor i, i = 1, . . . , N . Since (1) is assumed tohold for a functioning sensors, we would like to detect afailing sensor by checking if its likelihood falls below somethreshold. What complicates the problem is that:

    is unknown, m > Ti for some i = 1, . . . , N , typically.

    We first solve the problem in a centralized setting.

    4.1 Centralized Solution

    Introduceifor the state of sensor i,i = 1, . . . , N . Assumethat we know that k sensors have failed. The maximumlikelihood (ML) solution for i, i = 1, . . . , N (taking intoaccount thatN k sensors have the same state) can thenbe computed by

    min1,...,N,

    NXi=1

    TXt=TTi+1

    kyi(t) T

    i(t)ik2

    2i

    (2a)

    subj. to [k1 kp k2 kp . . . kN kp]

    0= k,

    (2b)

    with k kp being the p-norm, p 1, and k k2i

    defined

    as k /ik2. The k failing sensors could now be identifiedas the sensors for which ki kp 6= 0. It follows frombasic optimization theory (see for instance Boyd andVandenberghe [2004]) that there exists a > 0 such that

    min1,...,N,

    NXi=1

    TXt=TTi+1

    kyi(t) T

    i(t)ik2

    2i

    + [k1 kp k2 kp . . . kN kp]

    0, (3)

    gives exactly the same estimate for 1, . . . , N, , as (2).However, both (2) and (3) are non-convex and combinato-rial, and in practice unsolvable.

    What makes (3) non-convex is the second term. It hasrecently become popular to approximate the zero-norm byits convex envelope. That is, to replace the zero-norm bythe one-norm. This is in line with the reasoning behindLasso [Tibsharani, 1996] and compressed sensing [Candes

    et al., 2006, Donoho, 2006]. Relaxing the zero-norm byreplacing it with the one-norm leads to the convex criteria

    min,1,...,N

    NXi=1

    TXt=TTi+1

    kyi(t)T

    i(t)ik2

    2i

    +NXi=1

    kikp.

    (4)should be interpreted as the nominal model. Most sensorswill have data that can be explained by the nominal model and the criterion (4) will therefore give i = for mostis. However, failing sensors will generate a data sequencethat could not have been generated by the nominal modelrepresented by and for these sensors, (4) will give i6=.

    In (4), regulates the trade off between miss fit to the

    observations and the deviation from the nominal model .In practice, a large will make us less sensitive to noisebut may also imply that we miss to detect a deviating

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    11/60

    sensor. However, a too small may in a noisy environmentgenerate false alarms. should be seen as an applicationdependent design parameter. The estimates of (2) and (3)are indifferent to the choice of p. The estimate of (4) isnot, however. In general, p = 1 is a good choice if oneis interested in detecting changes in individual elements ofsensors or agents states. If one only cares about detectingif a sensor or agent is failing, p >1 is a better choice.

    What is remarkable is that under some conditions on i(t)and the number of failures, the criterion (4) will workexactly as good as (2). That is, (4) and (2) will pick out ex-actly the same sensors as failing sensors. To examine whenthis happens theory developed in compressive sensing canbe used. This is not discussed here but is a good futureresearch direction.

    4.2 Distributed Solution

    Let us now apply ADMM (see e.g., Boyd et al. [2011],

    [Bertsekas and Tsitsiklis, 1997, Sec. 3.4]) to solve theidentification problem in a distributed manner. First let

    Yi=

    yi(T Ti+ 1)yi(T Ti+ 2)

    ...yi(T)

    , i=

    Ti(T Ti+ 1)Ti(T Ti+ 2)

    ...Ti(T)

    . (5)

    The optimization problem (4) can then be written as

    min,1,1,...,N,N

    NXi=1

    kYi iik2

    2i

    +ki ikp, (6a)

    subj. to i = 0, i= 1, . . . , N . (6b)

    Let xT = T1

    . . . TN

    T

    1

    . . . TN , and let

    T=T1

    T

    2 . . . T

    N

    , be the Lagrange multiplier vector

    and i be the Lagrange multiplier associated with the ithconstraint i = 0, i = 1, . . . , N . So the augmentedLagrangian takes the following form

    L(, x, ) =NXi=1

    kYi iik2

    2i

    +ki ikp (7a)

    +Ti(i ) + (/2)ki k2. (7b)

    ADMM consists of the following update rules

    xk+1 = argminx

    L(k, x, k) (8a)

    k+1

    = (1/N)

    NXi=1

    (k+1

    i + (1/)k

    i) (8b)

    k+1i =ki +(

    k+1i

    k+1), for i = 1, . . . , N . (8c)

    Remark 4.1. It should be noted that given k, k, thecriterionL(

    k, x, k) in (8a) is separable in terms of thepairs i, i, i = 1, . . . , N . Therefore, the optimization canbe done separately, for each i, as

    k+1i , k+1i = argmin

    i,i

    kYi iik2

    2i

    +ki ikp

    + (ki)T(i

    k) + (/2)ki kk2. (9)

    Remark 4.2. (Boyd et al. [2011]). It is interesting to notethat no matter what 1 is

    NXi=1

    ki = 0, k 2. (10)

    To show (10), first note thatNXi=1

    k+1i =NXi=1

    ki +NXi=1

    k+1i Nk+1, k 1. (11)

    Inserting k+1 into the above equation yields (10). Sowithout loss of generality, further assume

    NXi=1

    1i = 0. (12)

    Then the update on reduces to

    k+1 = (1/N)NXi=1

    k+1i , k 1. (13)

    As a result, in order to implement the ADMM in adistributed manner each sensor or system i should followthe steps below.

    (1) Initialization: set1, 1i and .

    (2) k+1i , k+1i = argmini,iL(

    k, x, k).

    (3) Broadcast k+1i to the other systems (sensors), j =

    1, . . . , i 1, i+ 1, . . . , N .

    (4) k+1 = (1/N)PN

    i=1k+1i .

    (5) k+1i =ki +(

    k+1i

    k+1).(6) If not converged, setk = k+ 1 and return to step 2.

    To show that ADMM gives:

    k ki 0 as k , i= 1, . . . , N .

    PN

    i=1 kYi iki k

    2

    2i

    + kki ki kp p

    , wherep is

    the optimal objective of (4).

    it is sufficient to show that the Lagrangian (L0(, x, ), theaugmented Lagrangian evaluated at = 0) has a saddlepoint according to [Boyd et al., 2011, Sect. 3.2.1] (since the

    objective consists of closed, proper and convex functions).Let , x denote the solution of (4). It is easy to showthat , x and = 0 is a saddle point. Since L0(, x, 0) isconvex,

    L0(, x, 0) L0(, x, 0) , x (14)

    and since L0(, x, 0) =L0(

    , x, ), , , x and =0 must be a saddle point. ADMM hence converges to thesolution of (4) in the sense listed above.

    5. PROPOSED METHOD RECURSIVE SOLUTION

    To apply the above batch method to a scenario where wecontinuously get new measurements, we propose to re-runthe batch algorithm every Tth sample time:

    (1) Initialize by running the batch algorithm proposed inthe previous section on the data available.

    (2) Every Tth time-step, re-run the batch algorithmusing the sT, s N, last data. Initialize the ADMMiterations using the estimates of and from theprevious run. Considering the fact that faults occurrarely over time, the optimal solution for differentdata batches are often similar. As a result, by usingthe estimates of and from the previous run forinitializing the ADMM algorithm can speed up theconvergence of the ADMM algorithm considerably.

    To add T new observation pairs, one could possibly usean extended version of the homotopy algorithm proposed

    by Garrigues and El Ghaoui [2008]. The homotopy algo-rithm presented in Garrigues and El Ghaoui [2008] wasdeveloped for including new observations in Lasso.

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    12/60

    The ADMM algorithm presented in the previous sectioncould also be altered to use single data samples uponarrival. In such a setup, instead of waiting for a collectionor batch of measurements, the algorithm is updated uponarrival of new measurements. This can be done by studyingthe augmented Lagrangian of the problem. The augmentedLagrangian in (7) can also be written in normalized formas

    L(, x,) =NXi=1

    kYi iik2

    2i

    +ki ikp

    +(/2)ki ( i)k2 (/2)kik

    2, (15)

    where i = /. Hence, for p = 2, the update in (9) canbe achieved by solving the following convex optimizationproblem, which can be written as a Second Order ConeProgramming (SOCP) problem, [Boyd and Vandenberghe,2004],

    mini,i,t

    TiHii 2T

    ihi+ (/2)T

    ii T

    i hki +s

    subj. to ki ik s (16)

    where the following data matrices describe this optimiza-tion problem

    Hi= T

    i i/2i, hi=

    T

    iYi/2i,

    hki =k ki. (17)

    As can be seen from (17), among these matrices only Hiandhi are the ones that are affected by the new measure-ments. Letynewi and

    newi denote the new measurements.

    ThenHi andhi can be updated as follows

    Hi Hi+newi

    newTi /

    2i, hi hi+

    newi y

    newi /

    2i.(18)

    To handle single data samples upon arrival, step 2 of theADMM algorithm should therefore be altered to:

    (2) If there exits any new measurements, updateHi and

    hiaccording to (18). Find k+1i , k+1i by solving (16).

    Remark 5.1. In order for this algorithm to be responsiveto the arrival of the new measurements, it is required tohave network-wide persistent communication. As a resultthis approach demands much higher communication trafficthan the batch solution.

    6. IMPLEMENTATION ISSUES

    Step 2 of ADMM requires solving the optimization prob-lem

    minx

    L(k, x, k). (19)

    This problem is solved locally on each sensor once everyADMM iteration. What varies from one iteration to thenext are the values for the arguments k andk. However,it is unlikely that k and k differ significantly fromk+1 and k+1. To take use of this fact can considerablyease the computational load on each sensor. We presenttwo methods for doing this, warm start and a homotopymethod.

    The following two subsections are rather technical andwe refer the reader not interested in implementing theproposed algorithm to Section 7.

    6.1 Warm Start for Step 2 of the ADMM Algorithm

    For the case wherep = 2, at each iteration of the ADMM,we have to solve an SOCP problem, which is described

    in (16) and (17). However, in the batch solution, amongthe matrices in (17), only hki changes with the iterationnumber. Therefore, if we assume that the vectors k andki do not change drastically from iteration to iteration, itis possible to use the solution for the problem in (16) at thekth iteration to warm start the problem at the (k+ 1)th

    iteration. This is done as follows.The Lagrangian for the problem in (16) can be written as

    L(i,i, s , zi) = T

    iHii 2T

    ihi+ (/2)T

    ii T

    i hki

    +s

    zi1zi2

    ,

    s

    xi i

    , (20)

    for all kzi2k zi1. By this, the optimality conditions forthe problem in (16), can be written as

    2Hii 2hi zi2 = 0 (21a)

    i hki +zi2 = 0 (21b)

    zi1 = 0 (21c)

    kzi2k zi1 (21d)

    ki ik s (21e)zi1s+z

    T

    i2(i i) = 0. (21f)

    wherezi=

    zi1zi2

    [Boyd and Vandenberghe, 2004]. Let i,

    i , t and zi be the primal and dual optimums for the

    problem in (16) at the kth iteration and lethk+1i =hki +

    hi. These can be used to generate a good warm startpoint for the solver that solves (16). As a result, by (21)the following vectors can be used to warm start the solver

    wi =

    i

    wi =

    i + hizwi

    =zi

    sw =s + s

    (22)

    where sshould be chosen such thatkwi

    wi k s

    + s

    zwi1(s + s) +zwTi2 (

    wi

    wi ) =,

    (23)

    for some 0.

    6.2 A Homotopy Algorithm

    Since it is unlikely that k and k differ significantlyfromk+1 andk+1, one can use the previously computedsolution xk and through a homotopy update the solution

    instead of resolving (19) from scratch every iteration. Wewill in the following assume thatp= 1 and leave the detailsforp >1.

    First, definei= i i. (24)

    The optimization objective of (9) is then

    gki(i, i) ,kYi iik2

    2i

    +kik1+ (ki)

    T(i+i k)

    +(/2)ki+i kk2. (25)

    It is then straightforward to show that the optimizationproblem (9) is equivalent to

    k+1i , k+1i = argmin

    i,i

    gki(i, i). (26)

    Moreover,

    k+1i =k+1i +

    k+1i . (27)

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    13/60

    Now, compute the subdifferential ofgi(i, i) w.r.t. i andi. Simple calculations show that

    igki(i, i) =ig

    ki(i, i) = 2/

    2iY

    T

    i i

    +2/2i T

    i T

    i i+ (ki)

    T +(i+i k)T, (28a)

    igki(i, i) =kik1+ (

    ki)

    T +(i+i k)T. (28b)

    A necessary condition for the global optimal solutionk+1i ,

    k+1i of the optimization problem (9) is

    0 igki

    k+1i ,

    k+1i

    , (29a)

    0 igki

    k+1i ,

    k+1i

    . (29b)

    It follows from (28a) and (29a) that

    k+1i =R1

    i

    hi

    ki/2 (/2)(i

    k)

    (30)

    where we have let

    Ri= T

    i i/2i + (/2)I, hi=

    T

    iYi/2i. (31)

    With (30), the problem now reduces to how to solve (29b).

    Inserting (30) into (28b), and Qi , I (/2)R1

    i , yields

    igki(

    k+1i , i)=kik1+h

    T

    iR1

    i +(ki

    k +i)TQi

    Now, replacek withk+tk+1 andki withk+tk+1.

    Let

    Gki(t) =kik1+hT

    iR1

    i + (ki

    k +i)TQi

    +t(k+1i k+1)TQi. (32)

    igki(

    k+1i , i) hence equals G

    ki(0). Let

    ki(t) = argminiGki(t). It follows that

    k+1i =ki(0),

    k+2i =

    ki(1) (33)

    Assume now that k+1i has been computed and that theelements have been arranged such that the qfirst elementsare nonzero and the last m qzero. Let us write

    ki(0) =

    ki0

    . (34)

    We then have that (both the sign and | | taken element-wise)

    kki(0)k1 =

    sign(ki)T vT

    , v Rmq, |v| 1. (35)

    Hence, that 0 Gki(0) is equivalent with

    sign(ki)T +hTi R

    1

    i + (ki

    k +ki)TQi = 0 (36a)

    vT +hTi R1i + (ki k +ki)TQi= 0, (36b)

    with

    R1 =

    R1 R1

    , R1 Rmq, R1 Rmmq, (37)

    Q=

    Q Q

    , Q Rmq, Q Rmmq. (38)

    It can be shown that the partition (34) or the supportof ki(t) will stay unchanged for t [0 t

    ), t > 0 (seeLemma 1 of Garrigues and El Ghaoui [2008]). It hencefollows that for t [0 t]

    ki(t)T=(sign(ki)

    T/+hTi R1

    i )Q1i +(

    k ki/)TQQ1

    +t(k+1 k+1i /)TQQ1, (39)

    where we introduced Q for the q qmatrix made up ofthe top qrows ofQ. We can also compute

    vT(t)/= hTi R1

    i + (k ki

    ki(t))

    TQi

    +t(k ki)TQi

    = hTi R1

    i + (k ki)

    TQi (ki(t))

    TQi

    +t(k ki)TQi

    = hTi R1

    i + (k ki)

    TQi (ki(0))

    TQi

    +t(k ki)T(Qi QQ

    1Qi)

    where Qi was used to denote the top qrows ofQi. Now tofindt, we notice that bothki(t) andv are linear functionsoft. We can hence compute the minimal t that:

    Make one or more elements of v equal to 1 or 1or/and

    make one or more elements ofki(t) equal to 0.This minimalt will bet. Att the partition (34) changes:

    Elements corresponding to v-elements equal to 1 or1 should be included in ki.

    Elements ofki(t) equal to 0 should be fixed to zero.

    Given the solutionki(t), we can now continue in a similar

    way as above to computeki(t), t [t, t]. The procedure

    continues until ki (1) has been computed. Due to spacelimitations we have chosen to not give a summary ofthe algorithm. We instead refer the interested reader todownload the implementation available fromhttp://www.rt.isy.liu.se/~ohlsson/code.html.

    7. NUMERICAL ILLUSTRATION

    For the numerical illustration, we consider a network ofN= 10 sensors with 1 = 2 = = 10 being randomsamples from N(0, I) R20. We assume that thereexist 10 batches of measurements, each consisting of 15samples. The regressors were unit Gaussian noise and themeasurement noise variance was set to one. In the scenarioconsidered in this section, we simulate failures in sensors2 and 5, upon the arrival of the 4th and 6th measurementbatches, respectively. This is done by changing the 5thcomponent of2 by multiplying it by 5 and changing the8th component of 5 by shifting (adding) it by 5. It isassumed that the faults are persistent.

    With = = 20 both the centralized, the ADMM andthe ADMM with homotopy give identical results (up to the2nd digit, 10 iterations were used in ADMM). As can beseen from Figures 1-3 the result correctly detected that the2nd and 5th sensors are faulty. In addition as can be seenfrom Figures 2 and 3, ADMM and ADMM with homotopyshow that for how many data batches the sensors remainedfaulty. Also the results detect which elements from 2 and5deviated from the nominal value. In this example (usingthe ADMM algorithm), each sensor had to broadcastm ADMM iterations number of batches = 20 10 10 = 2000 scalar values. If instead all data would havebeen shared, each sensor would have to broadcast (m+1) T number of batches = (20 + 1) 15 10 = 3150scalar values. Using proposed algorithm, the traffic overthe network can hence be made considerably lower whilekeeping the performance of a centralized change detectionalgorithm. Using the Homotopy algorithm (or warm start)

    to solve step 2 of the ADMM algorithm will not affect thetraffic over the network, but could lower the computationalburden on the sensors. It is also worth noting that the

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    14/60

    0 2 4 6 8 100

    5

    10

    15

    Sensor No.

    ||

    i

    ||2

    Fig. 1. Results from the centralized change detection. Ascan be seen sensors 2 and 5 are detected to be faulty.

    0 2 4 6 8 100

    2

    4

    6

    8

    10

    12

    14

    Sensor No.

    ||

    i

    i

    ||2

    Fig. 2. Results from the ADMM batch solution. Sensors 2and 5 have been detected faulty for 6 and 4 batches.

    0 2 4 6 8 100

    2

    4

    6

    8

    10

    12

    14

    16

    18

    Sensor No.

    ||

    i

    i

    ||2

    Fig. 3. Results from the Homotopy solution. Sensors 2 and5 have been detected faulty for 6 and 4 batches.

    classical approach using likelihood ration test, as describedin Section 3, would not work on this example since 20 =m > T= 15.

    8. CONCLUSION

    This paper has presented a distributed change detectionalgorithm. Change detection is most often seen as a cen-tralized problem. As many scenarios are naturally dis-tributed, there is a need for distributed change detectionalgorithms. The basic idea of the proposed distributed

    change detection algorithm is to use system identification,in a distributed manner, to obtain a nominal model forthe sensors or agents and then detect whether one or

    more sensors or agents start deviating from this nominalmodel. The proposed formulation takes the form of aconvex optimization problem. We show how this can besolved distributively and present a homotopy algorithm toeasy the computational load. The proposed formulationhas connections with Lasso and compressed sensing andtheory developed for these methods are therefore directlyapplicable.

    REFERENCES

    M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes Theory and Application. Prentice-Hall, Englewood Cliffs, NJ,1993.

    D. P. Bertsekas and J. N. Tsitsiklis. Parallel and DistributedComputation: Numerical Methods. Athena Scientific, 1997.

    S. Boyd and L. Vandenberghe. Convex Optimization. CambridgeUniversity Press, 2004.

    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating directionmethod of multipliers. Foundations and Trends in MachineLearning, 2011.

    A. M. Bruckstein, D. L. Donoho, and M. Elad. From sparse solutionsof systems of equations to sparse modeling of signals and images.SIAM Review, 51(1):3481, 2009.

    E. J. Candes and M. B. Wakin. An introduction to compressivesampling.Signal Processing Magazine, IEEE, 25(2):2130, March2008.

    E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequencyinformation. IEEE Transactions on Information Theory, 52:489509, February 2006.

    E. Chu, D. Gorinevsky, and S. Boyd. Scalable statistical monitoringof fleet data. In Proceedings of the 18th IFAC World Congress,pages 1322713232, Milan, Italy, August 2011.

    D. L. Donoho. Compressed sensing. IEEE Transactions on Infor-mation Theory, 52(4):12891306, April 2006.

    P. Garrigues and L. El Ghaoui. An homotopy algorithm for thelasso with online observations. InProceedings of the 22nd AnnualConference on Neural Information Processing Systems (NIPS),2008.

    F. Gustafsson. Adaptive Filtering and Change Detection. Wiley,New York, 2001.

    I. Loris. On the performance of algorithms for the minimization of`1-penalized functionals. Inverse Problems, 25:116, 2009.

    H. Ohlsson, F. Gustafsson, L. Ljung, and S. Boyd. Smoothed stateestimates under abrupt changes using sum-of-norms regulariza-tion. Automatica, 48(4):595605, 2012.

    R. Patton, P. Frank, and R. Clark. Fault Diagnosis in DynamicSystems Theory and Application. Prentice Hall, 1989.

    A. Tartakovsky and V. Veeravalli. Quickest change detection indistributed sensor systems. In Proceedings of the 6th Interna-

    tional Conference on Information Fusion, pages 756763, Cairns,Australia, July 2003.

    A.G. Tartakovsky and V.V. Veeravalli. An efficient sequentialprocedure for detecting changes in multichannel and distributedsystems. In Proceedings of the Fifth International Conference onInformation Fusion, pages 4148, 2002.

    R. Tibsharani. Regression shrinkage and selection via the lasso.Journal of Royal Statistical Society B (Methodological), 58(1):267288, 1996.

    A. Willsky. A survey of design methods for failure detection indynamic systems. Automatica, 12:601611, 1976.

    A. Willsky and H. Jones. A generalized likelihood ratio approachto the detection and estimation of jumps in linear systems.IEEE Transactions on Automatic Control, 21(1):108112, Febru-ary 1976.

    A. Yang, A. Ganesh, Y. Ma, and S. Sastry. Fast `1-minimizationalgorithms and an application in robust face recognition: A review.In ICIP, 2010.

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    15/60

    An ADMM Algorithm for a Class of Total

    Variation Regularized Estimation

    Problems?

    Bo Wahlberg , Stephen Boyd , Mariette Annergren ,and Yang Wang

    Automatic Control Lab and ACCESS, School of ElectricalEngineering, KTH Royal Institute of Technology,

    SE 100 44 Stockholm,Sweden Department of Electrical Engineering, Stanford University, Stanford,

    CA 94305, USA

    Abstract: We present an alternating augmented Lagrangian method for convex optimizationproblems where the cost function is the sum of two terms, one that is separable in the variable

    blocks, and a second that is separable in the difference between consecutive variable blocks.Examples of such problems include Fused Lasso estimation, total variation denoising, and multi-period portfolio optimization with transaction costs. In each iteration of our method, the firststep involves separately optimizing over each variable block, which can be carried out in parallel.The second step is not separable in the variables, but can be carried out very efficiently. We applythe algorithm to segmentation of data based on changes in mean (`1 mean filtering) or changesin variance (`1 variance filtering). In a numerical example, we show that our implementation isaround 10000 times faster compared with the generic optimization solver SDPT3.

    Keywords: Signal processing algorithms, stochastic parameters, parameter estimation, convexoptimization and regularization

    1. INTRODUCTION

    In this paper we consider optimization problems wherethe objective is a sum of two terms: The first term isseparable in the variable blocks, and the second term isseparable in the difference between consecutive variableblocks. One example is the Fused Lasso method in statis-tical learning, Tibshirani et al. [2005], where the objectiveincludes an `1-norm penalty on the parameters, as well asan `1-norm penalty on the difference between consecutiveparameters. The first penalty encourages a sparse solution,i.e., one with few nonzero entries, while the second penaltyenhances block partitions in the parameter space. Thesame ideas have been applied in many other areas, such as

    Total Variation (TV) denoising, Rudin et al. [1992], andsegmentation of ARX models, Ohlsson et al. [2010] (whereit is called sum-of-norms regularization). Another exampleis multi-period portfolio optimization, where the variableblocks give the portfolio in different time periods, the firstterm is the portfolio objective (such as risk-adjusted re-turn), and the second term accounts for transaction costs.

    In many applications, the optimization problem involvesa large number of variables, and cannot be efficientlyhandled by generic optimization solvers. In this paper,our main contribution is to derive an efficient and scalableoptimization algorithm, by exploiting the structure of theoptimization problem. To do this, we use a distributed

    ? This work was partially supported by the Swedish Research Coun-cil, the Linnaeus Center ACCESS at KTH and the European Re-

    search Council under the advanced grant LEARN, contract 267381.

    optimization method called Alternating Direction Method

    of Multipliers (ADMM). ADMM was developed in the1970s, and is closely related to many other optimizationalgorithms including Bregman iterative algorithms for `1problems, Douglas-Rachford splitting, and proximal pointmethods; see Eckstein and Bertsekas [1992], Combettesand Pesquet [2007]. ADMM has been applied in manyareas, including image and signal processing, Setzer [2011],as well as large-scale problems in statistics and machinelearning, Boyd et al. [2011].

    We will apply ADMM to `1 mean filtering and `1variancefiltering (Wahlberg et al. [2011]), which are importantproblems in signal processing with many applications, forexample in financial or biological data analysis. In some

    applications, mean and variance filtering are used to pre-process data before fitting a parametric model. For non-stationary data it is also important for segmenting thedata into stationary subsets. The approach we present isinspired by the`1trend filtering method described in Kimet al. [2009], which tracks changes in the mean value ofthe data. (An example in this paper also tracks changes inthe variance of the underlying stochastic process.) Theseproblems are closely related to the covariance selectionproblem, Dempster [1972], which is a convex optimizationproblem when the inverse covariance is used as the opti-mization variable, Banerjee et al. [2008]. The same ideascan also be found in Kim et al. [2009] and Friedman et al.

    [2008].

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    16/60

    This paper is organized as follows. In Section 2 we reviewthe ADMM method. In Section 3, we apply ADMM to ouroptimization problem to derive an efficient optimizationalgorithm. In Section 4.1 we apply our method to `1mean filtering, while in Section 4.2 we consider `1variancefiltering. Section 5 contains some numerical examples, andSection 6 concludes the paper.

    2. ALTERNATING DIRECTION METHOD OFMULTIPLIERS (ADMM)

    In this section we give an overview of ADMM. We followclosely the development in Section 5 of Boyd et al. [2011].

    Consider the following optimization problem

    minimize f(x)subject to x C (1)

    with variable x Rn, and where f and Care convex. Weletp? denote the optimal value of (1). We first re-write theproblem as

    minimize f(x) +IC(z)subject to x= z,

    (2)

    whereIC(z) is the indicator function on C (i.e., IC(z) = 0for z C, and IC(z) = for z / C). The augmentedLagrangian for this problem is

    L(x,z,u) =f(x) +IC(z) + (/2)kx z+uk22,where u is a scaled dual variable associated with theconstraint x = z, i.e., u = (1/)y, where y is the dualvariable forx= z . Here, > 0 is a penalty parameter.

    In each iteration of ADMM, we perform alternating min-imization of the augmented Lagrangian over x and z. Atiterationk we carry out the following steps

    xk+1

    := argminx {f(x) + (/2)kx zk

    +uk

    k22} (3)

    zk+1 := C(xk+1 +uk) (4)

    uk+1 :=uk + (xk+1 zk+1), (5)where C denotes Euclidean projection onto C. In thefirst step of ADMM, we fix z and u and minimize theaugmented Lagrangian over x; next, we fix x and u andminimize overz ; finally, we update the dual variable u.

    2.1 Convergence

    Under mild assumptions onf andC, we can show that theiterates of ADMM converge to a solution; specifically, we

    havef(xk) p?, xk zk 0,

    ask . The rate of convergence, and hence the numberof iterations required to achieve a specified accuracy, candepend strongly on the choice of the parameter . When is well chosen, this method can converge to a fairlyaccurate solution (good enough for many applications),within a few tens of iterations. However, if the choice of is poor, many iterations can be needed for convergence.These issues, including heuristics for choosing , are dis-cussed in more detail in Boyd et al. [2011].

    2.2 Stopping criterion

    The primal and dual residuals at iteration k are given by

    ekp = (xk zk), ekd = (zk zk1).

    We terminate the algorithm when the primal and dualresiduals satisfy a stopping criterion (which can varydepending on the requirements of the application). Atypical criterion is to stop when

    kekpk2 pri, kekdk2 dual.Here, the tolerances pri > 0 and dual > 0 can be set via

    an absolute plus relative criterion,pri =

    nabs +rel max{kxkk2, kzkk2},

    dual =

    nabs +relkukk2,where abs > 0 and rel > 0 are absolute and relativetolerances (see Boyd et al. [2011] for details).

    3. PROBLEM FORMULATION AND METHOD

    In this section we formulate our problem and derive anefficient distributed optimization algorithm via ADMM.

    3.1 Optimization problem

    We consider the problem

    minimize

    NXi=1

    i(xi) +N1Xi=1

    i(ri)

    subject to ri = xi+1 xi, i= 1, . . . , N 1(6)

    with variables x1, . . . , xN, r1, . . . , rN1 Rn, and wherei :R

    n R {} and i :Rn R {} are convexfunctions.

    This problem has the form (1), with variables x =(x1, . . . , xN), r = (r1, . . . , rN1), objective function

    f(x, r) =N

    Xi=1

    i(x

    i) +

    N1

    Xi=1

    i(ri)

    and constraint set

    C= {(x, r)| ri = xi+1 xi, i= 1, . . . , N 1}. (7)The ADMM form for problem (6) is

    minimizeNXi=1

    i(xi) +N1Xi=1

    i(ri) +IC(z, s)

    subject to ri = si, i= 1, . . . , N 1xi = zi, i= 1, . . . , N ,

    (8)

    with variables x = (x1, . . . , xN), r = (r1, . . . , rN1),z = (z1, . . . , zN), and s = (s1, . . . , sN1). Furthermore,

    we let u= (u1, . . . , uN) and t = (t1, . . . , tN1) be vectorsof scaled dual variables associated with the constraintsxi = zi, i = 1, . . . , N , and ri = si, i = 1, . . . , N 1 (i.e.,ui = (1/)yi, whereyi is the dual variable associated withxi = zi).

    3.2 Distributed optimization method

    Applying ADMM to problem (8), we carry out the follow-ing steps in each iteration.

    Step 1. Since the objective function f is separable in xiandri, the first step (3) of the ADMM algorithm consists

    of 2N 1 separate minimizationsxk+1i := argmin

    xi

    {i(xi) + (/2)kxi zki +uki k22}, (9)

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    17/60

    i= 1, . . . , N , and

    rk+1i := argminri

    {i(ri) + (/2)kri ski +tki k22}, (10)i= 1, . . . , N 1. These updates can all be carried out inparallel. For many applications, we will see that we canoften solve (9) and (10) analytically.

    Step 2. In the second step of ADMM, we project (xk+1+uk, rk+1 +tk) onto the constraint set C, i.e.,

    (zk+1, sk+1) := C((xk+1, rk+1) + (uk, tk)).

    For the particular constraint set (7), we will show inSection 3.3 that the projection can be performed extremelyefficiently.

    Step 3. Finally, we update the dual variables:

    uk+1i :=uki + (x

    k+1i

    zk+1i ), i= 1, . . . , N

    and

    tk+1i :=tki + (r

    k+1i sk+1i ), i= 1, . . . , N 1.

    These updates can also be carried out independently inparallel, for each variable block.

    3.3 Projection

    In this section we work out an efficient formula for pro-jection onto the constraint set C (7). To perform theprojection

    (z, s) = C((w, v)),

    we solve the optimization problem

    minimize kz wk22+ ks vk22subject to s= Dz ,

    with variables z = (z1, . . . , zN) and s = (s1, . . . , sN1),and where D R(N1)nNn is the forward differenceoperator, i.e.,

    D=

    I II I

    . . . . . .I I

    .

    This problem is equivalent to

    minimize

    kz

    w

    k22+

    kDz

    v

    k22.

    with variable z = (z1, . . . , zN). Thus to perform theprojection we first solve the optimality condition

    (I+ DTD)z= w+DTv, (11)

    forz , then we let s = Dz .

    The matrix I+ DTD is block tridiagonal, with diagonalblocks equal to multiples of I, and sub/super-diagonalblocks equal to I. LetLLT be the Cholesky factorizationofI+ DTD. It is easy to show thatL is block banded withthe form

    L=

    l1,1l2,1 l2,2

    l3,2 l3,3. . . . . .

    lN,N1 lN,N

    I,

    where denotes the Kronecker product. The coefficientsli,j can be explicitly computed via the recursion

    l1,1=

    2,

    li+1,i = 1/li,i, li+1,i+1=q

    3 l2i+1,i, i= 1, . . . , N 2,lN,N1=

    1/lN1,N1, lN,N =

    q2

    l2N,N1.

    The coefficients only need to be computed once, before theprojection operator is applied.

    The projection therefore consists of the following steps

    (1) Formb := w +DTv:

    b1:= w1 v1, bN :=wN+vN1,bi := wi+ (vi1 vi), i= 2, . . . , N 1.

    (2) Solve Ly = b:

    y1:= (1/l1,1)b1,

    yi := (1/li,i)(bi li,i1yi1), i= 2, . . . , N .(3) Solve LTz = y:

    zN:= (1/lN,N)yN,zi := (1/li,i)(yi li+1,izi+1), i= N 1, . . . , 1.

    (4) Sets = Dz :

    si := zi+1 zi, i= 1, . . . , N 1.Thus, we see that we can perform the projection veryefficiently, in O(Nn) flops (floating-point operations). Infact, if we pre-compute the inverses 1/li,i,i= 1, . . . , N , theonly operations that are required are multiplication, addi-tion, and subtraction. We do not need to perform division,which can be expensive on some hardware platforms.

    4. EXAMPLES

    4.1 `1 Mean filtering

    Consider a sequence of vector random variables

    YiN(yi,), i= 1, . . . , N ,where yi Rn is the mean, and Sn+ is the covariancematrix. We assume that the covariance matrix is known,but the mean of the process is unknown. Given a sequenceof observationsy1, . . . , yN, our goal is to estimate the meanunder the assumption that it is piecewise constant, i.e.,yi+1= yi for many values ofi.

    In the Fused Group Lasso method, we obtain our estimatesby solving

    minimizeNXi=1

    1

    2(yi xi)T1(yi xi) +

    N1Xi=1

    krik2subject to ri = xi+1 xi, i= 1, . . . , N 1,

    with variables x1, . . . , xN, r1, . . . , rN1. Let x?1, . . . , x

    ?N,

    r?1 , . . . , r?N1 denote an optimal point, our estimates of

    y1, . . . ,yN arex?1, . . . , x

    ?N.

    This problem is clearly in the form (6), with

    i(xi) =1

    2(yi xi)T1(yi xi), i(ri) =krik2.

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    18/60

    ADMM steps. For this problem, steps (9) and (10) ofADMM can be further simplified. Step (9) involves mini-mizing an unconstrained quadratic function in the variablexi, and can be written as

    xk+1i = (1 +I)1(1yi+(z

    ki uki )).

    Step (10) is

    rk+1i := argminri

    {krik2+ (/2)kri ski +tki k22},

    which simplifies to

    rk+1i =S/(ski tki ), (12)

    whereS is the vector soft thresholding operator, definedas

    S(a) = (1 /kak2)+a, S(0) = 0.Here the notation (v)+ = max{0, v} denotes the positivepart of the vector v . (For details see Boyd et al. [2011].)

    Variations. In some problems, we might expect thatindividual components ofxt will be piecewise constant, inwhich case we can instead use the standard Fused Lassomethod. In the standard Fused Lasso method we solve

    minimizeNXi=1

    1

    2(yi xi)T1(yi xi) +

    N1Xi=1

    krik1subject to ri = xi+1 xi, i= 1, . . . , N ,

    with variables x1, . . . , xN, r1, . . . , rN1. The ADMM up-dates are the same, except that instead of doing vectorsoft thresholding for step (10), we perform scalar compo-nentwise soft thresholding, i.e.,

    (rk+1

    i )j =S/((sk

    i tk

    i )j), j = 1, . . . , n .

    4.2 `1 Variance filtering

    Consider a sequence of vector random variables (of dimen-sionn)

    YiN(0,i), i= 1, . . . , N ,where i Sn+ is the covariance matrix for Yi (whichwe assume is fixed but unknown). Given observationsof y1, . . . , yN, our goal is to estimate the sequence ofcovariance matrices 1, . . . ,N, under the assumptionthat it is piecewise constant, i.e., it is often the case thati+1 =

    i. In order to obtain a convex problem, we usethe inverse covariances Xi =

    1i as our variables.

    The Fused Group Lasso method for this problem involvessolving

    minimizeNXi=1

    Tr(XiyiyTi ) log detXi+

    N1Xi=1

    kRikFsubject to Ri = Xi+1 Xi, i= 1, . . . , N 1,

    where our variables are Ri Sn, i = 1, . . . , N 1, andXi Sn+, i = 1, . . . , N . Here,

    kRikF =q

    Tr(RTi Ri)

    is the Frobenius norm ofRi. LetX?1 , . . . , X

    ?N,R

    ?1, . . . , R

    ?N1

    denote an optimal point, our estimates of1, . . . ,N are(X?1 )

    1, . . . , (X?N)1.

    ADMM steps. It is easy to see that steps (9) and (10)simplify for this problem. Step (9) requires solving

    Xk+1i := argminXi0

    {i(Xi) + (/2)kXi Zki +Ukik22},

    where

    i(Xi) =Tr(XiyiyTi )

    log detXi.

    This update can be solved analytically, as follows.

    (1) Compute the eigenvalue decomposition of

    Zki Uki yiyTi =QQT

    where =diag(1, . . . ,n).(2) Now let

    j :=j+

    q2j + 4

    2 , j = 1, . . . , n .

    (3) Finally, we set

    Xk+1i =Q diag(1, . . . , n)QT.

    For details of this derivation, see Section 6.5 in Boyd et al.[2011].

    Step (10) is

    Rk+1i := argminRi

    {kRikF+ (/2)kRi Ski +Tkik22},

    which simplifies to

    Rk+1i =S/(Ski Tki ),

    whereS is a matrix soft threshold operator, defined as

    S(A) = (1 /kAkF)+A, S(0) = 0.

    Variations. As with`1mean filtering, we can replace theFrobenius norm penalty with a componentwise vector `1-norm penalty on Ri to get the problem

    minimizeNXi=1

    Tr(XiyiyTi ) log detXi+

    N1Xi=1

    kRik1subject to Ri = Xi+1 Xi, i= 1, . . . , N 1,

    with variablesR1, . . . , RN1 Sn, andX1, . . . , X N Sn+,and where

    kRk1=Xj,k

    |Rjk |.

    Again, the ADMM updates are the same, the only differ-ence is that in step (10) we replace matrix soft thresholdingwith a componentwise soft threshold, i.e.,

    (Rk+1i )l,m = S/((Ski Tki )l,m),

    forl = 1, . . . , n, m = 1, . . . , n.

    4.3 `1 Mean and variance filtering

    Consider a sequence of vector random variables

    YiN(yi,i), i= 1, . . . , N ,where yi Rn is the mean, and i Sn+ is the covariancematrix for Yi. We assume that the mean and covariancematrix of the process is unknown. Given observations

    y1, . . . , yN, our goal is to estimate the mean and thesequence of covariance matrices 1, . . . ,N, under theassumption that they are piecewise constant, i.e., it is

  • 8/10/2019 Special Session on Convex Optimization for System Identification

    19/60

    often the case that yi+1= yi and i+1= i. To obtain aconvex optimization problem, we use

    Xi = 121t , mi =

    1t xi,

    as our variables. In the Fused Group Lasso method, weobtain our estimates by solving

    minimizeNXi=1

    (1/2) log det(Xi)Tr(XiyiyTi )

    mTi yi (1/4) Tr(X1i mimTi )

    +1

    N1Xi=1

    krik2+2N1Xi=1

    kRikFsubject to ri = mi+1 mi, i= 1, . . . , N 1,

    Ri = Xi+1 Xi, i= 1, . . . , N 1,with variables r1, . . . , rN1 Rn, m1, . . . , mN Rn,R1, . . . , RN1 Sn, and X1, . . . , X N Sn+.

    ADMM steps. This problem is also in the form (6), how-ever, as far as we are aware, there is no analytical formulafor steps (9) and (10). To carry out these updates, we mustsolve semidefinite programs (SDPs), for which there are anumber of efficient and reliable software packages (Tohet al. [1999], Sturm [1999]).

    5. NUMERICAL EXAMPLE

    In this section we solve an instance of `1 mean filteringwithn= 1, = 1, andN= 400, using the standard FusedLasso method. To improve convergence of the ADMMalgorithm, we use over-relaxation with = 1.8, see Boydet al. [2011]. The