analysis, design, and optimization of large-scale networks ...mihailo/theses/sepid... · theory,...
TRANSCRIPT
Analysis, design, and optimization of
large-scale networks of dynamical
systems
A DISSERTATION
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF SOUTHERN CALIFORNIA
BY
Sepideh Hassan-Moghaddam
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
Doctor of Philosophy
Ming Hsieh Department of Electrical and Computer Engineering
December 2019
Analysis, design, and optimization of
large-scale networks of dynamical
systems
Copyright © 2019
by
Sepideh Hassan-Moghaddam
ALL RIGHTS RESERVED
To Karen, Susan, Hojjat, and Neda
i
Acknowledgements
I would like to express my sincere gratitude to my Ph.D. advisor, Professor Mihailo
Jovanovic, for his continuous support and guidance through these years. He was always
ready to help and never gave up on making me a better researcher. He was the perfect
role model in research with his great vision, immense knowledge, intelligence, and pa-
tience. I am extremely fortunate to have the opportunity to work with him and I will
always be grateful for all his advices and support. His kindness and friendship have
made the hard days of the graduate school enjoyable for me.
I would like to thank Professors Antonio Ortega, Ketan Savla, Urbashi Mitra, and
Meisam Razaviyayn for serving on my doctoral committee. I have received invaluable
comments and feedback from all of them. Also, I would like to thank Professors Tryphon
Georgiou, Andrew Lamperski, Yousef Saad, and Tom Luo for their extremely useful
feedback on my research. Also, I am extremely lucky to have the opportunity to work
with Professor Meyn. His insightful comments and suggestions have made our joint
work successful.
I am very grateful to have my past and current lab mates, Xiaofan Wu, Neil Dhingra,
Armin Zare, Yongxin Chen, Wei Ran, Hesam Mohammadi, and Dongsheng Ding for
making these graduate school years more interesting. I would like to also thank the
senior members Rashad Moarref, Fu Lin, and Binh Lieu who built the foundation of
my research by their hard work.
I am incredibly fortunate for having my PhD program in Minneapolis and Los An-
geles where I met my friends for life. I would like to thank Fatemeh Sheikholeslami,
Pardees Azodanloo, Maziar Sanjabi, Roushanak Navvab, Farnaz Forootaninia, Pooyan
Behnam, Mehdi Lamee, Behnaz Forootaninia, Nazila Haratipour, Mehrdad Hairani,
ii
Maral Mousavi, Meisam Razaviyayn, Saghar Sadeghpour, Pooya Rezaei, Samira Tabar,
Farnoosh Zough, and Rashad Moarref. Thank you for constantly providing love and
support.
I would like to sincerely thank my family. My parents have always been teaching
me to believe in myself. They have scarified a lot for me to be here and I am forever
indebted to them. I am grateful to my sister Neda for her constant support and infinite
love. Finally and most importantly, I would like to thank my wonderful husband, Karen,
for being on my side, loving me, and supporting me through all these years.
iii
Abstract
Large-scale networks of dynamical systems are becoming increasingly important
in science and engineering with applications ranging from economics, social networks,
power systems, and robotics. As a consequence, network science has emerged as an in-
terdisciplinary field that draws from diverse disciplines including graph theory, matrix
theory, dynamical systems, optimization, and statistical mechanics. One of the major
challenges is the development of fast and scalable methods for analysis and design of
these systems. Such systems are usually large-scale, they involve large-scale interconnec-
tions of components but have limitations on communication power, and their dynamics
may be unknown. In large-scale dynamical systems, centralized implementation of con-
trol and estimation policies is computationally expensive or even infeasible. Moreover,
it is often the case that the underlying dynamics are unknown and only limited input-
output measurements are available from the system. To overcome these challenges, we
combine tools and ideas from control theory, distributed optimization, reinforcement
learning, and compressive sensing to develop distributed estimation and control strate-
gies that utilize limited information exchange between the individual subsystems and
do not require knowledge of the underlying dynamics.
In Part I of this dissertation, we study the problem of optimal topology design
for stochastically forced undirected consensus networks in order to optimally enhance
closed-loop performance. The performance is given by the steady-state variance am-
plification of the network with additive stochastic disturbances. We use a sparsity
promoting optimal control framework to avoid computational complexity. In particu-
lar, we introduce `1-regularization into the optimal H2 formulation and cast the design
problem as a nonsmooth composite optimization problem. By exploiting the structure
of the problem, we develop customized proximal gradient and Newton algorithms that
are well-suited for large-scale problems. We illustrate that our algorithms can solve
the problems with more than million edges in the controller graph in a few minutes,
on a PC. We also exploit structure of connected resistive networks to demonstrate how
additional edges can be systematically added in order to minimize the H2 norm of the
iv
closed-loop system. Moreover, we study the problem of performance enhancement in
stochastically-forced directed consensus networks by adding edges to an existing topol-
ogy. We formulate the problem as a feedback control design, and represent the links
as the elements of the controller graph Laplacian matrix. To deal with the structural
constraints that arise from the absence of absolute measurements, we introduce a coor-
dinate transformation to eliminate the average mode and assure convergence of all states
to the average of the initial node values. By exploiting structure of the optimization
problem, we develop a customized algorithm based on the alternating direction method
of multipliers to design a sparse controller network that improves the performance of
the closed-loop system.
Part II studies the problem of identifying sparse interaction topology using sam-
ple covariance matrix of the states of the network. We assume that the statistics
are generated by a stochastically-forced undirected consensus network with unknown
topology. We propose a method for identifying the topology using a regularized Gaus-
sian maximum likelihood framework where the `1 regularizer is introduced as a means
for inducing sparse network topology. The proposed algorithm employs a sequential
quadratic approximation in which the Newton’s direction is obtained using coordinate
descent method. We also develop a method based on growing a Chow-Liu tree that is
well-suited for identifying the underlying structure of large-scale systems in which some
of the nodes may have access to their own states.
In Part III, we study the algorithms that can be used to solve nonsmooth composite
optimization problems from control-theoretic point of view. We view proximal algo-
rithms as dynamical systems and leverage techniques from control theory to study their
global convergence properties. In particular, for problems with strongly convex objec-
tive functions, we utilize the theory of integral quadratic constraints to prove global
exponential stability of the differential equations that govern the evolution of proximal
gradient and Douglas-Rachford splitting flows. Moreover, we establish conditions for
global exponential convergence even in the absence of strong convexity. We also study
a class of nonsmooth composite optimization problems in which the convex objective
function is given by a sum of differentiable and nondifferentiable terms. We propose a
primal-descent dual-ascent gradient flow method that exploits separability of the objec-
tive function and is well-suited for in-network optimization. We prove global asymptotic
v
stability of the proposed algorithm and solve the problem of growing undirected con-
sensus networks in a distributed manner to demonstrate its effectiveness.
Part IV focuses on the distributed design of structured feedback gains for large-scale
systems. This is a challenging problem even in the absence of additional structural con-
straints on the feedback controller. We approach the structured optimal control problem
via a data-driven framework that does not require knowledge of the system parameters
and avoids the need to solve large-scale matrical equations. For the structured opti-
mal H2 state-feedback problem, we show that the objective function and its gradient
can be computed from data and develop customized proximal algorithms based on gra-
dient descent and incremental gradient method. Moreover, we exploit separability of
the objective function and utilize an ADMM-based consensus algorithm to solve the
regularized optimal control problem in a distributed manner over multiple processors.
vi
Table of Contents
Acknowledgements ii
Abstract iv
List of Tables xii
List of Figures xiii
1 Introduction 1
1.1 Main topics of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Distributed design and control of consensus networks . . . . . . . 3
1.1.2 Network estimation and inference . . . . . . . . . . . . . . . . . . 5
1.1.3 Control theoretic approach for analysis of optimization algorithms 7
1.1.4 Distributed in-network optimization . . . . . . . . . . . . . . . . 8
1.1.5 Data-driven and model-free distributed control . . . . . . . . . . 9
1.2 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Contributions of the dissertation . . . . . . . . . . . . . . . . . . . . . . 12
I Distributed design and control of networks 16
2 Topology design for stochastically forced undirected consensus net-
works 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Design of optimal sparse topology . . . . . . . . . . . . . . . . . 22
vii
2.2.2 Structured optimal control problem: debiasing step . . . . . . . . 24
2.2.3 Gradient and Hessian of J(x) . . . . . . . . . . . . . . . . . . . . 25
2.3 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Customized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Proximal Newton method . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Growing connected resistive networks . . . . . . . . . . . . . . . . . . . . 34
2.5.1 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 Proximal Newton method . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.2 Large-scale Facebook network . . . . . . . . . . . . . . . . . . . . 39
2.6.3 Random disconnected network . . . . . . . . . . . . . . . . . . . 41
2.6.4 Path and ring networks . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Edge addition in directed consensus networks 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Motivation and background . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Feedback design in consensus networks . . . . . . . . . . . . . . . 48
3.2.3 Elimination of the average mode . . . . . . . . . . . . . . . . . . 50
3.3 Topology design for directed networks . . . . . . . . . . . . . . . . . . . 52
3.3.1 Sparsity-promoting optimal control problem . . . . . . . . . . . . 52
3.3.2 Structured optimal control problem . . . . . . . . . . . . . . . . 53
3.4 An ADMM-based algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1 F -minimization step . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Lx-minimization step . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Synthetic example . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2 Performance improvement . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
viii
II Network estimation and inference 60
4 Topology identification of undirected consensus networks via sparse
inverse covariance estimation 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Undirected consensus networks . . . . . . . . . . . . . . . . . . . 62
4.2.2 Topology identification . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Customized algorithm based on sequential quadratic approximation . . 65
4.3.1 Structured problem: debiasing step . . . . . . . . . . . . . . . . . 66
4.3.2 Gradient and Hessian of J(x) . . . . . . . . . . . . . . . . . . . . 66
4.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 Network identification . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.2 Network sparsification . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Topology identification via growing a Chow-Liu tree network 76
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Topology identification via structured graphical-LASSO . . . . . . . . . 77
5.3 Growing a Chow-Liu tree network . . . . . . . . . . . . . . . . . . . . . 80
5.4 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.1 Structured graphical-LASSO algorithm . . . . . . . . . . . . . . 83
5.4.2 Topology identification via growing a Chow-Liu tree . . . . . . . 84
5.4.3 FMRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
III Control theoretic approach for analysis of optimization algo-
rithms 88
6 Proximal gradient flow and Douglas-Rachford splitting dynamics: global
exponential stability via integral quadratic constraints 89
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
ix
6.2 Problem formulation and background . . . . . . . . . . . . . . . . . . . . 91
6.2.1 Proximal operators and the associated envelopes . . . . . . . . . 91
6.2.2 Proximal augmented Lagrangian . . . . . . . . . . . . . . . . . . 94
6.2.3 Strong convexity and Lipschitz continuity . . . . . . . . . . . . . 95
6.2.4 Proximal Polyak-Lojasiewicz inequality . . . . . . . . . . . . . . 95
6.3 Exponential stability of proximal algorithms . . . . . . . . . . . . . . . . 96
6.3.1 Strongly convex problems . . . . . . . . . . . . . . . . . . . . . . 98
6.3.2 Proximal Polyak-Lojasiewicz condition . . . . . . . . . . . . . . . 100
6.4 Global exponential stability of Douglas-Rachford splitting dynamics . . 102
6.4.1 Non-smooth composite optimization problem . . . . . . . . . . . 102
6.4.2 Douglas-Rachford splitting on the dual problem . . . . . . . . . . 107
6.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7 Distributed proximal augmented Lagrangian method for nonsmooth
composite optimization 109
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.1 Motivating application . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Proximal augmented Lagrangian . . . . . . . . . . . . . . . . . . . . . . 116
7.4 Arrow-Hurwicz-Uzawa gradient flow . . . . . . . . . . . . . . . . . . . . 118
7.5 Distributed implementation . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5.1 Primal update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5.2 Dual updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.6 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
IV Data-driven and model-free distributed control 125
8 Data-driven proximal algorithms for the design of structured optimal
feedback gains 126
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
x
8.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.3 Computation from data . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.3.1 Computation of P and L from data . . . . . . . . . . . . . . . . 131
8.3.2 Computation of PL from data . . . . . . . . . . . . . . . . . . . 132
8.4 Systems with special structure . . . . . . . . . . . . . . . . . . . . . . . 134
8.5 Proximal algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.5.1 Proximal gradient descent . . . . . . . . . . . . . . . . . . . . . . 138
8.5.2 Incremental proximal gradient . . . . . . . . . . . . . . . . . . . 139
8.6 An ADMM-based consensus algorithm . . . . . . . . . . . . . . . . . . . 140
8.6.1 Computational experiments . . . . . . . . . . . . . . . . . . . . . 143
8.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9 Conclusions 148
References 150
Appendix A. Proximal PL condition 169
A.1 Pointwise quadratic inequality . . . . . . . . . . . . . . . . . . . . . . . . 171
Appendix B. On the exponential convergence rate of proximal gradient
flow algorithms 172
B.0.1 Distributed optimization . . . . . . . . . . . . . . . . . . . . . . . 172
xi
List of Tables
2.1 Comparison of our customized proximal algorithms . . . . . . . . . . . . 33
2.2 Algorithmic properties of the primal dual IP method. . . . . . . . . . . 33
2.3 Comparison of algorithms (solve times in seconds/number of iterations)
for the problem of growing connected resistive Erdos-Renyi networks with
different number of nodes n, edge probability 1.05 log(n)/n, and γ =
0.8 γmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 Table 2.3 continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
xii
List of Figures
1.1 (a) Social network. (b) Computer network. (c) Vehicular formation. (d)
Power network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 (a) Solve times (in seconds); and (b) performance degradation (in per-
cents) of proximal gradient and greedy algorithms relative to the optimal
centralized controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 (a) Sparsity level; and (b) optimal tradeoff curves resulting from the
application of proximal gradient algorithm and a heuristic strategy for
the Facebook network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Topologies of the plant (blue lines) and controller graphs (red lines) for
an unweighted random network with three disconnected subgraphs. . . . 42
2.4 (a) Sparsity level; (b) performance degradation; and (c) the optimal
tradeoff curve between the performance degradation and the sparsity
level of optimal sparse x compared to the optimal centralized vector of
the edge weights xc. The results are obtained for unweighted random
disconnected plant network with topology shown in Fig. 2.3. . . . . . . . 43
2.5 The problems of growing unweighted path (top row) and ring (bottom
row) networks. Blue lines identify edges in the plant graph, and red lines
identify edges in the controller graph. . . . . . . . . . . . . . . . . . . . 44
3.1 Topologies of the plant (blue lines) and controller graphs (red lines) for
a randomly generated weighted network. . . . . . . . . . . . . . . . . . . 58
xiii
3.2 (a) Sparsity level; (b) performance degradation; and (c) the optimal
tradeoff between the performance degradation and the sparsity level of
the optimal sparse Lx compared to the optimal centralized controller Lc.
The results are obtained for the weighted random plant network with
topology shown in Fig. 3.1a. . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 The problem of identification of sparse networks using sample covariance
matrix for a network with n = 12 nodes. . . . . . . . . . . . . . . . . . . 71
4.2 The problem of identification of sparse networks using sample covariance
matrix for a network with n = 12 nodes. . . . . . . . . . . . . . . . . . . 72
4.3 The problem of sparsification of a network with n = 50 nodes using
sample covariance matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Relative error in log-scale for the sparsification problem of a network with
n = 50 nodes using sample covariance matrix. . . . . . . . . . . . . . . . 74
5.1 The RC tree network with n = 10 nodes. . . . . . . . . . . . . . . . . . . 83
5.2 (a) Plant network; (b) constructed Chow-Liu tree network; and (c) the
identified network from growing the Chow-Liu tree. . . . . . . . . . . . . 84
5.3 The identified networks for different values of γ. . . . . . . . . . . . . . . 86
7.1 Speedup ratio versus the number of the cores used for growing connected
resistive networks with n nodes. . . . . . . . . . . . . . . . . . . . . . . . 124
8.1 A mass-spring-damper system on a line. . . . . . . . . . . . . . . . . . . 143
8.2 Structure of feedback gains resulting from the solution to (8.6) for dif-
ferent values of the regularization parameter γ for an MSD system with
N = 10 masses. The number of non-zero elements in K is denoted by nz. 144
8.3 (a) Sparsity level; (b) performance degradation; and (c) the optimal
tradeoff between the performance degradation and the sparsity level of
the optimal sparse x compared to the optimal centralized controller xc.
The results are obtained for a randomly generated Erdos-Renyi network
with n = 20 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.4 Speedup ratio versus the number of the cores used for growing connected
resistive networks with n nodes. . . . . . . . . . . . . . . . . . . . . . . . 146
xiv
Chapter 1
Introduction
Traditional control theory is concerned with implementing the control and estimation
policies in a centralized fashion [1]. However, in large networks of dynamical systems,
centralized processing is often computationally expensive or even practically infeasible.
This motivates the development of distributed control strategies that require limited
information exchange between the individual subsystems. Analysis and design of dis-
tributed averaging protocols have been an active research area recently; e.g., see [2–11].
Moreover, distributed computing over networks is of fundamental importance in
network science [2]. Consensus problem has received significant attention because it
encounters in a number of applications ranging from animal group behavior [12,13], to
social networks [14,15], to power systems [16–18], to distributed processing networks [19,
20], to spreading processes [21,22], and to coordination of autonomous vehicles [23–26],
see Fig. 1.1 for some examples. An inherent challenge in all of these problems is that
it is desired for all nodes to reach an agreement or to achieve synchronization by only
exchanging relative information with limited number of nodes. The restriction on the
absence of the absolute measurements imposes structural constraints for the analysis
and design.
Furthermore, in many of the real-world applications, the dynamics of the underly-
ing system are unknown, the dynamical systems are large-scale, and only input-output
measurements are available. This necessitates the development of theory and techniques
that utilize distributed computing architectures to cope with large problem sizes and
1
2
(a) (b)
(c) (d)
Figure 1.1: (a) Social network. (b) Computer network. (c) Vehicular formation. (d)Power network.
do not require the knowledge of underlying dynamics of the system. Moreover, in some
applications, having access to even the input-output measurements is not feasible and
only second-order data is available. Recovering network topology from partially avail-
able statistical signatures is a challenging problem. Due to experimental or numerical
limitations, it is often the case that only noisy partial network statistics are known.
Thus, the objective is to account for all observed correlations that may be available
and to develop efficient identification algorithms that are well-suited for big data prob-
lems [27,28].
Moreover, analysis of optimization algorithms from the system theoretic point of
3
view has received significant recent attention [29–31]. In these references, the optimiza-
tion algorithm is interpreted as a feedback interconnection in which the states converge
to the optimal solution of the optimization problem. Many modern large-scale and dis-
tributed optimization problems can be cast into a form in which the objective function
is a sum of a smooth term and a nonsmooth regularizer. Such problems can be solved
via a proximal gradient method which generalizes standard gradient descent to a non-
smooth setup. We leverage the tools from control theory to study global convergence
of proximal gradient flow algorithms in continuous time. We also provide a distributed
implementation of the gradient flow dynamics based on the proximal augmented La-
grangian and prove global exponential stability for strongly convex problems.
In this dissertation, we combine tools and ideas from control theory, distributed
optimization, reinforcement learning, and compressive sensing to develop distributed
estimation and control strategies that require limited information exchange between
the individual subsystems. The remainder of this introductory chapter is organized as
follows. In Section 1.1, we overview different topics of the dissertation and we pro-
vide a brief dicsussion on each of them. The outline of the dissertation is provided in
Section 1.2. Finally, we provide a summary of our main results and contributions in
Section 1.3.
1.1 Main topics of the dissertation In this section, we discuss the
main topics of the dissertation.
1.1.1 Distributed design and control of consensus networks Reaching
agreement using relative information exchange in a decentralized fashion has received
lots of attention. In large networks, centralized implementation of control policies im-
poses heavy communication and computation burden on individual nodes. This mo-
tivates the development of distributed control strategies that require limited informa-
tion exchange between the nodes in order to reach consensus or achieve synchroniza-
tion [5, 7–9].
Optimal design of the edge weights for networks with pre-specified topology has
received significant attention. In [3], the design of the fastest averaging protocol for
4
undirected networks was cast as a semidefinite program (SDP). Two customized algo-
rithms, based on primal barrier interior-point (IP) and subgradient methods, were de-
veloped and the advantages of optimal weight selection over commonly used heuristics
were demonstrated. Similar SDP characterization, for networks with state-dependent
graph Laplacians, was provided in [4]. The allocation of symmetric edge weights that
minimize the mean-square deviation from average for networks with additive stochas-
tic disturbances was solved in [5]. A related problem, aimed at minimizing the total
effective resistance of resistive networks, was addressed in [7]. In [8], the edge Lapla-
cian was used to provide graph-theoretic characterization of the H2 and H∞ symmetric
agreement protocols.
Network coherence quantifies the ability of distributed estimation and control strate-
gies to guard against exogenous disturbances [6, 10]. The coherence is determined by
the sum of reciprocals of the non-zero eigenvalues of the graph Laplacian and its scaling
properties cannot be predicted by algebraic connectivity of the network. In [10], per-
formance limitations of spatially-localized consensus protocols on regular lattices were
examined. It was shown that the fundamental limitations for large-scale networks
are dictated by the network topology rather than by the optimal selection of the edge
weights. Moreover, epidemic spread in networks is strongly influenced by their topol-
ogy [21, 22, 32]. Thus, optimal topology design represents an important challenge. It is
precisely this problem, for undirected consensus networks, that we address here.
More specifically, we study an optimal control problem aimed at achieving a de-
sired tradeoff between the network performance and communication requirements in
the distributed controller. Our goal is to add a certain number of edges to a given
undirected network in order to optimally enhance the closed-loop performance. One
of our key contributions is the formulation of topology design as an optimal control
problem that admits convex characterization and is amenable to the development of
efficient optimization algorithms. In our formulation, the plant network can contain
disconnected components and optimal topology of the controller network is an integral
part of the design. In general, this problem is NP-hard [33] and it amounts to an in-
tractable combinatorial search. Several references have examined convex relaxations
or greedy algorithms to design topology that optimizes algebraic connectivity [34] or
5
network coherence [9, 35–37].
In this dissertation, we tap on recent developments regarding sparse representa-
tions in conjunction with regularization penalties on the level of communication in a
distributed controller. Similar approaches have been used for solving the problem of
optimal sensors/actuators selections in large-scale dynamical systems [38–40]. This
allows us to formulate convex optimization problems that exploit the underlying struc-
ture and are amenable to the development of efficient optimization algorithms. To
avoid combinatorial complexity, we approach optimal topology design using a sparsity-
promoting optimal control framework introduced in [41, 42]. Performance is captured
by the H2 norm of the closed-loop network and `1-regularization is introduced to pro-
mote controller sparsity. While this problem is in general nonconvex [42], for undirected
networks we show that it admits a convex characterization with a non-differentiable ob-
jective function and a positive definite constraint. This problem can be transformed
into an SDP and, for small size networks, the optimal solution can be computed using
standard IP method solvers, e.g., SeDuMi [43] and SDPT3 [44].
Moreover, we study the problem of performance enhancement in stochastically-
forced directed consensus networks by adding edges to an existing topology. Significant
amount of research has been devoted to the study of the consensus problem in networks,
where both the plant and the controller graphs are undirected. We consider the problem
of adding edges to a weakly connected directed consensus network in order to improve
performance. In particular, we are interested in designing sparse communication graphs
that strike a balance between the variance amplification of the closed-loop system and
the number of communication links. In general, this is a combinatorial search problem
and is non-convex.
1.1.2 Network estimation and inference Identifying network topology and
learning graphical models from partially available statistical signatures are topics of im-
mense interest in areas ranging from machine learning to statistics to neuroscience [45–
52]. Moreover, structured covariance completion problem has received significant atten-
tion in the recent years [53–55]. In these works, the unknown elements of a partially
known covariance matrix are estimated such that they are consistent with the dynamics
of the underlying system. Studying the human brain as a complex network has received
6
significant attention recently [56–58]. The brain functional connectivity can be mea-
sured by computing the correlation between time-series functional magnetic resonance
imaging (FMRI) data. The functional connectivity structure between different regions
can be revealed by utilizing different thresholding techniques [59, 60]. In general, this
is a challenging problem because it is often the case that only noisy partial network
statistics are known. The goal is to develop an efficient algorithm for recovering the
underlying topology of a network utilizing the limited sample data.
Recovering the underlying network topology using sample covariance matrix of the
node values under structural constraints has been studied in [27, 61, 62]. Moreover, a
rich body of literature has been devoted to the problems of designing network topol-
ogy to improve performance [33, 34, 36, 37, 63]. Several algorithms can be employed to
identify the underlying network structure from limited statistical data. In [64], the au-
thors show inability of standard graphical-LASSO to identify network topology. It was
demonstrated that this popular algorithm fails to recover the underlying topology even
when the abundance of data is available.
Over the last decade, a rich body of literature has been devoted to the problems of
designing network topology to improve performance [9,34–37,63,65,66] and identifying
an unknown network topology from available data [45–51, 67]. Moreover, the problem
of sparsifying a network with dense interconnection structure in order to optimize a
specific performance measure has been studied in [68, 69]. In [70], it was demonstrated
that individual starlings within a large flock interact only with a limited number of
immediate neighbors. The authors have shown that the networks with six or seven
nearest neighbors provide an optimal trade-off between flock cohesion and effort of
individual birds. These observations were made by examining how robustness of the
group depends on the network structure. The proposed measure of robustness captures
the network coherence [10] and it is well-suited for developing tractable optimization
framework.
We develop a convex optimization framework for identifying sparse interaction topol-
ogy using sample covariance matrix of the state of the network. Our framework utilizes
an `1-regularized Gaussian maximum likelihood estimator. We also develop a method
based on growing a Chow-Liu tree that is well-suited for identifying the underlying
7
structure of large-scale systems. We apply this technique to resting-state functional
MRI (FMRI) data as well as synthetic datasets to illustrate the effectiveness of the
proposed approach.
1.1.3 Control theoretic approach for analysis of optimization algo-
rithms Structured optimal control and inverse problems, that arise when trying to
identify and control dynamical representations of systems evolving in real-time, typ-
ically lead to optimization of objective functions that consist of a sum of a smooth
term and a nonsmooth regularizer. Such problems are of increasing importance in con-
trol [41, 42, 71] and it is thus necessary to develop efficient algorithms for distributed
and embedded nonsmooth composite optimization [72–75]. The lack of differentiability
in the objective function precludes the use of standard descent methods from smooth
optimization. Proximal gradient method [76] generalizes gradient descent to nonsmooth
context and provides a powerful tool for solving problems in which the nonsmooth term
is separable over the optimization variable.
Examining optimization algorithms as continuous-time dynamical systems has been
an active topic since the seminal work of Arrow, Hurwicz, and Uzawa [77]. This view-
point can provide important insight into performance of optimization algorithms and,
during the last decade, it has been advanced and extended to a broad class of prob-
lems including convergence analysis of primal-dual [73, 78–82] and accelerated [83–88]
first-order methods. This should be compared and contrasted to existing techniques
which employ subgradient methods and discontinuous projected dynamics [89–91]. Fur-
thermore, establishing the connection between theory of ordinary differential equations
(ODEs) and numerical optimization algorithms has been a topic of many studies, in-
cluding [92,93]; for recent efforts, see [84,94].
Most algorithms can be viewed as a feedback interconnection of linear dynamical sys-
tems with nonlinearities that posses certain structural properties. This system-theoretic
interpretation was exploited in [31] and further advanced in a number or recent pa-
pers [81, 95–101]. The key idea is to exploit structural features of linear and nonlinear
terms and utilize theory and techniques from stability analysis of nonlinear dynamical
systems to study properties of optimization algorithms. This approach provides new
8
methods for studying not only convergence rate but also robustness of optimization rou-
tines [102–105] and can lead to new classes of algorithms that strike a desired tradeoff
between the speed and robustness.
We study a class of nonsmooth composite convex optimization problems in which the
objective is a sum of a differentiable function and a possibly nondifferentiable regular-
izer. These problems emerge in compressive sensing, machine learning, and control. For
example, structured feedback design can be cast as a nonsmooth composite optimiza-
tion problem [42,71,106]. Standard descent methods cannot be used in the presence of
nondifferentiable component. Proximal gradient algorithms [76,107] offer viable alterna-
tives to the generic descent methods for solving nonsmooth problems. Another effective
strategy is to transform the associated augmented Lagrangian into the continuously
differentiable proximal augmented Lagrangain [108] in which the former is restricted
to the manifold that corresponds to the explicit minimization over the variable in the
nonsmooth term.
1.1.4 Distributed in-network optimization Distributed algorithms are crit-
ically important for solving large-scale optimization problems. The decentralized con-
sensus problem in multi-agent networks [72, 73, 109, 110] arises in many applications.
Moreover, distributed control techniques are critically important in the design of large-
scale systems. In these systems, conventional control strategies that rely on centralized
computation and implementation are often prohibitively expensive. For example, find-
ing the optimal controller requires computation of the solution to the algebraic Riccati
equations which is often infeasible because of high computational requirements. This ne-
cessitates the development of theory and techniques that utilize distributed computing
architectures to cope with large problem sizes.
We study a class of nonsmooth composite optimization problems in which the con-
vex objective function is a sum of differentiable and nondifferentiable functions. Among
other applications, these problems emerge in machine learning, compressive sensing,
and control. Recently, regularization has been used as a promising tool for enhanc-
ing utility of standard optimal control techniques. Generic descent methods cannot
be used in the nonsmooth composite optimization problems due to the presence of a
9
nondifferentiable component in the objective function. Moreover, these standard meth-
ods are not well-suited for distributed implementation. An alternative approach is to
separate the smooth and nonsmooth parts of the objective function and use the alter-
nating direction method of multipliers (ADMM). In [111], we exploit separability of
the objective function and utilize an ADMM-based consensus algorithm to solve the
regularized optimal control problem in a distributed manner over multiple processors.
Even though the optimal control problem is in general non-convex, recent results can
be utilized to show convergence to a local minimum [112]. However, in an update step
of the ADMM algorithm, all the processors halt to compute the weighted average (the
gathering step) [113].
By introducing auxiliary variables in nondifferentiable terms, we provide an equiva-
lent consensus-based characterization that is convenient for distributed implementation.
The Moreau envelope associated with the nonsmooth part of the objective function is
used to bring the optimization problem into a continuously differentiable form that
serves as a basis for the development of a primal-descent dual-ascent gradient flow
method. This algorithm exploits separability of the objective function and is well-suited
for in-network optimization. We prove global asymptotic stability of the proposed al-
gorithm. Moreover, we study primal-descent dual-ascent gradient flow dynamics based
on the proximal augmented Lagrangian [108] that can be used as an effective algorithm
to solve smooth optimization problems with a separable objective function. We provide
a distributed implementation and prove global exponential stability in the presence of
strong convexity.
1.1.5 Data-driven and model-free distributed control One of the major
challenges in analysis and design of large-scale dynamical systems is the development
of fast and scalable methods. Such systems involve large-scale interconnections of com-
ponents, have rapidly-evolving structure and limitations on communication/processing
power, and require real-time distributed control actions. These requirements make con-
trol strategies that rely on centralized information processing infeasible and motivate
new classes of optimal control problems. In these, standard performance metrics are
augmented with typically nonsmooth regularizers to promote desired structural features
(e.g., low communication requirements) in the optimal controller [41, 42, 71, 114, 115].
10
Moreover, in many applications, the dynamics of the plant are unknown and only lim-
ited input-output measurements are available. Designing optimal controllers for these
systems is challenging even in the absence of structural constraints. The LQR problem
in the model-free setup has been recently studied in [116,117].
Modern control applications impose additional requirements on controller design
that cannot be addressed using standard optimal control tools. These requirements may
arise from limited communication and computation resources or the size of the problem.
The standard optimal control techniques typically induce an all-to-all communication
requirements in the controller which is infeasible in large-scale setting. Recently, reg-
ularization has emerged as a promising tool for enhancing utility of standard optimal
control techniques. In this approach, commonly used performance measures (e.g., H2 or
H∞) are augmented with regularization functions that are supposed to promote some
desired structural features in the distributed controller, e.g., sparsity. Such an approach
has received significant level of attention in recent years [41, 42, 71, 106, 115, 118, 119],
but computing optimal solutions in large-scale problems still remains a challenge. Dis-
tributed computing techniques have been commonly used to cope with large problem
sizes; for example, stability and synthesis of cooperative distributed model predictive
controllers for linear systems have been recently studied in [120].
Herein, we address complexity through a combination of linear-algebraic techniques
and computational methods adapted from both machine learning and reinforcement
learning. We study the problem of designing optimal structured feedback gains for
large-scale systems with unknown dynamics in a distributed manner. We quantify per-
formance using the H2 norm and introduce regularization functions to promote desired
structural properties in the controller. The key challenge is to evaluate the objective
function and its gradient without solving the large-scale Lyapunov equations. By ex-
ploiting the square-additive property of the H2 norm, we propose customized proximal
and ADMM-based algorithms that are convenient for distributed implementation.
1.2 Dissertation structure This dissertation consists of four parts. Each
part focuses on a specific topic and includes individual chapters that studies relevant
subjects. In each chapter, we provide a brief introduction, background and motiva-
tion, problem formulation, algorithm and design steps, computational experiments, and
11
conclusion.
Part I. Traditional control theory is concerned with implementing the control poli-
cies in a centralized fashion. However, due to high communication and computation
costs, this is not practically feasible for large-scale networks of dynamical systems.
Leveraging tools from compressive sensing and distributed optimization, we have devel-
oped a theoretical and computational framework for designing distributed controllers for
large-scale systems [63, 66, 106, 119]. In these, standard performance metrics are aug-
mented with typically nonsmooth regularizers to promote desired structural features
(e.g., low communication requirements) in the optimal controller. These algorithms
scale gracefully with problem size and are well-suited for distributed implementation
over multiple processors.
Part II. Identifying network topology and learning graphical models from partially
available statistical signatures are topics of immense interest in areas ranging from
machine learning to statistics to neuroscience. A motivating application in neuroscience
is reconstructing the pattern of causal interactions between distinct units of the brain
using time series data of neural activities. For undirected consensus networks, the
topology identification problem can be imposed as a sparse inverse covariance estimation
with additional structural constraints. We have developed an algorithm to identify
simple models of large consensus networks from a known sample covariance matrix [27].
Moreover, we have developed an inference algorithm based on growing a Chow-Liu tree
which is well-suited for identifying the underlying structure of large-scale dynamical
systems [28].
Part III. Analysis of optimization algorithms from the system theoretic point of
view has received significant recent attention [29–31]. In these references, the optimiza-
tion algorithm is interpreted as a feedback interconnection in which the states converge
to the optimal solution of the optimization problem. Moreover, centralized computa-
tion over large-scale systems are often impractical or very expensive. In this part, we
propose distributed computational algorithm based on alternating direction method of
multiplier and proximal augmented Lagrangian for solving non-smooth composite opti-
mization problems over networks and provide global convergence analysis using control
12
theooretic techniques [98,99,121].
Part IV. Distributed feedback design and complexity constrained control are exam-
ples of problems posed within the domain of structured optimal feedback synthesis. The
optimal feedback gain is typically a non-convex function of system primitives. However,
in recent years, algorithms have been proposed to obtain locally optimal solutions. In
applications to large-scale distributed control, the major obstacle is computational com-
plexity. We address complexity through a combination of linear-algebraic techniques
and computational methods adapted from both machine learning and reinforcement
learning. Moreover, in many applications, the dynamics of the plant are unknown and
only limited input-output measurements are available. Designing optimal controllers
for these systems is challenging even in the absence of structural constraints. We have
shown that for general classes of optimal control problems, the objective function and
its gradient can be computed from data [111]. By employing incremental and stochastic
gradient and ADMM based algorithms, we have developed a data-driven framework for
the structured optimal control problem that does not require knowledge of the system
parameters and avoids the need to solve large-scale matrical equations [122].
1.3 Contributions of the dissertation In this section, the structure
of the dissertation is provided along with the main contributions of each part.
Part I
Topology design for stochastically forced undirected consensus networks.
We have examined the problem of optimal topology design of the corresponding edge
weights for undirected consensus networks. Our approach uses convex optimization to
balance performance of stochastically-forced networks with the number of edges in the
distributed controller. For `1-regularized minimum variance optimal control problem,
we have derived a Lagrange dual and exploited structure of the optimality conditions
for undirected networks to develop customized algorithms that are well-suited for large
problems. These are based on the proximal gradient and the proximal Newton meth-
ods. The proximal gradient algorithm is a first-order method that updates the controller
graph Laplacian via the use of the soft-thresholding operator. In the proximal Newton
13
method, sequential quadratic approximation of the smooth part of the objective func-
tion is employed and the Newton direction is computed using cyclic coordinate descent
over the set of active variables. Examples are provided to demonstrate utility of our
algorithms. We have shown that proximal algorithms can solve the problems with mil-
lions of edges in the controller graph in several minutes, on a PC. Furthermore, we have
specialized our algorithm to the problem of growing connected resistive networks. In
this, the plant graph is connected and there are no joint edges between the plant and the
controller graphs. We have exploited structure of such networks and demonstrated how
additional edges can be systematically added in a computationally efficient manner.
Edge addition in directed consensus networks.
We consider the `1 regularized version of optimal control problem for adding edges
to directed consensus networks in order to reach consensus and optimally enhance per-
formance. Although the given plant network is not necessarily balanced, in order to
reach agreement, we restrict the closed-loop graph Laplacian to be balanced. The
performance is measured by the H2 norm from the disturbance to the output of the
closed-loop network. In general, this problem is a combinatorial search problem. We
use sparsity promoting optimal control framework and introduce weighted `1 regular-
ization as a proxy for promoting sparsity of the controller. By exploiting structure of
the problem, we develop an algorithm based on ADMM.
Part II
Topology identification of undirected consensus networks via sparse inverse
covariance estimation.
We have developed a method for identifying the topology of an undirected consensus
network using available statistical data. In order to promote network sparsity, we intro-
duce a convex optimization framework aimed at finding the solution to the `1-regularized
maximum likelihood problem. This problem is closely related to the problem of sparse
inverse covariance estimation that has received significant attention in the literature.
In our setup, additional structure arises from the requirement that data is generated
by an undirected consensus network. By exploiting the structure of the problem, we
develop an efficient algorithm based on the sequential quadratic approximation method
14
in which the search direction is determined using coordinate descent with active set
strategy. Several examples have been provided to illustrate utility of the method and
efficiency of the customized algorithm.
Topology identification via growing a Chow-Liu tree network.
We have studied the problem of sparse topology identification of an undirected con-
sensus network with leaders using second-order statistical data. The goal is to identify
a sparse interaction topology using sample covariance matrix of the network state. We
have introduced two algorithms based on regularized Gaussian maximum likelihood
and growing a Chow-Liu tree. In the first algorithm, we propose a structured graphical-
LASSO algorithm that uses the weighted `1 regularizer as a proxy for inducing sparse
network topology. The other method is based on growing a Chow-Liu tree that is well-
suited for identifying the underlying structure of large-scale networks. Several examples
have been provided to demonstrate the performance of our framework.
Part III
Global exponential stability via integral quadratic constraints for proximal
gradient flow and Douglas-Rachford splitting dynamics.
We study a class of nonsmooth optimization problems in which it is desired to
minimize the sum of a continuously differentiable function with a Lipschitz continuous
gradient and a nondifferentiable function. For strongly convex problems, we employ the
theory of integral quadratic constraints to prove global exponential stability of proximal
gradient flow and Douglas-Rachford splitting dynamics. We also propose a generaliza-
tion of the Polyak-Lojasiewicz condition to nonsmooth problems and demonstrate the
global exponential convergence of the forward-backward envelope for the proximal gra-
dient flow algorithm even in the absence of strong convexity.
Distributed in-network optimization.
We have studied a class of convex nonsmooth composite optimization problems in
which the objective function is a combination of differentiable and nondifferentiable
functions. By exploiting the structure of the probelm, we have provided an equivalent
consensus-based characterization and have developed an algorithm based on primal-
descent dual-ascent gradient flow method. This algorithm exploits the separability of
15
the objective function and is well-suited for distributed implementation. Convexity of
the smooth part of the objective function is utilized to prove global asymptotic stability
of our algorithm. Finally, by exploiting the structure of the H2 norm, we have employed
this algorithm to design a sparse controller network that improves the performance of
the closed-loop system in a large-scale undirected consensus network in a distributed
manner. An example is provided to demonstrate the utility of the developed approach.
We are currently working on implementing this algorithm in C++ and will use it to solve
structured optimal control problems for large-scale systems in a distributed manner.
Part IV
Data-driven proximal algorithms for the design of structured optimal feed-
back gains.
We have considered the problem of designing optimal structured feedback controllers
for large-scale systems. We have shown that for general classes of optimal control
problems, the objective function and its gradient can be computed from data. Trans-
formations borrowed from the theory of reinforcement learning are adapted to obtain
simulation-based algorithms for computing the structured optimal H2 feedback gain.
Customized proximal algorithms based on gradient descent and incremental gradient
are tested in computational experiments and their relative merits are discussed. More-
over, by exploiting the structure of the H2 norm, we have shown that this problem can
be separated into N subproblems and that it can be solved efficiently using distributed
optimization. We have utilized an ADMM-based consensus algorithm to design a sparse
controller network that improves the performance of the closed-loop system. By splitting
the problem into N separate subproblems over N different cores, we have implemented
this algorithm in C++. Our parallel implementation can be used to solve structured
optimal control problems for large-scale systems, e.g., power networks [16,18]. Compu-
tational experiments are provided to demonstrate the utility of the developed approach.
Part I
Distributed design and control of
networks
16
Chapter 2
Topology design for stochastically
forced undirected consensus
networks
In this chapter, we study an optimal control problem aimed at adding a certain number
of edges to an undirected network, with a known graph Laplacian, in order to optimally
enhance closed-loop performance. The performance is quantified by the steady-state
variance amplification of the network with additive stochastic disturbances. To promote
controller sparsity, we introduce `1-regularization into the optimal H2 formulation and
cast the design problem as a semidefinite program. We derive a Lagrange dual, provide
interpretation of dual variables, and exploit structure of the optimality conditions for
undirected networks to develop customized proximal gradient and Newton algorithms
that are well-suited for large problems. We illustrate that our algorithms can solve the
problems with more than million edges in the controller graph in a few minutes, on
a PC. We also exploit structure of connected resistive networks to demonstrate how
additional edges can be systematically added in order to minimize the H2 norm of the
closed-loop system.
2.1 Introduction Herein, we study an optimal control problem aimed at
achieving a desired tradeoff between the network performance and communication re-
quirements in the distributed controller. Our goal is to add a certain number of edges to
17
18
a given undirected network in order to optimally enhance the closed-loop performance.
One of our key contributions is the formulation of topology design as an optimal control
problem that admits convex characterization and is amenable to the development of
efficient optimization algorithms. In our formulation, the plant network can contain
disconnected components and optimal topology of the controller network is an integral
part of the design. In general, this problem is NP-hard [33] and it amounts to an in-
tractable combinatorial search. Several references have examined convex relaxations
or greedy algorithms to design topology that optimizes algebraic connectivity [34] or
network coherence [9, 35–37].
To enable design of large networks, we pay particular attention to the computational
aspects of the edge-addition problem. We derive a Lagrange dual of the optimal control
problem, provide interpretation of dual variables, and develop efficient proximal algo-
rithms. Furthermore, building on preliminary work [63], we specialize our algorithms
to the problem of growing connected resistive networks described in [7, 34]. In this,
the plant graph is connected and inequality constraints amount to non-negativity of
controller edge weights. This allows us to simplify optimality conditions and further
improve computational efficiency of our customized algorithms.
Proximal gradient algorithms [76] and their accelerated variants [107] have recently
found use in distributed optimization, statistics, machine learning, image and signal
processing. They can be interpreted as generalization of standard gradient projection
to problems with non-smooth and extended real-value objective functions. When the
proximal operator is easy to evaluate, these algorithms are simple yet extremely efficient.
For networks that can contain disconnected components and non-positive edge weights,
we show that the proximal gradient algorithm iteratively updates the controller graph
Laplacian via convenient use of the soft-thresholding operator. This extends the Itera-
tive Shrinkage Thresholding Algorithm (ISTA) to optimal topology design of undirected
networks. In contrast to the `1-regularized least-squares, however, the step-size has to
be selected to guarantee positivity of the second smallest eigenvalue of the closed-loop
graph Laplacian. We combine the Barzilai-Borwein (BB) step-size initialization with
backtracking to achieve this goal and enhance the rate of convergence. The biggest com-
putational challenge comes from evaluation of the objective function and its gradient.
We exploit problem structure to speed up computations and save memory. Finally, for
19
the problem of growing connected resistive networks, the proximal algorithm simplifies
to gradient projection which additionally improves the efficiency.
We also develop a customized algorithm based on the proximal Newton method.
In contrast to the proximal gradient, this method sequentially employs the second-
order Taylor series approximation of the smooth part of the objective function; e.g.,
see [123]. We use cyclic coordinate descent over the set of active variables to efficiently
compute the Newton direction by consecutive minimization with respect to individual
coordinates. Similar approach has been recently utilized in a number of applications,
including sparse inverse covariance estimation in graphical models [124].
Both of our customized proximal algorithms significantly outperform a primal-dual
IP method developed in [63]. It is worth noting that the latter is significantly faster than
the general-purpose solvers. While the customized IP algorithm of [63] with a simple
diagonal preconditioner can solve the problems with hundreds of thousands of edges
in the controller graph in several hours, on a PC, the customized algorithms based on
proximal gradient and Newton methods can solve the problems with millions of edges in
several minutes. Furthermore, they are considerably faster than the greedy algorithm
with efficient rank-one updates developed in [37].
Our presentation is organized as follows. In Section 2.2, we formulate the problem
of optimal topology design for undirected networks subject to additive stochastic dis-
turbances. In Section 2.3, we derive a Lagrange dual of the sparsity-promoting optimal
control problem, provide interpretation of dual variables, and construct dual feasible
variables from the primal ones. In Section 2.4, we develop customized algorithms based
on the proximal gradient and Newton methods. In Section 2.5, we achieve additional
speedup by specializing our algorithms to the problem of growing connected resistive
networks. In Section 2.6, we use computational experiments to design optimal topol-
ogy of a controller graph for benchmark problems and demonstrate efficiency of our
algorithms. In Section 2.7, we provide a brief overview of the chapter.
2.2 Problem formulation We consider undirected consensus networks
with n nodes
ψ = −Lp ψ + u + d (2.1)
20
where d and u are the exogenous disturbance and the control input, respectively, ψ is
the state of the network, and Lp is a symmetric n × n matrix that represents graph
Laplacian of the open-loop system, i.e., plant. Such networks arise in applications
ranging from load balancing to power systems to opinion formation to control of multi-
agent systems. The goal is to improve performance of a consensus algorithm in the
presence of stochastic disturbances by adding a certain number of edges (from a given
set of candidate edges). We formulate this problem as a feedback design problem with
u = −Lx ψ
where the symmetric feedback-gain matrix Lx is required to have the Laplacian struc-
ture. This implies that each node in (2.1) forms control action using a weighted sum of
the differences between its own state and the states of other nodes and that information
is processed in a symmetric fashion. Since a nonzero ijth element of Lx corresponds to
an edge between the nodes i and j, the communication structure in the controller graph
is determined by the sparsity pattern of the matrix Lx.
Upon closing the loop we obtain
ψ = − (Lp + Lx)ψ + d. (2.2a)
For a given Lp, our objective is to design the topology for Lx and the corresponding edge
weights x in order to achieve the desired tradeoff between controller sparsity and network
performance. The performance is quantified by the steady-state variance amplification
of the stochastically-forced network, from the white-in-time input d to the performance
output ζ,
ζ :=
[Q1/2
0
]ψ +
[0
R1/2
]u =
[Q1/2
−R1/2Lx
]ψ (2.2b)
which penalizes deviation from consensus and control effort. Here, Q = QT � 0 and
R = RT � 0 are the state and control weights in the standard quadratic performance
index.
The interesting features of this problem come from structural restrictions on the
Lalpacian matrices Lp and Lx. Both of them are symmetric and are restricted to
21
having an eigenvalue at zero with the corresponding eigenvector of all ones,
Lp 1 = 0, Lx 1 = 0. (2.3)
Since each node uses relative information exchange with its neighbors to update its
state, in the presence of white noise, the average mode ψ(t) := (1/n)1Tψ(t) experiences
a random walk and its variance increases linearly with time. To make the average mode
unobservable from the performance output ζ, the matrix Q is also restricted to having
an eigenvalue at zero associated with the vector of all ones, Q1 = 0. Furthermore, to
guarantee observability of the remaining eigenvalues of Lp, we consider state weights
that are positive definite on the orthogonal complement of the subspace spanned by the
vector of all ones, Q + (1/n)11T � 0; e.g., Q = I − (1/n)11T penalizes mean-square
deviation from the network average.
In what follows, we express Lx as
Lx :=m∑l= 1
xl ξl ξTl = E diag (x)ET (2.4)
where E is the incidence matrix of the controller graph Lx, m is the number of edges in
Lx, and diag (x) is a diagonal matrix containing the vector of the edge weights x ∈ Rm.
The matrix E is given and it determines the set of candidate edges in controller network.
This set can contain all possible edges in the network or it can only include edges that
are not in the plant network. Many other options are possible as long as the union of
the sets of edges in the plant and controller networks yields a connected graph. We note
that the size of the set of candidate edges in controller network influences computational
complexity of our algorithms.
It is desired to select a subset of edges in order to balance the closed-loop performance
with the number of added edges. Vectors ξl ∈ Rn determine the columns of E and they
signify the connection with weight xl between nodes i and j: the ith and jth entries of
ξl are 1 and −1 and all other entries are equal to 0. Thus, Lx given by (2.4) satisfies
structural requirements on the controller graph Laplacian in (2.3) by construction.
To achieve consensus in the absence of disturbances, the closed-loop network has
to be connected [2]. Equivalently, the second smallest eigenvalue of the closed-loop
22
graph Laplacian, L := Lp + Lx, has to be positive, i.e., L has to be positive definite on
1⊥. This amounts to positive definiteness of the “strengthened” graph Laplacian of the
closed-loop network
G := Lp + Lx + (1/n)11T = Gp + E diag (x)ET � 0 (2.5a)
where
Gp := Lp + (1/n)11T . (2.5b)
Structural restrictions (2.3) on the Laplacian matrices introduce an additional constraint
on the matrix G,
G1 = 1. (2.5c)
2.2.1 Design of optimal sparse topology Let d be a white stochastic distur-
bance with zero-mean and unit variance,
E (d(t)) = 0, E(d(t1) dT (t2)
)= I δ(t1 − t2)
where E is the expectation operator. The square of the H2 norm of the transfer function
from d to ζ,
‖H‖22 = limt→∞
E(ψT (t) (Q + LxRLx)ψ(t)
)quantifies the steady-state variance amplification of closed-loop system (2.2). As noted
earlier, the network average ψ(t) corresponds to the zero eigenvalue of the graph Lapla-
cian and it is not observable from the performance output ζ. Thus, the H2 norm is
equivalently given by
‖H‖22 = limt→∞
E(ψT (t) (Q + LxRLx) ψ(t)
)= trace (P (Q + LxRLx)) = 〈P,Q + LxRLx〉
where ψ(t) is the vector of deviations of the states of individual nodes from ψ(t),
ψ(t) := ψ(t) − 1 ψ(t) =(I − (1/n)11T
)ψ(t)
23
and P is the steady-state covariance matrix of ψ,
P := limt→∞
E(ψ(t) ψT (t)
).
The above measure of the amplification of stochastic disturbances is determined by
‖H‖22 = (1/2)J(x), where
J(x) :=⟨(Gp + E diag (x)ET
)−1, Q+ LxRLx
⟩. (2.6)
It can be shown that J can be expressed as
J(x) =⟨(Gp + E diag (x)ET
)−1, Qp
⟩+ diag
(ETRE
)Tx − 〈R,Lp〉 − 1 (2.7)
with
Qp := Q + (1/n)11T + LpRLp.
Note that the last two terms in (2.7) do not depend on the optimization variable x and
that the term LpRLp in Qp has an interesting interpretation: it determines a state-
weight that guarantees inverse optimality (in LQR sense) of u = −Lpψ for a system
with no coupling between the nodes, ψ = u+ d.
We formulate the design of a controller graph that provides an optimal tradeoff
between the H2 performance of the closed-loop network and the controller sparsity as
minimizex
J(x) + γ ‖x‖1
subject to Gp + E diag (x)ET � 0(SP)
where J(x) and Gp are given by (2.7) and (2.5b), respectively. The `1 norm of x,
‖x‖1 :=∑m
l= 1 |xl|, is introduced as a convex proxy for promoting sparsity. In (SP),
the vector of the edge weights x ∈ Rm is the optimization variable; the problem data
are the positive regularization parameter γ, the state and control weights Q and R, the
plant graph Laplacian Lp, and the incidence matrix of the controller graph E.
The sparsity-promoting optimal control problem (SP) is a constrained optimization
problem with a convex non-differentiable objective function [35] and a positive defi-
nite inequality constraint. This implies convexity of (SP). Positive definiteness of the
24
strengthened graph Laplacian G guarantees stability of the closed-loop network (2.2a)
on the subspace 1⊥, and thereby consensus in the absence of disturbances [2].
The consensus can be achieved even if some edge weights are negative [3, 5]. By
expressing x as a difference between two non-negative vectors x+ and x−, (SP) can be
written asminimizex+, x−
⟨(Gp + E diag (x+ − x−)ET
)−1, Qp
⟩+
(γ 1 + c)Tx+ + (γ 1 − c)Tx−
subject to Gp + E diag (x+ − x−)ET � 0
x+ ≥ 0, x− ≥ 0
(2.8)
where c := diag(ETRE
). By utilizing the Schur complement, (2.8) can be cast to an
SDP, and solved via standard IP method algorithms for small size networks.
Reweighted `1 norm
An alternative proxy for promoting sparsity is given by the weighted `1 norm [125],
‖w ◦ x‖1 :=∑m
l= 1wl |xl| where ◦ denotes elementwise product. The vector of non-
negative weights w ∈ Rm can be selected to provide better approximation of non-
convex cardinality function than the `1 norm. An effective heuristic for weight selection
is given by the iterative reweighted algorithm [125], with wl inversely proportional to
the magnitude of xl in the previous iteration,
w+l = 1/(|xl| + ε). (2.9)
This puts larger emphasis on smaller optimization variables, where a small positive
parameter ε ensures that w+l is well-defined. If the weighted `1 norm is used in (SP),
the vector of all ones 1 should be replaced by the vector w in (2.8).
2.2.2 Structured optimal control problem: debiasing step After the
structure of the controller graph Laplacian Lx has been designed, we fix the structure of
Lx and optimize the corresponding edge weights. This “polishing” or “debiasing” step
is used to improve the performance relative to the solution of the regularized optimal
control problem (SP); see [126, Section 6.3.2] for additional information. The structured
25
optimal control problem is obtained by eliminating the columns from the incidence ma-
trix E that correspond to zero elements in the vector of the optimal edge weights x?
resulting from (SP). This yields a new incidence matrix E and leads to
minimizex
⟨(Gp + E diag (x) ET
)−1, Qp
⟩+ diag
(ETR E
)Tx
subject to Gp + E diag (x) ET � 0.
Alternatively, this optimization problem is obtained by setting γ = 0 in (SP) and by
replacing the incidence matrix E with E. The solution provides the optimal vector of
the edge weights x for the controller graph Laplacian with the desired structure.
2.2.3 Gradient and Hessian of J(x) We next summarize the first- and second-
order derivatives of the objective function J , given by (2.7), with respect to the vector
of the edge weights x. The second-order Taylor series approximation of J(x) around
x ∈ Rm is given by
J(x+ x) ≈ J(x) + ∇J(x)T x +1
2xT ∇2J(x) x.
For related developments we refer the reader to [7].
Proposition 1. The gradient and the Hessian of J at x ∈ Rm are determined by
∇J(x) = − diag(ET (Y (x) − R)E
)∇2J(x) = H1(x) ◦ H2(x)
whereY (x) :=
(Gp + EDxE
T)−1
Qp(Gp + EDxE
T)−1
H1(x) := ET Y (x)E
H2(x) := ET(Gp + EDxE
T)−1
E
Dx := diag (x) .
2.3 Dual problem Herein, we study the Lagrange dual of the sparsity-
promoting optimal control problem (2.8), provide interpretation of dual variables, and
construct dual feasible variables from primal feasible variables. Since minimization of
26
the Lagrangian associated with (2.8) does not lead to an explicit expression for the dual
function, we introduce an auxiliary variable G and find the dual of
minimizeG, x±
⟨G−1, Qp
⟩+ (γ 1 + c)Tx+ + (γ 1 − c)Tx−
subject to G − Gp − E diag (x+ − x−)ET = 0
G � 0, x+ ≥ 0, x− ≥ 0.
(P)
In (P), G represents the “strengthened” graph Laplacian of the closed-loop network and
the equality constraint comes from (2.5a). As we show next, the Lagrange dual of the
primal optimization problem (P) admits an explicit characterization.
Proposition 2. The Lagrange dual of the primal optimization problem (P) is given by
maximizeY
2 trace(
(Q1/2p Y Q
1/2p )1/2
)− 〈Y,Gp〉
subject to ‖ diag(ET (Y − R)E
)‖∞ ≤ γ
Y � 0, Y 1 = 1
(D)
where Y = Y T ∈ Rn×n is the dual variable associated with the equality constraint in (P).
The duality gap is
η = yT+ x+ + yT− x− = 1T (y+ ◦ x+ + y− ◦ x−) (2.10)
where
y+ = γ 1 − diag(ET (Y −R)E
)≥ 0 (2.11a)
y− = γ 1 + diag(ET (Y −R)E
)≥ 0. (2.11b)
are the Lagrange multipliers associated with elementwise inequality constraints in (P).
Proof. The Lagrangian of (P) is given by
L =⟨G−1, Qp
⟩+ 〈Y,G〉 − 〈Y,Gp〉 +(
γ 1 − diag(ET (Y −R)E
)− y+
)Tx+ +(
γ 1 + diag(ET (Y −R)E
)− y−
)Tx−.
(2.12)
27
Note that no Lagrange multiplier is assigned to the positive definite constraint on G in
L. Instead, we determine conditions on Y and y± that guarantee G � 0.
Minimizing L with respect to G yields
G−1QpG−1 = Y (2.13a)
or, equivalently,
G = Q1/2p
(Q1/2p Y Q1/2
p
)−1/2Q1/2p . (2.13b)
Positive definiteness of G and Qp implies Y � 0. Furthermore, since Qp1 = 1,
from (2.5c) and (2.13a) we have
Y 1 = 1.
Similarly, minimization with respect to x+ and x− leads to (2.11a) and (2.11a). Thus,
non-negativity of y+ and y− amounts to
−γ 1 ≤ diag(ET (Y −R)E
)≤ γ 1
or, equivalently,
‖diag(ET (Y −R)E
)‖∞ ≤ γ.
Substitution of (2.13) and (2.11) into (2.12) eliminates y+ and y− from the dual problem.
We can thus represent the dual function, infG, x± L(G, x±;Y, y±), as
2 trace(
(Q1/2p Y Q1/2
p )1/2)− 〈Y,Gp〉
which allows us to bring the dual of (P) to (D).
Any dual feasible Y can be used to obtain a lower bound on the optimal value of
the primal problem (P). Furthermore, the difference between the objective functions of
the primal (evaluated at the primal feasible (G, x±)) and dual (evaluated at the dual
feasible Y ) problems yields expression (2.10) for the duality gap η, where y+ and y−
are given by (2.11a) and (2.11b). The duality gap can be used to estimate distance to
optimality.
28
Strong duality follows from Slater’s theorem [126], i.e., convexity of the primal prob-
lem (P) and strict feasibility of the constraints in (P). This implies that at optimality,
the duality gap η for the primal problem (P) and the dual problem (D) is zero. Further-
more, if (G?, x?±) are optimal points of (P), then Y ? = (G?)−1Qp (G?)−1 is the optimal
point of (D). Similarly, if Y ? is the optimal point of (D),
G? = Q1/2p
(Q1/2p Y ?Q1/2
p
)−1/2Q1/2p
is the optimal point of (P). The optimal vector of the edge weights x? is determined by
the non-zero off-diagonal elements of the controller graph Laplacian, L?x = G? −Gp.
Interpretation of dual variables
For electrical networks, the dual variables have appealing interpretations. Let ι ∈ Rn
be a random current injected into the resistor network satisfying
1T ι = 0, E (ι) = 0, E(ιιT)
= Q + LpRLp.
The vector of voltages ϑ ∈ Rm across the edges of the network is then given by ϑ =
ETG−1ι. Furthermore, since
E(ϑϑT
)= ET G−1 E
(ιιT)G−1E = ET Y E,
the dual variable Y is related to the covariance matrix of voltages across the edges.
Moreover, (2.11) implies that y+ and y− quantify the deviations between variances of
edge voltages from their respective upper and lower bounds.
Remark 1. For a primal feasible x, Y resulting from (2.13a) with G given by (2.5a)
may not be dual feasible. Let
Y := β Y +1 − β
n11T (2.14a)
and let the control weight be R = r I with r > 0. If
β ≤ γ + 2 r
‖diag (ET (Y − R)E) ‖∞ + 2 r(2.14b)
29
then Y satisfies the inequality constraint in (D) and it is thus dual feasible.
2.4 Customized algorithms We next exploit the structure of the sparsity-
promoting optimal control problem (SP) and develop customized algorithms based on
the proximal gradient and Newton methods. The proximal gradient algorithm is a first-
order method that uses a simple quadratic approximation of J in (SP). This yields an
explicit update of the vector of the edge weights via application of the soft-thresholding
operator. In the proximal Newton method a sequential quadratic approximation of
the smooth part of the objective function in (SP) is used and the search direction is
efficiently computed via cyclic coordinate descent over the set of active variables.
2.4.1 Proximal gradient method We next use the proximal gradient method
to solve (SP). A simple quadratic approximation of J(x) around the current iterate xk,
J(x) ≈ J(xk) + ∇J(xk)T (x − xk) +1
2αk‖x − xk‖22
is substituted to (SP) to obtain
xk+1 = argminx
g(x) +1
2αk‖x − (xk − αk∇J(xk))‖22.
Here, αk is the step-size and the update is determined by the proximal operator of the
function αk g,
xk+1 = proxαkg
(xk − αk∇J(xk)
).
In particular, for g(x) = γ ‖x‖1, we have
xk+1 = Sγαk
(xk − αk∇J(xk)
)where Sκ(y) = sign (y) max (|y| − κ, 0) is the soft-thresholding function.
The proximal gradient algorithm converges with rate O(1/k) if αk < 1/L, where L is
the Lipschitz constant of ∇J [76,107]. It can be shown that ∇J is Lipschitz continuous
but, since it is challenging to explicitly determine L, we adjust αk via backtracking. To
provide a better estimate of L, we initialize αk using the Barzilai-Borwein (BB) method
which provides an effective heuristic for approximating the Hessian of the function J
30
via the scaled version of the identity [127], (1/αk)I. At the kth iteration, the initial BB
step-size αk,0,
αk,0 :=‖xk − xk−1‖22
(xk−1 − xk)T (∇J(xk−1) − ∇J(xk))(2.15)
is adjusted via backtracking until the inequality constraint in (SP) is satisfied and
J(xk+1) ≤ J(xk) +∇J(xk)T (xk+1 − xk) +1
2αk‖xk+1 − xk‖22.
Since J is continuously differentiable with Lipschitz continuous gradient, this inequality
holds for any αk < 1/L and the algorithm converges sub-linearly [107]. This condition
guarantees that objective function decreases at every iteration. Our numerical experi-
ments in Section 2.6 suggest that BB step-size initialization significantly enhances the
rate of convergence.
Remark 2. The biggest computational challenge comes from evaluation of the objective
function and its gradient. Since the inverse of the strengthened graph Laplacian G has
to be computed, with direct computations these evaluations take O(n3) and O(nm2)
flops, respectively. However, by exploiting the problem structure, ∇J can be computed
more efficiently. The main cost arises in the computation of diag (ETY E). We instead
compute it using sum (ET ◦ (Y E)) which takes O(n2m) operations. Here, sum (A) is
a vector which contains summation of each row of the matrix A in its entries. For
networks with m� n this leads to significant speed up. Moreover, in contrast to direct
computation, we do not need to store the m×m matrix ETY E. Only formation of the
columns is required which offers memory saving.
2.4.2 Proximal Newton method In contrast to the proximal gradient algo-
rithm, the proximal Newton method benefits from second-order Taylor series expansion
of the smooth part of the objective function in (SP). Herein, we employ cyclic coordinate
descent over the set of active variables to efficiently compute the Newton direction.
By approximating the smooth part of the objective function J in (SP) with the
second-order Taylor series expansion around the current iterate x,
J(x+ x) ≈ J(x) + ∇J(x)T x +1
2xT ∇2J(x) x
31
the problem (SP) becomes
minimizex
∇J(x)T x +1
2xT ∇2J(x) x + γ ‖x + x‖1
subject to Gp + E diag (x + x)ET � 0.
(2.16)
Let x denote the current iterate approximating the Newton direction. By perturbing x
in the direction of the ith standard basis vector ei in Rm, the objective function in (2.16)
becomes
∇J(x)T (x + δi ei) +1
2(x + δi ei)
T ∇2J(x) (x + δi ei) + γ |xi + xi + δi|.
Elimination of constant terms allows us to bring (2.16) into
minimizeδi
1
2ai δ
2i + bi δi + γ |ci + δi| (2.17)
where the optimization variable is the scalar δi and (ai, bi, ci, xi, xi) are the problem
data withai := eTi ∇2J(x) ei
bi :=(∇2J(x) ei
)Tx + eTi ∇J(x)
ci := xi + xi.
The explicit solution to (2.17) is given by
δi = − ci + Sγ/ai(ci − bi/ai) .
After the Newton direction x has been computed, we determine the step-size α via
backtracking. This guarantees positive definiteness of the strengthened graph Lapla-
cian and sufficient decrease of the objective function. We use generalization of Armijo
rule [128] to find an appropriate step-size α such that Gp+E diag(x+αx)ET is positive
definite matrix and
J(x+ αx) + γ ‖x+ αx‖1 ≤ J(x) + γ ‖x‖1 + ασ(∇J(x)T x + γ ‖x+ x‖1 − γ ‖x‖1
).
Remark 3. The parameter ai in (2.17) is determined by the ith diagonal element of the
32
Hessian ∇2J(x). On the other hand, the ith column of ∇2J(x) and the ith element of
the gradient vector ∇J(x) enter into the expression for bi. All of these can be obtained
directly from ∇2J(x) and ∇J(x) and forming them does not require any multiplication.
Computation of a single vector inner product between the ith column of the Hessian and
x is required in bi, which typically takes O(m) operations. To avoid direct multiplication,
in each iteration after finding δi, we update the vector ∇2J(x)T x using the correction
term δi(ETY Ei) ◦ ((G−1Ei)
TE)T and take its ith element to form bi. Here, Ei is the
ith column of the incidence matrix of the controller graph. This also avoids the need
to store the Hessian of J , which is an m ×m matrix, thereby leading to a significant
memory saving.
Remark 4. Active set strategy is an effective means for determining the directions
that do not need to be updated in the coordinate descent algorithm. At each outer
iteration, we classify the variable as either active or inactive based on the values of
xi and the ith component of the gradient vector ∇J(x). For g(x) = γ ‖x‖1, the ith
search direction is inactive if
xi = 0 and | eTi ∇J(x) | < γ − ε
and it is active otherwise. Here, ε > 0 is a small number (e.g., ε = 0.0001γ). The
Newton direction is then obtained by solving the optimization problem over the set of
active variables. This significantly improves algorithmic efficiency for large values of
the regularization parameter γ.
Convergence analysis
In (SP), J(x) is smooth for Gp+E diag(x)ET � 0 and the non-smooth part is given by
the `1 norm of x. The objective function of the form J(x) + g(x) was studied in [124],
where J is smooth over the positive definite cone and g is a separable non-differentiable
function. Theorem 16 from [124] thus implies super-linear (i.e., quadratic) convergence
rate of the quadratic approximation method for (SP).
33
Table 2.1: Comparison of our customized proximal algorithms
Algorithm proximal gradient proximal Newton
Order 1st 2nd
Search direction explicit update coordinate descent
Speed-up strategy BB step-size initialization active set strategy
Memory no storage of m×m matrices no storage of m×m matrices
Most expensive part O(n2m) O(m2)
Convergence rate linear super-linear (quadratic)
Table 2.2: Algorithmic properties of the primal dual IP method.
Algorithm primal-dual IP method
Order 2nd
Search direction PCG
Speed-up strategy PCG with preconditioner
Memory no storage of m×m matrices
Most expensive part O(m3)
Convergence rate super-linear
Stopping criteria
The norms of the primal and dual residuals rp and r±d as well as the duality gap η are
used as stopping criteria. In contrast to the stopping criteria available in the literature,
this choice enables fair comparison of the algorithms. We use (2.14) to construct a dual
feasible Y and obtain y± from (2.11), (2.10) to compute the duality gap η, and
rp(x, x±) := x − x+ + x−
r+d (x, y+) := γ 1 − diag
(ET (Y −R)E
)− y+
r−d (x, y−) := γ 1 + diag(ET (Y −R)E
)− y−.
to determine the primal and dual residuals.
Comparison of algorithms
Tables 2.1 and 2.2 compare and contrast features of our customized proximal algorithms
and the algorithm based on the primal-dual IP method developed in [63].
34
2.5 Growing connected resistive networks The problem of opti-
mal topology design for stochastically-forced networks has many interesting variations.
An important class is given by resistive networks in which all edge weights are non-
negative, x ≥ 0. Here, we study the problem of growing connected resistive networks;
e.g., see [34]. In this, the plant graph is connected and there are no joint edges between
the plant and the controller graphs. Our objective is to enhance the closed-loop per-
formance by adding a small number of edges. As we show below, inequality constraints
in this case amount to non-negativity of controller edge weights. This simplifies opti-
mality conditions and enables further improvement of the computational efficiency of
our customized algorithms.
The restriction on connected plant graphs implies positive definiteness of the strength-
ened graph Laplacian of the plant, Gp = Lp + (1/n)11T � 0. Thus, Gp +E diag (x)ET
is always positive definite for connected resistive networks and (SP) simplifies to
minimizex
f(x) + g(x) (2.18)
where
f(x) := J(x) + γ 1Tx
and g(x) is the indicator function for the non-negative orthant,
g(x) := I+(x) =
0, x ≥ 0
+∞, otherwise.
As in Section 2.3, in order to determine the Lagrange dual of the optimization
problem (2.18), we introduce an additional optimization variable G and rewrite (2.18)
asminimize
G, x
⟨G−1, Qp
⟩+ (γ 1 + diag
(ETRE
))Tx
subject to G − Gp − E diag (x)ET = 0
x ≥ 0.
(P1)
Proposition 3. The Lagrange dual of the primal optimization problem (P1) is given
35
by
maximizeY
2 trace(
(Q1/2p Y Q
1/2p )1/2
)− 〈Y,Gp〉
subject to diag(ET (Y − R)E
)≤ γ 1
Y � 0, Y 1 = 1
(D1)
where Y is the dual variable associated with the equality constraint in (P1). The duality
gap is
η = yTx = 1T (y ◦ x) (2.19)
where
y := γ 1 − diag(ET (Y −R)E
)≥ 0 (2.20)
represents the dual variable associated with the non-negativity constraint on the vector
of the edge weights x.
Remark 5. For connected resistive networks with the control weight R = r I, Y given
by (2.14a) is dual feasible if
β ≤ γ + 2 r
max (diag (ET (Y − R)E)) + 2 r. (2.21)
2.5.1 Proximal gradient method Using a simple quadratic approximation of
the smooth part of the objective function f around the current iterate xk
f(x) ≈ f(xk) + ∇f(xk)T (x − xk) +1
2αk‖x − xk‖22
the optimal solution of (2.18) is determined by the proximal operator of the function
g(x) = I+(x),
xk+1 =(xk − αk∇f(xk)
)+
where (·)+ is the projection on the non-negative orthant. Thus, the action of the
proximal operator is given by the projected gradient.
As in Section 2.4.1, we initialize αk using the BB heuristics but we skip the back-
tracking step here and employ a non-monotone BB scheme [129,130]. The effectiveness
36
of this strategy has been established on quadratic problems [127, 129], but its conver-
gence in general is hard to prove. In Section 2.6, we demonstrate efficiency of this
approach.
2.5.2 Proximal Newton method We next adjust the customized algorithm
based on proximal Newton method for growing connected resistive networks. We ap-
proximate the smooth part of the objective function f in (2.18) using the second-order
Taylor series expansion around the current iterate x,
f(x+ x) ≈ f(x) + ∇f(x)T x +1
2xT ∇2f(x) x
and rewrite (2.18) as
minimizex
∇f(x)T x +1
2xT ∇2f(x) x
subject to x + x ≥ 0.
(2.22)
By perturbing x in the direction of the ith standard basis vector ei in Rm, x+ δi ei, the
objective function in (2.22) becomes
∇f(x)T (x + δi ei) +1
2(x + δi ei)
T ∇2f(x) (x + δi ei) .
Elimination of constant terms allows us to bring (2.22) into
minimizeδi
1
2ai δ
2i + bi δi
subject to xi + xi + δi ≥ 0.
(2.23)
The optimization variable is the scalar δi and ai, bi, xi, and xi are the problem data
withai := eTi ∇2f(x) ei
bi :=(∇2f(x) ei
)Tx + eTi ∇f(x)
The explicit solution to (2.23) is given by
δi =
−bi/ai, xi + xi − bi/ai ≥ 0
− (xi + xi) , otherwise .
37
After the Newton direction x has been computed, we determine the step-size α
via backtracking. This guarantees positivity of the updated vector of the edge weights,
x+αx, and sufficient decrease of the objective function, f(x+αx) ≤ f(x)+ασ∇f(x)T x.
Remark 6. As in Section 2.4.2, we use an active set strategy to identify the directions
that do not need to be updated in the coordinate descent algorithm. For g(x) = I+(x), the
ith search direction is inactive if {xi = 0 and eTi ∇f(x) ≥ 0} and it is active otherwise.
Stopping criteria
The norm of the dual residual, rd, and the duality gap, η, are used as stopping cri-
teria. The dual variable y is obtained from (2.20) where Y is given by (2.14a) and β
satisfies (2.21). At each iteration, η is evaluated using (2.19) and the dual residual is
determined by
rd(x, y) := γ 1 − diag(ET (Y (x) − R)E
)− y.
2.6 Computational experiments We next provide examples and eval-
uate performance of our customized algorithms. Algorithm proxBB represents proximal
gradient method with BB step-size initialization and proxN identifies proximal Newton
method in which the search direction is found via coordinate descent. Performance is
compared with the PCG-based primal-dual IP method of [63] and the greedy algorithm
of [37]. We have implemented all algorithms in Matlab and executed tests on a 3.4
GHz Core(TM) i7-3770 Intel(R) machine with 16GB RAM.
In all examples, we set R = I and choose the state weight that penalizes the mean-
square deviation from the network average, Q = I − (1/n)11T . The absolute value of
the dual residual, rd, and the duality gap, η, are used as stopping criteria. We set the
tolerances for rd and η to 10−3 and 10−4, respectively. Finally, for connected plant
networks
γmax := ‖diag (ET G−1p QG−1
p E) ‖∞
identifies the value of the regularization parameter γ for which all edge weights in the
controller graph are equal to zero.
38
Table 2.3: Comparison of algorithms (solve times in seconds/number of iterations) forthe problem of growing connected resistive Erdos-Renyi networks with different numberof nodes n, edge probability 1.05 log(n)/n, and γ = 0.8 γmax.
number of nodes n = 300 n = 700 n = 1000
number of edges m = 43986 m = 242249 m = 495879
IP (PCG) 16.499/8 394.256/13 1014.282/13
proxBB 1.279/11 15.353/11 55.944/13
proxN 1.078/4 11.992/4 34.759/4
Table 2.4: Table 2.3 continued
number of nodes n = 1300 n = 1500
number of edges m = 839487 m = 1118541
IP (PCG) 15948.164/13 179352.208/14
proxBB 157.305/16 239.567/16
proxN 82.488/4 124.307/4
Additional information about our computational experiments, along with Matlab
source codes, can be found at:
www.ece.umn.edu/∼mihailo/software/graphsp/
2.6.1 Performance comparison In what follows, the incidence matrix of the
controller graph is selected to satisfy the following requirements: (i) in the absence of
the sparsity-promoting term, the closed-loop network is given by a complete graph; and
(ii) there are no joint edges between the plant and the controller graphs.
We first solve the problem (P1) for growing connected resistive Erdos-Renyi networks
with different number of nodes. The generator of the plant dynamics is given by an
undirected unweighted graph with edge probability 1.05 log(n)/n. Tables 2.3 and 2.4
compare our customized algorithms in terms of speed and the number of iterations.
Even for small networks, proximal methods are significantly faster than the IP method
and proxN takes smaller number of iterations and converges quicker than proxBB. For
a larger network (with 1500 nodes and 1118541 edges in the controller graph), it takes
about 50 hours for the PCG-based IP method to solve the problem. In contrast, proxN
and proxBB converge in about 2 and 4 minutes, respectively.
Figure 2.1 compares our proximal gradient algorithm with the fast greedy algorithm
39
(a) solve times (b) (J − Jc)/Jc
n n
Figure 2.1: (a) Solve times (in seconds); and (b) performance degradation (in percents)of proximal gradient and greedy algorithms relative to the optimal centralized controller.
of [37]. We solve problem (P1) for Erdos-Renyi networks with different number of
nodes (n = 5 to 500) and γ = 0.4 γmax. After proxBB identifies the edges in the
controller graph, we use the greedy method to select the same number of edges. Finally,
we polish the identified edge weights for both methods. Figure 2.1a shows the solve
times (in seconds) versus the number of nodes. As the number of nodes increases the
proximal algorithm significantly outperforms the fast greedy method. Relative to the
optimal centralized controller, both methods yield similar performance degradation of
the closed-loop network; see Fig. 2.1b.
2.6.2 Large-scale Facebook network To evaluate effectiveness of our algo-
rithms on large networks, we solve the problem of growing a network of friendships.
In such social networks, nodes denote people and edges denote friendships. There is an
edge between two nodes if two people are friends. The network is obtained by examining
social network of 10 users (the so-called ego nodes); all other nodes are friends to at least
one of these ego nodes [131]. The resulting network is undirected and unweighted with
4039 nodes and 88234 edges; the data is available at http://snap.stanford.edu/data/.
Our objective is to improve performance by adding a small number of extra edges. We
assume that people can only form friendships with friends of their friends. This restricts
the number of potential edges in the controller graph to 1358067.
40
(a) card(x) (b) (J − Jc)/Jc
γ card(x)
Figure 2.2: (a) Sparsity level; and (b) optimal tradeoff curves resulting from the applica-tion of proximal gradient algorithm and a heuristic strategy for the Facebook network.
To avoid memory issues, we have implemented our algorithms in C++. For γ =
c γmax with c = {0.1, 0.2, 0.5, 0.8} and γmax = 19.525, the proximal gradient algorithm
computes the solution in about 10, 2.6, 0.87, and 0.43 hours, respectively. After de-
signing the topology of the controller graph, we optimize the resulting edge weights via
polishing.
Figure 2.2a shows that the number of nonzero elements in the vector x decreases as
γ increases and Fig. 2.2b illustrates that the H2 performance deteriorates as the number
of nonzero elements in x decreases. In particular, for γ = 0.8 γmax, the identified sparse
controller has only 3 nonzero elements (it uses only 0.0002% of the potential edges).
Relative to the optimal centralized controller, this controller degrades performance by
16.842%, (J − Jc)/Jc = 16.842%.
In all of our experiments, the added links with the largest edge weights connect
either the ego nodes to each other or three non-ego nodes to the ego nodes. Thus, our
method recognizes significance of the ego nodes and identifies non-ego nodes that play
an important role in improving performance.
We compare performance of the identified controller to a heuristic strategy that is
described next. The controller graph contains 16 potential edges between ego nodes. If
the number of edges identified by our method is smaller than 16, we randomly select the
desired number of edges between ego nodes. Otherwise, we connect all ego nodes and
41
select the remaining edges in the controller graph randomly. We then use polishing to
find the optimal edge weights. The performance of resulting random controller graphs
are averaged over 10 trials and the performance loss relative to the optimal centralized
controller is displayed in Fig. 2.2b. We see that our algorithm always performs better
than the heuristic strategy. On the other hand, the heuristic strategy outperforms the
strategy that adds edges randomly (without paying attention to ego nodes). Unlike
our method, the heuristic strategy does not necessarily improve the performance by
increasing the number of added edges. In fact, the performance deteriorates as the
number of edges in the controller graph increases from 4 to 27; see Fig. 2.2b.
2.6.3 Random disconnected network The plant graph (blue lines) in Fig. 2.3
contains 50 randomly distributed nodes in a region of 10 × 10 units. Two nodes are
neighbors if their Euclidean distance is not greater than 2 units. We examine the
problem of adding edges to a plant graph which is not connected and solve the sparsity-
promoting optimal control problem (SP) for controller graph with m = 1094 potential
edges. This is done for 200 logarithmically-spaced values of γ ∈ [10−3, 2.5] using the
path-following iterative reweighted algorithm as a proxy for inducing sparsity [125]. As
indicated by (2.9), we set the weights to be inversely proportional to the magnitude of
the solution x to (SP) at the previous value of γ. We choose ε = 10−3 in (2.9) and
initialize weights for γ = 10−3 using the solution to (SP) with γ = 0 (i.e., the optimal
centralized vector of the edge weights). Topology design is followed by the polishing
step that computes the optimal edge weights; see Section 2.2.2.
As illustrated in Fig. 2.3, larger values of γ yield sparser controller graphs (red lines).
In contrast to all other examples, the plant graph is not connected and the optimal
solution is obtained using the algorithms of Section 2.4. Note that greedy method [37]
cannot be used here. Since the plant graph has three disconnected subgraphs, at least
two edges in the controller are needed to make the closed-loop network connected.
Figure 2.4 shows that the number of nonzero elements in the vector of the edge
weights x decreases and that the closed-loop performance deteriorates as γ increases.
In particular, Fig. 2.4c illustrates the optimal tradeoff curve between theH2 performance
loss (relative to the optimal centralized controller) and the sparsity of the vector x. For
γ = 2.5, only four edges are added. Relative to the optimal centralized vector of the
controller edge weights xc, the identified sparse controller in this case uses only 0.37% of
42
(a) γ = 0.02 (b) γ = 0.09
(c) γ = 0.63 (d) γ = 2.5
Figure 2.3: Topologies of the plant (blue lines) and controller graphs (red lines) for anunweighted random network with three disconnected subgraphs.
43
(a) card(x)/card(xc) (b) (J − Jc)/Jc (c) (J − Jc)/Jc
γ γ card(x)/card(xc)
Figure 2.4: (a) Sparsity level; (b) performance degradation; and (c) the optimal tradeoffcurve between the performance degradation and the sparsity level of optimal sparse xcompared to the optimal centralized vector of the edge weights xc. The results areobtained for unweighted random disconnected plant network with topology shown inFig. 2.3.
the edges, and achieves a performance loss of 82.13%, i.e., card(x)/card(xc) = 0.37%
and (J−Jc)/Jc = 82.13%. Here, xc is the solution to (SP) with γ = 0 and the pattern of
non-zero elements of x is obtained by solving (SP) with γ = 2.5 via the path-following
iterative reweighted algorithm.
2.6.4 Path and ring networks For path networks, our computational exper-
iments show that for a large enough value of the sparsity-promoting parameter γ a
single edge, which generates the longest cycle, is added; see Fig. 2.5, top row. This is
in agreement with [9] where it was proved that the longest cycle is most beneficial for
improving the H2 performance of tree networks. Similar observations are made for the
spatially-invariant ring network with nearest neighbor interactions. For large values of
γ, each node establishes a link to the node that is farthest away in the network; see
Fig. 2.5, bottom row. This is in agreement with recent theoretical developments [132]
where perturbation analysis was used to identify optimal week links in edge-transitive
consensus networks. Thus, for these regular networks and large enough values of the
regularization parameter, our approach indeed provides the globally optimal solution to
the original non-convex cardinality minimization problem.
2.7 Concluding remarks We have examined the problem of optimal topol-
ogy design of the corresponding edge weights for undirected consensus networks. Our
44
(a) γ = 0 (b) γ = 0.09 γmax (c) γ = 0.24 γmax (d) γ = 0.96 γmax
(e) γ = 0 (f) γ = 0.11 γmax (g) γ = 0.24 γmax (h) γ = 0.94 γmax
Figure 2.5: The problems of growing unweighted path (top row) and ring (bottom row)networks. Blue lines identify edges in the plant graph, and red lines identify edges inthe controller graph.
approach uses convex optimization to balance performance of stochastically-forced net-
works with the number of edges in the distributed controller. For `1-regularized mini-
mum variance optimal control problem, we have derived a Lagrange dual and exploited
structure of the optimality conditions for undirected networks to develop customized al-
gorithms that are well-suited for large problems. These are based on the proximal gradi-
ent and the proximal Newton methods. The proximal gradient algorithm is a first-order
method that updates the controller graph Laplacian via the use of the soft-thresholding
operator. In the proximal Newton method, sequential quadratic approximation of the
smooth part of the objective function is employed and the Newton direction is computed
using cyclic coordinate descent over the set of active variables. Examples are provided
to demonstrate utility of our algorithms. We have shown that proximal algorithms can
solve the problems with millions of edges in the controller graph in several minutes,
on a PC. Furthermore, we have specialized our algorithm to the problem of growing
connected resistive networks. In this, the plant graph is connected and there are no
joint edges between the plant and the controller graphs. We have exploited structure of
such networks and demonstrated how additional edges can be systematically added in
45
a computationally efficient manner.
Chapter 3
Edge addition in directed
consensus networks
We study the problem of performance enhancement in stochastically-forced directed
consensus networks by adding edges to an existing topology. We formulate the problem
as a feedback control design, and represent the links as the elements of the controller
graph Laplacian matrix. The topology design of the controller network can be cast as
an `1 regularized version of the H2 optimal control problem. The goal is to optimize
the performance of the network by selecting a controller graph with low communica-
tion requirements. To deal with the structural constraints that arise from the absence
of absolute measurements, we introduce a coordinate transformation to eliminate the
average mode and assure convergence of all states to the average of the initial node
values. By exploiting structure of the optimization problem, we develop a customized
algorithm based on the alternating direction method of multipliers to design a sparse
controller network that improves the performance of the closed-loop system.
3.1 Introduction In this chapter, we consider the problem of adding edges
to a weakly connected directed consensus network in order to improve performance. In
particular, we are interested in designing sparse communication graphs that strike a
balance between the variance amplification of the closed-loop system and the number
of communication links. In general, this is a combinatorial search problem and is non-
convex. In undirected networks, convex relaxations or greedy algorithms have been
46
47
introduced in order to optimize algebraic connectivity of the network [34,65] or network
coherence [35,133,134] by adding edges from a given set of edges.
In the absence of disturbances, a strongly connected balanced network converges
to the average of the initial node values [2]. However, in the presence of additive
stochastic disturbances, the network average experiences a random walk. Thus, the
control objective is to minimize mean square deviation from average. To cope with
structural constraints that arise from the absence of absolute information exchange,
we introduce a coordinate transformation to eliminate the average mode. While it is
desired to promote sparsity of controller graph in the physical domain, the H2 optimal
control problem is solved in the transformed set of coordinates where the average mode
is eliminated.
We formulate the edge addition problem for directed networks with an objective
of optimizing the closed-loop coherence. We consider the scenario in which the plant
graph is unbalanced. Structural requirements that the closed-loop graph Laplacian is
weakly connected and balanced make the optimal control problem challenging. We
approach this problem using sparsity-promoting optimal control framework [41,42,71].
In our formulation, performance is captured by the H2 norm of the closed-loop network
and `1 regularization is introduced as a proxy for inducing sparsity in the controller
graph [35, 133]. By exploiting the structure of the problem, we develop a customized
algorithm based on alternating direction method of multipliers (ADMM) [113].
The rest of the chapter is structured as follows. In Section 3.2, we provide necessary
background on graph theory for directed networks, define consensus problems on graphs,
and introduce an appropriate change of coordinates. In Section 3.3, we formulate the
optimal control problem. In Section 3.4, we develop a customized algorithm based
on ADMM. In Section 3.5, we use our algorithm for sparse edge addition to directed
consensus networks. Finally, in Section 3.6, we provide concluding remarks and highlight
future directions.
3.2 Motivation and background In this section, after providing the
necessary background, we describe the dynamics of consensus networks, inherent struc-
tural constraints, and the challenges in the related design problems. We then introduce
a change of coordinates that enables us to overcome these restrictions.
48
3.2.1 Directed graphs Herein, we provide a brief overview of the background
material; for additional information, see [2].
Weighted digraph: a weighted directed graph is denoted by D = (V,E,w) where V
and E are the sets of nodes and edges and w is the vector of edge weights. The edges
are directed and a value wij is the edge weight between nodes vi and vj . If the ordered
pair (vi, vj) ∈ E then vi is the tail of the edge and vj is its head.
Strongly connected digraph: a digraph is strongly connected if there is a directed
path between every pair of nodes.
Weakly connected digraph: a digraph is weakly connected if it is a connected graph
when the directions are omitted.
Adjacency matrix : the ijth element is given by
Aij =
wij , (vj , vi) ∈ E
0, otherwise.
Degree matrix is a diagonal matrix with Dii = din(vi) where din(vi) is the weighted
in-degree of node vi,
din(vi) =∑j
wij .
We sum over j’s for (vj , vi) ∈ E.
Laplacian matrix is a matrix L = D −A, where L1 = 0 by definition.
Balanced digraph: a digraph is balanced if the weighted in-degree and the weighted
out-degree for each node are equal.
3.2.2 Feedback design in consensus networks We consider a consensus dy-
namics
x = −Lp x + u + d
where d and u are the disturbance and control inputs, x is the vector of the states,
and Lp ∈ Rn×n is the graph Laplacian of the plant network. We assume that the plant
network is weakly connected and formulate the edge addition problem as a feedback
design problem with
u = −Lx x
49
where Lx ∈ Rn×n is the weighted directed Laplacian matrix of the controller. This
matrix represents the locations, directions, and edge weights. Note that each nonzero
element in Lx can either indicate addition of an edge or re-tuning of an existing edge
gain.
The closed-loop system is given by
x = − (Lp + Lx)x + d. (3.1a)
Our goal is to optimally design the feedback gain matrix Lx in order to achieve the
desired tradeoff between the controller sparsity and network performance. The per-
formance is quantified by the steady-state variance amplification of the stochastically-
forced network, from the white-in-time input d to the performance output z that quan-
tifies deviation from consensus and control effort,
z =
[Q1/2
−R1/2Lx
]x. (3.1b)
Here, the matrices Q = QT � 0 and R � 0 are the state and control weights, respec-
tively, and � (�) signifies positive definiteness (semi-definiteness) of a matrix.
In consensus networks, each node updates its state using the relative information
exchange with its neighbors. In the presence of white noise, the average mode x =
(1/n)xT1 experiences a random walk and variance increased linearly with time. A key
property of the Laplacian matrix of the controller is Lx1 = 0. Since our primary control
objective is to achieve consensus, only differences between node values are penalized in
the performance output. Therefore, the state weight matrix Q has a zero eigenvalue
with the corresponding eigenvector of all ones, i.e., Q1 = 0. Furthermore, we choose Q
to be positive definite on the orthogonal complement of the subspace span (1),
Q + (1/n)11T � 0.
The following lemma summarizes the well-known conditions for achieving consensus
in the absence of disturbances in directed networks [2].
Lemma 4. The agreement protocol over a digraph reaches the average consensus,
50
(1/n)11Tx(0), for every initial condition if and only if it is weakly connected and bal-
anced.
Thus, for a weekly connected Lp which is not necessarily balanced, it is required
that the closed-loop graph Laplacian, Lp + Lx be balanced, which amounts to
1T (Lp + Lx) = 0. (3.2)
The problem of designing a controller graph that provides a desired tradeoff between
performance index J of the network and the sparsity of the controller Lx can be formu-
lated asminimize
Lx
J(Lx) + γ ‖W ◦ Lx‖1
subject to 1T (Lx + Lp) = 0
Lx 1 = 0.
(3.3)
The positive scalar γ is the sparsity-promoting parameter that characterizes a trade-off
between network performance and sparsity of the controller and ◦ denotes elementwise
multiplication. The first condition guarantees asymptotic consensus to the initial net-
work average value in the absence of disturbances and the second condition secures
the row stochastic property of the controller graph Laplacian. We gurantee that the
closed-loop graph Laplacian remains weakly connected via proper step-size selection in
the algorithm.
In what follows, we quantify the network performance using the H2 norm. The
weighted `1 norm [125] of Lx is a convex approximation of the cardinality function.
In each iteration, each element of the weight matrix W (i, j) is chosen to be inversely
proportional to the magnitude of Lx(i, j) in the previous iteration and can be determined
as
W (i, j) = 1/(|Lx(i, j)| + ε). (3.4)
This puts larger emphasis on smaller optimization variables, where a small positive
parameter ε ensures that W (i, j) is well-defined.
3.2.3 Elimination of the average mode Since both Q and Lx have a zero
eigenvalue associated with the vector of all ones, the average mode x is not observable
51
from the output z. In order to eliminate the average mode, we introduce the following
change of coordinates [ψ
x
]=
[UT
1T /n
]x
where U ∈ Rn×(n−1) is a full-rank matrix and its columns span the orthogonal comple-
ment of 1. Using the properties of the matrix U ,
UT U = I, U UT = I − (1/n)11T , UT 1 = 0,
we have [ψ
˙x
]=
[−UT (Lp + Lx)U 0
0 0
][ψ
x
]+
[UT
(1/n)1T
]d.
In the new coordinates, the performance output is given by
z =
[Q1/2 U 0
−R1/2Lx U 0
] [ψ
x
].
The network average x = (1/n)1Td experiences a random walk and the vector ψ quan-
tifies the deviation from average of the states. Since our control objective is to minimize
the deviation from average in the nodes, the average mode is not of interest. Therefore,
the minimal representation of the system containing only the state ψ is given by
ψ = −UT (Lp + Lx)U ψ + UT d
z =
[Q1/2 U
−R1/2Lx U
]ψ.
(3.5)
In order to guarantee that the closed-loop graph Laplacian (Lp + Lx) is balanced, we
introduce the following change of variables
Lx = U F UT − (1/n)11TLp ⇔ F = UTLx U. (3.6)
52
Note that the main contribution that differentiates this work from previous results [133]
is that we are considering a weakly-connected directed unbalanced network, and we
require the closed-loop network to be balanced. The problem of adding edges to a
directed network has not been addressed in [133]. Moreover, the presented framework
cannot be used to solve the current constrained problem due to an additional bias term.
In our framework, a new change of variables is introduced to accommodate structural
constraints.
Using the properties of the matrix U , the constraints (3.2) and Lx 1 = 0 are auto-
matically satisfied. The equation (3.6) demonstrates how we can move back and forth
between two variables F and Lx.
Substituting Lx given by (3.6) into (3.5), we can write the state-space representation
of the closed-loop system as
ψ = − (UT Lp U + F )ψ + UT d
z =
[Q1/2 U
−R1/2(U F − (1/n)11TLp U)
]ψ.
(3.7)
Next, we formulate the optimal control problem and propose a framework to design
sparse controller Lx.
3.3 Topology design for directed networks In this section, we
approach the problem of topology design as a regularized optimal control problem.
3.3.1 Sparsity-promoting optimal control problem The H2 norm of the
transfer function from d to z,
‖H‖22 = J(F ) = trace (P (F ))
quantifies the variance amplification of the closed-loop network (3.7). Here, P (F ) is the
closed-loop observability Gramian which is the solution of the Lyapunov equation,
(UT Lp U + F )T P + P (UT Lp U + F ) = CTC (3.8)
53
where
C =
[Q1/2 U
−R1/2(U F − (1/n)11TLp U)
].
The control design problem (3.3) takes the following form
minimizeF,Lx
J(F ) + γ ‖W ◦ Lx‖1
subject to U F UT − Lx − (1/n)11TLp = 0.
(3.9)
In (3.9), the Laplacian matrix Lx and the matrix F are the optimization variables;
the problem data is given by plant graph Laplacian Lp, state and control weights Q
and R, and positive regularization parameter γ. The matrix W is the weight matrix
that imposes a penalty on the magnitude of the elements in Lx. With the problem
formulation (3.9), we are able to minimize the H2 norm of the network J(F ) in the
transformed coordinates where the average mode is eliminated, while promoting sparsity
of Lx in the physical domain. For γ = 0, the solution to (3.9) is typically given by a
matrix Lx with all non-zero components. As the regularization parameter increases, the
number of non-zero elements in the controller graph decreases.
3.3.2 Structured optimal control problem After the structure of the con-
troller graph Laplacian Lx has been designed, we fix the sparsity pattern S and then
solve the following problem
minimizeF
J(F )
subject to U F UT − (1/n)11TLp ∈ S(3.10)
whose solution provides the optimal controller graph Laplacian with the desired struc-
ture. This optimization problem is obtained by setting γ = 0 in (3.9) and adding the
sparsity pattern S. This “polishing” or “debiasing” step is used to improve the perfor-
mance relative to the solution of the sparsity-promoting optimal control problem (3.9).
3.4 An ADMM-based algorithm We next exploit the structure of the
constrained optimization problem (3.9) and develop a customized algorithm based on
54
ADMM. The augmented Lagrangian associated with (3.9) is given by
Lρ(F, Lx; Λ) = J(F ) + γ ‖W ◦ Lx‖1 +⟨Λ, U F UT − Lx − (1/n)11TLp
⟩+
ρ
2‖U F UT − Lx − (1/n)11TLp‖2F
(3.11)
where the matrix Λ is the Lagrange multiplier, ρ is a positive scalar, and ‖ · ‖F is the
Frobenius norm. The ADMM algorithm consists of the following steps at each iteration
F k+1 = argminF
Lρ(F, Lkx; Λk)
Lk+1x = argmin
Lx
Lρ(F k+1, Lx; Λk)
Λk+1 = Λk + ρ(U F k+1 UT − Lk+1
x − (1/n)11TLp).
The algorithm terminates when ‖Lk+1x −Lkx‖F ≤ ε1 and ‖U F k+1 UT−Lk+1
x − (1/n)11TLp)‖ ≤ε2, where ε1 and ε2 are desired tolerances.
Note that, the smooth part J(F ) and the non-smooth part ‖W ◦ Lx‖1 are now
operating in different coordinates; therefore, descent algorithms can be utilized in the
F -minimization step. Moreover, the `1 norm is a separable function with respect to
each element of Lx. Thus, we can determine the solution to the Lx-minimization step
analytically. In the Lagrange multiplier update step, we use the step-size equal to ρ in
order to guarantee the dual feasibility [113].
3.4.1 F -minimization step We bring the F -minimization step to the following
form by using the properties of the matrices Lp and U ,
F k+1 = argminF
J(F ) +ρ
2‖F − Sk‖2F
where
Sk = UT(Lkx + (1/n)11TLp − (1/ρ) Λk
)U = UT
(Lkx − (1/ρ) Λk
)U.
We employ the Anderson-Moore method to solve this problem [135]. This algorithm
converges faster compared to the gradient method and its implementation is simpler
compared to Newton method [42].
55
We next summarize the first- and second-order derivatives of the objective function
J . The second order approximation of the smooth part of objective function J around
F is given by
J(F + F ) ≈ J(F ) +⟨∇FJ(F ), F
⟩+
1
2
⟨∇2FJ(F , F ), F
⟩.
For related developments we refer the reader to [135].
Proposition 5. The gradient and the Hessian of J at F are determined by
∇J(F ) = 2 (R F − (1/n)UTR11T Lp U − P )L
∇2J(F , F ) = 2(
(R F − P )L + (R F − (1/n)UTR11T Lp U − P ) L) (3.12)
where R = UTRU and the matrix P is given by (3.8) and is the observability Gramian.
The matrix L is the controllability Gramian and is determined by
(UT Lp U + F )L + L (UT Lp U + F )T = UT U (3.13)
where UTU = I is identity. The matrices P and L are the solutions to the following
Lyapunov equations
(UT Lp U + F ) L+ L (UT Lp U + F ) = − F L − L F
(UT Lp U + F )T P + P (UT Lp U + F ) =
F T (R F − P − (1/n)UTR11TLpU) +
(F T R − P − (1/n)UTLTp 11TRU) F .
By setting ∇FLρ = ∇FJ + ρ (F − Sk) = 0, we obtain
2(R F − (1/n)UTR11T Lp U − P
)L + ρ (F − Sk) = 0. (3.14)
The necessary conditions for the optimality of Lρ(F, Lx; Λ) are given by (3.8), (3.13)
and (3.14).
The Anderson-Moore method solves the F -minimization step iteratively. In each
56
iteration, the algorithm starts with a stabilizing F and solves two Lyapunov equations
and one Sylvester equation. It first solves the Lyapunov equations (3.8), (3.13) with a
fixed F to obtain controllability and observability Gramians L and P , respectively. Then
it solves the Sylvester equation (3.14) for F with fixed L and P . Then we use Newton’s
method to find the descent direction between two consecutive steps by utilizing (3.12).
We next employ a line search strategy to determine an appropriate step-size in order to
guarantee convergence to a stationary point and the closed-loop stability.
3.4.2 Lx-minimization step Using the expression for the augmented Lagrangian (3.11),
we can write the Lx-minimization step as
Lk+1x = argmin
Lx
γ ‖Lx‖1 +ρ
2‖Lx − T k‖2F
where
T k = U F k+1 UT − (1/n)11TLp + (1/ρ) Λk.
The solution is given by soft-thresholding
Lk(i, j)k+1 =
(1− υ
|T k(i, j)|)T k(i, j), |T k(i, j)| > υ
0, otherwise
where υ = (γ/ρ)W (i, j).
Convergence analysis of ADMM for convex problems can be found in [113]. For non-
convex problems, the quadratic term in the augmented Lagrangian locally convexify the
problem for a large ρ. A recent result on convergence analysis of ADMM for a family
of non-convex problems can be found in [136]. Computational results also show that
ADMM works well when ρ is sufficiently large [137,138].
3.5 Computational experiments
3.5.1 Synthetic example In this section, we employ our customized algorithm
based on ADMM to add certain number of edges to a given directed network. The plant
network is a randomly generated graph with n nodes and with edge weight li ∈ (0, 1)
57
for the ith edge which is obtained randomly.
We set R = I and choose the state weight that penalizes the mean-square deviation
from the network average,
Q = I − (1/n)11T .
We solve the sparsity-promoting optimal control problem (3.9) for controller graph for
100 logarithmically-spaced values of γ ∈ [0.001, 3] using the path-following iterative
reweighted algorithm as a proxy for inducing sparsity [125]. We set the weights to be
inversely proportional to the magnitude of the solution Lx to (3.9) at the previous value
of γ. We choose ε = 10−3 in (3.4) and initialize weights for γ = 0.01 using the solution
to (3.9) with LQR state-feedback matrix. Topology design is followed by a polishing
step that computes the optimal weights of identified edges; see Section I-A.
3.5.2 Performance improvement The randomly generated plant graph in Fig. 3.1a
is a directed graph with n = 20 nodes. The plant graph is weakly connected but unbal-
anced. Thus, it will not converge to the initial nodes average value. In order to reach
consensus and improve the performance, adding edges to the plant network is required.
Figure 3.1 illustrates that by increasing γ, the controller graph becomes sparser. The
number of added edges to the network is equal to the number of nonzero off-diagonal
elements in the controller. Specifically, for γ = 0.001, the number of nonzero elements
in the controller graph is 84, among which 74 edges are added to the original network,
and the other 10 nonzero elements represent the diagonal elements. By increasing γ to
0.0788, there are 33 nonzero elements are in the controller graph, and only 25 edges are
added to the plant network. It is noteworthy that the Laplacian matrix of the controller
graph can not be a zero matrix, because the plant network is an unbalance graph and
a nonzero Laplacian matrix of the controller is needed to make the closed-loop graph
Laplacian balanced.
Figure 3.2 shows that the closed-loop performance deteriorates and the number of
nonzero elements in the controller graph Laplacian Lx decreases as γ increases. As
shown in Fig. 3.2b, relative to the optimal LQR controller, Lc, the H2 performance
loss decreases as the sparsity of the controller graph Laplacian matrix Lx increases.
In particular, for γ = 3, there are only 24 nonzero elements in the controller graph.
This is equivalent to have only 16 added edges. The identified sparse controller in
58
(a) plant graph (b) γ = 0.001
(c) γ = 0.0788 (d) γ = 3
Figure 3.1: Topologies of the plant (blue lines) and controller graphs (red lines) for arandomly generated weighted network.
59card
(Lx)/card
(Lc)
(a)
(J−Jc)/Jc
(b)
(J−Jc)/Jc
(c)
γ γ card(Lx)/card(Lc)
Figure 3.2: (a) Sparsity level; (b) performance degradation; and (c) the optimal tradeoffbetween the performance degradation and the sparsity level of the optimal sparse Lxcompared to the optimal centralized controller Lc. The results are obtained for theweighted random plant network with topology shown in Fig. 3.1a.
this case uses only 0.057% of the edges, relative to the optimal LQR controller, i.e.,
card(Lx)/card(Lc) = 29.629% and achieves a performance loss of 15.118%, i.e., (J −Jc)/Jc = 15.118%.
3.6 Concluding remarks We consider the `1 regularized version of opti-
mal control problem for adding edges to directed consensus networks in order to reach
consensus and optimally enhance performance. Although the given plant network is
not necessarily balanced, in order to reach agreement, we restrict the closed-loop graph
Laplacian to be balanced. The performance is measured by the H2 norm from the
disturbance to the output of the closed-loop network. In general, this problem is a com-
binatorial search problem. We use sparsity promoting optimal control framework and
introduce weighted `1 regularization as a proxy for promoting sparsity of the controller.
By exploiting structure of the problem, we develop an algorithm based on ADMM. An
example is provided to demonstrate the utility of the developed algorithm.
Part II
Network estimation and inference
60
Chapter 4
Topology identification of
undirected consensus networks
via sparse inverse covariance
estimation
We study the problem of identifying sparse interaction topology using sample covari-
ance matrix of the state of the network. Specifically, we assume that the statistics
are generated by a stochastically-forced undirected first-order consensus network with
unknown topology. We propose a method for identifying the topology using a regu-
larized Gaussian maximum likelihood framework where the `1 regularizer is introduced
as a means for inducing sparse network topology. The proposed algorithm employs a
sequential quadratic approximation in which the Newton’s direction is obtained using
coordinate descent method. We provide several examples to demonstrate good practical
performance of the method.
4.1 Introduction In this chapter, we develop a convex optimization frame-
work for identifying sparse interaction topology using sample covariance matrix of the
state of the network. Our framework utilizes an `1-regularized Gaussian maximum
likelihood estimator. Because of strong theoretical guarantees, this approach has been
61
62
commonly used for recovering sparse inverse covariance matrices [139–143]. We uti-
lize the structure of undirected networks to develop an efficient second-order method
based on a sequential quadratic approximation. As in [124,144], we compute the New-
ton’s direction using coordinate descent method [145–147] that employs active set strat-
egy. The main point of departure is the formulation of a convex optimization problem
that respects particular structure of undirected consensus networks. We also use a
reweighted `1-heuristics as an effective means for approximating non-convex cardinality
function [125], thereby improving performance relative to standard `1 regularization.
Our presentation is organized as follows. In Section 4.2, we formulate the problem of
topology identification using sparse inverse covariance matrix estimation. In Section 4.3,
we develop a customized second-order algorithm to solve the `1-regularized Gaussian
maximum likelihood estimation problem. In Section 4.4, we use computational exper-
iments to illustrate features of the method. In Section 4.5, we conclude with a brief
summary.
4.2 Problem formulation In this section, we provide background material
on stochastically forced undirected first-order consensus networks and formulate the
problem of topology identification using a sample covariance matrix. The inverse of a
given sample covariance matrix can be estimated using Gaussian maximum likelihood
estimator. For undirected consensus networks, we show that the estimated matrix is
related to the graph Laplacian of the underlying network.
Our objective is to identify the underlying graph structure of a stochastically forced
undirected consensus network with a known number of nodes by sampling its second-
order statistics. In what follows, we relate the problem of topology identification for
consensus networks to the inverse covariance matrix estimation problem.
4.2.1 Undirected consensus networks We consider an undirected consensus
network
ψ = −Lx ψ + d, (4.1)
where ψ ∈ Rn represents the state of n nodes, d is the disturbance input, and the
symmetric n×n matrix Lx represents the graph Laplacian. The matrix Lx is restricted
to have an eigenvalue at zero corresponding to an eigenvector of all ones, Lx1 = 0. This
63
requirement is implicitly satisfied by expressing Lx in terms of the incidence matrix E
Lx :=
m∑l= 1
xl ξl ξTl = E diag (x)ET ,
where diag (x) is a diagonal matrix containing the vector of edge weights x ∈ Rm. Each
column ξl = ei− ej , where ei ∈ Rn is the ith basis vector, represents an edge connecting
nodes i and j. The m columns of E specify the edges that may be used to construct the
consensus network. For a complete graph, there are m = n(n− 1)/2 potential edges.
In order to achieve consensus in the absence of disturbances, it is required that the
closed-loop graph Laplacian, Lx, be positive definite on the orthogonal complement of
the vector of all ones, 1⊥ [2]. This amounts to positive definiteness of the “strengthened”
graph Laplacian of the closed-loop network
X := Lx + (1/n)11T � 0. (4.2)
Clearly, since Lx1 = 0, X1 = 1.
Consensus networks attempt to compute the network average; thus, it is of interest
to study the deviations from average
ψ(t) := ψ(t) − 1 ψ(t) =(I − (1/n)11T
)ψ(t),
where ψ(t) := (1/n)1Tψ(t) is the network average and corresponds to the zero eigenvalue
of Lx. From (4.1), it follows that the dynamics of the deviation from average are,
˙ψ = −Lxψ +
(I − (1/n)11T
)d.
The steady-state covariance of ψ,
P := limt→∞
E(ψ(t) ψT (t)
),
is given by the solution to the algebraic Lyapunov equation
Lx P + P Lx = I − (1/n)11T .
64
For connected networks, the unique solution is given by
P =1
2L†x =
1
2
((Lx + (1/n)11T
)−1 − (1/n)11T)
=1
2
(X−1 − (1/n)11T
),
(4.3)
where (·)† is the pseudo-inverse of a matrix. Thus, the inverse of the steady-state covari-
ance matrix of the deviation from network average is determined by the strengthened
graph Laplacian of the consensus network X.
4.2.2 Topology identification A sparse precision matrix can be obtained as the
solution to the regularized maximum log-likelihood problem [141],
minimizeX
− log det (X) + trace (S X) + γ ‖X‖1
subject to X � 0,(4.4)
where S is the sample covariance matrix, γ is a positive regularization parameter, and
‖X‖1 :=∑|Xij | is the `1 norm of the matrix X. The `1 norm is introduced as a means
for inducing sparsity in the inverse covariance matrix where a zero element implies
conditional independence. This problem has received significant attention in recent
years [124,139,141,144,148–151].
In this work, we establish a relation between inverse covariance matrix estimation
and the problem of topology identification of an undirected consensus network. We are
interested in identifying a sparse topology that yields close approximation of a given
sample covariance matrix. This is achieved by solving the following problem,
minimizex
J(x) + γm∑l= 1
|xl|
subject to E diag (x)ET + (1/n)11T � 0,
(NI)
where
J(x) = − log det(E diag (x)ET + (1/n)11T
)+ trace (S E diag (x)ET ).
65
Relative to [124, 139, 144, 148, 149], our optimization problem has additional structure
induced by the dynamics of undirected consensus networks.
The network identification problem (NI) is a convex but non-smooth problem where
the optimization variable is the vector of the edge weights x ∈ Rm and the problem
data is the sample covariance matrix S and the incidence matrix. The incidence matrix
is selected to contain all possible edges. The `1 norm of x is a convex relaxation of the
cardinality function and it is introduced to promote sparsity. The positive parameter γ
specifies the emphasis on sparsity versus matching the sample covariance matrix S. For
γ = 0, the solution to (NI) is typically given by a vector x with all non-zero elements.
The positive definite constraint comes from (4.2) and guarantees a connected closed-loop
network and thus asymptotic consensus in the absence of disturbances.
Remark 7. We also use this framework to address the network sparsification problem
where it is of interest to find a sparse network that generates close approximation of the
covariance matrix of a given dense network. We choose E to be equal to the incidence
matrix of the primary network.
4.3 Customized algorithm based on sequential quadraticapproximation We next exploit the structure of the optimization problem (NI)
to develop an efficient customized algorithm. Our algorithm is based on sequential
quadratic approximation of the smooth part J of the objective function in (NI). This
method benefits from exploiting second-order information about J and from computing
the Newton direction using cyclic coordinate descent [145–147] over the set of active
variables. We find a step-size that ensures the descent direction via backtracking line
search. Furthermore, by restricting our computations to active search directions, com-
putational cost is significantly reduced. A similar approach has been recently utilized
in a number of applications, including sparse inverse covariance estimation in graphi-
cal models [124, 144, 152]. In this work, we have additional structural constraints and
use reweighted heuristics in order to achieve sparsity. We use an alternative proxy for
promoting sparsity which is given by the weighted `1 norm [125]. In particular, we
solve problem (NI) for different values of γ using a path-following iterative reweighted
algorithm; see Section II (A) in [?]. The topology identification then is followed by a
66
polishing step to debias the identified edge weights.
4.3.1 Structured problem: debiasing step In addition to promoting sparsity
of the identified edge weights, the `1 norm penalizes the magnitude of the nonzero edge
weights. In order to gauge the performance of the estimated network topology, once
we identify a set of sparse topology via (NI), we solve the structured “polishing” or
“debiasing” problem to optimize J over the set of identified edges. To do this, we form
a new incidence matrix E which contains only those edges identified as nonzero in the
solution to (NI) and form the problem,
minimizex
− log det(E diag (x) ET + (1/n)11T
)+ trace
(S E diag (x) ET
)subject to E diag (x) ET + (1/n)11T � 0.
whose solution provides the optimal estimated graph Laplacian with the desired struc-
ture.
4.3.2 Gradient and Hessian of J(x) We next derive the gradient and Hessian
of J which can be used to form a second-order Taylor series approximation of J(x)
around xk,
J(xk + x) ≈ J(xk) + ∇J(xk)T x +1
2xT ∇2J(xk) x. (4.5)
Proposition 6. The gradient and the Hessian of J at xk are
∇J(xk) = diag(ET (S − X−1(xk))E
)∇2J(xk) = M(xk) ◦ M(xk)
(4.6)
where ◦ denotes the elementwise (Hadamard) product and
X−1(xk) :=(EDxk E
T + (1/n)11T)−1
,
M(xk) := ETX−1(xk)E.
67
Proof. Utilizing the second order expansion of the log-determinant function we have
J(xk + x) − J(xk) ≈ trace(ET (S − X−1(xk))EDx
)+
1
2trace
(DxE
TX−1(xk)EDxETX−1(xk)E
).
The expressions in (4.6) can be established using a sequence of straightforward algebraic
manipulations in conjunction with the use of the commutativity invariance of the trace
function and the following properties for a matrix T , a vector α, and a diagonal matrix
Dα,
trace (T Dα) = αT diag (T )
trace (Dα T Dα T ) = αT (T ◦ T )α.
4.3.3 Algorithm Our algorithm is based on building the second-order Taylor series
expansion of the smooth part of the objective function J in (NI) around the current
iterate xk. Approximation J in (NI) with (4.5),
minimizex
∇J(xk)T x +1
2xT ∇2J(xk) x + γ ‖xk + x‖1
subject to E diag(xk + x
)ET + (1/n)11T � 0.
(4.7)
We use the coordinate descent algorithm to determine the Newton direction. Let x
denote the current iterate approximating the Newton direction. By perturbing x in the
direction of the ith standard basis vector ei ∈ Rm, x + µi ei, the objective function
in (4.7) becomes
∇J(xk)T (x + µi ei) +1
2(x + µi ei)
T ∇2J(xk) (x + µi ei) + γ |xki + xi + µi|.
Elimination of constant terms allows us to express (4.7) as
minimizeµi
1
2ai µ
2i + bi µi + γ |ci + µi| (4.8)
68
where (ai, bi, ci, xki , xi) are the problem data,
ai := eTi ∇2J(xk) ei
bi :=(∇2J(xk) ei
)Tx + eTi ∇J(xk)
ci := xki + xi.
The explicit solution to (4.8) is given by
µi = − ci + Sγ/ai(ci − bi/ai) ,
where Sκ(y) = sign (y) max (|y| − κ, 0) , is the soft-thresholding function.
After the Newton direction x has been computed, we determine the step-size α via
backtracking. This guarantees positive definiteness of the strengthened graph Laplacian
and sufficient decrease of the objective function. We use Armijo rule to find an appro-
priate step-size such that E diag(xk + αx)ET + (1/n)11T is positive definite matrix
and
J(xk + αx) + γ ‖xk + αx‖1 ≤ J(xk) + γ ‖xk‖1 +
ασ(∇J(xk)T x + γ ‖xk + αx‖1 − γ ‖xk‖1
).
There are two computational aspects in our work which lead to suitability of this
algorithm for large-scale networks.
• Active set strategy
We propose an active set prediction strategy as an efficient method to solve the
problem (NI) for large values of γ. It is an effective means for determining which
directions need to be updated in the coordinate descent algorithm. The classifica-
tion of a variable as either active or inactive is based on the values of xki and the
ith component of the gradient vector ∇J(xk). Specifically, the ith search direction
is only inactive if
xki = 0 and | eTi ∇J(xk) | < γ − ε
where ε > 0 is a small number (e.g., ε = 0.0001γ). At each outer iteration, the
Newton search direction is obtained by solving the optimization problem over the
69
set of active variables. The size of active sets is small for large values of the
regularization parameter γ.
• Memory saving
Computation of bi in (4.8) requires a single vector inner product between the
ith column of the Hessian and x, which typically takes O(m) operations. To
avoid direct multiplication, in each iteration after finding µi, we update the vector
∇2J(xk)T x using the correction term µi(ETX−1ξi) ◦ ((X−1ξi)
TE)T and take its
ith element to form bi. Here, ξi is the ith column of the incidence matrix of the
controller graph. This also avoids the need to store the Hessian of J , which is an
m×m matrix, thereby leading to a significant memory saving. Moreover, the ith
column of ∇2J(xk) and the ith element of the gradient vector ∇J(xk) enter into
the expression for bi. On the other hand, ai is determined by the ith diagonal
element of the Hessian matrix ∇2J(xk). All of these can be obtained directly
from ∇2J(xk) and ∇J(xk) which are formed in each outer iteration without any
multiplication.
Our problem is closely related to the problem in [124]. The objective function there
has the form f(x) = J(x) + g(x), where J(x) is smooth over the positive definite cone,
and g(x) is a separable non-differentiable function. In our problem formulation, J(x) is
smooth for E diag(x)ET+(1/n)11T � 0 while the non-smooth part is the `1 norm which
is separable. Thus, convergence can be established using similar arguments. According
to [124, Theorems 1,2], the quadratic approximation method converges to the unique
global optimum of (NI) and at super-linear rate.
The optimality condition for any x∗ that satisfies E diag(x∗)ET + (1/n)11T � 0 is
given by
∇J(x∗) + γ ∂ ‖x∗‖1 ∈ 0,
where ∂‖x∗‖1 is the subgradient of the `1 norm. This means that for any i
∇iJ(x∗) ∈
− γ, xi > 0;
γ, xi < 0;
[− γ, γ], xi = 0.
70
The stopping criterion is to check the norm of ∇J(x) and the sign of x to make sure
that x is the optimal solution.
4.4 Computational experiments We next illustrate the performance
of our customized algorithm. We have implemented our algorithm in Matlab, and all
tests were done on 3.4 GHz Core(TM) i7-3770 Intel(R) machine with 16GB RAM.
The problem (NI) is solved for different values of γ using the path-following iterative
reweighted algorithm [125] with ε = 10−5. The initial weights are computed using the
solution to (NI) with γ = 0 (i.e., the optimal centralized vector of the edge weights). We
then adjust ε = 0.001‖x‖2 at each iteration. Topology identification is followed by the
polishing step described in Section 4.3.1. In the figures, we use black dots to represent
nodes, blue lines to identify the original graph, and red lines to denote the edges in
the estimated sparse network. In all examples, we set the tolerance for the stopping
criterion to 10−4.
4.4.1 Network identification We solve the problem of identification of a sparse
network using sample covariance matrix for 500 logarithmically-spaced values of γ ∈[0.1, 1000]. The sample covariance matrix S is obtained by sampling the nodes of
the stochastically-forced undirected unweighted network whose topology is shown in
Fig. 4.1a. To generate samples, we conducted 20 simulations of system (4.1) forced
with zero-mean unit variance band-limited white noise d. The sample covariance ma-
trix is averaged over all simulations and asymptotically converges to the steady-state
covariance matrix. The incidence matrix E in (NI) contains all possible edges.
Empirically, we observe that after about 5 seconds the sample covariance matrix
converges to the steady-state covariance. First, we sample the states after 3 seconds,
so the sample covariance matrix we compute is different than the true steady-state
covariance matrix. For (NI) solved with this problem data, Figures 4.3 and 4.1c illustrate
the topology of the identified networks for the minimum and maximum values of γ.
In Fig. 4.1d, the blue line shows an edge in the original network that has not been
recovered by the algorithm for the largest value of γ. The red lines in this figure
show two extra edges in the estimated network for the smallest value of γ which were
not present in the original graph. Next, we solve the problem (NI) using a sample
71
(a) Original network with n = 12nodes
(b) γ = 0.1
(c) γ = 1000 (d) Distinct edges
Figure 4.1: The problem of identification of sparse networks using sample covariancematrix for a network with n = 12 nodes.
covariance matrix generated by sampling the node states after 15 seconds. In this case,
the sample covariance matrix is closer to the steady-state covariance matrix than the
previous experiment. As shown in Fig. 4.2a, the identified network is exactly the same
as the original network for γ = 0.1. If γ is further increased, a network sparser than the
original is identified; see Fig. 4.2b.
For γ = 0, the relative error between the covariance matrix of the estimated network
and the sample covariance matrix S is given by
‖(E diag (xc)E
T + (1/n)11T)−1 − S‖F
‖S‖F= 0.004%,
where ‖ · ‖F is the Frobenius norm and xc is the solution to (NI) with γ = 0. As γ
increases, the number of nonzero elements in the vector of the edge weights x decreases
and the state covariance matrix gets farther away from the given sample covariance
72
(a) γ = 0.1 (b) γ = 1000
Figure 4.2: The problem of identification of sparse networks using sample covariancematrix for a network with n = 12 nodes.
matrix. In particular, in the first experiment for γ = 1000, only twelve edges are chosen.
Relative to the centralized network with the vector of the edge weights xc, the identified
sparse network in this case uses only 18.182% of the edges, i.e., card(x)/card(xc) =
18.182% and achieves a relative error of 53.841%, (‖X−1 − S‖F )/‖S‖F = 53.841%,
with X =(E diag (x)ET + (1/n)11T
). In the second experiment, the identified sparse
network has 16.666% of the potential edges and achieves a relative error of 51.067%.
4.4.2 Network sparsification We next use (NI) to find a sparse representation
of a dense consensus network. Inspired by [153], we generate a network by randomly
distributing nodes in a 10× 10 box. A pair of nodes can communicate if the Euclidean
distance between them, d(i, j), is not greater than 5 units and the edge connecting them
has weight exp (− d(i, j)). The incidence matrix of the identified graph is selected to
be equal to the incidence matrix of the given graph; i.e., the sparse network’s edges are
a subset of the original network’s. Figure 4.3a shows a graph with 50 nodes. We use
the reweighted `1 regularized Gaussian maximum likelihood estimation framework, for
200 logarithmically-spaced values of γ ∈ [0.01, 100] following by the polishing step. The
sparse graph topologies identified for different values of γ are shown in figures 4.3b, 4.3c,
and 4.3d. As γ increases, the identified graph becomes sparser.
For γ = 0.01, 222 edges are chosen to be in the sparse estimated network which is
only 34.851% of the 637 potential edges. The network with these selected edges achieves
73
(a) Original network with n = 50nodes
(b) γ = 0.01
(c) γ = 0.102 (d) γ = 1.138
Figure 4.3: The problem of sparsification of a network with n = 50 nodes using samplecovariance matrix.
74
a relative error of 29.965%,
‖X−1 − S‖F‖S‖F
= 29.965%.
For the largest value of the sparsity-promoting parameter, γ = 1.138, only 64 edges
are present (10.047% of the potential edges) in the estimated graph that gets a relative
error of 207.493%.
To provide a point of comparison, we compare the performance of our algorithm to
a simple truncation scheme. In this scheme, the edge with the smallest weights that
does not disconnect the network is iteratively removed until the network has the desired
sparsity. After identifying the topology in this way, the polishing step optimizes the
edge weights of the selected set of edges.
Figure 4.4 shows the relative errors of our algorithm (in red dashed lines) and the
truncation algorithm (in blue solid lines) on a log scale against the number of removed
edges. As the number of edges in the estimated graph decreases, the relative error
of both algorithms increases. The relative error of network topologies identified by our
algorithm is much smaller than the error of those identified by the truncation algorithm,
and thus our customized algorithm outperforms the truncation method. In particular,
when 573 edges are removed, the relative errors for our customized algorithm and the
truncation algorithm are 2.075 and 8.977, respectively.
(‖X
−1−S‖ F
)/‖S‖ F
Number of eliminated edges
Figure 4.4: Relative error in log-scale for the sparsification problem of a network withn = 50 nodes using sample covariance matrix.
75
4.5 Concluding Remarks We have developed a method for identifying
the topology of an undirected consensus network using available statistical data. In or-
der to promote network sparsity, we introduce a convex optimization framework aimed
at finding the solution to the `1-regularized maximum likelihood problem. This prob-
lem is closely related to the problem of sparse inverse covariance estimation that has
received significant attention in the literature. In our setup, additional structure arises
from the requirement that data is generated by an undirected consensus network. By
exploiting the structure of the problem, we develop an efficient algorithm based on the
sequential quadratic approximation method in which the search direction is determined
using coordinate descent with active set strategy. Several examples have been provided
to illustrate utility of the method and efficiency of the customized algorithm.
Chapter 5
Topology identification via
growing a Chow-Liu tree network
We study the problem of sparse interaction topology identification using sample covari-
ance matrix of the states of the network. We postulate that the statistics are gener-
ated by a stochastically-forced undirected consensus network with unknown topology in
which some of the nodes may have access to their own states. We first propose a method
for topology identification using a regularized Gaussian maximum likelihood framework
where the `1 regularizer is introduced as a means for inducing sparse network topology.
We also develop a method based on growing a Chow-Liu tree that is well-suited for
identifying the underlying structure of large-scale systems. We apply this technique to
resting-state functional MRI (FMRI) data as well as synthetic datasets to illustrate the
effectiveness of the proposed approach.
5.1 Introduction In this chapter, we develop a convex optimization algo-
rithm for identifying sparse interaction topology using the sample covariance matrix
of the states of the network. First, we utilize an `1-regularized Gaussian maximum
likelihood estimator that has been commonly used for recovering sparse inverse covari-
ance matrices [139–141]. We show that the performance of graphical-LASSO can im-
prove significantly by imposing additional structure on the problem and using reweight-
ing schemes [125]. In particular, our framework overcomes challenges that standard
graphical-LASSO faced and performs well for the case study in [64]. Moreover, inspired
76
77
by [64], we combine the Chow-Liu algorithm [154] with the techniques for growing net-
works developed in [106] to identify the underlying structure of an undirected consensus
network. Constructing the Chow-Liu tree from statistical data does not require any ma-
trix inversion; thereby, it is well-suited for large-scale problems. Furthermore, we have
developed efficient algorithms [106] for growing connected resistive consensus networks.
Herein, we demonstrate that combining these two algorithms yields an efficient method
for recovering the network topology in large-scale systems.
Our presentation is organized as follows. In Section 5.2, we discuss the proper-
ties of consensus networks and formulate the problem of topology identification using
sparse inverse covariance matrix estimation. We also briefly comment on the customized
second-order algorithm based on the proximal Newton method to solve the `1-regularized
Gaussian maximum likelihood estimation problem. In Section 5.3, we develop an algo-
rithm for growing a Chow-Liu tree graph in order to identify the network that yields
close approximation of a given sample covariance matrix. In Section 5.4, we use com-
putational experiments to illustrate features of our method. In particular, we employ
our algorithm to identify the underlying functional network of the human brain using
FMRI data. Finally, in Section 5.5, we conclude with a brief summary.
5.2 Topology identification via structured graphical-LASSOThe problem of topology identification using a sample covariance matrix for stochasti-
cally forced undirected consensus networks has been studied in Chapter 4. In consensus
networks, each node updates its own state using localized information exchange with
the neighbors. Two nodes are neighbors if an edge connects them together. Herein,
we consider a network that leader nodes are equipped with absolute information about
their own states. In Chapter 4, we showed that the underlying topology of an undirected
consensus network can be identified using Gaussian maximum likelihood estimator. In
this chapter, we formulate the topology identification problem for undirected consensus
networks with leader nodes and provide two algorithms to solve this problem.
Consider an undirected network with n nodes governed by
ψi =∑j∈Ni
xij (ψj − ψi) − zi ψi + wi,
78
where each node i updates its own state using relative information exchange with its
neighbors in the set Ni. Moreover, certain nodes, the so-called leaders, have access
to their own states. Here, zi is the weight that the ith node sets for the absolute
measurement, xij is the edge weight, and wi is an exogenous stochastic disturbance.
The ith node is a leader if zi 6= 0 and it is a follower if zi = 0. By concatenating all the
states in a vector ψ ∈ Rn, the consensus dynamics can be written as
ψ = − (Lx + Dz)ψ + w (5.1)
Here, Lx ∈ Rn×n is the graph Laplacian of the consensus network and Dz ∈ Rn×n
is a diagonal matrix with the ith diagonal entry zi. The incidence matrix E of the
underlying graph represents the edges in the network. The lth column of this matrix is
given by
ξl = ei − ej ,
that demonstrates the lth edge between the nodes i and j. Here, ei ∈ Rn is the ith basis
vector. By using the incidence matrix, the Laplacian matrix Lx can be written as
Lx :=m∑l= 1
xl ξl ξTl = E diag (x)ET
where diag (x) is a diagonal matrix with the edge weights x ∈ Rm in its diagonal.
Given that the covariance of the disturbance is a multiple of the identity matrix I,
the steady-state covariance of ψ,
Σ := limt→∞
E (ψ ψT ),
can be computed as the solution to the associated algebraic Lyapunov equation
(Lx + Dz) Σ + Σ (Lx + Dz) = 2 I.
Thus, the steady state covariance can be explicitly computed as
Σ = (Lx + Dz)−1 . (5.2)
79
The inverse of the steady-state covariance matrix of the states of the system can be
determined by the structure of the underlying graph that connects the n nodes. Thus,
by using a sampled second-order statistics and estimating the inverse covariance ma-
trix, the underlying topology of an undirected consensus network with leaders can be
identified. The problem of sparse covariance estimation has received significant atten-
tion recently [124, 144, 150, 151]. Relative to these works, our optimization problem
has additional structure coming from the dynamics of undirected consensus networks.
Moreover, compared to previous chapter, we consider consensus networks with leaders
and introduce a new algorithm that is convenient for solving large-scale problems.
We first generalize the proposed algorithm based on the structured graphical-LASSO
in Chapter 4 to solve the problem of topology identification in undirected consensus
networks with leaders. It is well-known that the estimation of the inverse covariance
matrix X can be obtained as the solution to the regularized maximum log-likelihood
problem [141],
minimizeX
− log det (X) + trace (SX) + γ ‖F ◦ X‖1
subject to X � 0,(5.3)
where S is the sample covariance matrix, γ is a positive regularization parameter, F is
the weight matrix, and ‖F ◦X‖1 :=∑Fij |Xij | is the weighted `1 norm of the matrix X.
By substituting the expression (5.2) for the inverse covariance matrix in (5.3) and using
the incidence matrix E, the topology of a network that generates close approximation
of a given sample covariance matrix can be identified by solving the following problem,
minimizex, z
J(x, z) + γ1
m∑l= 1
fl |xl| + γ2
N∑k= 1
gk |zk|
subject to E diag (x)ET + diag(z) � 0,
(NI)
where
J(x, z) = − log det(E diag (x)ET + diag(z)
)+ trace
(S (E diag (x)ET + diag(z)
).
Moreover, f ∈ Rm and g ∈ RN are the vectors of non-negative weights and (γ1, γ2)
80
are the positive regularization parameters. An effective heuristic for weight selection
is given by the iterative reweighted algorithm where the weights f and g are inversely
proportional to the magnitudes of x and z in the previous iteration [125]. Problem (NI) is
a convex but non-smooth optimization problem where the optimization variables are the
vector of the edge weights x ∈ Rm and the vector of leaders weights z ∈ RN . Relative
to our prior work in the previous chapter, our optimization problem has additional
structure induced by presence of the leader nodes.
The algorithm based on the sequential quadratic approximation of the smooth part
of the objective function in Section 4.3 can be utilized for solving (NI) with minor
changes. The difference is that the optimization variable size has increased form m to
m + N . This method benefits from exploiting second-order information of the smooth
part of the objective funtion and from computing the Newton direction using cyclic
coordinate descent [145] over the set of active variables. We solve the problem (NI) for
different values of (γ1, γ2) using a path-following iterative reweighted algorithm. The
topology identification then is followed by a polishing step to debias the identified edge
weights. We next propose an algorithm to solve the topology identification problem of
large-scale networks.
5.3 Growing a Chow-Liu tree network In this section, we discuss
an alternative algorithm for identifying the underlying network topology which is well-
suited for large-scale systems. In order to find the underlying network structure using
statistical data, the Chow-Liu tree algorithm [154] can be utilized. This method does
not require any matrix inversion; thereby, suitable for large-scale usage. However, as
discussed in [64], it causes false positives and negatives when using it for identifying
the topology of disconnected networks or networks with cycles, respectively. Herein,
we propose a framework in order to combine the Chow-Liu tree and the reweighted
graphical-LASSO algorithms for identifying the structure of connected networks with
cycles.
We consider the same consensus network in (5.1) and we assume that the sample
covariance matrix S is given. In order to use the Chow-Liu algorithm, the mutual infor-
mation matrix M should be constructed from the sample covariance matrix. Assuming
81
Gaussian distribution for the noise w, the mutual information is given by
Mij =1
2log
(Sii Sjj
Sii Sjj − S2ij
),
where Sij is the ijth element of the matrix S. We only use the n (n − 1)/2 off-diagonal
elements of this symmetric matrix to construct the Chow-Liu tree. A spanning tree of a
graph with n nodes has (n−1) edges. To build the Chow-Liu tree, we sort the elements
of the mutual information matrix and choose the biggest (n − 1) of them that do not
create cycles [154].
After finding the underlying tree network that generates close approximation of the
sample covariance matrix, our goal is to add a certain number of edges to the tree graph
in order to enhance the closed-loop performance [106]. The performance is measured
by the proximity of the second-order statistical data generated by the network to the
given sample covariance matrix.
Consider an undirected consensus tree network,
ψ = −Lt ψ + u + w, (5.4)
where w and u are the exogenous disturbance and the control input, respectively and
Lt is the graph Laplacian of the tree network that is obtained using the Chow-Liu
algorithm. The goal is to improve the performance of this system by growing the tree
network. We approach this problem as a feedback design problem with
u = − (Lx + Dz)ψ, (5.5)
whereDz is a diagonal matrix with the ith diagonal entry zi and the symmetric feedback-
gain matrix Lx is required to have the Laplacian structure. Since a nonzero ijth element
of Lx corresponds to an edge between the nodes i and j, the communication structure in
the controller graph is determined by the sparsity pattern of the matrix Lx. Moreover,
the ith node is a leader if zi is nonzero. By substituting the control input u from (5.5)
in (5.4)
ψ = − (Lt + Lx + Dz)ψ + w. (5.6)
82
For a computed Lt from the Chow-Liu algorithm, our objective is to design the topology
Lx and to identify the leader nodes in the network in order to achieve the desired
tradeoff between the controller sparsity and the network performance. The performance
is quantified by the proximity of the steady-state covariance matrix of the closed-loop
system to the sample covariance matrix.
Next, we are going to establish a relation between the closed-loop graph Laplacian
and the inverse covariance matrix of the network. The steady-state covariance of ψ is
given by
Σ = (Lt + Lx + Dz)−1, (5.7)
where Lx = Ediag(x)ET . Thus, the problem of identifying the sparse topology of a
network, i.e., finding Lx and Dz, that generates close approximation of a given sample
covariance matrix is equivalent to sparse inverse covariance estimation problem. This
can be achieved by solving a similar regularized maximum log-likelihood problem to (5.3)
with one main difference. The inverse covariance matrix X is the summation of the
Laplacian matrices of the tree plant network Lt and the controller network Lx + Dz.
Thus, the problem of growing a tree network in order to match the available statistical
data can be formulated as
minimizex, z
J(x, z) + γ1
m∑l= 1
fl |xl| + γ2
N∑k= 1
gk |zk|
subject to Lt + E diag (x)ET + diag(z) � 0,
(5.8)
whereJ(x, z) = − log det
(Lt + E diag (x)ET + diag(z)
)+
trace(S (E diag (x)ET + diag(z)
).
In the case of resistive networks (i.e., all the edge weights are nonnegative), since the
plant network is given by a tree graph, the closed-loop network is connected; thereby,
the optimization problem simplifies to
minimizex, z
J(x, z) + γ1
m∑l= 1
fl xl + γ2
N∑k= 1
gk zk
subject to x ≥ 0, z ≥ 0.
(5.9)
83
In this scenario, the topology identification problem turn into the problem of growing
a tree network and the positive definiteness constraint simplifies to nonnegativity con-
straints of the vectors x and z. Thus, several optimization algorithms can be employed
to solve this problem efficiently for large-scale networks.
5.4 Computational experiments We next illustrate the performance
of our methods by employing them to identify the topology of an RC network with
10 nodes and a tree structure shown in Fig. 5.1. The voltages of the nodes form the
states of the system. In this network, node 5 is grounded with z5 = 4 and all other
edge weights are one which implies that node 5 is a leader. This example is borrowed
from [64].
Figure 5.1: The RC tree network with n = 10 nodes.
5.4.1 Structured graphical-LASSO algorithm Assume that infinite samples
of the nodes’ voltages are available; thereby, the sample covariance matrix S is equal to
the exact solution of the Lyapunov equation. Moreover, we set γ1 = γ2 = γ. In this case,
the structured graphical-LASSO algorithm in Section 5.2 can completely recover the
underlying network topology for different values of γ. This example has been previously
studied in [64]. They show that when the sample covariance matrix is precise, the
graphical-LASSO algorithm results in 5 false positives and negatives. However, by
adding a structural constraint to the problem and using a reweighting scheme, we showed
that the same algorithm can recover the network topology with zero error.
Next, we utilize our method to solve the problem by using a sample covariance
84
(a) plant network (b) Chow-Liu tree (c) identified structure
Figure 5.2: (a) Plant network; (b) constructed Chow-Liu tree network; and (c) theidentified network from growing the Chow-Liu tree.
matrix which is not very close to the actual covariance matrix and is constructed from
only 80 samples. The algorithm is again able to recover the network topology and to
identify the leader for different values of γ.
It is worth to note that the performance of this method deteriorates if we replace
the reweighted `1 norm scheme with the `1 norm. In particular, by eliminating the
reweighted `1 norm, we observed the effect of grounding one of the nodes with high
capacitance. Although the network topology will be identified for some values of γ, by
increasing γ (to very large value), the algorithm chooses the optimal edges in the same
way as [64]. In particular, it ignores the connections between nodes 1 to 5 because of
their low variances. In the next example, we illustrate the effectiveness of growing a
Chow-Liu tree by using it on a synthetic dataset.
5.4.2 Topology identification via growing a Chow-Liu tree In this sec-
tion, the second method is utilized to identify the underlying structure of a network
with cycles. The original plant network is shown in Fig. 5.2a. We first assume that the
sample covariance matrix S is equal to the exact solution of the Lyapunov equation.
We form the mutual information matrix and construct the Chow-Liu tree accordingly
which is shown in Fig. 5.2b. We next grow this tree network in order to enhance the
performance of the closed-loop system. In particular, we solve the problem (5.8) to find
the leader nodes and the Laplacian matrix of the controller graph Lx. In this case, our
algorithm can completely recover the underlying network topology for different values
of γ. Next, we employ this algorithm to identify the topology of a larger network with
85
real data to evaluate its performance.
5.4.3 FMRI dataset The FMRI technique detects the activity of a region in the
brain by measuring the blood flow to that region. Since the blood flow increases in
an active part of the brain, the functioning regions can be identified by monitoring
the blood flow. The functional connectivity structure between different regions can be
revealed by utilizing different thresholding techniques [59,60]. The results indicate that
different regions of the brain that are not anatomically connected act closely together
and are functionally linked. Moreover, the previous studies have shown that the human
brain has small-world network properties [60].
In this section, we employ the second algorithm based on growing a Chow-Liu tree to
identify the underlying functional network of the human brain. The sample covariance
matrix is computed using the resting-state FMRI data of 20 healthy patients [60]. In
the resting-state FMRI, the patients are asked to close their eyes and try not to think.
The studies have shown that even in the rest state, the brain is highly active and
different regions of the brain are communicating with each other [155]. We collect 134
samples from 140 cortical brain regions (nodes) in the right hemisphere. The sample
correlation matrix for each patient is a 140 × 140 matrix and can be computed using the
time series data. The sample covariance matrices are not invertible since the number
of samples is smaller than the number of the nodes in the network. Thus, we use
our proposed algorithm to estimate the inverse covariance matrix and to identify the
underlying network structure of the human brain.
First, we form the mutual information matrix and construct the Chow-Liu tree
Fig 5.3a. Next, we grow the obtained tree network to identify the remained edges and
improve the performance of the closed-loop system. We set γ1 = γ2 = γ. The identified
networks for a randomly chosen patient are shown in Fig 5.3. In particular, as the
sparsity promoting parameter γ increases, the identified network gets sparser.
This example has been previously studied in [156]. Their results show that the
nodes that are located in the lower left corner of the graphs are highly connected to
their neighboring nodes. They compare this pattern of connectivity with the false
positives created by their algorithm in a synthetic RC circuit and conclude that the
high number of edges in that area is false positives created by the same phenomenon
86
(a) Chow-Liu tree graph with 139 edges (b) γ = 0.1
(c) γ = 1
Figure 5.3: The identified networks for different values of γ.
in the circuit example. However, by adding a structural constraint to the problem and
using a reweighting scheme, we showed that the underlying network can be recovered
without high connectivity in the lower left corner. Moreover, the general shape of the
identified network is consistent with the results reported in [60]. Furthermore, the small-
world properties such as high clustering and high efficiency coefficients can be seen in
the identified networks.
To conclude, it seems that using both an additional structural constraint and the
reweighted `1 norm scheme can improve the performance of the graphical-LASSO algo-
rithm significantly. Unlike the Chow-Liu algorithm that can be employed to construct
tree networks only, our algorithm is more general and overcomes the challenges associ-
ated with the conventional algorithms proposed in [156].
87
5.5 Concluding remarks We have studied the problem of sparse topology
identification of an undirected consensus network with leaders using second-order statis-
tical data. The goal is to identify a sparse interaction topology using sample covariance
matrix of the network state. We have introduced two algorithms based on regularized
Gaussian maximum likelihood and growing a Chow-Liu tree. In the first algorithm, we
propose a structured graphical-LASSO algorithm that uses the weighted `1 regularizer
as a proxy for inducing sparse network topology. The other method is based on growing
a Chow-Liu tree that is well-suited for identifying the underlying structure of large-scale
networks. Several examples have been provided to demonstrate the performance of our
framework.
Part III
Control theoretic approach for
analysis of optimization
algorithms
88
Chapter 6
Proximal gradient flow and
Douglas-Rachford splitting
dynamics: global exponential
stability via integral quadratic
constraints
Many large-scale and distributed optimization problems can be brought into a composite
form in which the objective function is given by the sum of a smooth term and a
nonsmooth regularizer. Such problems can be solved via a proximal gradient method and
its variants, thereby generalizing gradient descent to a nonsmooth setup. In this chapter,
we view proximal algorithms as dynamical systems and leverage techniques from control
theory to study their global properties. In particular, for problems with strongly convex
objective functions, we utilize the theory of integral quadratic constraints to prove
global exponential stability of the differential equations that govern the evolution of
proximal gradient and Douglas-Rachford splitting flows. In our analysis, we use the
fact that these algorithms can be interpreted as variable-metric gradient methods on the
forward-backward and the Douglas-Rachford envelopes and exploit structural properties
of the nonlinear terms that arise from the gradient of the smooth part of the objective
89
90
function and the proximal operator associated with the nonsmooth regularizer. We
also demonstrate that these envelopes can be obtained from the augmented Lagrangian
associated with the original nonsmooth problem and establish conditions for global
exponential convergence even in the absence of strong convexity.
6.1 Introduction Analysis of optimization algorithms from the system theo-
retic point of view has received significant recent attention [29–31]. In these references,
the optimization algorithm is interpreted as a feedback interconnection in which the
states converge to the optimal solution of the optimization problem.
In this chapter, we utilize techniques from control theory to establish global prop-
erties of proximal gradient flow and Douglas-Rachford (DR) splitting dynamics. These
algorithms provide an effective tool for solving nonsmooth convex optimization prob-
lems in which the objective function is given by a sum of a differentiable term and a
nondifferentiable regularizer. We exploit the fact that the proximal gradient flow can
be interpreted as a variable-metric gradient method on forward-backward (FB) enve-
lope [157–159] and show that this envelope can be obtained by restricting the proximal
augmented Lagrangian to the manifold in which the dual variable is given by the neg-
ative gradient of the smooth part of the objective function. When this smooth part is
strongly convex with a Lipschitz continuous gradient, we prove global exponential sta-
bility by utilizing the theory of IQCs [160]. We also generalize the Polyak-Lojasiewicz
(PL) [161] condition to nonsmooth problems and show global exponential convergence
of the FB envelope even in the absence of strong convexity. Finally, since the DR split-
ting algorithm [162] can be interpreted as a variable-metric gradient method on DR
envelope [157], we utilize similar approach to establish global stability properties of the
resulting continuous-time dynamics for strongly convex problems.
This chapter is structured as follows. In Section 6.2, we formulate the nonsmooth
composite optimization problem and provide background material. In Section 6.3,
we establish the global exponential stability of the proximal gradient flow dynamics
for a problem with strongly convex objective function. Moreover, by exploiting the
problem structure, we demonstrate the global exponential convergence of the forward-
backward envelope even in the absence of strong convexity. In Section 6.4, we introduce
a continuous-time gradient flow dynamics based on the celebrated Douglas-Rachford
91
splitting algorithm and utilize the theory of IQCs to prove global exponential stability
for strongly convex problems. We offer concluding remarks in Section 6.5.
6.2 Problem formulation and background We consider a com-
posite optimization problem,
minimizex
f(x) + g(Tx) (6.1)
where x ∈ Rn is the optimization variable, T ∈ Rm×n is a given matrix, f : Rn → Ris a continuously differentiable function with a Lipschitz continuous gradient, and g:
Rm → R is a nondifferentiable convex function. Such optimization problems arise in
a number of different applications and depending on the structure of the functions f
and g, different first- and second-order algorithms can be employed to solve them. We
are interested in studying global convergence properties of methods based on proximal
gradient flow algorithms. In what follows, we provide background material that we
utilize in the rest of the chapter.
6.2.1 Proximal operators and the associated envelopes The proximal op-
erator of a proper, lower semicontinuous, and convex function g is defined as
proxµg(v) := argminz
(g(z) +
1
2µ‖z − v‖22
).
The value function of this optimization problem determines the associated Moreau en-
velope,
Mµg(v) := g(proxµg(v)) +1
2µ‖proxµg(v) − v‖22
which is a continuously differentiable function even when g is not [76],
∇Mµg(v) =1
µ(v − proxµg(v)).
Combining the last two expressions yields,
Mµg(v) = g(proxµg(v)) +µ
2‖∇Mµg(v)‖22.
The Moreau envelope of g can be used to introduce the forward-backward (FB)
92
envelope [157–159] of the composite function
F (x) := f(x) + g(x).
The FB envelope is determined by the value function of the problem
minimizev
J(x, v) (6.2a)
where J approximates F via a simple quadratic expansion of the function f around x,
J(x, v) := f(x) + 〈∇f(x), v − x〉 + 12µ ‖v − x‖22 + g(v)
= g(v) + 12µ ‖v − (x − µ∇f(x))‖22 + f(x) − µ
2 ‖∇f(x)‖22.(6.2b)
The optimal solution of (6.2) is determined by
v?µ(x) = proxµg(x − µ∇f(x))
and it can be used to obtain the FB envelope of the function F ,
Fµ(x) := J(x, v?µ(x)) = J(x,proxµg(x − µ∇f(x)))
= f(x) − µ 〈∇f(x), Gµ(x)〉 +µ
2‖Gµ(x)‖22 + g(proxµg(x − µ∇f(x)))
(6.3)
where Gµ is the generalized gradient map,
Gµ(x) :=1
µ(x − proxµg(x − µ∇f(x))). (6.4)
Alternatively, the FB envelope Fµ can be also expressed as
Fµ(x) = f(x) + Mµg(x − µ∇f(x)) − µ
2‖∇f(x)‖22. (6.5)
Moreover, when f is twice continuously differentiable, Fµ is continuously differentiable
and its gradient is determined by [157],
∇Fµ(x) = (I − µ∇2f(x))Gµ(x). (6.6)
93
The Douglas-Rachford (DR) envelope is another useful object that is obtained by
evaluating the FB envelope at proxµf (x) [163],
FDRµ (x) := Fµ(proxµf (x)). (6.7)
Alternatively, the DR envelope can be expressed as
FDRµ (x) = Mµg(x− 2µ∇Mµf (x)) + Mµf (x) − µ‖∇Mµf (x)‖22. (6.8)
From the definition of the proximal operator of the continuously differentiable function
f , we have
µ∇f(proxµf (x)) + proxµf (x) − x = 0 (6.9)
and, thus,
∇Mµf (x) = ∇f(proxµf (x)). (6.10)
Equality (6.7) follows from substituting the expression for ∇Mµf (x) into (6.8), using
equation (6.9), and leveraging the properties of the Moreau envelope,
FDRµ (x) = Mµg(x− 2µ∇f(proxµf (x)) + f(proxµf (x)) +
1
2µ‖proxµf (x) − x‖22 − µ ‖∇f(proxµf (x))‖22
= Mµg(proxµf (x)− µ∇f(proxµf (x)) + f(proxµf (x))− µ
2‖∇f(proxµf (x))‖22
= Fµ(proxµf (x)).
If f is twice continuously differentiable with∇2f(x) � LfI for all x, the DR envelope
is continuously differentiable and its gradient is given by [163]
∇FDRµ (x) =1
µ(2∇proxµf (x) − I)GDRµ (x) (6.11)
where
∇proxµf (x) =(I + µ∇2f(proxµf (x))
)−1
and
GDRµ (x) := proxµf (x) − proxµg(2proxµf (x)− x). (6.12)
94
6.2.2 Proximal augmented Lagrangian By introducing an auxiliary optimiza-
tion variable z, problem (6.1) can be rewritten as follows,
minimizex, z
f(x) + g(z)
subject to Tx − z = 0.(6.13)
The augmented Lagrangian associated with constrained optimization problem (6.13) is
given by,
Lµ(x, z; y) := f(x) + g(z) + 〈y, Tx− z〉 + 12µ ‖Tx− z‖
22
and the completion of squares yields,
Lµ = f(x) + g(z) + 12µ ‖z − (Tx + µy)‖22 −
µ2 ‖y‖
22
where y is the Lagrange multiplier and µ is a positive parameter. The minimizer of Lµwith respect to z is
z?(x, y) = proxµg(Tx + µy)
and the evaluation of Lµ(x, z; y) along the manifold resulting from the explicit mini-
mization over z yields the proximal augmented Lagrangian [81],
Lµ(x; y) := Lµ(x, z?(x, y); y)
= f(x) + Mµg(Tx + µy) − µ2 ‖y‖
22.
(6.14)
This function is continuously differentiable with respect to both x and y and it can be
used as a foundation for the development of different first- and second-order primal-
dual methods for nonsmooth composite optimization [81, 108]. It is noteworthy that,
for T = I, forward-backward envelope Fµ(x) is obtained by restricting the proximal
augmented Lagrangian Lµ(x; y) along the manifold y?(x) = −∇f(x) resulting from
KKT optimality conditions,
Fµ(x) := Lµ(x; y?(x)) = Lµ(x; y = −∇f(x))
= f(x) + Mµg(x − µ∇f(x)) − µ2 ‖∇f(x)‖22.
95
Moreover, relation (6.7) between the two envelopes allows us to interpret the DR en-
velope as the proximal augmented Lagrangian evaluated at proxµf (x) and restricted
along the manifold y?(proxµf (x)) = −∇f(proxµf (x)) = −∇Mµf (x),
FDRµ (x) := Lµ(proxµf (x); y?(proxµf (x)))
= Mµf (x) + Mµg(x − 2µ∇Mµf (x)) − µ ‖∇Mµf (x)‖22.
6.2.3 Strong convexity and Lipschitz continuity The function f is strongly
convex with parameter mf if for any x and x,
f(x) ≥ f(x) + 〈∇f(x), x − x〉 +mf
2‖x − x‖22
and equivalently,
‖∇f(x)−∇f(x)‖2 ≥ mf‖x − x‖2. (6.15)
The gradient of a continuously-differentiable function f is Lipschitz continuous with
parameter Lf if for any x and x,
f(x) ≤ f(x) + 〈∇f(x), x − x〉 +Lf2‖x − x‖22
and equivalently,
‖∇f(x)−∇f(x)‖2 ≤ Lf‖x − x‖2. (6.16)
Moreover, if an mf -strongly convex function f has an Lf -Lipschitz continuous gradient,
the following inequality holds for any x and x,
〈∇f(x)−∇f(x), x − x〉 ≥mf Lfmf + Lf
‖x − x‖22 +1
mf + Lf‖∇f(x)−∇f(x)‖22.
(6.17)
Furthermore, the subgradient ∂g of a nondifferentiable function g is defined as the set
of points z ∈ ∂g(x) that for any x and x satisfy,
g(x) ≥ g(x) + zT (x − x). (6.18)
6.2.4 Proximal Polyak-Lojasiewicz inequality The Polyak-Lojasiewicz (PL)
condition can be used to establish linear convergence of standard gradient descent
96
method in the absence of strong convexity (or even convexity) [164]. For an uncon-
strained optimization problem with a non-empty solution set,
minimizex
f(x)
where f is a twice differentiable function with a Lipschitz continuous gradient, the PL
condition is given by
‖∇f(x)‖22 ≥ γ (f(x) − f?)
where γ > 0 and f? is the optimal value of the function f .
We next provide the generalization of the PL condition for nonsmooth composite
optimization problems (6.13) with T = I. For this class of problems, the proximal PL
inequality holds for µ ∈ (0, 1/Lf ) if there exist γ > 0 such that
‖Gµ(x)‖22 ≥ γ (Fµ(x) − F ?µ). (6.19)
Here, Lf is the Lipschitz constant of∇f , Fµ is the FB envelope, andGµ is the generalized
gradient map. In Appendix A, we show that (6.19) is equivalent to the condition
provided in [164].
6.3 Exponential stability of proximal algorithms In this sec-
tion, we briefly discuss the Arrow-Hurwicz-Uzawa gradient flow dynamics that can be
used to solve (6.13) by computing the saddle points of the proximal augmented La-
grangian [81]. We then show that the proximal gradient method in continuous time can
be obtained from the proximal augmented Lagrangian method by restricting the dual
variable along the manifold y = −∇f(x). Finally, we discuss global stability properties
of proximal algorithms both in the presence and in the absence of strong convexity.
Continuous differentiability of the proximal augmented Lagrangian (6.14) can be
utilized to compute its saddle points via the Arrow-Hurwicz-Uzawa gradient flow dy-
namic, [x
y
]=
[−(∇f(x) + T T∇Mµg(Tx + µy)
)µ (∇Mµg(Tx + µy) − y)
]. (6.20)
As shown in [81], these primal-descent dual-ascent gradient flow dynamics are globally
exponentially stable for convex problems in which the matrix TT T is invertible and the
97
smooth part of the objective function f is strongly convex.
For convex problems with T = I in (6.1),
minimizex
f(x) + g(x) (6.21)
the proximal gradient method,
xk+1 = proxαkg(xk − αk∇f(xk)) (6.22)
with the stepsize αk ∈ (0, 1/Lf ] can be used to solve (6.21), where Lf is the Lipschitz
constant of ∇f . In [157], it was demonstrated that (6.22) can be interpreted as a
variable-metric gradient method on FB envelope,
xk+1 = xk − αk(I − αk∇2f(x)
)−1∇Fαk(xk)
= xk − αkGαk(xk).
This interpretation can be utilized to compute the optimal solution to (6.21) using the
continuous-time proximal gradient flow dynamics
x = −Gµ(x)
= − (∇f(x) + ∇Mµg(x − µ∇f(x)))
= − 1µ
(x − proxµg(x − µ∇f(x))
).
(6.23)
Remark 8. Proximal gradient flow dynamics (6.23) are different from the subgradient
flow dynamics associated with nonsmooth composite optimization problem (6.21). We
note that proximal gradient algorithm (6.22) can be obtained via explicit forward Euler
discretization of (6.23) with the stepsize µ = αk. This should be compared and contrasted
to a standard interpretation [76] in which (6.22) results from implicit backward Euler
discretization of the subgradient flow dynamics associated with (6.21). We also note
that (6.23) can be obtained by substituting −∇f(x) for the dual variable y in the x-
update step of primal-descent dual-ascent gradient flow dynamics (6.20) with T = I.
We next examine global stability properties of proximal gradient flow dynamics (6.23),
first for strongly convex problems and then for the problems in which only the PL con-
dition holds.
98
6.3.1 Strongly convex problems Herein, we utilize the theory of integral quadratic
constraints to establish global asymptotic stability of proximal gradient flow dynam-
ics (6.23) under the following assumption.
Assumption 1. Let the differentiable part f of the objective function in (6.21) be
strongly convex with parameter mf , let ∇f be Lipschitz continuous with parameter Lf ,
and let the regularization function g be proper, lower semicontinuous, and convex.
Proximal gradient flow dynamics (6.23) can be expressed as a feedback interconnec-
tion of an LTI system
x = Ax + B u
ξ = C x(6.24a)
with a nonlinear term,
u(x) := proxµg(x − µ∇f(x)). (6.24b)
Here,
A = − 1µ I, B = 1
µ I, C = I (6.24c)
and the corresponding transfer function H(s), ξ(s) = H(s)u(s), is
H(s) = C (s I − A)−1B =1
µs + 1I. (6.24d)
The following lemma combines firm nonexpansiveness of proxµg, strong convexity of
f , and Lipschitz continuity of ∇f to characterize nonlinear map (6.24b) by establishing
a quadratic inequality that u(x) = proxµg(x− µ∇f(x)) satisfies.
Lemma 7. Let Assumption 1 hold. Then, for any x ∈ Rn, x ∈ Rn, u := proxµg(x −µ∇f(x)), and u := proxµg(x− µ∇f(x)), the pointwise quadratic inequality
[x − x
u − u
]T [η2I 0
0 −I
]︸ ︷︷ ︸
Π
[x − x
u − u
]≥ 0 (6.25)
holds, where
η = max {|1 − µmf |, |1 − µLf |} . (6.26)
99Proof. See Appendix A.1.
Lemma 8. The nonlinear function u(x) := proxµg(x − µ∇f(x)) is a contraction for
µ ∈ (0, 2/Lf ).
Proof. From (A.8), it follows that the nonlinear function u(x) is a contraction for η < 1.
From (6.26), we see that η < 1 if and only if,
−1 < 1 − µLf < 1 and − 1 < 1 − µmf < 1.
Since mf ≤ Lf , these conditions hold for µ ∈ (0, 2/Lf ) .We next employ the KYP lemma in the frequency domain [165][
Hρ(jω)
I
]∗Π
[Hρ(jω)
I
]≺ 0, ∀ω ∈ R (6.27)
where Π is given by (6.25), ω is the temporal frequency, and
Hρ(jω) := C (jωI − (A + ρI))−1B =1
jµω + 1− µρI
to establish global exponential stability of (6.24). Note that stability of Hρ requires
µ < 1/ρ.
Theorem 9. Let Assumption 1 hold and let µ ∈ (0, 2/Lf ). Then, proximal gradient
flow dynamics (6.24) are globally exponentially stable with the rate ρ < (1− η)/µ, i.e.,
‖x(t) − x?‖2 ≤ c e−ρt‖x(0) − x?‖2, ∀ t ≥ 0
where c is a positive constant and η is given by (6.26).
Proof. Substituting Π given by (6.25) into (6.27) implies that the condition (6.27) holds
for all ω ∈ R if
I − η2H∗ρ (jω)Hρ(jω) � 0.
This inequality can be equivalently written as
(µ2ω2 + (1 − µρ)2 − η2
)I � 0 (6.28)
100
which holds for every ω ∈ R if
(1 − µρ + η)(1 − µρ − η) > 0.
Since stability of Hρ requires 1−µρ > 0, the first term is positive and (6.28) is satisfied
for all ω if 1 − µρ − η > 0. Finally, since η < 1 for µ ∈ (0, 2/Lf ), dynamics (6.24) are
globally exponentially stable with the rate ρ < (1− η)/µ.
Remark 9. For µ = 2/(Lf + mf ), the second term on the right-hand-side in (A.7)
disappears and the parameter η is given by
η =Lf − mf
Lf + mf=
κ − 1
κ + 1
where κ := Lf/mf is the condition number associated with the function f . In this case,
the convergence rate is upper bounded by the strong convexity module mf , i.e., ρ < mf .
Remark 10. In Appendix B, by restricting our attention to in-network optimization,
we provide a distributed implementation based on the proximal augmented Lagrangian.
Furthermore, by introducing an appropriate change of coordinates, we utilize the theory
of IQCs to prove global exponential stability under the strong convexity assumption.
6.3.2 Proximal Polyak-Lojasiewicz condition Next, we consider the prob-
lems in which the function f is not strongly convex but the function F := f +g satisfies
the proximal PL condition (6.19).
Assumption 2. Let the regularization function g in (6.13) be proper, lower semicon-
tinuous, and convex, let f be twice continuously differentiable with ∇2f(x) � LfI, and
let the generalized gradient map satisfy the proximal PL condition,
‖Gµ(x)‖22 ≥ γ (Fµ(x) − F ?µ)
where µ ∈ (0, 1/Lf ), γ > 0, and F ?µ is the optimal value of the FB envelope Fµ.
Remark 11. We recall that the proximal gradient algorithm can be interpreted as a
variable-metric gradient method on FB envelope and that (6.23) can be equivalently
written as
x = −(I − µ∇2f(x))−1∇Fµ(x).
101
Under Assumption 2, the matrix I − µ∇2f(x) is invertible and the functions F and Fµ
have the same minimizers and the same optimal values [157],
argminx
F (x) = argminx
Fµ(x), F ? = F ?µ .
This motivates study of the convergence properties of (6.23) in terms of the FB envelope.
Theorem 10. Let Assumption 2 hold. Then the forward-backward envelope associated
with proximal gradient flow dynamics (6.23) converge exponentially to F ?µ = F ? with
the rate ρ = γ(1− µLf ), i.e.,
Fµ(x(t)) − F ?µ ≤ e−ρt(Fµ(x(0)) − F ?µ), ∀ t ≥ 0.
Proof. We introduce a Lyapunov function candidate,
V (x) = Fµ(x) − F ?µ
where Fµ is the FB envelope. The derivative of V along the solutions of (6.23) is given
by
V (x) = 〈∇Fµ(x), x〉
= −⟨∇Fµ(x), (I − µ∇2f(x))−1∇Fµ(x)
⟩= −
⟨Gµ(x), (I − µ∇2f(x))Gµ(x)
⟩where the last expression follows from (6.6). Since the gradient of f is Lf -Lipschitz
continuous, i.e., ∇2f(x) � LfI for all x ∈ Rn, Assumption 2 implies
− (I − µ∇2f(x)) � −(1 − µLf )I
and, thus,
V (x) ≤ −(1 − µLf ) ‖Gµ(x)‖22
≤ −γ (1 − µLf ) (Fµ(x) − F ?µ)(6.29)
is non-positive for µ ∈ (0, 1/Lf ). Moreover, combining the last inequality with the
definition of V yields,
V ≤ −γ (1 − µLf )V
102
which implies
Fµ(x(t)) − F ?µ ≤ e−γ (1−µLf )t(Fµ(x(0)) − F ?µ).
Remark 12. When the proximal PL condition is satisfied, Fµ(x(t)) − F ?µ converges
exponentially but, in the absence of strong convexity, the exponential convergence rate
cannot be established for ‖x(t)−x?‖2. Thus, in the absence of strong convexity, although
the objective function converges exponentially fast, the solution to (6.23) does not enjoy
this convergence rate. To the best of our knowledge, the convergence rate of x(t) to the
set of optimal values x? is not known in this case.
6.4 Global exponential stability of Douglas-Rachford split-ting dynamics We next introduce a continuous-time gradient flow dynamics
based on the well-known Douglas-Rachford splitting algorithm [162] and establish global
exponential stability for strongly convex f .
6.4.1 Non-smooth composite optimization problem The optimality con-
dition for non-smooth composite optimization problem (6.21) is given by
0 ∈ ∇f(x) + ∂g(x).
Multiplication by µ and addition/subtraction of x yields,
0 ∈ [I + µ∇f ] (x) + µ∂g(x) − x.
Since the proximal operator associated with the function µf is determined by the re-
solvent operator of µ∇f , we have
x = (I + µ∇f)−1(x − µ∂g(x)) = proxµf (x − µ∂g(x)).
Introducing a new variable z := x− µ∂g(x) allows us to bring the optimality condition
into the following form
x = proxµf (z)
or, equivalently,
µ∂g(x) = proxµf (z) − z.
103
Now, adding x to both sides of this equation yields
[I + µ∂g] (x) = x + proxµf (z) − z
= 2proxµf (z) − z
which leads to,
x? = proxµg(2 proxµf (z?) − z?) = proxµf (z?). (6.30a)
Furthermore, the reflected proximal operators [166] of the functions f and g,
Rµf (z) := [2 proxµf − I](z), Rµg := [2 proxµg − I](z)
can be used to write optimality condition (6.30a) as
[RµgRµf ](z?) = z?. (6.30b)
This follows from (6.30a) and
[RµgRµf ](z) = z + 2(proxµg(2 proxµf (z) − z)− proxµf (z)).
Building on the optimality conditions, the DR splitting algorithm consists of the
following iterative steps,
xk = proxµf (zk)
yk = proxµg(2xk − zk)
zk+1 = zk + 2α(yk − xk).
(6.31)
Under standard convexity assumptions [167], the DR splitting algorithm converges for
α ∈ (0, 1). Combining all the steps in (6.31) yields the first-order recurrence,
zk+1 = zk + 2α(proxµg(2x
k − zk) − xk)
= [(1 − α)I + αRµgRµf ] (zk)(6.32)
where zk converges to the fixed point of the operator RµfRµg and xk converges to the
104
optimal solution of (6.21).
Optimality conditions (6.30) can be used to obtain the continuous-time gradient flow
dynamics to compute z?,
z = −z + [RµgRµf ](z)
= proxµg(2 proxµf (z) − z) − proxµf (z)
= −GDRµ (z)
(6.33)
where GDRµ (z) is given by (6.12). We note that the discrete-time system (6.32) results
from the explicit forward Euler discretization of (6.33) with the stepsize α.
Remark 13. Using the definition of ∇FDRµ (x) in (6.11), the continuous-time sys-
tem (6.33) can be written as
z = −µ(2∇proxµf (z) − I
)−1∇FDRµ (z)
where the inverse is well-defined for µ ∈ (0, 1/Lf ). Thus, the DR splitting algorithm
can be interpreted as a variable-metric gradient method on the DR envelope FDRµ [157].
The continuous-time dynamics (6.33) can be also seen as a feedback interconnection
of an LTI system
z = Az + B u
ξ = C z(6.34a)
with a nonlinear term,
u(z) := [RµgRµf ](z). (6.34b)
Herein, the matrices in LTI representation (6.34a) are given by
A = − I, B = I, C = I (6.34c)
and the corresponding transfer function is
H(s) = C(s I − A)−1B =1
s + 1I. (6.34d)
105
We first characterize properties of nonlinearity u in (6.34b) and then, similar to
the previous section, employ the KYP lemma to establish the conditions for global
exponential stability of (6.34).
Lemma 11. Let Assumption 1 hold and let µ ∈ (0, 2/Lf ). Then, the operator Rµf is
σ-contractive,
‖Rµf (x) − Rµf (y)‖2 ≤ σ‖x − y‖2
where σ is given by
σ = max {|1 − µmf |, |1 − µLf |} < 1. (6.35)
Proof. Given zx := proxµf (x) and zy := proxµf (y), x and y can be computed as follows
x = zx + µ∇f(zx), y = zy + µ∇f(zy).
Thus,
‖Rµf (x) − Rµf (y)‖2 = ‖2(zx − zy) − (x− y)‖2
= ‖(zx − zy) − µ (∇f(zx)−∇f(zy))‖2
= ‖zx − zy‖2 + |µ(∇f(zx)−∇f(zy))‖2 −
2µ 〈∇f(zx)−∇f(zy), zx − zy〉
≤ max{
(1 − µLf )2, (1 − µmf )2}‖zx − zy‖2
≤ σ ‖x − y‖2.
where the firm non-expansiveness of the proximal operator is used in the last step.
Moreover, according to Lemma 8, for µ ∈ (0, 2/Lf ) we have σ < 1, which completes the
proof.
Lemma 12. Let Assumption 1 hold and let µ ∈ (0, 2/Lf ). Then, the operator Rµg is
firmly non-expansive.
106
Proof.
‖Rµg(x)−Rµg(y)‖22 = 4 ‖proxµf (x) − proxµf (y)‖22 + ‖x − y‖22 −
4⟨x − y,proxµf (x) − proxµf (y)
⟩≤ ‖x − y‖22.
Remark 14. Since Rµg is firmly non-expansive and Rµf is σ-contractive, the composite
operator RµgRµf is also σ-contractive. Moreover, since the operator Rµf and nonlin-
earity u in (6.24b) have the same contraction parameters, the quadratic inequality that
describes nonlinearity (6.24b) can be also utilized to characterize the composite operator
RµgRµf .
Theorem 13. Let Assumption 1 hold. Then, the DR splitting dynamics (6.34) are
globally exponentially stable, i.e., there is c > 0 and ρ ∈ (0, 1− σ) such that,
‖z(t) − z?‖ ≤ c e−ρt‖z(0) − z?‖, ∀ t ≥ 0.
Proof. Note that although the nonlinearities in systems (6.20) and (6.34) are different,
they have the same quadratic characterizations. Moreover, the dynamics are the same
except for a multiplicative coefficient 1/µ. Thus, the KYP lemma implies the global
exponential stability of (6.34) if there exists ρ ∈ (0, 1) such that,
σ2Hρ(jω)∗Hρ(jω) − I ≺ 0, ∀ω ∈ R, (6.36)
where
Hρ(jω) = C (jωI − (A + ρI))−1B =1
j ω + 1− ρI.
The inequality (6.36) is satisfied if,
σ2 − (1 − ρ)2 − ω2 < 0, ∀ω ∈ R
which proves ρ < 1− σ.
107
6.4.2 Douglas-Rachford splitting on the dual problem The DR splitting
algorithm cannot be used to directly solve a problem with a more general linear equality
constraint,
minimizex, z
f(x) + g(z)
subject to Tx + Sz = r(6.37)
where T ∈ Rm×n, S ∈ Rm×n, and r ∈ Rm are problem data. However, it can be utilized
to solve the dual problem,
minimizeζ
f1(ζ) + g1(ζ) (6.38)
wheref1(ζ) = f?(−T T ζ) + rT ζ
g1(ζ) = g?(−ST ζ)
and h?(ζ) := supx(ζTx − h(x)) is the conjugate of the function h. It is a standard
fact [167, 168] that solving the dual problem (6.38) via the DR splitting algorithm
is equivalent to using ADMM for the original problem (6.37). Next, we introduce a
gradient flow dynamics based on the DR splitting algorithm for solving (6.38) and
demonstrate global exponentially stability under the following assumption.
Assumption 3. Let the differentiable part f of the objective function in (6.37) be
strongly convex with parameter mf , let ∇f be Lipschitz continuous with parameter Lf ,
let the function g be proper, lower semicontinuous, and convex, and let the matrix T be
full row rank.
The continuous-time DR splitting algorithm
ζ = proxµg1(2 proxµf1(ζ) − ζ) − proxµf1(ζ) (6.39)
can be used to compute the optimal solution ζ? to (6.38). It is well-known that the
conjugate functions are convex [126]. Since (6.39) is identical to (6.33), if f1 satisfies
the conditions in Assumption 1 global exponential stability of (6.39) follows from The-
orem 13. For a full row rank T , f1 is strongly convex and ∇f1 is Lipschitz continuous
108
with parameters [166, Proposition 4],
Lf1 = ‖T T ‖2/mf , mf1 = θ2/Lf
where θ is a positive parameter that always exists and satisfies ‖T T ν‖ ≥ θ‖ν‖ for all
ν ∈ Rm. Thus, both f1 and g1 satisfy Assumption 3 and global exponential stability
of (6.39) follows from Theorem 13.
6.5 Concluding remarks We study a class of nonsmooth optimization
problems in which it is desired to minimize the sum of a continuously differentiable func-
tion with a Lipschitz continuous gradient and a nondifferentiable function. For strongly
convex problems, we employ the theory of integral quadratic constraints to prove global
exponential stability of proximal gradient flow and Douglas-Rachford splitting dynam-
ics. We also propose a generalization of the Polyak-Lojasiewicz condition to nonsmooth
problems and demonstrate the global exponential convergence of the forward-backward
envelope for the proximal gradient flow algorithm even in the absence of strong convex-
ity.
Chapter 7
Distributed proximal augmented
Lagrangian method for
nonsmooth composite
optimization
We study a class of nonsmooth composite optimization problems in which the con-
vex objective function is given by a sum of differentiable and nondifferentiable terms.
By introducing auxiliary variables in nondifferentiable terms, we provide an equivalent
consensus-based characterization that is convenient for distributed implementation. The
Moreau envelope associated with the nonsmooth part of the objective function is used
to bring the optimization problem into a continuously differentiable form that serves
as a basis for the development of a primal-descent dual-ascent gradient flow method.
This algorithm exploits separability of the objective function and is well-suited for in-
network optimization. We prove global asymptotic stability of the proposed algorithm
and solve the problem of growing undirected consensus networks in a distributed manner
to demonstrate its effectiveness.
7.1 Introduction We study a class of nonsmooth composite optimization
problems in which the convex objective function is a sum of differentiable and non-
differentiable functions. Among other applications, these problems emerge in machine
109
110
learning, compressive sensing, and control. Recently, regularization has been used as
a promising tool for enhancing utility of standard optimal control techniques. In this
approach, commonly used performance measures are augmented with regularization
functions that are supposed to promote some desired structural features in the dis-
tributed controller, e.g., sparsity. Such an approach has received significant attention in
recent years [41,42,71,106,115,118,119], but computing optimal solutions in large-scale
problems still remains a challenge.
Generic descent methods cannot be used in the nonsmooth composite optimization
problems due to the presence of a nondifferentiable component in the objective function.
Moreover, these standard methods are not well-suited for distributed implementation.
An alternative approach is to separate the smooth and nonsmooth parts of the objective
function and use the alternating direction method of multipliers (ADMM). In [111], we
exploit separability of the objective function and utilize an ADMM-based consensus
algorithm to solve the regularized optimal control problem in a distributed manner over
multiple processors. Even though the optimal control problem is in general non-convex,
recent results can be utilized to show convergence to a local minimum [112]. However, in
an update step of the ADMM algorithm, all the processors halt to compute the weighted
average (the gathering step) [113].
Herein, we build on recent work [108] in which the structure of proximal oper-
ators associated with nonsmooth regularizers was exploited to bring the augmented
Lagrangian into a continuously differentiable form. Such an approach is suitable for de-
veloping an algorithm based on primal-descent dual-ascent gradient method. We use the
Arrow-Hurwicz-Uzawa gradient flow dynamics [77] and propose an algorithm that can
be implemented in a fully distributed manner over multiple processors. This increases
the computational efficiency and reduces the overall computation time. By exploiting
convexity of the smooth part of the objective function, we show asymptotic convergence
of our algorithm.
The point of departure of our work from [108] is that we study a more general
form of consensus optimization problems in which the optimization variable is a matrix
111
and develop a fully distributed algorithm. Furthermore, while most existing primal-
dual techniques for nonsmooth distributed optimization employ subgradient flow meth-
ods [73,79,169], our approach yields a gradient flow dynamics with a continuous right-
hand side even for nonsmooth problems.
The rest of the chapter is structured as follows. In Section 7.2, we formulate the
nonsmooth composite optimization problem, discuss a motivating example, and provide
background on proximal operators and the consensus-based ADMM algorithm. In Sec-
tion 7.3, by exploiting the structure of proximal operators, we introduce the proximal
augmented Lagrangian. In Section 7.4, we use the Arrow-Hurwicz-Uzawa method to
develop the gradient flow dynamics that are well-suited for distributed computations
and prove global asymptotic stability. In Section 7.5, we discuss distributed implemen-
tation, in Section 7.6, we provide examples, and, in Section 7.7, we offer concluding
remarks.
7.2 Problem formulation We consider a composite convex optimization
problem,
minimizex
n∑i= 1
fi(x) + g(x) (7.1)
where x ∈ Rm is the optimization variable, the functions fi are continuously differen-
tiable, and the function g is possibly nondifferentiable. This problem can be brought
into a standard consensus form by introducing n local variables xi and a global variable
z,
minimizexi, z
n∑i= 1
fi(xi) + g(z)
subject to xi − z = 0, i = 1, . . . , n.
(7.2)
Even though this reformulation increases the number of optimization variables, it
facilitates distributed computations by bringing the objective function into a separa-
ble form. Clearly, the solutions to (7.1) and (7.2) coincide but, in contrast to (7.1),
optimization problem (7.2) is convenient for distributed implementation. Solving (7.2)
on a single processor is not necessarily more computationally efficient than solving the
original problem via a centralized algorithm. However, optimization problem (7.2) can
be split into n separate subproblems over n different processors. In such a setup, each
112
processor solves an optimization problem that involves a local objective function fi of a
single local variable xi. This is advantageous for large-scale systems where centralized
implementation is prohibitively complex and cannot be afforded.
7.2.1 Motivating application Problem (7.1) arises in feedback design when a
performance metric, e.g., the H2 norm, is augmented with a regularization function
to promote structural features in the optimal controller. Herein, we discuss the prob-
lem of growing undirected consensus networks and show that the objective function is
separable; thereby, it completely fits into the framework (7.2).
We consider the controlled undirected network,
ψ = −Lp ψ + d + u
ζ =
[Q1/2
0
]ψ +
[0
R1/2
]u
where d and u are the disturbance and control inputs, ψ is the state, and ζ is the
performance output. The dynamic matrix Lp is the Laplacian matrix of the plant
network and symmetric matrices Q � 0 and R � 0 specify the state and control weights
in the performance output. For memoryless control laws,
u = −Lx ψ
where Lx is the Laplacian matrix of the controller graph, the closed-loop system is given
by
ψ = − (Lp + Lx)ψ + d
ζ =
[Q1/2
−R1/2Lx
]ψ.
(7.3)
In the absence of exogenous disturbances, the network converges to the average of the
initial node values ψ = (1/n)∑
i ψi(0) if and only if it is connected [2]. Let Q :=
I − (1/n)11T penalizes the deviation of individual node values from average. The
objective is to minimize the mean square deviation from the network average by adding
a few additional edges, specified by the graph Laplacian Lx of a controller network. If
113
E is the incidence matrix of the controller graph, Lx can be written as
Lx = E diag (x)ET
where diag (x) is a diagonal matrix containing the optimization variable x ∈ Rm (i.e., the
vector of the edge weights in the controller graph). Regularization terms may be used
to promote sparsity of the controller network or to impose some additional constraints
on the edge weights. The matrix Lx that optimizes the closed-loop performance and has
certain structural properties can be obtained by solving the regularized optimal control
problem
minimizex
f(x) + g(x). (7.4)
Here, f is the function that quantifies the closed-loop performance, i.e. the H2 norm,
and g is the regularization function that is introduced to promote certain structural
properties of Lx. For example, when it is desired to design Lx with a specified pattern
of zero elements, g is an indicator function of the set that characterizes this pattern [114].
When it is desired to promote sparsity of Lx, the `1 norm g(x) = γ∑
i |xi| can be used
as a sparsity-enhancing regularizer, where γ is the positive parameter that characterizes
emphasis on sparsity [42].Next, we exploit the square-additive property of the H2 norm to provide an equiv-
alent representation that is convenient for large-scale and distributed optimization. As
shown in [106], up to an additive constant, the square of the H2 norm (from d to ζ) is
determined by
f(x) = trace((E diag (x)ET + Lp)
†(I + LpRLp))
+ diag(ETRE
)Tx
where the pseudo-inverse of the closed-loop graph Laplacian is given by
(Ediag(x)ET + Lp)† = (Ediag(x)ET + (1/n)11T + Lp)
−1.
It is easy to show that f(x) can be written as
f(x) =n∑
i= 1
fi(x)
114
where
fi(x) = ξTi(E diag (x)ET + (1/n)11T + Lp
)−1ξi + (1/n) diag
(ETRE
)Tx. (7.5)
Here, ξi = (I + LpRLp)1/2 ei is the ith column of the square root of the matrix
(I + LpRLp). Moreover, it can be shown that the gradient of fi(x) is given by
∇fi(x) = (1/n) diag(ETRE
)− νi(x) ◦ νi(x) (7.6)
where ◦ is the elementwise multiplication and
νi(x) = ET(E diag (x)ET + (1/n)11T + Lp
)−1ξi. (7.7)
In what follows, we provide essential background on the proximal operators that we
utilize for the latter developments.
7.2.2 Background
Proximal operators
The proximal operator of the function g is given by
proxµg(v) := argminz
g(z) +1
2µ‖z − v‖2,
and the Moreau envelope determines the corresponding value function,
Mµg(v) := g(proxµg(v)) +1
2µ‖proxµg(v) − v‖22. (7.8)
Irrespective of differentiability of g, Moreau envelope is a continuously differentiable
function and its gradient is given by [76],
∇Mµg(v) =1
µ(v − proxµg(v)). (7.9)
The above defined functions play essential role in our subsequent developments.
115
Alternating Direction Method of Multipliers (ADMM)
We next demonstrate that a standard consensus algorithm based on the ADMM [113]
can be used to solve the problem (7.2). This algorithm is well-suited for distributed
implementation in which each processor solves an optimization problem. More details
can be found in [111].
The augmented Lagrangian associated with (7.2) is given by
Lµi(xi, z;λ) := g(z) +
n∑i= 1
( fi(xi) + 〈λi, xi − z〉 +1
2µi‖xi − z‖22) (7.10)
where λi’s are the Lagrange multipliers and µi’s are positive parameters. The distributed
ADMM algorithm consists of the following iterative steps,
xk+1i = argmin
xifi(xi) +
1
2µi‖xi − uki ‖22
zk+1 = argminz
g(z) +
n∑i=1
1
2µi‖z − vki ‖22
λk+1i = λki +
1
µi
(xk+1i − zk+1
)where
uki := zk − µi λki
vki := xk+1i + µi λ
ki .
The xi-minimization step can be done via distributed computation by spreading
subproblems to n different processors. On the other hand, the update of z amounts to
the evaluation of the proximal operator of the function g,
zk+1 = proxµg
(µ
n∑i= 1
1
µivki
).
where µ := (∑
i 1/µi)−1. Thus, the update of z requires gathering each xk+1
i and the
associated Lagrange multipliers λki in order to form vki .
The above presented consensus algorithm is standard (e.g., see [113]). We have pre-
viously used this algorithm for distributed design of structured feedback gains in [111].
116
Recently, convergence of this algorithm was established even for problems with non-
convex objective functions fi [112].
7.3 Proximal augmented Lagrangian The interconnection graph
between the nodes in the consensus-based formulation in (7.2) is given by a star graph.
Each node has access to the internal node of the star graph with the state z. This topol-
ogy yields the z-update in ADMM that requires gathering the states of all subsystems.
Recently, an algorithm based on the proximal augmented Lagrangian for solving (7.2)
was developed in [108]. To avoid the above described computational requirement in the
z-update of the ADMM algorithm, we propose a primal-dual algorithm based on the
proximal augmented Lagrangian that can be implemented in a fully distributed manner.
Problem (7.2) can be equivalently written as,
minimizexi, z
n∑i= 1
fi(xi) + g(z)
subject to xi − xj = 0, (i, j) ∈ I
xk − z = 0,
(7.11)
where I is the set of indices between 1 and n such that any index appears in one pair
of the set at least once. This set characterizes structure of the information exchange
network between the agents. The interaction topology is given by a connected graph.
Moreover, the index k ∈ {1, . . . , n} can be chosen arbitrarily. In what follows, we study
one particular instance of problem (7.11).
Without loss of generality, we assume that k = n and that the underlying com-
munication network between different nodes in (7.11) is given by a path graph. By
introducing the optimization variable X,
X :=[x1 · · · xn
]∈ Rm×n
the column vector en,
en =[
0 0 · · · 1]T∈ Rn
117
and the matrix T ,
T =
1 0 · · · 0
−1 1 · · · 0......
. . ....
0 0 · · · −1
∈ Rn×(n−1)
we can rewrite (7.11) as,
minimizeX, z
f(X) + g(z)
subject to X T = 0,
X en − z = 0.
(7.12)
In (7.12), the matrix T is the incidence matrix of an undirected path network that
connects n nodes. We note that any connected network can be used to build the
information exchange structure between the nodes. For an arbitrary connected graph
with m edges, the matrix T ∈ Rn×m has to satisfy the following properties,
T T1 = 0, T T T = L,
where 1 is the vector of all ones and L ∈ Rn×n is the Laplacian matrix of the underlying
graph. This is exactly the problem (6.37) that we discussed in the previous Chapter 6.
In Sec. 6.4.2, we have introduced a gradient flow dynamics based on the DR splitting
algorithm that can be used to solve the dual form of the problem (7.12) with a guaran-
teed global exponential convergence under certain assumptions 3. Herein, we provide an
algorithm based on proximal augmented Lagrangian to solve the primal problem that is
well-suited for distributed implementation and we show global asymptotic convergence.
The augmented Lagrangian associated with (7.12) is given by
Lµi(X, z;λ, Y ) = f(X) + g(z) + 〈λ,X en − z〉 + 〈Y,X T 〉 +1
2µ1‖X en − z‖22 +
1
2µ2‖X T‖2F
(7.13)
where λ ∈ Rm and Y ∈ Rm×(n−1) are the Lagrange multipliers and µ1 and µ2 are
positive parameters. The proximal augmented Lagrangian is obtained by evaluating the
118
augmented Lagrangian on the manifold that results from the explicit minimization of
Lµi with respect to z [108]. This yields a function that is once but not twice continuously
differentiable with respect to both the primal variable X and the dual variables λ and
Y .
The proximal augmented Lagrangian associated with (7.12) is given by
Lµi(X;λ, Y ) = Lµi(X, z?µ1(X;λ);λ, Y ) =
f(X) + Mµ1g(X en + µ1 λ) − µ1
2‖λ‖22 + 〈Y,X T 〉 +
1
2µ2‖X T‖2F ,
(7.14)
where Mµ1g is the Moreau envelope of the function g(z) and z?µ1(X;λ) is given by
z?µ1(X;λ) = proxµ1g(X en + µ1 λ)
and by substituting z? in the augmented Lagrangian (7.13), the proximal augmented
Lagrangian can be written as (7.14).
7.4 Arrow-Hurwicz-Uzawa gradient flow The proximal augmented
Lagrangian is a continuously differentiable function because the Moreau envelope is con-
tinuously differentiable. This facilitates the use of the Arrow-Hurwicz-Uzawa algorithm
which is a primal-descent dual-ascent gradient flow method. In this algorithm, the
primal variable X and the dual variables λ and Y are updated simultaneously. The
gradient flow dynamics are given by
X = −∇XLµ(X;λ, Y )
λ = +∇λLµ(X;λ, Y )
Y = +∇Y Lµ(X;λ, Y ).
By taking the derivatives, the updates can be written as
X = − (∇f(X) + ∇Mµ1g(X en + µ1 λ) eTn +1
µ2X T T T + Y T T )
λ = µ1∇Mµ1g(X en + µ1 λ) − µ1 λ
Y = X T.
(7.15)
119
It is worth to note that if f(x) is separable, i.e. f(x) =∑
i fi(xi), the gradient ∇f(X)
is an m× n matrix and can be written as
∇f(X) =[∇f1(x1) . . . ∇fn(xn)
].
Asymptotic convergence Our subsequent developments are based on the following
assumption
Assumption 4. The function f is continuously differentiable and convex, and the func-
tion g is proper, lower semicontinuous, and convex.
We show that under Assumption 4, dynamics (7.15) are globally asymptotically
stable and converge to (X?, λ?, Y ?) where each of the columns of X? is the optimal
solution to (7.1). The optimal primal and dual points (X?, λ?, Y ?) satisfy the following
first order optimality conditions
∇f(X?) + λ? eTn + Y ? T T = 0 (7.16a)
X? en − z? = 0 (7.16b)
∂g(z?) − λ? 3 0 (7.16c)
X? T = 0 (7.16d)
where ∂g is the subgradient of g.
Proposition 14. Let Assumption 4 hold. Then, gradient flow dynamics (7.15) are
globally asymptotically stable, i.e., they converge globally to the optimal primal and dual
points (X?, λ?, Y ?) of (7.12).
Proof. We introduce a change of variables
X := X − X?, λ := λ − λ?, Y := Y − Y ?
and a Lyapunov function
V (X, λ, Y ) =1
2
⟨X, X
⟩+
1
2
⟨λ, λ
⟩+
1
2
⟨Y , Y
⟩
120
where (X, λ, Y ) satisfy
˙X = ∇f(X?)− ∇f(X) − 1
µ1m eTn −
1
µ2X T T T − Y T T ,
˙λ = −µ1 λ + m, ˙Y = X T,
(7.17)
with
m := µ1 (∇Mµ1g(X en + µ1 λ)−∇Mµ1g(X? en + µ1 λ
?)) .
Based on [108, Lemma 2], we can write
P (Xen + µ1λ) = proxµ1g(Xen + µ1 λ) − proxµ1g(X?en + µ1 λ
?), (7.18)
where I is the identity matrix and P is a positive semidefinite matrix such that P � I.
Thus, from (7.9) we have,
m = (I − P ) (Xen + µ1 λ). (7.19)
The derivative of the Lyapunov function candidate along the solutions of (7.17) is
determined by
V =⟨X, ˙X
⟩+⟨λ,
˙λ⟩
+⟨Y , ˙Y
⟩= −
⟨X,∇f(X) − ∇f(X?)
⟩− µ1
⟨λ, P λ
⟩−
1
µ1
⟨(I − P ) Xen, X en
⟩− 1
µ2
⟨X T T T , X
⟩.
Since f is convex, the first term in nonpositive. Moreover, by utilizing 0 � P � I,
it follows that V ≤ 0. We next invoke LaSalle’s invariance principle [170] to establish
global asymptotic stability.
The points (X, λ, Y ) in the set
S = {(X, λ, Y ) | V (X, λ, Y ) = 0},
121
satisfy
∇f(X? + X) = ∇f(X?), (7.20a)
λ ∈ ker(P ), (7.20b)
X en ∈ ker(I − P ), (7.20c)
XT ∈ span{[1 · · · 1 ]}, (7.20d)
where ker(A) denotes the null space of the matrix A and A ∈ span{[1 · · ·1 ]} signifies
that each column of the matrix A is given by a scalar multiple of the vector of all ones,
1. From (7.19), for the points in this set, we have m = µ1λ. In order to identify the
largest invariant set in S, we evaluate dynamics (7.17) under constraints (7.20) to obtain
˙X = − λ eTn − Y T T ,˙λ = 0, ˙Y = 0. (7.21)
Thus, the invariant set is characterized by λeTn + Y T T = 0 for constant λ and Y . To
complete the proof, we need to show that the largest invariant set in S yields
(X,λ, Y ) = (X?, λ?, Y ?) + (X, λ, Y ),
that satisfy optimality conditions (7.16).
Points (X,λ, Y ) satisfy optimality condition (7.16a) if
∇f(X) + (λ + λ?) eTn + (Y + Y ?)T T = 0.
For any (X, λ, Y ) in the invariant set, we can use (7.20a) to replace∇f(X) with∇f(X?).
Furthermore, since λeTn + Y T T = 0, the resulting (X,λ, Y ) satisfy (7.16a). Moreover,
the substitution of Pλ = 0 and (I − P )Xen = 0 to (7.18) yields
Xen − X?en = proxµ1g(Xen + µ1 λ) − proxµ1g(X?en + µ1 λ
?).
The optimality condition (7.16b) leads to
Xen = proxµ1g(Xen + µ1 λ) = z
122
which implies that the pair (X, z) satisfies (7.16b). We next show that the optimality
condition (7.16c) holds for any (z, λ) in this set. Taking sub-differential of the proximal
operator of the function g in (7.8) yields
∂g(z) + 1µ1
(z − v) 3 0
where v is an arbitrary vector. Choosing v = Xen + µ1λ and utilizing the fact that
Xen = z yields the third optimality condition (7.16c). Furthermore, XT = 0 yields
XT = 0. Thus, X satisfies (7.16d) and the dynamics (7.15) converges asymptotically
to the optimal points (X?, λ?, Y ?).
7.5 Distributed implementation In this section, we exploit the struc-
ture of the problem and show that the gradient flow dynamics (7.15) is well-suited for
distributed implementation. In this case, the underlying interconnection network is
given by a path graph. We first discuss how the gradient flow of the primal variable X
can be implemented in a distributed manner and then show that the dual variables can
be also updated in a distributed fashion.
7.5.1 Primal update The vector xk denotes the kth column of the matrix X.
Each of the columns from k = 2, . . . , (n− 1) can be updated in a distributed manner as
follows
xk = −∇fk(xk) −1
µ2(2xk − xk−1 − xk+1) − yk + yk−1, (7.22a)
where the vector yk is the kth columns of the matrix Y . Thus, each agent only uses
its neighbors’ states and the corresponding dual variables to update its own state. The
updates for the first and last column of X which are x1 and xn are different than the
other updates and can be written as follows
x1 = −∇f1(x1) − x1 − x2
µ2− y1 (7.22b)
xn = −∇fn(xn) − ∇Mµ1g(xn + µ1 λ) − 1
µ2(xn − xn−1) + yn−1. (7.22c)
123
Similarly, we can see only local information exchange and access to local dual variable
is required for these two updates.
7.5.2 Dual updates The dual variable λ is a column vector and its update can be
done by using the following column update
λ = µ1∇Mµ1g(xn + µ1 λ) − µ1 λ. (7.22d)
Thus, we only need the state of the nth agent to update its value. The second dual
variable Y is an m× (n− 1) matrix and the kth column of it, yk, can be updated using
only the states of xk and xk+1 agents for k = 1, . . . , (n− 1),
yk = xk − xk+1. (7.22e)
7.6 Computational experiments In this section, we employ our al-
gorithm for growing connected resistive Erdos-Renyi networks with edge probability
1.05 log(n)/n for different number of nodes using multiple cores. We choose the control
weight matrix R = I and a state weight matrix that penalizes the mean-square devia-
tion from the network average, Q = I − (1/n)11T . Moreover, the incidence matrix of
the controller is such that there are no joint edges between the plant and the controller
graphs. This algorithm is implemented in a distributed fashion by splitting the problem
into N separate subproblems over N different cores. We have provided a parallel im-
plementation in Matlab and have executed tests on a machine featuring an Intel Core
i7-3770 with 16GB of RAM to measure the performance of the algorithm.
The solve times are averaged over 10 trials and the speedup relative to a single core
is displayed in Fig 7.1. It demonstrates that the algorithm is scalable. In particular,
multi-core execution outperforms running just on a single core. Moreover, the speed-up
is even higher for larger networks since overheads of parallel execution are less and more
time is spent on actual parallel computation.
124
Sp
eed
up
Number of cores
Figure 7.1: Speedup ratio versus the number of the cores used for growing connectedresistive networks with n nodes.
7.7 Concluding remarks We have studied a class of convex nonsmooth
composite optimization problems in which the objective function is a combination of dif-
ferentiable and nondifferentiable functions. By exploiting the structure of the probelm,
we have provided an equivalent consensus-based characterization and have developed an
algorithm based on primal-descent dual-ascent gradient flow method. This algorithm
exploits the separability of the objective function and is well-suited for distributed im-
plementation. Convexity of the smooth part of the objective function is utilized to prove
global asymptotic stability of our algorithm. Finally, by exploiting the structure of the
H2 norm, we have employed this algorithm to design a sparse controller network that
improves the performance of the closed-loop system in a large-scale undirected consen-
sus network in a distributed manner. An example is provided to demonstrate the utility
of the developed approach. We are currently working on implementing this algorithm
in C++ and will use it to solve structured optimal control problems for large-scale
systems in a distributed manner.
Part IV
Data-driven and model-free
distributed control
125
Chapter 8
Data-driven proximal algorithms
for the design of structured
optimal feedback gains
Distributed feedback design and complexity constrained control are examples of prob-
lems posed within the domain of structured optimal feedback synthesis. The optimal
feedback gain is typically a non-convex function of system primitives. However, in recent
years, algorithms have been proposed to obtain locally optimal solutions. In applica-
tions to large-scale distributed control, the major obstacle is computational complexity.
This chapter addresses complexity through a combination of linear-algebraic techniques
and computational methods adapted from both machine learning and reinforcement
learning. It is shown that for general classes of optimal control problems, the objective
function and its gradient can be computed from data. Transformations borrowed from
the theory of reinforcement learning are adapted to obtain simulation-based algorithms
for computing the structured optimal H2 feedback gain. Customized algorithms based
on proximal gradient descent, incremental proximal gradient, an ADMM are tested in
computational experiments and their relative merits are discussed.
8.1 Introduction In this chapter, by exploiting the square-additive property
of the H2 norm, we provide efficient algorithms to solve the regularized H2 optimal
control problem. We demonstrate that the objective function and its gradients can
126
127
be evaluated without solving large-scale Lyapunov equations. In particular, our ap-
proach is inspired by the framework developed in [171, 172] where an iterative method
for computing the structured linear quadratic regulator is proposed. Herein, we ad-
dress complexity through a combination of linear-algebraic techniques and computa-
tional methods adapted from both machine learning and reinforcement learning. We
approach the structured optimal control problem via a data-driven framework that does
not require knowledge of the system parameters and avoids the need to solve large-scale
matrical equations. For the structured optimal H2 state-feedback problem, we show
that the objective function and its gradient can be computed from data and develop
customized proximal algorithms based on gradient descent and incremental gradient
method.
Moreover, we exploit separability of the objective function and utilize an ADMM-
based consensus algorithm to solve the regularized optimal control problem in a dis-
tributed manner over multiple processors. Even though the optimal control problem
is in general non-convex, recent results can be utilized to show convergence to a lo-
cal minimum [112]. The ADMM-based consensus algorithm that we use is standard
(e.g., see [113]) but, to the best of our knowledge, it has not been previously used for
distributed design of structured feedback gains. Even though the optimal control prob-
lem is in general non-convex, recent results can be utilized to show convergence to a
local minimum [112]. The ADMM-based consensus algorithm that we use is standard
(e.g., see [113]) but, to the best of our knowledge, it has not been previously used for
distributed design of structured feedback gains.
Our presentation is organized as follows. In Section 8.2, we describe the regularized
structured optimal control problem. In Section 8.3, a square-additive property of the
H2 norm is used to obtain a decomposition of the objective function and its gradient
and duality arguments are utilized to transform these into quantities that can be es-
timated from experiments or simulations. In Section 8.4, we illustrate how structure
of symmetric systems and undirected consensus networks can be exploited to simplify
computations. In Section 8.5, proximal algorithms of tractable complexity for solving
the structured optimal control problem are described and computational experiments
are provided. In Section 8.6, we propose an ADMM-based consensus algorithm to solve
the regularized optimal control problem in a distributed manner. Finally, in Section 8.7,
128
the chapter is concluded with remarks and a summary of outstanding challenges.
8.2 Problem formulation Consider the LTI control system in state-space
formψ = Aψ + B u
ζ =
[Q1/2
0
]ψ +
[0
R1/2
]u
(8.1)
where ψ(t) ∈ Rn is the state, u(t) ∈ Rq is the control input, and ζ(t) ∈ Rn+q is the
performance output. The matrices are all of compatible dimensions, and the standard
assumptions are imposed: Q � 0, R � 0, (A,B) is stabilizable and (A,Q1/2) is de-
tectable.
The control input is defined by state-feedback,
u(t) = −K ψ(t)
with gain matrix K ∈ Rq×n. The closed-loop system is thus
ψ = Acl ψ
ζ =
[Q1/2
−R1/2K
]ψ
(8.2)
with Acl := A−BK.
Closed-loop performance is quantified by the square of the L2 norm of the impulse
response for the closed-loop system, which is expressed using either of the following
expressions:
f(K) = trace (P )
= trace((Q + KTRK)L
) (8.3)
where P � 0 is the closed-loop observability Gramian,
ATclP + P Acl + Q + KTRK = 0 (8.4)
129
and L � 0 is the closed-loop controllability Gramian,
Acl L + LATcl + I = 0. (8.5)
Our objective is to promote certain structural properties of the feedback gain matrix
K by solving the regularized optimal control problem
minimizeK
f(K) + g(K) (8.6)
where f is given by (8.3) and g is the regularization function. When it is desired to
design K to belong to a set S with a specified pattern of zero elements [114], g is an
indicator function associated with the underlying sparsity constraints,
g(K) :=
0 K ∈ S
∞ K 6∈ S.
Similarly, the `1 norm and the nuclear norm are commonly used convex proxies for
promoting sparsity, or for designing K with low rank. In this case, g(K) = γ∑
i, j |Kij |or g(K) = γ‖K‖∗ = γ
∑i σi(K), where γ is a positive regularization parameter and σi
is the ith singular value.
In general, the optimization problem (8.6) is not convex because f is a non-convex
function of K. The proximal augmented Lagrangian method [108] can be used to
compute a local minimum of (8.6); for details, see [173]. The minimization (8.6) remains
non-convex even in the absence of the regularization function g. Of course, in this special
case it reduces to the standard LQR problem, and the globally optimal solution is given
by
K? = R−1BTP
where P is the unique positive definite solution of the algebraic Riccati equation,
ATP + PA + Q − P BR−1BTP = 0.
The recent work [117] established conditions for convergence of gradient descent methods
130
for the standard LQR problem to the global minimum, in spite of the lack of convexity.
The goal of this chapter is twofold: first is to obtain algorithms of tractable complex-
ity for truly large-scale problems. A second goal is to solve the regularized, structured
optimal control problem without knowledge of the underlying model. The approach to
the second goal is a component of our approach to the first: a data-driven framework
is proposed for computing the optimal feedback gain, with unknown matrices A and B
in (8.2). This approach also avoids the need to solve large-scale Lyapunov equations.
8.3 Computation from data A square-additive property of the H2 norm
is used to obtain a decomposition of the objective function f and its gradient ∇f .
Duality arguments are then used to transform each term into a quantity that can be
estimated from experiments or simulations. The resulting algorithm obtains estimates
of the optimal gain without solving a Lyapunov equation, and without knowledge of the
system parameters.
The gradient of f with respect to K is given by [135],
∇f(K) = 2 (RK − BTP )L. (8.7)
In what follows, we use the three decompositions
L =n∑
k= 1
Lk, f(K) =n∑
k= 1
fk, ∇f(K) =n∑
k= 1
∇fk
where Lk � 0 solves the Lyapunov equation
Acl Lk + Lk ATcl + eke
Tk = 0, (8.8)
ek is the kth unit vector in the canonical basis of Rn, and
fk(K) = trace((Q + KTRK)Lk
)∇fk(K) = 2 (RK − BTP )Lk.
(8.9)
131
8.3.1 Computation of P and L from data The solution to (8.8) can be ex-
pressed
Lk =
∫ ∞0
eAclt ekeTk eA
Tclt dt =
∫ ∞0
ψk(t) (ψk(t))T dt (8.10)
where ψk(t) := eAclt ek is the solution to
ψk = Acl ψk, ψk(0) = ek, k = 1, . . . , n. (8.11)
This leads to the first component of the data-driven architecture: for a given feedback
gain K, the term KLk that appears both in f and ∇f can be expressed as the integral
KLk =
∫ ∞0
uk(t) (ψk(t))T dt
where ψk(t) and uk(t) = Kψk(t) are obtained from simulations or experiments of sys-
tem (8.11).
Similarly, the observability Gramian is given by
P =
∫ ∞0
eATclt (Q + KTRK) eAclt dt
and its (i, j)th component is
Pi,j =
∫ ∞0
(ϕi(t))T (Q + KTRK)ϕj(t) dt (8.12)
where ϕi(t) := eAclt ei is the solution to
ϕi = Acl ϕi, ϕi(0) = ei, i = 1, . . . , n.
Thus, using 2n forward-in-time numerical simulations, the gradient of the objective
function ∇f(K) can be computed without solving a Lyapunov equation. Computation
can be further reduced by avoiding multiplication of P and Lk in (8.9). A data-driven
approach to compute the product PLk directly is introduced in the next subsection.
132
8.3.2 Computation of PL from data As showed in [111,172], the product PLk
that appears in ∇fk(K) can be expressed as
Xk := PLk = −∫ ∞
0φk(t) (ψk(t))T dt (8.13)
where φk is obtained from the solution of the adjoint system
φk = −ATcl φk + (Q + KTRK)ψk
φk(∞) = 0(8.14)
and ψk is the solution to (8.11). Thus, numerical simulations of the primal and adjoint
systems (8.11) and (8.14) along with numerical evaluations of the corresponding integrals
can be used to compute Lk, PLk, fk, and ∇fk.
Next, we show that the matrix Xk in (8.13) can be computed without simulating
the adjoint system (8.14). This is a critically important step in the model-free setup
where only experimental or numerical data is available. Introduce the new variable
ηk := (Q + KTRK)ψk
so that (8.14) becomes
φk = −ATcl φk + ηk, φk(∞) = 0 (8.15)
which admits the solution,
φk(t) =[Hηk
](t) = −
∫ ∞t
e−ATcl(t− τ) ηk(τ) dτ. (8.16)
The linear operator H is introduced here to facilitate an adjoint transformation below.
This operator and its adjoint H∗ are defined defined on L2([0,∞)) → Rn. For any
functions ξ, χ ∈ L2([0,∞)), we have by definition 〈χ,Hξ〉 = 〈H∗χ, ξ〉; see, e.g., [174].
An explicit representation for µ = H∗γ is obtained using elementary calculus:
µ(t) =
∫ t
0eAcl(t− τ)γ(τ) dτ.
133
The (i, j)th element of the matrix Xk can be expressed
Xki,j := eTi Xk ej = −
∫ ∞0
φki (t)ψkj (t) dt
= −⟨ψkj , φ
ki
⟩ (8.17)
where φki (t) := eTi φk(t) and ψkj (t) := eTj ψ
k(t) are the ith and jth elements of the vectors
φk(t) and ψk(t), respectively, and the inner product is in L2([0,∞)). By substituting
the expression (8.16) for φk in (8.17), we obtain
Xki,j = −
⟨ψkj , e
Ti Hη
k⟩
= −⟨H∗ei ψ
kj , η
k⟩
=:⟨ξki,j , η
k⟩. (8.18)
Applying the expression for H∗ then gives
ξki,j(t) := −[H∗eie
Tj ψ
k]
(t) =
∫ t
0eAcl(t− τ) ei eTj ψ
k(τ) dτ.
This implies that ξki,j can be obtained as the solution to the LTI system,
ξki,j = Acl ξki,j + ei eTj ψ
k, ξki,j(0) = 0. (8.19)
Thus, to compute ∇f(K) for a given gain K, the only systems that need to be simulated
are the forward in time systems (8.11) and (8.19). The system (8.11) is unforced, and
its solution determines the input to the forced system (8.19).
Remark 15. Similar adjoint techniques are used in analysis of reinforcement learning.
The proof that the TD(1) algorithm solves a minimum norm problem is based on related
adjoint transformations [175, 176]. And, a similar adjoint transformation is a crucial
step in a Q-learning algorithm for deterministic continuous-time systems [177].
Remark 16. By writing the Lyapunov equation for the controllability Gramian as (8.8),
unlike the requirement of stochastic simulations in [172], only deterministic simulations
are needed in the present framework.
Remark 17. Since the optimal unstructured gain K? only depends on the closed-loop
observability Gramian P (which can be obtained from the solution of the algebraic Riccati
134
equation), L does not influence K? and its computation can be avoided. To promote
structure (e.g., via proximal algorithms) we need to compute both P and L to form ∇f .
In contrast, only computation of P is required to form natural policy gradient, which is
defined as [178, 179],
∇h(K) = ∇f(K)L−1 = 2 (RK − BTP ). (8.20)
8.4 Systems with special structure If the underlying system has
additional structure, evaluation of both the objective function and the corresponding
gradients can be further simplified. We next illustrate how structure of symmetric sys-
tems and undirected consensus networks can be exploited to simplify computations. In
both cases, the closed-loop L2 norm has an explicit convex dependance on the opti-
mization variable and there is no need to conduct simulations of the adjoint system and
numerically evaluate the underlying integrals.
Symmetric systems
Let A and K in (8.2) be symmetric matrices and let B = I. For a stabilizing K, the L2
norm of the impulse response for the closed-loop system (8.2) is determined by
f(K) =1
2trace
((K − A)−1(Q + KRK)
).
Convex dependence of f on K can be established via a straightforward use of the Schur
complement. Furthermore, the design for symmetric systems provides a useful starting
point for the design of non-symmetric systems [180].
Equivalently, f(K) can be written as
f(K) =1
2trace
((K − A)−1(Q + ARA)
)+
1
2trace (RK) +
1
2trace (RA)
=1
2
n∑i= 1
fi(K)
135
where
fi(K) = qTi (K − A)−1qi + aTi (K − A)−1ai + rTi (K + A) ri.
The vectors indexed by i in this expression are given by
qi = Q1/2 ei, ri = R1/2 ei, ai = Ari
where ei is the ith canonical basis vector in Rn. Furthermore, the gradient of fi with
respect to K is given by
∇fi(K) = ri rTi − (K − A)−1qi q
Ti (K − A)−1 − (K − A)−1ai a
Ti (K − A)−1
Thus, there is no need to compute and store the inverse of the matrix K − A in
order to evaluate fi(K) and ∇fi(K); only actions of (K−A)−1 on the vectors qi and ai
is necessary. For example, the preconditioned conjugate gradients method can be used
to compute them efficiently.
Remark 18. The vector (K − A)−1qi represents the steady-state solution of a stable
linear system
ψi = (A − K)ψi + qi
and it can be computed via numerical integration. This illustrates that the symmetric na-
ture of the closed-loop system (8.2) allows us to avoid the need for numerical integration
of the corresponding adjoint system.
Undirected consensus networks
For undirected consensus networks with n nodes, the matrices A and K∈ Rn×n in (8.2)
determine the graph Laplacians of the plant and controller networks, respectively. In the
absence of exogenous disturbances, the network converges to the average of the initial
node values ψ = (1/n)∑
i ψi(0) if and only if it is connected [2]. Let B = I and let
Q := I − (1/n)11T penalize the deviation of individual node values from average. The
objective is to minimize the mean square deviation from the network average by adding
a few additional edges, specified by the graph Laplacian K of a controller network. If
136
E is the incidence matrix of the controller graph, K can be written as
K(x) = E diag (x)ET
where diag (x) is a diagonal matrix containing the optimization variable x ∈ Rm (i.e., the
vector of the edge weights in the controller graph). Regularization terms may be used
to promote sparsity of the controller network or to impose some additional constraints
on the edge weights.
As shown in [106], up to an additive constant, the L2 norm of the impulse response
for the closed-loop system is determined by
f(x) = trace((E diag (x)ET − A)†(I + ARA)
)+ diag
(ETRE
)Tx
where the pseudo-inverse of the closed-loop graph Laplacian is given by
(E diag (x)ET −A)† = (E diag (x)ET + (1/n)11T −A)−1.
It is easy to show that f(x) can be written as
f(x) =n∑
i= 1
fi(x)
where
fi(x) = ξTi(E diag (x)ET + (1/n)11T −A
)−1ξi + (1/n) diag
(ETRE
)Tx.
(8.21)
Here, ξi = (I + ARA)1/2 ei is the ith column of the square root of the matrix (I +
ARA). Moreover, it can be shown that the gradient of fi(x) is given by
∇fi(x) = (1/n) diag(ETRE
)− νi(x) ◦ νi(x) (8.22)
where ◦ is the elementwise multiplication and
νi(x) = ET(E diag (x)ET + (1/n)11T −A
)−1ξi. (8.23)
137
As for symmetric systems, there is no need to compute and store the inverse of the
matrix in (8.23) and the preconditioned conjugate gradients method can be used to
compute the gradient efficiently.
We next provide a proposition on the Lipschitz continuity of the gradient of fi
for connected resistive networks which enables us to choose appropriate step-size for
proximal gradient algorithms. In such networks, the plant graph is connected and all
edge weights in both the plant and the controller graphs are non-negative.
Proposition 15. The gradient of fi for connected resistive networks is Lipschitz con-
tinuous with the Lipschitz constant
Li = ξTi A(0)−1E ET A(0)−1 ξi ‖ET A(0)−1E‖2 (8.24)
where A(x) = E diag (x)ET + (1/n)11T −A.
Proof. For a convex and twice differentiable function fi, the gradient ∇fi is Lipschitz
continuous with Lipschitz constant Li, if
∇2fi(x) � Li I
or, equivalently, zT ∇2fi(x) z ≤ Li ‖z‖2. It can be shown that the second order deriva-
tive of fi is given by
∇2fi(x) = (νi νTi ) ◦ (ET A(x)−1E).
Thus, zT ∇2fi(x) z can be written as
ξTi A(x)−1EDz ET A(x)−1EDz E
T A(x)−1ξi
where Dz := diag(z). For positive definite matrices S and T with S � T , we have
zTS z ≤ zTT z and
zTS z ≤ λmax(S) ‖z‖2 ≤√λmax(STS) ‖z‖2 = ‖S‖2 ‖z‖2
138
for any vector z. Therefore,
Dz ET A(x)−1EDz � ‖ET A(0)−1E‖2 ‖z‖2.
As a result,
zT ∇2fi(x) z ≤ νi(0)T νi(0) ‖ET A(0)−1E‖2 ‖z‖2
where νi(0) = ET A(0)−1 ξi.
8.5 Proximal algorithms Algorithms of tractable complexity are intro-
duced here for solving the regularized structured optimal control problem (8.6). A
standard proximal gradient descent algorithm is considered as a starting point. It is
argued however that computing the full gradient may be prohibitively expensive. Thus,
we propose a more efficient algorithm based on incremental proximal gradient that is
well-suited for the design of structured optimal feedback gains in large-scale systems.
8.5.1 Proximal gradient descent The proximal gradient algorithm is a first-
order method that extends standard gradient descent to a class of non-smooth composite
optimization problems. In our setup, it iteratively updates the feedback gain K via
application of the proximal operator associated with the function g on the standard
gradient descent update. The algorithm is initialized with a stabilizing K0 and the
iterates are determined by,
K l+1 = proxαlg
(K l − αl∇f(K l)
)where l is the iteration index and αl is the step-size. At each iteration, the step-size is
selected via backtracking to guarantee sufficient descent of the objective function and
stability of closed-loop system (8.2). To enhance practical performance, we use the
Barzilai-Borwein (BB) step-size initialization [127].
For any matrix V and a positive scalar α, the proximal operator of the function g
is defined as [76],
proxαg(V ) := argminZ
(g(Z) +
1
2α‖Z − V ‖2F
)
139
where ‖ · ‖F is the Frobenius norm. In particular, for g(K) = γ ‖K‖1 = γ∑
i, j |Kij |, we
have
K l+1 = Sγαl
(K l − αl∇f(K l)
)where Sκ(V ) is the soft-thresholding function that acts on the individual entries Vij of
the matrix V according to Sκ(Vij) = sign (Vij) max (|Vij | − κ, 0). In the absence of the
regularization function g in (8.6), the proximal gradient algorithm simplifies to standard
gradient descent with the updates K l+1 = K l − αl∇f(K l).
Recall that estimates of ∇f(K l) require the matrices KL and PL. These can be
computed or estimated from data using the techniques described in the previous section.
Remark 19. The step-size αl is adjusted via backtracking to ensure stability of closed-
loop system (8.2) at each iteration. It is initialized using the BB method which provides
a heuristic for approximating the Hessian of the function f via the scaled version of the
identity matrix [127], (1/αl)I. At the lth iteration, the initial BB step-size
αl,0 :=‖K l − K l−1‖22
trace ((K l−1 − K l)T (∇f(K l−1) − ∇f(K l)))(8.25)
is adjusted via backtracking until system (8.2) is stable and sufficient descent is achieved
at each iteration, i.e.,
f(K l+1) < f(K l) + trace((K l+1 − K l)T ∇f(K l)
)+ 1
2αl‖K l+1 − K l‖22.
(8.26)
The algorithm stops when the generalized gradient map becomes smaller than the given
tolerance,
‖K l − proxαlg
(K l − αl∇f(K l)
)αl
‖ < ε.
8.5.2 Incremental proximal gradient Next, we exploit the separable structure
obtained in Section 8.2 to speed up computations and save memory. This formulation
allows to work with a single function fk and its gradient ∇fk in the controller design,
which reduces computational complexity and allows implementation at scale.
We start with an initial stabilizing gain K0, and at each iteration we solve the
140
following optimization problem
minimizeK
fk(K) + g(K). (8.27)
The explicit solution to (8.27) is given by
K l+1 = proxαlg
(K l − αl∇fk(K l)
).
Here, prox is the proximal operator of the function g, and αl is a step-size. This is the
incremental proximal gradient algorithm because, at each iteration, the optimization
variable is updated in the negative direction of a single element of the gradient where
the index k is selected in a random manner. At each iteration, we choose step-size to
be O(1/l) and adjust it to guarantee stability of closed-loop system (8.2).
Incremental and stochastic gradient based algorithms have recently found widespread
use in large-scale optimization and machine learning. High variance that results from es-
timating the full gradient ∇f using samples of its entries can result in slow convergence.
Moreover, to ensure convergence, the step-size has to decay to zero.
In the first few iterations, the objective function in the incremental proximal gradient
decreases dramatically but it starts to oscillate after that. Thus, in large-scale optimiza-
tion problems where having high accuracy may not be achievable, these methods are
useful because of their fast initial convergence rate.
8.6 An ADMM-based consensus algorithm We next demon-
strate that a standard consensus algorithm based on the Alternating Direction Method
of Multipliers (ADMM) is well-suited for distributed design of optimal structured feed-
back gains. As shown in Section 8.3, the regularized optimal control problem can be
written as
minimizeK
N∑i= 1
fi(K) + g(K). (8.28)
This characterization is suitable for distributed implementation in which each processors
solves an optimization problem. By introducing N local variables Ki and a global
141
variable K0, problem (8.28) can be brought into a standard consensus form,
minimizeKi,K0
N∑i= 1
fi(Ki) + g(K0)
subject to Ki − K0 = 0, i = 1, . . . , N.
(8.29)
This formulation increases the number of optimization variables but it brings the ob-
jective function into a separable form and facilitates distributed computations. When
implemented on a single machine, solving the reformulated problem is not necessarily
more computationally efficient than solving the original problem using centralized al-
gorithms. However, this formulation allows to work with a local objective function fi
that depends on a single local variable Ki. This is advantageous for large-scale systems
where centralized algorithms cannot be afforded (e.g., because of the computational cost
associated with solving the large-scale Lyapunov equations).
The augmented Lagrangian associated with (8.29) is
L(Ki,K0; Λ) = g(K0) +
N∑i=1
( fi(Ki) + 〈Λi,Ki −K0〉 +ρi2‖Ki −K0‖2F ) (8.30)
where Λi’s are the Lagrange multipliers and ρi’s are positive parameters. The ADMM
algorithm consists of the following iterative steps,
Kk+1i = argmin
Ki
fi(Ki) +ρi2‖Ki − Uki ‖2F
Kk+10 = argmin
K0
g(K0) +N∑i=1
ρi2‖K0 − V k
i ‖2F
Λk+1i = Λki + ρi
(Kk+1i − Kk+1
0
)where
Uki := Kk0 − (1/ρi) Λki
V ki := Kk+1
i + (1/ρi) Λki .
The Ki-minimization step can be done via distributed computation by spreading
subproblems to N different processors. On the other hand, the update of K0 amounts
142
to the evaluation of the proximal operator of the function g,
Kk+10 = proxg/(Nρ)
(1
Nρ
N∑i= 1
ρiVki
).
where ρ := (1/N)∑
i ρi. Thus, the update of K0 requires gathering each Kk+1i and the
associated Lagrange multipliers Λki in order to form V ki .
Remark 20. The above presented consensus algorithm is standard (e.g., see [113])
but to the best of our knowledge it has not been previously used for optimal design of
structured feedback gains via distributed optimization. Recently, convergence of this al-
gorithm was established even for problems with non-convex objective functions fi [112].
The authors of [112] also show that additional computational advantage can be gained
by updating only a subset of Ki’s in each iteration and offer several alternative imple-
mentations to speed computations.
Remark 21. The update of each Ki amounts to the computation of the proximal oper-
ator associated with fi,
Kk+1i = proxfi/ρi
(Uki
)However, while the proximal operators of commonly used regularization functions are
easy to evaluate, the computation of proxfi/ρi is significantly more involved. In [42],
the proximal operator associated with the closed-loop H2 norm f(K) was computed using
the Anderson-Moore algorithm. This approach requires computation of the closed-loop
controllability and observability gramians. In contrast, the developments of Section 8.3
facilitate computation of proxfi/ρi via a proximal gradient algorithm that avoids the
need for solving large-scale Lyapunov equations.
Remark 22. For the problem of growing connected resistive consensus networks [106],
Algorithm 3 in [112] can be used to solve the Ki-minimization step in the ADMM-
based algorithm explicitly. The key point of departure compared to the standard ADMM
implementation is linearization in the Ki-minimization step of the function fi with a
Lipschitz continuous gradient around the current K0 iterate. This offers a significant
speed-up and enables an explicit update of Ki.
143
Figure 8.1: A mass-spring-damper system on a line.
8.6.1 Computational experiments We next provide examples to illustrate the
performance of proximal algorithms. The acronym proxG represents the proximal gra-
dient and proxIG is the incremental proximal gradient. We have implemented all algo-
rithms in Matlab. In all examples, we choose R = I and Q = I.
The algorithms were tested for a mass-spring-damper (MSD) model with N masses,
illustrated in Figure 8.1. Our goal is to find the local minimum of the problem (8.6)
with g(K) = γ ‖K‖1, where γ is the sparsity-promoting parameter and the `1 norm is
a proxy for inducing sparsity in the feedback gain matrix. The MSD system with N
masses has n = 2N states where the first N states denote the positions and the rest are
velocities. We consider the case where all masses, spring, and damping constants are
equal to one, which results in the state-space model with
A =
[0 I
−T −T
], B =
[0
I
]
where T is an N×N tridiagonal symmetric Toeplitz matrix with 2 on the main diagonal
and −1 on the first upper and lower sub-diagonals, and I and 0 are N ×N identity and
zero matrices.
Sparsity-promoting controllers were designed for this model with N = 10 masses
and compared with the ADMM-based algorithm [42] that does not utilize the iterative
re-weighting scheme. The algorithms were initialized with the (unstructured) optimal
feedback gain K?. Figure 8.2 shows how the sparsity pattern of the controller changes
as the value of the parameter γ is increased. It is worth to note that all algorithms give
the same structure of K with very close feedback gain values. For γ = 0, the optimal
feedback controller is given by a dense matrix and as γ increases the controller becomes
sparser. It is diagonal when γ = 1, and for sufficiently large γ the gain is identically
zero. This is a feasible solution because the open-loop system is stable.
144
2
4
6
8
10
2 4 6 8 10 12 14 16 18 202 00 4 6 8 10 12 14 16 18 20
2
4
6
8
10
nz = 200 , γ = 0 nz = 102 , γ = 0.01
nz = 38 , γ = 0.1 nz = 16 , γ = 1
Figure 8.2: Structure of feedback gains resulting from the solution to (8.6) for differentvalues of the regularization parameter γ for an MSD system with N = 10 masses. Thenumber of non-zero elements in K is denoted by nz.
Next, we employ the ADMM-based consensus algorithm to design optimal struc-
tured feedback gains in a distributed fashion by splitting the problem into N separate
subproblems over N different cores. We have provided a parallel implementation in
C++ using pthreads library and executed tests on a machine featuring an Intel Core
i7-3770 with 16GB of RAM to measure the performance of the algorithm. We use our
algorithm for growing a connected resistive network with n = 20 nodes. The plant graph
is given by an Erdos-Renyi network with edge probability 1.05 log(n)/n. We choose the
control weight matrix R = I and a state weight matrix that penalizes the mean-square
deviation from the network average, Q = I− (1/n)11T . Moreover, the incidence matrix
of the controller is such that there are no joint edges between the plant and the con-
troller graphs. As discussed in Section 8.4, for consensus networks, the controller can
be written as a function of the vector of the edge weights x. Thus, the H2 norm of the
closed-loop system is f(x) and the regularization function is given by g(x) = γ ‖w ◦ x‖1where w is the vector of the weights. Since the plant network is resistive and connected,
all the edge weights are nonnegative, thereby if the added edges have nonnegative edge
weights, the closed-loop system is stable. We can write the smooth and nonsmooth
parts of the objective function as f(x) + γ wTx and g(x) = I+(x) where I+(x) is an
indicator function for nonnegative orthant.
145
We solve the problem (8.28) to find the controller graph for 500 logarithmically-
spaced values of γ ∈ [0.001, 0.3] using the path-following iterative reweighted algorithm
as a proxy for inducing sparsity [125]. We set the weights to be inversely proportional
to the magnitude of the solution x at the previous value of γ following by a polishing
step that computes the optimal weights of identified edges; see [106].
As γ increases, the number of nonzero edges decreases and the closed-loop perfor-
mance deteriorates. As shown in Fig. 8.3, relative to the optimal centralized vector of
the edge weights, xc, the H2 loss decreases as the sparsity of the vector of the edge
weights x increases. In particular, for γ = 0.3, there is only one nonzero element in the
vector of the edge weights. The identified sparse controller in this case uses only 0.62% of
the edges, relative to the optimal centralized controller, i.e., card(x)/card(xc) = 0.62%
and achieves a performance loss of 23.47%, i.e., (J − Jc)/Jc = 23.47%.
card
(x)/card
(xc)
(a)
(J−Jc)/Jc
(b)
(J−Jc)/Jc
(c)
γ γ card(x)/card(xc)
Figure 8.3: (a) Sparsity level; (b) performance degradation; and (c) the optimal trade-off between the performance degradation and the sparsity level of the optimal sparsex compared to the optimal centralized controller xc. The results are obtained for arandomly generated Erdos-Renyi network with n = 20 nodes.
Next, we employ our algorithm for growing connected resistive networks with differ-
ent number of nodes using multiple cores. The solve times are averaged over 10 trials
and the speedup relative to a single core is displayed in Fig 8.4. This figure demonstrates
that the algorithm is scalable. In particular, multi-core execution outperforms running
just on a single core. The speed-up is even higher for larger networks since overheads
of parallel execution are smaller and more time is spent on actual parallel computation.
146
Sp
eed
up
Number of cores
Figure 8.4: Speedup ratio versus the number of the cores used for growing connectedresistive networks with n nodes.
8.7 Concluding remarks This chapter provides a step towards a compre-
hensive theory for distributed and structured optimal control of LTI systems. Even
within the linear setting there are many open questions, including
• Development of simulation-based methods in the presence of disturbances and
uncertainty;
• Construction of truly recursive algorithms, similar to those used in traditional
reinforcement learning settings;
• Techniques to avoid local minima, and finding conditions that ensure non-existence
of local minima for f ;
• Development of algorithms that combine proximal methods with natural policy
gradient.
Extensions to nonlinear control will probably require more modest objectives since
it is not an easy task to compute the performance of a given policy. However, both
Q-learning and TD-learning are based on value function approximation. Consequently,
as in the present work, these algorithms can be designed so that each value function
approximation serves as a Lyapunov function for the current approximating policy.
147
This provides bounds on performance, as well as stability. Consequently, algorithms
for computation of structured stabilizing policies with good performance are not out of
reach even for nonlinear control systems.
Chapter 9
Conclusions
This dissertation focuses on structure identification and optimal control of large-scale
networks of dynamical systems. To overcome the challenges in this area, we combine
tools and ideas from control theory, distributed optimization, reinforcement learning,
and compressive sensing to develop distributed estimation and control strategies that
require limited information exchange between the individual subsystems. In the first
and second parts, we have studied the problems of optimal distributed design and in-
ference of large-scale networks of dynamical systems. We have developed theoretical
and computational frameworks to solve these problems. In particular, we have designed
and implemented customized optimization algorithms that are well-suited for solving
the above-mentioned problems in large-scale dynamical systems. In the third chapter,
we have studied the convergence properties of these algorithms from system-theoretic
point of view. We have utilized control-theoretic tools to view different optimization al-
gorithms as dynamical systems and study their convergence properties by using stability
analysis techniques from control theory. Finally, in the last chapter, we have provided a
data-driven framework for the structured optimal control problem that does not require
knowledge of the system parameters and avoids the need to solve large-scale matrical
equations.
Future directions
The current research on design and control of networks can be extended to make
them applicable to a larger class of systems such as unbalanced time-varying large-scale
148
149
networks. We have also developed an algorithm to recover the underlying undirected
network topology of a complex system using partially available statistical data. We
will extend this framework to identify the structure of directed networks which is a
more challenging problem since the resulting optimization problem is non-convex. Also,
we are utilizing optimization theoretical results to develop a new model with more
complicated dynamics than first-order consensus. Furthermore, we are extending the
work on distributed computation over networks to solve the non-smooth optimization
problems with more than one regularizers. In this case, the dynamical system of the
primal-dual updates is more complicated and convergence analysis is more challenging.
Moreover, we will extend the data-driven framework to be truly recursive, similar to
those used in traditional reinforcement learning settings and will combine the proximal
methods with natural policy gradient. The framework can be generalized by developing
simulation-based methods in the presence of disturbances and uncertainty.
References
[1] G. E. Dullerud and F. Paganini. A course in robust control theory. Springer, 2000.
[2] M. Mesbahi and M. Egerstedt. Graph Theoretic Methods in Multiagent Networks.
Princeton University Press, 2010.
[3] L. Xiao and S. Boyd. Fast linear iterations for distributed averaging. Syst. Control
Lett., 53:65–78, 2004.
[4] Y. Kim and M. Mesbahi. On maximizing the second smallest eigenvalue of a
state-dependent graph Laplacian. IEEE Trans. Automat. Control, 51(1):116–120,
2006.
[5] L. Xiao, S. Boyd, and S.-J. Kim. Distributed average consensus with least-mean-
square deviation. J. Parallel Distrib. Comput., 67(1):33–46, 2007.
[6] P. Barooah and J. P. Hespanha. Estimation on graphs from relative measure-
ments: Distributed algorithms and fundamental limits. IEEE Control Syst. Mag.,
27(4):57–74, 2007.
[7] A. Ghosh, S. Boyd, and A. Saberi. Minimizing effective resistance of a graph.
SIAM Rev., 50(1):37–66, 2008.
[8] D. Zelazo and M. Mesbahi. Edge agreement: Graph-theoretic performance bounds
and passivity analysis. IEEE Trans. Automat. Control, 56(3):544–555, 2011.
[9] D. Zelazo, S. Schuler, and F. Allgower. Performance and design of cycles in
consensus networks. Syst. Control Lett., 62(1):85–96, 2013.
150
151
[10] B. Bamieh, M. R. Jovanovic, P. Mitra, and S. Patterson. Coherence in large-
scale networks: dimension dependent limitations of local feedback. IEEE Trans.
Automat. Control, 57(9):2235–2249, September 2012. (2013 George S. Axelby
Outstanding Paper Award).
[11] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging.
Systems & Control Letters, 53(1):65–78, 2004.
[12] C. W. Reynolds. Flocks, herds and schools: A distributed behavioral model. ACM
SIGGRAPH Computer Graphics, 21(4):25–34, 1987.
[13] T. Vicsek, A. Czirok, E. Ben-Jacob, I. Cohen, and O. Shochet. Novel type of phase
transition in a system of self-driven particles. Physical Review Letters, 75(6):1226,
1995.
[14] M. H. DeGroot. Reaching a consensus. J. Amer. Statist. Assoc., 69(345):118–121,
1974.
[15] B. Golub and M. Jackson. Naive learning social networks and the wisdom of
crowds. American Economic Journal: Microeconomics, 2(1):112–149, 2010.
[16] F. Dorfler, M. R. Jovanovic, M. Chertkov, and F. Bullo. Sparsity-promoting opti-
mal wide-area control of power networks. IEEE Trans. Power Syst., 29(5):2281–
2291, September 2014.
[17] X. Wu, F. Dorfler, and M. R. Jovanovic. Decentralized optimal control of inter-
area oscillations in bulk power systems. In Proceedings of the 54th IEEE Confer-
ence on Decision and Control, pages 5532–5537, Osaka, Japan, 2015.
[18] X. Wu, F. Dorfler, and M. R. Jovanovic. Input-output analysis and decentralized
optimal control of inter-area oscillations in power systems. IEEE Trans. Power
Syst., 31(3):2434–2444, May 2016.
[19] G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. J.
Parallel Distrib. Comput., 7(2):279–301, 1989.
[20] J. E. Boillat. Load balancing and Poisson equation in a graph. Concurrency:
Practice and Experience, 2(4):289–313, 1990.
152
[21] V. Preciado, M. Zargham, C. Enyioha, A. Jadbabaie, and G. Pappas. Optimal
vaccine allocation to control epidemic outbreaks in arbitrary networks. In Pro-
ceedings of the 52nd IEEE Conference on Decision and Control, pages 7486–7491.
IEEE, 2013.
[22] E. Ramırez-Llanos and S. Martınez. A distributed nonlinear dynamics for virus
spread control. Automatica, 2016. To appear.
[23] A. Jadbabaie, J. Lin, and A. S. Morse. Coordination of groups of mobile au-
tonomous agents using nearest neighbor rules. IEEE Trans. Automat. Control,
48(6):988–1001, 2003.
[24] R. Olfati-Saber and R. M. Murray. Consensus problems in networks of agents with
switching topology and time-delays. IEEE Trans. Automat. Control, 49(9):1520–
1533, 2004.
[25] L. Moreau. Stability of multiagent systems with time-dependent communication
links. IEEE Trans. Automat. Control, 50(2):169–182, 2005.
[26] R. Carli, F. Fagnani, A. Speranzon, and S. Zampieri. Communication constraints
in the average consensus problem. Automatica, 44(3):671–684, 2007.
[27] S. Hassan-Moghaddam, N. K. Dhingra, and M. R. Jovanovic. Topology identifica-
tion of undirected consensus networks via sparse inverse covariance estimation. In
Proceedings of the 55th IEEE Conference on Decision and Control, pages 4624–
4629, Las Vegas, NV, 2016.
[28] S. Hassan-Moghaddam and M. R. Jovanovic. Topology identification via growing
a chow-liu tree network. In Proceedings of the 57th IEEE Conference on Decision
and Control, pages 5421–5426, Miami, FL, 2018.
[29] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization
algorithms via integral quadratic constraints: Nonstrongly convex problems. 2017.
arXiv:1705.03615.
153
[30] B. Hu and P. Seiler. Exponential decay rate conditions for uncertain linear systems
using integral quadratic constraints. IEEE Trans. Autom. Control, 61(11):3631–
3637, 2016.
[31] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algo-
rithms via integral quadratic constraints. SIAM J. Optim., 26(1):57–95, 2016.
[32] Y. Wan, S. Roy, and A. Saberi. Designing spatially heterogeneous strategies for
control of virus spread. IET Systems Biology, 2(4):184–201, 2008.
[33] M. Siami and N. Motee. Tractable approximation algorithms for the np-hard prob-
lem of growing linear consensus networks. In Proceedings of the 2016 American
Control Conference, pages 6429–6434, 2016.
[34] A. Ghosh and S. Boyd. Growing well-connected graphs. In Proceedings of the
45th IEEE Conference on Decision and Control, pages 6605–6611, 2006.
[35] F. Lin, M. Fardad, and M. R. Jovanovic. Identification of sparse communication
graphs in consensus networks. In Proceedings of the 50th Annual Allerton Confer-
ence on Communication, Control, and Computing, pages 85–89, Monticello, IL,
2012. (Invited paper).
[36] M. Fardad, F. Lin, and M. R. Jovanovic. Design of optimal sparse interconnection
graphs for synchronization of oscillator networks. IEEE Trans. Automat. Control,
59(9):2457–2462, September 2014.
[37] T. H. Summers, I. Shames, J. Lygeros, and F. Dorfler. Topology design for optimal
network coherence. In Proceedings of the 2015 European Control Conference, pages
575–580, 2015.
[38] B. Polyak, M. Khlebnikov, and P. Shcherbakov. An LMI approach to structured
sparse feedback design in linear control systems. In Proceedings of the 2013 Eu-
ropean Control Conference, pages 833–838, 2013.
[39] N. K. Dhingra, M. R. Jovanovic, and Z. Q. Luo. An admm algorithm for optimal
sensor and actuator selection. In Proceedings of the 53rd IEEE Conference on
Decision and Control, pages 4039–4044, Los Angeles, CA, 2014. (Invited paper).
154
[40] A. Zare, H. Mohammadi, N. K. Dhingra, T. T. Georgiou, and M. R. Jo-
vanovic. Proximal algorithms for large-scale statistical modeling and op-
timal sensor/actuator selection. IEEE Trans. Automat. Control, 2019.
doi:10.1109/TAC.2019.2948268; also arXiv:1807.01739.
[41] M. Fardad, F. Lin, and M. R. Jovanovic. Sparsity-promoting optimal control
for a class of distributed systems. In Proceedings of the 2011 American Control
Conference, pages 2050–2055, San Francisco, CA, 2011.
[42] F. Lin, M. Fardad, and M. R. Jovanovic. Design of optimal sparse feedback
gains via the alternating direction method of multipliers. IEEE Trans. Automat.
Control, 58(9):2426–2431, September 2013.
[43] J. F. Sturm. Using SeDuMi 1.02, a Matlab toolbox for optimization over sym-
metric cones. Optim. Methods Softw., 11(1-4):625–653, 1999.
[44] Kim-Chuan Toh, M. J. Todd, and R. H. Tutuncu. SDPT3 – A Matlab software
package for semidefinite programming, version 1.3. Optim. Methods Softw., 11(1-
4):545–581, 1999.
[45] Pedro A Valdes-Sosa, Jose M Sanchez-Bornot, Agustın Lage-Castellanos, Mayrim
Vega-Hernandez, Jorge Bosch-Bayard, Lester Melie-Garcıa, and Erick Canales-
Rodrıguez. Estimating brain functional connectivity with sparse multivariate au-
toregression. Phil. Trans. R. Soc. B, 360(1457):969–981, 2005.
[46] Dongchuan Yu, Marco Righero, and Ljupco Kocarev. Estimating topology of
networks. Phys. Rev. Lett., 97(18):188701, 2006.
[47] Michael M Zavlanos, A Agung Julius, Stephen P Boyd, and George J Pappas.
Identification of stable genetic networks using convex programming. In Proceedings
of the 2008 American Control Conference, pages 2755–2760, 2008.
[48] Domenico Napoletani and Timothy D Sauer. Reconstructing the topology of
sparsely connected dynamical networks. Physical Review E, 77(2):026103, 2008.
155
[49] Mustafa Ayazoglu, Mario Sznaier, and Necmiye Ozay. Blind identification of
sparse dynamic networks and applications. In Proceedings of the 50th IEEE Con-
ference on Decision and Control, pages 2944–2950, 2011.
[50] Marzieh Nabi-Abdolyousefi, Maryam Fazel, and Mehran Mesbahi. A graph re-
alization approach to network identification. In Proceedings of the 51st IEEE
Conference on Decision and Control, pages 4642–4647, 2012.
[51] D. Materassi and M. V. Salapaka. On the problem of reconstructing an unknown
topology via locality properties of the wiener filter. IEEE Trans. Automat. Con-
trol, 57(7):1765–1777, 2012.
[52] D. Heckerman, D. Geiger, and D. M. Chickering. Learning bayesian networks:
The combination of knowledge and statistical data. JMLR, 20(3):197–243, 1995.
[53] A. Zare, Y. Chen, M. R. Jovanovic, and T. T. Georgiou. Low-complexity mod-
eling of partially available second-order statistics: theory and an efficient matrix
completion algorithm. IEEE Trans. Automat. Control, 62(3):1368–1383, March
2017.
[54] A. Zare, M. R. Jovanovic, and T. T. Georgiou. Colour of turbulence. J. Fluid
Mech., 812:636–680, February 2017.
[55] A. Zare, T. T. Georgiou, and M. R. Jovanovic. Stochastic dynamical modeling of
turbulent flows. Annu. Rev. Control Robot. Auton. Syst., 3, May 2020. in press;
doi:10.1146/annurev-control-053018-023843; also arXiv:1908.09487.
[56] Mikail M. Rubinov and O. Sporns. Complex network measures of brain connec-
tivity: uses and interpretations. Neuroimage, 52(3):1059–1069, 2010.
[57] Bharat Biswal, F Zerrin Yetkin, Victor M Haughton, and James S Hyde. Func-
tional connectivity in the motor cortex of resting human brain using echo-planar
mri. Magn. Reson. Med., 34(4):537–541, 1995.
[58] M. Lynall, D. S. Bassett, R. Kerwin, P. J. McKenna, M. Kitzbichler, U. Muller,
and E. Bullmore. Functional connectivity and brain networks in schizophrenia. J.
Neurosci., 30(28):9477–9487, 2010.
156
[59] D. Cordes, V. Haughton, J. D. Carew, K. Arfanakis, and K. Maravilla. Hierarchi-
cal clustering to measure connectivity in fmri resting-state data. JMRI, 20(4):305–
317, 2002.
[60] P. E. Vertes, A. F. Alexander-Bloch, N. Gogtay, J. N. Giedd, J. L. Rapoport, and
E. T. Bullmore. Simple models of human brain functional networks. Proceedings
of the National Academy of Sciences, 109(15):5868–5873, 2012.
[61] H. E. Egilmez, E. Pavez, and A. Ortega. Graph learning from data under laplacian
and structural constraints. IEEE J. Sel. Topics Signal Process, 2017.
[62] H. J. van Waarde, P. Tesi, and M. K. Camlibel. Topology reconstruction
of dynamical networks via constrained lyapunov equations. arXiv preprint
arXiv:1706.09709, 2017.
[63] S. Hassan-Moghaddam and M. R. Jovanovic. An interior point method for grow-
ing connected resistive networks. In Proceedings of the 2015 American Control
Conference, pages 1223–1228, Chicago, IL, 2015.
[64] S. Sojoudi and J. Doyle. On the efficacy of computational methods for learning
graphical models. submitted.
[65] R. Dai and M. Mesbahi. Optimal topology design for dynamic networks. In 50th
IEEE Conference on Decision and Control, pages 1280–1285, 2011.
[66] S. Hassan-Moghaddam and M. R. Jovanovic. Customized algorithms for growing
connected resistive networks. In Proceedings of the 10th IFAC Symposium on
Nonlinear Control Systems, pages 986–991, Monterey, CA, 2016.
[67] Donatello Materassi and Giacomo Innocenti. Topological identification in net-
works of dynamical systems. IEEE Trans. Automat. Control, 55(8):1860–1871,
2010.
[68] N. K. Dhingra, F. Lin, M. Fardad, and M. R. Jovanovic. On identifying sparse
representations of consensus networks. In Preprints of the 3rd IFAC Workshop on
Distributed Estimation and Control in Networked Systems, pages 305–310, Santa
Barbara, CA, 2012.
157
[69] Milad Siami and Nader Motee. Network sparsification with guaranteed systemic
performance measures. In Preprints of the 5th IFAC Workshop on Distributed
Estimation and Control in Networked Systems, volume 48, pages 246–251, 2015.
[70] George F Young, Luca Scardovi, Andrea Cavagna, Irene Giardina, and Naomi E
Leonard. Starling flock networks manage uncertainty in consensus at low cost.
PLoS Comput Biol, 9(1):e1002894, 2013.
[71] M. R. Jovanovic and N. K. Dhingra. Controller architectures: tradeoffs between
performance and structure. Eur. J. Control, 30:76–91, July 2016.
[72] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multiagent opti-
mization. IEEE Trans. on Automat. Control, 54(1):48–61, 2009.
[73] Jing Wang and Nicola Elia. A control perspective for centralized and distributed
convex optimization. In Proceedings of the 50th IEEE Conference on Decision
and Control and the 10th European Control Conference, pages 3800–3805, 2011.
[74] P. Latafat, L. Stella, and P. Patrinos. New primal-dual proximal algorithm for
distributed optimization. In Proceedings of the 55th IEEE Conference on Decision
and Control, pages 1959–1964, 2016.
[75] P. Latafat, N. Freris, and P. Patrinos. A new randomized block-coordinate primal-
dual proximal algorithm for distributed optimization. IEEE Trans. Automat.
Control, 2019. doi:10.1109/TAC.2019.2906924.
[76] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Opti-
mization, 1(3):123–231, 2013.
[77] K. J. Arrow, L. Hurwicz, H. Uzawa, and H. B. Chenery. Studies in linear and
non-linear programming. 1958.
[78] D. Feijer and F. Paganini. Stability of primal–dual gradient dynamics and appli-
cations to network optimization. Automatica, 46(12):1974–1981, 2010.
[79] Ashish Cherukuri, Enrique Mallada, and Jorge Cortes. Asymptotic convergence
of constrained primal–dual dynamics. Syst. Control Lett., 87:10–15, 2016.
158
[80] A. Cherukuri, E. Mallada, S. Low, and J. Cortes. The role of convexity on saddle-
point dynamics: Lyapunov function and robustness. IEEE Trans. Automat. Con-
trol, 63(8):2449–2464, 2018.
[81] N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. The proximal augmented
lagrangian method for nonsmooth composite optimization. IEEE Trans. Automat.
Control, 64(7):2861–2868, July 2019.
[82] G. Qu and N. Li. On the exponential stability of primal-dual gradient dynamics.
IEEE Control Syst. Lett., 3(1):43–48, 2018.
[83] W. Su, S. Boyd, and E. Candes. A differential equation for modeling Nesterov’s
accelerated gradient method: Theory and insights. J. Mach. Learn. Res., 17:1–43,
2016.
[84] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on ac-
celerated methods in optimization. Proc. Natl. Acad. Sci., 113(47):E7351–E7358,
2016.
[85] G. Franca, D. Robinson, and R. Vidal. ADMM and accelerated ADMM as con-
tinuous dynamical systems. 2018. arXiv:1805.06579.
[86] B. Shi, S. Du, M. I. Jordan, and W. Su. Understanding the acceleration phe-
nomenon via high-resolution differential equations. 2018. arXiv:1810.08907.
[87] M. Muehlebach and M. I. Jordan. A dynamical systems perspective on Nesterov
acceleration. 2019. arXiv:1905.07436.
[88] J. I. Poveda and N. Li. Inducing uniform asymptotic stability in time-
varying accelerated optimization dynamics via hybrid regularization. 2019.
arXiv:1905.12110.
[89] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent op-
timization. IEEE Trans. Automat. Control, 54:48, 2009.
[90] D. Feijer and F. Paganini. Stability of primal-dual gradient dynamics and appli-
cations to network optimization. Automatica, 46(12):1974–1981, 2010.
159
[91] A. Cherukuri, E. Mallada, S. Low, and J. Cortes. The role of convexity in saddle-
point dynamics: Lyapunov function and robustness. IEEE Trans. Automat. Con-
trol, 63(8):2449–2464, 2017.
[92] A. Brown and M. Bartholomew-Biggs. Some effective methods for unconstrained
optimization based on the solution of systems of ordinary differential equations.
J. Optimiz. Theory App., 62(2):211–224, 1989.
[93] J. Schropp and I. Singer. A dynamical systems approach to constrained minimiza-
tion. Numer. Func. Anal. Opt., 21(3-4):537–551, 2000.
[94] J. Zhang, A. Mokhtari, S. Sra, and A. Jadbabaie. Direct Runge-Kutta discretiza-
tion achieves acceleration. In Advances in Neural Information Processing Systems,
pages 3900–3909, 2018.
[95] B. Hu, P. Seiler, and A. Rantzer. A unified analysis of stochastic optimization
methods using jump system theory and quadratic constraints. In Proceedings of
the 2017 Conference on Learning Theory, pages 1157–1189, 2017.
[96] B. Hu and L. Lessard. Dissipativity theory for Nesterov’s accelerated method.
In Proceedings of the 34th International Conference on Machine Learning, pages
1549–1557, 2017.
[97] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimiza-
tion algorithms via integral quadratic constraints: Nonstrongly convex problems.
SIAM J. Optim., 28(3):2654–2689, 2018.
[98] S. Hassan-Moghaddam and M. R. Jovanovic. Distributed proximal augmented
lagrangian method for nonsmooth composite optimization. In Proceedings of the
2018 American Control Conference, pages 2047–2052, Milwaukee, WI, 2018.
[99] S. Hassan-Moghaddam and M. R. Jovanovic. On the exponential convergence rate
of proximal gradient flow algorithms. In Proceedings of the 57th IEEE Conference
on Decision and Control, pages 4246–4251, Miami, FL, 2018. (Invited paper).
[100] D. Ding, B. Hu, N. K. Dhingra, and M. R. Jovanovic. An exponentially convergent
primal-dual algorithm for nonsmooth composite minimization. In Proceedings of
160
the 57th IEEE Conference on Decision and Control, pages 4927–4932, Miami, FL,
2018.
[101] J. Seidman, M. Fazlyab, V. Preciado, and G. Pappas. A control-theoretic ap-
proach to analysis and parameter selection of Douglas-Rachford splitting. 2019.
arXiv:1903.11525.
[102] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Variance amplification
of accelerated first-order algorithms for strongly convex quadratic optimization
problems. In Proceedings of the 57th IEEE Conference on Decision and Control,
pages 5753–5758, Miami, FL, 2018.
[103] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Performance of noisy
nesterov’s accelerated method for strongly convex optimization problems. In Pro-
ceedings of the 2019 American Control Conference, pages 3426–3431, Philadelphia,
PA, 2019.
[104] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Robustness of accelerated
first-order algorithms for strongly convex optimization problems. IEEE Trans.
Automat. Control, 2019. submitted; also arXiv:1905.11011.
[105] S. Michalowsky, C. Scherer, and C. Ebenbauer. Robust and structure exploit-
ing optimization algorithms: An integral quadratic constraint approach. 2019.
arXiv:1905.00279.
[106] S. Hassan-Moghaddam and M. R. Jovanovic. Topology design for stochastically-
forced consensus networks. IEEE Trans. Control Netw. Syst., 5(3):1075–1086,
September 2018.
[107] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
[108] N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. A second order primal-dual
method for nonsmooth convex composite optimization. IEEE Trans. Automat.
Control, 2017. conditionally accepted; also arXiv:1709.01610.
161
[109] W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact first-order algorithm for
decentralized consensus optimization. SIAM J. Optim., 25(2):944–966, 2015.
[110] B. Gharesifard and J. Cortes. Distributed continuous-time convex optimization
on weight-balanced digraphs. IEEE Trans. Autom. Control, 59(3):781–786, 2014.
[111] S. Hassan-Moghaddam and M. R. Jovanovic. Distributed design of optimal struc-
tured feedback gains. In Proceedings of the 56th IEEE Conference on Decision
and Control, pages 6586–6591, Melbourne, Australia, 2017.
[112] M. Hong, Z. Q. Luo, and M. Razaviyayn. Convergence analysis of alternating
direction method of multipliers for a family of nonconvex problems. SIAM J.
Optimiz., 26(1):337–364, 2016.
[113] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza-
tion and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–124, 2011.
[114] F. Lin, M. Fardad, and M. R. Jovanovic. Augmented lagrangian approach to
design of structured optimal state feedback gains. IEEE Trans. Automat. Control,
56(12):2923–2929, December 2011.
[115] N. Matni and V. Chandrasekaran. Regularization for design. IEEE Trans. Au-
tomat. Control, 61(12):3991–4006, 2016.
[116] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of
the linear quadratic regulator. 2017. arXiv:1710.01688.
[117] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi. Global convergence of policy
gradient methods for linearized control problems. 2018. arXiv:1801.05039.
[118] S. Schuler, P. Li, J. Lam, and F Allgower. Design of structured dynamic output-
feedback controllers for interconnected systems. International Journal of Control,
2011. doi:10.1080/00207179.2011.634029.
[119] S. Hassan-Moghaddam, X. Wu, and M. R. Jovanovic. Edge addition in directed
consensus networks. In Proceedings of the 2017 American Control Conference,
pages 5592–5597, Seattle, WA, 2017.
162
[120] C. Conte, C. N. Jones, M. Morari, and M. N. Zeilinger. Distributed synthesis and
stability of cooperative distributed model predictive control for linear systems.
Automatica, 69:117–125, 2016.
[121] S. Hassan-Moghaddam and M. R. Jovanovic. Proximal gradient flow and douglas-
rachford splitting dynamics: global exponential stability via integral quadratic
constraints. Automatica, 2019. submitted; also arXiv:1908.09043.
[122] S. Hassan-Moghaddam, M. R. Jovanovic, and S. Meyn. Data-driven proximal
algorithms for the design of structured optimal feedback gains. In Proceedings of
the 2019 American Control Conference, pages 5846–5850, Philadelphia, PA, 2019.
[123] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for
minimizing composite functions. SIAM J. Optim., 24(3):1420–1443, 2014.
[124] Cho-Jui Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. QUIC: Quadratic
approximation for sparse inverse covariance estimation. J. Mach. Learn. Res.,
15:2911–2947, 2014.
[125] E. J. Candes, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted `1
minimization. J. Fourier Anal. Appl, 14:877–905, 2008.
[126] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
[127] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J.
Numer. Anal., 8(1):141–148, 1988.
[128] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nons-
mooth separable minimization. Math. Prog., 117(1-2):387–423, 2009.
[129] Yu-Hong Dai and Roger Fletcher. Projected Barzilai-Borwein methods for
large-scale box-constrained quadratic programming. Numerische Mathematik,
100(1):21–47, 2005.
[130] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by
separable approximation. IEEE Trans. Signal Process., 57(7):2479–2493, 2009.
163
[131] Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego
networks. In Adv. Neural Inf. Process. Syst., pages 539–547, 2012.
[132] M. Fardad, X. Zhang, F. Lin, and M. R. Jovanovic. On the properties of optimal
weak links in consensus networks. In Proceedings of the 53rd IEEE Conference on
Decision and Control, pages 2124–2129, Los Angeles, CA, 2014. (Invited paper).
[133] X. Wu and M. R. Jovanovic. Sparsity-promoting optimal control of consensus and
synchronization networks. In Proceedings of the 2014 American Control Confer-
ence, pages 2948–2953, Portland, OR, 2014.
[134] T.H. Summers, Iman Shames, J. Lygeros, and F. Dorfler. Topology Design for
Optimal Network Coherence. In Proceedings of the 14th European Control Con-
ference, July 2015.
[135] T. Rautert and E. W. Sachs. Computational design of optimal output feedback
controllers. SIAM J. Optim., 7(3):837–852, 1997.
[136] M. Hong, Z. Q. Luo, and M. Razaviyayn. Convergence analysis of alternating
direction method of multipliers for a family of nonconvex problems. SIAM J
OPTIMIZ, 26(1):337–364, 2016.
[137] M Fortin and R. Glowinski. Augmented Lagrangian method: Application to the
numerical solution of boundary-value problems. North-Holland, 1983.
[138] R. Glowinski and P. Le Tallec. Augmented Lagrangian and operator-splitting meth-
ods in nonlinear mechanics. SIAM, 1989.
[139] Katya Scheinberg, Shiqian Ma, and Donald Goldfarb. Sparse inverse covariance
selection via alternating linearization methods. In NIPS, pages 2101–2109, 2010.
[140] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation
with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
[141] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model se-
lection through sparse maximum likelihood estimation for multivariate gaussian
or binary data. J. Mach. Learn. Res., 9:485–516, 2008.
164
[142] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical
model. Biometrika, 94(1):19–35, 2007.
[143] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning
sparse gaussians. arXiv preprint arXiv:1206.3249, 2012.
[144] Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and Matyas A Sustik.
Sparse inverse covariance matrix estimation using quadratic approximation. In
Adv. Neural Inf. Process. Syst., pages 2330–2338, 2011.
[145] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable
minimization. J. Optim. Theory Appl., 109(3):475–494, 2001.
[146] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for `1 regularized
loss minimization. J. Mach. Learn. Res., 12:1865–1892, 2011.
[147] Ankan Saha and Ambuj Tewari. On the non-asymptotic convergence of cyclic
coordinate descent methods. SIAM J. Optim., 23(1):576–601, 2013.
[148] Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, and Arian
Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation.
In NIPS, pages 1574–1582, 2012.
[149] Figen Oztoprak, Jorge Nocedal, Steven Rennie, and Peder A Olsen. Newton-like
methods for sparse inverse covariance estimation. In NIPS, pages 755–763, 2012.
[150] Onureena Banerjee, Laurent El Ghaoui, Alexandre d’Aspremont, and Georges
Natsoulis. Convex optimization techniques for fitting sparse gaussian graphical
models. In Proceedings of the 23rd international conference on Machine learning,
pages 89–96, 2006.
[151] Alexandre d’Aspremont, Onureena Banerjee, and Laurent El Ghaoui. First-order
methods for sparse covariance selection. SIAM J. Matrix Anal. Appl., 30(1):56–66,
2008.
[152] Cho-Jui Hsieh, Matyas A Sustik, Inderjit S Dhillon, Pradeep K Ravikumar, and
Russell Poldrack. Big & quic: Sparse inverse covariance estimation for a million
variables. In NIPS, pages 3165–3173, 2013.
165
[153] N. Motee and A. Jadbabaie. Optimal control of spatially distributed systems.
IEEE Trans. Automat. Control, 53(7):1616–1629, 2008.
[154] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with
dependence trees. 14(3):462–467, 1968.
[155] M. P. van den Heuvel, C. J. Stam, R. S. Kahn, and H. R. H Pol. Efficiency of
functional brain networks and intellectual performance. J. Neurosci., 29(23):7619–
7624, 2009.
[156] S. Sojoudi and J. Doyle. Study of the brain functional network using synthetic
data. In Proceedings of the 52nd Annual Allerton Conference on Communication,
Control, and Computing, pages 350–357, 2014.
[157] P. Patrinos, L. Stella, and A. Bemporad. Forward-backward truncated Newton
methods for convex composite optimization. 2014. arXiv:1402.6655.
[158] L. Stella, A. Themelis, and P. Patrinos. Forward–backward quasi-Newton methods
for nonsmooth optimization problems. Comput. Optim. Appl., 67(3):443–487,
2017.
[159] A. Themelis, L. Stella, and P. Patrinos. Forward-backward envelope for the sum
of two nonconvex functions: Further properties and nonmonotone line-search al-
gorithms. SIAM J. Optim., 28(3):2274–2303, 2018.
[160] A. Megretski and A. Rantzer. System analysis via integral quadratic constraints.
IEEE Trans. Autom. Control, 42(6):819–830, 1997.
[161] B. T. Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-
tel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
[162] J. Douglas and H. Rachford. On the numerical solution of heat conduction prob-
lems in two and three space variables. Trans. Amer. Math. Soc., 82(2):421–439,
1956.
[163] A. Themelis and P. Patrinos. Douglas-Rachford splitting and ADMM for noncon-
vex optimization: tight convergence results. 2017. arXiv:1709.05747.
166
[164] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and
proximal-gradient methods under the Polyak-Lojasiewicz condition. In Joint Eu-
ropean Conference on Machine Learning and Knowledge Discovery in Databases,
pages 795–811, 2016.
[165] Anders Rantzer. On the Kalman-Yakubovich-Popov lemma. Systems & Control
Letters, 28(1):7 – 10, 1996.
[166] P. Giselsson and S. Boyd. Linear convergence and metric selection for Douglas-
Rachford splitting and ADMM. IEEE Trans. Automat. Control, 62(2):532–544,
2017.
[167] J. Eckstein and D. P. Bertsekas. On the Douglas-Rachford splitting method and
the proximal point algorithm for maximal monotone operators. Math. Program.,
55(1-3):293–318, 1992.
[168] D. Gabay. Applications of the method of multipliers to variational inequalities.
In Studies in Mathematics and its Applications, volume 15, pages 299–331. 1983.
[169] A. Cherukuri, B. Gharesifard, J., and Cortes. Saddle-point dynamics: conditions
for asymptotic stability of saddle points. SIAM J. Control Optim., 55(1):486–511,
2017.
[170] Hassan K. Khalil. Noninear systems. Prentice-Hall, New Jersey, 2(5):5–1, 1996.
[171] K. Martensson and A. Rantzer. Gradient methods for iterative distributed control
synthesis. In Proceedings of the 48th IEEE Conference on Decision and Control,
pages 549–554, 2009.
[172] K. Martensson and A. Rantzer. A scalable method for continuous-time distributed
control synthesis. In Proceedings of the 2012 American Control Conference, pages
6308–6313, 2012.
[173] N. K. Dhingra and M. R. Jovanovic. A method of multipliers algorithm for
sparsity-promoting optimal control. In Proceedings of the 2016 American Control
Conference, pages 1942–1947, Boston, MA, 2016. (Invited paper).
167
[174] D. G. Luenberger. Optimization by vector space methods. John Wiley & Sons Inc.,
New York, 1969. Reprinted 1997.
[175] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with
function approximation. 42(5):674–690, 1997.
[176] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University
Press, 2007. Pre-publication edition available online.
[177] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In
Proceedings of the 48th IEEE Conference on Decision and Control, pages 3598–
3605, 2009.
[178] S. Amari. Natural gradient works efficiently in learning. Neural Computation,
10(2):251–276, 1998.
[179] S. M. Kakade. A natural policy gradient. In Advances in Neural Information
Processing Systems, pages 1531–1538, 2002.
[180] N. K. Dhingra and M. R. Jovanovic. Convex synthesis of symmetric modifications
to linear systems. In Proceedings of the 2015 American Control Conference, pages
3583–3588, Chicago, IL, 2015.
[181] D. P. Bertsekas. Convex optimization algorithms. Athena Scientific, 2015.
[182] S. Kia, J. Cortes, and S. Martınez. Distributed convex optimization via
continuous-time coordination algorithms with discrete-time communication. Au-
tomatica, 55:254–264, 2015.
168
Appendix A
Proximal PL condition
The generalization of the PL condition to nonsmooth problems was introduced in [164]
and is given by1
2Dg(x, Lf ) ≥ κ (F (x) − F ?) (A.1)
where κ is a positive constant, Lf is the Lipschitz constant of ∇f , and Dg(x, α) is given
by
− 2αminy
(〈∇f(x), y − x〉+
α
2‖y − x‖22 + g(y)− g(x)
). (A.2)
Herein, we show that if proximal PL condition (A.1) holds, there is a lower bound
given by (6.19) on the norm of the generalized gradient map Gµ(x). For µ ∈ (0, 1/Lf ),
Dg(x, 1/µ) ≥ Dg(x, Lf ), and, thus, inequality (A.1) also holds for Dg(x, 1/µ). Moreover,
from the definition of Dg(x, α) given by (A.2), it follows that
Dg(x, 1/µ) =2
µ(F (x) − Fµ(x))
where F := f + g and Fµ is the associated FB envelope. Substituting this expression
for Dg(x, 1/µ) to (A.1) yields,
1
µ(F (x) − Fµ(x)) ≥ κ (F (x) − F ?). (A.3)
169
170
From (6.5), the smooth part of the objective function f can be written as
f(x) = Fµ(x) − g(proxµg(x − µ∇f(x))) +
µ 〈∇f(x), Gµ(x)〉 − µ
2‖Gµ(x)‖22
and substituting this expression for f to both sides of (A.3) yields
(µκ− 1)2 ‖Gµ(x)‖22 ≥ κ (Fµ(x) − F ?) + (µκ− 1)
µ g(x) −(µκ− 1)
µ g( proxµg(x − µ∇f(x))) +
(µκ − 1) 〈∇f(x), Gµ(x)〉 .
(A.4)
Since
Gµ(x) − ∇f(x) ∈ ∂g(x)
the subgradient inequality (6.18) implies
0 ≤ µ ‖Gµ(x)‖22 ≤ g(x) − g(proxµg(x − µ∇f(x))) +
µ 〈∇f(x), Gµ(x)〉 .(A.5)
Thus, if µκ− 1 > 0, inequality (A.4) can be written as
(µκ− 1)2 ‖Gµ(x)‖22 ≥ κ (Fµ(x) − F ?).
Moreover, if µκ− 1 < 0, let α := −(µκ− 1) ≥ 0. From (A.4) and (A.5), we have
α
2‖Gµ(x)‖22 ≥ κ (Fµ(x) − F ?).
Thus, in both cases the inequality holds for α = |µκ− 1|. Furthermore, it can be shown
that [157],
F ? = F ?µ and argmin F = argmin Fµ.
Thus, F ? can be substituted by F ?µ and we have
‖Gµ(x)‖22 ≥ γ (Fµ(x) − F ?µ)
171
where γ := 2κ/α.
A.1 Pointwise quadratic inequality Herein, we provide a pointwise
quadratic inequality that characterizes the nonlinear map u(x) = proxµg(x− µ∇f(x))
in (6.24b). Since the proximal operator of g is firmly nonexpansive [76], it is Lipschitz
continuous with parameter 1 and we have
‖u − u‖22 ≤ ‖(x− µ∇f(x)) − (x− µ∇f(x))‖22. (A.6)
Expanding the right-hand-side of (A.6) yields,
‖u − u‖22 ≤ ‖x − x‖22 + µ2‖∇f(x) − ∇f(x)‖22 −
2µ 〈x − x,∇f(x) − ∇f(x)〉
and utilizing inequality (6.17) for an mf -strongly convex function f with an Lf -Lipschitz
continuous gradient, the last inequality can be further simplified as,
‖u − u‖22 ≤ (1 −2µmfLfLf +mf
) ‖x − x‖22 +
(µ2 − 2µ
Lf +mf) ‖∇f(x) − ∇f(x)‖22.
(A.7)
Now, depending on the sign of µ − 2/(Lf +mf ) either (6.15) or (6.16) can be used to
upper bound the second term on the right-hand-side of (A.7) and obtain
‖u − u‖22 ≤ max{
(1 − µLf )2, (1 − µmf )2}‖x − x‖22. (A.8)
Choosing η given by (6.26) completes the proof of Lemma 7. We note that this con-
traction property has been also exploited in [97,181].
Appendix B
On the exponential convergence
rate of proximal gradient flow
algorithms
Herein, we describe a primal-descent dual-ascent gradient flow dynamics based on proxi-
mal augmented Lagrangian (6.20) can be used for distributed optimization. We provide
a distributed implementation and prove global exponential stability in the presence of
strong convexity.
B.0.1 Distributed optimization Let us consider the unconstrained optimiza-
tion problem,
minimizex
n∑i= 1
fi(x)
where x ∈ Rn is the optimization variable and f :=∑
i fi is a strongly convex objective
function. It is desired to solve this problem over an undirected connected network
with the incidence matrix ET and the graph Laplacian L := ETE. To accomplish this
objective, we reformulate it as,
minimizex
n∑i= 1
fi(xi) + g(Ex) (B.1a)
172
173
where x := [x1 · · · xn ]T and g(Ex) is an indicator function associated with the equality
constraint Ex = 0,
g(Ex) =
0, Ex = 0,
∞, otherwise.(B.1b)
This constraint is introduced to ensure asymptotic agreement between the node values
xi(t) ∈ R.
As demonstrated in [108], the primal-descent dual-ascent gradient flow dynamics
based on the proximal augmented Lagrangian (6.20) can be used to solve this problem.
The resulting gradient flow dynamics are given by,
x = −∇f(x) − 1µ Lx − y
˙y = Lx,(B.2)
where L = ETE is the Laplacian matrix of the underlying communication graph between
neighboring nodes and the vector y := ET y belongs to the orthogonal complement of
the vector of all ones. This setup is well-suited for distributed implementation in which
each node only shares its state xi with its neighbors and maintains the corresponding
dual variable yi. A Lyapunov-based argument was used in [182] to prove the exponential
convergence of (B.2). Herein, we provide an alternative proof that utilizes the theory of
IQCs to establish global exponential stability of (B.2) under the condition that 1T y(0) =
0.
Assumption 5. Let the differentiable part f :=∑
i fi(x) of the objective function
in (B.1) be strongly convex with parameter mf , let ∇f be Lipschitz continuous with
parameter Lf , let the regularization function g be proper, lower semicontinuous, and
convex, let ET be incidence matrix of a connected undirected network, and let 1T y(0) = 0
in (B.2).
From Assumption 5 it follows that the graph Laplacian L := ETE is a positive
semidefinite matrix with one zero eigenvalue. Thus, it can be decomposed as
L = V ΛV T =[U 1
n 1
] [ Λ0 0
0 0
][UT
1n 1
T
]
174
where the columns of V are the eigenvectors of L, Λ0 is a diagonal matrix of the nonzero
eigenvalues of L, and the matrix U satisfies,
UTU = I, UT1 = 0, U UT = I − 1n 11
T . (B.3)
By introducing a change of variables[ψ
x
]=
[UTx1n 1
Tx
],
[φ
y
]=
[UT y1n 1
T y
]
with ψ ∈ Rn−1 and φ ∈ Rn−1, (B.2) can be written asψ
˙x
φ
˙y
=
− ( 1
µ Λ0 + mfI)ψ − φ − UTu
− 1n 1
Tu − mf x − y
Λ0 ψ
0
(B.4)
where u = ∇f(x)−mfx, and from the properties (B.3) of the matrix U we have,
x =[U 1
] [ ψ
x
]. (B.5)
By choosing y(0) = 0, we have y ≡ 0. Thus, the y-dynamics can be eliminated
from (B.4), which yields
w = Aw + B u
ξ = C w
u = ∇f(ξ) − mfξ.
(B.6a)
175
Here, w := [ψT x φT ]T , ξ := x,
A =
− ( 1
µ Λ0 + mfI) 0 − I
0 −mf 0
Λ0 0 0
B =
−UT
− 1n 1
T
0
, C =[U 1 0
](B.6b)
and the corresponding transfer function is
H(s) = −[U 1
] H1(s) 0
01
s+mf
[ UT
1n 1
T
](B.6c)
where
H1(s) = diag
(s
s2 + (λi/µ + mf ) s + λi
).
Furthermore, under Assumption 5 with u = ∇f(ξ) −mfξ, for any ξ and ξ ∈ Rn, we
have [31], [ξ − ξu− u
]T [0 (Lf −mf )I
(Lf −mf )I −2I
][ξ − ξu− u
]� 0.
We now employ the KYP lemma to establish global exponential stability of (B.6a).
Theorem 16. Let Assumption 5 hold. Then proximal gradient flow dynamics (B.6a)
are globally exponentially stable, i.e., there is τ, 0 < ρ < mf such that,
‖w(t) − w?‖ ≤ τ e−ρt ‖w(0) − w?‖
.
Proof. The KYP lemma implies global exponential stability if[Hρ(jω)
I
]∗Π
[Hρ(jω)
I
]≺ 0, ∀ω ∈ R (B.7)
176
where
Π =
[0 (Lf −mf )I
(Lf −mf )I −2I
]Hρ(jω) = H(jω − ρ).
It is easy to show that (B.7) holds for all ω ∈ R if
2 I − (Lf − mf ) (H∗ρ (jω) + Hρ(jω)) � 0.
This condition yields a decoupled family of inequalities,
ω2 + (mf − ρ)2 + (Lf − mf )(mf − ρ) > 0 (B.8)
(ω2 − bi(ρ))2 + ci(ρ)ω2 + di(ρ) > 0 (B.9)
which have to hold for all ω ∈ R and for i = 1, . . . , n− 1. Condition (B.8) clearly holds
if ρ ∈ (0,mf ). On the other hand, checking (B.9) amounts to checking a decoupled
family of quadratic inequalities in ω2 where bi(ρ), ci(ρ), and di(ρ) are parameters that
depend on µ, Lf , mf , λi, and ρ. At ρ = 0, these are given by
bi(0) = λi/µ, ci(0) = (λi/µ + mf ) (λi/µ + Lf ), di(0) = 0.
Positivity of bi(0) and ci(0) for each i and continuity of bi(ρ), ci(ρ), and di(ρ) with
respect to ρ imply the existence of ρ > 0 that guarantees (B.9) for each ω ∈ R and each
i = 1, . . . , n− 1, which completes the proof.