maximum-likelihood estimation of multivariate normal ...€¦ · maximum-likelihood estimation of...

Maximum-likelihood estimation of multivariate normal graphical

models: large-scale numerical implementation and topology

selection

Joachim Dahl Vwani Roychowdhury Lieven Vandenberghe

October 18, 2004

Abstract

We consider large-scale numerical implementations of maximum-likelihood estimation of mul-tivariate normal graphical models with conditional independence. We show how to formulatethis as a convex optimization problem, which gives a wealth of methods that can readily beapplied. We give details on large-scale implementation based on both a coordinate steepestdescent method, and a conjugate gradient method. With these methods we can routinely solveproblems with several 1000 of nodes on a standard PC. As an application of the large-scaleimplementations we investigate topology selection for normal graphical models, where we wishto learn the node-edge incidence matrix of the underlying graph based only on measurements(i.e., based on the sample covariance estimate).

1 Introduction

In this paper we investigate methods for maximum-likelihood parameter estimation for Gaussiangraphical models with conditionally independent variables.

To that end, let X ∈ Rn be a random vector with n i.i.d. samples from a normal distributionN (µ, Σ), with the additional restrictions that some of the variables are conditionally independent.In a graphical model the nodes represent the different random variables, and conditional indepen-dence between variables implies that the corresponding nodes in the graph are not connected (see,e.g., the extensive treatment by Lauritzen [15]). Thus, the problem we study in this paper is oneof computing maximum-likelihood estimates of µ and Σ for the normal graph with the additionalconstraints that some of the nodes are not connected.

While most previous work on this problem has been on the statistical properties and interpre-tations of the problem, little attention has been paid to the numerical properties of the problem.This is our main motivation for studying the problem, and the main topic of the paper. As weshall see, the problem is a very nice application of convex optimization with a rich field of theoryand algorithms that are readily applied. We develop differentiable conjugate gradient algorithmsthat are particularly well-suited for large-scale problems. With these algorithms we can readilysolve problems with several 1000 nodes in a few minutes on a standard PC. Such large-scale im-plementations have several import applications. One of those applications, which we investigate, istopology selection, where we wish to identify the node-edge incidence matrix of the random graphbased only on measurements (i.e., on the sample covariance estimate).

1

The basic maximum-likelihood problem was first studied in detail by Dempster [7], who usedthe name covariance selection. For the special case without conditional independence betweenvariables the problem reduces to traditional maximum-likelihood estimation of the covariances formultivariate Gaussian random variables. An elementary result is then that the sample covarianceis both the minimum-variance and maximum-likelihood estimate. What Dempster showed in [7] isthat conditional independence between some of the variables be expressed equivalently by requiringelements to be zero in the inverse of the covariance estimate. Thus, in the language of graphicalmodels, zeros in the inverse covariance matrix give structural or topological information aboutthe random graph. This has recently become a popular (and exciting) research area concernedwith inferring structural information about the underlying random graph based only on data-measurements (XXX a few good references), or based on both measurements and assumptions onthe model (XXX references) which is somewhat easier, but not less interesting.

In particular, the special constraints with zeros in the inverse covariance estimate are simpleconvex constraints, i.e., the complete covariance selection problem can be formulated as convexoptimization problem. Furthermore, special structure of the convex problem allows for very simpleand efficient implementations. As more progress is made on large-scale graphical modeling andinferring topological structure from measurements, we believe that large-scale implementations ofthose methods will become increasingly important.

The paper is outlined as follows: in §2 we introduce the basic covariance selection problem,and in §3, §4 and §5 we discuss implementations using Newton’s method, a coordinate descentmethod, and a non-linear conjugate gradient method, respectively. In §6 we discuss modificationsof the problem where we want to determine which variables are conditionally independent basedon a sample covariance estimate. We show simulation results for different examples in §7, and wedevote appendices to details on the algorithm that may not be of interest to all readers. In theappendices we also list simple Matlab implementations of the algorithms; we do that as a serviceto the referees. We do not suggest including those in a final version, although we maintain thatmuch can be learned from studying a simple and efficient working implementation.

1.1 Notation

Let A be an m × n matrix and let I = (i1, i2, . . . , iq), ik ∈ {1, 2, . . . , m}, 1 ≤ k ≤ q and J =(j1, j2, . . . , jr), jk ∈ {1, 2 . . . , n}, 1 ≤ k ≤ r be two index lists. Then we define AI,J as the q × rmatrix with entries [AI,J ]k,l = Aik,jl

. Similarly, for index lists of equal length (q = r), we defineAI×J as the q × 1 column-vector with entries [AI×J ]k = Aik,jk

. In the remainder of this paper weonly use index lists with equal length q = r.

2 Problem statement

Let X, Y and Z be three random variables with continuous distributions. We say that the randomvariables X and Y are conditionally independent given Z if

f(x|y, z) = f(x|z).

Informally, this means that once we know Z, knowledge of Y gives no information about X. Inexpert systems conditional independence is a crucial property since it gives a simple factorization

2

of the joint distribution f(x, y, z), i.e., if x and y are conditionally independent given z then

f(x, y, z) = f(x|y, z)f(y|z)f(z) = f(x|z)f(y|z)f(z),

see the standard textbooks [15, 6] for a thorough treatment of graphical models and expert systems.We only consider random variables with multivariate normal distributions, i.e.,

f(x) = (2π)−n/2 1

(det Σ)1/2e−

1

2(x−µ)T Σ−1(x−µ)

if x ∈ Rn, and our main interest is the special case of conditional independence between twovariables given all the remaining variables, i.e., if X =

[

x1 x2 . . . xn

]

is a multivariaterandom variable we are interested in cases where xi and xj are conditionally independent given allthe remaining xk, i.e.,

f(xi|x1, x2, . . . , xi−1, xi+1, . . . , xn) = f(xi|x1, x2, . . . , xi−1, xi+1, . . . , xj−1, xj+1, . . . , xn),

in which case we merely say that xi and xj are conditionally independent.The following classical result [7] gives a nice link between conditional independence of normal

distributed random variables and their joint distribution:

Proposition 1 Consider a multivariate random variable X =[

x1 x2 . . . xn

]

with joint nor-

mal distribution X ∼ N (µ, Σ). If xi and xj are conditionally independent then Kij = 0 where

Σ−1 =

K11 . . . K1n...

. . ....

Kn1 . . . Knn

.

In other words, if two elements xi and xj of a multivariate normal distributed random variable areconditionally independent (given the remaining elements in X), then the corresponding elementKij = KT

ji in the inverse covariance matrix of the joint distribution is zero.

2.1 Optimality conditions

We next show how maximum-likelihood estimation of the parameters (ν, Σ) of a multivariate dis-tribution X ∼ N (µ, Σ) with conditionally independent pairs of variables (xi, xj) can be solvedefficiently as a convex optimization problem. To that end, let I0 and J0 denote the two index listsfor the part of K with zero entries, i.e., KI0,J0 = 0.

The log-likelihood function (up to a positive scaling) of the observations is

L(x; µ, Σ) = log∏

i

f(xi) = −N

2log det(Σ)− 1

2

∑

i

(yi − µ)T Σ−1(yi − µ).

Define the sample estimates µ = 1N

∑Ni=1 yi and Σ = 1

N

∑

i(yi − µ)(yi − µ)T . Then the likelihoodfunction can be written as

L(x; µ, Σ) =N

2(− log det(Σ)−Tr (Σ−1Σ)− (µ− µ)T Σ−1(µ− µ)).

3

Thus, for µ? = µ we have the following maximum-likelihood problem

maximize log det(K)−Tr (KΣ)subject to KI0,J0 = 0

(1)

where K = Σ−1 and (I0, J0) is the index-set of conditionally independent random variables. It iswell-known that (1) is a concave program (e.g., [4, 5]) over the set of positive semidefinite matrices,i.e.,

d2(log det(K)) = −Tr (K−1/2(dK)K−1(dK)K−1/2) ≤ 0

if K º 0. Note, however, that (1) can be unbounded unless Σ Â 0. This often occurs in practice,e.g., when we have large dimensions but small sample-size. When we solve (1) we will implicitlyassume that the objective is bounded, otherwise we must add constraints that make the objectivebounded; the regularized problem we consider in §6 can be viewed as such an attempt, althoughwe shall motivate it differently.

2.2 Lagrangian duality

We can also consider the dual-problem of (1). The dual problem can be written as the unconstrainedproblem,

minimize − log det(

Σ +∑

i,j νijeieTj

)

(2)

in the dual variable ν ∈ Rq2

where (i, j) runs over the index-set (I0, J0). Since (1) is a convexproblem and since a constraint qualification holds (e.g., Slater’s constraint qualification) the twoproblems (1) and (2) attain the same optimal value, see, e.g., [5] for details on duality theory andhow to derive (2).

From the optimal solution to (2) we can then easily retrieve the solution to (1) using the firstorder optimality conditions of (2), i.e.,

K? =

Σ−∑

ij

ν?ijeie

Tj

−1

.

3 Newton’s method

Here, and the following sections, we discuss implementation of different algorithms for solving (1)for large sparse graphical models, i.e., models with large n and many zeros in Σ−1. We start with animplementation using Newton’s method, which is very efficient for moderate-size problems wherethe Hessian can easily be stored in memory (and can be inverted).

The formulation in (1) is desirable only for models with few constraints, since that would resultin an unconstrained dual problem with few variables1. For sparse models, however, it is preferableto rewrite the problem into an equivalent problem involving only the non-zero elements of K.

To that end, let I = (i1, i2, . . . , iq) and J = (j1, j2, . . . , jq) be two index lists for the lower

triangular part with non-zero elements in K (including the diagonal). Further define the n × q

1After eliminating variables the primal problem would have roughly the same complexity.

4

matrices E1 and E2 as sparse matrices with non-zero elements at [E1]k,ik = 1, [E2]k,jk= 1, k =

1, . . . , q and zeros elsewhere. We can then write K as a function of its non-zero elements x as

K(x) = E1diag (x)ET2 + E2diag (x)ET

1

if we let x = (xo, (1/2)xd) where xo and xd are the off-diagonal and diagonal part of K(x),respectively. Note we have scaled the diagonal elements by (1/2). We then have the unconstrainedoptimization problem

minimize f(x) = − log det(K(x)) + Tr (K(x)Σ) . (3)

When we write (3) as an unconstrained problem it is implied that we optimize f(x) over the domainof log det(·), i.e., dom f = {x |K(x) º 0}.

As shown in App. A, the gradient and Hessian of (3) are

∇f(x) = 2diag(

ET1 (Σ−K(x)−1)E2

)

∇2f(x) = 2(ET1 K(x)−1E1) ◦ (ET

2 K(x)−1E2) + 2(ET1 K(x)−1E2) ◦ (ET

2 K(x)−1E1)

or using the simpler notation

∇f(x) = 2(

YI×J −K−1I×J(x)

)

(4)

∇2f(x) = 2(

K−1I×I(x) ◦K−1

J×J(x))

+ 2(

K−1I×J(x) ◦K−1

J×I(x))

(5)

i.e., instead of computing products of the form ET1 K−1(x)E2 we index K−1(x) directly using the

given sparsity pattern. A simple Matlab implementation of an algorithm for solving (3) usingNewton’s method is given in App. A. Despite its simplicity this algorithm is quite efficient forsmall and medium sized problems where the Hessian can be stored in memory. We next give analgorithm that is specifically tailored for large-scale sparse problems, where we cannot store theHessian in memory.

4 Coordinate steepest descent

The main idea behind the algorithm is optimize (3) over one variable at a time, and thus avoidingstoring all of K−1(x) in memory simultaneously. We then keep iterating the element-wise mini-mization. The main advantage of the algorithm is that often each search direction can be computedvery cheaply, and in practice convergence may be acceptable if the variables are loosely coupledand if the number of variables is not too large. Also, in many cases we may be satisfied with anapproximate solution. Similar ideas are reported in the early work by Wermuth and Scheidt [19],and by Speed and Kiiveri [18] who give a specialized convergence proof. We motivate the algorithmquite differently, and we give details on large-scale implementations.

In our algorithm we assume K(x) is given (e.g., we could start with K(x) = I) and we wish tocompute the update ∆xk for xk. Let K0 = K(x); we then have the problem

minimize g(∆xk) = − log det(K0 + eieTj ∆xk + eje

Ti ∆xk) + Tr ((K0 + eie

Tj ∆xk + eje

Ti ∆xk)Σ)

where (i, j) is the index of xk in K. In our formulation convergence of the algorithm follows froma standard convergence result for coordinate descent algorithms for convex functions (see, e.g.,[3]);

5

if Σ Â 0 then g(∆xk) is a strictly convex, continuously differentiable function for which coordinatedescent methods converges.

If we define E =[

ei ej

]

and B =

(

0 11 0

)

we can write the problem as

minimize g(∆xk) = − log det(K0 + EBET ∆xk) + Tr ((K0 + EBET ∆xk)Σ). (6)

Now (6) is still a convex unconstrained optimization problem, so only an optimal ∇x?k satisfies

∇f(∆x?k) = 0, and for the simple problem (6) we can solve that equation analytically. The derivative

of (6) (or the gradient of f(x) along direction ∆xk) is

g′(∆xk) = −Tr ((K0 + EBET ∆xk)−1EBET ) + Tr (EBET Σ)

or equivalentlyg′(∆xk) = 2

(

Σ− (K0 + EBET ∆xk)−1)

i,j.

In App. B we show how to find ∆x?k analytically, as the solution to a second order equation,

without explicitly forming or storing K−10 ; in the remainder of this section we take the solution to

∇f(∆xk) = 0 for granted.Instead of updating K0 + EBET ∆xk directly we update the Cholesky factorization instead.

We assume that we have a sparse Cholesky factorization of K0; we use same factorization when wecompute ∆xk (see App. B).

In other words,K0 = LLT

and we wish to compute updates of the form

LLT = LLT + (eieTj + eje

Ti )∆xk (7)

The simplest case is when i = j and ∆xk > 0, i.e., we are doing a positive update of a diagonalelement of K0. This update can be done very efficiently by a series of n − i + 1 Givens rotations,

i.e., we reduce the sparse matrix

[

LT√

2∆xkeTi

]

to upper triangular form,

PnPn−1 · · ·Pi

[

LT√

2∆xkeTi

]

=

[

LT

0

]

(8)

where each rotation Pl cancels an element of the last row of the augmented matrix. We then havethat

LLT + 2∆xkeieTi =

[

L√

2∆xkei

]

P Ti · · ·P T

n Pn · · ·Pi

[

LT√

2∆xkeTi

]

= LLT .

Note that on average we achieve a significant saving by skipping the first i − 1 Givens rotationsbecause of the simple update 2∆xkeie

Ti .

When ∆xk < 0 we follow a similar approach by reducing an augmented matrix to uppertriangular form by a series of hyperbolic transformations [11]. Because of potential caveats withhyperbolic transformations, we elaborate a bit on this approach. First of all, we are performing aCholesky downdating, i.e.,

LLT = LLT − 2|∆xk|eieTi

6

where care must taken so that LLT remains positive definite. However, that issue is of no concernto us since we are always guaranteed that K0 º (eie

Tj +eje

Ti )∆xk by definition of dom f , i.e., when

we compute ∆xk we optimize over the domain of f . In the hyperbolic transformation we solve theset of equations

[

c −s−s c

] [

x1

x2

]

=

[

r0

]

, c2 − s2 = 1.

If x2 = 0 we take s = 0 and c = 0 otherwise if |x2| < |x1| we take c = 1/√

1− (x2/x1)2 ands = (x2/x1)c. In fact, it is easy to show that a necessary condition for positive of the downdatedmatrix is that |x2| < |x1|, or more generally definiteness of (LLT − ddT ) implies that L2

ii > d2i .

To see this let y be the solution to LT y = Liiei. Since L Â 0 (with Lii 6= 0) the system alwayshas a non-trivial solution. Furthermore, from back-substitution we have that yi = 1 and yj = 0,j = i + 1, . . . , n. But then

yT (LLT − ddT )y ≥ 0 =⇒ L2ii ≥

i−1∑

j=1

y2j d

2j + d2

i

which shows that LLT Â ddT ⇒ L2ii ≥ d2

i . Thus, if |x2| > |x1| we have a certificate that thefactorization after downdating would be indefinite.

Unlike the Givens rotation the hyperbolic transformation is not orthogonal, which makes it lessstable. However, similar to Givens rotations, it lets us reduce a matrix to upper-triangular form.As before we reduce an augmented matrix to upper-triangular form,

HnHn−1 · · ·Hi

[

LT√

2|∆xk|eTi

]

=

[

LT

0

]

. (9)

By construction the hyperbolic transformations have the useful property that HT SH = S, whereS =

[

In 00 −1

]

, and thus we get

LLT − 2|∆xk|eieTi =

[

L√

2|∆xk|ei

]

S

[

LT√

2|∆xk|eTi

]

=[

L√

2|∆xk|ei

]

HTi · · ·HT

n SHn · · ·Hi

[

LT√

2|∆xk|eTi

]

=[

L 0]

[

LT

0

]

= LLT .

The complexity of doing the downdating using hyperbolic transformations is then identical to doingthe updating using Givens rotation.

The rank-two Cholesky update (7) for i 6= j is handled by defining z1 = 1√2(ei + ej) and

z2 = 1√2(ei − ej) and rewriting the update as

LLT + (eieTj + eje

Ti )∆xk = LLT + (z1z

T1 − z2z

T2 )∆xk

and performing two rank 1 updates using Givens rotations and hyperbolic transforms as describedabove. In this case the rotations (or transforms) will be applied to columns min{i, j}, . . . , n of theaugmented matrices, i.e., we apply (n − min{i, j} + 1) Givens rotations and (n − min{i, j} + 1)hyperbolic transforms.

7

So far we have only discussed how we update a single coordinate xk. In practice we computeall the gradients g′(∆xk), k = 1, . . . , q and then update the coordinate with largest |g′(∆xk)|; thatis why we call the algorithm “coordinate steepest descent”.

We terminate the algorithm when all g′(∆xk) are below a given tolerance level, or when thedecrease in the objective at each iteration is below a given tolerance level.

5 Nonlinear conjugate gradients

In this section we use a nonlinear conjugate gradient algorithm (see, e.g., [16]) to solve (3). Theconjugate gradient algorithm is especially suited for large sparse problems like (3) and it is well-studied. The basic form of the (Fletcher-Reeves) conjugate gradient algorithm we use is givenbelow:

Given x0 and tolerance ε > 0.Compute f0 = − log(det(K(x0))) + TrY K(x0).Compute ∇f0 (see (4)).Set p0 = ∇f0, k ← 0.repeat

Compute αk using a line-search and set xk+1 ← xk + αkpk.Compute fk+1 and ∇fk+1.

pk+1 ← −∇fk+1 +∇fT

k+1(∇fk+1−∇fk)

‖∇fk‖2 pk.

if ∇fTk+1pk+1 < 0 then

pk+1 ← −∇fk+1.end if

k ← k + 1until fk−1 − fk < ε

A few comments about the algorithm are in order. In the algorithm we explicitly check whetherthe updated conjugate gradient direction is a decent direction, i.e., if ∇f T

k+1pk+1 < 0, and otherwisewe take a steepest descent step instead. It can be shown that if αk is chosen as the exact minimizerof the line-search problem arg minα f(xk +αpk) then the pk+1 computed by the algorithm is alwaysa descent direction.

If we use an inexact line-search algorithm then the update step

pk+1 ← −∇fk+1 +∇fT

k+1(∇fk+1 −∇fk)

‖∇fk‖2pk

known as the Polak-Ribiere update step is often believed to be more robust than other similarupdate steps.

In each step of the algorithm we need to evaluate fk+1 and ∇fk+1, which we do from a Choleskyfactorization of K(xk+1). Thus, not counting the line-search, each step in the algorithm is domi-nated by computing a Cholesky factorization of K at the current point.

In the line-search we first need to find the feasible interval for αk. Since pk is a descent directionwe know that αk > 0. We then do a guarded backtracking until K(xk + αkpk) º 0 by performinga sequence of Cholesky factorizations. For the actual line-search algorithm we have observed goodperformance of both a simple backtracking line-search that only uses sufficient decrease as a stopping

8

criterion, or an extension of the backtracking line-search with a quadratic or cubic interpolation ofthe line-search function (see [16]).

In the following we drop the time-dependency in the subscripts for ease of notation. Both line-searches require only evaluation of the derivative f ′

α(x + αp) = ∇f(x)T p (where ∇f(x) is knownfrom the conjugate gradient iteration) and function evaluations of f(x + αp) where each functionevaluation is dominated by the cost of computing the Cholesky factorization of K. We can also useNewton’s method to compute an the exact minimizer of the line-search (although our simulations,as well as common beliefs, suggest that it may not be worthwhile). For Newton’s methods we mustevaluate the Hessian ∇2f(x) along the search direction p, i.e.,

f′′

α(x + αp) = 2pT(

(R−1I,I(x) ◦R−1

J,J(x)) + (R−1I,J (x) ◦R−1

J,I(x)))

p

We do not need to store the complete Hessian ∇2f(x) (which was our main reason for developinglarge-scale methods to begin with) to evaluate f

′′

α . Instead we write (see, e.g., [13]),

pT (R−1I,I(x) ◦R−1

J,J(x))p = Tr(

diag (p)R−1I,I(x)diag (p)R−T

J,J (x))

=

q∑

k=1

pkR−1ik,I(x)(R−1

J,jk(x) ◦ p)

pT (R−1I,J (x) ◦R−1

J,I(x))p = Tr(

diag (p)R−1I,J (x)diag (p)R−T

J,I (x))

=

q∑

k=1

pkR−1ik,I(x)(R−1

I,jk(x) ◦ p)

consistent with our definition that, e.g., R−1I,ik

is a q × 1 column vector and I = (i1, i2, . . . , iq),J = (j1, j2, . . . , jq).

The complexity of the line-search algorithm may be further reduced using Lanczos’ method forapproximating log det(A), see [11, 10, 1]. However, as shown in [1], the reduction in complexitymay be limited since we must apply Lanczos’ method n times to approximate log det(A).

Convergence of the algorithm is almost trivial, since we are minimizing a convex function takinga sequence of descent steps. A simple Matlab implementation of the conjugate gradient algorithmis given in App. C.

5.1 Preconditioning

Good preconditioning is an essential part of conjugate gradient algorithms. The convergence prop-erties of the conjugate gradient algorithm is close related to the eigenvalue spread of the Hessian.Thus, besides being easy to compute, a good preconditioner C should make the resulting Hessianclose to identity, i.e.,

∇2f(Cx) = CT∇2f(x)C ≈ I,

and if we know ∇2f(x?), then C =(

∇2f(x?))−1/2

would be an ideal preconditioner.From the optimality conditions we know that

∇f(x?) = diag (ET1 (Σ−K−1(x?))E2) = 0,

9

i.e., at optimality K−1ij = Σij for all (i, j) in the given sparsity pattern. Thus we can approximate

the Hessian by replacing K−1 by Σ in (5), i.e., if we let

H = 2(

(ΣI,I ◦ ΣJ,J) + (ΣI,J ◦ ΣJ,I))

= LLT (10)

then L−T is good candidate for a preconditioner. However, L−1 is in general not sparse and, morecritically, we do not want to store H in memory. Instead we directly compute a sparse incompleteinverse factorization (e.g., [2]) accessing only one column of H at a time, where, e.g., the rth columnis readily computed as

2(

(ΣI,ir ◦ ΣJ,jr) + (ΣI,jr

◦ ΣJ,ir))

.

Sparsity is achieved by reducing the fill-in in the incomplete factorization (values below a certainthreshold are discarded). As an example we show the sparsity pattern for a preconditioner for inExample 7.1 obtained using this approach, see Fig. 1.

0 1000 2000 3000 4000

0

1000

2000

3000

4000

Figure 1: Sparsity pattern for sparse incomplete inverse Cholesky factor.

As a comparison the inverse factorization of (10) is almost completely dense. In the simulations§7 we demonstrate excellent results using this sparse preconditioner; it’s essentially as good as usingthe dense inverse factorization of (10). There is no good heuristic for how much fill-in is required;a good sparse preconditioner relies on a bit of trial-and-error.

6 Topology selection

In this section we discuss different approaches for topology selection, i.e., how to estimate a sparsitypattern that is consistent with the sample covariance estimate. It is clear from the formulation in(1) that a simple maximum-likelihood score is insufficient, since the maximum-likelihood score ismaximized for a full sample covariance estimate.

A typical approach for obtaining a sparse estimate is to regularize the objective with a penaltyterm

∑

ij |xij |. We then get a minimization problem

minimize f(x) = − log det(K(x)) + Tr (K(x)Σ) + γ∑

ij |xij | (11)

10

where γ > 0 is a regularization constant that must be chosen appropriately beforehand. We canalso interpret (11) as a way to bound the objective in cases where Σ is not strictly definite (e.g.,for small sample-sizes).

A similar problem is considered in [14], where they consider problem (11) without a given spar-sity pattern, i.e., they solve a l1 norm penalized dense maximum-likelihood estimate. In contrastto our interest in sparse estimates, their primary motivation for regularizing the problem is totrade variance for bias in the estimate. Also, no attempt is made in [14] to exploit convexity inthe solution of the problem, where the solution is approximated using an alternating optimizationapproach.

To solve (11) we first rewrite it as a constrained problem,

minimize f(x) = − log det(K(x)) + Tr (K(x)Σ)subject to −z ¹ xo ¹ z

1T z ≤ ρ(12)

in the variables x, t and z with a different regularization constant ρ.. We can, e.g., solve (12) usinga standard interior-point algorithm or barrier-method [5]. In each step of such an algorithm wesolve an unconstrained problem of the form,

minimize h(x, z) = tf(x)−∑

i

log(zi − xoi )−

∑

i

log(zi + xoi )− log(ρ−

∑

i

zi) (13)

for a fixed constant t ≥ 0.Let

d1 =(

(z1 − xo1)

−1, (z2 − xo2)

−1, . . . , (zq − xoq−n)−1

)

d2 =(

(z1 + xo1)

−1, (z2 + xo2)

−1, . . . , (zq + xoq−n)−1

)

d3 =(

(ρ−∑

i zi)−1, (ρ−∑

i zi)−1, . . . , (ρ−∑

i zi)−1

)

where q − n = |xo|. Then the first- and second-order derivatives of h(x, z) can be written as

∇xh(x, z) = t∇xf(x) +

[

d1 − d2

0

]

∇zh(x, z) = −d1 − d2 + d3

∇xxh(x, z) = t∇xxf(x) +

[

(d1 ◦ d1) + (d2 ◦ d2)0

]

∇xzh(x, z) =

[

−(d1 ◦ d1) + (d2 ◦ d2)0

]

∇zzh(x, z) = (d1 ◦ d1) + (d2 ◦ d2) + (d3 ◦ d3).

Thus to solve the l1 norm regularized problem we must solve a modest number of problems, say5-10, of the form (13). Each of those problems have q − n additional variables compared to (3).The conjugate gradient method of §5 is well-suited for solving each of the problems (13).

Typically we solve (11) for several different values of ρ and use the solution that is deemedbest by some criterion. Hastie et al. [12] have proposed a clever scheme for regularized quadraticprogramming problems, where the problem can be resolved for more values of the regularizationconstant ρ in the same order of complexity as solving a single quadratic programming problem.Their approach is based on the observation that the dual problem is linear in the regularizationconstant.

11

7 Numerical experiments

In this section we show simulation results of the algorithms presented so far.

Example 7.1 We consider a large problem with n = 2000 and a random sparsity patternwith 4000 lower-triangular non-zero entries, see Fig. 2. In the simulations we use a symmetricminimum degree permutation of Σ which gives a sparser Cholesky factorizations without changingthe underlying problem.

0 500 1000 1500 2000

0

500

1000

1500

2000

(a) Sparsity pattern of covariance matrix.

0 500 1000 1500 2000

0

500

1000

1500

2000

(b) Sparsity pattern after symmetric mini-mum degree permutation.

Figure 2: Sparsity pattern for the random sparse covariance matrix in example 7.1.

In Fig. 3 we plot the convergence rate of the different methods discussed so far. As a referencewe use Newton’s method (§3) to compute the optimal solution (this is about the largest problemwe can solve on Pentium IV computer with 2GB RAM using Newton’s method). We note severalinteresting things from Fig. 3:

– The coordinate steepest descent algorithm has a break in convergence rate after k = 2000iterations. In the first 2000 iterations the algorithm only updates the diagonal elements,which are dominating the objective for this example. After all the diagonal elements havebeen updated convergence slows down.

– Suprisingly, using the exact line-search is actually worse in this example. Careful inspection ofcurves, however, shows that the exact line-search algorithm converges faster in the very initialphase, but then slows down. Further inspection revealed that the backtracking line-searchon average takes longer steps than the exact line-search for this example, which explains thecontroversy. In practice it is too costly to use an exact line-search, we only show it here forcomparison.

– The cubic interpolation and the simple backtracking line-search performs comparable (overmany different simulations), but the cubic interpolations tends to be more robust (i.e., oc-

12

casionally the backtracking line-search results in many restarts of the conjugate gradientalgorithm which slows downs convergence).

– Both the dense and the sparse preconditioner from §5.0.1 are excellent (over a wide range ofsimulations). In this example the dense preconditioner converges in only 4 iterations, and thesparse preconditioner converges in 5 iterations.

Example 7.2 In this example we experiment with the topology selection discussed in §6. Inthe experiment we randomly generate a sparse inverse covariance matrix of dimensions 20 × 20,and we consider both a moderate sample size of K = 100 and a small sample size of K = 10.

We first solve problem (12) for several values of ρ using a full index-set, i.e., we make no priorassumptions on the sparsity pattern. For each value of ρ we threshold the corresponding solutionK(x?) to identify sparsity pattern (many elements are close to but not exactly zero), and we thenuse the sparse estimate K(x?) to evaluate log-likelihood function f(x).

For K = 100 we show this tradeoff-curve in Fig. 4. From the trade-off curve we identify a “knee”around N = 10 non-zero elements, and we use that corresponding value of ρ. The correspondingidentified sparsity pattern is shown in Fig. 5a. Fig. 5b shows the identified sparsity pattern fora small sample-size K = 10, where we identified the optimal value of ρ from a separate trade-offcurve.

The success of the proposed method is highly dependent on the quality of the sample covarianceestimate; for a small sample-size the algorithm performs poorly (but this is a very difficult problem).Most of the wrongly classified points “+” have small magnitude, however, and can be successfullyremoved by simple heuristics; here we refrain from using such heuristics.

8 Conclusions

In this paper we presented several different algorithms for solving large-scale sparse covarianceselection problems. Our algorithms were all based on the observation that the basic covarianceselection problem can be formulated as convex optimization problem.

We directly exploited the inherent sparsity in the inverse covariance estimate that arises frommodelling conditional independence (as simple convex equality constraints) between Gaussian ran-dom variables, and we gave implementation details that are critical for successful large-scale de-ployment.

With these methods we can routinely solve problems involving several thousands of variablesusing a standard PC. We also suggested applications in topology selection for large sparse randomgraphs; a topic we believe will become increasingly important in other fields where topologicalinformation about random sparse graphs is inferred based only on data-measurements, or a com-bination of measurements and certain model assumptions, with applications in various fields suchas bio-informatics, fMRI analysis, and more.

13

0 500 1000 1500 2000 250010

−12

10−8

10−4

100

104

108

A B C D,E F,G

PSfrag replacements

k

fk − f?

(a) Convergence rate for different methods.

0 5 10 15 20 2510

−12

10−8

10−4

100

104

108

F G

PSfrag replacements

k

fk − f?

(b) Convergence rate on a small scale.

Figure 3: Convergence rate of different methods for the large-scale problem in example 7.1. f ? isthe solution computed using Newton’s method (see §3) and fk is the objective value of differentiterative methods at iteration k. A: coordinate steepest descent (§4), B: conjugate gradient methodwith exact line-search (§5), C: steepest descent with backtracking line-search, D and E: conjugategradient method with line-search using cubic interpolation and backtracking, respectively, F, G:conjugate gradient method (cubic interpolation line-search) using the inverse and sparse incompleteinverse factorization of (10), respectively, as a preconditioner.

14

0 20 40 60 80 1002

3

4

5

6

PSfrag replacements

N

f(x)

Figure 4: Trade-off curve between Log-likelihood objective f(x) and number of non-zeros N inK(x) (not counting the diagonal elements).

0 5 10 15 20

0

5

10

15

20

(a) Sample-size K = 100.

0 5 10 15 20

0

5

10

15

20

(b) Sample-size K = 10.

Figure 5: Identified sparsity pattern for example 7.2. Correctly classified elements: “•”, missedelements: ”◦”, wrongly identified elements: ”+”.

15

A Newton’s method

A.1 Deriving the Newton step

Here we show the details for solving

minimize f(x) = − log det(K(x)) + Tr (K(x)Σ) . (14)

where K(x) = E1diag (x)ET2 + E2diag (x)ET

1 using Newton’s method. As shown in, e.g., [5], asecond order approximation of h(X) = − log det(X) near X is

h(Z) = h(X +∇X) ≈ h(X)−Tr(

X−1(Z −X))

+ (1/2)Tr(

X−1(Z −X)X−1(Z −X))

.

Now, to identify the gradient of (14) we have that

∇f(x) = −K∗(K(x)−1) + K∗(Σ)

where K∗(Y ) = 2diag (ET1 Y E2) is the adjoint of K(x), and thus

∇f(x) = 2diag(

ET1 (Σ−K(x)−1)E2

)

. (15)

Similarly, we have for the Hessian that

∇2f(x)∆x = K∗ (K−1(x)K(∆x)K−1x)

.

Using the fact ([17]) thatdiag (Adiag (z)B) = (A ◦BT )x

we identify the Hessian as

∇2f(x) = 2(ET1 K(x)−1E1) ◦ (ET

2 K(x)−1E2) + 2(ET1 K(x)−1E2) ◦ (ET

2 K(x)−1E1). (16)

A.2 A Matlab algorithm using Newton’s method

function R = covsel_sparse_primal(Y,I,J)

n = size(Y,1); N = sub2ind([n,n],I,J); Nt = sub2ind([n,n],J,I);

R = speye(n); dR = spalloc(n,n,2*length(N)-n);

for iters=1:1000

Rc = chol(R)’; Rci = inv(Rc); Rinv = Rci’*Rci;

grad = 2*(Y(N)-Rinv(N));

hess = 2*(Rinv(I,J).*Rinv(J,I) + Rinv(I,I).*Rinv(J,J));

v = -hess\grad;

dR(N) = v; dR(Nt) = v; dR(1:n+1:n*n) = 2*dR(1:n+1:n*n);

sqntdecr = -grad’*v; if sqntdecr<1e-12, break, end

% line-search

16

t = 1;

lambda = eig(full(dR), full(R));

while 1,

if 1+t*lambda>0 & -sum(log(1+t*lambda))+t*Y(:)’*dR(:) < ...

-0.01*t*sqntdecr;

break

end

t = t*0.5;

end

R = R + t*dR;

end

B Coordinate steepest descent

B.1 An analytical solution to the update step

Here we derive an analytical expression for the solution the update step. The update step ∆xk isthe solution to

∇f(∆xk) = 2(

Σ− (K0 + EBET ∆xk)−1)

i,j= 0.

Using the matrix inversion lemma we have that

(K0 + EBET ∆xk)−1 = K−1

0 −∆xkK−10 E(B + ∆xkE

T K−10 E)−1ET K−1

0 .

Define C =

[

(K−10 )ii (K−1

0 )ij

(K−10 )ji (K−1

0 )jj

]

and note that ET K−10 E = C. Thus we can reduce the optimality

condition to

eTi (Σ− (K0 + EBET ∆xk)

−1)ej = Σij − C12 + ∆xkeT1 C (B + ∆xkC)−1 Ce2 = 0.

Note that from our Cholesky factorization K0 = LLT we can compute C very efficiently by firstsolving the sparse n× 2 system LC = E for C and then forming C = CT C, so we never explicitlyform K−1

0 .Next define

W =

[

1 1

−(C11/C22)−1/2 (C11/C22)

−1/2

]

.

Then it can easily be verified that

W T (B + ∆xkC)W = diag (λ + ∆xkγ)

where λ = diag (W T BW ) and γ = diag (W T CW ), i.e., we can jointly diagonalize B and C by W .

For ease of notation define c1 = W[

C11

C12

]

and c2 = W[

C21

C22

]

. We can then simplify the optimality

conditions asΣij − C12 + cT

1 diag (λ + ∆xkγ)−1c2 = 0.

Thus, after a bit of work, we get a second order equation for ∆xk,

(

(Σij − C12)γ1γ2 + γ1c12c22 + γ2c11c21

)

∆x2k+

(

(Σij − C12)(γ1λ2 + γ2λ1) + λ1c12c22 + λ2c11c21

)

∆xk + (Σij − C12)λ1λ2 = 0. (17)

17

Thus in general we get two roots for the solution, but only one of the roots is in dom f , i.e., onlyfor one of them is (K0 + EBET ∆xk) º 0. To that end, we can rewrite

det(K0 + EBET ∆xk) = det(K0)(∆xk(C212 − C11C22) + 2C12∆xk + 1), (18)

and the optimal solution ∆xk is the root that makes (18) non-negative.

C Nonlinear conjugate gradient method

function R = covsel_sparse_cg(Y,I,J,precon)

n = size(Y,1); N = sub2ind([n,n],I,J); Nt = sub2ind([n,n],J,I);

R = spalloc(n,n,2*length(N)-n);

R = R + speye(n); dR = spalloc(n,n,2*length(N)-n);

df = 2*(Y(N)-R(N)); p = -precon*df;

MAXITERS = 2000;

for iters = 1:MAXITERS

dR(N) = p; dR(Nt) = dR(N); dR(1:n+1:n*n) = 2*dR(1:n+1:n*n);

alpha = linesearch_backtracking(R, dR, Y, df’*p);

L = chol(R)’; Li = inv(L); Rinv = Li’*Li;

if iters>1 & abs(fsave(iters)-fsave(iters-1)) < 1e-12, break, end;

dfn = 2*precon’*(Y(N)-Rinv(N));

p = precon*(-dfn + dfn’*(dfn-df)/(df’*df)*p);

df = dfn;

if (df’*p>0) | ~mod(iters,length(N)), p = -precon*df; end

end

function alpha = linesearch_backtracking(R0, dR, Y, phi_prime0)

Rc = chol(R0)’; Rci = inv(Rc); Rinv = Rci’*Rci;

phi0 = -2*sum(log(diag(Rc))) + R0(:)’*Y(:);

alpha = 1;

for iters=1:100

R = R0 + alpha*dR;

[Rc, e] = chol(R); Rc = Rc’;

if ~e & -2*sum(log(diag(Rc))) + Y(:)’*R(:) < phi0 + 0.01*alpha*phi_prime0,

break

end

alpha = alpha*0.5;

end

18

References

[1] Z. Bai, M. Fahey, and G. Golub. Some large-scale matrix computation problems. Journal of

Computational and Applied Mathematics, pages 71–89, 1996.

[2] M. Benzi, C. D. Meyer, and M. Tuma. A sparse approximate inverse preconditioner for theconjugate gradient method. SIAM J. Sci. Comput., 17(5):1135–1149, 1996.

[3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods.Athena Scientific, 1997.

[4] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear Matrix Inequalities in System

and Control Theory. SIAM, 1994.

[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[6] G. R. Cowell, A. P. Dawid, Lauritzen S. L, and Spiegelhalter D. J. Probabilistic Networks and

Expert Systems. Springer, 1999.

[7] A. P. Dempster. Covariance selection. Biometrics, 28(1):157–175, 1972.

[8] A. Dobra, C. Hans, B. Jones, J. R. Nevins, G. Yao, and M. West. Sparse graphical models forexploring gene expression data. Journal of Multivariate Analysis, pages 196–212, 2004.

[9] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders. Methods for modifying matrixfactorizations. Mathematics of Computations, 28(126):71–89, 1974.

[10] G. Golub and G. Meurant. Matrices, moments and quadrature. Technical report, StanfordUniversity, 1994.

[11] G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins University Press, 3rdedition, 1996.

[12] T. Hastie, S. Rosset, Tibshirani R., and Zhu J. The entire regularization path for the supportvector machine. Technical report, Stanford University, 2004.

[13] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.

[14] J. Z. Huang, N. Liu, and M. Pourashmadi. Covariance selection and estimation via penalizednormal likelihood. Preprint.

[15] S. L Lauritzen. Graphical Models. Clarendon Press, Oxford, 1996.

[16] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2nd edition, 2001.

[17] T. Roh and L. Vandenberghe. Discrete transforms, semidefinite programming, and sum-of-squares representations of nonnegative polynomials. Submitted to SIAM J. Opt., 2004.

[18] T. P. Speed and H. T. Kiiveri. Gaussian markov distributions over finite graphs. Annal. Stat.,14(1):138–150, 1986.

[19] N. Wermuth and E. Scheidt. Algorithm AS 105: Fitting a covariance selection model to amatrix. Applied Statistics, 26(1):88–92, 1977.

19

maximum-likelihood estimation of multivariate normal ...€¦ · maximum-likelihood estimation of...

Documents