chapter 28 nonparametric graphical modelslarry/=sml/nonpargraphs.pdf · 01/01/2003  · 658 chapter...

29
Copyright c 2008–2010 John Lafferty, Han Liu, and Larry Wasserman Do Not Distribute Chapter 28 Nonparametric Graphical Models In this chapter we discuss some nonparametric methods for graphical mod- eling. In the discrete case, where the data are binary or drawn from a finite alphabet, Markov random fields are already essentially nonparametric, since the cliques can take only a finite number of values. Continuous data are dif- ferent. The Gaussian graphical model is the standard parametric model for continuous data, but it makes distributional assumptions that are often unreal- istic. We discuss two approaches to building more flexible graphical models. One allows arbitrary graphs and a nonparametric extension of the Gaussian; the other uses kernel density estimation and restricts the graphs to trees and forests. 28.1 Introduction This chapter presents two methods for constructing nonparametric graphical models for continuous data. In the discrete case, where the data are binary or drawn from a finite al- phabet, Markov random fields or log-linear models are already essentially nonparametric, since the cliques can take only a finite number of values. Continuous data are different. The Gaussian graphical model is the standard parametric model for continuous data, but it makes distributional assumptions that are typically unrealistic. Yet few practical alternatives to the Gaussian graphical model exist, particularly for high dimensional data. We discuss two approaches to building more flexible graphical models that exploit sparsity. These two approaches are at different extremes in the array of choices available. One allows arbitrary graphs, but makes a distributional restriction through the use of copulas; this is a semipara- metric extension of the Gaussian. The other approach uses kernel density estimation and restricts the graphs to trees and forests; in this case the model is fully nonparametric, at the expense of structural restrictions. We describe two-step estimation methods for both approaches. We also outline some statistical theory for the methods, and compare them in some examples. The primary references for this material are Liu et al. (2009) and Liu et al. 657

Upload: others

Post on 20-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

Copyright c� 2008–2010 John Lafferty, Han Liu, and Larry Wasserman Do Not Distribute

Chapter 28

Nonparametric Graphical Models

In this chapter we discuss some nonparametric methods for graphical mod-eling. In the discrete case, where the data are binary or drawn from a finitealphabet, Markov random fields are already essentially nonparametric, sincethe cliques can take only a finite number of values. Continuous data are dif-ferent. The Gaussian graphical model is the standard parametric model forcontinuous data, but it makes distributional assumptions that are often unreal-istic. We discuss two approaches to building more flexible graphical models.One allows arbitrary graphs and a nonparametric extension of the Gaussian;the other uses kernel density estimation and restricts the graphs to trees andforests.

28.1 Introduction

This chapter presents two methods for constructing nonparametric graphical models forcontinuous data. In the discrete case, where the data are binary or drawn from a finite al-phabet, Markov random fields or log-linear models are already essentially nonparametric,since the cliques can take only a finite number of values. Continuous data are different.The Gaussian graphical model is the standard parametric model for continuous data, but itmakes distributional assumptions that are typically unrealistic. Yet few practical alternativesto the Gaussian graphical model exist, particularly for high dimensional data. We discusstwo approaches to building more flexible graphical models that exploit sparsity. These twoapproaches are at different extremes in the array of choices available. One allows arbitrarygraphs, but makes a distributional restriction through the use of copulas; this is a semipara-metric extension of the Gaussian. The other approach uses kernel density estimation andrestricts the graphs to trees and forests; in this case the model is fully nonparametric, atthe expense of structural restrictions. We describe two-step estimation methods for bothapproaches. We also outline some statistical theory for the methods, and compare them insome examples. The primary references for this material are Liu et al. (2009) and Liu et al.

657

Page 2: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

658 Chapter 28. Nonparametric Graphical Models

(2011).The methods we present here are relatively simple, and many more possibilities remain

for nonparametric graphical modeling. But one of the main messages of this chapter is thata little nonparametricity can go a long way.

28.2 Two Families of Nonparametric Graphical Models

The graph of a random vector is a useful way of exploring the underlying distribution.Recall that if X = (X

1

, . . . , Xd) is a random vector with distribution P , then the undirectedgraph G = (V,E) corresponding to P consists of a vertex set V and an edge set E whereV has d elements, one for each variable Xi. The edge between (i, j) is excluded from Eif and only if Xi is independent of Xj given the other variables X\{i,j} ⌘ (Xs : 1 s d, s 6= i, j), written

Xi ?? Xj

X\{i,j}. (28.1)

The general form for a (strictly positive) probability density encoded by an undirectedgraph G is

p(x) =1

Z(f)exp

0

@

X

C2Cliques(G)

fC(xC)

1

A , (28.2)

where the sum is over all cliques, or fully connected subsets of vertices of the graph. Ingeneral, this is what we mean by a nonparametric graphical model. It is the graphicalmodel analogue of the general nonparametric regression model. Model (28.2) has two mainingredients, the graph G and the functions {fC}. However, without further assumptions,it is much too general to be practical. The main difficulty in working with such a modelis the normalizing constant Z(f), which cannot, in general, be efficiently computed orapproximated.

In the spirit of nonparametric estimation, we can seek to impose structure on eitherthe graph or the functions fC in order to get a flexible and useful family of models. Oneapproach parallels the ideas behind sparse additive models for regression. Specifically,we replace the random variable X = (X

1

, . . . , Xd) by the transformed random variablef(X) = (f

1

(X1

), . . . , fd(Xd)), and assume that f(X) is multivariate Gaussian. This re-sults in a nonparametric extension of the Normal that we call the nonparanormal distri-bution. The nonparanormal depends on the univariate functions {fj}, and a mean µ andcovariance matrix ⌃, all of which are to be estimated from data. While the resulting fam-ily of distributions is much richer than the standard parametric Normal (the paranormal),the independence relations among the variables are still encoded in the precision matrix⌦ = ⌃

�1, as we show below.The second approach is to force the graphical structure to be a tree or forest, where each

pair of vertices is connected by at most one path. Thus, we relax the distributional assump-tion of normality but we restrict the allowed family of undirected graphs. The complexityof the model is then regulated by selecting the edges to include, using cross validation.

Page 3: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.2. Two Families of Nonparametric Graphical Models 659

nonparanormal forest densities

univariate marginals nonparametric nonparametric

bivariate marginals determined by Gaussian copula nonparametric

graph unrestricted acyclic

Figure 28.1. Comparison of properties of the nonparanormal and forest-structured densities.

Figure 28.1 summarizes the tradeoffs made by these two families of models. The non-paranormal can be thought of as an extension of additive models for regression to graphicalmodeling. This requires estimating the univariate marginals; in the copula approach, this isdone by estimating the functions fj(x) = µj + �j��1

(Fj(x)), where Fj is the distributionfunction for variable Xj . After estimating each fj , we transform to (assumed) jointly Nor-mal via Z = (f

1

(X1

), . . . , fd(Xd)) and then apply methods for Gaussian graphical modelsto estimate the graph. In this approach, the univariate marginals are fully nonparametric,and the sparsity of the model is regulated through the inverse covariance matrix, as for thegraphical lasso, or “glasso” (Banerjee et al., 2008; Friedman et al., 2007)27 The model isestimated in a two-stage procedure; first the functions fj are estimated, and then inversecovariance matrix ⌦ is estimated. The high level relationship between linear regressionmodels, Gaussian graphical models, and their extensions to additive and high dimensionalmodels is summarized in Figure 28.2.

In the forest graph approach, we restrict the graph to be acyclic, and estimate the bivari-ate marginals p(xi, xj) nonparametrically. In light of equation (28.27), this yields the fullnonparametric family of graphical models having acyclic graphs. Here again, the estimationprocedure is two-stage; first the marginals are estimated, and then the graph is estimated.Sparsity is regulated through the edges (i, j) that are included in the forest.

Clearly these are just two tractable families within the very large space of possiblenonparametric graphical models specified by equation (28.2). Many interesting researchpossibilities remain for novel nonparametric graphical models that make different assump-tions; we discuss some possibilities in a concluding section. We now discuss details of thesetwo model families, beginning with the nonparanormal.

27Throughout the chapter we use the term graphical lasso, or glasso, coined by Friedman et al. (2007) to referto the solution obtained by `1-regularized log-likelihood under the Gaussian graphical model. This estimatorgoes back at least to Yuan and Lin (2007), and an iterative lasso algorithm for doing the optimization was firstproposed by Banerjee et al. (2008). In our experiments we use the R packages glasso (Friedman et al., 2007)and huge to implement this algorithm.

Page 4: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

660 Chapter 28. Nonparametric Graphical Models

assumptions dimension regression graphical Models

parametriclow linear model multivariate Normalhigh lasso graphical lasso

nonparametriclow additive model nonparanormalhigh sparse additive model sparse nonparanormal

Figure 28.2. Comparison of regression and graphical models. The nonparanormal extends additivemodels to the graphical model setting. Regularizing the inverse covariance leads to an extension tohigh dimensions, which parallels sparse additive models for regression.

28.3 The Nonparanormal

We say that a random vector X = (X1

, . . . , Xd)T has a nonparanormal distribution and

writeX ⇠ NPN (µ,⌃, f)

in case there exist functions {fj}dj=1

such that Z ⌘ f(X) ⇠ N(µ,⌃), where f(X) =

(f1

(X1

), . . . , fd(Xd)). When the fj’s are monotone and differentiable, the joint probabilitydensity function of X is given by

pX(x) =1

(2⇡)d/2|⌃|1/2 exp⇢

�1

2

(f(x)� µ)T ⌃

�1

(f(x)� µ)

� dY

j=1

|f 0j(xj)|, (28.3)

where the product term is a Jacobian.Note that the density in (28.3) is not identifiable—we could scale each function by a

constant, and scale the diagonal of ⌃ in the same way, and not change the density. To makethe family identifiable we demand that fj preserves marginal means and variances:

µj = E(Zj) = E(Xj) and �2j ⌘ ⌃jj = Var (Zj) = Var (Xj) . (28.4)

These conditions only depend on diag(⌃) but not the full covariance matrix.Now, let Fj(x) denote the marginal distribution function of Xj . Since the component

fj(Xj) is Gaussian, we have that

Fj(x) = P (Xj x) = P (Zj fj(x)) = �

fj(x)� µj

�j

which implies thatfj(x) = µj + �j�

�1

(Fj(x)) . (28.5)

The form of the density in (28.3) implies that the conditional independence graph of thenonparanormal is encoded in ⌦ = ⌃

�1, as for the parametric Normal, since the densityfactors with respect to the graph of ⌦, and therefore obeys the global Markov property ofthe graph.

Page 5: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.3. The Nonparanormal 661

Figure 28.3. Densities of three 2-dimensional nonparanormals. The left plots have componentfunctions of the form f

(x) = sign(x)|x|↵, with ↵1

= 0.9, and ↵2

= 0.8. The center plots havecomponent functions of the form g

(x) = bxc + 1/(1 + exp(�↵(x � bxc � 1/2))) with ↵1

= 10

and ↵2

= 5, where x � bxc is the fractional part. The right plots have component functions of theform h

(x) = x+ sin(↵x)/↵, with ↵1

= 5 and ↵2

= 10. In each case µ = (0, 0) and ⌃ =

1 .5

.5 1

.

In fact, this is true for any choice of identification restrictions; thus, it is not necessaryto estimate µ or � to estimate the graph, as the following result shows.

28.6 Lemma. Definehj(x) = �

�1

(Fj(x)) (28.7)

and let ⇤ be the covariance matrix of h(X). Then Xj ?? Xk |X\{j,k} if and only if ⇤�1

jk =

0.

Proof. We can rewrite the covariance matrix as

⌃jk = Cov(Zj , Zk) = �j�kCov(hj(Xj), hk(Xk)).

Hence ⌃ = D⇤D and⌃

�1

= D�1

�1D�1,

where D is the diagonal matrix with diag(D) = �. The zero pattern of ⇤�1 is thereforeidentical to the zero pattern of ⌃�1.

Figure 28.3 shows three examples of 2-dimensional nonparanormal densities. The com-ponent functions are taken to be from three different families of monotonic functions—one

Page 6: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

662 Chapter 28. Nonparametric Graphical Models

using power transforms, one using logistic transforms, and another using sinusoids:

f↵(x) = sign(x)|x|↵

g↵(x) = bxc+ 1

1 + exp

��↵(x� bxc � 1

2

)

h↵(x) = x+

sin(↵x)

↵.

The covariance in each case is ⌃ =

1 .5.5 1

and the mean is µ = (0, 0). It can be seen howthe concavity and number of modes of the density can change with different nonlinearities.Clearly the nonparanormal family is much richer than the Normal family.

The assumption that f(X) = (f1

(X1

), . . . , fd(Xd) is Normal leads to a semiparamet-ric model where only one dimensional functions need to be estimated. But the monotonicityof the functions fj , which map onto R, enables computational tractability of the nonpara-normal. For more general functions f , the normalizing constant for the density

pX(x) / exp

�1

2

(f(x)� µ)T ⌃

�1

(f(x)� µ)

(28.8)

cannot be computed in closed form.

28.3.1 Connection to Copula

If Fj is the distribution of Xj , then Uj = Fj(Xj) is uniformly distributed on (0, 1). Let Cdenote the joint distribution function of U = (U

1

, . . . , Ud), and let F denote the distributionfunction of X . Then we have that

F (x1

, . . . , xd) = P(X1

x1

, . . . , Xd xd) (28.9)

= P(F1

(X1

) F1

(x1

), . . . , Fd(Xd) Fd(xd)) (28.10)

= P(U1

F1

(x1

), . . . , Ud Fd(xd)) (28.11)

= C(F1

(x1

), . . . , Fd(xd)). (28.12)

This is known as Sklar’s theorem (Sklar, 1959), and C is called a copula. If c is the densityfunction of C then

p(x1

, . . . , xd) = c(F1

(x1

), . . . , Fd(xd))dY

j=1

p(xj) (28.13)

where p(xj) is the marginal density of Xj . For the nonparanormal we have

F (x1

, . . . , xd) = �µ,⌃(��1

(F1

(x1

)), . . . ,��1

(Fd(xd))) (28.14)

where �µ,⌃ is the multivariate Gaussian cdf and � is the univariate standard Gaussian cdf.The Gaussian copula is usually expressed in terms of the correlation matrix, which

is given by R = diag(�)�1

⌃ diag(�)�1. Note that the univariate marginal density for a

Page 7: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.3. The Nonparanormal 663

Normal can be written as p(xj) =

1

�j

�(uj) where uj = (xj � µj)/�j . The multivariateNormal density can thus be expressed as

pµ,⌃(x1, . . . , xd) =

1

(2⇡)d/2|R|1/2Qdj=1

�jexp

�1

2

uTR�1u

(28.15)

=

1

|R|1/2 exp✓

�1

2

uT (R�1 � I)u

◆ dY

j=1

�(uj)

�j. (28.16)

Since the distribution Fj of the jth variable satisfies Fj(xj) = �((xj � µj)/�j) = �(uj),we have that (Xj � µj)/�j

d= �

�1

(Fj(Xj)). The Gaussian copula density is thus

c(F1

(x1

), . . . , Fd(xd)) =1

|R|1/2 expn

�1

2

�1

(F (x))T (R�1 � I)��1

(F (x))o

(28.17)

where �

�1

(F (x)) = (�

�1

(F1

(x1

)), . . . ,��1

(Fd(xd))). This is seen to be equivalent to(28.3) using the chain rule and the identity

(�

�1

)

0(⌘) =

1

� (��1

(⌘)). (28.18)

28.3.2 Estimation

Let X(1), . . . , X(n) be a sample of size n where X(i)= (X(i)

1

, . . . , X(i)d )

T 2 Rd. We’lldesign a two-step estimation procedure where first the functions fj are estimated, and thenthe inverse covariance matrix ⌦ is estimated, after transforming to approximately Normal.

In light of (28.7) we define

bhj(x) = �

�1

(

eFj(x)) (28.19)

where eFj is an estimator of Fj . A natural candidate for eFj is the marginal empirical distri-bution function

bFj(t) ⌘ 1

n

nX

i=1

1nX

(i)j

to.

However, in this case bhj(x) blows up at the largest and smallest values of X(i)j . For the

high dimensional setting where n is small relative to d, an attractive alternative is to use atruncated or Winsorized28 estimator:

eFj(x) =

8

>

<

>

:

�n if bFj(x) < �nbFj(x) if �n bFj(x) 1� �n(1� �n) if bFj(x) > 1� �n,

(28.20)

28After Charles P. Winsor, the statistician whom John Tukey credited with his conversion from topology tostatistics (Mallows, 1990).

Page 8: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

664 Chapter 28. Nonparametric Graphical Models

where �n is a truncation parameter. There is a bias-variance tradeoff in choosing �n; in-creasing �n increases the bias while it decreases the variance.

Given this estimate of the distribution of variable Xj , we then estimate the transforma-tion function fj by

efj(x) ⌘ bµj + b�jehj(x) (28.21)

whereehj(x) = �

�1

eFj(x)⌘

(28.22)

and bµj and b�j are the sample mean and standard deviation:

bµj ⌘ 1

n

nX

i=1

X(i)j and b�j =

v

u

u

t

1

n

nX

i=1

X(i)j � bµj

2

.

Now, let Sn(ef) be the sample covariance matrix of ef(X(1)

), . . . , ef(X(n)); that is,

Sn(ef) ⌘ 1

n

nX

i=1

ef(X(i))� µn(

ef)⌘⇣

ef(X(i))� µn(

ef)⌘T

(28.23)

µn(ef) ⌘ 1

n

nX

i=1

ef(X(i)).

We then estimate ⌦ using Sn(ef). For instance, the maximum likelihood estimator is

b

MLE

n = Sn(ef)�1.

The `1

-regularized estimator is

b

⌦n = arg min⌦

n

tr⇣

⌦Sn(ef)⌘

� log |⌦|+ �k⌦k1

o

(28.24)

where � is a regularization parameter, and k⌦k1

=

Pdj=1

Pdk=1

|⌦jk|. The estimated graphis then bEn = {(j, k) : b⌦jk 6= 0}.

Thus, we use a two-step procedure to estimate the graph.

1. Replace the observations, for each variable, by their respective Normal scores, subjectto a Winsorized truncation.

2. Apply the graphical lasso to the transformed data to estimate the undirected graph.

The first step is non-iterative and computationally efficient. The truncation parameter�n is chosen to be

�n =

1

4n1/4p⇡ log n

(28.25)

and does not need to be tuned. As will be shown in Theorem 28.26, such a choice makesthe nonparanormal amenable to theoretical analysis.

Page 9: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.3. The Nonparanormal 665

28.3.3 Statistical Properties of Sn( ef)

The main technical result is an analysis of the covariance of the Winsorized estimator above.In particular, we show that under appropriate conditions,

max

j,k

Sn(ef)jk � Sn(f)jk

= OP

0

@

s

log d+ log

2 n

n1/2

1

A

where Sn(ef)jk denotes the (j, k) entry of the matrix Sn(

ef). This result allows us to leveragethe significant body of theory on the graphical lasso (Rothman et al., 2008; Ravikumar et al.,2009b) which we apply in step two.

28.26 Theorem. Suppose that d = n⇠ and let ef be the Winsorized estimator defined in

(28.21) with �n =

1

4n1/4p⇡ log n

. Define

C(M, ⇠) ⌘ 48p⇡⇠

⇣p2M � 1

(M + 2)

for M, ⇠ > 0. Then for any ✏ � C(M, ⇠)

s

log d+ log

2 n

n1/2and sufficiently large n, we have

P✓

max

jk

Sn(ef)jk � Sn(f)jk

> ✏

c1

d

(n✏2)2⇠+

c2

d

nM⇠�1

+ c3

exp

� c4

n1/2✏2

log d+ log

2 n

!

,

where c1

, c2

, c3

, c4

are positive constants.

The proof of this result involves a detailed Gaussian tail analysis, and is given in Liuet al. (2009).

Using Theorem 28.26 and the results of Rothman et al. (2008) it can then be shownthat the precision matrix is estimated at the following rates in the Frobenius norm and the`2

-operator norm.

kb⌦n � ⌦

0

kF

= OP

0

@

s

(s+ d) log d+ log

2 n

n1/2

1

A

and

kb⌦n � ⌦

0

k2

= OP

0

@

s

s log d+ log

2 n

n1/2

1

A ,

Page 10: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

666 Chapter 28. Nonparametric Graphical Models

where

s ⌘ Card ({(i, j) 2 {1, . . . , d}⇥ {1, . . . , d} |⌦0

(i, j) 6= 0, i 6= j})

is the number of nonzero off-diagonal elements of the true precision matrix.Using the results of Ravikumar et al. (2009b), it can also be shown, under appropriate

conditions, that the sparsity pattern of the precision matrix is estimated accurately with highprobability. In particular, the nonparanormal estimator b⌦n satisfies

P⇣

G⇣

b

⌦n,⌦0

⌘⌘

� 1� o(1)

where G(b⌦n,⌦0

) is the eventn

sign

b

⌦n(j, k)⌘

= sign

�1

0

(j, k)�

, 8j, k 2 {1, . . . , d}o

.

We refer to Liu et al. (2009) for the details of the conditions and proofs.

28.4 Forest Density Estimation

We now describe a very different, but equally flexible and useful approach. Rather thanassuming a transformation to normality and an arbitrary undirected graph, we restrict thegraph to be a tree or forest, but allow arbitrary nonparametric distributions.

Let p⇤(x) be a probability density with respect to Lebesgue measure µ(·) on Rd and letX(1), . . . , X(n) be n independent identically distributed Rd-valued data vectors sampledfrom p⇤(x) where X(i)

= (X(i)1

, . . . , X(i)d ). Let Xj denote the range of X(j)

i and let X =

X1

⇥ · · ·⇥ Xd.A graph is a forest if it is acyclic. If F is a d-node undirected forest with vertex set

VF = {1, . . . , d} and edge set EF ⇢ {1, . . . , d}⇥{1, . . . , d}, the number of edges satisfies|EF | < d. We say that a probability density function p(x) is supported by a forest F if thedensity can be written as

pF (x) =Y

(i,j)2EF

p(xi, xj)

p(xi) p(xj)

Y

k2VF

p(xk), (28.27)

where each p(xi, xj) is a bivariate density on Xi⇥Xj , and each p(xk) is a univariate densityon Xk.

Let Fd be the family of forests with d nodes, and let Pd be the corresponding family ofdensities:

Pd =

p � 0 :

Z

Xp(x) dµ(x) = 1, and p(x) satisfies (28.27) for some F 2 Fd

.

(28.28)

Page 11: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.4. Forest Density Estimation 667

Define the oracle forest density

q⇤ = arg minq2P

d

D(p⇤k q) (28.29)

where the Kullback-Leibler divergence D(pk q) between two densities p and q is

D(pk q) =Z

Xp(x) log

p(x)

q(x)dx, (28.30)

under the convention that 0 log(0/q) = 0, and p log(p/0) =1 for p 6= 0. The following isstraightforward to prove.

28.31 Proposition. Let q⇤ be defined as in (28.29). There exists a forest F ⇤ 2 Fd, such that

q⇤ = p⇤F ⇤ =

Y

(i,j)2EF

p⇤(xi, xj)p⇤(xi) p⇤(xj)

Y

k2VF

p⇤(xk) (28.32)

where p⇤(xi, xj) and p⇤(xi) are the bivariate and univariate marginal densities of p⇤.

For any density q(x), the negative log-likelihood risk R(q) is defined as

R(q) = �E log q(X) = �Z

Xp⇤(x) log q(x) dx. (28.33)

It is straightforward to see that the density q⇤ defined in (28.29) also minimizes the negativelog-likelihood loss:

q⇤ = arg minq2P

d

D(p⇤k q) = arg minq2P

d

R(q) (28.34)

We thus define the oracle risk as R⇤= R(q⇤). Using Proposition 28.31 and equation

(28.27), we have

R⇤= R(q⇤) = R(p⇤F ⇤)

= �Z

Xp⇤(x)

X

(i,j)2EF

log

p⇤(xi, xj)p⇤(xi)p⇤(xj)

+

X

k2VF

log (p⇤(xk))◆

dx

= �X

(i,j)2EF

I(Xi;Xj) +X

k2VF

H(Xk), (28.35)

whereI(Xi;Xj) =

Z

Xi

⇥Xj

p⇤(xi, xj) logp⇤(xi, xj)

p⇤(xi) p⇤(xj)dxidxj (28.36)

is the mutual information between the pair of variables Xi, Xj and

H(Xk) = �Z

Xk

p⇤(xk) log p⇤(xk) dxk (28.37)

is the entropy.

Page 12: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

668 Chapter 28. Nonparametric Graphical Models

28.4.1 A Two-Step Procedure

If the true density p⇤(x) were known, by Proposition 28.31, the density estimation problemwould be reduced to finding the best forest structure F ⇤

d , satisfying

F ⇤d = arg min

F2Fd

R(p⇤F ) = arg minF2F

d

D(p⇤k p⇤F ). (28.38)

The optimal forest F ⇤d can be found by minimizing the right hand side of (28.35). Since the

entropy term H(X) =

P

k H(Xk) is constant across all forests, this can be recast as theproblem of finding the maximum weight spanning forest for a weighted graph, where theweight of the edge connecting nodes i and j is I(Xi;Xj). Kruskal’s algorithm (Kruskal,1956) is a greedy algorithm that is guaranteed to find a maximum weight spanning tree of aweighted graph. In the setting of density estimation, this procedure was proposed by Chowand Liu (1968) as a way of constructing a tree approximation to a distribution. At eachstage the algorithm adds an edge connecting that pair of variables with maximum mutualinformation among all pairs not yet visited by the algorithm, if doing so does not form acycle. When stopped early, after k < d�1 edges have been added, it yields the best k-edgeweighted forest.

Of course, the above procedure is not practical since the true density p⇤(x) is unknown.We replace the population mutual information I(Xi;Xj) in (28.35) by a plug-in estimatebIn(Xi;Xj), defined as

bIn(Xi;Xj) =

Z

Xi

⇥Xj

bpn(xi, xj) logbpn(xi, xj)

bpn(xi) bpn(xj)dxidxj (28.39)

where bpn(xi, xj) and bpn(xi) are bivariate and univariate kernel density estimates. Giventhis estimated mutual information matrix cMn =

h

bIn(Xi;Xj)

i

, we can then apply Kruskal’s

algorithm (equivalently, the Chow-Liu algorithm) to find the best tree structure bFn.Since the number of edges of bFn controls the number of degrees of freedom in the final

density estimator, an automatic data-dependent way to choose it is needed. We adopt thefollowing two-stage procedure. First, we randomly split the data into two sets D

1

and D2

of sizes n1

and n2

; we then apply the following steps:

1. Using D1

, construct kernel density estimates of the univariate and bivariate marginalsand calculate bIn1(Xi;Xj) for i, j 2 {1, . . . , d} with i 6= j. Construct a full treebF (d�1)

n1 with d� 1 edges, using the Chow-Liu algorithm.

2. Using D2

, prune the tree bF (d�1)

n1 to find a forest bF (

bk)n1 with bk edges, for 0 bk d�1.

Once bF (

bk)n1 is obtained in Step 2, we can calculate bp

bF(bk)n1

according to (28.27), using the

kernel density estimates constructed in Step 1.

Page 13: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.4. Forest Density Estimation 669

Step 1: Constructing a sequence of forests

Step 1 is carried out on the dataset D1

. Let K(·) be a univariate kernel function. Givenan evaluation point (xi, xj), the bivariate kernel density estimate for (Xi, Xj) based on theobservations {X(s)

i , X(s)j }s2D1 is defined as

bpn1(xi, xj) =1

n1

X

s2D1

1

h22

K

X(s)i � xih2

!

K

X(s)j � xj

h2

!

, (28.40)

where we use a product kernel with h2

> 0 as the bandwidth parameter. The univariatekernel density estimate bpn1(xk) for Xk is

bpn1(xk) =1

n1

X

s2D1

1

h1

K

X(s)k � xkh1

!

, (28.41)

where h1

> 0 is the univariate bandwidth.We assume that the data lie in a d-dimensional unit cube X = [0, 1]d. To calculate

the empirical mutual information bIn1(Xi;Xj), we need to numerically evaluate a two-dimensional integral. To do so, we calculate the kernel density estimates on a grid ofpoints. We choose m evaluation points on each dimension, x

1i < x2i < · · · < xmi for

the ith variable. The mutual information bIn1(Xi;Xj) is then approximated as

bIn1(Xi;Xj) =1

m2

mX

k=1

mX

`=1

bpn1(xki, x`j) logbpn1(xki, x`j)

bpn1(xki) bpn1(x`j). (28.42)

The approximation error can be made arbitrarily small by choosing m sufficiently large.As a practical concern, care needs to be taken that the factors bpn1(xki) and bpn1(x`j) in thedenominator are not too small; a truncation procedure can be used to ensure this. Oncethe d ⇥ d mutual information matrix cMn1 =

h

bIn1(Xi;Xj)

i

is obtained, we can apply theChow-Liu (Kruskal) algorithm to find a maximum weight spanning tree.

Page 14: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

670 Chapter 28. Nonparametric Graphical Models

Tree Construction (Kruskal/Chow-Liu)

Input: Data set D1

and the bandwidths h1

, h2

.

Initialize: Calculate cMn1 , according to (28.40), (28.41), and (28.42).

Set E(0)

= ;For k = 1, . . . , d� 1:

1. Set (i(k), j(k)) arg max(i,j)

cMn1(i, j) such that E(k�1)[{(i(k), j(k))}does not contain a cycle;

2. E(k) E(k�1) [ {(i(k), j(k))}.

Output: tree bF (d�1)

n1 with edge set E(d�1).

Step 2: Selecting a forest size

The full tree bF (d�1)

n1 obtained in Step 1 might have high variance when the dimension dis large, leading to overfitting in the density estimate. In order to reduce the variance, weprune the tree; that is, we choose an unconnected tree with k edges. The number of edgesk is a tuning parameter that induces a bias-variance tradeoff.

In order to choose k, note that in stage k of the Chow-Liu algorithm we have an edgeset E(k) (in the notation of the Algorithm 28.4.1) which corresponds to a forest bF (k)

n1 with k

edges, where F (0)

n1 is the union of d disconnected nodes. To select k, we cross-validate overthe d forests bF (0)

n1 , bF (1)

n1 , . . . , bF (d�1)

n1 .Let bpn2(xi, xj) and bpn2(xk) be defined as in (28.40) and (28.41), but now evaluated

solely based on the held-out data in D2

. For a density pF that is supported by a forest F ,we define the held-out negative log-likelihood risk as

bRn2(pF ) (28.43)

= �X

(i,j)2EF

Z

Xi

⇥Xj

bpn2(xi, xj) logp(xi, xj)

p(xi) p(xj)dxidxj �

X

k2VF

Z

Xk

bpn2(xk) log p(xk) dxk.

The selected forest is then bF (

bk)n1 where

bk = arg mink2{0,...,d�1}

bRn2

bpF

(k)n1

(28.44)

and where bpF

(k)n1

is computed using the density estimate bpn1 constructed on D1

.

Page 15: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.4. Forest Density Estimation 671

We can also estimate bk as

bk = arg maxk2{0,...,d�1}

1

n2

X

s2D2

log

0

@

Y

(i,j)2EF

(k)

bpn1(X(s)i , X(s)

j )

bpn1(X(s)i ) bpn1(X

(s)j )

Y

`2VF

(k)

bpn1(X(s)` )

1

A(28.45)

= arg maxk2{0,...,d�1}

1

n2

X

s2D2

log

0

@

Y

(i,j)2EF

(k)

bpn1(X(s)i , X(s)

j )

bpn1(X(s)i ) bpn1(X

(s)j )

1

A . (28.46)

This minimization can be efficiently carried out by iterating over the d� 1 edges in bF (d�1)

n1 .Once bk is obtained, the final forest-based kernel density estimate is given by

bpn(x) =Y

(i,j)2E(bk)

bpn1(xi, xj)

bpn1(xi) bpn1(xj)

Y

k

bpn1(xk). (28.47)

Another alternative is to compute a maximum weight spanning forest, using Kruskal’salgorithm, but with heldout edge weights

bwn2(i, j) =1

n2

X

s2D2

log

bpn1(X(s)i , X(s)

j )

bpn1(X(s)i ) bpn1(X

(s)j )

. (28.48)

In fact, asymptotically (as n2

! 1) this gives optimal tree-based estimator constructed interms of the kernel density estimates bpn1 .

28.4.2 Statistical Properties

The statistical properties of forest density estimator can be analyzed under the same typeof assumptions that are made for classical kernel density estimation. In particular, assumethat the univariate and bivariate densities lie in a Holder class with exponent �. Under thisassumption the minimax rate of convergence in the squared error loss is O(n�/(�+1)

) forbivariate densities and O(n2�/(2�+1)

) for univariate densities. Technical assumptions onthe kernel yield L1 concentration results on kernel density estimation (Gine and Guillou,2002).

Choose the bandwidths h1

and h2

to be used in the one-dimensional and two-dimensionalkernel density estimates according to

h1

⇣✓

log n

n

11+2�

(28.49)

h2

⇣✓

log n

n

12+2�

. (28.50)

This choice of bandwidths ensures the optimal rate of convergence. Let P(k)d be the family

of d-dimensional densities that are supported by forests with at most k edges. Then

P(0)

d ⇢ P(1)

d ⇢ · · · ⇢ P(d�1)

d . (28.51)

Page 16: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

672 Chapter 28. Nonparametric Graphical Models

Due to this nesting property,

inf

qF

2P(0)d

R(qF ) � inf

qF

2P(1)d

R(qF ) � · · · � inf

qF

2P(d�1)d

R(qF ). (28.52)

This means that a full spanning tree would generally be selected if we had access to the truedistribution. However, with access to finite data to estimate the densities (bpn1) the optimalprocedure is to use fewer than d � 1 edges. The following result analyzes the excess riskresulting from selecting the forest based on the heldout risk bRn2 .

28.53 Theorem. Let bpbF(k)d

be the estimate with |EbF(k)d

| = k obtained after the first kiterations of the Chow-Liu algorithm. Then under (omitted) technical assumptions on thedensities and kernel, for any 1 k d� 1,

R(bpbF(k)d

)� inf

qF

2P(k)d

R(qF ) = OP

k

r

log n+ log d

n�/(1+�)+ d

r

log n+ log d

n2�/(1+2�)

!

(28.54)

and

R(bpbF(bk)d

)� min

0kd�1

R(bpbF(k)d

) = OP

(k⇤ + bk)r

log n+ log d

n�/(1+�)+ d

r

log n+ log d

n2�/(1+2�)

!

(28.55)

where bk = arg min0kd�1

bRn2(bpbF(k)d

) and k⇤ = arg min0kd�1

R(bpbF(k)d

).

The main work in proving this result lies in establishing bounds such as

sup

F2F(k)d

|R(bpF )� bRn2(bpF )| = OP

�n(k) + n(d)

(28.56)

where bRn2 is the heldout risk, under the notation

�n(k) = k

r

log n+ log d

n�/(�+1)

(28.57)

n(d) = d

r

log n+ log d

n2�/(1+2�). (28.58)

For the proof of this and related results, see Liu et al. (2011). Using this, one easily obtains

R(bpbF(bk)d

)�R(bpbF(k⇤)d

) = R(bpbF(bk)d

)� bRn2(bpbF(bk)d

) +

bRn2(bpbF(bk)d

)�R(bpbF(k⇤)d

) (28.59)

= OP (�n(bk) + n(d)) + bRn2(bpbF(bk)d

)�R(bpbF(k⇤)d

) (28.60)

OP (�n(bk) + n(d)) + bRn2(bpbF(k⇤)d

)�R(bpbF(k⇤)d

) (28.61)

= OP

�n(bk) + �n(k⇤) + n(d)

. (28.62)

Page 17: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.5. Examples 673

source: wikipedia.org

Arabidopsis thaliana is a small flowering plant; itwas the first plant genome to be sequenced, andits roughly 27,000 genes and 35,000 proteins havebeen actively studied. Here we consider a data setbased on Affymetrix GeneChip microarrays withsample size n = 118, for which p = 40 geneshave been selected for analysis.

where (28.61) follows from the fact that bk is the minimizer of bRn2(·).Note that this result allows the dimension d to increase at a rate o

p

n2�/(1+2�)/ log n⌘

and the number of edges k to increase at a rate o⇣

p

n�/(1+�)/ log n⌘

, with the excess riskstill decreasing to zero asymptotically.

28.5 Examples

28.5.1 Gene-Gene Interaction Graphs

The nonparanormal and Gaussian graphical model can construct very different graphs. Herewe consider a data set based on Affymetrix GeneChip microarrays for the plant Arabidopsisthaliana, (Wille et al., 2004). The sample size is n = 118. The expression levels for eachchip are pre-processed by log-transformation and standardization. A subset of 40 genesfrom the isoprenoid pathway is chosen for analysis.

While these data are often treated as multivariate Gaussian, the nonparanormal and theglasso give very different graphs over a wide range of regularization parameters, suggestingthat the nonparametric method could ffferent biological conclusions.

The regularization paths of the two methods are compared in Figure 28.4. To generatethe paths, we select 50 regularization parameters on an evenly spaced grid in the interval[0.16, 1.2]. Although the paths for the two methods look similar, there are some subtledifferences. In particular, variables become nonzero in a different order.

Figure 28.5 compares the estimated graphs for the two methods at several values ofthe regularization parameter � in the range [0.16, 0.37]. For each �, we show the esti-mated graph from the nonparanormal in the first column. In the second column we showthe graph obtained by scanning the full regularization path of the glasso fit and finding thegraph having the smallest symmetric difference with the nonparanormal graph. The sym-

Page 18: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

674 Chapter 28. Nonparametric Graphical Models

0.2 0.3 0.4 0.5 0.6 0.7

−1.0

−0.5

0.0

glasso path

lambda

Coefficient

0.2 0.3 0.4 0.5 0.6 0.7

−1.0

−0.5

0.0

nonparanormal path

lambda

Coefficient

Figure 28.4. Regularization paths of both methods on the microarray data set. Although the pathsfor the two methods look similar, there are some subtle differences.

metric difference graph is shown in the third column. The closest glasso fit is different,with edges selected by the glasso not selected by the nonparanormal, and vice-versa. Theestimated transformation functions for several genes are shown Figure 28.6, which shownon-Gaussian behavior.

Since the graphical lasso typically results in a large parameter bias as a consequence ofthe `

1

regularization, it sometimes make sense to use the refit glasso, which is a two-stepprocedure—in the first step, a sparse inverse covariance matrix is obtained by the graphicallasso; in the second step, a Gaussian model is refit without `

1

regularization, but enforcingthe sparsity pattern obtained in the first step.

Figure 28.7 compares forest density estimation to the graphical lasso and refit glasso.It can be seen that the forest-based kernel density estimator has better generalization per-formance. This is not surprising, given that the true distribution of the data is not Gaus-sian. (Note that since we do not directly compute the marginal univariate densities in thenonparanormal, we are unable to compute likelihoods under this model.) The held-out log-likelihood curve for forest density estimation achieves a maximum when there are only 35edges in the model. In contrast, the held-out log-likelihood curves of the glasso and re-fit glasso achieve maxima when there are around 280 edges and 100 edges respectively,while their predictive estimates are still inferior to those of the forest-based kernel densityestimator. Figure 28.7 also shows the estimated graphs for the forest-based kernel densityestimator and the graphical lasso. The graphs are automatically selected based on held-outlog-likelihood, and are clearly different.

Page 19: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.5. Examples 675

1 2 3 45

678910111213

1415

1617

1819202122232425

2627

282930313233343536

3738 39 40 1 2 3

45

6789

10

11

12

1314

1516

1718

19202122232425

2627

2829

30

31

32

33343536

3738

39 40 1 2 34

56789

10

11

12

1314

1516

1718

19202122232425

2627

2829

30

31

32

33343536

3738

39 40

1 2 3 45

678910111213

1415

1617

1819202122232425

2627

282930313233343536

3738 39 40 1 2 3

45

6789

10

11

12

1314

1516

1718

19202122232425

2627

2829

30

31

32

33343536

3738

39 40 1 2 34

56789

10

11

12

1314

1516

1718

19202122232425

2627

2829

30

31

32

33343536

3738

39 40

1 2 3 45

678910111213

1415

1617

1819202122232425

2627

282930313233343536

3738 39 40 1 2 3

45

6789

10

11

12

1314

1516

1718

19202122232425

2627

2829

30

31

32

33343536

3738

39 40 1 2 34

56789

10

11

12

1314

1516

1718

19202122232425

2627

2829

30

31

32

33343536

3738

39 40

Figure 28.5. The nonparanormal estimated graph for three values of � = 0.2448, 0.2661, 0.30857(left column), the closest glasso estimated graph from the full path (middle) and the symmetricdifference graph (right).

2 4 6 8

34

56

78

910

x56 7 8 9 10

8.0

8.5

9.0

9.5

10.0

10.5

11.0

x82.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

23

45

6

x131 2 3 4 5

23

45

6

x18

Figure 28.6. Estimated transformation functions for four genes in the microarray data set, indi-cating non-Gaussian marginals. The corresponding genes are among the nodes appearing in thesymmetric difference graphs above.

Page 20: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

676 Chapter 28. Nonparametric Graphical Models

0 100 200 300 400

68

1012

1416

1820

Number of Edges

held

out

log−

likel

ihoo

d

****************************************************************

******

******

* ******

***** * ** * * * * * **

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

oooooooooo

oo

oooooo

ooo oo o oo

o oo

o

o

o o

●●

●●

●●

●●

●●

●●

●●

●●

Figure 28.7. Results on microarray data. Top: held-out log-likelihood of the forest density estimator(black step function), glasso (red stars), and refit glasso (blue circles). Bottom: estimated graphsusing the forest-based estimator (left) and the glasso (right), using the same node layout.

28.5.2 Graphs for Equities Data

For the examples in this section we collected stock price data from Yahoo! Finance (finance.yahoo.com). The daily closing prices were obtained for 452 stocks that consistently werein the S&P 500 index between January 1, 2003 through January 1, 2011. This gave usaltogether 2,015 data points, each data point corresponds to the vector of closing prices ona trading day. With St,j denoting the closing price of stock j on day t, we consider thevariables Xtj = log (St,j/St�1,j) and build graphs over the indices j. We simply treat theinstances Xt as independent replicates, even though they form a time series. We Winsorize(or truncate) every stock so that its data points are within six times the mean absolute devi-ation from the sample average. In Figure 28.8(a) we show boxplots for 10 randomly chosenstocks. It can be seen that the data contains outliers even after Winsorization; the reasonsfor these outliers includes splits in a stock, which increases the number of shares. In Figure

Page 21: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.5. Examples 677

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●● ●

●●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●●●●●●

●●●

●●●●●

●●

●●●●

●●●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●

● ●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●●●

●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

AAPL BAC AMAT CERN CLX MSFT IBM JPM UPS YHOO

−0.6

−0.4

−0.2

0.0

0.2

0.4

Stocks

X

AAPL BAC AMAT CERN CLX MSFT IBM JPM UPS YHOO

−2−1

01

2

Stocks

X

(a) original data (b) after nonparanormal transformation

Figure 28.8. Boxplots of Xt

= log(St

/St�1

) for 10 stocks. As can be seen, the original datahas many outliers, which is addressed by the nonparanormal transformation on the re-scaled data(right).

28.8(b) we show the boxplots of the data after the nonparanormal transformation. We showbelow how removing outliers is important for forest density estimation. In the results showbelow, we use the subset of the data between January 1, 2003 to January 1, 2008, before theonset of the “financial crisis.” It is interesting to compare to results that include data after2008, but we omit these for brevity.

The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS)sectors, including Consumer Discretionary (70 stocks), Consumer Staples (35 stocks),Energy (37 stocks), Financials (74 stocks), Health Care (46 stocks), Industrials(59 stocks), Information Technology (64 stocks), Materials (29 stocks), TelecommunicationsServices (6 stocks), and Utilities (32 stocks). It is expected that stocks from the sameGICS sectors should tend to be clustered together, since stocks from the same GICS sectortend to interact more with each other. In the graphs shown below, the nodes are coloredaccording to the GICS sector of the corresponding stock.

In Figures 28.9(a)-(c) we show graphs estimated using the glasso, nonparanormal, andforest density estimator on the data from January 1, 2003 to January 1, 2008. There arealtogether n = 1, 257 data points and d = 452 dimensions. To estimate the nonparanormalgraph, we adopt a variant of the stability selection method proposed by Meinshausen andBuhlmann (2010). More specifically, let �

max

be the smallest tuning parameter � such thatthe estimated nonparanormal graph using Equation (28.24) is empty, and let e� = 0.1�

max

.We randomly sample 50 sub-datasets, each containing B = b10pnc = 320 data points. Oneach of these 50 subsampled dataset, we estimate a nonparanormal graph using (28.24) with� =

e�. In the final nonparanormal graph shown in Figure 28.9(b), an edge is present onlyif it appears more than 95 percent of the time among the 50 subsampled datasets. Thereforethe nonparanormal graph is in fact a stability graph; the graph has 642 edges. To estimatethe glasso graph, we again take �0

max

to be the smallest tuning parameter such that the

Page 22: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

678 Chapter 28. Nonparametric Graphical Models

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

(a) glasso graph (624 edges) (b) nonparanormal graph (642 edges)

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●● ●

●●●

●●

● ●●

● ● ●

●●

●●

●●

●●

●●

●● ●

● ● ●

●●

● ●

●●

●●● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

● ●●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

(c) FDE graph (451 edges) (d) FDE graph with outliers (451 edges)

Figure 28.9. Graphs build on S&P 500 stock data from Jan. 1, 2003 to Jan. 1, 2008. The graphs areestimated using (a) the glasso, (b) the nonparanormal, and (c) forest density estimation. The nodesare colored according to their GICS sector categories. Figure (d), shows the forest graph obtainedwithout transforming the original data to remove outliers.

estimated glasso graph is empty, randomly subsample 50 datasets with block size B = 320,and fit a glasso graph using the tuning parameter e�0 = 0.1�0

max

. We then plot all the edgeswhose frequency of occurrence is no smaller than a threshold ⇢ 2 [0, 1], where ⇢ is chosen

Page 23: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.5. Examples 679

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

(a) glasso vs. nonparanormal (b) glasso vs. FDE

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

(c) nonparanormal vs. FDE (d) edges common to glasso and npn

Figure 28.10. Visualizations of the differences between the estimated graphs. The symmetric dif-ference between the glasso and nonparanormal graphs are shown in (a) (a blue edge is unique tothe glasso graph while a red edge is unique to the nonparanormal graph). The graphs in (b) and(c) similarly illustrate the symmetric difference of the glasso and FDE graphs, and of the nonpara-normal and FDE graphs. Blue edges are unique to the glasso graph, red edges are unique to thenonparanormal graph, and black edges are unique to the FDE graph. The shared edges of the glassoand nonparanormal graphs are shown in (d).

Page 24: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

680 Chapter 28. Nonparametric Graphical Models

such that the total number of edges in the glasso graph is closest to the nonparanormalgraph. The final estimated glasso graph has 624 edges and is shown in Figure 28.9(a).

Since the dataset contains n = 1, 257 data points, we directly apply the forest densityestimator on the whole dataset to obtain a full spanning tree of d � 1 = 451 edges. Thisestimator turns out to be very sensitive to outliers, since it exploits kernel density estimatesas building blocks. In Figure 28.9(d) we show the estimated forest density graph on thestock data when outliers are not removed. In this case the graph is anomolous, with a snake-like character that weaves in and out of the 10 GICS industries. Intuitively, the outliersmake the two-dimensional densities appear like thin “pancakes,” and densities with similarorientations are clustered together. To address this, we transform by the nonparanormaltransformation, and then run forest density estimation. Figure 28.9(c) shows the estimatedforest graph after outliers are removed in this way. The resulting graph has good clusteringwith respect to the GICS sectors.

Figures 28.10(a)-(c) display the differences between the glasso, nonparanormal, andforest density estimation graphs. Figure 28.10(d) shows the shared edges between the es-timated glasso and nonparanormal graphs. Although the nonparanormal and glasso graphtopologies appear similar as shown, with respect to the clustering behavior in the GICSclasses, they have many different edges. In fact, the nonparanormal and glasso graphs shareonly about 63% of the same edges. In comparing the nonparanormal and glasso graphs withthe forest density estimation graphs, we find 58.5% edges in the forest density estimationgraph are also contained in the nonparanormal graph. In contrast, only 43% of edges in theforest density estimation graph are contained in the glasso graph.

We refrain from drawing any hard conclusions about the effectiveness of the differentmethods based on these plots—how these graphs are used will depend on the application.These results serve mainly to highlight how very different inferences about the indepen-dence relations can arise from moving from a Gaussian model to a semiparametric modelto a fully nonparametric model with restricted graphs.

28.6 Discussion

This paper has considered undirected graphical models for continuous data, where the gen-eral densities take the form

p(x) / exp

0

@

X

C2Cliques(G)

fC(xC)

1

A . (28.63)

Such a general family is at least as difficult as the general high-dimensional nonparamet-ric regression model. But, as for regression, simplifying assumptions can lead to tractableand useful models. We have considered two approaches that make very different tradeoffsbetween statistical generality and computational efficiency. The nonparanormal relies onestimating one-dimensional functions, in a manner that is similar to the way additive mod-els estimate one-dimensional regression functions. This allows arbitrary graphs, but the

Page 25: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.7. Nonparametric Belief Propagation 681

distribution is semiparametric, via the Gaussian copula. At the other extreme, when we re-strict to acyclic graphs we can have fully nonparametric bivariate and univariate marginals.This leverages classical techniques for low-dimensional density estimation, together withapproximation algorithms for constructing the graph. Clearly these are just two amongmany possibilities for nonparametric graphical modeling. We conclude, then, with a briefdescription of a few potential directions for future work.

As we saw with the nonparanormal, if only the graph is of interest, it may not be im-portant to estimate the functions accurately. More generally, to estimate the graph it is notnecessary to estimate the density. One of the most effective and theoretically well-supportedmethods for estimating Gaussian graphs is due to Meinshausen and Buhlmann (2006). Inthis approach, we regress each variable Xj onto all other variables (Xk)k 6=j using the lasso.This directly estimates the set of neighbors N (j) = {k | (j, k) 2 E} for each node j in thegraph, but the covariance matrix is not directly estimated. Lasso theory gives conditionsand guarantees on these variable selection problems. This approach was adapted to the dis-crete case by Ravikumar et al. (2010a), where the normalizing constant and thus the densitycan’t be efficiently computed. This general strategy may be attractive for graph selectionin nonparametric graphical models. In particular, each variable could be regressed on theothers using a nonparametric regression method that performs variable selection; one suchmethod with theoretical guarantees is due to Lafferty and Wasserman (2008).

No matter how the methodology develops, nonparametric graphical models will at bestbe approximations to the true distribution in many applications. Yet, there is plenty ofexperience to show how incorrect models can be useful. An ongoing challenge in nonpara-metric graphical modeling will be to better understand how the structure can be accuratelyestimated even when the model is wrong.

28.7 Nonparametric Belief Propagation

The nonparanormal and forest densities are special classes of tractable nonparametric graph-ical models. More generally, we would like to be able to work with models of the form

p(x) =1

Z(f)exp

X

C

fC(xC)

!

(28.64)

as already discussed above. Doing so requires approximate inference, for example stochas-tic simulation or variational methods. Nonparametric belief propagation is a hybrid ofvariational approximation and simulation that has been proposed for working with generalnonparametric graphical models with continuous variables.

Suppose we are given the functions fC . How do we carry out inference for this model?In machine learning parlance, this means, for example, computing or approximating themarginal density p(xi) for one of the component variables Xi. We will consider this ques-tion in the setting where a conditional density is specified in the form

p(x | y) /Y

(i,j)2E ij(xi, xj ; y)

Y

i2V i(xi; y) (28.65)

Page 26: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

682 Chapter 28. Nonparametric Graphical Models

where we condition on evidence y and specify the model in terms of a set of edge potentials ij(xi, xj ; y) > 0 and vertex potentials i(xi; y) > 0. This model can be thought of as aform of conditional random field. Our objective is to approximate the conditional densityof a single node p(xi | y).

As indicated in Chapter 20, belief propagation is a message passing algorithm that canbe used for this purpose. Although we presented it for discrete distributions, it in principleextends to continuous distributions. The messages are given by densities

mji(xi) /Z

Xj

ij(xi, xj ; y) j(xj ; y)Y

i2N (j)\imkj(xj) dµ(xj) (28.66)

where Xj is the domain of variable Xj ; this is the outgoing message sent from node j tonode i, given the incoming messages received by j from its other neighbors, N (j)\i. Withall of the incoming messages defined, the approximation to p(xi | y) is then given by

q(xi | y) / i(xi; y)Y

j2N (i)

mji(xi). (28.67)

The difficulty is that the integrals required to compute the messages in (28.66) may bedifficult to evaluate numerically. Nonparametric belief propagation uses particle filteringmethods to approximate these integrals.

To explain the algorithm in its simplest form, suppose that the incoming messagesmkj(xj) have been determined, and we wish to compute the outgoing message mji(xi);this is done using the following three-step procedure.

Page 27: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

28.7. Nonparametric Belief Propagation 683

NONPARAMETRIC BELIEF PROPAGATION

1. Draw a sample of L “auxiliary particles” ex(`)ji , by stochastic sampling accord-ing to

eX(`)ji ⇠ bji(Xj) j(Xj ; y)

Y

k2N (j)\imkj(Xj) (28.68)

where the bias term bji is computed as

bji(xj) =

Z

Xi

ij(xi, xj ; y) dµ(xi). (28.69)

The sample is formed using importance sampling or another standard MCMCprocedure.

2. Given the auxiliary particles, draw a sample of L particles x(`)ji by samplingaccording to

X(`)ji ⇠ ij(Xi, ex

(`)ji ; y). (28.70)

3. Given the particles x(`)ji , the message mji(xi) is obtained as the kernel densityestimate

mji(xi) =1

L

LX

`=1

Kh(xi, x(`)ji ) (28.71)

for an appropriately chosen bandwidth h.

Using the Gaussian kernel, this algorithm represents each message mjk(xk) as a mix-ture of L Gaussians. Note then that if node j has d neighbors, the term

Q

k2N (j)\imkj(xj)

in the first step can be expressed as a mixture of Ld�1 Gaussians. If L and d are large,sampling from this explicitly may be computationally prohibitive; a stochastic simulationalgorithm will generally require O(dL) cost.

The marginal densities p(xi | y) are approximated by simulation by sampling accordingto

X(`)i ⇠ i(Xi; y)

Y

j2N (i)

mji(Xi). (28.72)

A kernel density estimate of the marginal is then formed from the resulting particles as

q(xi | y) = 1

L

LX

`=1

Kh

xi, x(`)i

. (28.73)

The algorithm iterates until the messages converge. This nonparametric belief propagationprocedure can thus be seen as a hybrid of variational methods and simulation. Details

Page 28: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

684 Chapter 28. Nonparametric Graphical Models

and variants of this procedure, together with interesting applications to visual tracking andsensor localization, are discussed by Sudderth et al. (2010).

28.8 Bibliographic Remarks

There is surprisingly little work on structure learning of nonparametric graphical models inhigh dimensions. One piece of related work is sparse log-density smoothing spline ANOVAmodels, introduced by Jeon and Lin (2006). In such a model the log-density function isdecomposed as the sum of a constant term, one dimensional functions (main effects), two-dimensional functions (two-way interactions), and so on:

log p(x) = f(x) ⌘ c+dX

j=1

fj(xj) +X

j<k

fjk(xj , xk) + · · · . (28.74)

The component functions satisfy certain constraints so that the model is identifiable. In highdimensions, the model is truncated up to second order interactions so that the computationis still tractable. There is a close connection between the log-density ANOVA model andundirected graphical models. For a model with only main effects and two-way interactions,we define a graph G = (V,E) such that (i, j) 2 E if and only if fij 6= 0. It can be seen thatp(x) is Markov to G. Jeon and Lin (2006) assume that these component functions belongto certain reproducing kernel Hilbert spaces (RKHSs) equipped with a RKHS norm k · kK .To obtain a sparse estimation of the component functions f(x), they propose a penalizedM-estimator

bf = arg maxf

n

1

n

nX

i=1

exp

f(X(i))

+

Z

f(x)⇢(x)dx+ �J(f)o

, (28.75)

where ⇢(x) is some pre-defined positive density and J(f) is a sparsity-inducing penaltythat takes the form

J(f) =dX

j=1

kfjkK +

X

j<k

kfjkkK . (28.76)

Solving (28.75) only requires one-dimensional integrals which can be efficiently computed.However, the optimization in (28.75) exploits a surrogate loss instead of the log-likelihoodloss, and is more difficult to analyze theoretically.

Another related idea is to conduct structure learning using nonparametric decomposablegraphical models (Schwaighofer et al., 2007). A distribution is a decomposable graphicalmodel if it is Markov to a graph G = (V,E) which has a junction tree representation,which can be viewed as an extension of tree-based graphical models. A junction tree yieldsa factorized form

p(x) =

Q

C2VT

p(xC)Q

S2ET

p(xS)(28.77)

Page 29: Chapter 28 Nonparametric Graphical Modelslarry/=sml/NonparGraphs.pdf · 01/01/2003  · 658 Chapter 28. Nonparametric Graphical Models (2011). The methods we present here are relatively

Exercises 685

where VT denotes the set of cliques in V and ET is the set of separators, i.e., the intersectionof two neighboring cliques in the junction tree. Exact search for the junction tree structurethat maximizes the likelihood is usually computationally expensive. Schwaighofer et al.(2007) propose a forward-backward strategy for nonparametric structure learning. How-ever, such a greedy procedure does not guarantee that the global optimal solution is found,and makes theoretical analysis challenging.

A different framework for nonparametricity involves conditioning on a collection ofobserved explanatory variables Z. Liu et al. (2010) develop a nonparametric procedurecalled Graph-optimized CART, or Go-CART, to estimate the graph conditionally under aGaussian model. The main idea is to build a tree partition on the Z space just as in CART(classification and regression trees), but to estimate a graph at each leaf using the glasso.Oracle inequalities on risk minimization and model selection consistency were establishedfor Go-CART by Liu et al. (2010). When Z is time, graph-valued regression reduces to thetime-varying graph estimation problem (Zhou et al., 2010; Chen et al., 2010; Kolar et al.,2009).

In parametric settings, Chandrasekaran et al. (2010) and Choi et al. (2010) developalgorithms and theory for learning graphical models with latent variables. The first paperassumes the joint distribution of the observed and latent variables is a Gaussian graphicalmodel, and the second paper assumes the joint distribution is discrete and factors accordingto a forest. Since the nonparanormal and forest density estimator are nonparametric versionsof the Gaussian and forest graphical models for discrete data, we expect similar techniquesto those of Chandrasekaran et al. (2010); Choi et al. (2010) can be used to extend themethods of this chapter to handle latent variables.

Exercises

28.1 Let X be a random variable with mean zero, unit variance, distribution F , and densityp(x) (with respect to Lebesgue measure). Show that

p(x) = f 0(x)�(f(x)) (28.78)

where f(x) = �

�1

(F (x)). Thus, any one dimensional distribution can be trans-formed to normal by a monotonic transformation.

28.2 Suppose that P is a distribution with density p, and that the undirected graph of P isa forest with edge set EF and vertex set VF . Show that

pF (x) =Y

(i,j)2EF

p(xi, xj)

p(xi) p(xj)

Y

k2VF

p(xk). (28.79)

28.3 Prove Proposition 28.31