stat 437 lecture notes 5 - washington state...

Stat 437 Lecture Notes 5

Xiongzhi Chen

Washington State Univ.

Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 1 / 16

Mixture models

Intuitive definition: there are K data generating processes; with afixed probability, randomly pick a process, and then generateobservations from itFormal definition: X is said to have “a mixture density withmixing proportion {πi}K

i=1 and components {fi}Ki=1” if X has

density

f (x) =K

∑i=1

πifi (x) where min1≤i≤K

πi > 0 andK

∑i=1

πi = 1

Remark: f ∈ Conv({fi}K

i=1

); Question: how to generate an

observation for X when X ∼ f ?


The mighty Gaussian mixture

Gaussian mixtures approximates any probability measure (inweak topology).Namely, for any X, there exists Xn with a Gaussian mixturedensity such that∫

h (Xn) dP→∫

h (X) dP as n→ ∞

for any h ∈ Cb (R)


The mighty Gaussian mixture (supplementary)

Let N be the set of Gaussian distributions with independentcomponents, i.e.,

N =

{φ : φt (x) =

(2πt2

)−d/2exp

(−‖x‖2 /2t2

), x ∈ Rd

}Let {φn}n≥1 be an enumeration of all φ1/k (x− ξl) , k ≥ 1, ξl ∈ Rd.then for any g ∈ Cc

(Rd) there exists a sequence {an}n≥1 ∈

⋂p>1 lp

and an increasing sequence {bn}n≥1 in N such thatSbn = ∑bn

k=1 akφk satisfies

‖g− Sbn‖1 + ‖g− Sbn‖∞ → 0 as n→ ∞

(ref: Paper 1 and Paper 2)


http://www.mathaware.org/journals/proc/2010-138-07/S0002-9939-10-10340-2/S0002-9939-10-10340-2.pdf

https://www.sciencedirect.com/science/article/pii/S002190451100102X

Settings

Training set T : N observations {(xi, yi)}Ni=1 with xi ∈ Rp and

yi ∈ R

Each xi is an observation for the p-dimensional feature Xyi is the class label for xi and yi ∈ {1, ..., K}Target: given a new x ∈ Rp, estimate its label y


Bayesian modelling

πk = Pr (G = k), the prior probability of class k, where G indexesclass labelfk (x) = Pr (X = x|G = k), the conditional probability of X giventhat it is from class kBayes theorem:

Pr (G = k|X = x) =πk (x) fk (x)

∑Kk=1 πk (x) fk (x)

Bayes classifier: G = k0 if

Pr (k0|X = x) = maxg∈{1,...,K}

Pr (g|X = x)


Linear discriminant analysis (LDA)

Gaussian mixture with components

fk (x) =1

(2π)p/2 |Σk|1/2 exp(−1

2(x− µk)

T Σ−1k (x− µk)

)Assume Σk = Σ. Then

logPr (G = k|X = x)Pr (G = l|X = x)

= logπk

πl− 1

2(µk + µl)Σ−1 (µk − µl)

+ xTΣ−1 (µk − µl)

Decision boundary between class k and l:

Dk,l = {x : Pr (G = k|X = x) = Pr (G = l|X = x)}

under the principle of Bayes classifier


Linear discriminant analysis (LDA)

Under the principle of Bayes classifier, the log posteriorprobability log qk (x) for class k is the discriminant function δk (x)for class k:

δk (x) = log (Pr (G = k|X = x))

= xTΣ−1µk −12

µTk Σ−1µk + log πk

δk is a linear function of xThe previous results can be obtained by simple algebra


Implementation of LDA

Statistical: πk = Nk/N, where Nk = |Gk|; µk = ∑{i:gi=k} xi/Nk and

Σ =1

N− K

K

∑k=1

∑{i:gi=k}

(xi − µk) (xi − µk)T

Plug these in to δk to get estimate δk. Note that these are maximumlikelihood estimates (MLEs).Software: function lda in R library MASSIssue: when p� N the MLE estimate Σ of Σ and µk of µk can befar off. So, regularized estimates of Σ and µk are often employed


LDA: illustration4.2 Linear Regression of an Indicator Matrix 103

1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

FIGURE 4.1. The left plot shows some data from three classes, with lineardecision boundaries found by linear discriminant analysis. The right plot showsquadratic decision boundaries. These were obtained by finding linear boundariesin the five-dimensional space X1, X2, X1X2, X

21 , X

22 . Linear inequalities in this

space are quadratic inequalities in the original space.

mation h(X) where h : IRp 7→ IRq with q > p, and will be explored in laterchapters.

4.2 Linear Regression of an Indicator Matrix

Here each of the response categories are coded via an indicator variable.Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K,with Yk = 1 if G = k else 0. These are collected together in a vectorY = (Y1, . . . , YK), and the N training instances of these form an N × Kindicator response matrix Y. Y is a matrix of 0’s and 1’s, with each rowhaving a single 1. We fit a linear regression model to each of the columnsof Y simultaneously, and the fit is given by

Y = X(XTX)−1XTY. (4.3)

Chapter 3 has more details on linear regression. Note that we have a coeffi-cient vector for each response column yk, and hence a (p+1)×K coefficient

matrix B = (XTX)−1XTY. Here X is the model matrix with p+1 columnscorresponding to the p inputs, and a leading column of 1’s for the intercept.

A new observation with input x is classified as follows:

• compute the fitted output f(x)T = (1, xT )B, a K vector;

• identify the largest component and classify accordingly:

G(x) = argmaxk∈G fk(x). (4.4)


LDA: illustration4.3 Linear Discriminant Analysis 109

+ +

+3

21

1

1

2

3

3

3

1

2

3

3

2

1 1 21

1

3

3

1 21

2

3

2

3

3

1

2

2

1

1

1

1

3

2

2

2

2

1 3

2 2

3

1

3

1

3

3 2

1

3

3

2

3

1

3

3

21

33

2

2

32

2

211

1

1

1

2

1

3

3

11

3

32

2

2

23

1

2

FIGURE 4.5. The left panel shows three Gaussian distributions, with the samecovariance and different means. Included are the contours of constant densityenclosing 95% of the probability in each case. The Bayes decision boundariesbetween each pair of classes are shown (broken straight lines), and the Bayesdecision boundaries separating all three classes are the thicker solid lines (a subsetof the former). On the right we see a sample of 30 drawn from each Gaussiandistribution, and the fitted LDA decision boundaries.

the figure the contours corresponding to 95% highest probability density,as well as the class centroids. Notice that the decision boundaries are notthe perpendicular bisectors of the line segments joining the centroids. Thiswould be the case if the covariance Σ were spherical σ2I, and the classpriors were equal. From (4.9) we see that the linear discriminant functions

δk(x) = xTΣ−1µk −1

2µTkΣ

−1µk + log πk (4.10)

are an equivalent description of the decision rule, withG(x) = argmaxkδk(x).In practice we do not know the parameters of the Gaussian distributions,

and will need to estimate them using our training data:

• πk = Nk/N , where Nk is the number of class-k observations;

• µk =∑

gi=k xi/Nk;

• Σ =∑K

k=1

∑gi=k(xi − µk)(xi − µk)

T /(N −K).

Figure 4.5 (right panel) shows the estimated decision boundaries based ona sample of size 30 each from three Gaussian distributions. Figure 4.1 onpage 103 is another example, but here the classes are not Gaussian.With two classes there is a simple correspondence between linear dis-

criminant analysis and classification by linear regression, as in (4.5). TheLDA rule classifies to class 2 if

xT Σ−1

(µ2 − µ1) >1

2(µ2 + µ1)

T Σ−1

(µ2 − µ1)− log(N2/N1), (4.11)


Quadratic discriminant function (QDA)

Gaussian mixture with components

fk (x) =1

(2π)p/2 |Σk|1/2 exp(−1

2(x− µk)

T Σ−1k (x− µk)

)When Σk are not equal, the discriminant function

δk (x) = −12

log |Σk| −12(x− µk)

T Σ−1k (x− µk) + log πk

is quadratic in xThe decision boundary between classes k and l is given by

{x : δk (x) = δl (x)}

One issue to implement QDA is estimating all Σk’s when p is large


QDA: illustration4.3 Linear Discriminant Analysis 111

1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot showsthe quadratic decision boundaries for the data in Figure 4.1 (obtained using LDAin the five-dimensional space X1, X2, X1X2, X

21 , X

22 ). The right plot shows the

quadratic decision boundaries found by QDA. The differences are small, as isusually the case.

between the discriminant functions where K is some pre-chosen class (herewe have chosen the last), and each difference requires p + 1 parameters3.Likewise for QDA there will be (K − 1) × {p(p + 3)/2 + 1} parameters.Both LDA and QDA perform well on an amazingly large and diverse setof classification tasks. For example, in the STATLOG project (Michie etal., 1994) LDA was among the top three classifiers for 7 of the 22 datasets,QDA among the top three for four datasets, and one of the pair were in thetop three for 10 datasets. Both techniques are widely used, and entire booksare devoted to LDA. It seems that whatever exotic tools are the rage of theday, we should always have available these two simple tools. The questionarises why LDA and QDA have such a good track record. The reason is notlikely to be that the data are approximately Gaussian, and in addition forLDA that the covariances are approximately equal. More likely a reason isthat the data can only support simple decision boundaries such as linear orquadratic, and the estimates provided via the Gaussian models are stable.This is a bias variance tradeoff—we can put up with the bias of a lineardecision boundary because it can be estimated with much lower variancethan more exotic alternatives. This argument is less believable for QDA,since it can have many parameters itself, although perhaps fewer than thenon-parametric alternatives.

3Although we fit the covariance matrix Σ to compute the LDA discriminant functions,a much reduced function of it is all that is required to estimate the O(p) parametersneeded to compute the decision boundaries.


Logistic regression and LDA

LDA via Gaussian mixture:

logPr (G = k|X = x)Pr (G = K|X = x)

= logπk

πK− 1

2(µk + µK)

T Σ−1 (µk − µK) + xTΣ−1 (µk − µK)

= αk0 + αTk x

Logistic regression:

logPr (G = k|X = x)Pr (G = K|X = x)

= βk0 + βTk x



Recall

Pr (X, G = k) = Pr (X)Pr (G = k|X) = Pr (G = k)Pr (X|G = k)

So, for both methods,

Pr (G = k|X = x) =exp

(βk0 + βT

k x)

1+∑K−1l=1 exp

(βl0 + βT

l x) , k 6= K

andPr (G = K|X = x) =

11+∑K−1

l=1 exp(

βl0 + βTl x)



The logistic regression does not need Pr (X) and maximizes theconditional likelihood Pr (G = k|X)LDA implicitly uses Pr (X) and maximizes the joint likelihood

Pr (X, G = k) = φ (X; µk, Σ)πk

Recall the log-likelihood for logistic regression

l (β) =N

∑i=1

yiβTxi − log

(1+ eβTxi

)for binary classification

If the data in a two-class logistic regression model can be perfectlyseparately by a hyperplane, the maximum likelihood estimates ofthe parameters are undefined. However, in this case, the LDAcoefficients for the same data will be well defined since themarginal likelihood will not permit these degeneracies.


stat 437 lecture notes 5 - washington state...

Documents