stat 437 lecture notes 5 - washington state...
TRANSCRIPT
Stat 437 Lecture Notes 5
Xiongzhi Chen
Washington State Univ.
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 1 / 16
Mixture models
Intuitive definition: there are K data generating processes; with afixed probability, randomly pick a process, and then generateobservations from itFormal definition: X is said to have “a mixture density withmixing proportion {πi}K
i=1 and components {fi}Ki=1” if X has
density
f (x) =K
∑i=1
πifi (x) where min1≤i≤K
πi > 0 andK
∑i=1
πi = 1
Remark: f ∈ Conv({fi}K
i=1
); Question: how to generate an
observation for X when X ∼ f ?
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 2 / 16
The mighty Gaussian mixture
Gaussian mixtures approximates any probability measure (inweak topology).Namely, for any X, there exists Xn with a Gaussian mixturedensity such that∫
h (Xn) dP→∫
h (X) dP as n→ ∞
for any h ∈ Cb (R)
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 3 / 16
The mighty Gaussian mixture (supplementary)
Let N be the set of Gaussian distributions with independentcomponents, i.e.,
N =
{φ : φt (x) =
(2πt2
)−d/2exp
(−‖x‖2 /2t2
), x ∈ Rd
}Let {φn}n≥1 be an enumeration of all φ1/k (x− ξl) , k ≥ 1, ξl ∈ Rd.then for any g ∈ Cc
(Rd) there exists a sequence {an}n≥1 ∈
⋂p>1 lp
and an increasing sequence {bn}n≥1 in N such thatSbn = ∑bn
k=1 akφk satisfies
‖g− Sbn‖1 + ‖g− Sbn‖∞ → 0 as n→ ∞
(ref: Paper 1 and Paper 2)
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 4 / 16
Settings
Training set T : N observations {(xi, yi)}Ni=1 with xi ∈ Rp and
yi ∈ R
Each xi is an observation for the p-dimensional feature Xyi is the class label for xi and yi ∈ {1, ..., K}Target: given a new x ∈ Rp, estimate its label y
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 5 / 16
Bayesian modelling
πk = Pr (G = k), the prior probability of class k, where G indexesclass labelfk (x) = Pr (X = x|G = k), the conditional probability of X giventhat it is from class kBayes theorem:
Pr (G = k|X = x) =πk (x) fk (x)
∑Kk=1 πk (x) fk (x)
Bayes classifier: G = k0 if
Pr (k0|X = x) = maxg∈{1,...,K}
Pr (g|X = x)
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 6 / 16
Linear discriminant analysis (LDA)
Gaussian mixture with components
fk (x) =1
(2π)p/2 |Σk|1/2 exp(−1
2(x− µk)
T Σ−1k (x− µk)
)Assume Σk = Σ. Then
logPr (G = k|X = x)Pr (G = l|X = x)
= logπk
πl− 1
2(µk + µl)Σ−1 (µk − µl)
+ xTΣ−1 (µk − µl)
Decision boundary between class k and l:
Dk,l = {x : Pr (G = k|X = x) = Pr (G = l|X = x)}
under the principle of Bayes classifier
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 7 / 16
Linear discriminant analysis (LDA)
Under the principle of Bayes classifier, the log posteriorprobability log qk (x) for class k is the discriminant function δk (x)for class k:
δk (x) = log (Pr (G = k|X = x))
= xTΣ−1µk −12
µTk Σ−1µk + log πk
δk is a linear function of xThe previous results can be obtained by simple algebra
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 8 / 16
Implementation of LDA
Statistical: πk = Nk/N, where Nk = |Gk|; µk = ∑{i:gi=k} xi/Nk and
Σ =1
N− K
K
∑k=1
∑{i:gi=k}
(xi − µk) (xi − µk)T
Plug these in to δk to get estimate δk. Note that these are maximumlikelihood estimates (MLEs).Software: function lda in R library MASSIssue: when p� N the MLE estimate Σ of Σ and µk of µk can befar off. So, regularized estimates of Σ and µk are often employed
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 9 / 16
LDA: illustration4.2 Linear Regression of an Indicator Matrix 103
1
1
1
11
11
11
11
1
1
1
1
1
1
11 1
11
1
1
1
1
1
1
11
1
1
1
1
1
11
11
1
1
1
1
1
1
1 11
1
1 1
1
1
1
1
1
1
111
1
1
1
1
1
1
1 1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
11
1
1
1
1
111
1
1
11
11
1
1
1
1
1
11
1
1
1
1
1
1 1
1
1
11
1
1
1
1 1
1
1
11 1
1
1
1 1
11
1
1
1
1
1
1
1
1
11
1 1
11
1
1
1
11
1
1
1
1 1
1
1
1
11
1
1
1
1
1
1
1
11
1
11
1
1
11
11
1 1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
22
22
2
2
2
2
2
2
2
2
22
22
2
22
222
2
22
22
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2 22
2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
222
2
2
2
2
2 2
2
2
2
2
2
22
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2 2
22
2
2
2
2
2
2
22
2
2
2
22
2 2
22 2
2
2
2
2
2
2
2
2
2
2
2
22
2
22
222
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 22
2
2
22
22
2 2
2
2
2
3
3
3
3
33
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
33
3
3
3
3
3
33
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
33
3 3
3
33
33
33
3
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
3
33
3
3 33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
33
3
3
3 3
33
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
1
1
1
11
11
11
11
1
1
1
1
1
1
11 1
11
1
1
1
1
1
1
11
1
1
1
1
1
11
11
1
1
1
1
1
1
1 11
1
1 1
1
1
1
1
1
1
111
1
1
1
1
1
1
1 1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
11
1
1
1
1
111
1
1
11
11
1
1
1
1
1
11
1
1
1
1
1
1 1
1
1
11
1
1
1
1 1
1
1
11 1
1
1
1 1
11
1
1
1
1
1
1
1
1
11
1 1
11
1
1
1
11
1
1
1
1 1
1
1
1
11
1
1
1
1
1
1
1
11
1
11
1
1
11
11
1 1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
22
22
2
2
2
2
2
2
2
2
22
22
2
22
222
2
22
22
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2 22
2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
222
2
2
2
2
2 2
2
2
2
2
2
22
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2 2
22
2
2
2
2
2
2
22
2
2
2
22
2 2
22 2
2
2
2
2
2
2
2
2
2
2
2
22
2
22
222
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 22
2
2
22
22
2 2
2
2
2
3
3
3
3
33
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
33
3
3
3
3
3
33
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
33
3 3
3
33
33
33
3
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
3
33
3
3 33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
33
3
3
3 3
33
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
FIGURE 4.1. The left plot shows some data from three classes, with lineardecision boundaries found by linear discriminant analysis. The right plot showsquadratic decision boundaries. These were obtained by finding linear boundariesin the five-dimensional space X1, X2, X1X2, X
21 , X
22 . Linear inequalities in this
space are quadratic inequalities in the original space.
mation h(X) where h : IRp 7→ IRq with q > p, and will be explored in laterchapters.
4.2 Linear Regression of an Indicator Matrix
Here each of the response categories are coded via an indicator variable.Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K,with Yk = 1 if G = k else 0. These are collected together in a vectorY = (Y1, . . . , YK), and the N training instances of these form an N × Kindicator response matrix Y. Y is a matrix of 0’s and 1’s, with each rowhaving a single 1. We fit a linear regression model to each of the columnsof Y simultaneously, and the fit is given by
Y = X(XTX)−1XTY. (4.3)
Chapter 3 has more details on linear regression. Note that we have a coeffi-cient vector for each response column yk, and hence a (p+1)×K coefficient
matrix B = (XTX)−1XTY. Here X is the model matrix with p+1 columnscorresponding to the p inputs, and a leading column of 1’s for the intercept.
A new observation with input x is classified as follows:
• compute the fitted output f(x)T = (1, xT )B, a K vector;
• identify the largest component and classify accordingly:
G(x) = argmaxk∈G fk(x). (4.4)
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 10 / 16
LDA: illustration4.3 Linear Discriminant Analysis 109
+ +
+3
21
1
1
2
3
3
3
1
2
3
3
2
1 1 21
1
3
3
1 21
2
3
2
3
3
1
2
2
1
1
1
1
3
2
2
2
2
1 3
2 2
3
1
3
1
3
3 2
1
3
3
2
3
1
3
3
21
33
2
2
32
2
211
1
1
1
2
1
3
3
11
3
32
2
2
23
1
2
FIGURE 4.5. The left panel shows three Gaussian distributions, with the samecovariance and different means. Included are the contours of constant densityenclosing 95% of the probability in each case. The Bayes decision boundariesbetween each pair of classes are shown (broken straight lines), and the Bayesdecision boundaries separating all three classes are the thicker solid lines (a subsetof the former). On the right we see a sample of 30 drawn from each Gaussiandistribution, and the fitted LDA decision boundaries.
the figure the contours corresponding to 95% highest probability density,as well as the class centroids. Notice that the decision boundaries are notthe perpendicular bisectors of the line segments joining the centroids. Thiswould be the case if the covariance Σ were spherical σ2I, and the classpriors were equal. From (4.9) we see that the linear discriminant functions
δk(x) = xTΣ−1µk −1
2µTkΣ
−1µk + log πk (4.10)
are an equivalent description of the decision rule, withG(x) = argmaxkδk(x).In practice we do not know the parameters of the Gaussian distributions,
and will need to estimate them using our training data:
• πk = Nk/N , where Nk is the number of class-k observations;
• µk =∑
gi=k xi/Nk;
• Σ =∑K
k=1
∑gi=k(xi − µk)(xi − µk)
T /(N −K).
Figure 4.5 (right panel) shows the estimated decision boundaries based ona sample of size 30 each from three Gaussian distributions. Figure 4.1 onpage 103 is another example, but here the classes are not Gaussian.With two classes there is a simple correspondence between linear dis-
criminant analysis and classification by linear regression, as in (4.5). TheLDA rule classifies to class 2 if
xT Σ−1
(µ2 − µ1) >1
2(µ2 + µ1)
T Σ−1
(µ2 − µ1)− log(N2/N1), (4.11)
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 11 / 16
Quadratic discriminant function (QDA)
Gaussian mixture with components
fk (x) =1
(2π)p/2 |Σk|1/2 exp(−1
2(x− µk)
T Σ−1k (x− µk)
)When Σk are not equal, the discriminant function
δk (x) = −12
log |Σk| −12(x− µk)
T Σ−1k (x− µk) + log πk
is quadratic in xThe decision boundary between classes k and l is given by
{x : δk (x) = δl (x)}
One issue to implement QDA is estimating all Σk’s when p is large
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 12 / 16
QDA: illustration4.3 Linear Discriminant Analysis 111
1
1
1
11
11
11
11
1
1
1
1
1
1
11 1
11
1
1
1
1
1
1
11
1
1
1
1
1
11
11
1
1
1
1
1
1
1 11
1
1 1
1
1
1
1
1
1
111
1
1
1
1
1
1
1 1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
11
1
1
1
1
111
1
1
11
11
1
1
1
1
1
11
1
1
1
1
1
1 1
1
1
11
1
1
1
1 1
1
1
11 1
1
1
1 1
11
1
1
1
1
1
1
1
1
11
1 1
11
1
1
1
11
1
1
1
1 1
1
1
1
11
1
1
1
1
1
1
1
11
1
11
1
1
11
11
1 1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
22
22
2
2
2
2
2
2
2
2
22
22
2
22
222
2
22
22
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2 22
2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
222
2
2
2
2
2 2
2
2
2
2
2
22
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2 2
22
2
2
2
2
2
2
22
2
2
2
22
2 2
22 2
2
2
2
2
2
2
2
2
2
2
2
22
2
22
222
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 22
2
2
22
22
2 2
2
2
2
3
3
3
3
33
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
33
3
3
3
3
3
33
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
33
3 3
3
33
33
33
3
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
3
33
3
3 33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
33
3
3
3 3
33
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
1
1
1
11
11
11
11
1
1
1
1
1
1
11 1
11
1
1
1
1
1
1
11
1
1
1
1
1
11
11
1
1
1
1
1
1
1 11
1
1 1
1
1
1
1
1
1
111
1
1
1
1
1
1
1 1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
11
1
1
1
1
111
1
1
11
11
1
1
1
1
1
11
1
1
1
1
1
1 1
1
1
11
1
1
1
1 1
1
1
11 1
1
1
1 1
11
1
1
1
1
1
1
1
1
11
1 1
11
1
1
1
11
1
1
1
1 1
1
1
1
11
1
1
1
1
1
1
1
11
1
11
1
1
11
11
1 1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
22
22
2
2
2
2
2
2
2
2
22
22
2
22
222
2
22
22
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2 22
2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
222
2
2
2
2
2 2
2
2
2
2
2
22
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2 2
22
2
2
2
2
2
2
22
2
2
2
22
2 2
22 2
2
2
2
2
2
2
2
2
2
2
2
22
2
22
222
2
2
2
222
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 22
2
2
22
22
2 2
2
2
2
3
3
3
3
33
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
33
3
3
3
3
3
33
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
33
3 3
3
33
33
33
3
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
3
33
3
3 33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
33
3
33
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
33
3
3
3 3
33
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot showsthe quadratic decision boundaries for the data in Figure 4.1 (obtained using LDAin the five-dimensional space X1, X2, X1X2, X
21 , X
22 ). The right plot shows the
quadratic decision boundaries found by QDA. The differences are small, as isusually the case.
between the discriminant functions where K is some pre-chosen class (herewe have chosen the last), and each difference requires p + 1 parameters3.Likewise for QDA there will be (K − 1) × {p(p + 3)/2 + 1} parameters.Both LDA and QDA perform well on an amazingly large and diverse setof classification tasks. For example, in the STATLOG project (Michie etal., 1994) LDA was among the top three classifiers for 7 of the 22 datasets,QDA among the top three for four datasets, and one of the pair were in thetop three for 10 datasets. Both techniques are widely used, and entire booksare devoted to LDA. It seems that whatever exotic tools are the rage of theday, we should always have available these two simple tools. The questionarises why LDA and QDA have such a good track record. The reason is notlikely to be that the data are approximately Gaussian, and in addition forLDA that the covariances are approximately equal. More likely a reason isthat the data can only support simple decision boundaries such as linear orquadratic, and the estimates provided via the Gaussian models are stable.This is a bias variance tradeoff—we can put up with the bias of a lineardecision boundary because it can be estimated with much lower variancethan more exotic alternatives. This argument is less believable for QDA,since it can have many parameters itself, although perhaps fewer than thenon-parametric alternatives.
3Although we fit the covariance matrix Σ to compute the LDA discriminant functions,a much reduced function of it is all that is required to estimate the O(p) parametersneeded to compute the decision boundaries.
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 13 / 16
Logistic regression and LDA
LDA via Gaussian mixture:
logPr (G = k|X = x)Pr (G = K|X = x)
= logπk
πK− 1
2(µk + µK)
T Σ−1 (µk − µK) + xTΣ−1 (µk − µK)
= αk0 + αTk x
Logistic regression:
logPr (G = k|X = x)Pr (G = K|X = x)
= βk0 + βTk x
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 14 / 16
Logistic regression and LDA
Recall
Pr (X, G = k) = Pr (X)Pr (G = k|X) = Pr (G = k)Pr (X|G = k)
So, for both methods,
Pr (G = k|X = x) =exp
(βk0 + βT
k x)
1+∑K−1l=1 exp
(βl0 + βT
l x) , k 6= K
andPr (G = K|X = x) =
11+∑K−1
l=1 exp(
βl0 + βTl x)
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 15 / 16
Logistic regression and LDA
The logistic regression does not need Pr (X) and maximizes theconditional likelihood Pr (G = k|X)LDA implicitly uses Pr (X) and maximizes the joint likelihood
Pr (X, G = k) = φ (X; µk, Σ)πk
Recall the log-likelihood for logistic regression
l (β) =N
∑i=1
yiβTxi − log
(1+ eβTxi
)for binary classification
If the data in a two-class logistic regression model can be perfectlyseparately by a hyperplane, the maximum likelihood estimates ofthe parameters are undefined. However, in this case, the LDAcoefficients for the same data will be well defined since themarginal likelihood will not permit these degeneracies.
Xiongzhi Chen (Washington State Univ.) Stat 437 Lecture Notes 5 16 / 16