generative models rong jin. statistical inference training exampleslearning a statistical model ...
Post on 21-Dec-2015
222 views
TRANSCRIPT
Generative Models
Rong Jin
Statistical Inference
Training Examples
1 2{ , ,..., }nx x x
Learning a Statistical Model
Prediction
p(x;)
1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.10
1
2
3
4
5
6
7
8
9
10
Heigth
Num
ber
of P
eopl
e Female: Gaussian distribution N(1,1)
Male: Gaussian distribution N(2,2)
Pr(male|1.67m)
Pr(female|1.67m)
Statistical Inference
Training Examples
1 2{ , ,..., }nx x x
Learning a Statistical Model
Prediction
p(y|x;)
1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.10
1
2
3
4
5
6
7
8
9
10
Heigth
Num
ber
of P
eopl
e Male: Gaussian distribution N(1,1)
Female: Gaussian distribution N(2,2)
Pr(male|1.67m)
Pr(female|1.67m)
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example
using maximum likelihood approach The class of a new instance is predicted by
1
,n
i i ix y
( | ; )p y x
* arg max ( | ; )y
y p y x
Y
x
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example
using maximum likelihood approach The class of a new instance is predicted by
1
,n
i i ix y
( | ; )p y x
* arg max ( | ; )y
y p y x
Y
x
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example
using maximum likelihood approach The class of a new instance is predicted by
1
,n
i i ix y
( | ; )p y x
* arg max ( | ; )y
y p y x
Y
x
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example
using the maximum likelihood approach The class of a new instance is predicted by
1
,n
i i ix y
( | ; )p y x
* arg max ( | ; )y
y p y x
Y
x
Maximum Likelihood Estimation (MLE) Given training example Compute log-likelihood of data
Find the parameters that maximizes the log-likelihood
In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation
1( ) log ( | ; )
ntrain i ii
l D p y x
*1
max ( ) log ( | ; )n
train i iil D p y x
1 1 2 2, , , ,..., ,n nx y x y x y
Maximum Likelihood Estimation (MLE) Given training example Compute log-likelihood of data
Find the parameters that maximizes the log-likelihood
In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation
1( ) log ( | ; )
ntrain i ii
l D p y x
*1
max ( ) log ( | ; )n
train i iil D p y x
1 1 2 2, , , ,..., ,n nx y x y x y
Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example
using the maximum likelihood approach The class of a new instance is predicted by
1
,n
i i ix y
( | ; )p y x
* arg max ( | ; )y
y p y x
Y
x
Generative Models Most probabilistic distributions are joint distribution (i.e.,
p(x;)), not conditional distribution (i.e., p(y|x;))
Using Bayes rule
p(xly;) { p(y|x;); p(y;)}
( ; ) ( | ; )( | ; )
( , ; )
p y p x yp y x
p y x
Generative Models Most probabilistic distributions are joint distribution (i.e.,
p(x;)), not conditional distribution (i.e., p(y|x;))
Using Bayes rule
p(y|x;) { p(x|y;); p(y;)}
( ; ) ( | ; )( | ; )
( ; )
p y p x yp y x
p x
Generative Models (cont’d) Treatment of p(x|y;) Let yY={1, 2, …, c} Allocate a separate set of parameters for each class
{1, 2,…, c}
p(xly;) p(x;y) Data in different class have different input patterns
Generative Models (cont’d) Parameter space
Parameters for distribution: {1, 2,…, c}
Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE
Compute log-likelihood
Search for the optimal parameters by maximizing the log-likelihood
1
1
( ) log ( | ; )
log ( | ) log ( ) log ( | )i i
ntrain i ii
ni y i i yi
l D p y x
p x p y p x
1max ( ) max log ( ) ( | )
i
ntrain i i yi
l D p y p x
Generative Models (cont’d) Parameter space
Parameters for distribution: {1, 2,…, c}
Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE
Compute log-likelihood
Search for the optimal parameters by maximizing the log-likelihood
1
1
( ) log ( | ; )
log ( ; ) log ( ) log ( )i
ntrain i ii
ni y i ii
l D p y x
p x p y p x
1max ( ) max log ( ) ( ; )
i
ntrain i i yi
l D p y p x
( ; ) ( | ; )( | ; )
( ; )
p y p x yp y x
p x
Generative Models (cont’d) Parameter space
Parameters for distribution: {1, 2,…, c}
Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE
Compute log-likelihood
Search for the optimal parameters by maximizing the log-likelihood
1
1
( ) log ( | ; )
log ( ; ) log ( ) log ( )i
ntrain i ii
ni y i ii
l D p y x
p x p y p x
1max ( ) max log ( ) ( ; )
i
ntrain i i yi
l D p y p x
( ; ) ( | ; )( | ; )
( ; )
p y p x yp y x
p x
Generative Models (cont’d) Parameter space
Parameters for distribution: {1, 2,…, c}
Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE
Compute log-likelihood
Search for the optimal parameters by maximizing the log-likelihood
1
1
( ) log ( | ; )
log ( ; ) log ( ) log ( )i
ntrain i ii
ni y i ii
l D p y x
p x p y p x
1max ( ) max log ( ) ( ; )
i
ntrain i i yi
l D p y p x
Example
• Task: predict gender of individuals based on their heights
• Given
• 100 height examples of women
• 100 height examples of man
• Assume height of women and man follow different Gaussian distributions
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for male
Empirical data for female
Example (cont’d) Gaussian distribution
Parameter space Gaussian distribution for man: (m m)
Gaussian distribution for man: (w w)
Class priors: pm = p(y=man), pw = p(y=women)
1max ( ) max log ( ) ( | )
i
ntrain i i yi
l D p y p x
2
22
( )1( ) exp ,
22
xp x
Example (cont’d) Gaussian distribution
Parameter space Gaussian distribution for male: (m, m)
Gaussian distribution for female: (f , f)
Class priors: pm = p(y=male), pf = p(y=female)
2
22
( )1( ) exp ,
22
xp x
Example (cont’d)
1
1 1
2
2
2
1 2 1 2
log ( | )
log ( ; , ) log log ( ; , ) log
exp2
log2
Given training examples , ,..., ; , ,...,
m female
m f
m f
Nii
N N fmi m m m i f f fi i
mi m
m
m
f f fm m mN N
N N N
l p h y
p h p p h p
h
h h h h h h
2
2
1 1 2
exp2
log log log2
male male
fi f
fN N
m fi if
h
p p
Example (cont’d)
1
1 1
2
2
1 2 1 2
log ( | ) log ( )
log ( ; , ) log log ( ; , ) log
exp2
log2
Given training examples , ,..., ; , ,...,
m f
m f
m f
Ni i ii
N N fmi m m m i f f fi i
mi m
m
f f fm m mN N
N N N
l p h y p y
p h p p h p
h
h h h h h h
2
2
1 12 2
exp2
log log log2
male male
fi f
fN N
m fi im f
h
p p
Example (cont’d)
1
1 1
2
2
1 2 1 2
log ( | ) log ( )
log ( ; , ) log log ( ; , ) log
exp2
log2
Given training examples , ,..., ; , ,...,
m f
m f
m f
Ni i ii
N N fmi m m m i f f fi i
mi m
m
f f fm m mN N
N N N
l p h y p y
p h p p h p
h
h h h h h h
2
2
1 12 2
exp2
log log log2
m f
fi f
fN N
m fi im f
h
p p
Learn a Gaussian generative model
Example (cont’d)
*
221 1
221 1
, , ; , , max
( ), ,
( ), ,
m m
f f
m m m f f f
N Nm mi i mi i m
m m mm m
N Nf fi i f fi i
f f ff f
p p l
h h Np
N N N
h h Np
N N N
Learn a Gaussian generative model
Example (cont’d)
*
221 1
221 1
, , ; , , max
( ), ,
( ), ,
m m
f f
m m m f f f
N Nm mi i mi i m
m m mm m
N Nf fi i f fi i
f f ff f
p p l
h h Np
N N N
h h Np
N N N
Example (cont’d)
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
Predict the gender of an individual given his/her height
Example (cont’d)
2
22
2
22
( )( | ) ( | , ) exp
22
( )( | ) ( | , ) exp
22
m mm m m
mm
f ff f f
ff
p xp male h p p h
p xp female h p p h
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
Decision boundary Decision boundary h*
Predict female when h<h* Predict male when h>h* Random when h=h*
Where is the decision boundary?
It depends on the ratio pm/pf
h*
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
Example Decision boundary h*
Predict female when h<h* Predict male when h>h* Random when h=h*
Where is the decision boundary?
It depends on the ratio pm/pf
pf< pmpf> pm
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
Example Decision boundary h*
Predict female when h<h* Predict male when h>h* Random when h=h*
Where is the decision boundary?
It depends on the ratio pm/pf
pf< pmpf> pm
Gaussian Generative Model (II) Inputs contain multiple features Example
Task: predict if an individual is overweight based on his/her salary and the number of hours on watching TV
Input: (s: salary, h: hours for watching TV) Output: +1 (overweight), -1 (normal)
1 2, ,..., dx x x x
Multi-variate Gaussian Distribution
1/ 2 1/ 2
1 2
1 21
1,1 1,
,
,1 ,
, ,
1 1( ; , ) exp
22 | |
Input : , ,...,
1mean : , ,...,
variance matrix :
1
Ty y d
d
N
d ik
d
i j d d
d d d
i j i i j j k i i k
p x x x
x x x x
xN
E x x x x x x xN
,1
1
1
N
j jk
N T
k
x
x x x xN
Multi-variate Gaussian Distribution
1/ 2 1/ 2
1 2
1 21
1,1 1,
,
,1 ,
,
1 1( ; , ) exp
22 | |
Input : , ,...,
1mean : , ,...,
covariance matrix :
1
Ty y d
d
N
d ik
d
i j d d
d d d
i i j j i ii j k
p x x x
x x x x
xN
E x x x x x xN
1
1
1
Nj j
kk
N T
k kk
x x
x x x xN
Multi-variate Gaussian Distribution
1/ 2 1/ 2
1 2
1 21
1,1 1,
,
,1 ,
,
1 1( ; , ) exp
22 | |
Input : , ,...,
1mean : , ,...,
covariance matrix :
1
Ty y d
d
N
d ik
d
i j d d
d d d
i i j j i ii j k
p x x x
x x x x
xN
E x x x x x xN
1
1
1
Nj j
kk
NT
k kk
x x
x xN
Properties of Covariance Matrix
What if the number of data points N < d? How about for any vector ?
Positive semi-definitive matrix
1 21
1, , ,...,
NT
k k dk
x x x x x xN
Ta a
a
Properties of Covariance Matrix
What if the number of data points N < d? How about for any ?
Positive semi-definitive matrix
Ta a
1 21
1, , ,...,
NT
k k dk
x x x x x xN
a
Properties of Covariance Matrix
What if the number of data points N < d? How about for any ?
Positive semi-definitive matrix Number of different elements in ?
Ta a
1 21
1, , ,...,
NT
k k dk
x x x x x xN
a
Joint distribution p(s,h) for salary (s) and hours for watching TV (h)
12/ 2 1/ 2
, ,1 1
, ,
, ,
2 2, , , ,
1 1
1 1( ; , ) exp
22 | |
Input : ,
1 1mean : , , ,
covariance matrix :
1 1,
Ty y
N N
s h s k s h k hk k
s s s h
h s h h
N N
s s k s s h h k h hk k
s
p x x x
x s h
x xN N
x x x xN N
, , , ,1
1 N
h h s k s s k h hk
x x x xN
Gaussian Generative Model (II)
Joint distribution p(s,h) for salary (s) and hours for watching TV (h)
Gaussian Generative Model (II)
12/ 2 1/ 2
, ,1 1
, ,
, ,
2 2, , , ,
1 1
1 1( ; , ) exp
22 | |
Input : ,
1 1mean : , , ,
covariance matrix :
1 1,
Ty y
N N
s h s k s h k hk k
s s s h
h s h h
N N
s s k s s h h k h hk k
s
p x x x
x s h
x xN N
x x x xN N
, , , ,1
1 N
h h s k s s k h hk
x x x xN
Multi-variate Gaussian Generative Model Input with multiple input features A multi-variate Gaussian distribution for each class
1
/ 2 1/ 2
( | ; ) ~ ( , )
1 1( | ; ) exp
22 | |
Overweight: ( , , ( overweight))
Normal: ( , , ( normal))
y y
T
y y ydy
o o o o
n n n n
p x y N
p x y x x
p p y
p p y
1 2, ,..., dx x x x
Improve Multivariate Gaussian Model How could we improve the prediction of model for
overweight? Multiple modes for each class Introduce more attributes of individuals
Location Occupation The number of children House Age …
Problems with Using Multi-variate Gaussian Generative Model
is a matrix of size dxd, contains d(d+1)/2 independent variables d=100: the number of variables in is 5,050 d=1000: the number of variables in is 505,000 A large parameter space
can be singular If N < d If two features are linear correlated -1 does not exist
1
/ 2 1/ 2
1 1( | ; ) exp
22 | |
T
y y ydy
p x y x x
Problems with Using Multi-variate Gaussian Generative Model
Diagonalize
1
/ 2 1/ 2
1 1( | ; ) exp
22 | |
T
y y ydy
p x y x x
21
2
22,
1
0
0
1
d
N
i k i ik
x xN
21
1
2
0
0 d
Problems with Using Multi-variate Gaussian Generative Model
Diagonalize
Feature independence assumption (Naïve Bayes assumption)
1
/ 2 1/ 2
1 1( | ; ) exp
22 | |
T
y y ydy
p x y x x
2
21/ 2 2
1
1 1( | ; ) exp
22
di i
did i
ii
xp x y
Problems with Using Multi-variate Gaussian Generative Model
Diagonalize
Smooth the covariance matrix
1
/ 2 1/ 2
1 1( | ; ) exp
22 | |
T
y y ydy
p x y x x
2
21/ 2 2
1
1 1( | ; ) exp
22
di i
did i
ii
xp x y
, 0 is a smoothing parameterdI
Overfitting Issue Complex model vs. insufficient training Example
Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples
Total number parameters for a full Gaussian model is 5 class prior 5 parameters 5 means 500 parameters 5 covariance matrices 50,500 parameters 51,005 parameters insufficient training data
Model Complexity Vs. Data
-6 -4 -2 0 2 4 6-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Model Complexity Vs. Data
-6 -4 -2 0 2 4 6-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Model Complexity Vs. Data
-6 -4 -2 0 2 4 6-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Model Complexity Vs. Data
-8 -6 -4 -2 0 2 4 6 8-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Problems with Using Multi-variate Gaussian Generative Model Diagonalize
Feature independence assumption
2
21/ 2 2
1
1 1( | ; ) exp
22
di i
did i
ii
xp x y
2
221 1
1 1( | ; ) exp ( | ; )
22
( | ; ) ~ ( , )
d di i i
i iii
ii i
xp x y p x y
p x y N
Naïve Bayes Model In general, for any generative model, we have to
estimate For x in high dimension space, this probability is hard
to estimate In Naïve Bayes Model, we approximate
( | ; ) (or, ( | ))yp x y p x
( | ; )p x y
1
1 2
( | ; ) ( | ;; )
( | ;; ) ( | ;; )... ( | ;; )
d
ii
d
p x y p x y
p x y p x y p x y
Naïve Bayes Model In general, for any generative model, we have to
estimate For x in high dimension space, this probability is hard
to estimate In Naïve Bayes Model, we approximate
( | ; ) (or, ( | ))yp x y p x
( | ; )p x y
1
1 2
( | ; ) ( | ;; )
( | ;; ) ( | ;; )... ( | ;; )
d
ii
d
p x y p x y
p x y p x y p x y
Naïve Bayes Model In general, for any generative model, we have to
estimate For x in high dimension space, this probability is hard
to estimate In Naïve Bayes Model, we approximate
( | ; ) (or, ( | ))yp x y p x
( | ; )p x y
1
1 2
( | ; ) ( | ; )
( | ; ) ( | ; )... ( | ; )
di
i
d
p x y p x y
p x y p x y p x y
Text Categorization Learn to classify text into predefined categories Input x: a document
Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}
Output y: if the document is politics or not +1 for political document, -1 for not political document
Text Categorization A generative model for text classification (TC)
Parameter space p(+) and p(-) p(doc|+;), p(doc|-;)
It is difficult to estimate both p(doc|+;), p(doc|-;) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document
A Naïve Bayes approach
( | ) ~ ( ) ( | )p y doc p y p doc y
Text Classification A generative model for text classification (TC)
Parameter space p(+) and p(-) p(doc|+;), p(doc|-;)
It is difficult to estimate both p(doc|+;), p(doc|-;) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document
A Naïve Bayes approach
( | ) ~ ( ) ( | )p y doc p y p doc y
Text Classification A generative model for text classification (TC)
Parameter space p(+) and p(-) p(doc|+;), p(doc|-;)
It is difficult to estimate both p(doc|+;), p(doc|-;) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document
A Naïve Bayes approach
( | ) ~ ( ) ( | )p y doc p y p doc y
Text Classification A Naïve Bayes approach For a document
1 21 2
1
( | ) ( | ) ( | ) ... ( | )
( | )
n
i
t t tn
n tii
p doc p w p w p w
p w
1 1 2 2, , , ,..., ,n ndoc w t w t w t
1 21 2
1
( | ) ( | ) ( | ) ... ( | )
( | )
n
i
t t tn
n tii
p doc p w p w p w
p w
Text Classification The original parameter space
p(+) and p(-) p(doc|+;), p(doc|-;)
Parameter space after Naïve Bayes simplification p(+) and p(-) {p(w1|+), p(w2|+),…, p(wn|+)} {p(w1|-), p(w2|-),…, p(wn|-)}
Text Classification Learning parameters from training examples
Each document
Learn parameters using maximum likelihood estimation
1 2 1 2 , ,..., ; , ,..., n n
N n n
d d d d d d
1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd w t w t w t
Text Classification
,
,
1 1
1 1
1 1
,1 1
,1 1
log ( | ) log ( | )
log ( ) ( | )
log ( ) ( | )
log ( ) log ( | )
log ( ) log ( | )
i j
i j
n ni ii i
tnnji j
tnnji j
n ni j ji j
n ni j ji j
l p d p d
p p w
p p w
p t p w
p t p w
1 2 1 2 , ,..., ; , ,...,n n
N n n
d d d d d d
1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd w t w t w t
Text Classification
,
,
1 1
1 1
1 1
,1 1
,1 1
log ( | ) log ( | )
log ( ) ( | )
log ( ) ( | )
log ( ) log ( | )
log ( ) log ( | )
i j
i j
n ni ii i
tnnji j
tnnji j
n ni j ji j
n ni j ji j
l p d p d
p p w
p p w
p t p w
p t p w
1 2 1 2 , ,..., ; , ,...,n n
N n n
d d d d d d
1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd w t w t w t
Text Classification
,
,
1 1
1 1
1 1
,1 1
,1 1
log ( | ) log ( | )
log ( ) ( | )
log ( ) ( | )
log ( ) log ( | )
log ( ) log ( | )
i j
i j
n ni ii i
tnnji j
tnnji j
n ni j ji j
n ni j ji j
l p d p d
p p w
p p w
p t p w
p t p w
1 2 1 2 , ,..., ; , ,...,n n
N n n
d d d d d d
1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd w t w t w t
Text Classification
, ,1 1
, ,1 1 1 1
( ) , ( )
( | ) , ( | )
n ni j i ji i
j jn n n ni j i jj i j i
n np p
N N
t tp w p w
t t
The optimal solution that maximizes the likelihood of training data
Text ClassificationTwenty Newsgroups An Example
Text Classification Any problems with the Naïve Bayes text classifier? Unseen words
Word ‘w’ is unseen from the training documents, what is the consequence?
Word ‘w’ is only unseen for documents of one class, what is the consequence?
Related to the overfitting problem Any suggestion? Solution: word class approach
Introducing word class T= {t1, t2, …, tm} Compute p(ti|+), p(ti|-) When w is unseen before, replace p(w|) with p(ti|)
Introducing prior for word probabilities
Naïve Bayes Model
This is a terrible approximation
1( | ; ) ( | ; )
d ii
p x y p x y
0 2 1,
0 1 2
0 2 0,
0 0 2
Naïve Bayes Model Why use Naïve Bayes Model ? We are essentially interested in p(y|x;), not
p(x|y;)
' 1
' 1
( ; ) ( | ; ) ( ; ) ( | ; )( | ; )
( ; ) ( '; ) ( | '; )
1( '; ) ( | '; )
( ; ) ( | ; )
c
y
c
y
p y p x y p y p x yp y x
p x p y p x y
p y p x y
p y p x y
Naïve Bayes Model Why use Naïve Bayes Model ? We are essentially interested in p(y|x;), not
p(x|y;)
' 1
' 1
( ; ) ( | ; ) ( ; ) ( | ; )( | ; )
( ; ) ( '; ) ( | '; )
1( '; ) ( | '; )
( ; ) ( | ; )
c
y
c
y
p y p x y p y p x yp y x
p x p y p x y
p y p x y
p y p x y
Naïve Bayes Model Why use Naïve Bayes Model ? We are essentially interested in p(y|x;), not
p(x|y;)
' 1
' 1
( ; ) ( | ; ) ( ; ) ( | ; )( | ; )
( ; ) ( '; ) ( | '; )
1( '; ) ( | '; )
( ; ) ( | ; )
c
y
c
y
p y p x y p y p x yp y x
p x p y p x y
p y p x y
p y p x y
Naïve Bayes Model The key for the prediction model is not p(x|
y;), but the ratio p(x|y;)/p(x|y’;)
Although Naïve Bayes model does a poor job for estimating p(x|y;), it does a reasonable good on estimating the ratio.
The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance
2 2
, ,
2 21
2 2
, ,
2 21 1
( 1) ( | 1)log
( 1) ( | 1)
( 1)log
( 1)
( 1)2 log
( 1)
i i i id
ii i
i ii im m
ii i ii
p y p x y
p y p x y
x xp y
p y
p yx
p y
The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance
2 2
, ,
2 21
2 2
, ,
2 21 1
( 1) ( | 1)log
( 1) ( | 1)
( 1)log
( 1)
( 1)2 log
( 1)
i i i id
ii i
i ii im m
ii i ii
p y p x y
p y p x y
x xp y
p y
p yx
p y
The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance
2 2
, ,
2 21
2 2
, ,
2 21 1
( 1) ( | 1)log
( 1) ( | 1)
( 1)log
( 1)
( 1)2 log
( 1)
i i i id
ii i
i ii im m
ii i ii
p y p x y
p y p x y
x xp y
p y
p yx
p y
Gaussian generative model is a linear model
Linear Decision Boundary Gaussian Generative Models == Finding a linear
decision boundary Why not directly estimate the decision boundary?