generative models rong jin. statistical inference training exampleslearning a statistical model ...

Generative Models

Rong Jin

Statistical Inference

Training Examples

1 2{ , ,..., }nx x x

Learning a Statistical Model

Prediction

p(x;)

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.10

1

2

3

4

5

6

7

8

9

10

Heigth

Num

ber

of P

eopl

e Female: Gaussian distribution N(1,1)

Male: Gaussian distribution N(2,2)

Pr(male|1.67m)

Pr(female|1.67m)

Statistical Inference

Training Examples

1 2{ , ,..., }nx x x

Learning a Statistical Model

Prediction

p(y|x;)

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.10

1

2

3

4

5

6

7

8

9

10

Heigth

Num

ber

of P

eopl

e Male: Gaussian distribution N(1,1)

Female: Gaussian distribution N(2,2)

Pr(male|1.67m)

Pr(female|1.67m)

Probabilistic Models for Classification Problems Apply statistical inference methods Given training example Assume a parametric model Learn the model parameters from training example

using maximum likelihood approach The class of a new instance is predicted by

1

,n

i i ix y

( | ; )p y x

* arg max ( | ; )y

y p y x

Y

x


using the maximum likelihood approach The class of a new instance is predicted by

1

,n

i i ix y

( | ; )p y x

* arg max ( | ; )y

y p y x

Y

x

Maximum Likelihood Estimation (MLE) Given training example Compute log-likelihood of data

Find the parameters that maximizes the log-likelihood

In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation

1( ) log ( | ; )

ntrain i ii

l D p y x

*1

max ( ) log ( | ; )n

train i iil D p y x

1 1 2 2, , , ,..., ,n nx y x y x y


using the maximum likelihood approach The class of a new instance is predicted by

1

,n

i i ix y

( | ; )p y x

* arg max ( | ; )y

y p y x

Y

x

Generative Models Most probabilistic distributions are joint distribution (i.e.,

p(x;)), not conditional distribution (i.e., p(y|x;))

Using Bayes rule

p(xly;) { p(y|x;); p(y;)}

( ; ) ( | ; )( | ; )

( , ; )

p y p x yp y x

p y x

Generative Models Most probabilistic distributions are joint distribution (i.e.,

p(x;)), not conditional distribution (i.e., p(y|x;))

Using Bayes rule

p(y|x;) { p(x|y;); p(y;)}

( ; ) ( | ; )( | ; )

( ; )

p y p x yp y x

p x

Generative Models (cont’d) Treatment of p(x|y;) Let yY={1, 2, …, c} Allocate a separate set of parameters for each class

{1, 2,…, c}

p(xly;) p(x;y) Data in different class have different input patterns

Generative Models (cont’d) Parameter space

Parameters for distribution: {1, 2,…, c}

Class priors: {p(y=1), p(y=2), …, p(y=c)} Learn parameters from training examples using MLE

Compute log-likelihood

Search for the optimal parameters by maximizing the log-likelihood

1

1

( ) log ( | ; )

log ( | ) log ( ) log ( | )i i

ntrain i ii

ni y i i yi

l D p y x

p x p y p x

1max ( ) max log ( ) ( | )

i

ntrain i i yi

l D p y p x






1

1

( ) log ( | ; )

log ( ; ) log ( ) log ( )i

ntrain i ii

ni y i ii

l D p y x

p x p y p x

1max ( ) max log ( ) ( ; )

i

ntrain i i yi

l D p y p x

( ; ) ( | ; )( | ; )

( ; )

p y p x yp y x

p x






1

1

( ) log ( | ; )

log ( ; ) log ( ) log ( )i

ntrain i ii

ni y i ii

l D p y x

p x p y p x

1max ( ) max log ( ) ( ; )

i

ntrain i i yi

l D p y p x

Example

• Task: predict gender of individuals based on their heights

• Given

• 100 height examples of women

• 100 height examples of man

• Assume height of women and man follow different Gaussian distributions

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for male

Empirical data for female

Example (cont’d) Gaussian distribution

Parameter space Gaussian distribution for man: (m m)

Gaussian distribution for man: (w w)

Class priors: pm = p(y=man), pw = p(y=women)

1max ( ) max log ( ) ( | )

i

ntrain i i yi

l D p y p x

2

22

( )1( ) exp ,

22

xp x

Example (cont’d) Gaussian distribution

Parameter space Gaussian distribution for male: (m, m)

Gaussian distribution for female: (f , f)

Class priors: pm = p(y=male), pf = p(y=female)

2

22

( )1( ) exp ,

22

xp x

Example (cont’d)

1

1 1

2

2

2

1 2 1 2

log ( | )

log ( ; , ) log log ( ; , ) log

exp2

log2

Given training examples , ,..., ; , ,...,

m female

m f

m f

Nii

N N fmi m m m i f f fi i

mi m

m

m

f f fm m mN N

N N N

l p h y

p h p p h p

h

h h h h h h

2

2

1 1 2

exp2

log log log2

male male

fi f

fN N

m fi if

h

p p

Example (cont’d)

1

1 1

2

2

1 2 1 2

log ( | ) log ( )

log ( ; , ) log log ( ; , ) log

exp2

log2


m f

m f

m f

Ni i ii


mi m

m

f f fm m mN N

N N N

l p h y p y

p h p p h p

h

h h h h h h

2

2

1 12 2

exp2

log log log2

male male

fi f

fN N

m fi im f

h

p p

Example (cont’d)

1

1 1

2

2

1 2 1 2

log ( | ) log ( )

log ( ; , ) log log ( ; , ) log

exp2

log2


m f

m f

m f

Ni i ii


mi m

m

f f fm m mN N

N N N

l p h y p y

p h p p h p

h

h h h h h h

2

2

1 12 2

exp2

log log log2

m f

fi f

fN N

m fi im f

h

p p

Learn a Gaussian generative model

Example (cont’d)

*

221 1

221 1

, , ; , , max

( ), ,

( ), ,

m m

f f

m m m f f f

N Nm mi i mi i m

m m mm m

N Nf fi i f fi i

f f ff f

p p l

h h Np

N N N

h h Np

N N N

Example (cont’d)

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female

Predict the gender of an individual given his/her height

Example (cont’d)

2

22

2

22

( )( | ) ( | , ) exp

22

( )( | ) ( | , ) exp

22

m mm m m

mm

f ff f f

ff

p xp male h p p h

p xp female h p p h

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40


Decision boundary Decision boundary h*

Predict female when h<h* Predict male when h>h* Random when h=h*

Where is the decision boundary?

It depends on the ratio pm/pf

h*

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40


Example Decision boundary h*

Predict female when h<h* Predict male when h>h* Random when h=h*

Where is the decision boundary?

It depends on the ratio pm/pf

pf< pmpf> pm

Gaussian Generative Model (II) Inputs contain multiple features Example

Task: predict if an individual is overweight based on his/her salary and the number of hours on watching TV

Input: (s: salary, h: hours for watching TV) Output: +1 (overweight), -1 (normal)

1 2, ,..., dx x x x

Multi-variate Gaussian Distribution

1/ 2 1/ 2

1 2

1 21

1,1 1,

,

,1 ,

, ,

1 1( ; , ) exp

22 | |

Input : , ,...,

1mean : , ,...,

variance matrix :

1

Ty y d

d

N

d ik

d

i j d d

d d d

i j i i j j k i i k

p x x x

x x x x

xN

E x x x x x x xN

,1

1

1

N

j jk

N T

k

x

x x x xN


1/ 2 1/ 2

1 2

1 21

1,1 1,

,

,1 ,

,

1 1( ; , ) exp

22 | |

Input : , ,...,

1mean : , ,...,

covariance matrix :

1

Ty y d

d

N

d ik

d

i j d d

d d d

i i j j i ii j k

p x x x

x x x x

xN

E x x x x x xN

1

1

1

Nj j

kk

N T

k kk

x x

x x x xN


1/ 2 1/ 2

1 2

1 21

1,1 1,

,

,1 ,

,

1 1( ; , ) exp

22 | |

Input : , ,...,

1mean : , ,...,

covariance matrix :

1

Ty y d

d

N

d ik

d

i j d d

d d d

i i j j i ii j k

p x x x

x x x x

xN

E x x x x x xN

1

1

1

Nj j

kk

NT

k kk

x x

x xN

Properties of Covariance Matrix

What if the number of data points N < d? How about for any vector ?

Positive semi-definitive matrix

1 21

1, , ,...,

NT

k k dk

x x x x x xN

Ta a

a


What if the number of data points N < d? How about for any ?

Positive semi-definitive matrix

Ta a

1 21

1, , ,...,

NT

k k dk

x x x x x xN

a


What if the number of data points N < d? How about for any ?

Positive semi-definitive matrix Number of different elements in ?

Ta a

1 21

1, , ,...,

NT

k k dk

x x x x x xN

a

Joint distribution p(s,h) for salary (s) and hours for watching TV (h)

12/ 2 1/ 2

, ,1 1

, ,

, ,

2 2, , , ,

1 1

1 1( ; , ) exp

22 | |

Input : ,

1 1mean : , , ,

covariance matrix :

1 1,

Ty y

N N

s h s k s h k hk k

s s s h

h s h h

N N

s s k s s h h k h hk k

s

p x x x

x s h

x xN N

x x x xN N

, , , ,1

1 N

h h s k s s k h hk

x x x xN

Gaussian Generative Model (II)

Joint distribution p(s,h) for salary (s) and hours for watching TV (h)

Gaussian Generative Model (II)

12/ 2 1/ 2

, ,1 1

, ,

, ,

2 2, , , ,

1 1

1 1( ; , ) exp

22 | |

Input : ,

1 1mean : , , ,

covariance matrix :

1 1,

Ty y

N N

s h s k s h k hk k

s s s h

h s h h

N N

s s k s s h h k h hk k

s

p x x x

x s h

x xN N

x x x xN N

, , , ,1

1 N

h h s k s s k h hk

x x x xN

Multi-variate Gaussian Generative Model Input with multiple input features A multi-variate Gaussian distribution for each class

1

/ 2 1/ 2

( | ; ) ~ ( , )

1 1( | ; ) exp

22 | |

Overweight: ( , , ( overweight))

Normal: ( , , ( normal))

y y

T

y y ydy

o o o o

n n n n

p x y N

p x y x x

p p y

p p y

1 2, ,..., dx x x x

Improve Multivariate Gaussian Model How could we improve the prediction of model for

overweight? Multiple modes for each class Introduce more attributes of individuals

Location Occupation The number of children House Age …

Problems with Using Multi-variate Gaussian Generative Model

is a matrix of size dxd, contains d(d+1)/2 independent variables d=100: the number of variables in is 5,050 d=1000: the number of variables in is 505,000 A large parameter space

can be singular If N < d If two features are linear correlated -1 does not exist

1

/ 2 1/ 2

1 1( | ; ) exp

22 | |

T

y y ydy

p x y x x


Diagonalize

1

/ 2 1/ 2

1 1( | ; ) exp

22 | |

T

y y ydy

p x y x x

21

2

22,

1

0

0

1

d

N

i k i ik

x xN

21

1

2

0

0 d


Diagonalize

Feature independence assumption (Naïve Bayes assumption)

1

/ 2 1/ 2

1 1( | ; ) exp

22 | |

T

y y ydy

p x y x x

2

21/ 2 2

1

1 1( | ; ) exp

22

di i

did i

ii

xp x y


Diagonalize

Smooth the covariance matrix

1

/ 2 1/ 2

1 1( | ; ) exp

22 | |

T

y y ydy

p x y x x

2

21/ 2 2

1

1 1( | ; ) exp

22

di i

did i

ii

xp x y

, 0 is a smoothing parameterdI

Overfitting Issue Complex model vs. insufficient training Example

Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples

Total number parameters for a full Gaussian model is 5 class prior 5 parameters 5 means 500 parameters 5 covariance matrices 50,500 parameters 51,005 parameters insufficient training data

Model Complexity Vs. Data

-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


-6 -4 -2 0 2 4 6-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


-8 -6 -4 -2 0 2 4 6 8-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Problems with Using Multi-variate Gaussian Generative Model Diagonalize

Feature independence assumption

2

21/ 2 2

1

1 1( | ; ) exp

22

di i

did i

ii

xp x y

2

221 1

1 1( | ; ) exp ( | ; )

22

( | ; ) ~ ( , )

d di i i

i iii

ii i

xp x y p x y

p x y N

Naïve Bayes Model In general, for any generative model, we have to

estimate For x in high dimension space, this probability is hard

to estimate In Naïve Bayes Model, we approximate

( | ; ) (or, ( | ))yp x y p x

( | ; )p x y

1

1 2

( | ; ) ( | ;; )

( | ;; ) ( | ;; )... ( | ;; )

d

ii

d

p x y p x y

p x y p x y p x y

Naïve Bayes Model In general, for any generative model, we have to

estimate For x in high dimension space, this probability is hard

to estimate In Naïve Bayes Model, we approximate

( | ; ) (or, ( | ))yp x y p x

( | ; )p x y

1

1 2

( | ; ) ( | ; )

( | ; ) ( | ; )... ( | ; )

di

i

d

p x y p x y

p x y p x y p x y

Text Categorization Learn to classify text into predefined categories Input x: a document

Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}

Output y: if the document is politics or not +1 for political document, -1 for not political document

Text Categorization A generative model for text classification (TC)

Parameter space p(+) and p(-) p(doc|+;), p(doc|-;)

It is difficult to estimate both p(doc|+;), p(doc|-;) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document

A Naïve Bayes approach

( | ) ~ ( ) ( | )p y doc p y p doc y

Text Classification A generative model for text classification (TC)

Parameter space p(+) and p(-) p(doc|+;), p(doc|-;)

It is difficult to estimate both p(doc|+;), p(doc|-;) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document

A Naïve Bayes approach

( | ) ~ ( ) ( | )p y doc p y p doc y

Text Classification A Naïve Bayes approach For a document

1 21 2

1

( | ) ( | ) ( | ) ... ( | )

( | )

n

i

t t tn

n tii

p doc p w p w p w

p w

1 1 2 2, , , ,..., ,n ndoc w t w t w t

1 21 2

1

( | ) ( | ) ( | ) ... ( | )

( | )

n

i

t t tn

n tii

p doc p w p w p w

p w

Text Classification The original parameter space

p(+) and p(-) p(doc|+;), p(doc|-;)

Parameter space after Naïve Bayes simplification p(+) and p(-) {p(w1|+), p(w2|+),…, p(wn|+)} {p(w1|-), p(w2|-),…, p(wn|-)}

Text Classification Learning parameters from training examples

Each document

Learn parameters using maximum likelihood estimation

1 2 1 2 , ,..., ; , ,..., n n

N n n

d d d d d d

1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd w t w t w t

Text Classification

,

,

1 1

1 1

1 1

,1 1

,1 1

log ( | ) log ( | )

log ( ) ( | )

log ( ) ( | )

log ( ) log ( | )

log ( ) log ( | )

i j

i j

n ni ii i

tnnji j

tnnji j

n ni j ji j

n ni j ji j

l p d p d

p p w

p p w

p t p w

p t p w

1 2 1 2 , ,..., ; , ,...,n n

N n n

d d d d d d

1 ,1 2 ,2 , = , , , ,..., ,i i i n i nd w t w t w t

Text Classification

, ,1 1

, ,1 1 1 1

( ) , ( )

( | ) , ( | )

n ni j i ji i

j jn n n ni j i jj i j i

n np p

N N

t tp w p w

t t

The optimal solution that maximizes the likelihood of training data

Text ClassificationTwenty Newsgroups An Example

Text Classification Any problems with the Naïve Bayes text classifier? Unseen words

Word ‘w’ is unseen from the training documents, what is the consequence?

Word ‘w’ is only unseen for documents of one class, what is the consequence?

Related to the overfitting problem Any suggestion? Solution: word class approach

Introducing word class T= {t1, t2, …, tm} Compute p(ti|+), p(ti|-) When w is unseen before, replace p(w|) with p(ti|)

Introducing prior for word probabilities

Naïve Bayes Model

This is a terrible approximation

1( | ; ) ( | ; )

d ii

p x y p x y

0 2 1,

0 1 2

0 2 0,

0 0 2

Naïve Bayes Model Why use Naïve Bayes Model ? We are essentially interested in p(y|x;), not

p(x|y;)

' 1

' 1

( ; ) ( | ; ) ( ; ) ( | ; )( | ; )

( ; ) ( '; ) ( | '; )

1( '; ) ( | '; )

( ; ) ( | ; )

c

y

c

y

p y p x y p y p x yp y x

p x p y p x y

p y p x y

p y p x y

Naïve Bayes Model The key for the prediction model is not p(x|

y;), but the ratio p(x|y;)/p(x|y’;)

Although Naïve Bayes model does a poor job for estimating p(x|y;), it does a reasonable good on estimating the ratio.

The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance

2 2

, ,

2 21

2 2

, ,

2 21 1

( 1) ( | 1)log

( 1) ( | 1)

( 1)log

( 1)

( 1)2 log

( 1)

i i i id

ii i

i ii im m

ii i ii

p y p x y

p y p x y

x xp y

p y

p yx

p y

The Ratio of Likelihood for Binary Classes Assume that both classes share the same variance

2 2

, ,

2 21

2 2

, ,

2 21 1

( 1) ( | 1)log

( 1) ( | 1)

( 1)log

( 1)

( 1)2 log

( 1)

i i i id

ii i

i ii im m

ii i ii

p y p x y

p y p x y

x xp y

p y

p yx

p y

Gaussian generative model is a linear model

Linear Decision Boundary Gaussian Generative Models == Finding a linear

decision boundary Why not directly estimate the decision boundary?

generative models rong jin. statistical inference training exampleslearning a statistical model ...

Documents

loglikelihood slide

model parameters

probabilistic models

parametric model

statistical inference

optimal parameters

numerical calculation

gaussian distribution