map vs. ml

8/13/2019 MAP vs. ML

1/61

8/13/2019 MAP vs. ML

2/61

Contents

Part 1: Introduction to ML, MAP, and Bayesian Estimation(Slides 3 – 28)

Part 2: ML, MAP, and Bayesian Prediction (Slides 29 – 33)

Part 3: Conjugate Priors (Slides 34 – 37)

Part 4: Multinomial Distributions (Slides 38 – 47)

Part 5: Modelling Text (Slides 49 – 59)

2

8/13/2019 MAP vs. ML

3/61

PART 1: Introduction to ML, MAP, and

Bayesian Estimation

3

8/13/2019 MAP vs. ML

4/61

1.1: Say We are Given Evidence X

We may think of X as consisting of a set of independent observations

X = {xi}|X |i=1

where each xi is a realization of a random vari-

able X .

Let’s say that a set Θ of probability distribu-

tion parameters best explains the evidence X .

4

8/13/2019 MAP vs. ML

5/61

8/13/2019 MAP vs. ML

6/61

1.3: Focusing First on the Estimation

of the Parameters Θ ...

We can interpret the Bayes’ Rule

prob(Θ|X ) = prob(X|Θ) · prob(Θ)

prob(X )

as

posterior =

likelihood · prior

evidence

6

8/13/2019 MAP vs. ML

7/61

1.4: ML Estimation of Θ

We seek that value for Θ which maximizes the

likelihood shown on the previous slide. That

is, we seek that value for Θ which gives largest

value to

prob(X|Θ)

Recognizing that the evidence X consists of independent observations {x1, x2, .....}, we can

say that we seek that value Θ which maximizes

x∈X prob(x|Θ)

Because of the product on the RHS, it is often

simpler to use the logarithm instead.

7

8/13/2019 MAP vs. ML

8/61

Using the symbol L to denote the logarithm of

this likelihood, we can express our solution as

L =

x∈X

log prob(x|Θ)

ΘM L = argmax

Θ

L

The ML solution is usually obtained by setting

∂ L

∂θi = 0 ∀ θi ∈ Θ

8

8/13/2019 MAP vs. ML

9/61

1.5: MAP Estimation of Θ

For constructing a maximum a posteriori esti-

mate, we first go back to our Bayes’ Rule:


prob(X )

We now seek that value for Θ which maximizesthe posterior prob(Θ|X ).

Therefore, our solution can now be stated as

shown on the next slide:

9

8/13/2019 MAP vs. ML

10/61

ΘM AP = argmaxΘ

prob(Θ|X )

= argmaxΘ

prob(X|Θ) · prob(Θ)

prob(X )

= argmaxΘ


= argmaxΘ

x∈X

prob(x|Θ) · prob(Θ)

As with the ML estimate, we can make this

problem easier if we first take the logarithm of

the posteriors. We can then write

ΘM AP = argmaxΘ

x∈X

log prob(x|Θ) + log prob(Θ)

10

8/13/2019 MAP vs. ML

11/61

1.6: What Does the MAP Estimate Get

Us That the ML Estimate Does NOT

The MAP estimate allows us to inject into the

estimation calculation our prior beliefs regard-

ing the parameters values in Θ.

To illustrate the usefulness of such incorpo-

ration of prior beliefs, consider the following

example provided by Heinrich:

Let’s conduct N independent trials of the fol-

lowing Bernoulli experiment: We will ask each

individual we run into in the hallway whether

they will vote Democratic or Republican in the next presidential election. Let p be the proba-

bility that an individual will vote Democratic.

11

8/13/2019 MAP vs. ML

12/61

8/13/2019 MAP vs. ML

13/61

Setting

L = log prob(X| p)

we find the ML estimate for p by setting

∂ L

∂p= 0

That gives us the equation

nd

p−

(N − nd)

(1 − p) = 0

whose solution is the ML estimate

ˆ pM L = nd

N

13

8/13/2019 MAP vs. ML

14/61

8/13/2019 MAP vs. ML

15/61

prob( p) = 1

B(α, β ) pα−1(1 − p)β −1

where B = Γ(α)Γ(β )

Γ(α+β ) is the beta function. (Theprobability distribution shown above is also dis-

played as Beta( p|α, β ).) Note that the Gamma

function Γ() is the generalization of factorial

to real numbers.

When both α and β are greater than zero, the

above distribution has its mode — meaning its

maximum value — at the following point

α − 1α + β − 2

15

8/13/2019 MAP vs. ML

16/61

Considering that Indiana has traditionally votedRepublican in presidential elections, but that

this year Democrats are expected to win, we

may choose a prior distribution for p that has

a peak at 0.5. This we can do by setting

α = β = 5. The variance of a beta distributionis given by

αβ

(α + β )2(α + β + 1)

When α = β = 5, we have a variance of roughly

0.025, implying a standard deviation of roughly

0.16, which should do for us nicely.

To construct a MAP estimate for p, we will

now substitute the beta distribution prior for p

in the following equation:

16

8/13/2019 MAP vs. ML

17/61

8/13/2019 MAP vs. ML

18/61

nd

p−

(N − nd)

(1 − p) +

α − 1

p−

β − 1

1 − p= 0

The solution of this equation is

ˆ pM AP = nd + α − 1

N + α + β − 2

= nd + 4

N + 8

With N = 20 and with 12 of the 20 saying they

would vote Democratic, the MAP estimate for

p is 0.571 with α and β both set to 5.

To summarize, as this example shows, the MAP

estimate has the following properties over the

ML estimate:

18

8/13/2019 MAP vs. ML

19/61

• MAP estimation “pulls” the estimate to-

ward the prior.

• The more focused our prior belief, the larger

the pull toward the prior. By using larger

values for α and β (but keeping them equal),

we can narrow the peak of the beta distri-

bution around the value of p = 0.5. This

would cause the MAP estimate to move

closer to the prior.

• In the expression we derived for ˆ pM AP , the

parameters α and β play a “smoothing”

role vis-a-vis the measurement nd.

• Since we referred to p as the parameter to be estimated, we can refer to α and β

as the hyperparameters in the estimation

calculations.

19

8/13/2019 MAP vs. ML

20/61

1.7: Bayesian Estimation

Given the evidence X , ML considers the pa-

rameter vector Θ to be a constant and seeks

out that value for the constant that providesmaximum support for the evidence. ML does

NOT allow us to inject our prior beliefs about

the likely values for Θ in the estimation calcu-

lations.

MAP allows for the fact that the parameter

vector Θ can take values from a distribution

that expresses our prior beliefs regarding the

parameters. MAP returns that value for Θ

where the probability prob(Θ|X ) is a maximum.

Both ML and MAP return only single and spe-

cific values for the parameter Θ.

20

8/13/2019 MAP vs. ML

21/61

8/13/2019 MAP vs. ML

22/61

1.8: What Makes Bayesian Estimation

Complicated?

Bayesian estimation is made complex by the

fact that now the denominator in the Bayes’

Rule


prob(X )

cannot be ignored. The denominator, known

as the probability of evidence, is related to

the other probabilities that make their appear-

ance in the Bayes’ Rule by

prob(X ) =

Θ prob(X|Θ) · prob(Θ) dΘ

22

8/13/2019 MAP vs. ML

23/61

This leads to the following thought critical toBayesian estimation: For a given likelihood

function, if we have a choice regarding how

we express our prior beliefs, we must use

that form which allows us to carry out the

integration shown above. It is this thought

that leads to the notion of conjugate priors.

Finally, note that, as with MAP, Bayesian es-

timation also requires us to express our prior

beliefs in the possible values of the parameter

vector Θ in the form of a distribution.

NOTE: Obtaining an algebraic expression for the posterior is,

of course, important from a theoretical perspective. In practice, if

you estimate a posterior ignoring the denominator, you can always

find the normalization constant simply by adding up what you get

for the numerator — assuming you did a sufficiently good job of

estimating the numerator. More on this point is in my tutorial

“Monte Carlo Integration in Bayesian Estimation.”

23

8/13/2019 MAP vs. ML

24/61

1.10: An Example of BayesianEstimation

We will illustrate Bayesian estimation with the

same Bernoulli trial based example we usedearlier for ML and MAP. Our prior for that

example is given by the following beta distri-

bution:

prob( p|α, β ) = 1

B(α, β ) pα−1(1 − p)β −1

where the LHS makes explicit the dependence

of the prior on the hyperparameters α and β .

With this prior, the probability of evidence is

given by

24

8/13/2019 MAP vs. ML

25/61

prob(X ) = 1

0 prob(X| p) · prob( p) dp

= 1

0

N

i=1

prob(xi| p)

· prob( p) dp

= 1

0 pnd · (1 − p)N −nd · prob( p) dp

What is interesting that in this case the inte-

gration shown above is easy. When we multiply

a beta distribution with either a power of p or a

power of (1− p), you simply get a different beta

distribution. So the above can be replaced by

a constant Z that would depend on the values

chosen for α, β , and the measurement nd.

With the denominator out of the way, let’s go

back to the Bayes’ Rule for Bayesian estima-

tion:

25

8/13/2019 MAP vs. ML

26/61

prob( p|X ) = prob(X| p) · prob( p)

Z

= 1

Z · prob(X| p) · prob( p)

= 1

Z · N

i=1

prob(xi| p) · prob( p)

= 1

Z · pnd · (1 − p)N −nd

· prob( p)

= Beta( p|α + nd, β + N − nd)

where the last result follows from the observa-

tion made earlier that a beta distribution mul-

tiplied by either a power of p or a power of

(1− p) remains a beta distribution, albeit with

a different pair of hyperparameters.

26

8/13/2019 MAP vs. ML

27/61

The notation Beta() above is a short form

for the same beta distribution you saw before.

The hyperparameters of this beta distribution

are shown to the right of the vertical bar in the

argument list.

The above result gives us a closed form ex-

pression for the posterior distribution for the

parameter to be estimated.

If we wanted to return a single value as an es-

timate for p, that would be the expected value

of the posterior distribution we just derived.

Using the standard formula for the expecta-

tion of a beta distribution, the expectation isgiven by

27

8/13/2019 MAP vs. ML

28/61

ˆ pBayesian = E { p|X}

= α + nd

α + β + N

= 5 + nd

10 + N

for the case when we set both α and β to 5.

When N = 20 and 12 out of 20 individu-

als report that they will vote Democratic, our

Bayesian estimate yields a value of 0.567. Com-

pare that to the MAP value of 0.571 and the

ML value of 0.6.

One benefit of the Bayesian estimation is that

we can also calculate the variance associated

with the above estimate. One can again use

the standard formula for the variance of a beta

distribution to show that the variance associ-

ated with the Bayesian estimate is 0.0079.

28

8/13/2019 MAP vs. ML

29/61

8/13/2019 MAP vs. ML

30/61

8/13/2019 MAP vs. ML

31/61

2.2: ML Prediction

We can write the following equation for the

probabilistic support that the past data X pro-vides to a new observation x̃:

prob(x̃|X ) =

Θ prob(x̃|Θ) · prob(Θ|X ) dΘ

≈

Θ prob(x̃| ΘM L) · prob(Θ|X ) dΘ

= prob(x̃| ΘM L)

What this says is that the probability model

for the new observation x̃ is the same as for all

previous observations that constitute the evi-dence X . In this probability model, we set theparameters to ΘM L to compute the supportthat the evidence lends to the new observa-tion.

31

8/13/2019 MAP vs. ML

32/61

2.3: MAP Prediction

With MAP, the derivation on the previous slide

becomes

prob(x̃|X ) =


≈

Θ prob(x̃| ΘM AP ) · prob(Θ|X ) dΘ

= prob(x̃| ΘM AP )

This is to be interpreted in the same manner

as the ML prediction presented on the previous

slide. The probabilistic support for the new

data x̃ is to be computed by using the same

probability model as used for the evidence X

but with the parameters set to ΘM AP .

32

8/13/2019 MAP vs. ML

33/61

2.4: Bayesian Prediction

In order to compute the support prob(x̃|X ) that

the evidence X lends to the new observationx̃, we again start with the relationship:

prob(x̃|X ) =


but now we must use the Bayes’ Rule for the

posterior prob(Θ|X ) to yield

prob(x̃|X ) =

Θ prob(x̃|Θ)·


prob(X ) dΘ

33

8/13/2019 MAP vs. ML

34/61

8/13/2019 MAP vs. ML

35/61

3.1: What is a Conjugate Prior?

As you saw, Bayesian estimation requires us to

compute the full posterior distribution for theparameters of interest, as opposed to, say, just

the value where the posterior acquires its max-

imum value. As shown already, the posterior

is given by

prob(Θ|X ) = prob(X|Θ) · prob(Θ) prob(X|Θ) · prob(Θ) dΘ

The most challenging part of the calculation

here is the derivation of a closed form for the

marginal in the denominator on the right.

35

8/13/2019 MAP vs. ML

36/61

For a given algebraic form for the likelihood,

the different forms for the prior prob(Θ) pose

different levels of difficulty for the determina-

tion of the marginal in the denominator and,

therefore, for the determination of the poste-

rior.

For a given likelihood function prob(X|Θ), a

prior prob(Θ) is called a conjugate prior if

the posterior prob(Θ|X ) has the same algebraic

form as the prior.

Obviously, Bayesian estimation and prediction

becomes much easier should the engineering

assumptions allow a conjugate prior to be cho-

sen for the applicable likelihood function.

When the likelihood can be assumed to be

Gaussian, a Gaussian prior would constitute a

conjugate prior because in this case the poste-

rior would also be Gaussian.

36

8/13/2019 MAP vs. ML

37/61

You have already seen another example of aconjugate prior earlier in this review. For the

Bernoulli trail based experiment we talked about

earlier, the beta distribution constitutes a con-

jugate prior. As we saw there, the posterior

was also a beta distribution (albeit with differ-ent hyperparameters).

As we will see later, when the likelihood is a

multinomial, the conjugate prior is the Dirich-

let distribution.

37

8/13/2019 MAP vs. ML

38/61

PART 4: Multinomial Distributions

38

8/13/2019 MAP vs. ML

39/61

4.1: When Are Multinomial

Distributions Useful for Likelihoods?

Multinomial distributions are useful for model-

ing the evidence when each observation in the

evidence can be characterized by count based

features.

As a stepping stone to multinomial distribu-

tions, let’s first talk about binomial distribu-

tions.

Binomial distributions answer the following ques-

tion: Let’s carry out N trials of a Bernoulli ex-

periment with p as the probability of success

at each trial. Let n be the random variablethat denotes the number of times we achieve

success in N trials. The random variable n has

binomial distribution that is given by

39

8/13/2019 MAP vs. ML

40/61

prob(n) = N n

pn(1 − p)N −n

with the binomial coefficient

N n

= N !

k!(N −n)!.

A multinomial distribution is a generalizationof the binomial distribution. Now instead of a

binary outcome at each trial, we have k possi-

ble mutually exclusive outcomes at each trial.

At each trial, the k outcomes can occur with

the probabilities p

1, p

2, ..., p

k, respectively,with the constraint that they must all add up

to 1. We still carry out N trials of the underly-

ing experiment and at the end of the N trials

we pose the following question: What is the

probability that we saw n1 number of the first

outcome, n2 number of the second outcome,...., and nk number of the k

th outcome. This

probability is a multinomial distribution and is

given by

40

8/13/2019 MAP vs. ML

41/61

8/13/2019 MAP vs. ML

42/61

If we wanted to actually measure the proba-

bility prob(n1,...,nk) experimentally, we would

carry out a large number of multinomial exper-

iments and record the number of experiments

in which the first outcome occurs n1 times, the

second outcome n2 times, and so on.

If we want to make explicit the conditioning

variables in prob(n1,...,nk), we can express it

as

prob(n1,...,nk | p1, p2,...,pk, N )

This form is sometimes expressed more com-

pactly as

Multi(n| p, N )

42

8/13/2019 MAP vs. ML

43/61

4.2: What is a Multinomial RandomVariable?

A random variable W is a multinomial r.v. if its

probability distribution is a multinomial. Thevector n shown at the end of the last slide is a

multinomial random variable.

A multinomial r.v. is a vector. Each element

of the vector stands for a count for the occur-rence of one of k possible outcomes in each

trial of the underlying experiment.

Think of rolling a die 1000 times. At each roll,

you will see one of six possible outcomes. In

this case, W = n1,...,n6 where ni stands for

the number of times you will see the face with

i dots.

43

8/13/2019 MAP vs. ML

44/61

8/13/2019 MAP vs. ML

45/61

A common way to characterize text documents

is by the frequency of the words in the docu-ments. Carrying out N trials of a k-outcomeexperiment could now correspond to recordingthe N most prominent words in a text file. If we assume that our vocabulary is limited tok words, for each document we would record

the frequency of occurrence of each vocabu-lary word. We can think of each document asa result of one multinomial experiment.

For each of the above two cases, if for a givenimage or text file the first feature is observedn1 times, the second feature n2 times, etc.,the likelihood probability to be associated withthat image or text file would be

prob(image| p1,...,pk) =k

i=1

pini

We will refer to this probability as the multino-mial likelihood of the image. We can think of

p1, ... , pk as the parameters that characterizethe database.

45

8/13/2019 MAP vs. ML

46/61

4.4: Conjugate Prior for a Multinomial

Likelihood

The conjugate prior for a multinomial likeli-

hood is the Dirichlet distribution:

prob( p1,...,pk) = Γ(

ki=1 αi)k

i=1 Γ(αi)

ki=1

pαi−1i

where αi, i = 1,...,k, are the hyperparametersof the prior.

The Dirichlet is a generalization of the beta

distribution from two degrees of freedom to k

degrees of freedom. (Strictly speaking, it is

a generalization from the one degree of free-dom of a beta distribution to k − 1 degrees of

freedom. That is because of the constraintki=1 pi = 1.)

46

8/13/2019 MAP vs. ML

47/61

The Dirichlet prior is also expressed more com-pactly as

prob( p| α)

For the purpose of visualization, consider the

case when an image is only allowed to have

three different kinds of features and we make

N feature measurements in each image. In

this case, k = 3. The Dirichlet prior would

now take the form

prob( p1, p2, p3 | α1, α2, α3)

= Γ(α1 + α2 + α3)

Γ(α1)Γ(α2)Γ(α3) p

α1−11 p

α2−12 p

α3−13

47

8/13/2019 MAP vs. ML

48/61

under the constraint that p1 + p2 + p3 = 1.

When k = 2, the Dirichlet prior reduces to

prob( p1, p2 | α1, α2)

= Γ(α1 + α2)

Γ(α1)Γ(α2) p

α1−11 p

α2−12

which is the same thing as the beta distribution

shown earlier. Recall that now p1 + p2 = 1.Previously, we epxressed the beta distribution

as prob( p|α, β ). The p there is the same thingas p1 here.

48

8/13/2019 MAP vs. ML

49/61

PART 5: Modelling Text

49

8/13/2019 MAP vs. ML

50/61

5.1: A Unigram Model for Documents

Let’s say we wish to draw N prominent words

from a document for its representation.

We assume that we have a vocabulary of V

prominent words. Also assume for the purpose

of mental comfort that V

8/13/2019 MAP vs. ML

51/61

Let our corpus (database of text documents)

be characterized by the following set of prob-abilities: The probability that wordi of the vo-

cabulary will appear in any given document is

pi. We will use the vector p to represent the

vector ( p1, p2,...,pV ). The vector p is referred

to as defining the unigram statistics for thedocuments.

We can now associate the following multino-

mial likelihood with a document for which the

r.v. W takes on the specific value W = (n1,...,nV ):

prob(W| p) =V

i=1

pnii

If we want to carry out a Bayesian estimation

of the parameters p, it would be best if we

could represent the priors by the Dirichlet dis-

tribution:

51

8/13/2019 MAP vs. ML

52/61

prob( p| α) = Dir( p| α)

Since the Dirichlet is a conjugate prior for a

multinomial likelihood, our posterior will also

be a Dirichlet.

Let’s say we wish to compute the posterior

after observing a single document with W =(n1,...,nV ) as the value for the r.v. W . This

can be shown to be

prob( p|W , α) = Dir( p| α + n)

52

8/13/2019 MAP vs. ML

53/61

5.2: A Mixture of Unigrams Model forDocuments

In this model, we assume that the word proba-

bility pi for the occurrence of wordi in a docu-ment is conditioned on the selection of a topic

for a document. In other words, we first as-

sume that a document contains words corre-

sponding to a specific topic and that the word

probabilities then depend on the topic.

Therefore, if we are given, say, 100,000 doc-

uments on, say, 20 topics, and if we can as-

sume that each document pertains to only one

topic, then the mixture of unigrams approach

will partition the corpus into 20 clusters by as-

signing one of the 20 topics labels to each of

the 100,000 documents.

53

8/13/2019 MAP vs. ML

54/61

We can now associate the following likelihood

with a document for which the r.v. W takes

on the specific value W = (n1,...,nV ), with nias the number of times wordi appears in the

document:

prob(W| p, z) =

z

prob(z) ·V

i=1

( p(wordi|z))ni

Note that the summation over the topic dis-

tribution does not imply that a document is

allowed to contain multiple topics simultane-

ously. It simply implies that a document, con-

strained to contain only one topic, may either

contain topic z1, or topic z2, or any of the othertopics, each with a probability that is given by

prob(z).

54

8/13/2019 MAP vs. ML

55/61

5.3: Document Modelling with PLSA

PLSA stands for Probabilistic Latent Semantic

Analysis.

PLSA extends the mixture of unigrams modelby considering the document itself to be a ran-

dom variable and declaring that a document

and a word in the document are conditionally

independent if we know the topic that governs

the production of that word.

The mixture of unigrams model presented on

the previous slide required that a document

contain only one topic.

On the other hand, with PLSA, as you “gen-

erate” the words in a document, at each point

you first randomly select a topic and then se-

lect a word based on the topic chosen.

55

8/13/2019 MAP vs. ML

56/61

The topics themselves are considered to be the

hidden variables in the modelling process.

With PLSA, the probability that the word wn

from our vocabulary of V words will appear in

a document d is given by

prob(d, wn) = prob(d)

z

prob(wn|z) prob(z|d)

where the random variable z represents the hid-

den topics. Now a document can have any

number of topics in it.

56

8/13/2019 MAP vs. ML

57/61

5.4: Modelling Documents with LDA

LDA is a more modern approach to modelling

documents.

LDA takes a more principled approach to ex-

pressing the dependencies between the topics

and the documents on the one hand and be-

tween the topics and the words on the other.

LDA stands for Latent Dirichlet Allocation.

The name is justified by the fact that the top-

ics are the latent (hidden) variables and our

document modelling process must allocate the

words to the topics.

Assume for a moment that we know that a

corpus is best modeled with the help of k top-

ics.

57

8/13/2019 MAP vs. ML

58/61

For each document, LDA first “constructs” a

multinomial whose k outcomes correspond tochoosing each of the k topics. For each doc-

ument we are interested in the frequency of

occurrence of of each of the k topics. Given

a document, the probabilities associated with

each of the topics can be expressed as

θ = [ p(z1|doc),...,p(zk|doc)]

where zi stands for the ith topic. It must obvi-

ously be the case that ki=1 p(zi|doc) = 1. Sothe θ vector has only k−1 degrees of freedom.

LDA assumes that the multinomial θ can be

given a Dirichlet prior:

prob(θ|α) = Γ(

ki=1 αi)k

i=1 Γ(αi)

ki=1

θαi−1i

58

8/13/2019 MAP vs. ML

59/61

where θi stands for p(zi|doc) and where α arethe k hyperparameters of the prior.

Choosing θ for a document means randomly

specifying the topic mixture for the document.

After we have chosen θ randomly for a docu-

ment, we need to generate the words for the

document. This we do by first randomly choos-

ing a topic at each word position according to

θ and then choosing a word by using the dis-

tribution specified by the β matrix:

β =

p(word1|z1) p(word2|z1) ... p(wordV |z1) p(word1|z2) p(word2|z2) ... p(wordV |z2)

... ... ...

... ... ...

p(word1|zK ) p(word2|zK ) ... p(wordV |zK )

59

8/13/2019 MAP vs. ML

60/61

8/13/2019 MAP vs. ML

61/61

Acknowledgements

This presentation was a part of a tutorial marathon weran in Purdue Robot Vision Lab in the summer of 2008.The marathon consisted of three presentations: my pre-sentation that you are seeing here on the foundationsof parameter estimation and prediction, Shivani Rao’spresentation on the application of LDA to text analysis,and Gaurav Srivastava’s presentation of the applicationof LDA to scene interpretation in computer vision.

This tutorial presentation was inspired by the wonder-ful paper “Parameter Estimation for Text Analysis” byGregor Heinrich that was brought to my attention byShivani Rao. My explanations and the examples I haveused follow closely those that are in the paper by GregorHeinrich.

This tutorial presentation has also benefitted consider-

ably from my many conversations with Shivani Rao.