map vs. ml

Upload: ioana-cilniceanu

Post on 04-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 MAP vs. ML

    1/61

  • 8/13/2019 MAP vs. ML

    2/61

    Contents

    Part 1:   Introduction to ML, MAP, and Bayesian Estimation(Slides 3 – 28)

    Part 2:   ML, MAP, and Bayesian Prediction (Slides 29 – 33)

    Part 3:   Conjugate Priors (Slides 34 – 37)

    Part 4:   Multinomial Distributions (Slides 38 – 47)

    Part 5:   Modelling Text (Slides 49 – 59)

    2

  • 8/13/2019 MAP vs. ML

    3/61

    PART 1: Introduction to ML, MAP, and

    Bayesian Estimation

    3

  • 8/13/2019 MAP vs. ML

    4/61

    1.1: Say We are Given Evidence  X 

    We may think of   X   as consisting of a set of independent observations

    X    =   {xi}|X |i=1

    where each  xi   is a realization of a random vari-

    able   X .

    Let’s say that a set Θ of probability distribu-

    tion parameters best explains the evidence  X .

    4

  • 8/13/2019 MAP vs. ML

    5/61

  • 8/13/2019 MAP vs. ML

    6/61

    1.3: Focusing First on the Estimation

    of the Parameters   Θ   ...

    We can interpret the Bayes’ Rule

     prob(Θ|X ) =  prob(X|Θ)   ·   prob(Θ)

     prob(X )

    as

     posterior   =

      likelihood   ·   prior

    evidence

    6

  • 8/13/2019 MAP vs. ML

    7/61

    1.4: ML Estimation of   Θ

    We seek that value for Θ which maximizes the

    likelihood shown on the previous slide. That

    is, we seek that value for Θ which gives largest

    value to

     prob(X|Θ)

    Recognizing that the evidence   X    consists of independent observations   {x1, x2, .....}, we can

    say that we seek that value Θ which maximizes

    x∈X  prob(x|Θ)

    Because of the product on the RHS, it is often

    simpler to use the logarithm instead.

    7

  • 8/13/2019 MAP vs. ML

    8/61

    Using the symbol  L  to denote the logarithm of 

    this likelihood, we can express our solution as

    L   =

    x∈X 

    log prob(x|Θ)

     ΘM L   = argmax

    Θ

    L

    The ML solution is usually obtained by setting

    ∂ L

    ∂θi = 0   ∀   θi ∈ Θ

    8

  • 8/13/2019 MAP vs. ML

    9/61

    1.5: MAP Estimation of   Θ

    For constructing a maximum a posteriori esti-

    mate, we first go back to our Bayes’ Rule:

     prob(Θ|X ) =  prob(X|Θ)   ·   prob(Θ)

     prob(X )

    We now seek that value for Θ which maximizesthe posterior   prob(Θ|X ).

    Therefore, our solution can now be stated as

    shown on the next slide:

    9

  • 8/13/2019 MAP vs. ML

    10/61

     ΘM AP    = argmaxΘ

     prob(Θ|X )

    = argmaxΘ

     prob(X|Θ)   ·   prob(Θ)

     prob(X )

    = argmaxΘ

     prob(X|Θ)   ·   prob(Θ)

    = argmaxΘ

    x∈X 

     prob(x|Θ)   ·   prob(Θ)

    As with the ML estimate, we can make this

    problem easier if we first take the logarithm of 

    the posteriors. We can then write

     ΘM AP  = argmaxΘ

    x∈X 

    log prob(x|Θ) +   log prob(Θ)

    10

  • 8/13/2019 MAP vs. ML

    11/61

    1.6: What Does the MAP Estimate Get

    Us That the ML Estimate Does NOT

    The MAP estimate allows us to inject into the

    estimation calculation our prior beliefs regard-

    ing the parameters values in Θ.

    To illustrate the usefulness of such incorpo-

    ration of prior beliefs, consider the following

    example provided by Heinrich:

    Let’s conduct   N   independent trials of the fol-

    lowing Bernoulli experiment:   We will ask each

    individual we run into in the hallway whether 

    they will vote Democratic or Republican in the next presidential election.   Let   p   be the proba-

    bility that an individual will vote Democratic.

    11

  • 8/13/2019 MAP vs. ML

    12/61

  • 8/13/2019 MAP vs. ML

    13/61

    Setting

    L   =   log prob(X| p)

    we find the ML estimate for   p   by setting

    ∂ L

    ∂p= 0

    That gives us the equation

    nd

     p−

      (N  − nd)

    (1 − p)  = 0

    whose solution is the ML estimate

    ˆ pM L   =  nd

    13

  • 8/13/2019 MAP vs. ML

    14/61

  • 8/13/2019 MAP vs. ML

    15/61

     prob( p) =  1

    B(α, β ) pα−1(1 − p)β −1

    where   B  =  Γ(α)Γ(β )

    Γ(α+β )   is the beta function. (Theprobability distribution shown above is also dis-

    played as   Beta( p|α, β ).) Note that the Gamma

    function Γ() is the generalization of factorial

    to real numbers.

    When both   α  and   β   are greater than zero, the

    above distribution has its mode — meaning its

    maximum value — at the following point

    α − 1α + β  − 2

    15

  • 8/13/2019 MAP vs. ML

    16/61

    Considering that Indiana has traditionally votedRepublican in presidential elections, but that

    this year Democrats are expected to win, we

    may choose a prior distribution for   p   that has

    a peak at 0.5. This we can do by setting

    α = β  = 5. The variance of a beta distributionis given by

    αβ 

    (α + β )2(α + β  + 1)

    When α = β  = 5, we have a variance of roughly

    0.025, implying a standard deviation of roughly

    0.16, which should do for us nicely.

    To construct a MAP estimate for   p, we will

    now substitute the beta distribution prior for   p

    in the following equation:

    16

  • 8/13/2019 MAP vs. ML

    17/61

  • 8/13/2019 MAP vs. ML

    18/61

    nd

     p−

     (N  − nd)

    (1 − p)  +

     α − 1

     p−

     β  − 1

    1 − p= 0

    The solution of this equation is

    ˆ pM AP    =  nd + α − 1

    N  + α + β  − 2

    =  nd + 4

    N  + 8

    With  N  = 20 and with 12 of the 20 saying they

    would vote Democratic, the MAP estimate for

     p   is 0.571 with   α   and   β   both set to 5.

    To summarize, as this example shows, the MAP

    estimate has the following properties over the

    ML estimate:

    18

  • 8/13/2019 MAP vs. ML

    19/61

    •   MAP estimation “pulls” the estimate to-

    ward the prior.

    •  The more focused our prior belief, the larger

    the pull toward the prior. By using larger

    values for α and  β  (but keeping them equal),

    we can narrow the peak of the beta distri-

    bution around the value of   p  = 0.5. This

    would cause the MAP estimate to move

    closer to the prior.

    •   In the expression we derived for ˆ pM AP , the

    parameters   α   and   β   play a “smoothing”

    role vis-a-vis the measurement   nd.

    •   Since we referred to   p   as the   parameter to be estimated, we can refer to   α   and   β 

    as the   hyperparameters   in the estimation

    calculations.

    19

  • 8/13/2019 MAP vs. ML

    20/61

    1.7: Bayesian Estimation

    Given the evidence   X , ML considers the pa-

    rameter vector Θ to be a constant and seeks

    out that value for the constant that providesmaximum support for the evidence. ML does

    NOT allow us to inject our prior beliefs about

    the likely values for Θ in the estimation calcu-

    lations.

    MAP allows for the fact that the parameter

    vector Θ can take values from a distribution

    that expresses our prior beliefs regarding the

    parameters. MAP returns that value for Θ

    where the probability  prob(Θ|X ) is a maximum.

    Both ML and MAP return only single and spe-

    cific values for the parameter Θ.

    20

  • 8/13/2019 MAP vs. ML

    21/61

  • 8/13/2019 MAP vs. ML

    22/61

    1.8: What Makes Bayesian Estimation

    Complicated?

    Bayesian estimation is made complex by the

    fact that now the denominator in the Bayes’

    Rule

     prob(Θ|X ) =  prob(X|Θ)   ·   prob(Θ)

     prob(X )

    cannot be ignored. The denominator, known

    as the   probability of evidence, is related to

    the other probabilities that make their appear-

    ance in the Bayes’ Rule by

     prob(X ) = 

    Θ prob(X|Θ) · prob(Θ)   dΘ

    22

  • 8/13/2019 MAP vs. ML

    23/61

    This leads to the following thought critical toBayesian estimation:   For a given likelihood

    function, if we have a choice regarding how

    we express our prior beliefs, we must use

    that form which allows us to carry out the

    integration shown above.   It is this thought 

    that leads to the notion of conjugate priors.

    Finally, note that, as with MAP, Bayesian es-

    timation also requires us to express our prior

    beliefs in the possible values of the parameter

    vector Θ in the form of a distribution.

    NOTE:   Obtaining an algebraic expression for the posterior is,

    of course, important from a theoretical perspective. In practice, if 

    you estimate a posterior ignoring the denominator, you can always

    find the normalization constant simply by adding up what you get

    for the numerator — assuming you did a sufficiently good job of 

    estimating the numerator. More on this point is in my tutorial

    “Monte Carlo Integration in Bayesian Estimation.”

    23

  • 8/13/2019 MAP vs. ML

    24/61

    1.10: An Example of BayesianEstimation

    We will illustrate Bayesian estimation with the

    same Bernoulli trial based example we usedearlier for ML and MAP. Our prior for that

    example is given by the following beta distri-

    bution:

     prob( p|α, β ) =  1

    B(α, β ) pα−1(1 − p)β −1

    where the LHS makes explicit the dependence

    of the prior on the hyperparameters   α   and   β .

    With this prior, the probability of evidence is

    given by

    24

  • 8/13/2019 MAP vs. ML

    25/61

     prob(X ) =   1

    0 prob(X| p) · prob( p)   dp

    =   1

    0

    i=1

     prob(xi| p)

    · prob( p)   dp

    =    1

    0 pnd · (1 − p)N −nd · prob( p)   dp

    What is interesting that in this case the inte-

    gration shown above is easy. When we multiply

    a beta distribution with either a power of  p or a

    power of (1− p), you simply get a different beta

    distribution. So the above can be replaced by

    a constant   Z  that would depend on the values

    chosen for   α,   β , and the measurement   nd.

    With the denominator out of the way, let’s go

    back to the Bayes’ Rule for Bayesian estima-

    tion:

    25

  • 8/13/2019 MAP vs. ML

    26/61

     prob( p|X ) =  prob(X| p)   ·   prob( p)

    =  1

    Z · prob(X| p) · prob( p)

    =  1

    Z · N 

    i=1

     prob(xi| p) · prob( p)

    =  1

    Z · pnd · (1 − p)N −nd

    · prob( p)

    =   Beta( p|α + nd, β  + N  − nd)

    where the last result follows from the observa-

    tion made earlier that a beta distribution mul-

    tiplied by either a power of   p   or a power of 

    (1− p) remains a beta distribution, albeit with

    a different pair of hyperparameters.

    26

  • 8/13/2019 MAP vs. ML

    27/61

    The notation   Beta() above is a short form

    for the same beta distribution you saw before.

    The hyperparameters of this beta distribution

    are shown to the right of the vertical bar in the

    argument list.

    The above result gives us a closed form ex-

    pression for the posterior distribution for the

    parameter to be estimated.

    If we wanted to return a single value as an es-

    timate for   p, that would be the expected value

    of the posterior distribution we just derived.

    Using the standard formula for the expecta-

    tion of a beta distribution, the expectation isgiven by

    27

  • 8/13/2019 MAP vs. ML

    28/61

    ˆ pBayesian   =   E { p|X}

    =  α + nd

    α + β  + N 

    =  5 + nd

    10 + N 

    for the case when we set both   α   and   β   to 5.

    When   N    = 20 and 12 out of 20 individu-

    als report that they will vote Democratic, our

    Bayesian estimate yields a value of 0.567. Com-

    pare that to the MAP value of 0.571 and the

    ML value of 0.6.

    One benefit of the Bayesian estimation is that

    we can also calculate the variance associated

    with the above estimate. One can again use

    the standard formula for the variance of a beta

    distribution to show that the variance associ-

    ated with the Bayesian estimate is 0.0079.

    28

  • 8/13/2019 MAP vs. ML

    29/61

  • 8/13/2019 MAP vs. ML

    30/61

  • 8/13/2019 MAP vs. ML

    31/61

    2.2: ML Prediction

    We can write the following equation for the

    probabilistic support that the past data  X   pro-vides to a new observation x̃:

     prob(x̃|X ) = 

    Θ prob(x̃|Θ) · prob(Θ|X )   dΘ

    ≈ 

    Θ prob(x̃| ΘM L) · prob(Θ|X )   dΘ

    =   prob(x̃| ΘM L)

    What this says is that the probability model

    for the new observation x̃   is the same as for all

    previous observations that constitute the evi-dence  X . In this probability model, we set theparameters to ΘM L   to compute the supportthat the evidence lends to the new observa-tion.

    31

  • 8/13/2019 MAP vs. ML

    32/61

    2.3: MAP Prediction

    With MAP, the derivation on the previous slide

    becomes

     prob(x̃|X ) = 

    Θ prob(x̃|Θ) · prob(Θ|X )   dΘ

    ≈ 

    Θ prob(x̃| ΘM AP ) · prob(Θ|X )   dΘ

    =   prob(x̃| ΘM AP )

    This is to be interpreted in the same manner

    as the ML prediction presented on the previous

    slide. The probabilistic support for the new

    data x̃   is to be computed by using the same

    probability model as used for the evidence   X 

    but with the parameters set to ΘM AP .

    32

  • 8/13/2019 MAP vs. ML

    33/61

    2.4: Bayesian Prediction

    In order to compute the support  prob(x̃|X ) that

    the evidence   X   lends to the new observationx̃, we again start with the relationship:

     prob(x̃|X ) = 

    Θ prob(x̃|Θ) · prob(Θ|X )   dΘ

    but now we must use the Bayes’ Rule for the

    posterior   prob(Θ|X ) to yield

     prob(x̃|X ) = 

    Θ prob(x̃|Θ)·

     prob(X|Θ) · prob(Θ)

     prob(X )  dΘ

    33

  • 8/13/2019 MAP vs. ML

    34/61

  • 8/13/2019 MAP vs. ML

    35/61

    3.1: What is a Conjugate Prior?

    As you saw, Bayesian estimation requires us to

    compute the full posterior distribution for theparameters of interest, as opposed to, say, just

    the value where the posterior acquires its max-

    imum value. As shown already, the posterior

    is given by

     prob(Θ|X ) =  prob(X|Θ) · prob(Θ)  prob(X|Θ) · prob(Θ)   dΘ

    The most challenging part of the calculation

    here is the derivation of a closed form for the

    marginal in the denominator on the right.

    35

  • 8/13/2019 MAP vs. ML

    36/61

    For a given algebraic form for the likelihood,

    the different forms for the prior   prob(Θ) pose

    different levels of difficulty for the determina-

    tion of the marginal in the denominator and,

    therefore, for the determination of the poste-

    rior.

    For a given likelihood function   prob(X|Θ), a

    prior   prob(Θ) is called a   conjugate prior   if 

    the posterior  prob(Θ|X ) has the same algebraic

    form as the prior.

    Obviously, Bayesian estimation and prediction

    becomes much easier should the engineering

    assumptions allow a conjugate prior to be cho-

    sen for the applicable likelihood function.

    When the likelihood can be assumed to be

    Gaussian, a Gaussian prior would constitute a

    conjugate prior because in this case the poste-

    rior would also be Gaussian.

    36

  • 8/13/2019 MAP vs. ML

    37/61

    You have already seen another example of aconjugate prior earlier in this review. For the

    Bernoulli trail based experiment we talked about

    earlier, the beta distribution constitutes a con-

     jugate prior. As we saw there, the posterior

    was also a beta distribution (albeit with differ-ent hyperparameters).

    As we will see later, when the likelihood is a

    multinomial, the conjugate prior is the Dirich-

    let distribution.

    37

  • 8/13/2019 MAP vs. ML

    38/61

    PART 4: Multinomial Distributions

    38

  • 8/13/2019 MAP vs. ML

    39/61

    4.1: When Are Multinomial

    Distributions Useful for Likelihoods?

    Multinomial distributions are useful for model-

    ing the evidence when each observation in the

    evidence can be characterized by count based

    features.

    As a stepping stone to multinomial distribu-

    tions, let’s first talk about binomial distribu-

    tions.

    Binomial distributions answer the following ques-

    tion: Let’s carry out N  trials of a Bernoulli ex-

    periment with   p   as the probability of success

    at each trial. Let   n   be the random variablethat denotes the number of times we achieve

    success in  N   trials. The random variable n  has

    binomial distribution that is given by

    39

  • 8/13/2019 MAP vs. ML

    40/61

     prob(n) = N n

     pn(1 − p)N −n

    with the binomial coefficient

    N n

    =   N !

    k!(N −n)!.

    A multinomial distribution is a generalizationof the binomial distribution. Now instead of a

    binary outcome at each trial, we have   k   possi-

    ble mutually exclusive outcomes at each trial.

    At each trial, the   k   outcomes can occur with

    the probabilities  p

    1,  p

    2, ...,  p

    k, respectively,with the constraint that they must all add up

    to 1. We still carry out  N  trials of the underly-

    ing experiment and at the end of the   N   trials

    we pose the following question: What is the

    probability that we saw   n1  number of the first

    outcome,   n2   number of the second outcome,...., and   nk   number of the   k

    th outcome. This

    probability is a multinomial distribution and is

    given by

    40

  • 8/13/2019 MAP vs. ML

    41/61

  • 8/13/2019 MAP vs. ML

    42/61

    If we wanted to actually measure the proba-

    bility   prob(n1,...,nk) experimentally, we would

    carry out a large number of multinomial exper-

    iments and record the number of experiments

    in which the first outcome occurs  n1  times, the

    second outcome   n2   times, and so on.

    If we want to make explicit the conditioning

    variables in   prob(n1,...,nk), we can express it

    as

     prob(n1,...,nk   |   p1, p2,...,pk, N )

    This form is sometimes expressed more com-

    pactly as

    Multi(n|  p, N )

    42

  • 8/13/2019 MAP vs. ML

    43/61

    4.2: What is a Multinomial RandomVariable?

    A random variable  W   is a multinomial r.v. if its

    probability distribution is a multinomial. Thevector  n  shown at the end of the last slide is a

    multinomial random variable.

    A multinomial r.v. is a vector. Each element

    of the vector stands for a count for the occur-rence of one of   k   possible outcomes in each

    trial of the underlying experiment.

    Think of rolling a die 1000 times. At each roll,

    you will see one of six possible outcomes. In

    this case,   W   =   n1,...,n6   where   ni   stands for

    the number of times you will see the face with

    i   dots.

    43

  • 8/13/2019 MAP vs. ML

    44/61

  • 8/13/2019 MAP vs. ML

    45/61

    A common way to characterize text documents

    is by the frequency of the words in the docu-ments. Carrying out   N   trials of a   k-outcomeexperiment could now correspond to recordingthe   N   most prominent words in a text file. If we assume that our vocabulary is limited tok   words, for each document we would record

    the frequency of occurrence of each vocabu-lary word. We can think of each document asa result of   one multinomial experiment.

    For each of the above two cases, if for a givenimage or text file the first feature is observedn1   times, the second feature   n2   times, etc.,the likelihood probability to be associated withthat image or text file would be

     prob(image| p1,...,pk) =k

    i=1

     pini

    We will refer to this probability as the  multino-mial likelihood  of the image.   We can think of 

     p1, ... ,   pk  as the parameters that characterizethe database.

    45

  • 8/13/2019 MAP vs. ML

    46/61

    4.4: Conjugate Prior for a Multinomial

    Likelihood

    The conjugate   prior   for a multinomial likeli-

    hood is the Dirichlet distribution:

     prob( p1,...,pk) =  Γ(

    ki=1 αi)k

    i=1 Γ(αi)

    ki=1

     pαi−1i

    where   αi,   i  = 1,...,k, are the hyperparametersof the prior.

    The Dirichlet is a generalization of the beta

    distribution from two degrees of freedom to   k

    degrees of freedom. (Strictly speaking, it is

    a generalization from the one degree of free-dom of a beta distribution to   k − 1 degrees of 

    freedom. That is because of the constraintki=1 pi  = 1.)

    46

  • 8/13/2019 MAP vs. ML

    47/61

    The Dirichlet prior is also expressed more com-pactly as

     prob(  p| α)

    For the purpose of visualization, consider the

    case when an image is only allowed to have

    three different kinds of features and we make

    N    feature measurements in each image. In

    this case,   k   = 3. The Dirichlet prior would

    now take the form

     prob( p1, p2, p3   |   α1, α2, α3)

    =  Γ(α1 + α2 + α3)

    Γ(α1)Γ(α2)Γ(α3) p

    α1−11   p

    α2−12   p

    α3−13

    47

  • 8/13/2019 MAP vs. ML

    48/61

    under the constraint that   p1 +  p2 +  p3 = 1.

    When   k = 2, the Dirichlet prior reduces to

     prob( p1, p2   |   α1, α2)

    =  Γ(α1 + α2)

    Γ(α1)Γ(α2) p

    α1−11   p

    α2−12

    which is the same thing as the beta distribution

    shown earlier. Recall that now   p1  +  p2   = 1.Previously, we epxressed the beta distribution

    as   prob( p|α, β ). The   p   there is the same thingas   p1   here.

    48

  • 8/13/2019 MAP vs. ML

    49/61

    PART 5: Modelling Text

    49

  • 8/13/2019 MAP vs. ML

    50/61

    5.1: A Unigram Model for Documents

    Let’s say we wish to draw   N   prominent words

    from a document for its representation.

    We assume that we have a vocabulary of   V  

    prominent words. Also assume for the purpose

    of mental comfort that   V

  • 8/13/2019 MAP vs. ML

    51/61

    Let our corpus (database of text documents)

    be characterized by the following set of prob-abilities: The probability that   wordi   of the vo-

    cabulary will appear in any given document is

     pi. We will use the vector     p   to represent the

    vector ( p1, p2,...,pV  ). The vector     p   is referred

    to as defining the   unigram statistics   for thedocuments.

    We can now associate the following multino-

    mial likelihood with a document for which the

    r.v.   W  takes on the specific value W  = (n1,...,nV  ):

     prob(W|  p) =V  

    i=1

     pnii

    If we want to carry out a Bayesian estimation

    of the parameters     p, it would be best if we

    could represent the priors by the Dirichlet dis-

    tribution:

    51

  • 8/13/2019 MAP vs. ML

    52/61

     prob(  p| α) =   Dir(  p| α)

    Since the Dirichlet is a conjugate prior for a

    multinomial likelihood, our posterior will also

    be a Dirichlet.

    Let’s say we wish to compute the posterior

    after observing a single document with  W   =(n1,...,nV  ) as the value for the r.v.   W . This

    can be shown to be

     prob(  p|W ,  α) =   Dir(  p| α + n)

    52

  • 8/13/2019 MAP vs. ML

    53/61

    5.2: A Mixture of Unigrams Model forDocuments

    In this model, we assume that the word proba-

    bility   pi   for the occurrence of   wordi   in a docu-ment is conditioned on the selection of a  topic

    for a document. In other words, we first as-

    sume that a document contains words corre-

    sponding to a specific topic and that the word

    probabilities then depend on the topic.

    Therefore, if we are given, say, 100,000 doc-

    uments on, say, 20 topics, and if we can as-

    sume that each document pertains to only one

    topic, then the mixture of unigrams approach

    will partition the corpus into 20 clusters by as-

    signing one of the 20 topics labels to each of 

    the 100,000 documents.

    53

  • 8/13/2019 MAP vs. ML

    54/61

    We can now associate the following likelihood

    with a document for which the r.v.   W   takes

    on the specific value  W   = (n1,...,nV  ), with   nias the number of times   wordi   appears in the

    document:

     prob(W|  p, z) =

    z

     prob(z) ·V  

    i=1

    ( p(wordi|z))ni

    Note that the summation over the topic dis-

    tribution does not imply that a document is

    allowed to contain multiple topics simultane-

    ously. It simply implies that a document, con-

    strained to contain only one topic, may either

    contain topic z1, or topic z2, or any of the othertopics, each with a probability that is given by

     prob(z).

    54

  • 8/13/2019 MAP vs. ML

    55/61

    5.3: Document Modelling with PLSA

    PLSA stands for Probabilistic Latent Semantic

    Analysis.

    PLSA extends the mixture of unigrams modelby considering the document itself to be a ran-

    dom variable and declaring that a document

    and a word in the document   are conditionally 

    independent if we know the topic that governs 

    the production of that word.

    The mixture of unigrams model presented on

    the previous slide required that a document

    contain only one topic.

    On the other hand, with PLSA, as you “gen-

    erate” the words in a document, at each point

    you first randomly select a topic and then se-

    lect a word based on the topic chosen.

    55

  • 8/13/2019 MAP vs. ML

    56/61

    The topics themselves are considered to be the

    hidden variables in the modelling process.

    With PLSA, the probability that the word   wn

    from our vocabulary of   V    words will appear in

    a document   d   is given by

     prob(d, wn) =   prob(d)

    z

     prob(wn|z) prob(z|d)

    where the random variable z  represents the hid-

    den topics. Now a document can have any

    number of topics in it.

    56

  • 8/13/2019 MAP vs. ML

    57/61

    5.4: Modelling Documents with LDA

    LDA is a more modern approach to modelling

    documents.

    LDA takes a more principled approach to ex-

    pressing the dependencies between the topics

    and the documents on the one hand and be-

    tween the topics and the words on the other.

    LDA stands for   Latent Dirichlet Allocation.

    The name is justified by the fact that the top-

    ics are the latent (hidden) variables and our

    document modelling process must allocate the

    words to the topics.

    Assume for a moment that we know that a

    corpus is best modeled with the help of   k   top-

    ics.

    57

  • 8/13/2019 MAP vs. ML

    58/61

    For each document, LDA first “constructs” a

    multinomial whose   k   outcomes correspond tochoosing each of the   k   topics. For each doc-

    ument we are interested in the frequency of 

    occurrence of of each of the   k   topics. Given

    a document, the probabilities associated with

    each of the topics can be expressed as

    θ   = [ p(z1|doc),...,p(zk|doc)]

    where   zi  stands for the   ith topic. It must obvi-

    ously be the case that ki=1 p(zi|doc) = 1. Sothe  θ  vector has only   k−1 degrees of freedom.

    LDA assumes that the multinomial   θ   can be

    given a Dirichlet prior:

     prob(θ|α) =  Γ(

    ki=1 αi)k

    i=1 Γ(αi)

    ki=1

    θαi−1i

    58

  • 8/13/2019 MAP vs. ML

    59/61

    where   θi   stands for   p(zi|doc) and where   α   arethe   k   hyperparameters of the prior.

    Choosing   θ   for a document means randomly

    specifying the topic mixture for the document.

    After we have chosen   θ   randomly for a docu-

    ment, we need to generate the words for the

    document. This we do by first randomly choos-

    ing a topic at each word position according to

    θ   and then choosing a word by using the dis-

    tribution specified by the   β   matrix:

    β  =

     p(word1|z1)   p(word2|z1)   ... p(wordV  |z1) p(word1|z2)   p(word2|z2)   ... p(wordV  |z2)

    ... ... ...

    ... ... ...

     p(word1|zK )   p(word2|zK )   ... p(wordV  |zK )

    59

  • 8/13/2019 MAP vs. ML

    60/61

  • 8/13/2019 MAP vs. ML

    61/61

    Acknowledgements

    This presentation was a part of a tutorial marathon weran in Purdue Robot Vision Lab in the summer of 2008.The marathon consisted of three presentations: my pre-sentation that you are seeing here on the foundationsof parameter estimation and prediction, Shivani Rao’spresentation on the application of LDA to text analysis,and Gaurav Srivastava’s presentation of the applicationof LDA to scene interpretation in computer vision.

    This tutorial presentation was inspired by the wonder-ful paper “Parameter Estimation for Text Analysis” byGregor Heinrich that was brought to my attention byShivani Rao. My explanations and the examples I haveused follow closely those that are in the paper by GregorHeinrich.

    This tutorial presentation has also benefitted consider-

    ably from my many conversations with Shivani Rao.