probabilistic models in mapreducejbg/teaching/infm_718... · to model training data advantages ......

102
Probabilistic Models in MapReduce Jordan Boyd-Graber April 7, 2011 Adapted from Jimmy Lin’s Slides

Upload: others

Post on 14-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Probabilistic Models in MapReduce

Jordan Boyd-Graber

April 7, 2011

Adapted from Jimmy Lin’s Slides

Page 2: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Roadmap

Homework

Midterm

Probabilistic Models

Hidden Markov Model

Topic Models

Page 3: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Homework

Project Proposal: Due 11

Homework 3?

Homework 4 is out

Homework 6

Page 4: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Midterm: Max

Obvious answer

Less obvious answerDefine grouperMake sure values arrive in sorted orderReducer only needs to look at first valueNot as efficient as using combiners (except in pathologicalsituations)

Page 5: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Rule-Based Systems

Until the 1990s, text processing relied on rule-based systems

Advantages

More predictable

Easy to understand

Easy to identify errors and fix them

Disadvantages

Extremely labor-intensive to create

Not robust to out of domain input

No partial output or analysis when failure occurs

Page 6: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Statistical Methods

Basic idea: learn from a large corpus of examples of what we wishto model Training Data

Advantages

More robust to the complexities of real-world input

Creating training data is usually cheaper than creating rules

Even easier today thanks to Amazon Mechanical Turk

Data may already exist for independent reasons

Disadvantages

Systems often behave differently compared to expectations

Hard to understand the reasons for errors or debug errors

Page 7: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Learning from training data usually means estimating theparameters of the statistical model

Estimation usually carried out via machine learning

Two kinds of machine learning algorithmsSupervised learning

Training data consists of the inputs and respective outputs

(labels)

Labels are usually created via expert annotation (expensive)

Difficult to annotate when predicting more complex outputs

Unsupervised learningTraining data just consists of inputs. No labels.

One example of such an algorithm: Expectation Maximization

(EM)

Page 8: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

What Problems Can We Solve?

(Supervised) Part of speech tagging

(Unsupervised) Exploring large corpora

But first, a brief recap of estimating probability distributions

Page 9: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

What Problems Can We Solve?

(Supervised) Part of speech tagging

(Unsupervised) Exploring large corpora

But first, a brief recap of estimating probability distributions

Page 10: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

Suppose we want to estimate P(wn = “dog”|zz = “NN”).

dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog

mouse dog fly cat cow

Maximum likelihood (ML) estimate of the probability is:

θi =ni�k nk

(1)

Is this reasonable?

Page 11: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

Suppose we want to estimate P(wn = “dog”|zz = “NN”).

dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog

mouse dog fly cat cow

Maximum likelihood (ML) estimate of the probability is:

θi =ni�k nk

(1)

Is this reasonable?

Page 12: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

Suppose we want to estimate P(wn = “dog”|zz = “NN”).

dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog

mouse dog fly cat cow

Maximum likelihood (ML) estimate of the probability is:

θi =ni�k nk

(1)

Is this reasonable?

Page 13: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

Suppose we want to estimate P(wn = “dog”|zz = “NN”).

dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog

mouse dog fly cat cow

Maximum likelihood (ML) estimate of the probability is:

θi =ni�k nk

(1)

Is this reasonable?

Page 14: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

In computational linguistics, we often have a prior notion ofwhat our probability distributions are going to look like (forexample, non-zero, sparse, uniform, etc.).

This estimate of a probability distribution is called themaximum a posteriori (MAP) estimate:

θMAP = argmaxθf (x |θ)g(θ) (2)

Page 15: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

For a multinomial distribution (i.e. a discrete distribution, likeover words):

θi =ni + αi�k nk + αk

(3)

αi is called a smoothing factor, a pseudocount, etc.

When αi = 1 for all i , it’s called “Laplace smoothing” andcorresponds to a uniform prior over all multinomialdistributions (we talked about this before).

To geek out, the set {α1, . . . ,αN} parameterizes a Dirichletdistribution, which is itself a distribution over distributionsand is the conjugate prior of the Multinomial (more later).

Page 16: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

For a multinomial distribution (i.e. a discrete distribution, likeover words):

θi =ni + αi�k nk + αk

(3)

αi is called a smoothing factor, a pseudocount, etc.

When αi = 1 for all i , it’s called “Laplace smoothing” andcorresponds to a uniform prior over all multinomialdistributions (we talked about this before).

To geek out, the set {α1, . . . ,αN} parameterizes a Dirichletdistribution, which is itself a distribution over distributionsand is the conjugate prior of the Multinomial (more later).

Page 17: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

How do we estimate a probability?

For a multinomial distribution (i.e. a discrete distribution, likeover words):

θi =ni + αi�k nk + αk

(3)

αi is called a smoothing factor, a pseudocount, etc.

When αi = 1 for all i , it’s called “Laplace smoothing” andcorresponds to a uniform prior over all multinomialdistributions (we talked about this before).

To geek out, the set {α1, . . . ,αN} parameterizes a Dirichletdistribution, which is itself a distribution over distributionsand is the conjugate prior of the Multinomial (more later).

Page 18: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Parts of Speech

The Art of Grammar circa 100 B.C.

Written to allow post-Classical Greek speakers to understandOdyssey and other classical poets

[Noun, Verb, Pronoun, Article, Adverb, Conjunction, Participle,Preposition]

Remarkably enduring list

Occur in almost every language

Defined primarily in terms of syntactic and morphologicalcriteria (affixes)

Page 19: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Categories of POS Tags

Closed Class

Relatively fixed membership

Conjunctions, Prepositions, Auxiliaries, Determiners, Pronouns

Function words: short and used primarily for structuring

Open Class

Nouns, Verbs, Adjectives, Adverbs

Frequent neologisms (borrowed/coined)

Most types

Page 20: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Tagsets

Several English tagsets have been developed (languagespecific)

Vary in number of tags

Brown Tagset (87)

Penn Treebank (45) [More common]

Simple morphology = more ambiguity = smaller tagset

Size depends on language and purpose

Page 21: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Why?

Corpus-based Linguistic Analysis & Lexicography

Information Retrieval & Question Answering

Automatic Speech Synthesis

Word Sense Disambiguation

Shallow Syntactic Parsing

Machine Translation

Page 22: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

What do we need to specify an FSM formally?

Finite numberof states

Transitions

Input alphabet

Start state

Final state(s)

Page 23: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Weighted FSM

0.5

0.25

0.25

0.25

0.75

1.0

a’ is twice as likely tobe seen in state 1 asb’ or c’

c’ is three times aslikely to be seen instate 2 as a’

P(ab�) = 0.50 ∗ 1.00 = 0.5,P(bc�) = 0.25 ∗ 0.75 = 0.1875 (4)

Page 24: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Observable States and Probabilistic Emissions

This not a valid prob. FSM!

No start states

Use prior probabilities

Note that prob. of being inany state ONLY depends onprevious state ,i.e., the (1st

order) Markov assumption

This extension of a prob.FSM is called a MarkovChain or an ObservedMarkov Model

Each state corresponds toan observable physical event

Page 25: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Observable States and Probabilistic Emissions

0.5!

0.2! 0.3!

This not a valid prob. FSM!

No start states

Use prior probabilities

Note that prob. of being inany state ONLY depends onprevious state ,i.e., the (1st

order) Markov assumption

This extension of a prob.FSM is called a MarkovChain or an ObservedMarkov Model

Each state corresponds toan observable physical event

Page 26: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Are states observable?

What you actually observe:

Page 27: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Are states observable?

What you actually observe:

Page 28: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

HMM Intuitions

Need to model problems where observed events don’tcorrespond to states directly

Instead observations are probabilistic of hidden state

Solution: A Hidden Markov Model (HMM)

Assume two probabilistic processesUnderlying process is hidden (states = hidden events)Second process produces sequence of observed events

Page 29: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

HMM Definition

Assume K parts of speech, a lexicon size of V , a series ofobservations {x1, . . . , xN}, and a series of unobserved states{z1, . . . , zN}.

π A distribution over start states (vector of length K ):πi = p(z1 = i)

θ Transition matrix (matrix of size K by K ):βi ,j = p(zn = j |zn−1 = i)

β An emission matrix (matrix of size K by V ):βk,v = p(xn = v |zn = k)

Two problems: How do we move from data to a model?(Estimation) How do we move from a model and unlabled data tolabeled data? (Inference)

Page 30: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

HMM Definition

Assume K parts of speech, a lexicon size of V , a series ofobservations {x1, . . . , xN}, and a series of unobserved states{z1, . . . , zN}.

π A distribution over start states (vector of length K ):πi = p(z1 = i)

θ Transition matrix (matrix of size K by K ):βi ,j = p(zn = j |zn−1 = i)

β An emission matrix (matrix of size K by V ):βk,v = p(xn = v |zn = k)

Two problems: How do we move from data to a model?(Estimation) How do we move from a model and unlabled data tolabeled data? (Inference)

Page 31: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Training Sentences

here come old flattopMOD V MOD N

a crowd of people stopped and staredDET N PREP N V CONJ V

gotta get you into my lifeV V PRO PREP PRO V

and I love herCONJ PRO V PRO

Page 32: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Initial Probability π

POS Frequency ProbabilityMOD 1.1 0.234DET 1.1 0.234CONJ 1.1 0.234

N 0.1 0.021PREP 0.1 0.021PRO 0.1 0.021

V 1.1 0.234

Remember, we’re taking MAP estimates, so we add 0.1 (arbitrarilychosen) to each of the counts before normalizing to create aprobability distribution. This is easy; one sentence starts with anadjective, one with a determiner, one with a verb, and one with aconjunction.

Page 33: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Transition Probability θ

We can ignore the words; just look at the parts of speech.Let’s compute one row, the row for verbs.

We see the following transitions: V → MOD, V → CONJ,V → V, V → PRO, and V → PRO

POS Frequency ProbabilityMOD 1.1 0.193DET 0.1 0.018CONJ 1.1 0.193

N 0.1 0.018PREP 0.1 0.018PRO 2.1 0.368

V 1.1 0.193

And do the same for each part of speech ...

Page 34: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Emission Probability β

Let’s look at verbs . . .Word a and come crowd flattop

Frequency 0.1 0.1 1.1 0.1 0.1Probability 0.011 0.011 0.121 0.011 0.011

Word get gotta her here iFrequency 1.1 1.1 0.1 0.1 0.1Probability 0.121 0.121 0.011 0.011 0.011

Word into it life love myFrequency 0.1 0.1 0.1 1.1 0.1Probability 0.011 0.011 0.011 0.121 0.011

Word of old people stared stoodFrequency 0.1 0.1 0.1 1.1 1.1Probability 0.011 0.011 0.011 0.121 0.121

Page 35: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Viterbi Algorithm

Given an unobserved sequence of length L, {x1, . . . , xL}, wewant to find a sequence {z1 . . . zL} with the highestprobability.

It’s impossible to compute KL possibilities.

So, we use dynamic programming to compute best sequencefor each subsequence from 0 to l .

Base case:δ1(k) = πkβk,xi (5)

Recursion:δn(k) = max

j(δn−1(j)θj ,k)βk,xn (6)

Page 36: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Viterbi Algorithm

Given an unobserved sequence of length L, {x1, . . . , xL}, wewant to find a sequence {z1 . . . zL} with the highestprobability.

It’s impossible to compute KL possibilities.

So, we use dynamic programming to compute best sequencefor each subsequence from 0 to l .

Base case:δ1(k) = πkβk,xi (5)

Recursion:δn(k) = max

j(δn−1(j)θj ,k)βk,xn (6)

Page 37: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

The complexity of this is now K2L.

But just computing the max isn’t enough. We also have toremember where we came from. (Breadcrumbs from bestprevious state.)

Ψn = argmaxjδn−1(j)θj ,k (7)

Let’s do that for the sentence “come and get it”

Page 38: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

The complexity of this is now K2L.

But just computing the max isn’t enough. We also have toremember where we came from. (Breadcrumbs from bestprevious state.)

Ψn = argmaxjδn−1(j)θj ,k (7)

Let’s do that for the sentence “come and get it”

Page 39: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS πk βk,x1 log δ1(k)MOD 0.234 0.024 -5.18DET 0.234 0.032 -4.89CONJ 0.234 0.024 -5.18

N 0.021 0.016 -7.99PREP 0.021 0.024 -7.59PRO 0.021 0.016 -7.99

V 0.234 0.121 -3.56come and get it

Why logarithms?

1 More interpretable than a float with lots of zeros.

2 Underflow is less of an issue

3 Addition is cheaper than multiplication

Page 40: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j)

log δ1(j)θj ,CONJ

log δ1(CONJ)

MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47 ??? -6.02

N -7.99

≤ −7.99

PREP -7.59

≤ −7.59

PRO -7.99

≤ −7.99

V -3.56

-5.21

come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 41: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j)

log δ1(j)θj ,CONJ

log δ1(CONJ)

MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤ −7.99

PREP -7.59

≤ −7.59

PRO -7.99

≤ −7.99

V -3.56

-5.21

come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 42: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤ −7.99

PREP -7.59

≤ −7.59

PRO -7.99

≤ −7.99

V -3.56

-5.21

come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 43: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤ −7.99

PREP -7.59

≤ −7.59

PRO -7.99

≤ −7.99

V -3.56

-5.21

come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 44: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤ −7.99

PREP -7.59

≤ −7.59

PRO -7.99

≤ −7.99

V -3.56 -5.21come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 45: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99

V -3.56 -5.21come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 46: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47 ???

-6.02

N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99

V -3.56 -5.21come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 47: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47 ???

-6.02

N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99

V -3.56 -5.21come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 48: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47

??? -6.02

N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99

V -3.56 -5.21come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 49: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47

??? -6.02

N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99

V -3.56 -5.21come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and = − 5.21− 0.81

Page 50: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)

MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47

???

-6.02N -7.99 ≤ −7.99

PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99

V -3.56 -5.21come and get it

log�δ0(V)θV, CONJ

�= log δ0(k)+log θV, CONJ = −3.56+−1.65

log δ1(k) = −5.21 + log βCONJ, and =

− 5.21− 0.81

Page 51: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18

-0.00 X -0.00 X -0.00 X

DET -4.89

-0.00 X -0.00 X -0.00 X

CONJ -5.18 -6.02 V

-0.00 X -0.00 X

N -7.99

-0.00 X -0.00 X -0.00 X

PREP -7.59

-0.00 X -0.00 X -0.00 X

PRO -7.99

-0.00 X -0.00 X -14.6 V

V -3.56

-0.00 X -9.03 CONJ -0.00 X

WORD come and get it

Page 52: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18 -0.00 X

-0.00 X -0.00 X

DET -4.89 -0.00 X

-0.00 X -0.00 X

CONJ -5.18 -6.02 V

-0.00 X -0.00 X

N -7.99 -0.00 X

-0.00 X -0.00 X

PREP -7.59 -0.00 X

-0.00 X -0.00 X

PRO -7.99 -0.00 X

-0.00 X -14.6 V

V -3.56 -0.00 X

-9.03 CONJ -0.00 X

WORD come and get it

Page 53: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18 -0.00 X -0.00 X

-0.00 X

DET -4.89 -0.00 X -0.00 X

-0.00 X

CONJ -5.18 -6.02 V -0.00 X

-0.00 X

N -7.99 -0.00 X -0.00 X

-0.00 X

PREP -7.59 -0.00 X -0.00 X

-0.00 X

PRO -7.99 -0.00 X -0.00 X

-14.6 V

V -3.56 -0.00 X -9.03 CONJ

-0.00 X

WORD come and get it

Page 54: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18 -0.00 X -0.00 X -0.00 X

DET -4.89 -0.00 X -0.00 X -0.00 X

CONJ -5.18 -6.02 V -0.00 X -0.00 X

N -7.99 -0.00 X -0.00 X -0.00 X

PREP -7.59 -0.00 X -0.00 X -0.00 X

PRO -7.99 -0.00 X -0.00 X -14.6 V

V -3.56 -0.00 X -9.03 CONJ -0.00 X

WORD come and get it

Page 55: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

MapReduce: HMM Learning

Mapper

def map( s e n t e n c e i d , s en t en c e ) :p r ev = Nonef o r s t a t e , word i n s en t en c e :

i f prev == None :emit ( ( ”S” , 0 , −1) , 1)emit ( ( ”S” , 0 , s t a t e ) , 1)

e l s e :emit ( ( ”T” , prev , word ) , 1)emit ( ( ”T” , prev , word ) , 1)

emit ( ( ”E” , s t a t e , word ) , 1)emit ( ( ”E” , s t a t e , word ) , 1)

Page 56: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

MapReduce: HMM Learning

Reducer

def r educe ( key , v a l u e s ) :d i s t r i b u t i o n , i , j = keyi f j == −1:

n o rma l i z e r = sum( v a l u e s ) +p r i o r s s um [ d i s t r i b u t i o n ]

e l s e :emit key , ( sum( v a l u e s ) +

p r i o r s e l em e n t [ d i s t r i b u t i o n ] ) /n o rma l i z e r

Page 57: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

MapReduce: HMM Testing

Distributed parameters via distributed cache

Do Vitterbi for each sentence using parameters

Can happen independently in a mapper, output finalassignment

Use identity reducer

Output final POS sequence (See Lin & Dyer for the gorydetails)

Page 58: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Why topic models?

Suppose you have a huge number of documents

You want to know what’s going on

Don’t have time to read them (e.g. every New York Timesarticle from the 50’s)

Topic models offer a way to get a corpus-level view of majorthemes

Unsupervised

Page 59: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Why topic models?

Suppose you have a huge number of documents

You want to know what’s going on

Don’t have time to read them (e.g. every New York Timesarticle from the 50’s)

Topic models offer a way to get a corpus-level view of majorthemes

Unsupervised

Page 60: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Conceptual Approach

Given a corpus, what topics (a priori number) are expressedthroughout the corpus?

For each document, what topics are expressed by thatdocument?

Page 61: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Conceptual Approach

Given a corpus, what topics (a priori number) are expressedthroughout the corpus?

For each document, what topics are expressed by thatdocument?

computer,

technology,

system,

service, site,

phone,

internet,

machine

play, film,

movie, theater,

production,

star, director,

stage

sell, sale,

store, product,

business,

advertising,

market,

consumer

TOPIC 1

TOPIC 2

TOPIC 3

Page 62: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Conceptual Approach

Given a corpus, what topics (a priori number) are expressedthroughout the corpus?

For each document, what topics are expressed by thatdocument?

computer,

technology,

system,

service, site,

phone,

internet,

machine

play, film,

movie, theater,

production,

star, director,

stage

sell, sale,

store, product,

business,

advertising,

market,

consumer

TOPIC 1

TOPIC 2

TOPIC 3

Forget the Bootleg, Just Download the Movie Legally

Multiplex Heralded As Linchpin To Growth

The Shape of Cinema, Transformed At the Click of

a Mouse

A Peaceful Crew Puts Muppets Where Its Mouth Is

Stock Trades: A Better Deal For Investors Isn't Simple

The three big Internet portals begin to distinguish

among themselves as shopping mallsRed Light, Green Light: A

2-Tone L.E.D. to Simplify Screens

TOPIC 2

TOPIC 3

TOPIC 1

Page 63: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Topics from Science

Page 64: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Why should you care?

Neat way to explore / understand corpus collections

NLP ApplicationsPOS Tagging [Toutanova and Johnson 2008]Word Sense Disambiguation [Boyd-Graber et al. 2007]Word Sense Induction [Brody and Lapata 2009]Discourse Segmentation [Purver et al. 2006]

Psychology [Griffiths et al. 2007b]: word meaning, polysemy

Inference is (relatively) simple

Page 65: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Matrix Factorization Approach

M ! VM ! K K ! V "!

Topic Assignment

Topics

Dataset

K Number of topics

M Number of documents

V Size of vocabulary

If you use singular valuedecomposition (SVD), thistechnique is called latentsemantic analysis.

Popular in informationretrieval.

Page 66: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Matrix Factorization Approach

M ! VM ! K K ! V "!

Topic Assignment

Topics

Dataset

K Number of topics

M Number of documents

V Size of vocabulary

If you use singular valuedecomposition (SVD), thistechnique is called latentsemantic analysis.

Popular in informationretrieval.

Page 67: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Alternative: Generative Model

How your data came to be

Sequence of Probabilistic Steps

Posterior Inference

Page 68: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Multinomial Distribution

Distribution over discrete outcomes

Represented by non-negative vector that sums to one

Picture representation(1,0,0) (0,0,1)

(1/2,1/2,0)(1/3,1/3,1/3) (1/4,1/4,1/2)

(0,1,0)

Come from a Dirichlet distribution

Page 69: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Multinomial Distribution

Distribution over discrete outcomes

Represented by non-negative vector that sums to one

Picture representation(1,0,0) (0,0,1)

(1/2,1/2,0)(1/3,1/3,1/3) (1/4,1/4,1/2)

(0,1,0)

Come from a Dirichlet distribution

Page 70: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Dirichlet Distribution

Page 71: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Dirichlet Distribution

Page 72: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Dirichlet Distribution

Page 73: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Dirichlet Distribution

Page 74: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Generative Model Approach

MNθd zn wn

Kβk

α

λ

For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ

For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .

Page 75: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Generative Model Approach

MNθd zn wn

Kβk

α

λ

For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ

For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .

Page 76: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Generative Model Approach

MNθd zn wn

Kβk

α

λ

For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ

For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.

Choose the observed word wn from the distribution βzn .

Page 77: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Generative Model Approach

MNθd zn wn

Kβk

α

λ

For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ

For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .

Page 78: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Generative Model Approach

MNθd zn wn

Kβk

α

λ

For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λFor each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter αFor each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .

We use statistical inference to uncover the most likely unobservedvariables given observed data.

Page 79: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Topic Models: What’s Important

A generative probabilistic model of document collections thatposits a hidden topical structure which is inferred from data

A topic is a distribution over words

Have semantic coherence because of language use

We use latent Dirichlet allocation (LDA) [Blei et al. 2003], afully Bayesian version of pLSI [Hofmann 1999]

Page 80: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Learning topics

What we want: a (topic) model

This is represented by a configuration latent variables z

What we have: our data D, any hyperparameters Ξ

Compute likelihood L = p(D|z ,Ξ).

Higher this number is, the better we’re doing

Page 81: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Expectation Maximization Algorithm

Input: z (hidden variables), ξ (parameters), D (data)

Start with initial guess of z

RepeatCompute the parameters ξ that maximize likelihood L (usecalculus)Compute the expected value of latent variables z

With each iteration, objective function goes up

Page 82: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Expectation Maximization Algorithm

Input: z (hidden variables), ξ (parameters), D (data)

Start with initial guess of z

RepeatCompute the parameters ξ that maximize likelihood L (usecalculus)E-Step Compute the expected value of latent variables z

With each iteration, objective function goes up

Page 83: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Expectation Maximization Algorithm

Input: z (hidden variables), ξ (parameters), D (data)

Start with initial guess of z

RepeatM-Step Compute the parameters ξ that maximize likelihood L

(use calculus)E-Step Compute the expected value of latent variables z

With each iteration, objective function goes up

Page 84: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Theory

Sometimes you can’t actually optimize L

So we instead optimize a lower bound based on a“variational” distribution q

L = q [log (p(D|Z )p(Z |ξ))]− q [log q(Z )] (8)

L− L = KL(p||q)

This is called variational EM (normal EM is when p = q)

Makes the math possible to optimize L

Page 85: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Variational distribution

!kK

zn

wn

"d#

MNd

(a) LDA

MNd

!d"d

zn#n

(b) Variational

Page 86: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Updates - Important Part

φ How much the nth word in

a document expressed topick

γd ,k How much the kth topic

is expressed in a document d

βv ,k How much word v isassociated with topic k

φd ,n,k ∝ βwd,n,k · eΨ(γk )

γd ,k = αk +Nd�

n=1

φd ,n,k ,

βv ,k ∝C�

d=1

(w (d)v φd ,v ,k)

This is the algorithm!

Page 87: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Updates - Important Part

φ How much the nth word in

a document expressed topick

γd ,k How much the kth topic

is expressed in a document d

βv ,k How much word v isassociated with topic k

φd ,n,k ∝ βwd,n,k · eΨ(γk )

γd ,k = αk +Nd�

n=1

φd ,n,k ,

βv ,k ∝C�

d=1

(w (d)v φd ,v ,k)

This is the algorithm!

Page 88: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Objective Function

Expanding Equation 8 gives us L(γ, φ;α, β) for one document:

L(γ, φ; α, β) =CX

d=1

Ld (γ, φ; α, β)

=CX

d=1

Ld (α)

| {z }Driver

+CX

d=1

(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }

computed in mapper

)

| {z }computed in Reducer

,

where

Ld (α) = log Γ“PK

k=1 αk

”−

KX

i=1

log Γ (αk ) ,

Page 89: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Objective Function

Expanding Equation 8 gives us L(γ, φ;α, β) for one document:

L(γ, φ; α, β) =CX

d=1

Ld (γ, φ; α, β)

=CX

d=1

Ld (α)

| {z }Driver

+CX

d=1

(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }

computed in mapper

)

| {z }computed in Reducer

,

where

Ld (γ, φ) =KX

k=1

"VX

v=1

φv,k −VX

v=1

φv,kwv

# hΨ (γk )−Ψ

“PKi=1 γi

”i,

Page 90: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Objective Function

Expanding Equation 8 gives us L(γ, φ;α, β) for one document:

L(γ, φ; α, β) =CX

d=1

Ld (γ, φ; α, β)

=CX

d=1

Ld (α)

| {z }Driver

+CX

d=1

(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }

computed in mapper

)

| {z }computed in Reducer

,

where

Ld (φ) =VX

v=1

KX

k=1

φv,k (log φv,k +VX

i=1

wi log βi,k ),

Page 91: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Objective Function

Expanding Equation 8 gives us L(γ, φ;α, β) for one document:

L(γ, φ; α, β) =CX

d=1

Ld (γ, φ; α, β)

=CX

d=1

Ld (α)

| {z }Driver

+CX

d=1

(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }

computed in mapper

)

| {z }computed in Reducer

,

where

Ld (γ) = − log Γ“PK

k=1 γk

”+

KX

k=1

log Γγk

Page 92: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Document Mapper: Update !, "

Test Likelihood Convergence

Parameters

Reducer

Document Mapper: Update !, "

Document Mapper: Update !, "

Document Mapper: Update !, "

Reducer

Reducer

Write #

SufficientStatistics for# Update

Driver: Update $

Write $

HessianTerms

Distributed Cache

Page 93: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Map(d , �w)

1: repeat2: for all v ∈ [1, V ] do3: for all k ∈ [1, K ] do4: Update φv,k = βv,k × exp(Ψ

`γd,k

´).

5: end for

6: Normalize row φv,∗, such thatKX

k=1

φv,k = 1.

7: Update σ = σ + �wvφv , where φv is a K -dimensional vector, and �wv is thecount of v in this document.

8: end for9: Update row vector γd,∗ = α + σ.10: until convergence11: for all k ∈ [1, K ] do12: for all v ∈ [1, V ] do13: Emit key-value pair �k,�� : �wvφv .

14: Emit key-value pair �k, v� : �wvφv . {order inversion}15: end for16: Emit key-value pair ��, k� : (Ψ

`γd,k

´−Ψ

“PKl=1 γd,l

”).

{emit the γ-tokens for α update}17: Output key-value pair �k, d� − γd,k to file.18: end for19: Emit key-value pair ��,�� − L, where L is log-likelihood of this document.

Page 94: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Input:Key - key pair �pleft, pright�.Value - an iterator I over sequence of values.

Configuration

1: Initialize the total number of topics as K .2: Initialize a normalization factor n = 0.

Reduce

1: Compute the sum σ over all values in the sequence I.2: if pleft = � then3: if pright = � then4: Output key-value pair ��,�� − σ to file.

{output the model likelihood L for convergence checking}5: else6: Output key-value pair ��, pright� − σ to file.

{output the γ-tokens to update α-vectors, Section ??}7: end if8: else9: if pright = � then10: Update the normalize factor n = σ. {order inversion}11: else12: Output key-value pair �k, v� : σ

n . {output normalized β value}13: end if14: end if

Page 95: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Applicatons

What’s a document?

What’s a word?

What’s your vocabulary?

How do you evaluate?

Page 96: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Applications

Computer Vision [Li Fei-Fei and Perona 2005]

Page 97: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Applications

Social Networks [Airoldi et al. 2008]

Page 98: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Applications

Music [Hu and Saul 2009]

Figure 2: The C major and C minor key-profiles learned by our model, as encoded by the β matrix.Resulting key-profiles are obtained by transposition.

Figure 3: Key judgments for the first 6 measures of Bach’s Prelude in C minor, WTC-II. Annotationsfor each measure show the top three keys (and relative strengths) chosen for each measure. The topset of three annotations are judgments from our LDA-based model; the bottom set of three are fromhuman expert judgments [3].

identified with the 24 major and minor modes of classical western music. We note that our approachis still regarded as unsupervised because we do not learn from labeled or annotated data.

2.3 Results & Applications

Our learnt key-profiles are shown in Figure 2. We note that these key-profiles are consistent withmusic theory principals: In both major and minor modes, weights are given in descending order todegrees of the triad, diatonic, and finally chromatic scales. Intuitively, these key-profiles representthe underlying distributions that are used to characterize all the songs in the corpus.

We also show how to do key-finding and modulation-tracking using the representations learned byour model. The goal of key-finding is to determine the overall key of a musical piece, given thenotes of the composition. Since the key weight vector θ represents the most likely keys present ineach song, we classify each song as the key that is given the largest weight in θ. A related task ismodulation-tracking, which identifies where the modulations occur within a piece. We achieve thisby determining the key of each segment from the most probable values of its topic latent variable z.

We estimated our model from a collection of 235 MIDI files compiled fromclassicalmusicmidipage.com. The collection included works by Bach, Vivaldi, Mozart,Beethoven, Chopin, and Rachmaninoff. These composers were chosen to span the baroque throughromantic periods of western, classical music. Our results for key-finding achieved an accuracy of86%, out-performing several other key-finding algorithms, including the popular KS model [3]. Wealso show in Figure 3 that our annotations for modulation-tracking are comparable to those givenby music theory experts. More results can be found in our paper [1].

3

Page 99: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Nonparametric Models

We’ve always assumed a fixed number of topics

Topic modeling inspired resurgence of nonparametric Bayesianstatistics that can handle infinitely many mixturecomponents [Antoniak 1974]

Equivalent to Rational Model ofCategorization [Griffiths et al. 2007a]

For the rest of this talk, cartoon version - details similar toLDA

Page 100: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Active research

Combining these models with models of syntax

Scaling up to larger corpora

Making topics relevant to social scientists

Humans in the loop

Modeling metadata in document collections

Page 101: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Recap

Probabilistic models - way of learning what data you have

Make predictions about the future

Require a little bit of math to figure out, but fairly easy toimplement in MapReduce

Page 102: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may

Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing.

2008.Mixed membership stochastic blockmodels.J. Mach. Learn. Res., 9:1981–2014.

Charles E. Antoniak.

1974.Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.The Annals of Statistics, 2(6):1152–1174.

David M. Blei, Andrew Ng, and Michael Jordan.

2003.Latent Dirichlet allocation.Journal of Machine Learning Research, 3:993–1022.

Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu.

2007.A topic model for word sense disambiguation.In Proceedings of Emperical Methods in Natural Language Processing.

Samuel Brody and Mirella Lapata.

2009.Bayesian word sense induction.In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 103–111,Athens, Greece, March. Association for Computational Linguistics.

T. L. Griffiths, K. R. Canini, A. N. Sanborn, and D. J. Navarro.

2007a.Unifying rational models of categorization via the hierarchical Dirichlet process.In Proceedings of the Twenty-Ninth Annual Conference of the Cognitive Science Society.

Thomas L. Griffiths, Mark Steyvers, and Joshua Tenenbaum.

2007b.Topics in semantic representation.Psychological Review, 114(2):211–244.

Thomas Hofmann.

1999.Probabilistic latent semantic analysis.In Proceedings of Uncertainty in Artificial Intelligence, Stockholm.

Diane Hu and Lawrence K. Saul.

2009.A probabilistic model of unsupervised learning for musical-key profiles.In International Society for Music Information Retrieval Conference.

Li Fei-Fei and Pietro Perona.

2005.A Bayesian hierarchical model for learning natural scene categories.In CVPR ’05 - Volume 2, pages 524–531, Washington, DC, USA. IEEE Computer Society.

Matthew Purver, Konrad Kording, Thomas L. Griffiths, and Joshua Tenenbaum.

2006.Unsupervised topic modelling for multi-party spoken discourse.In Proceedings of the Association for Computational Linguistics.

Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei.

2006.Hierarchical Dirichlet processes.Journal of the American Statistical Association, 101(476):1566–1581.

Kristina Toutanova and Mark Johnson.

2008.A Bayesian LDA-based model for semi-supervised part-of-speech tagging.In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information ProcessingSystems 20, pages 1521–1528. MIT Press, Cambridge, MA.