probabilistic models in mapreducejbg/teaching/infm_718... · to model training data advantages ......
TRANSCRIPT
![Page 1: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/1.jpg)
Probabilistic Models in MapReduce
Jordan Boyd-Graber
April 7, 2011
Adapted from Jimmy Lin’s Slides
![Page 2: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/2.jpg)
Roadmap
Homework
Midterm
Probabilistic Models
Hidden Markov Model
Topic Models
![Page 3: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/3.jpg)
Homework
Project Proposal: Due 11
Homework 3?
Homework 4 is out
Homework 6
![Page 4: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/4.jpg)
Midterm: Max
Obvious answer
Less obvious answerDefine grouperMake sure values arrive in sorted orderReducer only needs to look at first valueNot as efficient as using combiners (except in pathologicalsituations)
![Page 5: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/5.jpg)
Rule-Based Systems
Until the 1990s, text processing relied on rule-based systems
Advantages
More predictable
Easy to understand
Easy to identify errors and fix them
Disadvantages
Extremely labor-intensive to create
Not robust to out of domain input
No partial output or analysis when failure occurs
![Page 6: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/6.jpg)
Statistical Methods
Basic idea: learn from a large corpus of examples of what we wishto model Training Data
Advantages
More robust to the complexities of real-world input
Creating training data is usually cheaper than creating rules
Even easier today thanks to Amazon Mechanical Turk
Data may already exist for independent reasons
Disadvantages
Systems often behave differently compared to expectations
Hard to understand the reasons for errors or debug errors
![Page 7: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/7.jpg)
Learning from training data usually means estimating theparameters of the statistical model
Estimation usually carried out via machine learning
Two kinds of machine learning algorithmsSupervised learning
Training data consists of the inputs and respective outputs
(labels)
Labels are usually created via expert annotation (expensive)
Difficult to annotate when predicting more complex outputs
Unsupervised learningTraining data just consists of inputs. No labels.
One example of such an algorithm: Expectation Maximization
(EM)
![Page 8: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/8.jpg)
What Problems Can We Solve?
(Supervised) Part of speech tagging
(Unsupervised) Exploring large corpora
But first, a brief recap of estimating probability distributions
![Page 9: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/9.jpg)
What Problems Can We Solve?
(Supervised) Part of speech tagging
(Unsupervised) Exploring large corpora
But first, a brief recap of estimating probability distributions
![Page 10: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/10.jpg)
How do we estimate a probability?
Suppose we want to estimate P(wn = “dog”|zz = “NN”).
dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog
mouse dog fly cat cow
Maximum likelihood (ML) estimate of the probability is:
θi =ni�k nk
(1)
Is this reasonable?
![Page 11: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/11.jpg)
How do we estimate a probability?
Suppose we want to estimate P(wn = “dog”|zz = “NN”).
dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog
mouse dog fly cat cow
Maximum likelihood (ML) estimate of the probability is:
θi =ni�k nk
(1)
Is this reasonable?
![Page 12: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/12.jpg)
How do we estimate a probability?
Suppose we want to estimate P(wn = “dog”|zz = “NN”).
dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog
mouse dog fly cat cow
Maximum likelihood (ML) estimate of the probability is:
θi =ni�k nk
(1)
Is this reasonable?
![Page 13: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/13.jpg)
How do we estimate a probability?
Suppose we want to estimate P(wn = “dog”|zz = “NN”).
dog dog cat horse cowcat horse cow fly mousefly dog cat fly dog
mouse dog fly cat cow
Maximum likelihood (ML) estimate of the probability is:
θi =ni�k nk
(1)
Is this reasonable?
![Page 14: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/14.jpg)
How do we estimate a probability?
In computational linguistics, we often have a prior notion ofwhat our probability distributions are going to look like (forexample, non-zero, sparse, uniform, etc.).
This estimate of a probability distribution is called themaximum a posteriori (MAP) estimate:
θMAP = argmaxθf (x |θ)g(θ) (2)
![Page 15: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/15.jpg)
How do we estimate a probability?
For a multinomial distribution (i.e. a discrete distribution, likeover words):
θi =ni + αi�k nk + αk
(3)
αi is called a smoothing factor, a pseudocount, etc.
When αi = 1 for all i , it’s called “Laplace smoothing” andcorresponds to a uniform prior over all multinomialdistributions (we talked about this before).
To geek out, the set {α1, . . . ,αN} parameterizes a Dirichletdistribution, which is itself a distribution over distributionsand is the conjugate prior of the Multinomial (more later).
![Page 16: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/16.jpg)
How do we estimate a probability?
For a multinomial distribution (i.e. a discrete distribution, likeover words):
θi =ni + αi�k nk + αk
(3)
αi is called a smoothing factor, a pseudocount, etc.
When αi = 1 for all i , it’s called “Laplace smoothing” andcorresponds to a uniform prior over all multinomialdistributions (we talked about this before).
To geek out, the set {α1, . . . ,αN} parameterizes a Dirichletdistribution, which is itself a distribution over distributionsand is the conjugate prior of the Multinomial (more later).
![Page 17: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/17.jpg)
How do we estimate a probability?
For a multinomial distribution (i.e. a discrete distribution, likeover words):
θi =ni + αi�k nk + αk
(3)
αi is called a smoothing factor, a pseudocount, etc.
When αi = 1 for all i , it’s called “Laplace smoothing” andcorresponds to a uniform prior over all multinomialdistributions (we talked about this before).
To geek out, the set {α1, . . . ,αN} parameterizes a Dirichletdistribution, which is itself a distribution over distributionsand is the conjugate prior of the Multinomial (more later).
![Page 18: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/18.jpg)
Parts of Speech
The Art of Grammar circa 100 B.C.
Written to allow post-Classical Greek speakers to understandOdyssey and other classical poets
[Noun, Verb, Pronoun, Article, Adverb, Conjunction, Participle,Preposition]
Remarkably enduring list
Occur in almost every language
Defined primarily in terms of syntactic and morphologicalcriteria (affixes)
![Page 19: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/19.jpg)
Categories of POS Tags
Closed Class
Relatively fixed membership
Conjunctions, Prepositions, Auxiliaries, Determiners, Pronouns
Function words: short and used primarily for structuring
Open Class
Nouns, Verbs, Adjectives, Adverbs
Frequent neologisms (borrowed/coined)
Most types
![Page 20: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/20.jpg)
Tagsets
Several English tagsets have been developed (languagespecific)
Vary in number of tags
Brown Tagset (87)
Penn Treebank (45) [More common]
Simple morphology = more ambiguity = smaller tagset
Size depends on language and purpose
![Page 21: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/21.jpg)
Why?
Corpus-based Linguistic Analysis & Lexicography
Information Retrieval & Question Answering
Automatic Speech Synthesis
Word Sense Disambiguation
Shallow Syntactic Parsing
Machine Translation
![Page 22: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/22.jpg)
What do we need to specify an FSM formally?
Finite numberof states
Transitions
Input alphabet
Start state
Final state(s)
![Page 23: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/23.jpg)
Weighted FSM
0.5
0.25
0.25
0.25
0.75
1.0
a’ is twice as likely tobe seen in state 1 asb’ or c’
c’ is three times aslikely to be seen instate 2 as a’
P(ab�) = 0.50 ∗ 1.00 = 0.5,P(bc�) = 0.25 ∗ 0.75 = 0.1875 (4)
![Page 24: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/24.jpg)
Observable States and Probabilistic Emissions
This not a valid prob. FSM!
No start states
Use prior probabilities
Note that prob. of being inany state ONLY depends onprevious state ,i.e., the (1st
order) Markov assumption
This extension of a prob.FSM is called a MarkovChain or an ObservedMarkov Model
Each state corresponds toan observable physical event
![Page 25: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/25.jpg)
Observable States and Probabilistic Emissions
0.5!
0.2! 0.3!
This not a valid prob. FSM!
No start states
Use prior probabilities
Note that prob. of being inany state ONLY depends onprevious state ,i.e., the (1st
order) Markov assumption
This extension of a prob.FSM is called a MarkovChain or an ObservedMarkov Model
Each state corresponds toan observable physical event
![Page 26: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/26.jpg)
Are states observable?
What you actually observe:
![Page 27: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/27.jpg)
Are states observable?
What you actually observe:
![Page 28: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/28.jpg)
HMM Intuitions
Need to model problems where observed events don’tcorrespond to states directly
Instead observations are probabilistic of hidden state
Solution: A Hidden Markov Model (HMM)
Assume two probabilistic processesUnderlying process is hidden (states = hidden events)Second process produces sequence of observed events
![Page 29: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/29.jpg)
HMM Definition
Assume K parts of speech, a lexicon size of V , a series ofobservations {x1, . . . , xN}, and a series of unobserved states{z1, . . . , zN}.
π A distribution over start states (vector of length K ):πi = p(z1 = i)
θ Transition matrix (matrix of size K by K ):βi ,j = p(zn = j |zn−1 = i)
β An emission matrix (matrix of size K by V ):βk,v = p(xn = v |zn = k)
Two problems: How do we move from data to a model?(Estimation) How do we move from a model and unlabled data tolabeled data? (Inference)
![Page 30: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/30.jpg)
HMM Definition
Assume K parts of speech, a lexicon size of V , a series ofobservations {x1, . . . , xN}, and a series of unobserved states{z1, . . . , zN}.
π A distribution over start states (vector of length K ):πi = p(z1 = i)
θ Transition matrix (matrix of size K by K ):βi ,j = p(zn = j |zn−1 = i)
β An emission matrix (matrix of size K by V ):βk,v = p(xn = v |zn = k)
Two problems: How do we move from data to a model?(Estimation) How do we move from a model and unlabled data tolabeled data? (Inference)
![Page 31: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/31.jpg)
Training Sentences
here come old flattopMOD V MOD N
a crowd of people stopped and staredDET N PREP N V CONJ V
gotta get you into my lifeV V PRO PREP PRO V
and I love herCONJ PRO V PRO
![Page 32: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/32.jpg)
Initial Probability π
POS Frequency ProbabilityMOD 1.1 0.234DET 1.1 0.234CONJ 1.1 0.234
N 0.1 0.021PREP 0.1 0.021PRO 0.1 0.021
V 1.1 0.234
Remember, we’re taking MAP estimates, so we add 0.1 (arbitrarilychosen) to each of the counts before normalizing to create aprobability distribution. This is easy; one sentence starts with anadjective, one with a determiner, one with a verb, and one with aconjunction.
![Page 33: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/33.jpg)
Transition Probability θ
We can ignore the words; just look at the parts of speech.Let’s compute one row, the row for verbs.
We see the following transitions: V → MOD, V → CONJ,V → V, V → PRO, and V → PRO
POS Frequency ProbabilityMOD 1.1 0.193DET 0.1 0.018CONJ 1.1 0.193
N 0.1 0.018PREP 0.1 0.018PRO 2.1 0.368
V 1.1 0.193
And do the same for each part of speech ...
![Page 34: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/34.jpg)
Emission Probability β
Let’s look at verbs . . .Word a and come crowd flattop
Frequency 0.1 0.1 1.1 0.1 0.1Probability 0.011 0.011 0.121 0.011 0.011
Word get gotta her here iFrequency 1.1 1.1 0.1 0.1 0.1Probability 0.121 0.121 0.011 0.011 0.011
Word into it life love myFrequency 0.1 0.1 0.1 1.1 0.1Probability 0.011 0.011 0.011 0.121 0.011
Word of old people stared stoodFrequency 0.1 0.1 0.1 1.1 1.1Probability 0.011 0.011 0.011 0.121 0.121
![Page 35: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/35.jpg)
Viterbi Algorithm
Given an unobserved sequence of length L, {x1, . . . , xL}, wewant to find a sequence {z1 . . . zL} with the highestprobability.
It’s impossible to compute KL possibilities.
So, we use dynamic programming to compute best sequencefor each subsequence from 0 to l .
Base case:δ1(k) = πkβk,xi (5)
Recursion:δn(k) = max
j(δn−1(j)θj ,k)βk,xn (6)
![Page 36: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/36.jpg)
Viterbi Algorithm
Given an unobserved sequence of length L, {x1, . . . , xL}, wewant to find a sequence {z1 . . . zL} with the highestprobability.
It’s impossible to compute KL possibilities.
So, we use dynamic programming to compute best sequencefor each subsequence from 0 to l .
Base case:δ1(k) = πkβk,xi (5)
Recursion:δn(k) = max
j(δn−1(j)θj ,k)βk,xn (6)
![Page 37: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/37.jpg)
The complexity of this is now K2L.
But just computing the max isn’t enough. We also have toremember where we came from. (Breadcrumbs from bestprevious state.)
Ψn = argmaxjδn−1(j)θj ,k (7)
Let’s do that for the sentence “come and get it”
![Page 38: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/38.jpg)
The complexity of this is now K2L.
But just computing the max isn’t enough. We also have toremember where we came from. (Breadcrumbs from bestprevious state.)
Ψn = argmaxjδn−1(j)θj ,k (7)
Let’s do that for the sentence “come and get it”
![Page 39: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/39.jpg)
POS πk βk,x1 log δ1(k)MOD 0.234 0.024 -5.18DET 0.234 0.032 -4.89CONJ 0.234 0.024 -5.18
N 0.021 0.016 -7.99PREP 0.021 0.024 -7.59PRO 0.021 0.016 -7.99
V 0.234 0.121 -3.56come and get it
Why logarithms?
1 More interpretable than a float with lots of zeros.
2 Underflow is less of an issue
3 Addition is cheaper than multiplication
![Page 40: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/40.jpg)
POS log δ1(j)
log δ1(j)θj ,CONJ
log δ1(CONJ)
MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47 ??? -6.02
N -7.99
≤ −7.99
PREP -7.59
≤ −7.59
PRO -7.99
≤ −7.99
V -3.56
-5.21
come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 41: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/41.jpg)
POS log δ1(j)
log δ1(j)θj ,CONJ
log δ1(CONJ)
MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤ −7.99
PREP -7.59
≤ −7.59
PRO -7.99
≤ −7.99
V -3.56
-5.21
come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 42: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/42.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤ −7.99
PREP -7.59
≤ −7.59
PRO -7.99
≤ −7.99
V -3.56
-5.21
come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 43: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/43.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤ −7.99
PREP -7.59
≤ −7.59
PRO -7.99
≤ −7.99
V -3.56
-5.21
come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 44: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/44.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤ −7.99
PREP -7.59
≤ −7.59
PRO -7.99
≤ −7.99
V -3.56 -5.21come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 45: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/45.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99
V -3.56 -5.21come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 46: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/46.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47 ???
-6.02
N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99
V -3.56 -5.21come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 47: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/47.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47 ???
-6.02
N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99
V -3.56 -5.21come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 48: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/48.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47
??? -6.02
N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99
V -3.56 -5.21come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 49: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/49.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47
??? -6.02
N -7.99 ≤ −7.99PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99
V -3.56 -5.21come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and = − 5.21− 0.81
![Page 50: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/50.jpg)
POS log δ1(j) log δ1(j)θj ,CONJ log δ1(CONJ)
MOD -5.18 -8.48DET -4.89 -7.72CONJ -5.18 -8.47
???
-6.02N -7.99 ≤ −7.99
PREP -7.59 ≤ −7.59PRO -7.99 ≤ −7.99
V -3.56 -5.21come and get it
log�δ0(V)θV, CONJ
�= log δ0(k)+log θV, CONJ = −3.56+−1.65
log δ1(k) = −5.21 + log βCONJ, and =
− 5.21− 0.81
![Page 51: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/51.jpg)
POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4
MOD -5.18
-0.00 X -0.00 X -0.00 X
DET -4.89
-0.00 X -0.00 X -0.00 X
CONJ -5.18 -6.02 V
-0.00 X -0.00 X
N -7.99
-0.00 X -0.00 X -0.00 X
PREP -7.59
-0.00 X -0.00 X -0.00 X
PRO -7.99
-0.00 X -0.00 X -14.6 V
V -3.56
-0.00 X -9.03 CONJ -0.00 X
WORD come and get it
![Page 52: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/52.jpg)
POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4
MOD -5.18 -0.00 X
-0.00 X -0.00 X
DET -4.89 -0.00 X
-0.00 X -0.00 X
CONJ -5.18 -6.02 V
-0.00 X -0.00 X
N -7.99 -0.00 X
-0.00 X -0.00 X
PREP -7.59 -0.00 X
-0.00 X -0.00 X
PRO -7.99 -0.00 X
-0.00 X -14.6 V
V -3.56 -0.00 X
-9.03 CONJ -0.00 X
WORD come and get it
![Page 53: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/53.jpg)
POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4
MOD -5.18 -0.00 X -0.00 X
-0.00 X
DET -4.89 -0.00 X -0.00 X
-0.00 X
CONJ -5.18 -6.02 V -0.00 X
-0.00 X
N -7.99 -0.00 X -0.00 X
-0.00 X
PREP -7.59 -0.00 X -0.00 X
-0.00 X
PRO -7.99 -0.00 X -0.00 X
-14.6 V
V -3.56 -0.00 X -9.03 CONJ
-0.00 X
WORD come and get it
![Page 54: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/54.jpg)
POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4
MOD -5.18 -0.00 X -0.00 X -0.00 X
DET -4.89 -0.00 X -0.00 X -0.00 X
CONJ -5.18 -6.02 V -0.00 X -0.00 X
N -7.99 -0.00 X -0.00 X -0.00 X
PREP -7.59 -0.00 X -0.00 X -0.00 X
PRO -7.99 -0.00 X -0.00 X -14.6 V
V -3.56 -0.00 X -9.03 CONJ -0.00 X
WORD come and get it
![Page 55: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/55.jpg)
MapReduce: HMM Learning
Mapper
def map( s e n t e n c e i d , s en t en c e ) :p r ev = Nonef o r s t a t e , word i n s en t en c e :
i f prev == None :emit ( ( ”S” , 0 , −1) , 1)emit ( ( ”S” , 0 , s t a t e ) , 1)
e l s e :emit ( ( ”T” , prev , word ) , 1)emit ( ( ”T” , prev , word ) , 1)
emit ( ( ”E” , s t a t e , word ) , 1)emit ( ( ”E” , s t a t e , word ) , 1)
![Page 56: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/56.jpg)
MapReduce: HMM Learning
Reducer
def r educe ( key , v a l u e s ) :d i s t r i b u t i o n , i , j = keyi f j == −1:
n o rma l i z e r = sum( v a l u e s ) +p r i o r s s um [ d i s t r i b u t i o n ]
e l s e :emit key , ( sum( v a l u e s ) +
p r i o r s e l em e n t [ d i s t r i b u t i o n ] ) /n o rma l i z e r
![Page 57: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/57.jpg)
MapReduce: HMM Testing
Distributed parameters via distributed cache
Do Vitterbi for each sentence using parameters
Can happen independently in a mapper, output finalassignment
Use identity reducer
Output final POS sequence (See Lin & Dyer for the gorydetails)
![Page 58: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/58.jpg)
Why topic models?
Suppose you have a huge number of documents
You want to know what’s going on
Don’t have time to read them (e.g. every New York Timesarticle from the 50’s)
Topic models offer a way to get a corpus-level view of majorthemes
Unsupervised
![Page 59: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/59.jpg)
Why topic models?
Suppose you have a huge number of documents
You want to know what’s going on
Don’t have time to read them (e.g. every New York Timesarticle from the 50’s)
Topic models offer a way to get a corpus-level view of majorthemes
Unsupervised
![Page 60: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/60.jpg)
Conceptual Approach
Given a corpus, what topics (a priori number) are expressedthroughout the corpus?
For each document, what topics are expressed by thatdocument?
![Page 61: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/61.jpg)
Conceptual Approach
Given a corpus, what topics (a priori number) are expressedthroughout the corpus?
For each document, what topics are expressed by thatdocument?
computer,
technology,
system,
service, site,
phone,
internet,
machine
play, film,
movie, theater,
production,
star, director,
stage
sell, sale,
store, product,
business,
advertising,
market,
consumer
TOPIC 1
TOPIC 2
TOPIC 3
![Page 62: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/62.jpg)
Conceptual Approach
Given a corpus, what topics (a priori number) are expressedthroughout the corpus?
For each document, what topics are expressed by thatdocument?
computer,
technology,
system,
service, site,
phone,
internet,
machine
play, film,
movie, theater,
production,
star, director,
stage
sell, sale,
store, product,
business,
advertising,
market,
consumer
TOPIC 1
TOPIC 2
TOPIC 3
Forget the Bootleg, Just Download the Movie Legally
Multiplex Heralded As Linchpin To Growth
The Shape of Cinema, Transformed At the Click of
a Mouse
A Peaceful Crew Puts Muppets Where Its Mouth Is
Stock Trades: A Better Deal For Investors Isn't Simple
The three big Internet portals begin to distinguish
among themselves as shopping mallsRed Light, Green Light: A
2-Tone L.E.D. to Simplify Screens
TOPIC 2
TOPIC 3
TOPIC 1
![Page 63: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/63.jpg)
Topics from Science
![Page 64: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/64.jpg)
Why should you care?
Neat way to explore / understand corpus collections
NLP ApplicationsPOS Tagging [Toutanova and Johnson 2008]Word Sense Disambiguation [Boyd-Graber et al. 2007]Word Sense Induction [Brody and Lapata 2009]Discourse Segmentation [Purver et al. 2006]
Psychology [Griffiths et al. 2007b]: word meaning, polysemy
Inference is (relatively) simple
![Page 65: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/65.jpg)
Matrix Factorization Approach
M ! VM ! K K ! V "!
Topic Assignment
Topics
Dataset
K Number of topics
M Number of documents
V Size of vocabulary
If you use singular valuedecomposition (SVD), thistechnique is called latentsemantic analysis.
Popular in informationretrieval.
![Page 66: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/66.jpg)
Matrix Factorization Approach
M ! VM ! K K ! V "!
Topic Assignment
Topics
Dataset
K Number of topics
M Number of documents
V Size of vocabulary
If you use singular valuedecomposition (SVD), thistechnique is called latentsemantic analysis.
Popular in informationretrieval.
![Page 67: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/67.jpg)
Alternative: Generative Model
How your data came to be
Sequence of Probabilistic Steps
Posterior Inference
![Page 68: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/68.jpg)
Multinomial Distribution
Distribution over discrete outcomes
Represented by non-negative vector that sums to one
Picture representation(1,0,0) (0,0,1)
(1/2,1/2,0)(1/3,1/3,1/3) (1/4,1/4,1/2)
(0,1,0)
Come from a Dirichlet distribution
![Page 69: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/69.jpg)
Multinomial Distribution
Distribution over discrete outcomes
Represented by non-negative vector that sums to one
Picture representation(1,0,0) (0,0,1)
(1/2,1/2,0)(1/3,1/3,1/3) (1/4,1/4,1/2)
(0,1,0)
Come from a Dirichlet distribution
![Page 70: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/70.jpg)
Dirichlet Distribution
![Page 71: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/71.jpg)
Dirichlet Distribution
![Page 72: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/72.jpg)
Dirichlet Distribution
![Page 73: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/73.jpg)
Dirichlet Distribution
![Page 74: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/74.jpg)
Generative Model Approach
MNθd zn wn
Kβk
α
λ
For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ
For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .
![Page 75: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/75.jpg)
Generative Model Approach
MNθd zn wn
Kβk
α
λ
For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ
For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .
![Page 76: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/76.jpg)
Generative Model Approach
MNθd zn wn
Kβk
α
λ
For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ
For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.
Choose the observed word wn from the distribution βzn .
![Page 77: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/77.jpg)
Generative Model Approach
MNθd zn wn
Kβk
α
λ
For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λ
For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
For each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .
![Page 78: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/78.jpg)
Generative Model Approach
MNθd zn wn
Kβk
α
λ
For each topic k ∈ {1, . . . ,K}, draw a multinomialdistribution βk from a Dirichlet distribution with parameter λFor each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter αFor each word position n ∈ {1, . . . ,N}, select a hidden topiczn from the multinomial distribution parameterized by θ.Choose the observed word wn from the distribution βzn .
We use statistical inference to uncover the most likely unobservedvariables given observed data.
![Page 79: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/79.jpg)
Topic Models: What’s Important
A generative probabilistic model of document collections thatposits a hidden topical structure which is inferred from data
A topic is a distribution over words
Have semantic coherence because of language use
We use latent Dirichlet allocation (LDA) [Blei et al. 2003], afully Bayesian version of pLSI [Hofmann 1999]
![Page 80: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/80.jpg)
Learning topics
What we want: a (topic) model
This is represented by a configuration latent variables z
What we have: our data D, any hyperparameters Ξ
Compute likelihood L = p(D|z ,Ξ).
Higher this number is, the better we’re doing
![Page 81: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/81.jpg)
Expectation Maximization Algorithm
Input: z (hidden variables), ξ (parameters), D (data)
Start with initial guess of z
RepeatCompute the parameters ξ that maximize likelihood L (usecalculus)Compute the expected value of latent variables z
With each iteration, objective function goes up
![Page 82: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/82.jpg)
Expectation Maximization Algorithm
Input: z (hidden variables), ξ (parameters), D (data)
Start with initial guess of z
RepeatCompute the parameters ξ that maximize likelihood L (usecalculus)E-Step Compute the expected value of latent variables z
With each iteration, objective function goes up
![Page 83: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/83.jpg)
Expectation Maximization Algorithm
Input: z (hidden variables), ξ (parameters), D (data)
Start with initial guess of z
RepeatM-Step Compute the parameters ξ that maximize likelihood L
(use calculus)E-Step Compute the expected value of latent variables z
With each iteration, objective function goes up
![Page 84: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/84.jpg)
Theory
Sometimes you can’t actually optimize L
So we instead optimize a lower bound based on a“variational” distribution q
L = q [log (p(D|Z )p(Z |ξ))]− q [log q(Z )] (8)
L− L = KL(p||q)
This is called variational EM (normal EM is when p = q)
Makes the math possible to optimize L
![Page 85: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/85.jpg)
Variational distribution
!kK
zn
wn
"d#
MNd
(a) LDA
MNd
!d"d
zn#n
(b) Variational
![Page 86: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/86.jpg)
Updates - Important Part
φ How much the nth word in
a document expressed topick
γd ,k How much the kth topic
is expressed in a document d
βv ,k How much word v isassociated with topic k
φd ,n,k ∝ βwd,n,k · eΨ(γk )
γd ,k = αk +Nd�
n=1
φd ,n,k ,
βv ,k ∝C�
d=1
(w (d)v φd ,v ,k)
This is the algorithm!
![Page 87: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/87.jpg)
Updates - Important Part
φ How much the nth word in
a document expressed topick
γd ,k How much the kth topic
is expressed in a document d
βv ,k How much word v isassociated with topic k
φd ,n,k ∝ βwd,n,k · eΨ(γk )
γd ,k = αk +Nd�
n=1
φd ,n,k ,
βv ,k ∝C�
d=1
(w (d)v φd ,v ,k)
This is the algorithm!
![Page 88: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/88.jpg)
Objective Function
Expanding Equation 8 gives us L(γ, φ;α, β) for one document:
L(γ, φ; α, β) =CX
d=1
Ld (γ, φ; α, β)
=CX
d=1
Ld (α)
| {z }Driver
+CX
d=1
(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }
computed in mapper
)
| {z }computed in Reducer
,
where
Ld (α) = log Γ“PK
k=1 αk
”−
KX
i=1
log Γ (αk ) ,
![Page 89: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/89.jpg)
Objective Function
Expanding Equation 8 gives us L(γ, φ;α, β) for one document:
L(γ, φ; α, β) =CX
d=1
Ld (γ, φ; α, β)
=CX
d=1
Ld (α)
| {z }Driver
+CX
d=1
(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }
computed in mapper
)
| {z }computed in Reducer
,
where
Ld (γ, φ) =KX
k=1
"VX
v=1
φv,k −VX
v=1
φv,kwv
# hΨ (γk )−Ψ
“PKi=1 γi
”i,
![Page 90: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/90.jpg)
Objective Function
Expanding Equation 8 gives us L(γ, φ;α, β) for one document:
L(γ, φ; α, β) =CX
d=1
Ld (γ, φ; α, β)
=CX
d=1
Ld (α)
| {z }Driver
+CX
d=1
(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }
computed in mapper
)
| {z }computed in Reducer
,
where
Ld (φ) =VX
v=1
KX
k=1
φv,k (log φv,k +VX
i=1
wi log βi,k ),
![Page 91: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/91.jpg)
Objective Function
Expanding Equation 8 gives us L(γ, φ;α, β) for one document:
L(γ, φ; α, β) =CX
d=1
Ld (γ, φ; α, β)
=CX
d=1
Ld (α)
| {z }Driver
+CX
d=1
(Ld (γ, φ) + Ld (φ) + Ld (γ)| {z }
computed in mapper
)
| {z }computed in Reducer
,
where
Ld (γ) = − log Γ“PK
k=1 γk
”+
KX
k=1
log Γγk
![Page 92: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/92.jpg)
Document Mapper: Update !, "
Test Likelihood Convergence
Parameters
Reducer
Document Mapper: Update !, "
Document Mapper: Update !, "
Document Mapper: Update !, "
Reducer
Reducer
Write #
SufficientStatistics for# Update
Driver: Update $
Write $
HessianTerms
Distributed Cache
![Page 93: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/93.jpg)
Map(d , �w)
1: repeat2: for all v ∈ [1, V ] do3: for all k ∈ [1, K ] do4: Update φv,k = βv,k × exp(Ψ
`γd,k
´).
5: end for
6: Normalize row φv,∗, such thatKX
k=1
φv,k = 1.
7: Update σ = σ + �wvφv , where φv is a K -dimensional vector, and �wv is thecount of v in this document.
8: end for9: Update row vector γd,∗ = α + σ.10: until convergence11: for all k ∈ [1, K ] do12: for all v ∈ [1, V ] do13: Emit key-value pair �k,�� : �wvφv .
14: Emit key-value pair �k, v� : �wvφv . {order inversion}15: end for16: Emit key-value pair ��, k� : (Ψ
`γd,k
´−Ψ
“PKl=1 γd,l
”).
{emit the γ-tokens for α update}17: Output key-value pair �k, d� − γd,k to file.18: end for19: Emit key-value pair ��,�� − L, where L is log-likelihood of this document.
![Page 94: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/94.jpg)
Input:Key - key pair �pleft, pright�.Value - an iterator I over sequence of values.
Configuration
1: Initialize the total number of topics as K .2: Initialize a normalization factor n = 0.
Reduce
1: Compute the sum σ over all values in the sequence I.2: if pleft = � then3: if pright = � then4: Output key-value pair ��,�� − σ to file.
{output the model likelihood L for convergence checking}5: else6: Output key-value pair ��, pright� − σ to file.
{output the γ-tokens to update α-vectors, Section ??}7: end if8: else9: if pright = � then10: Update the normalize factor n = σ. {order inversion}11: else12: Output key-value pair �k, v� : σ
n . {output normalized β value}13: end if14: end if
![Page 95: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/95.jpg)
Applicatons
What’s a document?
What’s a word?
What’s your vocabulary?
How do you evaluate?
![Page 96: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/96.jpg)
Applications
Computer Vision [Li Fei-Fei and Perona 2005]
![Page 97: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/97.jpg)
Applications
Social Networks [Airoldi et al. 2008]
![Page 98: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/98.jpg)
Applications
Music [Hu and Saul 2009]
Figure 2: The C major and C minor key-profiles learned by our model, as encoded by the β matrix.Resulting key-profiles are obtained by transposition.
Figure 3: Key judgments for the first 6 measures of Bach’s Prelude in C minor, WTC-II. Annotationsfor each measure show the top three keys (and relative strengths) chosen for each measure. The topset of three annotations are judgments from our LDA-based model; the bottom set of three are fromhuman expert judgments [3].
identified with the 24 major and minor modes of classical western music. We note that our approachis still regarded as unsupervised because we do not learn from labeled or annotated data.
2.3 Results & Applications
Our learnt key-profiles are shown in Figure 2. We note that these key-profiles are consistent withmusic theory principals: In both major and minor modes, weights are given in descending order todegrees of the triad, diatonic, and finally chromatic scales. Intuitively, these key-profiles representthe underlying distributions that are used to characterize all the songs in the corpus.
We also show how to do key-finding and modulation-tracking using the representations learned byour model. The goal of key-finding is to determine the overall key of a musical piece, given thenotes of the composition. Since the key weight vector θ represents the most likely keys present ineach song, we classify each song as the key that is given the largest weight in θ. A related task ismodulation-tracking, which identifies where the modulations occur within a piece. We achieve thisby determining the key of each segment from the most probable values of its topic latent variable z.
We estimated our model from a collection of 235 MIDI files compiled fromclassicalmusicmidipage.com. The collection included works by Bach, Vivaldi, Mozart,Beethoven, Chopin, and Rachmaninoff. These composers were chosen to span the baroque throughromantic periods of western, classical music. Our results for key-finding achieved an accuracy of86%, out-performing several other key-finding algorithms, including the popular KS model [3]. Wealso show in Figure 3 that our annotations for modulation-tracking are comparable to those givenby music theory experts. More results can be found in our paper [1].
3
![Page 99: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/99.jpg)
Nonparametric Models
We’ve always assumed a fixed number of topics
Topic modeling inspired resurgence of nonparametric Bayesianstatistics that can handle infinitely many mixturecomponents [Antoniak 1974]
Equivalent to Rational Model ofCategorization [Griffiths et al. 2007a]
For the rest of this talk, cartoon version - details similar toLDA
![Page 100: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/100.jpg)
Active research
Combining these models with models of syntax
Scaling up to larger corpora
Making topics relevant to social scientists
Humans in the loop
Modeling metadata in document collections
![Page 101: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/101.jpg)
Recap
Probabilistic models - way of learning what data you have
Make predictions about the future
Require a little bit of math to figure out, but fairly easy toimplement in MapReduce
![Page 102: Probabilistic Models in MapReducejbg/teaching/INFM_718... · to model Training Data Advantages ... than creating rules Even easier today thanks to Amazon Mechanical Turk Data may](https://reader035.vdocuments.us/reader035/viewer/2022081407/604d0a1854ad0e6c6961101e/html5/thumbnails/102.jpg)
Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing.
2008.Mixed membership stochastic blockmodels.J. Mach. Learn. Res., 9:1981–2014.
Charles E. Antoniak.
1974.Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.The Annals of Statistics, 2(6):1152–1174.
David M. Blei, Andrew Ng, and Michael Jordan.
2003.Latent Dirichlet allocation.Journal of Machine Learning Research, 3:993–1022.
Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu.
2007.A topic model for word sense disambiguation.In Proceedings of Emperical Methods in Natural Language Processing.
Samuel Brody and Mirella Lapata.
2009.Bayesian word sense induction.In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 103–111,Athens, Greece, March. Association for Computational Linguistics.
T. L. Griffiths, K. R. Canini, A. N. Sanborn, and D. J. Navarro.
2007a.Unifying rational models of categorization via the hierarchical Dirichlet process.In Proceedings of the Twenty-Ninth Annual Conference of the Cognitive Science Society.
Thomas L. Griffiths, Mark Steyvers, and Joshua Tenenbaum.
2007b.Topics in semantic representation.Psychological Review, 114(2):211–244.
Thomas Hofmann.
1999.Probabilistic latent semantic analysis.In Proceedings of Uncertainty in Artificial Intelligence, Stockholm.
Diane Hu and Lawrence K. Saul.
2009.A probabilistic model of unsupervised learning for musical-key profiles.In International Society for Music Information Retrieval Conference.
Li Fei-Fei and Pietro Perona.
2005.A Bayesian hierarchical model for learning natural scene categories.In CVPR ’05 - Volume 2, pages 524–531, Washington, DC, USA. IEEE Computer Society.
Matthew Purver, Konrad Kording, Thomas L. Griffiths, and Joshua Tenenbaum.
2006.Unsupervised topic modelling for multi-party spoken discourse.In Proceedings of the Association for Computational Linguistics.
Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei.
2006.Hierarchical Dirichlet processes.Journal of the American Statistical Association, 101(476):1566–1581.
Kristina Toutanova and Mark Johnson.
2008.A Bayesian LDA-based model for semi-supervised part-of-speech tagging.In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information ProcessingSystems 20, pages 1521–1528. MIT Press, Cambridge, MA.