non parametric bayesian learning in discrete data
TRANSCRIPT
Non-parametric Bayesian
Learning in Discrete Data
Yueshen [email protected] / [email protected]
Middleware, CCNT, ZJU
Middleware, CCNT, ZJU5/10/2016
Statistics & Computational Linguistics
1Yueshen Xu
Outline
Bayesโ Rule
Parametric Bayesian Learning
Concept & Example
Discrete & Continuous Data
Text Clustering & Topic Modeling
Pros and Cons
Some Important Concepts
Non-parametric Bayesian Learning
Dirichlet Process and Process Construction
Dirichlet Process Mixture
Hierarchical Dirichlet Process
Chinese Restaurant Process
5/10/2016 2 Middleware, CCNT, ZJUYueshen Xu
Example: Hierarchical Topic
Modeling
Markov Chain Monte Carlo
Reference
Discussion
Bayesโ Rule
Posterior = Prior * Likelihood
5/10/2016 Yueshen Xu 3 Middleware, CCNT, ZJU
๐ ๐ป๐ฆ๐๐๐กโ๐๐ ๐๐ ๐ท๐๐ก๐ =๐ ๐ท๐๐ก๐ ๐ป๐ฆ๐๐๐กโ๐๐ ๐๐ ๐(๐ป๐ฆ๐๐๐กโ๐๐ ๐๐ )
๐(๐ท๐๐ก๐)
Posterior
Likelihood Prior
Evidence
Update beliefs in hypotheses in response to data
Parametric or Non-parametric
The structure of hypothesis: constrain or not constrain
We have examples later
Your confidence to the prior
Parametric Bayesian Learning
5/10/2016 Yueshen Xu 4 Middleware, CCNT, ZJU
๐ ๐ป๐ฆ๐๐๐กโ๐๐ ๐๐ ๐ท๐๐ก๐ โ ๐ ๐ท๐๐ก๐ ๐ป๐ฆ๐๐๐กโ๐๐ ๐๐ ๐(๐ป๐ฆ๐๐๐กโ๐๐ ๐๐ )
Parametric or Non-parametric Hypothesis
Evidence is the fact
Constant No possibility Trick commonly used
Non-parametric != No parameters
Hyper-parameters
โข Parameters of distributions
โข Parameter vs. Variable
๐ท๐๐ ๐ ๐ถ =ฮ(๐ผ0)
ฮ ๐ผ1 โฆฮ ๐ผ๐พ
๐=1
๐พ
๐๐๐ผ๐โ1
Variable
Hyper-parameter Parameter
p(ฮธ|X) โ p(X|ฮธ)p(ฮธ)
Parametric Bayesian Learning
Some Examples
5/10/2016 Yueshen Xu 5 Middleware, CCNT, ZJU
Clustering Topic Modeling
K-Means/Medoid, NMF LSA, pLSA, LDA
Hierarchical Concept Building
Parametric Bayesian Learning
Serious Problems
How could we know
the number of clusters?
the number of topics?
the number of layers?
5/10/2016 Yueshen Xu 6 Middleware, CCNT, ZJU
Heuristic pre-processing?
Guessing and Tuning
Parametric Bayesian Learning
Some basics
Discrete Data & Continuous Data
Discrete Data: text be modeled as natural numbers
Continuous Data: stock, trading, signal, quality, rating be
modeled as real numbers
5/10/2016 Yueshen Xu 7 Middleware, CCNT, ZJU
Some important concepts (Also used in non-parametric case)
Discrete distribution: ๐๐|๐~๐ท๐๐ ๐๐๐๐ก๐(๐)
๐ ๐ ๐ =
๐=1
๐
๐ท๐๐ ๐๐๐๐ก๐ ๐๐; ๐ =
๐=1
๐
๐๐
๐๐
Multinomial distribution: ๐|๐, ๐~๐๐ข๐๐ก๐(๐, ๐)
๐ ๐ ๐, ๐ =๐!
๐=1๐ ๐๐!
๐=1
๐
๐๐
๐๐
Computer Sciencers
often mix them up
Parametric Bayesian Learning
Some important concepts (cont.)
Dirichlet distribution:๐|๐ถ~๐ท๐๐(๐ถ)
๐ท๐๐ ๐ ๐ถ =ฮ(๐ผ0)
ฮ ๐ผ1 โฆฮ ๐ผ๐พ
๐=1
๐พ
๐๐๐ผ๐โ1
Conjugate Prior
the posterior p(ฮธ|X) are in the same family as the p(ฮธ), the prior is called
a conjugate prior of the likelihood p(X|ฮธ)
Examples
Binomial Distribution โโ Beta Distribution
Multinomial Distribution โโ Dirichlet Distribution
5/10/2016 Yueshen Xu 8 Middleware, CCNT, ZJU
๐ ๐ ๐ต, ๐ถ =๐ท๐๐ ๐ ๐ต + ๐ถ =ฮ(๐ผ0+๐)
ฮ ๐ผ1+๐1 โฆฮ ๐ผ๐พ+๐๐พ ๐=1๐พ ๐๐๐ผ๐โ1+๐๐
๐(๐|๐ถ) ๐ ๐ต ๐
Why should prior and
posterior better be
conjugate distributions?
โฆ
Parametric Bayesian Learning
Some important concepts (cont.)
Probabilistic Graphical Model
Modeling Bayesian Network using plates and circles
5/10/2016 Yueshen Xu 9 Middleware, CCNT, ZJU
Generative Model & Discriminative Model: ๐(๐|๐)
Generative Model: p(ฮธ|X) โ p(X|ฮธ)p(ฮธ)
Naรฏve Bayes, GMM, pLSA, LDA, HMM, HDPโฆ : Unsupervised Learning
Discriminative Model: ๐(๐|๐)
LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning
Also have graphical model
representations
Non-parametric Bayesian Learning
When we talk about non-parametric, what do we usually talk
about?
Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese
Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical
Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process
Multinomial Model, Clustering, โฆ
Continuous Data: Gaussian Distribution, Gaussian Process,
Regression, Classification, Factorization, Gradient Descent,
Covariance Matrixโฆ Brownian Motion
5/10/2016 Yueshen Xu 10 Middleware, CCNT, ZJU
Infinite
โ
Non-parametric Bayesian Learning
Dirichlet Process[Yee Whye Teh, etc]
๐บ0 : probabilistic measure/distribution (base distribution), ๐ผ0: real
number, (๐ด1, ๐ด2, โฆ , ๐ด๐) : partition of space, G: a probabilistic
distribution, iff
(๐บ ๐ด1 , โฆ , ๐บ(๐ด๐))~๐ท๐๐(๐ผ0๐บ0 ๐ด1 , โฆ , ๐ผ0๐บ0 ๐ด๐ )
then, ๐บ~DP(๐ผ0, ๐บ0)
5/10/2016, Yueshen Xu 11 Middleware, CCNT, ZJU
๐บ0 : which exact distribution is ๐บ0? We donโt know
๐บ : which exact distribution is ๐บ? We donโt know
Non-parametric Bayesian Learning
Where is infinite? Construction of DP We need to construct
a DP, since it does not exist naturally
Stick-breaking, Polya Urn Scheme, Chinese restaurant process
Middleware, CCNT, ZJU
Stick-breaking construction
(๐ฝ๐)๐=1โ ,(๐๐)๐=1
โ :iid sequence
๐=1โ ๐๐ = 1 ๐ฟ๐๐ is the probability of ๐๐
a distribution of positive integers
๐ฝ๐|๐ผ0~๐ต๐๐ก๐(1, ๐ผ0)๐๐|๐ผ0~๐บ0
๐๐ = ๐ฝ๐
๐=1
๐โ1
(1 โ ๐ฝ๐)
๐บ =
๐=1
โ
๐๐ ๐ฟ๐๐
Why DP? โฆ
Non-parametric Bayesian Learning
Chinese Restaurant Process
A restaurant with an infinite number of tables, and customers
(word, generated from ๐๐, one-to-one) enter this restaurant
sequentially. The ith customer (๐๐) sits at a table (๐๐) according to
the probability :
5/10/2016 Yueshen Xu 13 Middleware, CCNT, ZJU
new table
๐๐: Clustering == 2/3 unsupervised learning clustering, topic modeling (two layer
clustering), hierarchical concept building, collaborative filtering, similarity computationโฆ
Non-parametric Bayesian Learning
Dirichlet Process Mixture (DPM)
You can draw the graphical model yourself DP is not enough
We need similarity instead of cloning Mixture Models
Middleware, CCNT, ZJU
Mixture Models: an element is generated from a mixture/group of
variables (usually latent variables) โถ GMM, LDA, pLSAโฆ
DPM: ๐๐|๐บ~๐บ, ๐ฅ๐|๐๐~๐น(๐๐) For text data, ๐น(๐๐) is Discrete/Multinomial
Intuitive but not helpful
Construction
๐ฝ๐|๐ผ0~๐ต๐๐ก๐(1, ๐ผ0)๐๐|๐ผ0~๐บ0
๐๐ = ๐ฝ๐
๐=1
๐โ1
(1 โ ๐ฝ๐)
๐บ =
๐=1
โ
๐๐ ๐ฟ๐๐
Non-parametric Bayesian Learning
Dirichlet Process Mixture (DPM)
5/10/2016 Yueshen Xu 15 Middleware, CCNT, ZJU
Finite
Dirichlet Multinomial
Mixture Model
What can DMMM do?
(0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,โฆ.)
C l u s t e r i n g
Non-parametric Bayesian Learning
Hierarchical Dirichlet Process (HDP)
5/10/2016 Yueshen Xu 16 Middleware, CCNT, ZJU
Construction
HDP: ๐๐๐|๐บ~๐บ, ๐ฅ๐๐|๐๐๐~๐น(๐๐๐)
LDA
A very natural model for
those statistics guys,
but for our computer
guysโฆheheโฆ.Finite (F: Mult)
LDA Hierarchical
Dirichlet Multinomial
Mixture Model
Non-parametric Bayesian Learning
Hierarchical Topic Modeling
What we can get from reviews, blogs, question answers, twitter,
newsโฆโฆ? Only topics? Far not enough
What we really need is a hierarchy to illustrate what exactly the
text tells people, like
5/10/2016 Yueshen Xu 17 Middleware, CCNT, ZJU
Non-parametric Bayesian Learning
Hierarchical Topic Modeling
Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04]
NCRP: In a restaurant, at the 1st level, there is one table, which is linked
with an infinite number of tables at the 2nd level. Each table at the
second level is also linked with an infinite number of tables at the 3rd
level. Such a structure is repeated...
CRP is the prior to choose a table to form a path
5/10/2016 Yueshen Xu Middleware, CCNT, ZJU
one document, one path
Doc 2
Matryoshka Doll
Non-parametric Bayesian Learning
Hierarchical Topic Modeling
Generative Process
1. Let ๐1 be the root restaurant (only one table)
2. For each level ๐ โ {2, โฆ , ๐ฟ}:
Draw a table from restaurant ๐๐โ1 using CRP. Set ๐๐ to be the restaurant referred to
by that table
3. Draw an ๐ฟ -dimensional topic proportion vector ๐~๐ท๐๐(๐ผ)
4. For each word ๐ค๐:
Draw ๐ง โ 1,โฆ , ๐ฟ ~ Mult(๐)
Draw ๐ค๐ from the topic associated with restaurant ๐๐ง
5/10/2016 Yueshen Xu
ฮฑ
zm,n
N
c1
c2
cL
T
ฮณ
wm,n
M
ฮฒ
k
m
๐ฟ can be infinite, but not necessary
Non-parametric Bayesian Learning
What we can get
5/10/2016 Yueshen Xu 20 Middleware, CCNT, ZJU
Markov Chain Monte Carlo
Markov Chain
Initialization probability: ๐0 = {๐0 1 , ๐0 2 , โฆ , ๐0(|๐|)}
๐๐ = ๐๐โ1๐ = ๐๐โ2๐2 = โฏ = ๐0๐
๐: Chapman-Kolomogrov equation
Central-limit Theorem: Under the premise of connectivity of P, lim๐โโ๐๐๐๐
= ๐ ๐ ; ๐ ๐ = ๐=1|๐|๐ ๐ ๐๐๐
lim๐โโ๐0๐๐ =๐(1) โฆ ๐(|๐|)โฎ โฎ โฎ๐(1) ๐(|๐|)
๐ = {๐ 1 , ๐ 2 , โฆ , ๐ ๐ , โฆ , ๐(|๐|)}
5/10/2016 21 Middleware, CCNT, ZJU
Stationary Distribution
๐0~๐0 ๐ฅ โโ ๐1~๐1 ๐ฅ โโ โฏโโ ๐๐~๐ ๐ฅ โโ ๐๐+1~๐ ๐ฅ โโ ๐๐+2~๐ ๐ฅ โโ
sampleConvergence
Stationary Distribution
Yueshen Xu
|)||(|...)2|(|)1|(|
)12(p...)22(p)12(p
|)|1(...)21()11(p
SSpSpSp
Spp
P
Xm
Xm+1
Markov Chain Monte Carlo
Gibbs Sampling
5/10/2016 Yueshen Xu 22 Middleware, CCNT, ZJU
Step1: Initialize: ๐0 = ๐ฅ0 = {๐ฅ1: ๐ = 1,2, โฆ๐}
Step2: for t = 0, 1, 2, โฆ
1. ๐ฅ1(๐ก+1)~๐ ๐ฅ1 ๐ฅ2
(๐ก), ๐ฅ3(๐ก), โฆ , ๐ฅ๐
(๐ก);
2. ๐ฅ2๐ก+1~๐ ๐ฅ2 ๐ฅ1
(๐ก+1), ๐ฅ3(๐ก), โฆ , ๐ฅ๐
(๐ก)
3. โฆ
4. ๐ฅ๐๐ก+1~๐ ๐ฅ๐ ๐ฅ1
(๐ก+1), ๐ฅ๐โ1(๐ก+1), ๐ฅ๐+1(๐ก)โฆ , ๐ฅ๐(๐ก)
5. โฆ
6. ๐ฅ๐๐ก+1~๐ ๐ฅ๐ ๐ฅ1
(๐ก+1), ๐ฅ2(๐ก+1), โฆ , ๐ฅ๐โ1
(๐ก+1)
๐ฅ๐~๐ ๐ฅ ๐ฅโ๐
A(x1,x1)
B(x1,x2)
C(x2,x1)
D
Metropolis-Hastings Sampling
You want to know โGibbs sampling for HDP/DPM/nCRPโ ? Youโd better understand
Gibbs sampling for โLDA and DMMMโ
Reference
โข Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007
โข Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical
Association, 2006
โข David Blei. Probabilstic topic models. Communications of the ACM, 2012
โข David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003
โข David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic
Hierarchies. Journal of the ACM, 2010
โข Gregor Heinrich. Parameter Estimation for Text Analysis, 2008
โข T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of
Statistics, 1973
โข Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference
โข Rick Durrett. Probability: Theory and Examples, 2010
โข Christopher Bishop. Pattern Recognition and Machine Learning, 2007
โข Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014
โข David P. Williams. Gaussian Processes, Duke University, 2006
5/10/2016 Yueshen Xu 23 Middleware, CCNT, ZJU
Q&A
5/10/2016 Middleware, CCNT, ZJU24Yueshen Xu