non parametric bayesian learning in discrete data

Non-parametric Bayesian

Learning in Discrete Data

Yueshen [email protected] / [email protected]

Middleware, CCNT, ZJU

Middleware, CCNT, ZJU5/10/2016

Statistics & Computational Linguistics

1Yueshen Xu

Outline

Bayes’ Rule

Parametric Bayesian Learning

Concept & Example

Discrete & Continuous Data

Text Clustering & Topic Modeling

Pros and Cons

Some Important Concepts

Non-parametric Bayesian Learning

Dirichlet Process and Process Construction

Dirichlet Process Mixture

Hierarchical Dirichlet Process

Chinese Restaurant Process

5/10/2016 2 Middleware, CCNT, ZJUYueshen Xu

Example: Hierarchical Topic

Modeling

Markov Chain Monte Carlo

Reference

Discussion

Bayes’ Rule

Posterior = Prior * Likelihood

5/10/2016 Yueshen Xu 3 Middleware, CCNT, ZJU

𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 =𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)

𝑝(𝐷𝑎𝑡𝑎)

Posterior

Likelihood Prior

Evidence

Update beliefs in hypotheses in response to data

Parametric or Non-parametric

The structure of hypothesis: constrain or not constrain

We have examples later

Your confidence to the prior



𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 ∝ 𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)

Parametric or Non-parametric Hypothesis

Evidence is the fact

Constant No possibility Trick commonly used

Non-parametric != No parameters

Hyper-parameters

• Parameters of distributions

• Parameter vs. Variable

𝐷𝑖𝑟 𝜃 𝜶 =Γ(𝛼0)

Γ 𝛼1 …Γ 𝛼𝐾

𝑘=1

𝐾

𝜃𝑘𝛼𝑘−1

Variable

Hyper-parameter Parameter

p(θ|X) ∝ p(X|θ)p(θ)


Some Examples


Clustering Topic Modeling

K-Means/Medoid, NMF LSA, pLSA, LDA

Hierarchical Concept Building


Serious Problems

How could we know

the number of clusters?

the number of topics?

the number of layers?


Heuristic pre-processing?

Guessing and Tuning


Some basics

Discrete Data & Continuous Data

Discrete Data: text be modeled as natural numbers

Continuous Data: stock, trading, signal, quality, rating be

modeled as real numbers


Some important concepts (Also used in non-parametric case)

Discrete distribution: 𝑋𝑖|𝜃~𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒(𝜃)

𝑝 𝑋 𝜃 =

𝑖=1

𝑛

𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑋𝑖; 𝜃 =

𝑗=1

𝑚

𝜃𝑗

𝑁𝑗

Multinomial distribution: 𝑁|𝑛, 𝜃~𝑀𝑢𝑙𝑡𝑖(𝜃, 𝑛)

𝑝 𝑁 𝑛, 𝜃 =𝑛!

𝑗=1𝑚 𝑁𝑗!

𝑗=1

𝑚

𝜃𝑗

𝑁𝑗

Computer Sciencers

often mix them up


Some important concepts (cont.)

Dirichlet distribution:𝜃|𝜶~𝐷𝑖𝑟(𝜶)

𝐷𝑖𝑟 𝜃 𝜶 =Γ(𝛼0)

Γ 𝛼1 …Γ 𝛼𝐾

𝑘=1

𝐾

𝜃𝑘𝛼𝑘−1

Conjugate Prior

the posterior p(θ|X) are in the same family as the p(θ), the prior is called

a conjugate prior of the likelihood p(X|θ)

Examples

Binomial Distribution ←→ Beta Distribution

Multinomial Distribution ←→ Dirichlet Distribution


𝑝 𝜃 𝑵, 𝜶 =𝐷𝑖𝑟 𝜃 𝑵 + 𝜶 =Γ(𝛼0+𝑁)

Γ 𝛼1+𝑁1 …Γ 𝛼𝐾+𝑁𝐾 𝑘=1𝐾 𝜃𝑘𝛼𝑘−1+𝑁𝑘

𝑝(𝜃|𝜶) 𝑝 𝑵 𝜃

Why should prior and

posterior better be

conjugate distributions?

…


Some important concepts (cont.)

Probabilistic Graphical Model

Modeling Bayesian Network using plates and circles


Generative Model & Discriminative Model: 𝑝(𝜃|𝑋)

Generative Model: p(θ|X) ∝ p(X|θ)p(θ)

Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning

Discriminative Model: 𝑝(𝜃|𝑋)

LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning

Also have graphical model

representations


When we talk about non-parametric, what do we usually talk

about?

Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese

Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical

Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process

Multinomial Model, Clustering, …

Continuous Data: Gaussian Distribution, Gaussian Process,

Regression, Classification, Factorization, Gradient Descent,

Covariance Matrix… Brownian Motion


Infinite

∞


Dirichlet Process[Yee Whye Teh, etc]

𝐺0 : probabilistic measure/distribution (base distribution), 𝛼0: real

number, (𝐴1, 𝐴2, … , 𝐴𝑟) : partition of space, G: a probabilistic

distribution, iff

(𝐺 𝐴1 , … , 𝐺(𝐴𝑟))~𝐷𝑖𝑟(𝛼0𝐺0 𝐴1 , … , 𝛼0𝐺0 𝐴𝑟 )

then, 𝐺~DP(𝛼0, 𝐺0)

5/10/2016, Yueshen Xu 11 Middleware, CCNT, ZJU

𝐺0 : which exact distribution is 𝐺0? We don’t know

𝐺 : which exact distribution is 𝐺? We don’t know


Where is infinite? Construction of DP We need to construct

a DP, since it does not exist naturally

Stick-breaking, Polya Urn Scheme, Chinese restaurant process


Stick-breaking construction

(𝛽𝑘)𝑘=1∞ ,(𝜙𝑘)𝑘=1

∞ :iid sequence

𝑘=1∞ 𝜋𝑘 = 1 𝛿𝜙𝑘 is the probability of 𝜙𝑘

a distribution of positive integers

𝛽𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)𝜙𝑘|𝛼0~𝐺0

𝜋𝑘 = 𝛽𝑘

𝑙=1

𝑘−1

(1 − 𝛽𝑙)

𝐺 =

𝑘=1

∞

𝜋𝑘 𝛿𝜙𝑘

Why DP? …


Chinese Restaurant Process

A restaurant with an infinite number of tables, and customers

(word, generated from 𝜃𝑖, one-to-one) enter this restaurant

sequentially. The ith customer (𝜃𝑖) sits at a table (𝜙𝑘) according to

the probability :


new table

𝜙𝑘: Clustering == 2/3 unsupervised learning clustering, topic modeling (two layer

clustering), hierarchical concept building, collaborative filtering, similarity computation…


Dirichlet Process Mixture (DPM)

You can draw the graphical model yourself DP is not enough

We need similarity instead of cloning Mixture Models


Mixture Models: an element is generated from a mixture/group of

variables (usually latent variables) ∶ GMM, LDA, pLSA…

DPM: 𝜃𝑖|𝐺~𝐺, 𝑥𝑖|𝜃𝑖~𝐹(𝜃𝑖) For text data, 𝐹(𝜃𝑖) is Discrete/Multinomial

Intuitive but not helpful

Construction

𝛽𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)𝜙𝑘|𝛼0~𝐺0

𝜋𝑘 = 𝛽𝑘

𝑙=1

𝑘−1

(1 − 𝛽𝑙)

𝐺 =

𝑘=1

∞

𝜋𝑘 𝛿𝜙𝑘


Dirichlet Process Mixture (DPM)


Finite

Dirichlet Multinomial

Mixture Model

What can DMMM do?

(0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,….)

C l u s t e r i n g


Hierarchical Dirichlet Process (HDP)


Construction

HDP: 𝜃𝑗𝑖|𝐺~𝐺, 𝑥𝑗𝑖|𝜃𝑗𝑖~𝐹(𝜃𝑗𝑖)

LDA

A very natural model for

those statistics guys,

but for our computer

guys…hehe….Finite (F: Mult)

LDA Hierarchical

Dirichlet Multinomial

Mixture Model


Hierarchical Topic Modeling

What we can get from reviews, blogs, question answers, twitter,

news……? Only topics? Far not enough

What we really need is a hierarchy to illustrate what exactly the

text tells people, like




Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04]

NCRP: In a restaurant, at the 1st level, there is one table, which is linked

with an infinite number of tables at the 2nd level. Each table at the

second level is also linked with an infinite number of tables at the 3rd

level. Such a structure is repeated...

CRP is the prior to choose a table to form a path

5/10/2016 Yueshen Xu Middleware, CCNT, ZJU

one document, one path

Doc 2

Matryoshka Doll



Generative Process

1. Let 𝑐1 be the root restaurant (only one table)

2. For each level 𝑙 ∈ {2, … , 𝐿}:

Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to

by that table

3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)

4. For each word 𝑤𝑛:

Draw 𝑧 ∈ 1,… , 𝐿 ~ Mult(𝜃)

Draw 𝑤𝑛 from the topic associated with restaurant 𝑐𝑧

5/10/2016 Yueshen Xu

α

zm,n

N

c1

c2

cL

T

γ

wm,n

M

β

k

m

𝐿 can be infinite, but not necessary


What we can get



Markov Chain

Initialization probability: 𝜋0 = {𝜋0 1 , 𝜋0 2 , … , 𝜋0(|𝑆|)}

𝜋𝑛 = 𝜋𝑛−1𝑃 = 𝜋𝑛−2𝑃2 = ⋯ = 𝜋0𝑃

𝑛: Chapman-Kolomogrov equation

Central-limit Theorem: Under the premise of connectivity of P, lim𝑛→∞𝑃𝑖𝑗𝑛

= 𝜋 𝑗 ; 𝜋 𝑗 = 𝑖=1|𝑆|𝜋 𝑖 𝑃𝑖𝑗

lim𝑛→∞𝜋0𝑃𝑛 =𝜋(1) … 𝜋(|𝑆|)⋮ ⋮ ⋮𝜋(1) 𝜋(|𝑆|)

𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}

5/10/2016 21 Middleware, CCNT, ZJU

Stationary Distribution

𝑋0~𝜋0 𝑥 −→ 𝑋1~𝜋1 𝑥 −→ ⋯−→ 𝑋𝑛~𝜋 𝑥 −→ 𝑋𝑛+1~𝜋 𝑥 −→ 𝑋𝑛+2~𝜋 𝑥 −→

sampleConvergence

Stationary Distribution

Yueshen Xu

|)||(|...)2|(|)1|(|

)12(p...)22(p)12(p

|)|1(...)21()11(p

SSpSpSp

Spp

P

Xm

Xm+1


Gibbs Sampling


Step1: Initialize: 𝑋0 = 𝑥0 = {𝑥1: 𝑖 = 1,2, …𝑛}

Step2: for t = 0, 1, 2, …

1. 𝑥1(𝑡+1)~𝑝 𝑥1 𝑥2

(𝑡), 𝑥3(𝑡), … , 𝑥𝑛

(𝑡);

2. 𝑥2𝑡+1~𝑝 𝑥2 𝑥1

(𝑡+1), 𝑥3(𝑡), … , 𝑥𝑛

(𝑡)

3. …

4. 𝑥𝑗𝑡+1~𝑝 𝑥𝑗 𝑥1

(𝑡+1), 𝑥𝑗−1(𝑡+1), 𝑥𝑗+1(𝑡)… , 𝑥𝑛(𝑡)

5. …

6. 𝑥𝑛𝑡+1~𝑝 𝑥𝑛 𝑥1

(𝑡+1), 𝑥2(𝑡+1), … , 𝑥𝑛−1

(𝑡+1)

𝑥𝑖~𝑝 𝑥 𝑥−𝑖

A(x1,x1)

B(x1,x2)

C(x2,x1)

D

Metropolis-Hastings Sampling

You want to know ‘Gibbs sampling for HDP/DPM/nCRP’ ? You’d better understand

Gibbs sampling for ‘LDA and DMMM’

Reference

• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007

• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical

Association, 2006

• David Blei. Probabilstic topic models. Communications of the ACM, 2012

• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003

• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic

Hierarchies. Journal of the ACM, 2010

• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008

• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of

Statistics, 1973

• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference

• Rick Durrett. Probability: Theory and Examples, 2010

• Christopher Bishop. Pattern Recognition and Machine Learning, 2007

• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014

• David P. Williams. Gaussian Processes, Duke University, 2006


Q&A

5/10/2016 Middleware, CCNT, ZJU24Yueshen Xu

non parametric bayesian learning in discrete data

Data & Analytics