statistical arti cial intelligence laboratory (sail)*...

Nonparametric Bayesian for Sequential Data1

Jaesik Choi

Statistical Artificial Intelligence Laboratory (SAIL)*

*http://sail.unist.ac.kr

1Some slides are courtesy to [Zubin, UAI 2005], [Miller, 2009], and [Orbanz, NIPS2012]

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 1 / 38

Overview

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Terminology

Parameteric model

Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

Number of parameters grows with sample size∞-dimensional parameter space

Example: Density estimation

Nonparametric Bayesian Model

Definition A nonparametric Bayesian model is a Bayesian model on an∞-dimensional parameter space

Interpretation Parameter space T = set of possible model parameters(or pattern), for example:

Problem TDensity estimation Probabilistic distributionsRegression Smooth functionsClustering Partitions

Solution to Bayesian problem = posterior distribution on modelparameters

Exchangeability

Can we justify our assumptions?Assumption: data = model + noiseIn Bayes’ theorem

p(model |data) =p(data|model)

p(data)p(model)

Definition X1, · · ·Xn are exchangeable if P(X1, · · · ,Xn) is invariantunder any permutation π:

p(X1 = x1, · · · ,Xn = xn) = p(X1 = xπ(1), · · · ,Xn = xπ(n))

e.g.p(1, 1, 0, 0, 0, 1, 1, 0) = p(1, 0, 1, 0, 1, 0, 0, 1)

Order of observations does not matter.

Exchangeability

De Finetti’s Theorem (binary cases)Binary sequence case: all exchangeable binary sequences are mixturesof Bernoulli sequences [de Finetti, 1931]

p(x1, · · · , xn) =

0θtn(1− θ)n−tndF (θ),

where p(x1, · · · , xn) = p(X1 = x1, · · · ,Xn = xn) and tn = Σni=1xi .

Implications

Exchangeable data decomposes into mixtures of models with i.i.d.sequencesCaution: θ is in general an ∞-dimensional quantity

Contents

1 Introduction

Clustering

Observations X1,X2, · · ·Each observation belongs to exactly one cluster

Unknown pattern = partition of 1, · · · , n

Mixture Models

Mixture models

p(data|model) =

p(data|θ)m(dθ)

m is called the mixing measure

Two-stage samplingSample data X ∼ p(·|m) as:1. Θ ∼ m2. X ∼ p(·|θ)

Finite mixture model

m(·) = ΣKk=1ckδθk (·)

Bayesian Mixture Models (BMMs)

Random mixing measurem(·) = ΣK

k=1ckδθk (·)Conjugate priorsA Bayesian model is conjugate if the posterior is an element of thesame class of distribution as the prior (‘closure under sampling’).

p(data|model), likelihood p(model) conjugate prior

Multinomial DirichletGaussian Smooth functionsClustering Partitions

Choice of priors in BMM conjugate prior

Choose conjugate prior for each parameterE.g.: Dirichlet prior

Dirichlet Process Mixtures

Dirichlet process (DP) [Ferguson, 73] [Sethuraman, 94]A DP is a distribution on random probability measures of the form

m(·) =∑∞

k=1 ckδθk (·) where∑∞

k=1 ck = 1Dirichlet Process

Constructive definition of DP(α, G0)θk ∼iid G0

Vk ∼iid Beta(1, α)Compute ck as

ck = Vk

k−1∏

(1− Vi )

This procedure is called ‘Stick-breaking construction’

Posterior Distribution

DP Posterior

θn+1|θ1, · · · , θn ∼1

n + α

δθj (θn+1) +α

n + αG0(θn+1)

Mixture Posterior

p(xn+1|x1, · · · , xn) =Kn∑

nkn + α

p(xn+1|θ∗k)+α

n + α

∫p(xn+1|θ)G0(θ)dθ

Conjugacy

The posterior of DP(α,G0) is DP(α + n, 1n+α (

∑k nkδθ∗

k+ αG0))

The Dirichlet process is conjugate.

Inference

Latent variables

p(xn+1|x1, · · · , xn) =Kn∑

nkn+α

p(xn+1|θ∗k)+α

n + α

We observe xi and do not actually observe θk (latent).

Assignment probabilities

qjk ∝ nkp(xj |θ∗k )qj0 ∝ α

∫p(xj |θ)G0(θ)dθ

Gibbs SamplingUses an assignment variable φj for each observation xj .

Assignment step: Sampling φj ∼ Multinomial(qj0, · · · , qjKn)Parameter sampling:

θ∗k ∼ G0(θ∗k )∏

xj∈Clusterkp(xj |θ∗k )

Number of Clusters

Dirichlet process

Kn = # of clusters in sample of size n

E[Kn] = O(log(n))

Modeling assumption

Parametric clustering: K∞ is finite (possibly unknown, but fixed).Nonparametric clustering: K∞ is infinite.

Rephrasing the question

Estimate of Kn is controlled by distribution of the cluster sizes ck in∑k ckδθk

What should we assume about the distribution of ck

Generalizing the DP

Pitman-Yor process

p(xn+1|x1, · · ·, xn) =Kn∑k=1

nk−dn+α p(xn+1|θ∗k) + α+Kn·d

Discount parameter d ∈ [0, 1].

Cluster sizes

Power Laws

The distribution of cluster sizes is called a power law if

cj ∼ γ(β) · j−β

for some β ∈ [0, 1].Examples of power laws

Word frequenciesPopularity (# of friends) in social networks

Pitman-Yor language model

Random Partitions

Discrete measures and partitionsSampling from a discrete measure determines a partition of N intoblocks bk :

θn ∼iid

∞∑k=1

ckδθ∗k and set n ∈ bk ⇔ θn = θ∗k

As n→∞, the block proportions converge: |bk |n → ck

Induced random partitionThe distribution of a random discrete measure m =

∑∞k=1 ckδθk

induces the distribution of a random partition∏

= (B1,B2, · · · ).Exchangeable random partitions∏

is called exchangeable if its distribution depends only on the sizes ofits blocks.All exchangeable random partitions, and only those, can be representedby a random discrete distribution (Kingman’s theorem).

Chinese Restaurant Process (CRP)

Chinese Restaurant ProcessThe distribution of the random partition induced by the Dirichletprocess. CRF

‘Customers and tables’ analogy

Customers = observations (indices in N)Tables = clusters (blocks)

Historical remarkOriginally introduced by Dubins & Pitman as a distribution on infinitepermutationsA permutation of n items defines a partition of 1, · · · , n (regardcycles of permutation as blocks of partition)The induced distribution on partitions is the CRP we use in clustering

Summary: Clustering

Nonparametric Bayesian clustering

Infinite # of clusters, Kn ≤ n of which are observed.If partition exchangeable, it can be represented by a random discretedistribution.

Inference Latent variable algorithms, since assignments (i.e.,partition) not observed.

Gibbs samplingVariational algorithms

Prior assumption

Distribution of cluster sizesImplies prior assumption on # Kn of clusters.

Contents

1 Introduction

Gaussian Distributions

Gaussian

p(x) =1

√2πσ2 exp

(− (x−µ)2

Multivariate Gaussian

p(x) =1√

(2π)n|Σ| exp(−1

2 (x − µ)TΣ−1(x − µ))

Gaussian Processes: Definition

GPs are distributions over functions such that any finite set offunction evaluations [f (x1), · · · , f (xn)] have a jointly Gaussiandistribution.

A GP is completely specified by its mean function µ(x) = E(f (x))and covariance kernel function k(x , x ′) = Cov(f (x), f (x ′)).

Given data x = [x1, · · · , xn] and y = [f (x1), · · · , f (xn)], GP modelspecified by µ(x) and k(x , x ′).

Notation for ‘f follows the GP’ is:

f ∼ GP(µ(x), k(x , x ′)

Likelihood is:

p(y |X ) =1√

(2π)n|Σ|exp

2(y − µ)TΣ−1(y − µ)

where µ = [µ(x1), · · · , µ(xn)] and Σij = k(xi , xj).

Gaussian Processes: Samples with different kernels

Gaussian Processes: Predictions with different kernels

A sample from the prior for each covariance function

Corresponding predictions, mean with two standard deviations:

Gaussian Processes

Nonparametric regressionparameter spaces = continuous functions, say on [a,b]:

f : [a, b]→ R T = c[a, b]

Gaussian Process

f ∼ GP ↔ (f (x1), · · · , f (xd)) is d-dimensional Gaussian

for any finite set X ⊂ [a, b].

Construction: Intuition

The marginal of the GP for any finite X ⊂ [a, b] is a Gaussian.All these Gaussians are marginals of each other.

Examples

Applications Parameter space Bayesian Nonparametric model

Regression Function Gaussian processClustering Partition Chinese restaurant processDensity estimation Density Dirichlet process mixture

Hierarchical clusteringHierarchical

partition

Dirichlet/Pitmann-or diffusion tree,

Nested CRPLatent variable modeling Features Beta process/Indian buffet processDictionary learning Dictionary Beta process/Indian buffet process

Survival analysis HazardBeta process,

Neural-to-the-right processPower-law behaviour Pitman-Yor process, Stable-beta processDeep learning Features Cascading/nested Indian buffet processDimensionality reduction Manifold Gaussian process latent variable model

Topic modelsAtomic

distribution Hierarhical Dirichlet processRelational modeling Relations Infinite relational model,Mondrian process

Contents

1 Introduction

Recent advances in Nonparametric Regression

Problem

Descriptive prediction of multiple time series

Problem

Linear function

decrease x/week

Smooth function

Length scale: y weeks

Rapidly varying

smooth function

Length scale: z hours

Problem (this paper)

Constant function

Sudden drop btw

9/12/01 ~ 9/15/01

Smooth function

Rapidly varying

smooth function

Models

Automatic Bayesian Covariance Discovery* [Lloyd et. al. 2014] [Ghahramani. 2015]

* http://www.automaticstatistician.com/

Gaussian Processes

𝑓 𝑥 ~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Function Gaussian Process

Mean function

Covariance kernel function

𝜇 𝑥 = 𝔼 𝑓 𝑥

𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′

Gaussian Processes

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Function Gaussian Process

Mean function

Covariance kernel function

𝜇 𝑥 = 𝔼 𝑓 𝑥

𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′

[𝑓(𝑥1), … , 𝑓(𝑥𝑁)]~𝒩(𝝁, 𝚺)

Function evaluations Multivariate Gaussian

follows

Mean vector

Covariance matrix

𝝁 = [𝜇 𝑥1 , … , 𝜇 𝑥𝑁 ]

𝚺𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗

* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)

[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]

The Automatic Statistician*

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Find appropriate kernel(1) Encode characteristic

If 𝑔(𝑥)~𝒢𝒫 0, 𝑘𝑔 , ℎ 𝑥 ~𝒢𝒫 0, 𝑘ℎ and 𝑔 𝑥 ⊥ ℎ 𝑥

, then

𝑔 𝑥 + ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 + 𝑘ℎ

𝑔 𝑥 × ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 × 𝑘ℎ

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Find appropriate kernel(1) Encode characteristic

(2) Compose new kernel

The Automatic Statistician: Base kernels

kernel

Encoding

functionKernel function

Paramet

Example kernel

function shape

Example

encoded

functions

LIN(𝑥, 𝑥’)Linear

function𝜎2(𝑥 − ℓ)(𝑥′ − ℓ) 𝜎, ℓ

SE(𝑥, 𝑥’)Smooth

function𝜎2 exp −

𝑥 − 𝑥′ 2

2ℓ2𝜎, ℓ

PER(𝑥, 𝑥’)Periodic

functionIn appendix 𝜎, ℓ, 𝑝

The Automatic Statistician: Operators

Op. Concept Params ExampleExample kernel

function shape

Example encoded

functions

Addition

Superposition

OR operator

SE + PER

LIN + PER

×Multiplication

AND operatorN/A SE × PER

The Automatic Statistician: Operators

Op. Concept Params Example Example encoded functions

CPDivide

left vs rightℓ, 𝑠 CP(LIN,LIN)

CWDivide

in vs outℓ, 𝑠, 𝑤 CW(LIN,C)

× ×+ =

The Automatic statistician: Learning

BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷

Negative log-likelihood

Model complexity

Num. of model parameters

Num. of data points

(1) Optimization criterion: Bayesian Information Criterion (BIC)

The Automatic statistician: Learning

(2) Optimize

(1) Expand

(3) Select

Iteratively select best model (structure 𝑘, parameter )

(1) Expand: the current kernel

(2) Optimize: conjugate gradient descent

(3) Select: the best kernel in the level (greedy)

(4) Iterate: get back to (1) for the next level

(2) Learning algorithm (Composite Kernel Learning)

Negative log-likelihood

Model complexity

Num. of model parameters

Num. of data points

(1) Optimization criteria: Bayesian Information Criterion (BIC)

BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷

Example Result

Linear function

decrease x/week

Smooth function

Rapidly varying

smooth function

Prediction Performance (Extrapolation)

13 famous regression datasets

The Relational Automatic Statistician

[Hwang, Tong and Choi, 2016]

* http://www.automaticstatistician.com/

Motivation

Linear function

decrease x/week

Smooth function

Rapidly varying

smooth function

Adjusted Close of

General Electronics

9/11, 2001

The Automatic Statistician

Motivation

Adjusted Close of

General Electronics, Microsoft, ExxonMobil

9/11, 2001

• Exploit multiple time series

• Find global descriptions

• Hope better predictive

performance

The Relational Automatic Statistician

Latent function

Latent function evaluation

Observation

Fixed Given Optimize

Covariance kernel functionMean function

Gaussian Process

𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃

Model: Gaussian Process

Gaussian Processes → CKL → RKL → SRKL

Fixed Grammar Optimize

Observation

Gaussian Process

Latent function

𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃

Model: Composite Kernel Learning (CKL)

GPs → Composite Kernel Learning → RKL → SRKL

Generalized Multi Kernel Learning

Observation

Scale parameter Shift parameter

Gaussian Process

Latent function

𝑃 𝐷 ℳ =ෑ

𝑗=1

𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑐𝑗

Model: Relational Kernel Learning (RKL)

GPs → CKL → Relational Kernel Learning → SRKL

Scale parameter

Distinctive kernel

Latent function

Observation

Gaussian Process

Fixed (Spectral Mixture) Optimize

𝑃 𝐷 ℳ =ෑ

𝑗=1

𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑘𝑗(𝑥, 𝑥′; 𝜃𝑗)

Model: Semi-Relational Kernel Learning (SRKL)

GPs → CKL → RKL → Semi-Relational Kernel Learning

Semi-Relational Kernel Learning (SRKL)

Input: M time series

Output: A shared kernel 𝑘, M spectral mixture (SM) kernels

(1) Expand: the current shared kernel for all time series

(2) Optimize: expanded kernels for all M time series (conjugate gradient descent)

For each series, individual distinction is handled by the SM kernel.

(3) Select: the best shared kernel for all time series (greedy)

A shared kernel (𝑘) and M SM kernels (𝑘𝑗)

(4) Iterative: get back to (1) when level s is not reached

Find the best shared and distinctive kernels iteratively!

Experimental Results

Three real-world data sets:

US top 9 stocks in year 2001

US top 6 housing markets from 2003 to 2013

Currency exchange of 4 emerging market

Descriptions Graphs (normalized) Details

9 adjusted close

of stock figures

GE,MSFT, XOM,

PFE, C, WMT,

INTC, BP, AIG

6 US housing price

indices

New York, Los Angeles

Chicago, Pheonix,

San Diego, San Fancisco

4 emerging market

currency exchanges

Indonesian - IDR

Malaysian - MYR

South African - ZAR

Russian - RUB

Data sets

Qualitative Results

Adjusted Closes Component 1 Adjusted Closes Component 2

US stock market values suddenly drop after US 9/11 attacks.

Currency exchange is affected by FED’s policy change in interest rates around

middle Sep 2015.

Automatic Statistician Relational Automatic Statistician

4 currency exchange rates Learned component

An Automatically Generated Report

Quantitative Results

STOCK3 = GE, MSFT, XOM

STOCK6= STOCK3 + PFE, C, WMT

STOCK9 = STOCK6 + INTC, BP, AIG

HOUSE2 = NY, LA

HOUSE4 = HOUSE2 + Chicago, Pheonix

HOUSE6 = HOUSE4 + San Diego, San Francisco

CURRENCY4 = IDR, MYR,ZAR,RUB

Quantitative Results (box plots)

9 stocks 6 house price indices 4 currency exchanges

Conclusion

• Our research topic is about“Solving descriptive prediction problem of multiple timeseries”

• We proposed models that can“Exploit both common and distinct changes”

• We found that our models“Show the better qualitative and quantitative performancecompared to the state-of-the-art GP regression method”

http://saildemo.unist.ac.kr/automatic_statistician/

Contents

1 Introduction

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.

Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

α1=1.0, α

2=10.0

α1=9.0, α

Introduction

Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

α1=1.0, α

2=10.0

α1=9.0, α

Introduction

After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a+ n, b+m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Introduction

After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a+ n, b+m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Parametric Bayesian Clustering

The Dirichlet Distribution

We hadπ ∼ Dirichlet(α1, . . . , αK)

The Dirichlet density is defined as

p(π|α) =Γ(∑K

k=1 αk

∏Kk=1 Γ(αk)

πα1−11 πα2−1

2 · · ·παK−1K

where πK = 1−∑K−1k=1 πk.

The expectations of π are

E(πi) =αi∑Ki=1 αi

The Beta DistributionA special case of the Dirichlet distribution is the Beta distribution forwhen K = 2.

p(π|α1, α2) =Γ (α1 + α2)

Γ(α1)Γ(α2)πα1−1(1− π)α2−1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

α1=1.0, α

2=10.0

α1=9.0, α

The Dirichlet Distribution

In three dimensions:

p(π|α1, α2, α3) =Γ (α1 + α2 + α3)

Γ(α1)Γ(α2)Γ(α3)πα1−11 πα2−1

2 (1− π1 − π2)α3−1

α = (2, 2, 2) α = (5, 5, 5)

(2, 2, 5)

Draws from the Dirichlet Distribution

α = (2, 2, 2) 1 2 30

1 2 30

α = (5, 5, 5) 1 2 30

1 2 30

α = (2, 2, 5) 1 2 30

1 2 30

Key Property of the Dirichlet Distribution

The Aggregation Property: If

(π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK)

(π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)

This is also valid for any aggregation:

(π1 + π2,

)∼ Beta

(α1 + α2,

Multinomial-Dirichlet Conjugacy

Let Z ∼ Multinomial(π) and π ∼ Dir(α).

Posterior:

p(π|z) ∝ p(z|π)p(π)= (πz1

1 · · ·πzKK )(πα1−1

1 · · ·παK−1K )

= (πz1+α1−11 · · ·πzK+αK−1

which is Dir(α+ z).

Contents

1 Introduction

The Dirichlet Process Model

Parameters for the Dirichlet Process

• α - The concentration parameter.

• G0 - The base measure. A prior distribution for the cluster specificparameters.

The Dirichlet Process (DP) is a distribution over distributions. We write

G ∼ DP (α,G0)

to indicate G is a distribution drawn from the DP.

It will become clearer in a bit what α and G0 are.

The Dirichlet Process

Definition: Let G0 be a probability measure on the measurable space(Ω, B) and α ∈ R+.

The Dirichlet Process DP (α,G0) is the distribution on probabilitymeasures G such that for any finite partition (A1, . . . , Am) of Ω,

(G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).

(Ferguson, ’73)

Mathematical Properties of the Dirichlet Process

Suppose we sample

• G ∼ DP (α,G0)

• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP

(α+ 1,

α+ 1G0 +

α+ 1δθ1

More generally

G|θ1, . . . , θn ∼ DP

(α+ n,

α+ nG0 +

Suppose we sample

• G ∼ DP (α,G0)

• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP

(α+ 1,

α+ 1G0 +

α+ 1δθ1

More generally

G|θ1, . . . , θn ∼ DP

(α+ n,

α+ nG0 +

With probability 1, a sample G ∼ DP (α,G0) is of the form

G =∞∑

πkδφk

(Sethuraman, ’94)

The Stick-Breaking Process

• Define an infinite sequence of Beta random variables:

βk ∼ Beta(1, α) k = 1, 2, . . .

• And then define an infinite sequence of mixing proportions as:

π1 = β1

πk = βk

k−1∏

(1− βl) k = 2, 3, . . .

• This can be viewed as breaking off portions of a stick:

1 2...

1β β (1−β )

• When π are drawn this way, we can write π ∼ GEM(α).

The Stick-Breaking Process

• We now have an explicit formula for each πk:πk = βk

∏k−1l=1 (1− βl)

• We can also easily see that∑∞

k=1 πk = 1 (wp1):

1−K∑

πk = 1− β1 − β2(1− β1)− β3(1− β1)(1− β2)− · · ·

= (1− β1)(1− β2 − β3(1− β2)− · · · )

(1− βk)

→ 0 (wp1 as K → ∞)

• So now G =∑∞

k=1 πkδφkhas a clean definition as a random

measure

Contents

1 Introduction

The Chinese Restaurant Process (CRP)

• A random process in which n customers sit down in a Chineserestaurant with an infinite number of tables

• first customer sits at the first table• mth subsequent customer sits at a table drawn from the following

distribution:

P (previously occupied table i|Fm−1) ∝ ni

P (the next unoccupied table|Fm−1) ∝ α

where ni is the number of customers currently at table i and whereFm−1 denotes the state of the restaurant after m− 1 customershave been seated

The CRP and Clustering

• Data points are customers; tables are clusters

• the CRP defines a prior distribution on the partitioning of the dataand on the number of tables

• This prior can be completed with:

• a likelihood—e.g., associate a parameterized probability distributionwith each table

• a prior for the parameters—the first customer to sit at table kchooses the parameter vector for that table (φk) from the prior

φ2φ1 φ3 φ

• So we now have a distribution—or can obtain one—for any quantitythat we might care about in the clustering setting

The CRP Prior, Gaussian Likelihood, Conjugate Prior

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =

∫ ( n∏

p(xi|G)

)dP (G)

for some random variable G.

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =

∫ ( n∏

p(xi|G)

)dP (G)

for some random variable G.

The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =

∫ n∏

p(θi|G)dP (G)

The CRP and the DP

That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =

∫ n∏

p(θi|G)dP (G)

The CRP and the DP

In English, this means that if the DP is the prior onG, then the CRP defines how points are assigned toclusters when we integrate out G.

The DP, CRP, and Stick-Breaking Process Summary

G0G ! DP(!, G0)

Stick-Breaking Process(just the weights)

The CRP describes thepartitions of ! when Gis marginalized out.

Beta Processes

Definition [Hjort, 90]: Let H0 be a continuous probability measure(Ω,B) and α ∈ R+. Then, Beta Process BP(α,H0) is the distributionon probability measures H such that for any (disjoint) finite partition(A1, · · · ,Ak) of Ω satisfies

H(Ai ) ∼ Beta(αH0(Ai ), α(1− H0(Ai )))

with K →∞ and H0(Ai )→ 0 for i = 1, · · · ,K .

The beta process can be written in set function form,

H(w) =∞∑

πkδwk(w)

Contents

1 Introduction

Indian Buffet Processes

Latent feature models

Clustering is not enough for mixed membershipsGrouping problem with overlapping clustersEncode as binary matrix: Observation n in cluster k ⇔ xnk = 1Alternatively: Item n possesses feature k ⇔ xnk = 1

Indian buffet process (IBP)1. Customer 1 tries Poisson(α) dishes.2. Subsequent customer n + 1:

tries a previously tried dish k with probability nkn+1

tries Poisson( αn+1 ) new dishes.

Properties

An exchangeable distribution over finite sets (of dishes).Observation (=customer) n in cluster (=dish) k if customer ‘tries dishk’

Indian Buffet Process

Alternative description

1. Sample w1, · · · ,wk ∼iid Beta(1, α/K )2. Sample X1k , · · · ,Xnk ∼iid Bernoulli(wk)

Beta Process (BP)Beta Process

Distribution on objects of the form

θ =∞∑k=1

wkδφkwith wk ∈ [0, 1].

IBP matrix entries are sampled as xnk ∼iid Bernoulli(wk).Beta process is the de Finetti measure of the IBP.θ is a random measure

Binary matrices in left-order form

Contents

1 Introduction

Hierarchical Gaussian Processes

Apply Bayesian representation recursively Split parameter Θ:

Θ→ Ψ and Θ|Ψ

p(data) = p(data|Θ)p(Θ)

= p(data|Θ)p(Θ|Ψ)p(Ψ)

Example: Hierarchical Gaussian process

Sample Ψ ∼ p(Ψ)(e.g., large length-scale, mean 0)

Sample Θ|Ψ ∼ p(·|Ψ)(e.g., smaller length scale, mean Ψ)

Decompose underlying pattern:

Low-frequency component Ψ

High-frequency component Θ

Hierarchical Dirichlet Processes

Sampling scheme

Sample G0 ∼ DP(γ,H)Sample G1,G2, · · · ∼ DP(α,G0)Sample xij ∼ Gj G1,G2, · · · have common‘vocabulary’ of atoms.

Nonparametric Latent Dirichlet Allocation(LDA)

G0 =∞∑

ckδθ∗k Gj =∞∑

D ji δΦj

θk = finite probability (=‘topic’)ck = occurrence probability of topic kDocument j drawn from weighted combinationof topics, with proportions D j

l (‘admixturemodel’)

statistical arti cial intelligence laboratory (sail)*...

Documents

chapter 2 universal arti cial intelligence - hutter1.net ·...

comparison of arti cial neural network and coupled...

the arti cial epigenetic network - github pages...

arti cial intelligence: assignment 4 - jbnu.ac.kr

predicting future prices in an arti cial stock market using...

arti cial intelligence

till ampad arti ciell intelligens applied arti cial...

evolutionary psychology and arti cial intelligence: the...

understanding forgetting in arti cial neural networks

arti cial agents and logic programming

arti cial intelligence and augmented intelligence for

doppler imaging from arti cial data156 j.b. rice and k.g....

arti cial social systems

arti cial intelligence: assignment 2 - jbnu.ac.kr

arti cial intelligence as statistical learning

towards intelligent regulation of arti cial intelligence

arti cial intelligence, mathematics, and...

fundamentals of arti cial intelligence comp221: functional...

towards arti cial argumentation

the arti cial intelligence of the ethics of arti cial...