statistical arti cial intelligence laboratory (sail)*...
Post on 04-Mar-2020
11 Views
Preview:
TRANSCRIPT
Nonparametric Bayesian for Sequential Data1
Jaesik Choi
Statistical Artificial Intelligence Laboratory (SAIL)*
UNIST
*http://sail.unist.ac.kr
1Some slides are courtesy to [Zubin, UAI 2005], [Miller, 2009], and [Orbanz, NIPS2012]
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 1 / 38
Overview
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 2 / 38
Terminology
Parameteric model
Number of parameters fixed (or constantly bounded) w.r.t. sample size
Nonparametric model
Number of parameters grows with sample size∞-dimensional parameter space
Example: Density estimation
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 3 / 38
Nonparametric Bayesian Model
Definition A nonparametric Bayesian model is a Bayesian model on an∞-dimensional parameter space
Interpretation Parameter space T = set of possible model parameters(or pattern), for example:
Problem TDensity estimation Probabilistic distributionsRegression Smooth functionsClustering Partitions
Solution to Bayesian problem = posterior distribution on modelparameters
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 4 / 38
Exchangeability
Can we justify our assumptions?Assumption: data = model + noiseIn Bayes’ theorem
p(model |data) =p(data|model)
p(data)p(model)
Definition X1, · · ·Xn are exchangeable if P(X1, · · · ,Xn) is invariantunder any permutation π:
p(X1 = x1, · · · ,Xn = xn) = p(X1 = xπ(1), · · · ,Xn = xπ(n))
e.g.p(1, 1, 0, 0, 0, 1, 1, 0) = p(1, 0, 1, 0, 1, 0, 0, 1)
Order of observations does not matter.
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 5 / 38
Exchangeability
De Finetti’s Theorem (binary cases)Binary sequence case: all exchangeable binary sequences are mixturesof Bernoulli sequences [de Finetti, 1931]
p(x1, · · · , xn) =
∫ 1
0θtn(1− θ)n−tndF (θ),
where p(x1, · · · , xn) = p(X1 = x1, · · · ,Xn = xn) and tn = Σni=1xi .
Implications
Exchangeable data decomposes into mixtures of models with i.i.d.sequencesCaution: θ is in general an ∞-dimensional quantity
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 6 / 38
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 7 / 38
Clustering
Observations X1,X2, · · ·Each observation belongs to exactly one cluster
Unknown pattern = partition of 1, · · · , n
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 8 / 38
Mixture Models
Mixture models
p(data|model) =
∫
Ωθ
p(data|θ)m(dθ)
m is called the mixing measure
Two-stage samplingSample data X ∼ p(·|m) as:1. Θ ∼ m2. X ∼ p(·|θ)
Finite mixture model
m(·) = ΣKk=1ckδθk (·)
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 9 / 38
Bayesian Mixture Models (BMMs)
Random mixing measurem(·) = ΣK
k=1ckδθk (·)Conjugate priorsA Bayesian model is conjugate if the posterior is an element of thesame class of distribution as the prior (‘closure under sampling’).
p(data|model), likelihood p(model) conjugate prior
Multinomial DirichletGaussian Smooth functionsClustering Partitions
Choice of priors in BMM conjugate prior
Choose conjugate prior for each parameterE.g.: Dirichlet prior
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 10 / 38
Dirichlet Process Mixtures
Dirichlet process (DP) [Ferguson, 73] [Sethuraman, 94]A DP is a distribution on random probability measures of the form
m(·) =∑∞
k=1 ckδθk (·) where∑∞
k=1 ck = 1Dirichlet Process
Constructive definition of DP(α, G0)θk ∼iid G0
Vk ∼iid Beta(1, α)Compute ck as
ck = Vk
k−1∏
i=1
(1− Vi )
This procedure is called ‘Stick-breaking construction’
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 11 / 38
Posterior Distribution
DP Posterior
θn+1|θ1, · · · , θn ∼1
n + α
n∑
j=1
δθj (θn+1) +α
n + αG0(θn+1)
Mixture Posterior
p(xn+1|x1, · · · , xn) =Kn∑
k=1
nkn + α
p(xn+1|θ∗k)+α
n + α
∫p(xn+1|θ)G0(θ)dθ
Conjugacy
The posterior of DP(α,G0) is DP(α + n, 1n+α (
∑k nkδθ∗
k+ αG0))
The Dirichlet process is conjugate.
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 12 / 38
Inference
Latent variables
p(xn+1|x1, · · · , xn) =Kn∑
k=1
nkn+α
p(xn+1|θ∗k)+α
n + α
∫p(xn+1|θ)G0(θ)dθ
We observe xi and do not actually observe θk (latent).
Assignment probabilities
qjk ∝ nkp(xj |θ∗k )qj0 ∝ α
∫p(xj |θ)G0(θ)dθ
Gibbs SamplingUses an assignment variable φj for each observation xj .
Assignment step: Sampling φj ∼ Multinomial(qj0, · · · , qjKn)Parameter sampling:
θ∗k ∼ G0(θ∗k )∏
xj∈Clusterkp(xj |θ∗k )
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 13 / 38
Number of Clusters
Dirichlet process
Kn = # of clusters in sample of size n
E[Kn] = O(log(n))
Modeling assumption
Parametric clustering: K∞ is finite (possibly unknown, but fixed).Nonparametric clustering: K∞ is infinite.
Rephrasing the question
Estimate of Kn is controlled by distribution of the cluster sizes ck in∑k ckδθk
What should we assume about the distribution of ck
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 14 / 38
Generalizing the DP
Pitman-Yor process
p(xn+1|x1, · · ·, xn) =Kn∑k=1
nk−dn+α p(xn+1|θ∗k) + α+Kn·d
n+α
∫p(xn+1|θ)G0(θ)dθ
Discount parameter d ∈ [0, 1].
Cluster sizes
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 15 / 38
Power Laws
The distribution of cluster sizes is called a power law if
cj ∼ γ(β) · j−β
for some β ∈ [0, 1].Examples of power laws
Word frequenciesPopularity (# of friends) in social networks
Pitman-Yor language model
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 16 / 38
Random Partitions
Discrete measures and partitionsSampling from a discrete measure determines a partition of N intoblocks bk :
θn ∼iid
∞∑k=1
ckδθ∗k and set n ∈ bk ⇔ θn = θ∗k
As n→∞, the block proportions converge: |bk |n → ck
Induced random partitionThe distribution of a random discrete measure m =
∑∞k=1 ckδθk
induces the distribution of a random partition∏
= (B1,B2, · · · ).Exchangeable random partitions∏
is called exchangeable if its distribution depends only on the sizes ofits blocks.All exchangeable random partitions, and only those, can be representedby a random discrete distribution (Kingman’s theorem).
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 17 / 38
Chinese Restaurant Process (CRP)
Chinese Restaurant ProcessThe distribution of the random partition induced by the Dirichletprocess. CRF
‘Customers and tables’ analogy
Customers = observations (indices in N)Tables = clusters (blocks)
Historical remarkOriginally introduced by Dubins & Pitman as a distribution on infinitepermutationsA permutation of n items defines a partition of 1, · · · , n (regardcycles of permutation as blocks of partition)The induced distribution on partitions is the CRP we use in clustering
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 18 / 38
Summary: Clustering
Nonparametric Bayesian clustering
Infinite # of clusters, Kn ≤ n of which are observed.If partition exchangeable, it can be represented by a random discretedistribution.
Inference Latent variable algorithms, since assignments (i.e.,partition) not observed.
Gibbs samplingVariational algorithms
Prior assumption
Distribution of cluster sizesImplies prior assumption on # Kn of clusters.
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 19 / 38
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 20 / 38
Gaussian Distributions
Gaussian
p(x) =1
√2πσ2 exp
(− (x−µ)2
2σ2
)
Multivariate Gaussian
p(x) =1√
(2π)n|Σ| exp(−1
2 (x − µ)TΣ−1(x − µ))
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 21 / 38
Gaussian Processes: Definition
GPs are distributions over functions such that any finite set offunction evaluations [f (x1), · · · , f (xn)] have a jointly Gaussiandistribution.
A GP is completely specified by its mean function µ(x) = E(f (x))and covariance kernel function k(x , x ′) = Cov(f (x), f (x ′)).
Given data x = [x1, · · · , xn] and y = [f (x1), · · · , f (xn)], GP modelspecified by µ(x) and k(x , x ′).
Notation for ‘f follows the GP’ is:
f ∼ GP(µ(x), k(x , x ′)
Likelihood is:
p(y |X ) =1√
(2π)n|Σ|exp
(−1
2(y − µ)TΣ−1(y − µ)
)
where µ = [µ(x1), · · · , µ(xn)] and Σij = k(xi , xj).
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 22 / 38
Gaussian Processes: Samples with different kernels
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 23 / 38
Gaussian Processes: Predictions with different kernels
A sample from the prior for each covariance function
Corresponding predictions, mean with two standard deviations:
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 24 / 38
Gaussian Processes
Nonparametric regressionparameter spaces = continuous functions, say on [a,b]:
f : [a, b]→ R T = c[a, b]
Gaussian Process
f ∼ GP ↔ (f (x1), · · · , f (xd)) is d-dimensional Gaussian
for any finite set X ⊂ [a, b].
Construction: Intuition
The marginal of the GP for any finite X ⊂ [a, b] is a Gaussian.All these Gaussians are marginals of each other.
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 25 / 38
Examples
Applications Parameter space Bayesian Nonparametric model
Regression Function Gaussian processClustering Partition Chinese restaurant processDensity estimation Density Dirichlet process mixture
Hierarchical clusteringHierarchical
partition
Dirichlet/Pitmann-or diffusion tree,
Nested CRPLatent variable modeling Features Beta process/Indian buffet processDictionary learning Dictionary Beta process/Indian buffet process
Survival analysis HazardBeta process,
Neural-to-the-right processPower-law behaviour Pitman-Yor process, Stable-beta processDeep learning Features Cascading/nested Indian buffet processDimensionality reduction Manifold Gaussian process latent variable model
Topic modelsAtomic
distribution Hierarhical Dirichlet processRelational modeling Relations Infinite relational model,Mondrian process
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 26 / 38
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 27 / 38
Recent advances in Nonparametric Regression
Problem
Descriptive prediction of multiple time series
Problem
Descriptive prediction of multiple time series
Problem
Linear function
decrease x/week
Smooth function
Length scale: y weeks
Rapidly varying
smooth function
Length scale: z hours
Descriptive prediction of multiple time series
Problem (this paper)
Descriptive prediction of multiple time series
Constant function
Sudden drop btw
9/12/01 ~ 9/15/01
Smooth function
Length scale: y weeks
Rapidly varying
smooth function
Length scale: z hours
Models
Automatic Bayesian Covariance Discovery* [Lloyd et. al. 2014] [Ghahramani. 2015]
* http://www.automaticstatistician.com/
Gaussian Processes
𝑓 𝑥 ~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )
Function Gaussian Process
Mean function
Covariance kernel function
𝜇 𝑥 = 𝔼 𝑓 𝑥
𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′
Gaussian Processes
𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )
Function Gaussian Process
Mean function
Covariance kernel function
𝜇 𝑥 = 𝔼 𝑓 𝑥
𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′
[𝑓(𝑥1), … , 𝑓(𝑥𝑁)]~𝒩(𝝁, 𝚺)
Function evaluations Multivariate Gaussian
follows
Mean vector
Covariance matrix
𝝁 = [𝜇 𝑥1 , … , 𝜇 𝑥𝑁 ]
𝚺𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗
* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)
[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]
The Automatic Statistician*
𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )
Find appropriate kernel(1) Encode characteristic
* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)
[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]
The Automatic Statistician*
If 𝑔(𝑥)~𝒢𝒫 0, 𝑘𝑔 , ℎ 𝑥 ~𝒢𝒫 0, 𝑘ℎ and 𝑔 𝑥 ⊥ ℎ 𝑥
, then
𝑔 𝑥 + ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 + 𝑘ℎ
𝑔 𝑥 × ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 × 𝑘ℎ
𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )
Find appropriate kernel(1) Encode characteristic
(2) Compose new kernel
The Automatic Statistician: Base kernels
Base
kernel
Encoding
functionKernel function
Paramet
ers
Example kernel
function shape
Example
encoded
functions
LIN(𝑥, 𝑥’)Linear
function𝜎2(𝑥 − ℓ)(𝑥′ − ℓ) 𝜎, ℓ
SE(𝑥, 𝑥’)Smooth
function𝜎2 exp −
𝑥 − 𝑥′ 2
2ℓ2𝜎, ℓ
PER(𝑥, 𝑥’)Periodic
functionIn appendix 𝜎, ℓ, 𝑝
The Automatic Statistician: Operators
Op. Concept Params ExampleExample kernel
function shape
Example encoded
functions
+
Addition
Superposition
OR operator
N/A
SE + PER
LIN + PER
×Multiplication
AND operatorN/A SE × PER
The Automatic Statistician: Operators
Op. Concept Params Example Example encoded functions
CPDivide
left vs rightℓ, 𝑠 CP(LIN,LIN)
CWDivide
in vs outℓ, 𝑠, 𝑤 CW(LIN,C)
× ×+ =
× ×+ =
The Automatic statistician: Learning
BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷
Negative log-likelihood
Model complexity
Num. of model parameters
Num. of data points
(1) Optimization criterion: Bayesian Information Criterion (BIC)
The Automatic statistician: Learning
(2) Optimize
(1) Expand
(3) Select
Iteratively select best model (structure 𝑘, parameter )
(1) Expand: the current kernel
(2) Optimize: conjugate gradient descent
(3) Select: the best kernel in the level (greedy)
(4) Iterate: get back to (1) for the next level
(2) Learning algorithm (Composite Kernel Learning)
Negative log-likelihood
Model complexity
Num. of model parameters
Num. of data points
(1) Optimization criteria: Bayesian Information Criterion (BIC)
BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷
Example Result
Linear function
decrease x/week
Smooth function
Length scale: y weeks
Rapidly varying
smooth function
Length scale: z hours
The Automatic Statistician*
* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)
[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]
Prediction Performance (Extrapolation)
13 famous regression datasets
The Relational Automatic Statistician
[Hwang, Tong and Choi, 2016]
* http://www.automaticstatistician.com/
Motivation
Linear function
decrease x/week
Smooth function
Length scale: y weeks
Rapidly varying
smooth function
Length scale: z hours
Adjusted Close of
General Electronics
9/11, 2001
The Automatic Statistician
Motivation
Adjusted Close of
General Electronics, Microsoft, ExxonMobil
9/11, 2001
• Exploit multiple time series
• Find global descriptions
• Hope better predictive
performance
The Relational Automatic Statistician
Latent function
Latent function evaluation
Observation
Fixed Given Optimize
Covariance kernel functionMean function
Gaussian Process
𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃
Model: Gaussian Process
Gaussian Processes → CKL → RKL → SRKL
Fixed Grammar Optimize
Latent function evaluation
Observation
Covariance kernel functionMean function
Gaussian Process
Latent function
𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃
Model: Composite Kernel Learning (CKL)
GPs → Composite Kernel Learning → RKL → SRKL
Generalized Multi Kernel Learning
Fixed Grammar Optimize
Latent function evaluation
Observation
Scale parameter Shift parameter
Covariance kernel functionMean function
Gaussian Process
Latent function
𝑃 𝐷 ℳ =ෑ
𝑗=1
𝑀
𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑐𝑗
Model: Relational Kernel Learning (RKL)
GPs → CKL → Relational Kernel Learning → SRKL
10
0
100
0
10
0
20
10
Scale
Shift
Scale parameter
Distinctive kernel
Latent function
Fixed Grammar Optimize
Latent function evaluation
Observation
Covariance kernel functionMean function
Gaussian Process
Fixed (Spectral Mixture) Optimize
𝑃 𝐷 ℳ =ෑ
𝑗=1
𝑀
𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑘𝑗(𝑥, 𝑥′; 𝜃𝑗)
Model: Semi-Relational Kernel Learning (SRKL)
GPs → CKL → RKL → Semi-Relational Kernel Learning
Semi-Relational Kernel Learning (SRKL)
Input: M time series
Output: A shared kernel 𝑘, M spectral mixture (SM) kernels
(1) Expand: the current shared kernel for all time series
(2) Optimize: expanded kernels for all M time series (conjugate gradient descent)
For each series, individual distinction is handled by the SM kernel.
(3) Select: the best shared kernel for all time series (greedy)
A shared kernel (𝑘) and M SM kernels (𝑘𝑗)
(4) Iterative: get back to (1) when level s is not reached
Find the best shared and distinctive kernels iteratively!
Experimental Results
Three real-world data sets:
US top 9 stocks in year 2001
US top 6 housing markets from 2003 to 2013
Currency exchange of 4 emerging market
Descriptions Graphs (normalized) Details
9 adjusted close
of stock figures
GE,MSFT, XOM,
PFE, C, WMT,
INTC, BP, AIG
6 US housing price
indices
New York, Los Angeles
Chicago, Pheonix,
San Diego, San Fancisco
4 emerging market
currency exchanges
Indonesian - IDR
Malaysian - MYR
South African - ZAR
Russian - RUB
Data sets
Qualitative Results
Adjusted Closes Component 1 Adjusted Closes Component 2
US stock market values suddenly drop after US 9/11 attacks.
Currency exchange is affected by FED’s policy change in interest rates around
middle Sep 2015.
Automatic Statistician Relational Automatic Statistician
4 currency exchange rates Learned component
An Automatically Generated Report
Quantitative Results
STOCK3 = GE, MSFT, XOM
STOCK6= STOCK3 + PFE, C, WMT
STOCK9 = STOCK6 + INTC, BP, AIG
HOUSE2 = NY, LA
HOUSE4 = HOUSE2 + Chicago, Pheonix
HOUSE6 = HOUSE4 + San Diego, San Francisco
CURRENCY4 = IDR, MYR,ZAR,RUB
Quantitative Results (box plots)
9 stocks 6 house price indices 4 currency exchanges
Conclusion
• Our research topic is about“Solving descriptive prediction problem of multiple timeseries”
• We proposed models that can“Exploit both common and distinct changes”
• We found that our models“Show the better qualitative and quantitative performancecompared to the state-of-the-art GP regression method”
http://saildemo.unist.ac.kr/automatic_statistician/
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 28 / 38
Introduction
Simple Example
Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.
Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.
Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?
Not really. Why?
Kurt T. Miller Dr. Nonparametric Bayes 9
Introduction
Simple Example
Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.
Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.
Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?
Not really. Why?
Kurt T. Miller Dr. Nonparametric Bayes 9
Introduction
Simple Example
When we observe H, H, H, H, why does θ = 1 seem unreasonable?
Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.
Kurt T. Miller Dr. Nonparametric Bayes 10
Introduction
Simple Example
When we observe H, H, H, H, why does θ = 1 seem unreasonable?
Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.
Kurt T. Miller Dr. Nonparametric Bayes 10
Introduction
Bayesian Approach to Estimating θ
Place a Beta(a, b) prior on θ. This prior has the form
p(θ) ∝ θa−1(1− θ)b−1.
What does this distribution look like?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
α1=1.0, α
2=0.1
α1=1.0, α
2=1.0
α1=1.0, α
2=5.0
α1=1.0, α
2=10.0
α1=9.0, α
2=3.0
Kurt T. Miller Dr. Nonparametric Bayes 11
Introduction
Bayesian Approach to Estimating θ
Place a Beta(a, b) prior on θ. This prior has the form
p(θ) ∝ θa−1(1− θ)b−1.
What does this distribution look like?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
α1=1.0, α
2=0.1
α1=1.0, α
2=1.0
α1=1.0, α
2=5.0
α1=1.0, α
2=10.0
α1=9.0, α
2=3.0
Kurt T. Miller Dr. Nonparametric Bayes 11
Introduction
Bayesian Approach to Estimating θ
After observing X, a sequence with n heads and m tails, the posterior onθ is:
p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1
∼ Beta(a+ n, b+m).
If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
Kurt T. Miller Dr. Nonparametric Bayes 12
Introduction
Bayesian Approach to Estimating θ
After observing X, a sequence with n heads and m tails, the posterior onθ is:
p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1
∼ Beta(a+ n, b+m).
If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
Kurt T. Miller Dr. Nonparametric Bayes 12
Parametric Bayesian Clustering
The Dirichlet Distribution
We hadπ ∼ Dirichlet(α1, . . . , αK)
The Dirichlet density is defined as
p(π|α) =Γ(∑K
k=1 αk
)
∏Kk=1 Γ(αk)
πα1−11 πα2−1
2 · · ·παK−1K
where πK = 1−∑K−1k=1 πk.
The expectations of π are
E(πi) =αi∑Ki=1 αi
Kurt T. Miller Dr. Nonparametric Bayes 24
Parametric Bayesian Clustering
The Beta DistributionA special case of the Dirichlet distribution is the Beta distribution forwhen K = 2.
p(π|α1, α2) =Γ (α1 + α2)
Γ(α1)Γ(α2)πα1−1(1− π)α2−1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
α1=1.0, α
2=0.1
α1=1.0, α
2=1.0
α1=1.0, α
2=5.0
α1=1.0, α
2=10.0
α1=9.0, α
2=3.0
Kurt T. Miller Dr. Nonparametric Bayes 25
Parametric Bayesian Clustering
The Dirichlet Distribution
In three dimensions:
p(π|α1, α2, α3) =Γ (α1 + α2 + α3)
Γ(α1)Γ(α2)Γ(α3)πα1−11 πα2−1
2 (1− π1 − π2)α3−1
α = (2, 2, 2) α = (5, 5, 5)
Kurt T. Miller Dr. Nonparametric Bayes 26
(2, 2, 5)
Parametric Bayesian Clustering
Draws from the Dirichlet Distribution
α = (2, 2, 2) 1 2 30
0.2
0.4
0.6
0.8
1 2 30
0.1
0.2
0.3
0.4
0.5
1 2 30
0.1
0.2
0.3
0.4
α = (5, 5, 5) 1 2 30
0.1
0.2
0.3
0.4
0.5
1 2 30
0.2
0.4
0.6
0.8
1 2 30
0.2
0.4
0.6
0.8
α = (2, 2, 5) 1 2 30
0.2
0.4
0.6
0.8
1 2 30
0.2
0.4
0.6
0.8
1 2 30
0.2
0.4
0.6
0.8
Kurt T. Miller Dr. Nonparametric Bayes 27
Parametric Bayesian Clustering
Key Property of the Dirichlet Distribution
The Aggregation Property: If
(π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK)
then
(π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)
This is also valid for any aggregation:
(π1 + π2,
∑
k=3K
πk
)∼ Beta
(α1 + α2,
K∑
k=3
αk
)
Kurt T. Miller Dr. Nonparametric Bayes 28
Parametric Bayesian Clustering
Multinomial-Dirichlet Conjugacy
Let Z ∼ Multinomial(π) and π ∼ Dir(α).
Posterior:
p(π|z) ∝ p(z|π)p(π)= (πz1
1 · · ·πzKK )(πα1−1
1 · · ·παK−1K )
= (πz1+α1−11 · · ·πzK+αK−1
K )
which is Dir(α+ z).
Kurt T. Miller Dr. Nonparametric Bayes 29
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 29 / 38
The Dirichlet Process Model
Parameters for the Dirichlet Process
• α - The concentration parameter.
• G0 - The base measure. A prior distribution for the cluster specificparameters.
The Dirichlet Process (DP) is a distribution over distributions. We write
G ∼ DP (α,G0)
to indicate G is a distribution drawn from the DP.
It will become clearer in a bit what α and G0 are.
Kurt T. Miller Dr. Nonparametric Bayes 52
The Dirichlet Process Model
The Dirichlet Process
Definition: Let G0 be a probability measure on the measurable space(Ω, B) and α ∈ R+.
The Dirichlet Process DP (α,G0) is the distribution on probabilitymeasures G such that for any finite partition (A1, . . . , Am) of Ω,
(G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).
A
A
A
A
A
1
2
3
4
5Ω
(Ferguson, ’73)
Kurt T. Miller Dr. Nonparametric Bayes 54
The Dirichlet Process Model
Mathematical Properties of the Dirichlet Process
Suppose we sample
• G ∼ DP (α,G0)
• θ1 ∼ G
What is the posterior distribution of G given θ1?
G|θ1 ∼ DP
(α+ 1,
α
α+ 1G0 +
1
α+ 1δθ1
)
More generally
G|θ1, . . . , θn ∼ DP
(α+ n,
α
α+ nG0 +
1
α+ n
n∑
i=1
δθi
)
Kurt T. Miller Dr. Nonparametric Bayes 55
The Dirichlet Process Model
Mathematical Properties of the Dirichlet Process
Suppose we sample
• G ∼ DP (α,G0)
• θ1 ∼ G
What is the posterior distribution of G given θ1?
G|θ1 ∼ DP
(α+ 1,
α
α+ 1G0 +
1
α+ 1δθ1
)
More generally
G|θ1, . . . , θn ∼ DP
(α+ n,
α
α+ nG0 +
1
α+ n
n∑
i=1
δθi
)
Kurt T. Miller Dr. Nonparametric Bayes 55
The Dirichlet Process Model
Mathematical Properties of the Dirichlet Process
With probability 1, a sample G ∼ DP (α,G0) is of the form
G =∞∑
k=1
πkδφk
(Sethuraman, ’94)
Kurt T. Miller Dr. Nonparametric Bayes 56
The Dirichlet Process Model
The Stick-Breaking Process
• Define an infinite sequence of Beta random variables:
βk ∼ Beta(1, α) k = 1, 2, . . .
• And then define an infinite sequence of mixing proportions as:
π1 = β1
πk = βk
k−1∏
l=1
(1− βl) k = 2, 3, . . .
• This can be viewed as breaking off portions of a stick:
1 2...
1β β (1−β )
• When π are drawn this way, we can write π ∼ GEM(α).
Kurt T. Miller Dr. Nonparametric Bayes 58
The Dirichlet Process Model
The Stick-Breaking Process
• We now have an explicit formula for each πk:πk = βk
∏k−1l=1 (1− βl)
• We can also easily see that∑∞
k=1 πk = 1 (wp1):
1−K∑
k=1
πk = 1− β1 − β2(1− β1)− β3(1− β1)(1− β2)− · · ·
= (1− β1)(1− β2 − β3(1− β2)− · · · )
=K∏
k=1
(1− βk)
→ 0 (wp1 as K → ∞)
• So now G =∑∞
k=1 πkδφkhas a clean definition as a random
measure
Kurt T. Miller Dr. Nonparametric Bayes 59
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 30 / 38
The Dirichlet Process Model
The Chinese Restaurant Process (CRP)
• A random process in which n customers sit down in a Chineserestaurant with an infinite number of tables
• first customer sits at the first table• mth subsequent customer sits at a table drawn from the following
distribution:
P (previously occupied table i|Fm−1) ∝ ni
P (the next unoccupied table|Fm−1) ∝ α
where ni is the number of customers currently at table i and whereFm−1 denotes the state of the restaurant after m− 1 customershave been seated
Kurt T. Miller Dr. Nonparametric Bayes 61
The Dirichlet Process Model
The CRP and Clustering
• Data points are customers; tables are clusters
• the CRP defines a prior distribution on the partitioning of the dataand on the number of tables
• This prior can be completed with:
• a likelihood—e.g., associate a parameterized probability distributionwith each table
• a prior for the parameters—the first customer to sit at table kchooses the parameter vector for that table (φk) from the prior
φ2φ1 φ3 φ
4
• So we now have a distribution—or can obtain one—for any quantitythat we might care about in the clustering setting
Kurt T. Miller Dr. Nonparametric Bayes 62
The Dirichlet Process Model
The CRP Prior, Gaussian Likelihood, Conjugate Prior
Kurt T. Miller Dr. Nonparametric Bayes 63
The Dirichlet Process Model
The CRP and the DP
OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?
Important fact: The CRP is exchangeable.
Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n
p(x1, . . . , xn) =
∫ ( n∏
i=1
p(xi|G)
)dP (G)
for some random variable G.
Kurt T. Miller Dr. Nonparametric Bayes 64
The Dirichlet Process Model
The CRP and the DP
OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?
Important fact: The CRP is exchangeable.
Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n
p(x1, . . . , xn) =
∫ ( n∏
i=1
p(xi|G)
)dP (G)
for some random variable G.
Kurt T. Miller Dr. Nonparametric Bayes 64
The Dirichlet Process Model
The CRP and the DP
The Dirichlet Process is the De Finetti mixingdistribution for the CRP.
That means, when we integrate outG, we get the CRP.
p(θ1, . . . , θn) =
∫ n∏
i=1
p(θi|G)dP (G)
G
!i
xi
N
!
!
G0
Kurt T. Miller Dr. Nonparametric Bayes 65
The Dirichlet Process Model
The CRP and the DP
The Dirichlet Process is the De Finetti mixingdistribution for the CRP.
That means, when we integrate outG, we get the CRP.
p(θ1, . . . , θn) =
∫ n∏
i=1
p(θi|G)dP (G)
G
!i
xi
N
!
!
G0
Kurt T. Miller Dr. Nonparametric Bayes 65
The Dirichlet Process Model
The CRP and the DP
The Dirichlet Process is the De Finetti mixingdistribution for the CRP.
In English, this means that if the DP is the prior onG, then the CRP defines how points are assigned toclusters when we integrate out G.
Kurt T. Miller Dr. Nonparametric Bayes 66
The Dirichlet Process Model
The DP, CRP, and Stick-Breaking Process Summary
G
!i
xi
N
!
!
G0G ! DP(!, G0)
Stick-Breaking Process(just the weights)
The CRP describes thepartitions of ! when Gis marginalized out.
Kurt T. Miller Dr. Nonparametric Bayes 67
Beta Processes
Definition [Hjort, 90]: Let H0 be a continuous probability measure(Ω,B) and α ∈ R+. Then, Beta Process BP(α,H0) is the distributionon probability measures H such that for any (disjoint) finite partition(A1, · · · ,Ak) of Ω satisfies
H(Ai ) ∼ Beta(αH0(Ai ), α(1− H0(Ai )))
with K →∞ and H0(Ai )→ 0 for i = 1, · · · ,K .
The beta process can be written in set function form,
H(w) =∞∑
k=1
πkδwk(w)
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 31 / 38
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 32 / 38
Indian Buffet Processes
Latent feature models
Clustering is not enough for mixed membershipsGrouping problem with overlapping clustersEncode as binary matrix: Observation n in cluster k ⇔ xnk = 1Alternatively: Item n possesses feature k ⇔ xnk = 1
Indian buffet process (IBP)1. Customer 1 tries Poisson(α) dishes.2. Subsequent customer n + 1:
tries a previously tried dish k with probability nkn+1
tries Poisson( αn+1 ) new dishes.
Properties
An exchangeable distribution over finite sets (of dishes).Observation (=customer) n in cluster (=dish) k if customer ‘tries dishk’
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 33 / 38
Indian Buffet Process
Alternative description
1. Sample w1, · · · ,wk ∼iid Beta(1, α/K )2. Sample X1k , · · · ,Xnk ∼iid Bernoulli(wk)
Beta Process (BP)Beta Process
Distribution on objects of the form
θ =∞∑k=1
wkδφkwith wk ∈ [0, 1].
IBP matrix entries are sampled as xnk ∼iid Bernoulli(wk).Beta process is the de Finetti measure of the IBP.θ is a random measure
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 34 / 38
Binary matrices in left-order form
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 35 / 38
Contents
1 Introduction
2 Dirichlet Processes - Nonparametric Clustering
3 Gaussian Processes - Nonparametric Regression
4 Case Study: The Relational Automatic Statistician System
5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 36 / 38
Hierarchical Gaussian Processes
Apply Bayesian representation recursively Split parameter Θ:
Θ→ Ψ and Θ|Ψ
p(data) = p(data|Θ)p(Θ)
= p(data|Θ)p(Θ|Ψ)p(Ψ)
Example: Hierarchical Gaussian process
Sample Ψ ∼ p(Ψ)(e.g., large length-scale, mean 0)
Sample Θ|Ψ ∼ p(·|Ψ)(e.g., smaller length scale, mean Ψ)
Decompose underlying pattern:
Low-frequency component Ψ
High-frequency component Θ
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 37 / 38
Hierarchical Dirichlet Processes
Sampling scheme
Sample G0 ∼ DP(γ,H)Sample G1,G2, · · · ∼ DP(α,G0)Sample xij ∼ Gj G1,G2, · · · have common‘vocabulary’ of atoms.
Nonparametric Latent Dirichlet Allocation(LDA)
G0 =∞∑
k=1
ckδθ∗k Gj =∞∑
l=1
D ji δΦj
l
θk = finite probability (=‘topic’)ck = occurrence probability of topic kDocument j drawn from weighted combinationof topics, with proportions D j
l (‘admixturemodel’)
Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 38 / 38
top related