scalable deep poisson factor analysis for topic …changyou/pdf/dpfa_slides_icml15.pdfscalable deep...
TRANSCRIPT
Scalable Deep Poisson Factor Analysis for TopicModeling
Zhe Gan, Changyou Chen,Ricardo Henao, David Carlson, Lawrence Carin
Duke University
July 9th, 2015
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 1 / 23
Outline
1 Introduction
2 Model Formulation
3 Scalable Posterior Inference
4 Experiments
5 Summary
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 2 / 23
Introduction
Problem of interest: How to develop deep generative models fordocuments that are represented in bag-of-words form?
Directed Graphical Models:Latent Dirichlet Allocation (LDA) (Blei et al., 2003)Focused Topic Model (FTM) (Williamson et al., 2010)Poisson Factor Analysis (PFA) (Zhou et al., 2012)
Going “Deep”?Hierarchical tree-structured topic modelsnested Chinese Restaurant Process (nCRP) (Blei et al., 2004)Hierarchical Dirichlet Process (HDP) (Teh et al., 2006)nested Hierarchical Dirichlet Process (nHDP) (Paisley et al., 2015)
How about we want to model general topic correlations?
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 3 / 23
Introduction
Undirected Graphical Models:Replicated Softmax Model (RSM) (Salakhutdinov and Hinton,2009b)One generalization of the Restricted Boltzmann Machine (RBM)(Hinton, 2002)
Going Deep?Deep Belief Networks (DBN) (Hinton et al., 2006; Hinton andSalakhutdinov, 2011)Deep Boltzmann Machines (DBM) (Salakhutdinov and Hinton,2009a; Srivastava et al., 2013)
Topics are not defined “properly”.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 4 / 23
Introduction
Main idea:Poisson Factor Analysis (PFA) + Deep Sigmoid Belief Network(SBN) or Restricted Boltzmann Machine (RBM).PFA is employed to interact with data at the bottom layer.Deep SBN or RBM serve as a flexible prior for revealing topicstructure.
220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274
275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329
Scalable Deep Poisson Factor Analysis for Topic Modeling
In the experiments we consider both the deep SBN anddeep RBM for representation of the latent binary units,which are connected to topic usage in a given document.
Discussion An important benefit of SBNs over RBMsis that in the former sparsity or shrinkage priors can bereadily imposed on the global parameters W(1), and fullyBayesian inference can be implemented as shown in Ganet al. (2015). The RBM relies on an approximation tech-nique known as contrastive divergence (Hinton, 2002), forwhich prior specification for the model parameters is lim-ited.
2.3. Deep Architecture for Topic Modeling
Specifying a prior distribution onh(2)n as in (3) might be too
restrictive in some cases. Alternatively, we can use anotherSBN prior for h(2)
n , in fact, we can add multiple layers asin Gan et al. (2015) to obtain a deep architecture,
p(h(1)n , . . . ,h(L)
n ) = p(h(L)n )
∏L`=2 p(h
(`−1)n |h(`)
n ), (6)
where L is the number of layers, p(h(L)n ) is the prior for the
top layer defined as in (3), p(h(`−1)n |h(`)
n ) is defined in (4),and the weights W(`) ∈ RK`×K`+1 and biases c(`) ∈ RK`
are omitted from the conditional distributions to keep no-tation uncluttered. A similar deep architecture may be de-signed for the RBM (Salakhutdinov & Hinton, 2009b).
Instead of employing the beta-Bernoulli specification forh(1)n as in the NB-FTM, which assumes independent topic
usage probabilities, we propose using (6) instead as theprior for h(1)
n , thus
p(xn,hn) = p(xn|h(1)n )p(h(1)
n , . . . ,h(L)n ) , (7)
where hn , {h(1)n , . . . ,h(L)
n }, and p(xn|h(1)n ) as in (2).
The prior p(h(1)n |h(2)
n . . . ,h(L)n ) can be seen as a flexible
prior distribution over binary vectors that encodes high-order interactions across elements of h(1)
n . The graphi-cal model for our model, Deep Poisson Factor Analysis(DPFA) is shown in Figure 1.
3. Scalable Posterior InferenceWe focus on learning our model with fully Bayesian al-gorithms, however, emerging large-scale corpora prohibitstandard MCMC inference algorithms to be applied di-rectly. For example, in the experiments, we consider theRCV1-v2 and the Wikipedia corpora, which contain about800K and 10M documents, respectively. Therefore, fastalgorithms for big Bayesian learning are essential. Whileparallel algorithms based on distributed architectures suchas the parameter server (Ho et al., 2013; Li et al., 2014)are popular choices, in the work presented here, we focus
h(1)n
h(2)n
h(3)n
xn
θn
W(1)
W(2)
rk
φk
γ0
aφ
n=1,...,Nk=1,...,K
Figure 1. Graphical model for the Deep Poisson Factor Analysiswith three layers of hidden binary hierarchies. The directed binaryhierarchy may be replaced by a deep Boltzmann machine.
on another direction for scaling up inference by stochas-tic algorithms, where mini-batches instead of the wholedataset are utilized in each iteration of the algorithms.Specifically, we develop two stochastic Bayesian inferencealgorithms based on Bayesian conditional density filter-ing (Guhaniyogi et al., 2014) and stochastic gradient ther-mostats (Ding et al., 2014), both of which have theoreticalguarantees in the sense of asymptotical convergence to thetrue posterior distribution.
3.1. Bayesian conditional density filtering
Bayesian conditional density filtering (BCDF) is a re-cently proposed stochastic algorithm for Bayesian onlinelearning (Guhaniyogi et al., 2014), that extends Markovchain Monte Carlo (MCMC) sampling to streaming data.Sampling in BCDF proceeds by drawing from the condi-tional posterior distributions of model parameters, obtainedby propagating surrogate conditional sufficient statistics(SCSS). In practice, we repeatedly update the SCSS usingthe current mini-batch and draw S samples from the condi-tional densities using, for example, a Gibbs sampler. Thiseliminates the need to load the entire dataset into mem-ory, and provides computationally cheaper Gibbs updates.More importantly, it can be proved that BCDF leads to anapproximation of the conditional distributions that producesamples from the correct target posterior asymptotically,once the entire dataset is seen (Guhaniyogi et al., 2014).
In the learning phase, we are interested in learning theglobal parameters Ψg = ({φk}, {rk}, γ0, {W(`), c(`)}).Denote local variables as Ψl = (Θ,H(`)), and let Sg rep-resent the SCSS for Ψg , the BCDF algorithm can be sum-marized in Algorithm 1. Specifically, we need to obtain theconditional densities, which can be readily derived grantedthe full local conjugacy of the proposed model. Using dotnotation to represent marginal sums, e.g., x·nk ,
∑p xpnk,
we can write the key conditional densities for (2) as (Zhou& Carin, 2015)
Figure: Graphical model for the Deep Poisson Factor Analysis with three layers of hidden binary hierarchies. The directedbinary hierarchy may be replaced by a deep Boltzmann machine.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 5 / 23
Model Formulation
Poisson Factor Analysis: (Zhou et al., 2012)We represent a discrete matrix X ∈ ZP×N
+ containing counts fromN documents and P words as
X = Pois(Φ(Θ ◦ H(1))) . (1)
Each column of Φ, φk , encodes the relative importance of eachword in topic k .Each column of Θ, θn, contains relative topic intensities specific todocument n.Each column of H(1), h(1)
n , defines a sparse set of topicsassociated with each document.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 6 / 23
Model Formulation
Poisson Factor Analysis: (Zhou et al., 2012)We construct PFAs by placing Dirichlet priors on φk and gammapriors on θn.
xpn =∑K
k=1 xpnk , xpnk ∼ Pois(φpkθknh(1)kn ) , (2)
with priors specified as φk ∼ Dir(aφ, . . . ,aφ),θkn ∼ Gamma(rk ,pn/(1− pn)), rk ∼ Gamma(γ0,1/c0), andγ0 ∼ Gamma(e0,1/f0).
Previously, a beta-Bernoulli process prior is defined on h(1)n ,
assuming topic independence (Zhou and Carin, 2015).
The novelty in our models comes from the prior for h(1)n .
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 7 / 23
Model Formulation
Structured Priors on the Latent Binary matrix:
Assume h(1)n ∈ {0,1}K1 , we define another hidden set of units
h(2)n ∈ {0,1}K2 placed at a layer “above” h(1)
n .Modeling with the RBM: (Undirected)
− E(h(1)n ,h(2)
n ) = (h(1)n )>c(1) + (h(1)
n )>W(1)h(2)n + (h(2)
n )>c(2) . (3)
Modeling with the SBN (Neal, 1992): (Directed)
p(h(2)k2n = 1) = σ(c(2)
k2) , (4)
p(h(1)k1n = 1|h(2)
n ) = σ((w (1)
k1)>h(1)
n + c(1)k1
). (5)
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 8 / 23
Model Formulation
Going Deep?Add multiple layers of SBNs or RBMs.
220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274
275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329
Scalable Deep Poisson Factor Analysis for Topic Modeling
In the experiments we consider both the deep SBN anddeep RBM for representation of the latent binary units,which are connected to topic usage in a given document.
Discussion An important benefit of SBNs over RBMsis that in the former sparsity or shrinkage priors can bereadily imposed on the global parameters W(1), and fullyBayesian inference can be implemented as shown in Ganet al. (2015). The RBM relies on an approximation tech-nique known as contrastive divergence (Hinton, 2002), forwhich prior specification for the model parameters is lim-ited.
2.3. Deep Architecture for Topic Modeling
Specifying a prior distribution onh(2)n as in (3) might be too
restrictive in some cases. Alternatively, we can use anotherSBN prior for h(2)
n , in fact, we can add multiple layers asin Gan et al. (2015) to obtain a deep architecture,
p(h(1)n , . . . ,h(L)
n ) = p(h(L)n )
∏L`=2 p(h
(`−1)n |h(`)
n ), (6)
where L is the number of layers, p(h(L)n ) is the prior for the
top layer defined as in (3), p(h(`−1)n |h(`)
n ) is defined in (4),and the weights W(`) ∈ RK`×K`+1 and biases c(`) ∈ RK`
are omitted from the conditional distributions to keep no-tation uncluttered. A similar deep architecture may be de-signed for the RBM (Salakhutdinov & Hinton, 2009b).
Instead of employing the beta-Bernoulli specification forh(1)n as in the NB-FTM, which assumes independent topic
usage probabilities, we propose using (6) instead as theprior for h(1)
n , thus
p(xn,hn) = p(xn|h(1)n )p(h(1)
n , . . . ,h(L)n ) , (7)
where hn , {h(1)n , . . . ,h(L)
n }, and p(xn|h(1)n ) as in (2).
The prior p(h(1)n |h(2)
n . . . ,h(L)n ) can be seen as a flexible
prior distribution over binary vectors that encodes high-order interactions across elements of h(1)
n . The graphi-cal model for our model, Deep Poisson Factor Analysis(DPFA) is shown in Figure 1.
3. Scalable Posterior InferenceWe focus on learning our model with fully Bayesian al-gorithms, however, emerging large-scale corpora prohibitstandard MCMC inference algorithms to be applied di-rectly. For example, in the experiments, we consider theRCV1-v2 and the Wikipedia corpora, which contain about800K and 10M documents, respectively. Therefore, fastalgorithms for big Bayesian learning are essential. Whileparallel algorithms based on distributed architectures suchas the parameter server (Ho et al., 2013; Li et al., 2014)are popular choices, in the work presented here, we focus
h(1)n
h(2)n
h(3)n
xn
θn
W(1)
W(2)
rk
φk
γ0
aφ
n=1,...,Nk=1,...,K
Figure 1. Graphical model for the Deep Poisson Factor Analysiswith three layers of hidden binary hierarchies. The directed binaryhierarchy may be replaced by a deep Boltzmann machine.
on another direction for scaling up inference by stochas-tic algorithms, where mini-batches instead of the wholedataset are utilized in each iteration of the algorithms.Specifically, we develop two stochastic Bayesian inferencealgorithms based on Bayesian conditional density filter-ing (Guhaniyogi et al., 2014) and stochastic gradient ther-mostats (Ding et al., 2014), both of which have theoreticalguarantees in the sense of asymptotical convergence to thetrue posterior distribution.
3.1. Bayesian conditional density filtering
Bayesian conditional density filtering (BCDF) is a re-cently proposed stochastic algorithm for Bayesian onlinelearning (Guhaniyogi et al., 2014), that extends Markovchain Monte Carlo (MCMC) sampling to streaming data.Sampling in BCDF proceeds by drawing from the condi-tional posterior distributions of model parameters, obtainedby propagating surrogate conditional sufficient statistics(SCSS). In practice, we repeatedly update the SCSS usingthe current mini-batch and draw S samples from the condi-tional densities using, for example, a Gibbs sampler. Thiseliminates the need to load the entire dataset into mem-ory, and provides computationally cheaper Gibbs updates.More importantly, it can be proved that BCDF leads to anapproximation of the conditional distributions that producesamples from the correct target posterior asymptotically,once the entire dataset is seen (Guhaniyogi et al., 2014).
In the learning phase, we are interested in learning theglobal parameters Ψg = ({φk}, {rk}, γ0, {W(`), c(`)}).Denote local variables as Ψl = (Θ,H(`)), and let Sg rep-resent the SCSS for Ψg , the BCDF algorithm can be sum-marized in Algorithm 1. Specifically, we need to obtain theconditional densities, which can be readily derived grantedthe full local conjugacy of the proposed model. Using dotnotation to represent marginal sums, e.g., x·nk ,
∑p xpnk,
we can write the key conditional densities for (2) as (Zhou& Carin, 2015)
Figure: Graphical model for the Deep Poisson Factor Analysis with three layers of hidden binary hierarchies. The directedbinary hierarchy may be replaced by a deep Boltzmann machine.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 9 / 23
Scalable Posterior Inference
Challenge: Designing scalable Bayesian inference algorithms.Solutions: Scaling up inference by stochastic algorithms.
Applying Bayesian conditional density filtering algorithm(Guhaniyogi et al., 2014).Extending recently proposed work on stochastic gradientthermostats (Ding et al., 2014).
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 10 / 23
Scalable Posterior Inference
Bayesian conditional density filtering (BCDF):Repeatedly updating the surrogate conditional sufficient statistics(SCSS) using the current mini-batch.Drawing samples from the conditional posterior distributions ofmodel parameters, based on SCSS.“stochastic Gibbs-style” updates.
Input: text documents, i.e., a count matrix X.Initialize Ψ
(0)g randomly and set S(0)
g all to zero.for t = 1 to∞ do
Get one mini-batch X(t).Initialize Ψ
(t)g = Ψ
(t−1)g , and S(t)
g = S(t−1)g .
Initialize Ψ(t)l randomly.
for s = 1 to S doGibbs sampling for DPFA on X(t).Collect samples Ψ1:S
g ,Ψ1:Sl and S1:S
g .end forSet Ψ(t)
g = mean(Ψ1:Sg ), and S(t)
g = mean(S1:Sg ).
end for
Ψg : global parametersΨl : local hidden variablesSg : SCSS for Ψg
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 11 / 23
Scalable Posterior Inference
Stochastic Gradient Nose-Hoover Thermostats (SGNHT):Extending Hamiltonian Monte Carlo using stochastic gradient.Introducing thermostat to maintain system temperature.Adaptively absorbing stochastic gradient noise.The motion of the particles in the system are defined by thestochastic differential equations (SDE)
dΨg = vdt , dv = f (Ψg)dt − ξvdt +√
DdW ,
dξ =(
1M vT v − 1
)dt , (6)
where Ψg ∈ RM are model parameters, v ∈ RM are themomentum variables, f (Ψg) , −∇Ψg U(Ψg), and U(Ψg) is thenegative log-posterior.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 12 / 23
Scalable Posterior Inference
Extension:Extending the SGNHT by introducing multiple thermostatvariables (ξ1, · · · , ξM) into the system such that each ξi controlsone degree of the particle momentum.The proposed SGNHT is defined by the following SDEs
dΨg = vdt , dv = f (Ψg)dt − Ξvdt +√
DdW ,
dΞ = (q− I) dt , (7)
where Ξ = diag(ξ1, ξ2, · · · , ξM), q = diag(v21 , · · · , v2
M)
Theorem
The equilibrium distribution of the SDE system in (7) is
p(Ψg ,v ,Ξ) ∝ exp(−1
2v>v − U(Ψg)−
12
tr{(Ξ− D)> (Ξ− D)
}).
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 13 / 23
Scalable Posterior Inference
Stochastic Gradient Nose-Hoover Thermostats (SGNHT):
Input: text documents, i.e., a count matrix X.Random Initialization.for t = 1 to∞ do
Ψ(t+1)g = Ψ
(t)g + v (t)h.
v (t+1) = f (Ψ(t+1)g )h − Ξ(t)v (t)h +
√2DhN (0, I).
Ξ(t+1) = Ξ(t) + (q(t+1) − I)h, where q =diag(v2
1 , . . . , v2M).
end for
Discussion:BCDF: ease of implementation, but prefers the conditionaldensities for all the parameters.SGNHT: more general and robust, fast convergence.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 14 / 23
Scalable Posterior Inference
Stochastic Gradient Nose-Hoover Thermostats (SGNHT):
Input: text documents, i.e., a count matrix X.Random Initialization.for t = 1 to∞ do
Ψ(t+1)g = Ψ
(t)g + v (t)h.
v (t+1) = f (Ψ(t+1)g )h − Ξ(t)v (t)h +
√2DhN (0, I).
Ξ(t+1) = Ξ(t) + (q(t+1) − I)h, where q =diag(v2
1 , . . . , v2M).
end for
Discussion:BCDF: ease of implementation, but prefers the conditionaldensities for all the parameters.SGNHT: more general and robust, fast convergence.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 14 / 23
Experiments
Datasets:20 Newsgroups: 20K documents with a vocabulary size of 2K.RCV1-v2: 800K documents with a vocabulary size of 10K.Wikipedia: 10M documents with a vocabulary size of 8K.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 15 / 23
Experiments
Quantitative Evaluation:
Table: 20 Newsgroups.
MODEL METHOD DIM PERP.DPFA-SBN-t GIBBS 128-64-32 827DPFA-SBN GIBBS 128-64-32 846DPFA-SBN SGNHT 128-64-32 846DPFA-RBM SGNHT 128-64-32 896DPFA-SBN BCDF 128-64-32 905DPFA-SBN GIBBS 128-64 851DPFA-SBN SGNHT 128-64 850DPFA-RBM SGNHT 128-64 893DPFA-SBN BCDF 128-64 896LDA GIBBS 128 893NB-FTM GIBBS 128 887RSM CD5 128 877NHDP SVB (10,10,5)� 889
Table: RCV1-v2 & Wikipedia.
MODEL METHOD DIM RCV WIKIDPFA-SBN SGNHT 1024-512-256 964 770DPFA-SBN SGNHT 512-256-128 1073 799DPFA-SBN SGNHT 128-64-32 1143 876DPFA-RBM SGNHT 128-64-32 920 942DPFA-SBN BCDF 128-64-32 1149 986LDA BCDF 128 1179 1059NB-FTM BCDF 128 1155 991RSM CD5 128 1171 1001NHDP SVB (10,5,5)� 1041 932
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 16 / 23
Experiments
Quantitative Evaluation:
0 300 600 900 1200 1500840
860
880
900
920
940
960
980
1000
Iteration Number
Per
plex
ity
LDANB−FTMDPFA−SBN (Gibbs)DPFA−SBN (BCDF)DPFA−SBN (SGNHT)DPFA−RBM (SGNHT)
200K 230K 260K 290K 320K 350K800
900
1000
1100
1200
1300
1400
1500
1600
#Documents Seen
Perp
lexity
350K 380K 410K 440K 470K 500K850
900
950
1000
1050
1100
1150
1200
#Documents Seen
Per
plex
ity
Sensitivity Analysis:
1.8 2 2.2 2.4 2.6
x 105
800
900
1000
1100
1200
1300
1400
#Docs Seen
Pe
rple
xity
Batch_1Batch_10Batch_30Batch_50Batch_80Batch_100
2 2.05 2.1 2.15
x 105
1500
2000
2500
3000
#Docs Seen
Pe
rple
xity
Batch_10Batch_50Batch_80Batch_100
3.5 3.55 3.6 3.65 3.7
x 105
1000
1200
1400
1600
#Docs Seen
Pe
rple
xity
Batch_10Batch_30Batch_50Batch_80Batch_100
Figure: Perplexities. (Left) 20 News. (Middle) RCV1-v2. (Right) Wikipedia.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 17 / 23
Experiments
Topics we learned on 20 Newsgroups:
T1 T3 T8 T9 T10 T14 T15 T19 T21 T24year people group world evidence game israel software files teamhit real groups country claim games israeli modem file playersruns simply reading countries people win jews port ftp playergood world newsgroup germany argument cup arab mac program playseason things pro nazi agree hockey jewish serial format teamsT25 T26 T29 T40 T41 T43 T50 T54 T55 T64god fire people wrong image boston problem card windows turkishexistence fbi life doesn program toronto work video dos armenianexist koresh death jim application montreal problems memory file armenianshuman children kill agree widget chicago system mhz win turksatheism batf killing quote color pittsburgh fine bit ms armeniaT65 T69 T78 T81 T91 T94 T112 T118 T120 T126truth window drive makes question code children people men sextrue server disk power answer mit father make women sexualpoint display scsi make means comp child person man cramerfact manager hard doesn true unix mother things hand gaybody client drives part people source son feel world homosexual
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 18 / 23
Experiments
Visualization:Sports, Computers, and Poltics/Law.
1
3
8
91014
15
19
21
2425
26
29 40
41
43
50
54
55
64
65
69
78
81
91
94
112118120
126
1
3
8
91014
15
19
21
2425
26
29 40
41
43
50
54
55
64
65
69
78
81
91
94
112118120
126
1
3
8
91014
15
19
21
2425
26
29 40
41
43
50
54
55
64
65
69
78
81
91
94
112118120
126
Figure: Graphs induced by the correlation structure learned by DPFA-SBN forthe 20 Newsgroups.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 19 / 23
Summary
Model: Deep Poisson Factor AnalysisPFA is employed to interact with data at the bottom layer.Deep SBN or RBM serve as a flexible prior for revealing topicstructure.
Scalable Inference:Bayesian conditional density filtering.Stochastic gradient thermostats.
https://github.com/zhegan27/dpfa_icml2015
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 20 / 23
Questions?
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 21 / 23
References I
Blei, D. M., Griffiths, T., Jordan, M. I., and Tenenbaum, J. B. (2004). Hierarchical topicmodels and the nested Chinese restaurant process. NIPS.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR.
Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. (2014).Bayesian sampling using stochastic gradient thermostats. NIPS.
Guhaniyogi, R., Qamar, S., and Dunson, D. B. (2014). Bayesian conditional densityfiltering. arXiv:1401.3632.
Hinton, G. E. (2002). Training products of experts by minimizing contrastivedivergence. Neural computation.
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deepbelief nets. Neural computation.
Hinton, G. E. and Salakhutdinov, R. (2011). Discovering binary codes for documentsby learning deep generative models. Topics in Cognitive Science.
Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence.
Paisley, J., Wang, C., Blei, D. M., and Jordan, M. I. (2015). Nested hierarchicalDirichlet processes. PAMI.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 22 / 23
References II
Salakhutdinov, R. and Hinton, G. E. (2009a). Deep Boltzmann machines. AISTATS.
Salakhutdinov, R. and Hinton, G. E. (2009b). Replicated softmax: an undirected topicmodel. NIPS.
Srivastava, N., Salakhutdinov, R., and Hinton, G. E. (2013). Modeling documents withdeep Boltzmann machines. UAI.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichletprocesses. JASA.
Williamson, S., Wang, C., Heller, K., and Blei, D. M. (2010). The IBP compoundDirichlet process and its application to focused topic modeling. ICML.
Zhou, M. and Carin, L. (2015). Negative binomial process count and mixturemodeling. PAMI.
Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). Beta-negative binomialprocess and Poisson factor analysis. AISTATS.
Presented by David Carlson (Duke) Scalable DPFA July 9th, 2015 23 / 23