high-order stochastic gradient thermostats for bayesian …changyou/pdf/msgnht_poster.pdf ·...
TRANSCRIPT
High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep ModelsChunyuan Li, Changyou Chen, Kai Fan and Lawrence Carin
Duke University, Durham NC 27708, USA
Introduction
Objective Designing simple and efficient Bayesian inference algo-rithms for deep learning models.Main idea•Multivariate Stochastic Gradient Nośe-Hoover Thermostat(m-SGNHT) with more accurate numerical implementation.•From Euler integrator to Symmetric Splitting integrator
Illustrations
Double-well Potential Functionρ(θ) ∝ exp(−U(θ)), U(θ) = (θ + 4)(θ + 1)(θ − 1)(θ − 3)/14 + 0.5
•mSGNHT-S consistently provides abetter approximation.
•The significant gap at largerstepsize reveals that mSGNHT-Sallows large updates.
stepsize0.001 0.01 0.05 0.1 0.15 0.2 0.25 0.3
KL d
iverg
ence
100
101
102
103 mSGNHT-EmSGNHT-S
-4 -2 0 2 40
0.2
0.4
0.6
0.8
1h = 0.001
mSGNHT-Strue distribution
-4 -2 0 2 40
0.2
0.4
0.6
0.8
1h = 0.001
mSGNHT-Etrue distribution
#samples #1050 1 2 3
0
0.5
1
1.5
29: mSGNHT-E9: mSGNHT-S
-4 -2 0 2 40
0.2
0.4
0.6
0.8
1h = 0.2
mSGNHT-Strue distribution
-4 -2 0 2 40
0.2
0.4
0.6
0.8
1h = 0.2
mSGNHT-Etrue distribution
#samples #1050 1 2 3
0
0.5
1
1.5
29: mSGNHT-E9: mSGNHT-S
Figure: Samples of ρ(θ) with SSI (1st column) and Euler integrator (2ndcolumn), and the estimated thermostat variable over iterations (3rd column).
Convolutional Neural Networks
Epochs5 10 15 20 25 30 35 40
Test
Acc
urac
y (%
)
97.5
98
98.5
99
99.5 h = 0.0001, D = 50
mSGNHT-EmSGNHT-S
Epochs5 10 15 20 25 30 35 40
Test
Acc
urac
y (%
)
97.5
98
98.5
99
99.5h = 0.0005, D = 50
mSGNHT-EmSGNHT-S
Figure: Learning curves of CNN for different step sizes.
Algorithms
θ: model parameter p: momentum ξ: thermostatEuler Integrator
θt+1 = θt + pthpt+1 = pt−∇θUt(θt+1)h− diag(ξt)pth +
√2Dζt+1
ξt+1 = ξt + (pt+1� pt+1− 1)h
Symmetric Spltting IntegratorA : θt+1/2 = θt + pth/2, ξt+1/2 = ξt + (pt� pt− 1)h/2→B : pt+1/3 = exp(−ξt+1/2h/2)� pt→O : pt+2/3 = pt+1/3−∇θUt(θt+1/2)h +
√2Dζt+1→
B : pt+1 = exp(−ξt+1/2h/2)� pt+2/3→A : θt+1 = θt+1/2 + pt+1h/2, ξt+1 = ξt+1/2 + (pt+1� pt+1− 1)h/2
where h is stepsize, D is diffusion factor, and ζ is Gaussian noise
Theoretical Justification
Roles of numerical integrators Under certain assumptions, theBias and MSE of a SG-MCMC algorithm with stepsize h and a Kth-order integrator are:
Bias:∣∣∣∣∣∣Eφ− φ
∣∣∣∣∣∣ = Bbias + O(hK)MSE: E
φ− φ2
= Bmse + O(h2K) ,where Bbias and Bmse are functions depending on (h, T ) but indepen-dent of K.Remark 1 Robustness: The bias and MSE of mSGNHT-S isbounded as: Bbias+O(h2) and Bmse+O(h4), compared to Bbias+O(h)and Bmse + O(h2) for the mSGNHT-E, respectively. This indicatesthat mSGNHT-S is more robust to the stepsizes than mSGNHT-E.Remark 2 Convergence Rate: The higher order a numericalintegrator is, the faster its optimal convergence rate is. Convergencerates in term of bias for mSGNHT-S and mSGNHT-E are T−2/3 andT−1/2, respectively, indicating mSGNHT-S converges faster.Remark 3 Measure Accuracy: In the limit of infinite time (T →∞), the terms Bbias and Bmse in Lemma 1 vanish, leaving only theO(hK) terms. This indicates mSGNHT-S is an order of magnitudemore accurate than mSGNHT-E.
Advantages for Deep Learning
• It tolerates gradients of various magnitudes, providing a potentialsolution to mitigate the vanishing/exploding gradients problem
• It can estimate model parameters faster and more accurately
Experiments
I. Shallow ModelsLatent Dirichlet Allocation and Logistic Regression
Iterations200 400 600 800 1000
Test
Per
plex
ity
900950
1000105011001150120012501300
ICML (h = 0.01, D = 0.75)mSGNHT-SmSGNHT-ESGHMC-SSGHMC-E
Iterations200 400 600 800 1000
Test
Per
plex
ity
900950
1000105011001150120012501300
ICML (h = 0.0025, D = 0.75)mSGNHT-SmSGNHT-ESGHMC-SSGHMC-E
Iterations200 400 600 800 1000
Test
Per
plex
ity
900950
1000105011001150120012501300
ICML (h = 0.0005, D = 0.75)mSGNHT-SmSGNHT-ESGHMC-SSGHMC-E
Figure: Learning curves of LDA on ICML dataset for different stepsize h.
Table: LDA on ICML.Method Test Perplexity ↓mSGNHT-S 939.67mSGNHT-E 960.56SGHMC-S 1004.73SGHMC-E 1017.51SGRLD 1154.68Gibbs 907.84
Table: LR on a9a.Method Test Accuracy ↑mSGNHT-S 84.95%mSGNHT-E 84.72%SGHMC-S 84.56%SGHMC-E 84.51%DSVI 84.80%HFSGVI 84.84%
II. Deep ModelsDeep Neural Networks
Epochs10 20 30 40
Test
Acc
urac
y
0.9
0.91
0.92
0.93
0.94
0.95
0.96h = 1e-4, 100 # 100
mSGNHT-SmSGNHT-E
Epochs10 20 30 40
Test
Acc
urac
y
0.9
0.91
0.92
0.93
0.94
0.95
0.96h = 1e-4, 100 # 100 # 100
mSGNHT-SmSGNHT-E
Epochs10 20 30 40
Test
Acc
urac
y
0.9
0.91
0.92
0.93
0.94
0.95
0.96h = 5e-5, 100 # 100 # 100 # 100
mSGNHT-SmSGNHT-E
Epochs10 20 30 40Tr
aini
ng N
eg. L
og-li
kelih
ood
0
0.2
0.4
0.6
0.8
1h = 1e-4, 100 # 100
mSGNHT-SmSGNHT-E
Epochs10 20 30 40Tr
aini
ng N
eg. L
og-li
kelih
ood
0
0.2
0.4
0.6
0.8
1h = 1e-4, 100 # 100 # 100
mSGNHT-SmSGNHT-E
Epochs10 20 30 40Tr
aini
ng N
eg. L
og-li
kelih
ood
0
0.2
0.4
0.6
0.8
1h = 5e-5, 100 # 100 # 100 # 100
mSGNHT-SmSGNHT-E
Figure: Learning curves of FNN with different depth on MNIST dataset.mSGNHT-E is failing as learning progresses. It starts to work, only because the stepsize is decreased by half.
Deep Poisson Factor Analysis
Perplexity on Wikipedia dataset:
W ∼ Pois(Φ(Ψ� h(1))),h(1) ∼ DSBN(h{(2),··· ,(L)})
A 3-layer Deep Sigmoid BeliefNetworks (DSBN) is inferred bymSGNHT.
#Documents Seen350K 380K 410K 440K 470K 500K
Perp
lexi
ty
900
950
1000
1050
1100
LDANB-FTMDPFA-SBN (BCDF)DPFA-RBM (mSGNHT)mSGNHT-E (h=1e-4)mSGNHT-S (h=1e-4)mSGNHT-E (h=1e-5)mSGNHT-S (h=1e-5)
490K 495K 500K880
882
884
886
888
890
892
Acknowledgements
This research was supported by ARO, DARPA, DOE, NGA ONR and NSF.