high-order stochastic gradient thermostats for bayesian …changyou/pdf/msgnht_poster.pdf ·...

High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep ModelsChunyuan Li, Changyou Chen, Kai Fan and Lawrence Carin

Duke University, Durham NC 27708, USA

Introduction

Objective Designing simple and efficient Bayesian inference algo-rithms for deep learning models.Main idea•Multivariate Stochastic Gradient Nośe-Hoover Thermostat(m-SGNHT) with more accurate numerical implementation.•From Euler integrator to Symmetric Splitting integrator

Illustrations

Double-well Potential Functionρ(θ) ∝ exp(−U(θ)), U(θ) = (θ + 4)(θ + 1)(θ − 1)(θ − 3)/14 + 0.5

•mSGNHT-S consistently provides abetter approximation.

•The significant gap at largerstepsize reveals that mSGNHT-Sallows large updates.

stepsize0.001 0.01 0.05 0.1 0.15 0.2 0.25 0.3

KL d

iverg

ence

100

101

102

103 mSGNHT-EmSGNHT-S

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.001

mSGNHT-Strue distribution

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.001

mSGNHT-Etrue distribution

#samples #1050 1 2 3

0

0.5

1

1.5

29: mSGNHT-E9: mSGNHT-S

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.2

mSGNHT-Strue distribution

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.2

mSGNHT-Etrue distribution

#samples #1050 1 2 3

0

0.5

1

1.5

29: mSGNHT-E9: mSGNHT-S

Figure: Samples of ρ(θ) with SSI (1st column) and Euler integrator (2ndcolumn), and the estimated thermostat variable over iterations (3rd column).

Convolutional Neural Networks

Epochs5 10 15 20 25 30 35 40

Test

Acc

urac

y (%

)

97.5

98

98.5

99

99.5 h = 0.0001, D = 50

mSGNHT-EmSGNHT-S

Epochs5 10 15 20 25 30 35 40

Test

Acc

urac

y (%

)

97.5

98

98.5

99

99.5h = 0.0005, D = 50

mSGNHT-EmSGNHT-S

Figure: Learning curves of CNN for different step sizes.

Algorithms

θ: model parameter p: momentum ξ: thermostatEuler Integrator

θt+1 = θt + pthpt+1 = pt−∇θUt(θt+1)h− diag(ξt)pth +

√2Dζt+1

ξt+1 = ξt + (pt+1� pt+1− 1)h

Symmetric Spltting IntegratorA : θt+1/2 = θt + pth/2, ξt+1/2 = ξt + (pt� pt− 1)h/2→B : pt+1/3 = exp(−ξt+1/2h/2)� pt→O : pt+2/3 = pt+1/3−∇θUt(θt+1/2)h +

√2Dζt+1→

B : pt+1 = exp(−ξt+1/2h/2)� pt+2/3→A : θt+1 = θt+1/2 + pt+1h/2, ξt+1 = ξt+1/2 + (pt+1� pt+1− 1)h/2

where h is stepsize, D is diffusion factor, and ζ is Gaussian noise

Theoretical Justification

Roles of numerical integrators Under certain assumptions, theBias and MSE of a SG-MCMC algorithm with stepsize h and a Kth-order integrator are:

Bias:∣∣∣∣∣∣Eφ− φ

∣∣∣∣∣∣ = Bbias + O(hK)MSE: E

φ− φ2

= Bmse + O(h2K) ,where Bbias and Bmse are functions depending on (h, T ) but indepen-dent of K.Remark 1 Robustness: The bias and MSE of mSGNHT-S isbounded as: Bbias+O(h2) and Bmse+O(h4), compared to Bbias+O(h)and Bmse + O(h2) for the mSGNHT-E, respectively. This indicatesthat mSGNHT-S is more robust to the stepsizes than mSGNHT-E.Remark 2 Convergence Rate: The higher order a numericalintegrator is, the faster its optimal convergence rate is. Convergencerates in term of bias for mSGNHT-S and mSGNHT-E are T−2/3 andT−1/2, respectively, indicating mSGNHT-S converges faster.Remark 3 Measure Accuracy: In the limit of infinite time (T →∞), the terms Bbias and Bmse in Lemma 1 vanish, leaving only theO(hK) terms. This indicates mSGNHT-S is an order of magnitudemore accurate than mSGNHT-E.

Advantages for Deep Learning

• It tolerates gradients of various magnitudes, providing a potentialsolution to mitigate the vanishing/exploding gradients problem

• It can estimate model parameters faster and more accurately

Experiments

I. Shallow ModelsLatent Dirichlet Allocation and Logistic Regression

Iterations200 400 600 800 1000

Test

Per

plex

ity

900950

1000105011001150120012501300

ICML (h = 0.01, D = 0.75)mSGNHT-SmSGNHT-ESGHMC-SSGHMC-E

Iterations200 400 600 800 1000

Test

Per

plex

ity

900950

1000105011001150120012501300


Iterations200 400 600 800 1000

Test

Per

plex

ity

900950

1000105011001150120012501300


Figure: Learning curves of LDA on ICML dataset for different stepsize h.

Table: LDA on ICML.Method Test Perplexity ↓mSGNHT-S 939.67mSGNHT-E 960.56SGHMC-S 1004.73SGHMC-E 1017.51SGRLD 1154.68Gibbs 907.84

Table: LR on a9a.Method Test Accuracy ↑mSGNHT-S 84.95%mSGNHT-E 84.72%SGHMC-S 84.56%SGHMC-E 84.51%DSVI 84.80%HFSGVI 84.84%

II. Deep ModelsDeep Neural Networks

Epochs10 20 30 40

Test

Acc

urac

y

0.9

0.91

0.92

0.93

0.94

0.95

0.96h = 1e-4, 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40

Test

Acc

urac

y

0.9

0.91

0.92

0.93

0.94

0.95

0.96h = 1e-4, 100 # 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40

Test

Acc

urac

y

0.9

0.91

0.92

0.93

0.94

0.95

0.96h = 5e-5, 100 # 100 # 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40Tr

aini

ng N

eg. L

og-li

kelih

ood

0

0.2

0.4

0.6

0.8

1h = 1e-4, 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40Tr

aini

ng N

eg. L

og-li

kelih

ood

0

0.2

0.4

0.6

0.8

1h = 1e-4, 100 # 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40Tr

aini

ng N

eg. L

og-li

kelih

ood

0

0.2

0.4

0.6

0.8

1h = 5e-5, 100 # 100 # 100 # 100

mSGNHT-SmSGNHT-E

Figure: Learning curves of FNN with different depth on MNIST dataset.mSGNHT-E is failing as learning progresses. It starts to work, only because the stepsize is decreased by half.

Deep Poisson Factor Analysis

Perplexity on Wikipedia dataset:

W ∼ Pois(Φ(Ψ� h(1))),h(1) ∼ DSBN(h{(2),··· ,(L)})

A 3-layer Deep Sigmoid BeliefNetworks (DSBN) is inferred bymSGNHT.

#Documents Seen350K 380K 410K 440K 470K 500K

Perp

lexi

ty

900

950

1000

1050

1100

LDANB-FTMDPFA-SBN (BCDF)DPFA-RBM (mSGNHT)mSGNHT-E (h=1e-4)mSGNHT-S (h=1e-4)mSGNHT-E (h=1e-5)mSGNHT-S (h=1e-5)

490K 495K 500K880

882

884

886

888

890

892

Acknowledgements

This research was supported by ARO, DARPA, DOE, NGA ONR and NSF.