high-order stochastic gradient thermostats for bayesian …changyou/pdf/msgnht_poster.pdf ·...

1
High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan and Lawrence Carin Duke University, Durham NC 27708, USA Introduction Objective Designing simple and ecient Bayesian inference algo- rithms for deep learning models. Main idea Multivariate Stochastic Gradient No–e-Hoover Thermostat (m-SGNHT) with more accurate numerical implementation. From Euler integrator to Symmetric Splitting integrator Illustrations Double-well Potential Function ()/ exp( U( )); U( ) = ( + 4)( + 1)( 1)( 3)=14 + 0:5 mSGNHT-S consistently provides a better approximation. The signicant gap at larger stepsize reveals that mSGNHT-S allows large updates. Figure: Samples of ( ) with SSI (1st column) and Euler integrator (2nd column), and the estimated thermostat variable over iterations (3rd column). Convolutional Neural Networks Figure: Learning curves of CNN for dierent step sizes. Algorithms : model parameter p: momentum : thermostat Euler Integrator 8 > > > > > > > > > > < > > > > > > > > > > : t+1 = t +p t h p t+1 =p t r ~ U t ( t+1 )h diag( t )p t h+ p 2D t+1 t+1 = t + (p t+1 p t+1 1) h Symmetric Spltting Integrator A: t+1=2 = t +p t h=2; t+1=2 = t + (p t p t 1) h=2 ! B:p t+1=3 = exp( t+1=2 h=2) p t ! O:p t+2=3 =p t+1=3 r ~ U t ( t+1=2 )h + p 2D t+1 ! B:p t+1 = exp( t+1=2 h=2) p t+2=3 ! A: t+1 = t+1=2 +p t+1 h=2; t+1 = t+1=2 + (p t+1 p t+1 1) h=2 whereh is stepsize, D is diusion factor, and is Gaussian noise Theoretical Justication Roles of numerical integrators Under certain assumptions, the BiasandMSEof a SG-MCMC algorithm with stepsize h and aK th- order integrator are: Bias: E ^ =B bias + O(h K ) MSE:E 0 @ ^ ^ 1 A 2 =B mse + O(h 2K ); whereB bias andB mse are functions depending on (h; T) but indepen- dent ofK . Remark 1 Robustness: The bias and MSE of mSGNHT-S is bounded as: B bias + O(h 2 ) andB mse + O(h 4 ), comparedtoB bias + O(h) andB mse + O(h 2 ) for the mSGNHT-E, respectively. This indicates that mSGNHT-S is more robust to the stepsizes than mSGNHT-E. Remark 2 Convergence Rate: The higher order a numerical integrator is, the faster its optimal convergence rate is. Convergence rates in term of biasfor mSGNHT-S and mSGNHT-E are T 2=3 and T 1=2 , respectively, indicating mSGNHT-S converges faster. Remark 3 Measure Accuracy: In the limit of innite time (T ! 1 ), the termsB bias andB mse in Lemma 1 vanish, leaving only the O(h K ) terms. This indicates mSGNHT-S is an order of magnitude more accurate than mSGNHT-E. Advantages for Deep Learning It tolerates gradients of various magnitudes, providing a potential solution to mitigate the vanishing/exploding gradients problem It can estimate model parameters faster and more accurately Experiments I. Shallow Models Latent Dirichlet Allocation and Logistic Regression Figure: Learning curves of LDA on ICMLdataset for dierent stepsize h. Table: LDA on ICML . Method Test Perplexity # mSGNHT-S 939.67 mSGNHT-E 960.56 SGHMC-S 1004.73 SGHMC-E 1017.51 SGRLD 1154.68 Gibbs 907.84 Table: LR ona9a. Method Test Accuracy " mSGNHT-S 84.95% mSGNHT-E 84.72% SGHMC-S 84.56% SGHMC-E 84.51% DSVI 84.80% HFSGVI 84.84% II. Deep Models Deep Neural Networks Figure: Learning curves of FNN with dierent depth on MNIST dataset. mSGNHT-E is failing as learning progresses. It starts to work, only because the stepsize is decreased by ha Deep Poisson Factor Analysis Perplexity on Wikipedia dataset: W Pois( ( h (1) )); h (1) DSBN(h f (2); ;(L)g ) A 3-layer Deep Sigmoid Belief Networks (DSBN) is inferred by mSGNHT. Acknowledgements This research was supported by ARO, DARPA, DOE, NGA ONR and NSF.

Upload: others

Post on 23-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Order Stochastic Gradient Thermostats for Bayesian …changyou/PDF/msgnht_poster.pdf · 2017-08-15 · A3-layerDeepSigmoidBelief Networks(DSBN)isinferredby mSGNHT. #Documents

High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep ModelsChunyuan Li, Changyou Chen, Kai Fan and Lawrence Carin

Duke University, Durham NC 27708, USA

Introduction

Objective Designing simple and efficient Bayesian inference algo-rithms for deep learning models.Main idea•Multivariate Stochastic Gradient Nośe-Hoover Thermostat(m-SGNHT) with more accurate numerical implementation.•From Euler integrator to Symmetric Splitting integrator

Illustrations

Double-well Potential Functionρ(θ) ∝ exp(−U(θ)), U(θ) = (θ + 4)(θ + 1)(θ − 1)(θ − 3)/14 + 0.5

•mSGNHT-S consistently provides abetter approximation.

•The significant gap at largerstepsize reveals that mSGNHT-Sallows large updates.

stepsize0.001 0.01 0.05 0.1 0.15 0.2 0.25 0.3

KL d

iverg

ence

100

101

102

103 mSGNHT-EmSGNHT-S

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.001

mSGNHT-Strue distribution

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.001

mSGNHT-Etrue distribution

#samples #1050 1 2 3

0

0.5

1

1.5

29: mSGNHT-E9: mSGNHT-S

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.2

mSGNHT-Strue distribution

-4 -2 0 2 40

0.2

0.4

0.6

0.8

1h = 0.2

mSGNHT-Etrue distribution

#samples #1050 1 2 3

0

0.5

1

1.5

29: mSGNHT-E9: mSGNHT-S

Figure: Samples of ρ(θ) with SSI (1st column) and Euler integrator (2ndcolumn), and the estimated thermostat variable over iterations (3rd column).

Convolutional Neural Networks

Epochs5 10 15 20 25 30 35 40

Test

Acc

urac

y (%

)

97.5

98

98.5

99

99.5 h = 0.0001, D = 50

mSGNHT-EmSGNHT-S

Epochs5 10 15 20 25 30 35 40

Test

Acc

urac

y (%

)

97.5

98

98.5

99

99.5h = 0.0005, D = 50

mSGNHT-EmSGNHT-S

Figure: Learning curves of CNN for different step sizes.

Algorithms

θ: model parameter p: momentum ξ: thermostatEuler Integrator

θt+1 = θt + pthpt+1 = pt−∇θUt(θt+1)h− diag(ξt)pth +

√2Dζt+1

ξt+1 = ξt + (pt+1� pt+1− 1)h

Symmetric Spltting IntegratorA : θt+1/2 = θt + pth/2, ξt+1/2 = ξt + (pt� pt− 1)h/2→B : pt+1/3 = exp(−ξt+1/2h/2)� pt→O : pt+2/3 = pt+1/3−∇θUt(θt+1/2)h +

√2Dζt+1→

B : pt+1 = exp(−ξt+1/2h/2)� pt+2/3→A : θt+1 = θt+1/2 + pt+1h/2, ξt+1 = ξt+1/2 + (pt+1� pt+1− 1)h/2

where h is stepsize, D is diffusion factor, and ζ is Gaussian noise

Theoretical Justification

Roles of numerical integrators Under certain assumptions, theBias and MSE of a SG-MCMC algorithm with stepsize h and a Kth-order integrator are:

Bias:∣∣∣∣∣∣Eφ− φ

∣∣∣∣∣∣ = Bbias + O(hK)MSE: E

φ− φ2

= Bmse + O(h2K) ,where Bbias and Bmse are functions depending on (h, T ) but indepen-dent of K.Remark 1 Robustness: The bias and MSE of mSGNHT-S isbounded as: Bbias+O(h2) and Bmse+O(h4), compared to Bbias+O(h)and Bmse + O(h2) for the mSGNHT-E, respectively. This indicatesthat mSGNHT-S is more robust to the stepsizes than mSGNHT-E.Remark 2 Convergence Rate: The higher order a numericalintegrator is, the faster its optimal convergence rate is. Convergencerates in term of bias for mSGNHT-S and mSGNHT-E are T−2/3 andT−1/2, respectively, indicating mSGNHT-S converges faster.Remark 3 Measure Accuracy: In the limit of infinite time (T →∞), the terms Bbias and Bmse in Lemma 1 vanish, leaving only theO(hK) terms. This indicates mSGNHT-S is an order of magnitudemore accurate than mSGNHT-E.

Advantages for Deep Learning

• It tolerates gradients of various magnitudes, providing a potentialsolution to mitigate the vanishing/exploding gradients problem

• It can estimate model parameters faster and more accurately

Experiments

I. Shallow ModelsLatent Dirichlet Allocation and Logistic Regression

Iterations200 400 600 800 1000

Test

Per

plex

ity

900950

1000105011001150120012501300

ICML (h = 0.01, D = 0.75)mSGNHT-SmSGNHT-ESGHMC-SSGHMC-E

Iterations200 400 600 800 1000

Test

Per

plex

ity

900950

1000105011001150120012501300

ICML (h = 0.0025, D = 0.75)mSGNHT-SmSGNHT-ESGHMC-SSGHMC-E

Iterations200 400 600 800 1000

Test

Per

plex

ity

900950

1000105011001150120012501300

ICML (h = 0.0005, D = 0.75)mSGNHT-SmSGNHT-ESGHMC-SSGHMC-E

Figure: Learning curves of LDA on ICML dataset for different stepsize h.

Table: LDA on ICML.Method Test Perplexity ↓mSGNHT-S 939.67mSGNHT-E 960.56SGHMC-S 1004.73SGHMC-E 1017.51SGRLD 1154.68Gibbs 907.84

Table: LR on a9a.Method Test Accuracy ↑mSGNHT-S 84.95%mSGNHT-E 84.72%SGHMC-S 84.56%SGHMC-E 84.51%DSVI 84.80%HFSGVI 84.84%

II. Deep ModelsDeep Neural Networks

Epochs10 20 30 40

Test

Acc

urac

y

0.9

0.91

0.92

0.93

0.94

0.95

0.96h = 1e-4, 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40

Test

Acc

urac

y

0.9

0.91

0.92

0.93

0.94

0.95

0.96h = 1e-4, 100 # 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40

Test

Acc

urac

y

0.9

0.91

0.92

0.93

0.94

0.95

0.96h = 5e-5, 100 # 100 # 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40Tr

aini

ng N

eg. L

og-li

kelih

ood

0

0.2

0.4

0.6

0.8

1h = 1e-4, 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40Tr

aini

ng N

eg. L

og-li

kelih

ood

0

0.2

0.4

0.6

0.8

1h = 1e-4, 100 # 100 # 100

mSGNHT-SmSGNHT-E

Epochs10 20 30 40Tr

aini

ng N

eg. L

og-li

kelih

ood

0

0.2

0.4

0.6

0.8

1h = 5e-5, 100 # 100 # 100 # 100

mSGNHT-SmSGNHT-E

Figure: Learning curves of FNN with different depth on MNIST dataset.mSGNHT-E is failing as learning progresses. It starts to work, only because the stepsize is decreased by half.

Deep Poisson Factor Analysis

Perplexity on Wikipedia dataset:

W ∼ Pois(Φ(Ψ� h(1))),h(1) ∼ DSBN(h{(2),··· ,(L)})

A 3-layer Deep Sigmoid BeliefNetworks (DSBN) is inferred bymSGNHT.

#Documents Seen350K 380K 410K 440K 470K 500K

Perp

lexi

ty

900

950

1000

1050

1100

LDANB-FTMDPFA-SBN (BCDF)DPFA-RBM (mSGNHT)mSGNHT-E (h=1e-4)mSGNHT-S (h=1e-4)mSGNHT-E (h=1e-5)mSGNHT-S (h=1e-5)

490K 495K 500K880

882

884

886

888

890

892

Acknowledgements

This research was supported by ARO, DARPA, DOE, NGA ONR and NSF.