a picture of the energy landscape of deep neural · pdf filestefano soatto!! guillaume...

A picture of the energy landscape of!deep neural networks

Pratik Chaudhari

December 15, 2017

UCLA VISION LAB

What is the shape?

How should it inform the optimization?

Dy (x ;w ) = �(wp

�(wp�1 (. . . �(w 1x)) . . .))

⇤ = argmin

≈(x ,y ) 2 D

�1{y=i } log

(x ; w )

| {z }, f (w )

Many, many variants!AdaGrad, SVRG, rmsprop, Adam, Eve,!APPA, Catalyst, Natasha, Katyusha…

wk+1 = wk � ⌘ +fb(wk )

“variance reduction”

SGDLee et al., ‘16mere GD always!finds local minima

Ge et al., ’15!Sun et al., ’15, ‘16can escape strict!saddle points

Saxe et al., ‘14orthogonal!initializations

variance reduction

Schmidt et al. ’13!Defazio, et al. ’14

AdaGradDuchi et al., ’10

What do we know about the energy landscape?

multiple,!equivalent!

minima

linearnetworks

Duvenaud et al., ‘14deep GPs

Soundry & Carmon, ‘16piecewise linear

Haeffele & Vidal ’15matrix/tensor factorization

Baldi & Hornik ‘89PCA

our workHardt et al. ’15!Goodfellow & Vinyals ’15

generalizationcan easily get zero!training error, yet!generalize poorly

paradox between a “benign” energy landscape and delicate training algorithms

empirical!results

all local minima are!close to global minimum

Dauphin et al., ‘14

saddle points!slow down SGD

statistical physics

binary!perceptron

Bray & Dean ’07,Fyodorov & Williams ’07!Choromanska et al. ’15,!Chaudhari & Soatto, ‘15Gaussian random!fields, spin glasses

many descending!directions at high!energy

Baldassi et al. ’15

Motivation from the Hessian

�5 0 10 20 30 40Eigenvalues

�0.5 �0.4 �0.3 �0.2 �0.1 0.0Eigenvalues

Short negative tail

How do we exploit this?

Magnify the energy landscape and smooth with a kernel

w ⇤ = argminw

f (w )

= argmax

�f (w )

⇡ argmin

w� log

⇣G� ⇤ e�f (w )

Gaussian kernel!of variance g

focuses on the!neighborhood ofw

�0.5

xcandidate

F(x, 103)

F(x, 2 ⇥ 104)

A physics interpretation

new global minimum

original global minimum

Our new loss is “local entropy”Baldassi et al., ’15, ‘16

f�(w ) = � logZ

w 0exp

�f (w 0) � 1

2�kw �w 0k2

Minimizing local entropy

‣ Solve

‣ Estimate the gradient using MCMC

can be applied to general deep networks

w ⇤ = argminw

f�(w , �)

‣ Gradient of local entropy

+f�(w , �) = ��1�w �

⌦w 0

↵�denotes an expectation over!a local Gibbs distribution

⌦w 0

↵= Z (w , �)�1

w 0w 0 exp

�f (w 0) � 1

2�kw �w 0k2

Medium-scale CNN

‣ All-CNN-BN on CIFAR-10

‣ Do not see much plateauing of training or validation loss

0 50 100 150 200Epochs ⇥ L

SGDEntropy-SGD

0 50 100 150 200Epochs ⇥ L

0.03530.0336

SGDEntropy-SGD

x-axis wall-clock timeµ

A PDE interpretation

‣ Local entropy is the solution of a Hamilton-Jacobi equation

original loss as the initial condition

ut = �1

2`+u `2 +

2� u

u(w , 0) = f (w )

‣ Stochastic control interpretation

quadratic penalty for!greedy gradient descent

dw = �↵(s) ds + dB(s), t s T

w (t ) = w .

C(w (·), ↵(·)) = ≈

"f (w (T )) +

tk↵(s)k2 ds

↵(w , t ) = +u(w , t )

u(w , t ) = min↵(·)C(w (·), ↵(·))

f�(w ) = u(w , �)

New PDEs

‣Use the non-viscous HJ equation

ut = �1

2`+u `2 +

2� u

‣Hopf-Lax formula gives the solution “inf-convolution”!or Moreau envelope

u(w , t ) = infw 0

(f (w 0) +

2�kw �w 0k2

‣This has a few magical properties…

‣Simple formula for the gradient (proximal point iteration)

p⇤ = +f (w � t p⇤)+u(w , t ) = p⇤

Smoothing using PDEs

uviscous HJ(x, T)

unon-viscous HJ(x, T)

initial density,!SGD gets stuck

final density of non-viscous HJfinal density of viscous HJ

Distributed training algorithms

‣ A continuous-time view of local entropy

fast variabledw = ��1(w �w 0) ds

dw 0 = �1✏

"+f (w 0) +

�(w 0 �w )

1p✏dB(s)

Zhang et. al., ’15‣ Elastic-SGD

argminw , w 01,...,w 0p

f (w 0k ) +1

2�kw 0k �w k2

“homogenized” dynamics!as e ! 0

‣ If w’(s) is very fast, w(s) only sees its average

dw = ��1⇣w �

⌦w 0

↵⌘ds

⇢1(w 0;w ) / exp

�f (w 0) � 1

2�kw 0 �w k2

+f�(w , �) = ��1�w �

⌦w 0

↵�

Wide-ResNet on CIFAR-10 / 100

WTH is implicit regularization?

Why is SGD so special?

‣ Fokker-Planck equation and optimal transportation

⇢⇤ = argmin

Z�(w ) ⇢(w )dw + ��1

Zlog ⇢ d⇢.

Information bottleneck, Bayesian inference, large batch-sizes, sampling techniques, hyper-parameter choices, neural architecture search, ….

Many, many variants

AdaGrad, SVRG,!SAG, rmsprop,!Adam, Eve,!APPA, Catalyst,!Natasha, Katyusha…

‣ Stochastic differential equation

dw = �+f (w ) d t|{z},⌘

+q2��1D (w ) dB(t )

��1 =⌘

�(w )SGD does not minimize , what is ?f (w )

‣ Deep networks induce highly non-isotropic noise

CIFAR-10

�(D ) = 0.27 ± 0.84rank(D ) = 0.34%

div j (x) = 0

‣ Leads to deterministic, Hamiltonian dynamics in SGD

x = j (x)

Most likely trajectories of SGD are closed loops

saddle-point

‣ Deep networks have out-of-equilibrium distribution

⇢⇤(w ) 6/ e�� f (w )

/ e��(w )

Summary

‣Techniques from control and physics are interpretable, also leadto state-of-the-art algorithms

‣Control has powerful tools to make inroads into understandingand improving deep networks

PDEs, stochastic control, stability of limit cycles, Fokker-Planck equations, continuous-time analysis…

input-output stability, reinforcement learning & optimal control

‣Deep learning is powerful, and quite “easy” to get intoeven the fundamentals are unknown and debated upon

Thank You!

Joint work with

www.pratikac.info

Stefano Soatto!!

Guillaume Carlier, Adam Oberman, Stanley Osher!!

Yann LeCun, Anna Choromanska, Ameet Talwalkar!!

Carlo Baldassi, Christian Borgs, Jennifer Chayes, Riccardo Zecchina

a picture of the energy landscape of deep neural · pdf filestefano soatto!! guillaume...

Documents

100231178_ipo note talwalkar

visualbackprop: efﬁcient visualization of cnns ·...

date : 9th september, 2017 to the secretaries / executive...

changepoint analysis for e cient variant...

annamrita · agripada along with performances by george...

towards automated melanoma detection with deep …...towards...

cs260 machine learning algorithms - computer...

valuepickr forum...2019/02/13 · talwalkars better value...

deep learning with elastic averaging sgddeep learning with...

cs260 machine learning algorithms · 2017-09-16 ·...

introduction and strategy for hollow e-lens studies...

ineo +224e cod.397 gambale srl- · pdf filestefano gambale...

talwalkar mlconf (1)

41st economics conference 2013 - stefano ugolini · pdf...

click through rate...

spielplan elbjazz festi val 2014 - · pdf filestefano...

foundations of machine learning · 2019. 10. 17. · title:...

evolution of sync standards- addressing real deployment ·...

approximate queries on very large data uc berkeley sameer...

deep learning with elastic averaging sgdzsx/nips2015.pdf ·...