a picture of the energy landscape of deep neural · pdf filestefano soatto!! guillaume...

Post on 31-Jan-2018

225 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A picture of the energy landscape of!deep neural networks

Pratik Chaudhari

December 15, 2017

1

UCLA VISION LAB

2

What is the shape?

How should it inform the optimization?

Dy (x ;w ) = �(wp

�(wp�1 (. . . �(w 1x)) . . .))

w

⇤ = argmin

w

≈(x ,y ) 2 D

kX

i=1

�1{y=i } log

Dy

i

(x ; w )

| {z }, f (w )

(S)GD

Many, many variants!AdaGrad, SVRG, rmsprop, Adam, Eve,!APPA, Catalyst, Natasha, Katyusha…

wk+1 = wk � ⌘ +fb(wk )

“variance reduction”

SGDLee et al., ‘16mere GD always!finds local minima

Ge et al., ’15!Sun et al., ’15, ‘16can escape strict!saddle points

Saxe et al., ‘14orthogonal!initializations

variance reduction

Schmidt et al. ’13!Defazio, et al. ’14

AdaGradDuchi et al., ’10

What do we know about the energy landscape?

3

multiple,!equivalent!

minima

linearnetworks

Duvenaud et al., ‘14deep GPs

Soundry & Carmon, ‘16piecewise linear

Haeffele & Vidal ’15matrix/tensor factorization

Baldi & Hornik ‘89PCA

our workHardt et al. ’15!Goodfellow & Vinyals ’15

generalizationcan easily get zero!training error, yet!generalize poorly

paradox between a “benign” energy landscape and delicate training algorithms

empirical!results

all local minima are!close to global minimum

Dauphin et al., ‘14

saddle points!slow down SGD

statistical physics

binary!perceptron

Bray & Dean ’07,Fyodorov & Williams ’07!Choromanska et al. ’15,!Chaudhari & Soatto, ‘15Gaussian random!fields, spin glasses

many descending!directions at high!energy

Baldassi et al. ’15

Motivation from the Hessian

4

�5 0 10 20 30 40Eigenvalues

0

10

103

105

Freq

uenc

y

�0.5 �0.4 �0.3 �0.2 �0.1 0.0Eigenvalues

0

102

103

104

Freq

uenc

y

Short negative tail

How do we exploit this?

5

Magnify the energy landscape and smooth with a kernel

w ⇤ = argminw

f (w )

= argmax

x

e

�f (w )

⇡ argmin

w� log

⇣G� ⇤ e�f (w )

Gaussian kernel!of variance g

focuses on the!neighborhood ofw

�0.5

0.0

0.5

1.0

1.5

xcandidate

bxF

bx f

f (x)

F(x, 103)

F(x, 2 ⇥ 104)

A physics interpretation

6

new global minimum

original global minimum

Our new loss is “local entropy”Baldassi et al., ’15, ‘16

f�(w ) = � logZ

w 0exp

�f (w 0) � 1

2�kw �w 0k2

!dw 0

Minimizing local entropy

7

‣ Solve

‣ Estimate the gradient using MCMC

can be applied to general deep networks

w ⇤ = argminw

f�(w , �)

‣ Gradient of local entropy

+f�(w , �) = ��1�w �

⌦w 0

↵�denotes an expectation over!a local Gibbs distribution

⌦w 0

↵= Z (w , �)�1

Z

w 0w 0 exp

�f (w 0) � 1

2�kw �w 0k2

!dw 0

Medium-scale CNN

8

‣ All-CNN-BN on CIFAR-10

‣ Do not see much plateauing of training or validation loss

0 50 100 150 200Epochs ⇥ L

5

10

15

20

%Er

ror

7.71%

7.81%

SGDEntropy-SGD

0 50 100 150 200Epochs ⇥ L

0

0.1

0.2

0.3

0.4

0.5

0.6

Cro

ss-E

ntro

pyLo

ss

0.03530.0336

SGDEntropy-SGD

x-axis wall-clock timeµ

9

A PDE interpretation

‣ Local entropy is the solution of a Hamilton-Jacobi equation

original loss as the initial condition

ut = �1

2`+u `2 +

1

2� u

u(w , 0) = f (w )

‣ Stochastic control interpretation

quadratic penalty for!greedy gradient descent

dw = �↵(s) ds + dB(s), t s T

w (t ) = w .

C(w (·), ↵(·)) = ≈

"f (w (T )) +

1

2

Z T

tk↵(s)k2 ds

#

↵(w , t ) = +u(w , t )

u(w , t ) = min↵(·)C(w (·), ↵(·))

f�(w ) = u(w , �)

New PDEs

10

‣Use the non-viscous HJ equation

ut = �1

2`+u `2 +

1

2� u

0

‣Hopf-Lax formula gives the solution “inf-convolution”!or Moreau envelope

u(w , t ) = infw 0

(f (w 0) +

1

2�kw �w 0k2

)

‣This has a few magical properties…

‣Simple formula for the gradient (proximal point iteration)

p⇤ = +f (w � t p⇤)+u(w , t ) = p⇤

Smoothing using PDEs

11

f (x)

uviscous HJ(x, T)

unon-viscous HJ(x, T)

initial density,!SGD gets stuck

final density of non-viscous HJfinal density of viscous HJ

Distributed training algorithms

12

‣ A continuous-time view of local entropy

fast variabledw = ���1(w �w 0) ds

dw 0 = �1✏

"+f (w 0) +

1

�(w 0 �w )

#ds +

1p✏dB(s)

Zhang et. al., ’15‣ Elastic-SGD

argminw , w 01,...,w 0p

pX

k=1

f (w 0k ) +1

2�kw 0k �w k2

“homogenized” dynamics!as e ! 0

‣ If w’(s) is very fast, w(s) only sees its average

dw = ���1⇣w �

⌦w 0

↵⌘ds

⇢1(w 0;w ) / exp

�f (w 0) � 1

2�kw 0 �w k2

!

+f�(w , �) = ��1�w �

⌦w 0

↵�

13

Wide-ResNet on CIFAR-10 / 100

13

WTH is implicit regularization?

14

Why is SGD so special?

‣ Fokker-Planck equation and optimal transportation

⇢⇤ = argmin

Z�(w ) ⇢(w )dw + ��1

Zlog ⇢ d⇢.

Information bottleneck, Bayesian inference, large batch-sizes, sampling techniques, hyper-parameter choices, neural architecture search, ….

Many, many variants

AdaGrad, SVRG,!SAG, rmsprop,!Adam, Eve,!APPA, Catalyst,!Natasha, Katyusha…

‣ Stochastic differential equation

dw = �+f (w ) d t|{z},⌘

+q2��1D (w ) dB(t )

��1 =⌘

2b

15

�(w )SGD does not minimize , what is ?f (w )

‣ Deep networks induce highly non-isotropic noise

CIFAR-10

�(D ) = 0.27 ± 0.84rank(D ) = 0.34%

div j (x) = 0

‣ Leads to deterministic, Hamiltonian dynamics in SGD

x = j (x)

Most likely trajectories of SGD are closed loops

saddle-point

‣ Deep networks have out-of-equilibrium distribution

⇢⇤(w ) 6/ e�� f (w )

/ e���(w )

16

Summary

‣Techniques from control and physics are interpretable, also leadto state-of-the-art algorithms

‣Control has powerful tools to make inroads into understandingand improving deep networks

PDEs, stochastic control, stability of limit cycles, Fokker-Planck equations, continuous-time analysis…

input-output stability, reinforcement learning & optimal control

‣Deep learning is powerful, and quite “easy” to get intoeven the fundamentals are unknown and debated upon

Thank You!

17

Joint work with

www.pratikac.info

Stefano Soatto!!

Guillaume Carlier, Adam Oberman, Stanley Osher!!

Yann LeCun, Anna Choromanska, Ameet Talwalkar!!

Carlo Baldassi, Christian Borgs, Jennifer Chayes, Riccardo Zecchina

top related