mini-course 3: convergence analysis of neural …optimization i in practice, sgd always nds good...

Mini-Course 3:Convergence Analysis of Neural Network

Yang Yuan

Computer Science DepartmentCornell University

Deep learning is powerful

What is neural network?

rA simplified view.Missing: Convolution/BatchNorm.· · · · · · · · ·

· · · · · · · · ·

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

What is neural network?

rA simplified view.Missing: Convolution/BatchNorm.· · · · · · · · ·

· · · · · · · · ·

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Three basic types of theory questions

I RepresentationI Can we express any functions with neural networks?

I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

I OptimizationI Efficient methods for finding good parameters (i.e.,

representations)?

I GeneralizationI Training data used for optimization step.I Does it generalize to unseen data (test data)?

I Little is know. Flat minima? [Shirish Keskar et al., 2016,Hochreiter and Schmidhuber, 1995, Chaudhari et al., 2016,Zhang et al., 2016]

I In practice: neural network is doing great in ALL THREE!I What about in theory?

I RepresentationI Can we express any functions with neural networks?

I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

representations)?

I RepresentationI Can we express any functions with neural networks?I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,

Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

representations)?

I RepresentationI Can we express any functions with neural networks?I Yes, very expressive: [Hornik et al., 1989, Cybenko, 1992,

Barron, 1993, Eldan and Shamir, 2015,Safran and Shamir, 2016, Lee et al., 2017]

representations)?

I GeneralizationI Training data used for optimization step.I Does it generalize to unseen data (test data)?I Little is know. Flat minima? [Shirish Keskar et al., 2016,

Hochreiter and Schmidhuber, 1995, Chaudhari et al., 2016,Zhang et al., 2016]

Optimization

I In practice, SGD always finds good local minima.I SGD: stochastic gradient descentI xt+1 = xt − ηgt , E [gt ] = ∇f (xt)

I Some results are negative, saying optimization for neuralnetworks is in general hard.

I [Sıma, 2002, Livni et al., 2014, Shamir, 2016].

I Or positive but with special algorithms (tensor decomposition,half space intersection, etc.)

I [Janzamin et al., 2015, Zhang et al., 2015,Sedghi and Anandkumar, 2015, Goel et al., 2016],

I With strong assumptions on the model (weights are complexnumbers, learning polynomials only, weights are iid random)

I [Andoni et al., 2014, Arora et al., 2014]

Optimization

Recent work: independent activations

Independent activation assumption

I The outputs of ReLU units are independent of the input x,and independent of each other. [Choromanska et al., 2015,Kawaguchi, 2016, Brutzkus and Globerson, 2017].

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Independent!

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Independent!

← Input x = (1, 2,−4)>

← Weight W =

1 0 0 10 1 0 10 0 1 1

← W>x = (1, 2,−4,−1)>

← ReLU(W>x) = (1, 2, 0, 0)>

Independent!

CNN Model in [Brutzkus and Globerson, 2017]

Recent work: guarantees of other algorithm

I Tensor Decomposition+ Gradient Descent converges toground truth for one hidden layer network.[Zhong et al., 2017]

I Kernel methods could learn deep neural network witheigenvalue decay assumption. [Goel and Klivans, 2017]

Recent work: deep linear models

Ignore the activation functions.

I [Saxe et al., 2013, Kawaguchi, 2016, Hardt and Ma, 2016]

I Only learn a linear function.I Proof idea (for deep linear residual network)