lecture 3: the principle of sparsity

108
Lecture 3: The Principle of Sparsity [§4, §8, §11.2, and §.3 of “The Principles of Deep Learning Theory (PDLT),” arXiv:2106.10165]

Upload: others

Post on 06-Jun-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3: The Principle of Sparsity

Lecture 3:The Principle of Sparsity

[§4, §8, §11.2, and §∞.3 of“The Principles of Deep Learning Theory (PDLT),”arXiv:2106.10165]

Page 2: Lecture 3: The Principle of Sparsity

Problems 1, 2, & 3

Fully-trained network output, Taylor-expanded around initialization:

Page 3: Lecture 3: The Principle of Sparsity

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

Problems 1, 2, & 3

Page 4: Lecture 3: The Principle of Sparsity

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

initial distributionsover

model parameters

Problems 1, 2, & 3

statistics at initialization

Page 5: Lecture 3: The Principle of Sparsity

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

Problems 1, 2, & 3

statistics at initializationinitial distributionsover

model parameters

Page 6: Lecture 3: The Principle of Sparsity

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

statistics at initialization

Dan has covered Dynamics

initial distributionsover

model parameters

Page 7: Lecture 3: The Principle of Sparsity

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

statistics at initialization statistics after training

Dan has covered Dynamics

initial distributionsover

model parameters

Page 8: Lecture 3: The Principle of Sparsity

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

statistics at initialization statistics after training

Sho will cover Statistics

initial distributionsover

model parameters

Page 9: Lecture 3: The Principle of Sparsity

statistics at initialization

Sho will cover Statistics

for WIDE & DEEP neural networks

initial distributionsover

model parameters

Page 10: Lecture 3: The Principle of Sparsity

statistics at initialization

Sho will cover Statistics

for WIDE & DEEP neural networks

Lecture 3: The Principle of Sparsity, deriving recursions

Lecture 4: The Principle of Criticality, solving recursions

initial distributionsover

model parameters

Page 11: Lecture 3: The Principle of Sparsity

Outline1. Neural Networks 101

2. One-Layer Neural Networks

3. Two-Layer Neural Networks

4. Deep Neural Networks

Page 12: Lecture 3: The Principle of Sparsity

1. Neural Networks 101

Page 13: Lecture 3: The Principle of Sparsity

Neural Networks

Page 14: Lecture 3: The Principle of Sparsity

Neural Networks

activation function

Page 15: Lecture 3: The Principle of Sparsity

Neural Networks

preactivations

network output

Page 16: Lecture 3: The Principle of Sparsity

Neural Networks

hat @ initialization

Page 17: Lecture 3: The Principle of Sparsity

Neural Networks

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

Page 18: Lecture 3: The Principle of Sparsity

Neural Networks

initialization hyperparameters

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

Page 19: Lecture 3: The Principle of Sparsity

Neural Networks

good wide limit

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

Page 20: Lecture 3: The Principle of Sparsity

One Aside on Gradient Descent

[Cf. Andrea’s “S” matrix]

Page 21: Lecture 3: The Principle of Sparsity

One Aside on Gradient Descent

Taylor expansion:

Page 22: Lecture 3: The Principle of Sparsity

One Aside on Gradient Descent

Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]

Page 23: Lecture 3: The Principle of Sparsity

One Aside on Gradient Descent

Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]

differential of NTK (dNTK) [Cf. Dan’s ]

Page 24: Lecture 3: The Principle of Sparsity

Taylor expansion:

Neural Tangent Kernel (NTK)

differential of NTK (dNTK)

One Aside on Gradient Descent

ddNTK

Page 25: Lecture 3: The Principle of Sparsity

Taylor expansion:

Neural Tangent Kernel (NTK)

differential of NTK (dNTK)

One Aside on Gradient Descent

ddNTK

Page 26: Lecture 3: The Principle of Sparsity

Neural Tangent Kernel (NTK)

Page 27: Lecture 3: The Principle of Sparsity

[Cf. Andrea’s “S” matrix]

Neural Tangent Kernel (NTK)

Diagonal, group-by-group, learning rate:

Page 28: Lecture 3: The Principle of Sparsity

good wide limit

Neural Tangent Kernel (NTK)

Diagonal, group-by-group, learning rate:

Page 29: Lecture 3: The Principle of Sparsity

Two Pedagogical Simplifications

1. Single input; drop sample indices

[See “PDLT” (arXiv:2106.10165) for more general cases.]

2. Layer-independent hyperparameters; drop layer indices from them

Page 30: Lecture 3: The Principle of Sparsity
Page 31: Lecture 3: The Principle of Sparsity
Page 32: Lecture 3: The Principle of Sparsity
Page 33: Lecture 3: The Principle of Sparsity
Page 34: Lecture 3: The Principle of Sparsity

2. One-Layer Neural Networks

Page 35: Lecture 3: The Principle of Sparsity

Statistics of

Page 36: Lecture 3: The Principle of Sparsity

Statistics of

Page 37: Lecture 3: The Principle of Sparsity

Statistics of

Page 38: Lecture 3: The Principle of Sparsity

Statistics of“Wick contraction”

Page 39: Lecture 3: The Principle of Sparsity

Statistics of

Page 40: Lecture 3: The Principle of Sparsity

Statistics of

Page 41: Lecture 3: The Principle of Sparsity

Statistics of

Page 42: Lecture 3: The Principle of Sparsity

Statistics of

Page 43: Lecture 3: The Principle of Sparsity

Statistics of

Page 44: Lecture 3: The Principle of Sparsity

Statistics of

Page 45: Lecture 3: The Principle of Sparsity

Statistics of

Page 46: Lecture 3: The Principle of Sparsity

Statistics of

Page 47: Lecture 3: The Principle of Sparsity

Statistics of

Page 48: Lecture 3: The Principle of Sparsity

Statistics of

Page 49: Lecture 3: The Principle of Sparsity

Statistics of

• Neurons don’t talk to each other; they are statistically independent.

• We marginalized over/integrated out and .

• Two interpretations:(i) outputs of one-layer networks; or(ii) preactivations in the first layer of deeper networks.

Page 50: Lecture 3: The Principle of Sparsity

Statistics of

Page 51: Lecture 3: The Principle of Sparsity

Statistics of

Page 52: Lecture 3: The Principle of Sparsity

Statistics of

Page 53: Lecture 3: The Principle of Sparsity

Statistics of

Page 54: Lecture 3: The Principle of Sparsity

Statistics of

• “Deterministic”: it doesn't depend on any particular initialization; you always get the same number.

• “Frozen”: it cannot evolve during training; no representation learning.

Page 55: Lecture 3: The Principle of Sparsity

Statistics of

Page 56: Lecture 3: The Principle of Sparsity

Statistics of

Page 57: Lecture 3: The Principle of Sparsity

Statistics of

• No representation learning.

• No algorithm dependence.

Page 58: Lecture 3: The Principle of Sparsity

Statistics of One-Layer Neural Networks

Page 59: Lecture 3: The Principle of Sparsity

Statistics of One-Layer Neural Networks

Linear dynamics:

Simple solution:

Page 60: Lecture 3: The Principle of Sparsity

Statistics of One-Layer Neural Networks

statistics at initialization statistics after training

Page 61: Lecture 3: The Principle of Sparsity

Statistics of One-Layer Neural Networks

• Same trivial statistics for infinite-width neural networks of any fixed depth.

• No representation learning, no algorithm dependence; not a good model of deep learning.

We must study deeper networks of finite width!

Page 62: Lecture 3: The Principle of Sparsity

3. Two-Layer Neural Networks

Page 63: Lecture 3: The Principle of Sparsity

Statistics of

Page 64: Lecture 3: The Principle of Sparsity

Statistics of

Wick

arrange

Page 65: Lecture 3: The Principle of Sparsity

Statistics of

Page 66: Lecture 3: The Principle of Sparsity

Statistics of

• Recursive.

• width-scaling was important.

Page 67: Lecture 3: The Principle of Sparsity

Statistics of

Wick

Page 68: Lecture 3: The Principle of Sparsity

Statistics of

Wick

Page 69: Lecture 3: The Principle of Sparsity

Statistics of

Wick

Page 70: Lecture 3: The Principle of Sparsity

Statistics of

Wick

Page 71: Lecture 3: The Principle of Sparsity

Statistics of

with

Page 72: Lecture 3: The Principle of Sparsity

Statistics of

with

Nearly-Gaussian distribution for

[Cf. Gaussian distribution in the first layer:

]

Page 73: Lecture 3: The Principle of Sparsity

Statistics of

• Gaussian in the infinite-width limit, too simple; specified by one number (one matrix – kernel – more generally)

• Sparse description at ; specified by two numbers (two tensors more generally, one of them having four sample indices)

• Interacting neurons at finite width.

Page 74: Lecture 3: The Principle of Sparsity

Statistics of

Page 75: Lecture 3: The Principle of Sparsity

Statistics of

Page 76: Lecture 3: The Principle of Sparsity

Statistics of

1st piece

2nd piece

Page 77: Lecture 3: The Principle of Sparsity

Statistics of

1st piece, the same as before:

Page 78: Lecture 3: The Principle of Sparsity

Statistics of

1st piece, the same as before:

width-scaling was important.

Page 79: Lecture 3: The Principle of Sparsity

Statistics of

2nd piece, chain rule:

Page 80: Lecture 3: The Principle of Sparsity

Statistics of

2nd piece, chain rule:

Page 81: Lecture 3: The Principle of Sparsity

Statistics of

2nd piece, chain rule:

Page 82: Lecture 3: The Principle of Sparsity

Statistics of

2nd piece, chain rule:

Page 83: Lecture 3: The Principle of Sparsity

Statistics ofPutting things together, NTK forward equation:

Page 84: Lecture 3: The Principle of Sparsity

Statistics of

• “Stochastic”: it fluctuates from instantiation to instantiation.

• “Defrosted”: it can evolve during training.

Putting things together, NTK forward equation:

Page 85: Lecture 3: The Principle of Sparsity

Putting things together, NTK forward equation:

Fun for the weekend (solutions in §8):

[*for smooth activation functions]

Statistics of

Page 86: Lecture 3: The Principle of Sparsity

Putting things together, NTK forward equation:

Fun for the weekend (solutions in §8, §11.2, and §∞.3):

[*for smooth activation functions]

Statistics of and beyond

Page 87: Lecture 3: The Principle of Sparsity

Putting things together, NTK forward equation:

Fun for the weekend (solutions in §8, §11.2, and §∞.3):

[*for smooth activation functions]

Statistics of and beyond

Representation Learning

Page 88: Lecture 3: The Principle of Sparsity

[*for smooth activation functions]

Statistics of and beyond

Representation Learning

Page 89: Lecture 3: The Principle of Sparsity

Statistics of Two-Layer Neural Networks

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

Page 90: Lecture 3: The Principle of Sparsity

Statistics of Two-Layer Neural Networks

But what is being amplified by deep learning?

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

Page 91: Lecture 3: The Principle of Sparsity

4. Deep Neural Networks

Page 92: Lecture 3: The Principle of Sparsity

Statistics of

Page 93: Lecture 3: The Principle of Sparsity

Statistics of

Wick

arrange

Page 94: Lecture 3: The Principle of Sparsity

Statistics of

leading

Page 95: Lecture 3: The Principle of Sparsity

Statistics ofTwo-point:

Page 96: Lecture 3: The Principle of Sparsity

Statistics ofTwo-point:

Four-point:

Page 97: Lecture 3: The Principle of Sparsity

Two-point:

Four-point:

NTK mean:

Statistics of

Page 98: Lecture 3: The Principle of Sparsity

1st trivial piece

2nd chain-rule piece

NTK forward equation:

Statistics of

Page 99: Lecture 3: The Principle of Sparsity

NTK forward equation:

Statistics of

1st trivial piece

2nd chain-rule piece

Page 100: Lecture 3: The Principle of Sparsity

Two-point:

Four-point:

NTK mean:

Statistics of and beyond

Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)

Page 101: Lecture 3: The Principle of Sparsity

Two-point:

Four-point:

NTK mean:

Statistics of and beyond

Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)

Page 102: Lecture 3: The Principle of Sparsity

Two-point:

Four-point:

NTK mean:

Statistics of and beyond

Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)

Page 103: Lecture 3: The Principle of Sparsity

statistics at initialization statistics after training

The Principle of Sparsity for WIDE Neural NetworksDanSho

Page 104: Lecture 3: The Principle of Sparsity

• Infinite width:

specified by

The Principle of Sparsity for WIDE Neural Networks

statistics at initialization statistics after training

Page 105: Lecture 3: The Principle of Sparsity

• Infinite width:

• Large-but-finite width at :

specified by

specified by

The Principle of Sparsity for WIDE Neural Networks

statistics at initialization statistics after training

Page 106: Lecture 3: The Principle of Sparsity

All determined through recursion relations(RG-flow interpretation: §4.6 )

Page 107: Lecture 3: The Principle of Sparsity

All determined through recursion relations(RG-flow interpretation: §4.6 )

Next Lecture: Solving Recursions“The Principle of Criticality”

forDEEP Neural Networks

Page 108: Lecture 3: The Principle of Sparsity

One more thing…