lecture 3: the principle of sparsity

Lecture 3:The Principle of Sparsity

[§4, §8, §11.2, and §∞.3 of“The Principles of Deep Learning Theory (PDLT),”arXiv:2106.10165]

Problems 1, 2, & 3

Fully-trained network output, Taylor-expanded around initialization:


• Problem 1: too many terms in general

Problems 1, 2, & 3



• Problem 2: complicated mapping

initial distributionsover

model parameters

Problems 1, 2, & 3

statistics at initialization




• Problem 3: complicated dynamics

Problems 1, 2, & 3

statistics at initializationinitial distributionsover

model parameters






Dan has covered Dynamics


model parameters





statistics at initialization statistics after training

Dan has covered Dynamics


model parameters






Sho will cover Statistics


model parameters



for WIDE & DEEP neural networks


model parameters



for WIDE & DEEP neural networks

Lecture 3: The Principle of Sparsity, deriving recursions

Lecture 4: The Principle of Criticality, solving recursions


model parameters

Outline1. Neural Networks 101

2. One-Layer Neural Networks

3. Two-Layer Neural Networks

4. Deep Neural Networks

1. Neural Networks 101

Neural Networks

Neural Networks

activation function

Neural Networks

preactivations

network output

Neural Networks

hat @ initialization

Neural Networks

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

Neural Networks

initialization hyperparameters


Neural Networks

good wide limit


One Aside on Gradient Descent

[Cf. Andrea’s “S” matrix]


Taylor expansion:


Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]


Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]

differential of NTK (dNTK) [Cf. Dan’s ]

Taylor expansion:

Neural Tangent Kernel (NTK)

differential of NTK (dNTK)


ddNTK

[Cf. Andrea’s “S” matrix]


Diagonal, group-by-group, learning rate:

good wide limit


Diagonal, group-by-group, learning rate:

Two Pedagogical Simplifications

1. Single input; drop sample indices

[See “PDLT” (arXiv:2106.10165) for more general cases.]

2. Layer-independent hyperparameters; drop layer indices from them

2. One-Layer Neural Networks

Statistics of

Statistics of“Wick contraction”

Statistics of

Statistics of

• Neurons don’t talk to each other; they are statistically independent.

• We marginalized over/integrated out and .

• Two interpretations:(i) outputs of one-layer networks; or(ii) preactivations in the first layer of deeper networks.

Statistics of

Statistics of

• “Deterministic”: it doesn't depend on any particular initialization; you always get the same number.

• “Frozen”: it cannot evolve during training; no representation learning.

Statistics of

Statistics of

• No representation learning.

• No algorithm dependence.

Statistics of One-Layer Neural Networks


Linear dynamics:

Simple solution:


• Same trivial statistics for infinite-width neural networks of any fixed depth.

• No representation learning, no algorithm dependence; not a good model of deep learning.

We must study deeper networks of finite width!

3. Two-Layer Neural Networks

Statistics of

Statistics of

Wick

arrange

Statistics of

Statistics of

• Recursive.

• width-scaling was important.

Statistics of

Wick

Statistics of

with

Statistics of

with

Nearly-Gaussian distribution for

[Cf. Gaussian distribution in the first layer:

]

Statistics of

• Gaussian in the infinite-width limit, too simple; specified by one number (one matrix – kernel – more generally)

• Sparse description at ; specified by two numbers (two tensors more generally, one of them having four sample indices)

• Interacting neurons at finite width.

Statistics of

Statistics of

1st piece

2nd piece

Statistics of

1st piece, the same as before:

Statistics of

1st piece, the same as before:

width-scaling was important.

Statistics of

2nd piece, chain rule:

Statistics ofPutting things together, NTK forward equation:

Statistics of

• “Stochastic”: it fluctuates from instantiation to instantiation.

• “Defrosted”: it can evolve during training.

Putting things together, NTK forward equation:


Fun for the weekend (solutions in §8):

[*for smooth activation functions]

Statistics of


Fun for the weekend (solutions in §8, §11.2, and §∞.3):


Statistics of and beyond


Fun for the weekend (solutions in §8, §11.2, and §∞.3):



Representation Learning



Representation Learning

Statistics of Two-Layer Neural Networks

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

Statistics of Two-Layer Neural Networks

But what is being amplified by deep learning?

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

4. Deep Neural Networks

Statistics of

Statistics of

Wick

arrange

Statistics of

leading

Statistics ofTwo-point:

Statistics ofTwo-point:

Four-point:

Two-point:

Four-point:

NTK mean:

Statistics of

1st trivial piece

2nd chain-rule piece

NTK forward equation:

Statistics of

NTK forward equation:

Statistics of

1st trivial piece

2nd chain-rule piece

Two-point:

Four-point:

NTK mean:


Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)


The Principle of Sparsity for WIDE Neural NetworksDanSho

• Infinite width:

specified by

The Principle of Sparsity for WIDE Neural Networks


• Infinite width:

• Large-but-finite width at :

specified by

specified by

The Principle of Sparsity for WIDE Neural Networks


All determined through recursion relations(RG-flow interpretation: §4.6 )

All determined through recursion relations(RG-flow interpretation: §4.6 )

Next Lecture: Solving Recursions“The Principle of Criticality”

forDEEP Neural Networks

One more thing…

lecture 3: the principle of sparsity

Documents