lecture 3: the principle of sparsity
TRANSCRIPT
Lecture 3:The Principle of Sparsity
[§4, §8, §11.2, and §∞.3 of“The Principles of Deep Learning Theory (PDLT),”arXiv:2106.10165]
Problems 1, 2, & 3
Fully-trained network output, Taylor-expanded around initialization:
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
Problems 1, 2, & 3
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
initial distributionsover
model parameters
Problems 1, 2, & 3
statistics at initialization
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
Problems 1, 2, & 3
statistics at initializationinitial distributionsover
model parameters
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
statistics at initialization
Dan has covered Dynamics
initial distributionsover
model parameters
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
statistics at initialization statistics after training
Dan has covered Dynamics
initial distributionsover
model parameters
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
statistics at initialization statistics after training
Sho will cover Statistics
initial distributionsover
model parameters
statistics at initialization
Sho will cover Statistics
for WIDE & DEEP neural networks
initial distributionsover
model parameters
statistics at initialization
Sho will cover Statistics
for WIDE & DEEP neural networks
Lecture 3: The Principle of Sparsity, deriving recursions
Lecture 4: The Principle of Criticality, solving recursions
initial distributionsover
model parameters
Outline1. Neural Networks 101
2. One-Layer Neural Networks
3. Two-Layer Neural Networks
4. Deep Neural Networks
1. Neural Networks 101
Neural Networks
Neural Networks
activation function
Neural Networks
preactivations
network output
Neural Networks
hat @ initialization
Neural Networks
Biases and weights (model parameters) are independently (& symmetrically) distributed with variances
Neural Networks
initialization hyperparameters
Biases and weights (model parameters) are independently (& symmetrically) distributed with variances
Neural Networks
good wide limit
Biases and weights (model parameters) are independently (& symmetrically) distributed with variances
One Aside on Gradient Descent
[Cf. Andrea’s “S” matrix]
One Aside on Gradient Descent
Taylor expansion:
One Aside on Gradient Descent
Taylor expansion:
Neural Tangent Kernel (NTK) [Cf. Dan’s ]
One Aside on Gradient Descent
Taylor expansion:
Neural Tangent Kernel (NTK) [Cf. Dan’s ]
differential of NTK (dNTK) [Cf. Dan’s ]
Taylor expansion:
Neural Tangent Kernel (NTK)
differential of NTK (dNTK)
One Aside on Gradient Descent
ddNTK
Taylor expansion:
Neural Tangent Kernel (NTK)
differential of NTK (dNTK)
One Aside on Gradient Descent
ddNTK
Neural Tangent Kernel (NTK)
[Cf. Andrea’s “S” matrix]
Neural Tangent Kernel (NTK)
Diagonal, group-by-group, learning rate:
good wide limit
Neural Tangent Kernel (NTK)
Diagonal, group-by-group, learning rate:
Two Pedagogical Simplifications
1. Single input; drop sample indices
[See “PDLT” (arXiv:2106.10165) for more general cases.]
2. Layer-independent hyperparameters; drop layer indices from them
2. One-Layer Neural Networks
Statistics of
Statistics of
Statistics of
Statistics of“Wick contraction”
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
• Neurons don’t talk to each other; they are statistically independent.
• We marginalized over/integrated out and .
• Two interpretations:(i) outputs of one-layer networks; or(ii) preactivations in the first layer of deeper networks.
Statistics of
Statistics of
Statistics of
Statistics of
Statistics of
• “Deterministic”: it doesn't depend on any particular initialization; you always get the same number.
• “Frozen”: it cannot evolve during training; no representation learning.
Statistics of
Statistics of
Statistics of
• No representation learning.
• No algorithm dependence.
Statistics of One-Layer Neural Networks
Statistics of One-Layer Neural Networks
Linear dynamics:
Simple solution:
Statistics of One-Layer Neural Networks
statistics at initialization statistics after training
Statistics of One-Layer Neural Networks
• Same trivial statistics for infinite-width neural networks of any fixed depth.
• No representation learning, no algorithm dependence; not a good model of deep learning.
We must study deeper networks of finite width!
3. Two-Layer Neural Networks
Statistics of
Statistics of
Wick
arrange
Statistics of
Statistics of
• Recursive.
• width-scaling was important.
Statistics of
Wick
Statistics of
Wick
Statistics of
Wick
Statistics of
Wick
Statistics of
with
Statistics of
with
Nearly-Gaussian distribution for
[Cf. Gaussian distribution in the first layer:
]
Statistics of
• Gaussian in the infinite-width limit, too simple; specified by one number (one matrix – kernel – more generally)
• Sparse description at ; specified by two numbers (two tensors more generally, one of them having four sample indices)
• Interacting neurons at finite width.
Statistics of
Statistics of
Statistics of
1st piece
2nd piece
Statistics of
1st piece, the same as before:
Statistics of
1st piece, the same as before:
width-scaling was important.
Statistics of
2nd piece, chain rule:
Statistics of
2nd piece, chain rule:
Statistics of
2nd piece, chain rule:
Statistics of
2nd piece, chain rule:
Statistics ofPutting things together, NTK forward equation:
Statistics of
• “Stochastic”: it fluctuates from instantiation to instantiation.
• “Defrosted”: it can evolve during training.
Putting things together, NTK forward equation:
Putting things together, NTK forward equation:
Fun for the weekend (solutions in §8):
[*for smooth activation functions]
Statistics of
Putting things together, NTK forward equation:
Fun for the weekend (solutions in §8, §11.2, and §∞.3):
[*for smooth activation functions]
Statistics of and beyond
Putting things together, NTK forward equation:
Fun for the weekend (solutions in §8, §11.2, and §∞.3):
[*for smooth activation functions]
Statistics of and beyond
Representation Learning
[*for smooth activation functions]
Statistics of and beyond
Representation Learning
Statistics of Two-Layer Neural Networks
• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.
• Neurons do talk to each other; they are statistically dependent.
• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.
Statistics of Two-Layer Neural Networks
But what is being amplified by deep learning?
• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.
• Neurons do talk to each other; they are statistically dependent.
• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.
4. Deep Neural Networks
Statistics of
Statistics of
Wick
arrange
Statistics of
leading
Statistics ofTwo-point:
Statistics ofTwo-point:
Four-point:
Two-point:
Four-point:
NTK mean:
Statistics of
1st trivial piece
2nd chain-rule piece
NTK forward equation:
Statistics of
NTK forward equation:
Statistics of
1st trivial piece
2nd chain-rule piece
Two-point:
Four-point:
NTK mean:
Statistics of and beyond
Similarly for
NTK fluctuations (§8)
dNTK (§11.2)
ddNTK (§∞.3)
Two-point:
Four-point:
NTK mean:
Statistics of and beyond
Similarly for
NTK fluctuations (§8)
dNTK (§11.2)
ddNTK (§∞.3)
Two-point:
Four-point:
NTK mean:
Statistics of and beyond
Similarly for
NTK fluctuations (§8)
dNTK (§11.2)
ddNTK (§∞.3)
statistics at initialization statistics after training
The Principle of Sparsity for WIDE Neural NetworksDanSho
• Infinite width:
specified by
The Principle of Sparsity for WIDE Neural Networks
statistics at initialization statistics after training
• Infinite width:
• Large-but-finite width at :
specified by
specified by
The Principle of Sparsity for WIDE Neural Networks
statistics at initialization statistics after training
All determined through recursion relations(RG-flow interpretation: §4.6 )
All determined through recursion relations(RG-flow interpretation: §4.6 )
Next Lecture: Solving Recursions“The Principle of Criticality”
forDEEP Neural Networks
One more thing…