bayesian deep learning (mlss 2019) · 2019. 8. 27. · bayesian deep learning (mlss 2019) yarin gal...
TRANSCRIPT
![Page 1: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/1.jpg)
Bayesian Deep Learning (MLSS 2019)
Yarin Gal
University of [email protected]
Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license
![Page 2: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/2.jpg)
Contents
Today:
I Introduction
I The Language of Uncertainty
I Bayesian Probabilistic Modelling
I Bayesian Probabilistic Modelling of Functions
2 of 54
![Page 3: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/3.jpg)
Bayesian Deep Learning: Introduction
Introduction
3 of 54
![Page 4: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/4.jpg)
With great power...
I Many engineering advances in ML
I Systems applied to toy data→ deployed in real-life settings
I Control handed-over to automatedsystems; w many scenarios which canbecome life-threatening to humans
I Medical: automated decision making orrecommendation systems
I Automotive: autonomous control of dronesand self driving cars
I High frequency trading: ability to affecteconomic markets on global scale
I But all of these can be quite dangerous...
4 of 54
![Page 5: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/5.jpg)
With great power...
I Many engineering advances in ML
I Systems applied to toy data→ deployed in real-life settings
I Control handed-over to automatedsystems; w many scenarios which canbecome life-threatening to humans
I Medical: automated decision making orrecommendation systems
I Automotive: autonomous control of dronesand self driving cars
I High frequency trading: ability to affecteconomic markets on global scale
I But all of these can be quite dangerous...
4 of 54
![Page 6: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/6.jpg)
With great power...
I Many engineering advances in ML
I Systems applied to toy data→ deployed in real-life settings
I Control handed-over to automatedsystems; w many scenarios which canbecome life-threatening to humans
I Medical: automated decision making orrecommendation systems
I Automotive: autonomous control of dronesand self driving cars
I High frequency trading: ability to affecteconomic markets on global scale
I But all of these can be quite dangerous...
4 of 54
![Page 7: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/7.jpg)
With great power...
I Many engineering advances in ML
I Systems applied to toy data→ deployed in real-life settings
I Control handed-over to automatedsystems; w many scenarios which canbecome life-threatening to humans
I Medical: automated decision making orrecommendation systems
I Automotive: autonomous control of dronesand self driving cars
I High frequency trading: ability to affecteconomic markets on global scale
I But all of these can be quite dangerous...
4 of 54
![Page 8: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/8.jpg)
With great power...
I Many engineering advances in ML
I Systems applied to toy data→ deployed in real-life settings
I Control handed-over to automatedsystems; w many scenarios which canbecome life-threatening to humans
I Medical: automated decision making orrecommendation systems
I Automotive: autonomous control of dronesand self driving cars
I High frequency trading: ability to affecteconomic markets on global scale
I But all of these can be quite dangerous...
4 of 54
![Page 9: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/9.jpg)
Example: Medical Diagnostics
I dAIbetes: an exciting new startup (not really)I claims to automatically diagnose diabetic retinopathyI accuracy 99% on their 4 train/test patientsI engineer trained two deep learning systems to predict probability y
given input fondus image x .
The engineer runs their system on your fondus image x∗ (RHS):
I Which model f1, f2 would you want the engineer to use for yourdiagnosis?
5 of 54
![Page 10: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/10.jpg)
Example: Medical Diagnostics
I dAIbetes: an exciting new startup (not really)I claims to automatically diagnose diabetic retinopathyI accuracy 99% on their 4 train/test patientsI engineer trained two deep learning systems to predict probability y
given input fondus image x .
The engineer runs their system on your fondus image x∗ (RHS):
I Which model f1, f2 would you want the engineer to use for yourdiagnosis?
5 of 54
![Page 11: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/11.jpg)
Example: Medical Diagnostics
I dAIbetes: an exciting new startup (not really)I claims to automatically diagnose diabetic retinopathyI accuracy 99% on their 4 train/test patientsI engineer trained two deep learning systems to predict probability y
given input fondus image x .
The engineer runs their system on your fondus image x∗ (RHS):
I Which model f1, f2 would you want the engineer to use for yourdiagnosis? None of these! (‘I don’t know’)
5 of 54
![Page 12: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/12.jpg)
Example: Autonomous Driving
Autonomous systems
I Range from simple robotic vacuums toself-driving cars
I Largely divided into systems which
I control behaviour w rule-based systems
I learn and adapt to environment
Both can use of ML tools
I ML for low-level feature extraction(perception)
I reinforcement learning
6 of 54
![Page 13: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/13.jpg)
Example: Autonomous Driving (cnt)
Real-world example: assisted drivingI first fatality of assisted driving (June 2016)I low-level system failed to distinguish white side of trailer from
bright sky
7 of 54
![Page 14: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/14.jpg)
Example: Autonomous Driving (cnt)
Real-world example: assisted drivingI first fatality of assisted driving (June 2016)I low-level system failed to distinguish white side of trailer from
bright sky
If system had identified its own uncertainty:I alert user to take control over steeringI propagate uncertainty to decision making
7 of 54
![Page 15: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/15.jpg)
Point estimates
In medical / robotics / science...
X can’t use ML models giving asingle point estimate (singlevalue) in prediction
V must use ML models giving ananswer that says ‘10 but I’muncertain’; or ‘10± 5’
I Give me a distribution overpossible outcomes!
8 of 54
![Page 16: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/16.jpg)
Point estimates
In medical / robotics / science...
X can’t use ML models giving asingle point estimate (singlevalue) in prediction
V must use ML models giving ananswer that says ‘10 but I’muncertain’; or ‘10± 5’
I Give me a distribution overpossible outcomes!
8 of 54
![Page 17: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/17.jpg)
Point estimates
In medical / robotics / science...
X can’t use ML models giving asingle point estimate (singlevalue) in prediction
V must use ML models giving ananswer that says ‘10 but I’muncertain’; or ‘10± 5’
I Give me a distribution overpossible outcomes!
8 of 54
![Page 18: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/18.jpg)
Example: Autonomous Driving (cnt)
ML pipeline in self-driving carsI process raw sensory input w perception models
I eg image segmentation to find where other cars and pedestrians areI output fed into prediction model
I eg where other car will goI output fed into ‘higher-level’ decision making procedures
I eg rule based system (“cyclist to your left → do not steer left”)I industry’s starting to use uncertainty for lots of components in the
pipelineI eg pedestrian prediction models predict a distribution of
pedestrian locations in X timestepsI or uncertainty in perception components
9 of 54
![Page 19: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/19.jpg)
Example: Autonomous Driving (cnt)
ML pipeline in self-driving carsI process raw sensory input w perception models
I eg image segmentation to find where other cars and pedestrians areI output fed into prediction model
I eg where other car will goI output fed into ‘higher-level’ decision making procedures
I eg rule based system (“cyclist to your left → do not steer left”)I industry’s starting to use uncertainty for lots of components in the
pipelineI eg pedestrian prediction models predict a distribution of
pedestrian locations in X timestepsI or uncertainty in perception components
9 of 54
![Page 20: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/20.jpg)
Example: Autonomous Driving (cnt)
ML pipeline in self-driving carsI process raw sensory input w perception models
I eg image segmentation to find where other cars and pedestrians areI output fed into prediction model
I eg where other car will goI output fed into ‘higher-level’ decision making procedures
I eg rule based system (“cyclist to your left → do not steer left”)I industry’s starting to use uncertainty for lots of components in the
pipelineI eg pedestrian prediction models predict a distribution of
pedestrian locations in X timestepsI or uncertainty in perception components
9 of 54
![Page 21: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/21.jpg)
Example: Autonomous Driving (cnt)
ML pipeline in self-driving carsI process raw sensory input w perception models
I eg image segmentation to find where other cars and pedestrians areI output fed into prediction model
I eg where other car will goI output fed into ‘higher-level’ decision making procedures
I eg rule based system (“cyclist to your left → do not steer left”)I industry’s starting to use uncertainty for lots of components in the
pipelineI eg pedestrian prediction models predict a distribution of
pedestrian locations in X timestepsI or uncertainty in perception components
9 of 54
![Page 22: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/22.jpg)
Example: Autonomous Driving (cnt)
10 of 54
![Page 23: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/23.jpg)
Example: Autonomous Driving (cnt)
10 of 54
![Page 24: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/24.jpg)
Sources of uncertainty
I Above are some examples of uncertainty
I Many other sources of uncertainty
I Test data is very dissimilar to training dataI model trained on diabetes fondus photos of subpopulation AI never saw subpopulation B
I “images are outside data distribution model was trained on”I desired behaviour
I return a prediction (attempting to extrapolate)I +information that image lies outside data distribution
I (model retrained w subpop. B labels → low uncertainty on these)
11 of 54
![Page 25: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/25.jpg)
Sources of uncertainty
I Above are some examples of uncertainty
I Many other sources of uncertainty
I Test data is very dissimilar to training dataI model trained on diabetes fondus photos of subpopulation AI never saw subpopulation B
I “images are outside data distribution model was trained on”I desired behaviour
I return a prediction (attempting to extrapolate)I +information that image lies outside data distribution
I (model retrained w subpop. B labels → low uncertainty on these)
11 of 54
![Page 26: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/26.jpg)
Sources of uncertainty
I Above are some examples of uncertainty
I Many other sources of uncertainty
I Test data is very dissimilar to training dataI model trained on diabetes fondus photos of subpopulation AI never saw subpopulation B
I “images are outside data distribution model was trained on”I desired behaviour
I return a prediction (attempting to extrapolate)I +information that image lies outside data distribution
I (model retrained w subpop. B labels → low uncertainty on these)
11 of 54
![Page 27: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/27.jpg)
Sources of uncertainty (cnt)
I Uncertainty in model parameters that best explain data
I large number of possible models can explain a dataset
I uncertain which model parameters to choose to predict with
I affects how we predict with new test points
12 of 54
![Page 28: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/28.jpg)
Sources of uncertainty (cnt)
I Training labels are noisyI measurement imprecision
I expert mistakes
I crowd sourced labels
even infinity data → ambiguity inherent in data itself
13 of 54
![Page 29: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/29.jpg)
Deep learning models are deterministic
Deep learning does not capture uncertainty:
I regression models output a single scalar/vector
I classification models output a probability vector (erroneouslyinterpreted as model uncertainty)
2010 2015 2020 2025 2030 2035 2040 2045Year
22
24
26
28
30
32
Perc
enta
ge o
f Firs
ts
14 of 54
![Page 30: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/30.jpg)
Deep learning models are deterministic
Deep learning does not capture uncertainty:
I regression models output a single scalar/vector
I classification models output a probability vector (erroneouslyinterpreted as model uncertainty)
14 of 54
![Page 31: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/31.jpg)
Deep learning models are deterministic
Deep learning does not capture uncertainty:
I regression models output a single scalar/vector
I classification models output a probability vector (erroneouslyinterpreted as model uncertainty)
But when combined with probability theory can capture uncertainty ina principled way
→ known as Bayesian Deep Learning
14 of 54
![Page 32: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/32.jpg)
Teaser: Uncertainty in Autonomous Driving
15 of 54
![Page 33: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/33.jpg)
Teaser: Uncertainty in Autonomous Driving
15 of 54
![Page 34: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/34.jpg)
Teaser: Uncertainty in Autonomous Driving
15 of 54
![Page 35: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/35.jpg)
Teaser: Uncertainty in Autonomous Driving
15 of 54
![Page 36: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/36.jpg)
Teaser
Define model and train on data x train, y train:
1 from tensorflow.keras.layers import Input, Dense, Dropout2
3 inputs = Input(shape=(1,))4 x = Dense(512, activation="relu")(inputs)5 x = Dropout(0.5)(x, training=True)6 x = Dense(512, activation="relu")(x)7 x = Dropout(0.5)(x, training=True)8 outputs = Dense(1)(x)9
10 model = tf.keras.Model(inputs, outputs)11 model.compile(loss="mean_squared_error",12 optimizer="adam")13 model.fit(x_train, y_train)
16 of 54
![Page 37: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/37.jpg)
Teaser
1 # do stochastic forward passes on x_test:2 samples = [model.predict(x_test) for _ in range(100)]3 m = np.mean(samples, axis=0) # predictive mean4 v = np.var(samples, axis=0) # predictive variance5
6 # plot mean and uncertainty7 plt.plot(x_test, m)8 plt.fill_between(x_test, m - 2*v**0.5, m + 2*v**0.5,9 alpha=0.1) # plot two std (95% confidence)
Playgroud (working code): bdl101.ml/play
17 of 54
![Page 39: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/39.jpg)
Bayesian deep learning
Today and tomorrow we’ll understand why thiscode makes sense, and get a taste of
I the formal language of uncertainty(Bayesian probability theory)
I tools to use this language in ML (Bayesianprob. modelling)
I techniques to scale to real-world deeplearning systems (modern variationalinference)
I developing big deep learning systemswhich convey uncertainty
I w real-world examples
19 of 54
!
![Page 40: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/40.jpg)
Bayesian deep learning
Today and tomorrow we’ll understand why thiscode makes sense, and get a taste of
I the formal language of uncertainty(Bayesian probability theory)
I tools to use this language in ML (Bayesianprob. modelling)
I techniques to scale to real-world deeplearning systems (modern variationalinference)
I developing big deep learning systemswhich convey uncertainty
I w real-world examples
19 of 54
!
![Page 41: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/41.jpg)
Bayesian deep learning
Today and tomorrow we’ll understand why thiscode makes sense, and get a taste of
I the formal language of uncertainty(Bayesian probability theory)
I tools to use this language in ML (Bayesianprob. modelling)
I techniques to scale to real-world deeplearning systems (modern variationalinference)
I developing big deep learning systemswhich convey uncertainty
I w real-world examples
19 of 54
!
![Page 42: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/42.jpg)
Bayesian deep learning
Today and tomorrow we’ll understand why thiscode makes sense, and get a taste of
I the formal language of uncertainty(Bayesian probability theory)
I tools to use this language in ML (Bayesianprob. modelling)
I techniques to scale to real-world deeplearning systems (modern variationalinference)
I developing big deep learning systemswhich convey uncertainty
I w real-world examples
19 of 54
!
![Page 43: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/43.jpg)
Bayesian deep learning
Today and tomorrow we’ll understand why thiscode makes sense, and get a taste of
I the formal language of uncertainty(Bayesian probability theory)
I tools to use this language in ML (Bayesianprob. modelling)
I techniques to scale to real-world deeplearning systems (modern variationalinference)
I developing big deep learning systemswhich convey uncertainty
I w real-world examples
Basic concepts marked green (if you want to use as tools); Advancedtopics marked amber (if you want to develop new stuff in BDL)
19 of 54
!
![Page 44: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/44.jpg)
Bayesian Probability Theory
Bayesian Probability Theory:the Language of Uncertainty
Deriving the laws of probability theory from rational degrees of belief
20 of 54
!
![Page 45: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/45.jpg)
Betting game 1 (some philosophy for the soul)
1 import numpy as np2 def toss():3 if np.random.rand() < 0.5:4 print(’Heads’)5 else:6 print(’Tails’)
I unit wager : a ‘promise note’ where seller commits to pay noteowner £1 if outcome of toss=‘heads’; a tradeable note; eg..
21 of 54
![Page 46: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/46.jpg)
Betting game 1 (some philosophy for the soul)
1 import numpy as np2 def toss():3 if np.random.rand() < 0.5:4 print(’Heads’)5 else:6 print(’Tails’)
I unit wager : a ‘promise note’ where seller commits to pay noteowner £1 if outcome of toss=‘heads’; a tradeable note; eg..
I would you pay p=£0.01 for a unit wager on ‘heads’?I pay a penny to buy a note where I commit to paying £1 if ‘heads’
I p=£0.99?I pay 99 pence for a note where I commit to paying £1 if ‘heads’
I up to £0.05?, £0.95 or above?, ...
21 of 54
![Page 47: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/47.jpg)
Betting game 1 (some philosophy for the soul)
1 import numpy as np2 def toss():3 if np.random.rand() < 0.5:4 print(’Heads’)5 else:6 print(’Tails’)
I unit wager : a ‘promise note’ where seller commits to pay noteowner £1 if outcome of toss=‘heads’; a tradeable note; eg..
I would you pay p=£0.01 for a unit wager on ‘heads’?I pay a penny to buy a note where I commit to paying £1 if ‘heads’
I p=£0.99?I pay 99 pence for a note where I commit to paying £1 if ‘heads’
I up to £0.05?, £0.95 or above?, ...
21 of 54
![Page 48: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/48.jpg)
Betting game 1 (some philosophy for the soul)
1 import numpy as np2 def toss():3 if np.random.rand() < 0.5:4 print(’Heads’)5 else:6 print(’Tails’)
I unit wager : a ‘promise note’ where seller commits to pay noteowner £1 if outcome of toss=‘heads’; a tradeable note; eg..
I would you pay p=£0.01 for a unit wager on ‘heads’?I pay a penny to buy a note where I commit to paying £1 if ‘heads’
I p=£0.99?I pay 99 pence for a note where I commit to paying £1 if ‘heads’
I up to £0.05?, £0.95 or above?, ...
21 of 54
![Page 49: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/49.jpg)
Betting game 1b (some philosophy for the soul)
1 import numpy as np2 def toss():3 if np.random.rand() < 0.5:4 print(’Heads’)5 else:6 print(’Tails’)
I Unit wager : note seller commits to paying £1 if outcome=‘heads’
I would you sell a unit wager at £p for ‘heads’?I you get £p for the note, and have to pay £1 if heads
I up to £0.05?, £0.95 or above?, ...
22 of 54
![Page 50: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/50.jpg)
Betting game 1c (some philosophy for the soul)
1 import numpy as np2 def toss():3 if np.random.rand() < 0.5:4 print(’Heads’)5 else:6 print(’Tails’)
I Unit wager : note seller commits to paying £1 if outcome=‘heads’
I what if you have to set price £p, and commit to either sell unitwager at £p for ‘heads’, or buy one?
I I decide whether to sell to you, or buy from youI I sell: you pay £p to buy note where I commit to paying £1 if headsI I buy: you get £p for note, and have to pay me £1 if heads
I up to £0.05?, £0.95 or above?, ...
23 of 54
![Page 51: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/51.jpg)
Betting game 1c (some philosophy for the soul)
1 import numpy as np2 def toss():3 if np.random.rand() < 0.5:4 print(’Heads’)5 else:6 print(’Tails’)
I Unit wager : note seller commits to paying £1 if outcome=‘heads’
I what if you have to set price £p, and commit to either sell unitwager at £p for ‘heads’, or buy one?
I I decide whether to sell to you, or buy from youI I sell: you pay £p to buy note where I commit to paying £1 if headsI I buy: you get £p for note, and have to pay me £1 if heads
I up to £0.05?, £0.95 or above?, ...
23 of 54
![Page 52: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/52.jpg)
Beliefs as willingness to wager
I A person with degree of belief p in event A is assumed to bewilling to pay ≤ £p for a unit wager on A
I and is willing to sell such a wager for any price ≥ £p
I This p captures our degree of belief about the event A takingplace (aka uncertainty, confidence)
24 of 54
![Page 53: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/53.jpg)
Beliefs as willingness to wager
I A person with degree of belief p in event A is assumed to bewilling to pay ≤ £p for a unit wager on A
I and is willing to sell such a wager for any price ≥ £p
I This p captures our degree of belief about the event A takingplace (aka uncertainty, confidence)
24 of 54
![Page 54: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/54.jpg)
Beliefs as willingness to wager
I A person with degree of belief p in event A is assumed to bewilling to pay ≤ £p for a unit wager on A
I and is willing to sell such a wager for any price ≥ £p
I This p captures our degree of belief about the event A takingplace (aka uncertainty, confidence)
24 of 54
![Page 55: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/55.jpg)
Rational beliefs
I Two notes (unit wagers):I Note 1: ‘outcome=heads’I Note 2: ‘outcome=tails’
you decide p for note 1 and q for note 2; I decide whether to buyfrom you or sell you each note at the price you determined
I if p + q < 1 then I will buy from you note 1 for £p and also note2 for £q
I whatever outcome you give me £1; but because I gave youp + q < 1, you lost £1− p − q
I Dutch book: a set of unit wager notes where you decide theodds (wager price) and I decide whether to buy or sell eachnote ... and you are guaranteed to always lose money.
25 of 54
![Page 56: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/56.jpg)
Rational beliefs
I Two notes (unit wagers):I Note 1: ‘outcome=heads’I Note 2: ‘outcome=tails’
you decide p for note 1 and q for note 2; I decide whether to buyfrom you or sell you each note at the price you determined
I if p + q < 1 then I will buy from you note 1 for £p and also note2 for £q
I whatever outcome you give me £1; but because I gave youp + q < 1, you lost £1− p − q
I Dutch book: a set of unit wager notes where you decide theodds (wager price) and I decide whether to buy or sell eachnote ... and you are guaranteed to always lose money.
25 of 54
![Page 57: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/57.jpg)
Rational beliefs
I Two notes (unit wagers):I Note 1: ‘outcome=heads’I Note 2: ‘outcome=tails’
you decide p for note 1 and q for note 2; I decide whether to buyfrom you or sell you each note at the price you determined
I if p + q < 1 then I will buy from you note 1 for £p and also note2 for £q
I whatever outcome you give me £1; but because I gave youp + q < 1, you lost £1− p − q
I Dutch book: a set of unit wager notes where you decide theodds (wager price) and I decide whether to buy or sell eachnote ... and you are guaranteed to always lose money.
25 of 54
![Page 58: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/58.jpg)
Rational beliefs
I Two notes (unit wagers):I Note 1: ‘outcome=heads’I Note 2: ‘outcome=tails’
you decide p for note 1 and q for note 2; I decide whether to buyfrom you or sell you each note at the price you determined
I if p + q < 1 then I will buy from you note 1 for £p and also note2 for £q
I whatever outcome you give me £1; but because I gave youp + q < 1, you lost £1− p − q
I Dutch book: a set of unit wager notes where you decide theodds (wager price) and I decide whether to buy or sell eachnote ... and you are guaranteed to always lose money.
25 of 54
![Page 59: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/59.jpg)
Rational beliefs
I Two notes (unit wagers):I Note 1: ‘outcome=heads’I Note 2: ‘outcome=tails’
you decide p for note 1 and q for note 2; I decide whether to buyfrom you or sell you each note at the price you determined
I if p + q < 1 then I will buy from you note 1 for £p and also note2 for £q
I whatever outcome you give me £1; but because I gave youp + q < 1, you lost £1− p − q
I Dutch book: a set of unit wager notes where you decide theodds (wager price) and I decide whether to buy or sell eachnote ... and you are guaranteed to always lose money.
Set of beliefs is called rational if no Dutch book exists.
25 of 54
![Page 60: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/60.jpg)
Formalism (rational beliefs = prob theory)
SetupI Def sample space X of simple events (possible outcomes)
I e.g. experiment flipping two coins X={HH, HT, TH, TT}I Let A be an event (a subset of X ). A holding true = at least one
of the outcomes in A happenedI e.g. “at least one heads” ↔ A={HH, HT, TH}
I Write pA for belief of event A (your wager on A happening,assuming all wagers are unit wagers)
26 of 54
![Page 61: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/61.jpg)
Formalism (rational beliefs = prob theory)
SetupI Def sample space X of simple events (possible outcomes)
I e.g. experiment flipping two coins X={HH, HT, TH, TT}I Let A be an event (a subset of X ). A holding true = at least one
of the outcomes in A happenedI e.g. “at least one heads” ↔ A={HH, HT, TH}
I Write pA for belief of event A (your wager on A happening,assuming all wagers are unit wagers)
Can show that {pA}A⊆X are rational beliefs iff {pA}A⊆X satisfies lawsof probability theory
I Already showed that pA + pAc = 1I Try to devise other betting games at home (bdl101.ml/betting)
26 of 54
![Page 62: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/62.jpg)
Formalism (rational beliefs = prob theory)
SetupI Def sample space X of simple events (possible outcomes)
I e.g. experiment flipping two coins X={HH, HT, TH, TT}I Let A be an event (a subset of X ). A holding true = at least one
of the outcomes in A happenedI e.g. “at least one heads” ↔ A={HH, HT, TH}
I Write pA for belief of event A (your wager on A happening,assuming all wagers are unit wagers)
Can show that {pA}A⊆X are rational beliefs iff {pA}A⊆X satisfies lawsof probability theory
I Already showed that pA + pAc = 1I Try to devise other betting games at home (bdl101.ml/betting)
Can derive the laws of prob theory from rational beliefs!
I → if you want to be rational, must follow laws of probability(otherwise someone can take advatnge of your model)
26 of 54
![Page 63: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/63.jpg)
Probability as belief vs frequency
Above known as Bayesian prob theory
I forms an interpretation of the laws ofprobability, and formalises our notion ofuncertainty in events
I vs ‘Frequency as probability’I only applicable to repeatable events (eg,
try to answer ‘will Trump win 2020’)I also other issues; eg p-hackingI Psychology journal banning p values
(although there are problems w Bayesianarguments as well)
27 of 54
![Page 64: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/64.jpg)
‘Real-world’ example
[https://xkcd.com/1132/]
27 of 54
![Page 65: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/65.jpg)
Bayesian Probabilistic Modelling
Bayesian ProbabilisticModelling (an Introduction)
Simple idea: “If you’re doing something which doesn’t follow from thelaws of probability, then you’re doing it wrong”
28 of 54
!
![Page 66: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/66.jpg)
Bayesian Probabilistic Modelling
I can’t do ML without assumptionsI must make some assumptions about how data was generatedI there always exists some underlying process that generated obsI in Bayesian probabilistic modelling we make our assumptions about
underlying process explicitI want to infer underlying process (find dist that generated data)
29 of 54
![Page 67: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/67.jpg)
Bayesian Probabilistic Modelling
I can’t do ML without assumptionsI must make some assumptions about how data was generatedI there always exists some underlying process that generated obsI in Bayesian probabilistic modelling we make our assumptions about
underlying process explicitI want to infer underlying process (find dist that generated data)
I eg – astrophysics: gravitational lensingI there exists a physics process magnifying far
away galaxies
I Nature chose lensing coeff → gravitationallensing mechanism → transform galaxy
I We observe transformed galaxies, want toinfer lensing ceoff
29 of 54
![Page 68: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/68.jpg)
Bayesian Probabilistic Modelling
I can’t do ML without assumptionsI must make some assumptions about how data was generatedI there always exists some underlying process that generated obsI in Bayesian probabilistic modelling we make our assumptions about
underlying process explicitI want to infer underlying process (find dist that generated data)
I eg – cats vs dogs classificationI there exist some underlying rules we don’t
know
I eg “if has pointy ears then cat”
I We observe pairs (image, “cat”/“no cat”),and want to infer underlying mapping fromimages to labels
29 of 54
![Page 69: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/69.jpg)
Bayesian Probabilistic Modelling
I can’t do ML without assumptionsI must make some assumptions about how data was generatedI there always exists some underlying process that generated obsI in Bayesian probabilistic modelling we make our assumptions about
underlying process explicitI want to infer underlying process (find dist that generated data)
I eg – Gaussian density estimationI I tell you the process I used to generate data and give 5 data points
xn ∼ N (xn;µ, σ2), σ = 1
I you observe the points {x1, ..., x5}, and want to infer my µI Reminder: Gaussian density with mean µ and variance σ2
p(x |µ, σ) =1√
2πσ2e− (x−µ)2
2σ2
29 of 54
![Page 70: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/70.jpg)
Bayesian Probabilistic Modelling
I can’t do ML without assumptionsI must make some assumptions about how data was generatedI there always exists some underlying process that generated obsI in Bayesian probabilistic modelling we make our assumptions about
underlying process explicitI want to infer underlying process (find dist that generated data)
I eg – Gaussian density estimation
X Which µ generated my data?V What’s the probability that µ = 10 generated my data? (want to
infer distribution over µ!)29 of 54
![Page 71: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/71.jpg)
Bayesian Probabilistic Modelling
I can’t do ML without assumptionsI must make some assumptions about how data was generatedI there always exists some underlying process that generated obsI in Bayesian probabilistic modelling we make our assumptions about
underlying process explicitI want to infer underlying process (find dist that generated data)
I eg – Gaussian density estimation
X Which µ generated my data?V What’s the probability that µ = 10 generated my data? (want to
infer distribution over µ!)29 of 54
![Page 72: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/72.jpg)
Bayesian Probabilistic Modelling
I can’t do ML without assumptionsI must make some assumptions about how data was generatedI there always exists some underlying process that generated obsI in Bayesian probabilistic modelling we make our assumptions about
underlying process explicitI want to infer underlying process (find dist that generated data)
I eg – Gaussian density estimation
I These are the hypotheses we’ll play with
I I chose a Gaussian (one of those) from which I generated data
29 of 54
![Page 73: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/73.jpg)
Generative story / model
In Bayesian probabilistic modelling
I want to represent our beliefs / assumptions about how data wasgenerated explicitly
I eg via generative story [‘My assumptions are...’]:I Someone (me / Nature / etc) selected parameters µ∗, σ∗I Generated N data points xn ∼ N (µ∗, σ∗2)I Gave us D = {x1, ..., xN}I → how would you formalise this process?
I Bayesian probabilistic model:I prior [what I believe params might look like]
µ ∼ N (0,10), σ = 1
I likelihood [how I believe data was generated given params]
xn | µ, σ ∼ N (µ, σ2)
I will update prior belief on µ conditioned on data you give me(infer distribution over µ): µ | {x1, ..., xN}
30 of 54
![Page 74: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/74.jpg)
Generative story / model
In Bayesian probabilistic modelling
I want to represent our beliefs / assumptions about how data wasgenerated explicitly
I eg via generative story [‘My assumptions are...’]:I Someone (me / Nature / etc) selected parameters µ∗, σ∗I Generated N data points xn ∼ N (µ∗, σ∗2)I Gave us D = {x1, ..., xN}I → how would you formalise this process?
I Bayesian probabilistic model:I prior [what I believe params might look like]
µ ∼ N (0,10), σ = 1
I likelihood [how I believe data was generated given params]
xn | µ, σ ∼ N (µ, σ2)
I will update prior belief on µ conditioned on data you give me(infer distribution over µ): µ | {x1, ..., xN}
30 of 54
![Page 75: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/75.jpg)
Generative story / model
In Bayesian probabilistic modelling
I want to represent our beliefs / assumptions about how data wasgenerated explicitly
I eg via generative story [‘My assumptions are...’]:I Someone (me / Nature / etc) selected parameters µ∗, σ∗I Generated N data points xn ∼ N (µ∗, σ∗2)I Gave us D = {x1, ..., xN}I → how would you formalise this process?
I Bayesian probabilistic model:I prior [what I believe params might look like]
µ ∼ N (0,10), σ = 1
I likelihood [how I believe data was generated given params]
xn | µ, σ ∼ N (µ, σ2)
I will update prior belief on µ conditioned on data you give me(infer distribution over µ): µ | {x1, ..., xN}
30 of 54
![Page 76: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/76.jpg)
Generative story / model
In Bayesian probabilistic modelling
I want to represent our beliefs / assumptions about how data wasgenerated explicitly
I eg via generative story [‘My assumptions are...’]:I Someone (me / Nature / etc) selected parameters µ∗, σ∗I Generated N data points xn ∼ N (µ∗, σ∗2)I Gave us D = {x1, ..., xN}I → how would you formalise this process?
I Bayesian probabilistic model:I prior [what I believe params might look like]
µ ∼ N (0,10), σ = 1
I likelihood [how I believe data was generated given params]
xn | µ, σ ∼ N (µ, σ2)
I will update prior belief on µ conditioned on data you give me(infer distribution over µ): µ | {x1, ..., xN}
30 of 54
![Page 77: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/77.jpg)
Can you find my Gaussian?
How can you infer µ? (find distribution)
Everything follows the laws of prob..
I Sum rule
p(X = x) =∑
y
p(X = x ,Y = y) =
∫p(X = x ,Y )dY
I Product rule
p(X = x ,Y = y) = p(X = x |Y = y)p(Y = y)
I Bayes rule
p(X = x |Y = y ,H) =p(Y = y |X = x ,H)p(X = x |H)
p(Y = y |H)
Note: H is often omitted in conditional for brevity
31 of 54
![Page 78: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/78.jpg)
Can you find my Gaussian?
Remember: products, ratios, marginals, and conditionals of Gaussiansare Gaussian!
Summary (and playgroud) here: bdl101.ml/gauss
32 of 54
![Page 79: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/79.jpg)
Inference
Bayes rule:
p(X = x |Y = y ,H) =p(Y = y |X = x ,H)p(X = x |H)
p(Y = y |H),
and in probabilistic modelling:
Posterior︷ ︸︸ ︷p(µ|D, σ,H) =
Likelihood︷ ︸︸ ︷p(D|µ, σ,H)
Prior︷ ︸︸ ︷p(µ|σ,H)
p(D|σ,H)︸ ︷︷ ︸Model evidence
with model evidence p(D|σ,H) =∫
p(D|µ, σ,H)p(µ|σ,H)dµ (sumrule).
Likelihood
I we explicitly assumed data comes iid from a GaussianI compute p(D|µ, σ) = multiply all p(xn|µ, σ) (product rule)I prob of observing data points for given params
33 of 54
![Page 80: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/80.jpg)
The Likelihood in more detail
Likelihood
I we explicitly assumed data comes iid from a GaussianI compute p(D|µ, σ) = multiply all p(xn|µ, σ) (product rule)I prob of observing data points for given params
34 of 54
![Page 81: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/81.jpg)
The Likelihood in more detail
Likelihood
I we explicitly assumed data comes iid from a GaussianI compute p(D|µ, σ) = multiply all p(xn|µ, σ) (product rule)I prob of observing data points for given params
34 of 54
![Page 82: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/82.jpg)
The Likelihood in more detail
Likelihood
I we explicitly assumed data comes iid from a GaussianI compute p(D|µ, σ) = multiply all p(xn|µ, σ) (product rule)I prob of observing data points for given params
34 of 54
![Page 83: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/83.jpg)
The Likelihood in more detail
Likelihood
I we explicitly assumed data comes iid from a GaussianI compute p(D|µ, σ) = multiply all p(xn|µ, σ) (product rule)I prob of observing data points for given params
34 of 54
![Page 84: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/84.jpg)
The Likelihood in more detail
Likelihood
I we explicitly assumed data comes iid from a GaussianI compute p(D|µ, σ) = multiply all p(xn|µ, σ) (product rule)I prob of observing data points for given params
34 of 54
![Page 85: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/85.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
35 of 54
![Page 86: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/86.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
35 of 54
![Page 87: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/87.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
35 of 54
![Page 88: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/88.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
35 of 54
![Page 89: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/89.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
35 of 54
![Page 90: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/90.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
35 of 54
![Page 91: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/91.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
and with smaller σ..
I Trying to max lik will get “absolutely certain that σ = 0 & µ = 0”I Does this make sense? (I told you xn ∼ N !)I MLE failure
35 of 54
![Page 92: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/92.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
And with all data:
Likelihood function shows how well every value of µ, σ predicted whatwould happen.
35 of 54
![Page 93: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/93.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
And with all data:
Likelihood function shows how well every value of µ, σ predicted whatwould happen.
35 of 54
![Page 94: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/94.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
And with all data:
Likelihood function shows how well every value of µ, σ predicted whatwould happen.
35 of 54
![Page 95: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/95.jpg)
Likelihood as a function of parameters
Reducing dataset from 5 points to 1:
I What does the likelihood look like?
And with all data:
Likelihood function shows how well every value of µ, σ predicted whatwould happen.
35 of 54
![Page 96: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/96.jpg)
The Posterior in more detail
Posterior︷ ︸︸ ︷p(µ|D, σ,H) =
Likelihood︷ ︸︸ ︷p(D|µ, σ,H)
Prior︷ ︸︸ ︷p(µ|σ,H)
p(D|σ,H)︸ ︷︷ ︸Model evidence
with model evidence p(D|σ,H) =∫
p(D|µ, σ,H)p(µ|σ,H)dµ (sumrule). In contrast to the likelihood, posterior would say
‘with the data you gave me, this is what I currently think µ could be,and I might become more certain if you give me more data’
I normaliser = marginal likelihood = evidence = sum of likelihood *prior
I (but often difficult to calculate... more in the next lecture)
36 of 54
![Page 97: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/97.jpg)
The Posterior in more detail
I Eg, inference w prior = ‘we believe data is equally likely to havecome from one of the 5 Gaussians w σ = 1’
p(µ = µi |σ,H) =15
and p(µ 6= µi for all i |σ,H) = 0
then marginal likelihood is
p(D|σ,H) =∑
i
p(D|µ = µi , σ,H)p(µ = µi |σ,H)
=∑
i
p(D|µ = µi , σ,H)15
and posterior is
p(µ = µi |σ,D,H) =1/5p(D|µ = µi , σ,H)∑i 1/5p(D|µ = µi , σ,H)
37 of 54
![Page 98: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/98.jpg)
The Posterior in more detail
where p(D|µ = µi , σ = 1,H) is given by
I marginal likelihood of sigma=1 = p(D|σ = 1,H) = ‘prob thatdata came from single Gaussian with param σ = 1’
I similarly, marginal likelihood of hypothesis = p(D|H) = ‘prob thatdata came from single Gaussian (with some µ, σ)’
38 of 54
![Page 99: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/99.jpg)
Bayesian Deep Learning
Bayesian ProbabilisticModelling of Functions
39 of 54
!
![Page 100: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/100.jpg)
Why uncertainty over functions
I Example going beyond beliefs over statements (‘heads happened’)/ scalars (µ)
I Would want to know uncertainty (ie belief) of system in prediction
I Want to know distribution over outputs for each input x = distover functions
I First, some preliminaries.. (history, and notation)
40 of 54
![Page 101: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/101.jpg)
Why uncertainty over functions
I Example going beyond beliefs over statements (‘heads happened’)/ scalars (µ)
I Would want to know uncertainty (ie belief) of system in prediction
I Want to know distribution over outputs for each input x = distover functions
I First, some preliminaries.. (history, and notation)
40 of 54
![Page 102: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/102.jpg)
Why uncertainty over functions
I Example going beyond beliefs over statements (‘heads happened’)/ scalars (µ)
I Would want to know uncertainty (ie belief) of system in prediction
I Want to know distribution over outputs for each input x = distover functions
I First, some preliminaries.. (history, and notation)
40 of 54
![Page 103: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/103.jpg)
Why uncertainty over functions
I Example going beyond beliefs over statements (‘heads happened’)/ scalars (µ)
I Would want to know uncertainty (ie belief) of system in prediction
I Want to know distribution over outputs for each input x = distover functions
I First, some preliminaries.. (history, and notation)
40 of 54
![Page 104: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/104.jpg)
Linear regression
Linear regression [Gauss, 1809]
I Given a set of N input-output pairs {(x1,y1), ..., (xN ,yN)}I eg average number of accidents for different driving speeds
I assumes exists linear func mapping vectors xi ∈ RQ to yi ∈ RD
(with yi potentially corrupted with observation noise)
I model is linear trans. of inputs: f (x) = Wx + b, w W some D byQ matrix over reals, b real vector with D elements
I Different params W ,b define different linear transI aim: find params that (eg) minimise 1/N
∑i ||yi − (Wxi + b)||2
I but relation between x and y need not be linear
41 of 54
![Page 105: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/105.jpg)
Linear regression
Linear regression [Gauss, 1809]
I Given a set of N input-output pairs {(x1,y1), ..., (xN ,yN)}I eg average number of accidents for different driving speeds
I assumes exists linear func mapping vectors xi ∈ RQ to yi ∈ RD
(with yi potentially corrupted with observation noise)
I model is linear trans. of inputs: f (x) = Wx + b, w W some D byQ matrix over reals, b real vector with D elements
I Different params W ,b define different linear transI aim: find params that (eg) minimise 1/N
∑i ||yi − (Wxi + b)||2
I but relation between x and y need not be linear
41 of 54
![Page 106: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/106.jpg)
Linear regression
Linear regression [Gauss, 1809]
I Given a set of N input-output pairs {(x1,y1), ..., (xN ,yN)}I eg average number of accidents for different driving speeds
I assumes exists linear func mapping vectors xi ∈ RQ to yi ∈ RD
(with yi potentially corrupted with observation noise)
I model is linear trans. of inputs: f (x) = Wx + b, w W some D byQ matrix over reals, b real vector with D elements
I Different params W ,b define different linear transI aim: find params that (eg) minimise 1/N
∑i ||yi − (Wxi + b)||2
I but relation between x and y need not be linear
41 of 54
![Page 107: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/107.jpg)
Linear regression
Linear regression [Gauss, 1809]
I Given a set of N input-output pairs {(x1,y1), ..., (xN ,yN)}I eg average number of accidents for different driving speeds
I assumes exists linear func mapping vectors xi ∈ RQ to yi ∈ RD
(with yi potentially corrupted with observation noise)
I model is linear trans. of inputs: f (x) = Wx + b, w W some D byQ matrix over reals, b real vector with D elements
I Different params W ,b define different linear transI aim: find params that (eg) minimise 1/N
∑i ||yi − (Wxi + b)||2
I but relation between x and y need not be linear
41 of 54
![Page 108: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/108.jpg)
Linear regression
Linear regression [Gauss, 1809]
I Given a set of N input-output pairs {(x1,y1), ..., (xN ,yN)}I eg average number of accidents for different driving speeds
I assumes exists linear func mapping vectors xi ∈ RQ to yi ∈ RD
(with yi potentially corrupted with observation noise)
I model is linear trans. of inputs: f (x) = Wx + b, w W some D byQ matrix over reals, b real vector with D elements
I Different params W ,b define different linear transI aim: find params that (eg) minimise 1/N
∑i ||yi − (Wxi + b)||2
I but relation between x and y need not be linear
41 of 54
![Page 109: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/109.jpg)
Linear basis function regression
Linear basis function regression [Gergonne, 1815; Smith, 1918]
I input x fed through K fixed scalar-valued non-linear trans. φk (x)
I collect into a feature vector φ(x) = [φ1(x), ..., φK (x)]
I do linear regression with φ(x) vector instead of x itself
I with scalar input x, trans. can beI wavelets parametrised by k : cos(kπx)e−x2/2
I polynomials of degrees k : xk
I sinusoidals with various frequencies: sin(kx)
I When φk (x) := xk and K = Q, basis function regr. = linear regr.
I basis functions often assumed fixed and orthogonal to each other(optimal combination is sought)
I but need not be fixed and mutually orth. → param. basis functions
42 of 54
![Page 110: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/110.jpg)
Linear basis function regression
Linear basis function regression [Gergonne, 1815; Smith, 1918]
I input x fed through K fixed scalar-valued non-linear trans. φk (x)
I collect into a feature vector φ(x) = [φ1(x), ..., φK (x)]
I do linear regression with φ(x) vector instead of x itself
I with scalar input x, trans. can beI wavelets parametrised by k : cos(kπx)e−x2/2
I polynomials of degrees k : xk
I sinusoidals with various frequencies: sin(kx)
I When φk (x) := xk and K = Q, basis function regr. = linear regr.
I basis functions often assumed fixed and orthogonal to each other(optimal combination is sought)
I but need not be fixed and mutually orth. → param. basis functions
42 of 54
![Page 111: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/111.jpg)
Linear basis function regression
Linear basis function regression [Gergonne, 1815; Smith, 1918]
I input x fed through K fixed scalar-valued non-linear trans. φk (x)
I collect into a feature vector φ(x) = [φ1(x), ..., φK (x)]
I do linear regression with φ(x) vector instead of x itself
I with scalar input x, trans. can beI wavelets parametrised by k : cos(kπx)e−x2/2
I polynomials of degrees k : xk
I sinusoidals with various frequencies: sin(kx)
I When φk (x) := xk and K = Q, basis function regr. = linear regr.
I basis functions often assumed fixed and orthogonal to each other(optimal combination is sought)
I but need not be fixed and mutually orth. → param. basis functions
42 of 54
![Page 112: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/112.jpg)
Linear basis function regression
Linear basis function regression [Gergonne, 1815; Smith, 1918]
I input x fed through K fixed scalar-valued non-linear trans. φk (x)
I collect into a feature vector φ(x) = [φ1(x), ..., φK (x)]
I do linear regression with φ(x) vector instead of x itself
I with scalar input x, trans. can beI wavelets parametrised by k : cos(kπx)e−x2/2
I polynomials of degrees k : xk
I sinusoidals with various frequencies: sin(kx)
I When φk (x) := xk and K = Q, basis function regr. = linear regr.
I basis functions often assumed fixed and orthogonal to each other(optimal combination is sought)
I but need not be fixed and mutually orth. → param. basis functions
42 of 54
![Page 113: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/113.jpg)
Linear basis function regression
Linear basis function regression [Gergonne, 1815; Smith, 1918]
I input x fed through K fixed scalar-valued non-linear trans. φk (x)
I collect into a feature vector φ(x) = [φ1(x), ..., φK (x)]
I do linear regression with φ(x) vector instead of x itself
I with scalar input x, trans. can beI wavelets parametrised by k : cos(kπx)e−x2/2
I polynomials of degrees k : xk
I sinusoidals with various frequencies: sin(kx)
I When φk (x) := xk and K = Q, basis function regr. = linear regr.
I basis functions often assumed fixed and orthogonal to each other(optimal combination is sought)
I but need not be fixed and mutually orth. → param. basis functions
42 of 54
![Page 114: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/114.jpg)
Linear basis function regression
Linear basis function regression [Gergonne, 1815; Smith, 1918]
I input x fed through K fixed scalar-valued non-linear trans. φk (x)
I collect into a feature vector φ(x) = [φ1(x), ..., φK (x)]
I do linear regression with φ(x) vector instead of x itself
I with scalar input x, trans. can beI wavelets parametrised by k : cos(kπx)e−x2/2
I polynomials of degrees k : xk
I sinusoidals with various frequencies: sin(kx)
I When φk (x) := xk and K = Q, basis function regr. = linear regr.
I basis functions often assumed fixed and orthogonal to each other(optimal combination is sought)
I but need not be fixed and mutually orth. → param. basis functions
42 of 54
![Page 115: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/115.jpg)
Linear basis function regression
Linear basis function regression [Gergonne, 1815; Smith, 1918]
I input x fed through K fixed scalar-valued non-linear trans. φk (x)
I collect into a feature vector φ(x) = [φ1(x), ..., φK (x)]
I do linear regression with φ(x) vector instead of x itself
I with scalar input x, trans. can beI wavelets parametrised by k : cos(kπx)e−x2/2
I polynomials of degrees k : xk
I sinusoidals with various frequencies: sin(kx)
I When φk (x) := xk and K = Q, basis function regr. = linear regr.
I basis functions often assumed fixed and orthogonal to each other(optimal combination is sought)
I but need not be fixed and mutually orth. → param. basis functions
42 of 54
![Page 116: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/116.jpg)
Parametrised basis functions
Parametrised basis functions [Bishop, 2006; many others]
I eg basis functions φwk ,bkk where scalar-valued function φk is applied
to inner-product wTk x + bk
I φk often def’d to be identical for all k (only params change)I eg φk (·) = tanh(·) , giving φwk ,bk
k (x) = tanh(wTk x + bk )
I feature vector = basis functions’ outputs = input to linear trans.
I in vector form:I W1 a matrix of dimensions Q by KI b1 a vector with K elementsI φW1,b1 (x) = φ(W1x + b1)I W2 a matrix of dimensions K by DI b2 a vector with D elementsI model output:
f W1,b1,W2,b2 (x) = φW1,b1 (x)W2 + b2I want to find W1,b1,W2,b2 that minimise
1/N∑
i ||yi − f W1,b1,W2,b2(xi)||243 of 54
![Page 117: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/117.jpg)
Parametrised basis functions
Parametrised basis functions [Bishop, 2006; many others]
I eg basis functions φwk ,bkk where scalar-valued function φk is applied
to inner-product wTk x + bk
I φk often def’d to be identical for all k (only params change)I eg φk (·) = tanh(·) , giving φwk ,bk
k (x) = tanh(wTk x + bk )
I feature vector = basis functions’ outputs = input to linear trans.
I in vector form:I W1 a matrix of dimensions Q by KI b1 a vector with K elementsI φW1,b1 (x) = φ(W1x + b1)I W2 a matrix of dimensions K by DI b2 a vector with D elementsI model output:
f W1,b1,W2,b2 (x) = φW1,b1 (x)W2 + b2I want to find W1,b1,W2,b2 that minimise
1/N∑
i ||yi − f W1,b1,W2,b2(xi)||243 of 54
![Page 118: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/118.jpg)
Parametrised basis functions
Parametrised basis functions [Bishop, 2006; many others]
I eg basis functions φwk ,bkk where scalar-valued function φk is applied
to inner-product wTk x + bk
I φk often def’d to be identical for all k (only params change)I eg φk (·) = tanh(·) , giving φwk ,bk
k (x) = tanh(wTk x + bk )
I feature vector = basis functions’ outputs = input to linear trans.
I in vector form:I W1 a matrix of dimensions Q by KI b1 a vector with K elementsI φW1,b1 (x) = φ(W1x + b1)I W2 a matrix of dimensions K by DI b2 a vector with D elementsI model output:
f W1,b1,W2,b2 (x) = φW1,b1 (x)W2 + b2I want to find W1,b1,W2,b2 that minimise
1/N∑
i ||yi − f W1,b1,W2,b2(xi)||243 of 54
![Page 119: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/119.jpg)
Parametrised basis functions
Parametrised basis functions [Bishop, 2006; many others]
I eg basis functions φwk ,bkk where scalar-valued function φk is applied
to inner-product wTk x + bk
I φk often def’d to be identical for all k (only params change)I eg φk (·) = tanh(·) , giving φwk ,bk
k (x) = tanh(wTk x + bk )
I feature vector = basis functions’ outputs = input to linear trans.
I in vector form:I W1 a matrix of dimensions Q by KI b1 a vector with K elementsI φW1,b1 (x) = φ(W1x + b1)I W2 a matrix of dimensions K by DI b2 a vector with D elementsI model output:
f W1,b1,W2,b2 (x) = φW1,b1 (x)W2 + b2I want to find W1,b1,W2,b2 that minimise
1/N∑
i ||yi − f W1,b1,W2,b2(xi)||243 of 54
![Page 120: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/120.jpg)
Hierarchy of parametrised basis functions
Hierarchy of parametrised basis functions [Rumelhart et al., 1985]I called “NNs” for historical reasons
I layersI =‘feature vectors’ in hierarchyI linear trans. = ‘inner product’ layer = ‘fully
connected’ layerI ‘input layer’, ‘output layer’, ‘hidden layers’I trans. matrix = weight matrix = W ;
intercept = bias = b
I unitsI elements in a layer
I feature vector (overloaded term)I often refers to the penultimate layer (at top
of model just before softmax / last lineartrans.)
I denote feature vectorφ(x) = [φ1(x), .., φK (x)] with K units (a Kby 1 vector)
I denote feature matrixΦ(X) = [φ(x1)T , ..., φ(xN)T ], N by Kmatrix
44 of 54
![Page 121: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/121.jpg)
Hierarchy of parametrised basis functions
Hierarchy of parametrised basis functions [Rumelhart et al., 1985]I called “NNs” for historical reasons
I layersI =‘feature vectors’ in hierarchyI linear trans. = ‘inner product’ layer = ‘fully
connected’ layerI ‘input layer’, ‘output layer’, ‘hidden layers’I trans. matrix = weight matrix = W ;
intercept = bias = b
I unitsI elements in a layer
I feature vector (overloaded term)I often refers to the penultimate layer (at top
of model just before softmax / last lineartrans.)
I denote feature vectorφ(x) = [φ1(x), .., φK (x)] with K units (a Kby 1 vector)
I denote feature matrixΦ(X) = [φ(x1)T , ..., φ(xN)T ], N by Kmatrix
44 of 54
![Page 122: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/122.jpg)
Hierarchy of parametrised basis functions
Hierarchy of parametrised basis functions [Rumelhart et al., 1985]I called “NNs” for historical reasons
I layersI =‘feature vectors’ in hierarchyI linear trans. = ‘inner product’ layer = ‘fully
connected’ layerI ‘input layer’, ‘output layer’, ‘hidden layers’I trans. matrix = weight matrix = W ;
intercept = bias = b
I unitsI elements in a layer
I feature vector (overloaded term)I often refers to the penultimate layer (at top
of model just before softmax / last lineartrans.)
I denote feature vectorφ(x) = [φ1(x), .., φK (x)] with K units (a Kby 1 vector)
I denote feature matrixΦ(X) = [φ(x1)T , ..., φ(xN)T ], N by Kmatrix
44 of 54
![Page 123: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/123.jpg)
Hierarchy of parametrised basis functions
Hierarchy of parametrised basis functions [Rumelhart et al., 1985]I called “NNs” for historical reasons
I layers
I units
I feature vector (overloaded term)I often refers to the penultimate layer (at top
of model just before softmax / last lineartrans.)
I denote feature vectorφ(x) = [φ1(x), .., φK (x)] with K units (a Kby 1 vector)
I denote feature matrixΦ(X) = [φ(x1)T , ..., φ(xN)T ], N by Kmatrix
44 of 54
![Page 124: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/124.jpg)
Hierarchy of parametrised basis functions
I regressionI compose multiple basis function layers
into a regression model
I result of last trans. also called “modeloutput”; often no non-linearity here
I classificationI further compose a softmax function at the end; also called
“logistic” for 2 classesI “squashes” its input → probability vector; prob vector also called
model output / softmax vector / softmax layer
I “building blocks”I layers are simpleI modularity in layer composition → versatility of deep modelsI many engineers work in field → lots of tools that scale well
45 of 54
![Page 125: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/125.jpg)
Hierarchy of parametrised basis functions
I regressionI compose multiple basis function layers
into a regression model
I result of last trans. also called “modeloutput”; often no non-linearity here
I classificationI further compose a softmax function at the end; also called
“logistic” for 2 classesI “squashes” its input → probability vector; prob vector also called
model output / softmax vector / softmax layer
I “building blocks”I layers are simpleI modularity in layer composition → versatility of deep modelsI many engineers work in field → lots of tools that scale well
45 of 54
![Page 126: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/126.jpg)
Hierarchy of parametrised basis functions
I regressionI compose multiple basis function layers
into a regression model
I result of last trans. also called “modeloutput”; often no non-linearity here
I classificationI further compose a softmax function at the end; also called
“logistic” for 2 classesI “squashes” its input → probability vector; prob vector also called
model output / softmax vector / softmax layer
I “building blocks”I layers are simpleI modularity in layer composition → versatility of deep modelsI many engineers work in field → lots of tools that scale well
45 of 54
![Page 127: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/127.jpg)
Assumptions for the moment
I we’ll use deep nets, and denote W to be the weight matrix of thelast layer and b the bias of last layer
I (for the moment) look only at last layer W , everything else fixed –ie weights other than W do not change
I later we’ll worry about other layers
I assume that y is scalarI so W is K by 1I write wk for the k ’th elem
I assume that output layer’s b is zero (or, obs y ’s are normalised)I both will simplify derivations here (but pose no difficulty otherwise)
I then f W (x) =∑
wkφk (x) = W Tφ(x) with φ(x) a ‘frozen’ featurevec for some NN
I some notation you’ll need to remember...X, x ,N,xn,Q,D,K ,D = {(x1, y1), .., (xN , yN)} = X,Y
46 of 54
![Page 128: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/128.jpg)
Assumptions for the moment
I we’ll use deep nets, and denote W to be the weight matrix of thelast layer and b the bias of last layer
I (for the moment) look only at last layer W , everything else fixed –ie weights other than W do not change
I later we’ll worry about other layers
I assume that y is scalarI so W is K by 1I write wk for the k ’th elem
I assume that output layer’s b is zero (or, obs y ’s are normalised)I both will simplify derivations here (but pose no difficulty otherwise)
I then f W (x) =∑
wkφk (x) = W Tφ(x) with φ(x) a ‘frozen’ featurevec for some NN
I some notation you’ll need to remember...X, x ,N,xn,Q,D,K ,D = {(x1, y1), .., (xN , yN)} = X,Y
46 of 54
![Page 129: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/129.jpg)
Assumptions for the moment
I we’ll use deep nets, and denote W to be the weight matrix of thelast layer and b the bias of last layer
I (for the moment) look only at last layer W , everything else fixed –ie weights other than W do not change
I later we’ll worry about other layers
I assume that y is scalarI so W is K by 1I write wk for the k ’th elem
I assume that output layer’s b is zero (or, obs y ’s are normalised)I both will simplify derivations here (but pose no difficulty otherwise)
I then f W (x) =∑
wkφk (x) = W Tφ(x) with φ(x) a ‘frozen’ featurevec for some NN
I some notation you’ll need to remember...X, x ,N,xn,Q,D,K ,D = {(x1, y1), .., (xN , yN)} = X,Y
46 of 54
![Page 130: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/130.jpg)
Assumptions for the moment
I we’ll use deep nets, and denote W to be the weight matrix of thelast layer and b the bias of last layer
I (for the moment) look only at last layer W , everything else fixed –ie weights other than W do not change
I later we’ll worry about other layers
I assume that y is scalarI so W is K by 1I write wk for the k ’th elem
I assume that output layer’s b is zero (or, obs y ’s are normalised)I both will simplify derivations here (but pose no difficulty otherwise)
I then f W (x) =∑
wkφk (x) = W Tφ(x) with φ(x) a ‘frozen’ featurevec for some NN
I some notation you’ll need to remember...X, x ,N,xn,Q,D,K ,D = {(x1, y1), .., (xN , yN)} = X,Y
46 of 54
![Page 131: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/131.jpg)
Assumptions for the moment
I we’ll use deep nets, and denote W to be the weight matrix of thelast layer and b the bias of last layer
I (for the moment) look only at last layer W , everything else fixed –ie weights other than W do not change
I later we’ll worry about other layers
I assume that y is scalarI so W is K by 1I write wk for the k ’th elem
I assume that output layer’s b is zero (or, obs y ’s are normalised)I both will simplify derivations here (but pose no difficulty otherwise)
I then f W (x) =∑
wkφk (x) = W Tφ(x) with φ(x) a ‘frozen’ featurevec for some NN
I some notation you’ll need to remember...X, x ,N,xn,Q,D,K ,D = {(x1, y1), .., (xN , yN)} = X,Y
46 of 54
![Page 132: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/132.jpg)
Assumptions for the moment
I we’ll use deep nets, and denote W to be the weight matrix of thelast layer and b the bias of last layer
I (for the moment) look only at last layer W , everything else fixed –ie weights other than W do not change
I later we’ll worry about other layers
I assume that y is scalarI so W is K by 1I write wk for the k ’th elem
I assume that output layer’s b is zero (or, obs y ’s are normalised)I both will simplify derivations here (but pose no difficulty otherwise)
I then f W (x) =∑
wkφk (x) = W Tφ(x) with φ(x) a ‘frozen’ featurevec for some NN
I some notation you’ll need to remember...X, x ,N,xn,Q,D,K ,D = {(x1, y1), .., (xN , yN)} = X,Y
46 of 54
![Page 133: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/133.jpg)
Generative story
Want to put dist over functions..
I difficult to put belief over funcs., buteasy to put over NN params
I assumptions for the moment: ourdata was generated from the fixed φ(NN) using some W (which we wantto infer)
47 of 54
![Page 134: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/134.jpg)
Generative story
Want to put dist over functions..
I difficult to put belief over funcs., buteasy to put over NN params
I assumptions for the moment: ourdata was generated from the fixed φ(NN) using some W (which we wantto infer)
Generative story [what we assume about the data]
I Nature chose W which def’s a func: f W (x) := W Tφ(x)
I generated func. values with inputs x1, .., xN : fn := f W (xn)
I corrupted func. values with noise [also called ”obs noise”]yn := fn + εn, εn ∼ N (0, σ2) [additive Gaussian noise w param σ]
I we’re given observations {(x1, y1), ..., (xN , yN)} and σ = 1
47 of 54
![Page 135: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/135.jpg)
Model
I qsI how can we find function value f ∗ for a new x∗?I how can we find our confidence in this prediction?I → ‘everything follows from the laws of probability theory’
I we build a model:I put prior dist over params W
p(W ) = N (W ; 0K , s2IK )
I likelihood [conditioned on W generate obs by adding gaussian noise]
p(y |W , x) = N (y ; W Tφ(x), σ2)
I prior belief “wk is more likely to be in interval [−1,1] than in[100,200]” means that the func. values are likely to be moresmooth than erratic (we’ll see later why)
I we want to infer W (find dist over W given D)
48 of 54
![Page 136: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/136.jpg)
Model
I qsI how can we find function value f ∗ for a new x∗?I how can we find our confidence in this prediction?I → ‘everything follows from the laws of probability theory’
I we build a model:I put prior dist over params W
p(W ) = N (W ; 0K , s2IK )
I likelihood [conditioned on W generate obs by adding gaussian noise]
p(y |W , x) = N (y ; W Tφ(x), σ2)
I prior belief “wk is more likely to be in interval [−1,1] than in[100,200]” means that the func. values are likely to be moresmooth than erratic (we’ll see later why)
I we want to infer W (find dist over W given D)
48 of 54
![Page 137: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/137.jpg)
ModelI qs
I how can we find function value f ∗ for a new x∗?I how can we find our confidence in this prediction?I → ‘everything follows from the laws of probability theory’
I we build a model:I put prior dist over params W
p(W ) = N (W ; 0K , s2IK )
I likelihood [conditioned on W generate obs by adding gaussian noise]
p(y |W , x) = N (y ; W Tφ(x), σ2)
I prior belief “wk is more likely to be in interval [−1,1] than in[100,200]” means that the func. values are likely to be moresmooth than erratic (we’ll see later why)
I we want to infer W (find dist over W given D)
48 of 54
![Page 138: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/138.jpg)
Model
I qsI how can we find function value f ∗ for a new x∗?I how can we find our confidence in this prediction?I → ‘everything follows from the laws of probability theory’
I we build a model:I put prior dist over params W
p(W ) = N (W ; 0K , s2IK )
I likelihood [conditioned on W generate obs by adding gaussian noise]
p(y |W , x) = N (y ; W Tφ(x), σ2)
I prior belief “wk is more likely to be in interval [−1,1] than in[100,200]” means that the func. values are likely to be moresmooth than erratic (we’ll see later why)
I we want to infer W (find dist over W given D)
48 of 54
![Page 139: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/139.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 140: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/140.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 141: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/141.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 142: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/142.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 143: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/143.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 144: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/144.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 145: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/145.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 146: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/146.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 147: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/147.jpg)
Analytic inference w functions [new technique!]
49 of 54
!
![Page 148: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/148.jpg)
Analytic inference w functions [new technique!]
I posterior variance
Σ′ = (σ−2∑
n
(φ(xn)φ(xn)T ) + s−2IK )−1
and in vector form: (σ−2Φ(X)T Φ(X) + s−2IK )−1
I posterior mean
µ′ = Σ′σ−2∑
n
(ynφ(xn))
and in vector form: Σ′σ−2Φ(X)T Y49 of 54
!!
![Page 149: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/149.jpg)
Analytic predictions with functions
How do we predict function values y∗ for new x∗?
I use prob theory to perform preds!
p(y∗|x∗,X ,Y )
=
∫p(y∗,W |x∗,X ,Y )dW sum rule
=
∫p(y∗|x∗,W ,X ,Y )p(W |X ,Y )dW product rule
=
∫p(y∗|x∗,W )p(W |X ,Y )dW model assumptions
I how to eval? [a new technique!]I likelihood p(y∗|x∗,W ) is GaussianI posterior p(W |X ,Y ) is Gaussian (from above)I so predictive p(y∗|x∗,X ,Y ) is Gaussian..
50 of 54
!
![Page 150: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/150.jpg)
Analytic predictions with functions
51 of 54
!
![Page 151: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/151.jpg)
Analytic predictions with functions
51 of 54
!
![Page 152: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/152.jpg)
Analytic predictions with functions
51 of 54
!
![Page 153: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/153.jpg)
Analytic predictions with functions
51 of 54
!
![Page 154: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/154.jpg)
Analytic predictions with functions
51 of 54
!
![Page 155: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/155.jpg)
Analytic predictions with functions
I Homework: Predictive variance
51 of 54
!
![Page 156: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/156.jpg)
What you should be able to do now
I perform density estimation with scalars
I know when MLE fails (and why)
I use Bayes law to make more informed decisions in your life
I win against your friends in a series of bets
I argue with frequentists about how to interpret the laws ofprobability
I argue with philosophers about the nature of subjective beliefs
I use Bayesian probability in ML correctly
I perform predictions in Bayesian probabilistic modelling correctly
52 of 54
![Page 157: Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either](https://reader033.vdocuments.us/reader033/viewer/2022061004/60b2e164e50b16271d50910e/html5/thumbnails/157.jpg)
What we will cover next
In the next lecture we’ll
I decompose uncertainty into epistemic and aleatoric components
I use uncertainty in regression correctly
I develop tools to scale the ideas above to large deep models
I develop big deep learning systems which convey uncertaintyI w real-world examples
53 of 54