lecture-6: uncertainty in neural networks · unrestricted © siemens ag 2017 page 3 17.05.2017 dr....

51
Lecture-6: Uncertainty in Neural Networks

Upload: others

Post on 30-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Lecture-6:Uncertainty in NeuralNetworks

Page 2: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 2 Dr. Markus M. Geipel / CT RDA BAM LSY

Motivation

“Every time a scientific paper presents a bit of data, it's accompaniedby an error bar – a quiet but insistent reminder that no knowledge iscomplete or perfect. It's a calibration of how much we trust what wethink we know.”

― Carl Sagan,The Demon-Haunted World: Science as a Candle in the Dark

https://commons.wikimedia.org/wiki/File:Carl_Sagan_Planetary_Society.JPG

Further ReadingSagan, Carl. The demon-haunted world:

Science as a candle in the dark.Ballantine Books, 2011.

Page 3: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 3 Dr. Markus M. Geipel / CT RDA BAM LSY

Types of Uncertainty

Classical Neural Network(or any parametric model)

= f( , ) ; D = {x, t}

Output/Prediciton Uncertainty

~ ( | , )

Model (Weight) Uncertainty

~ ( |D)

Conventions for this lecture• = neural network weights• = neural network input• = neural network output• D = training data (input and targets)

Page 4: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 4 Dr. Markus M. Geipel / CT RDA BAM LSY

Three Different Ways to represent Distributions in Practice

Point Estimate Parametric Distribution Empirical Distribution

Mean

Mean Var. Bin1 Bin 2 Bin 3 … Bin N

data

Page 5: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 5 Dr. Markus M. Geipel / CT RDA BAM LSY

Pros and Cons

Point Estimate

þ Calculation efficient and stable

ý Uncertainty not accounted forý Cannot be sampledý Assumption that point estimate isrepresentative: e.g. Failure with multi-modality

Parametric Distribution

þ Calculation relatively efficientþ Analytical Representationþ Can be sampled

ý Breaks down if assumption onclass of distribution is not correctý Hard to represent multi-modality

Empirical Distribution

þ Flexible: No assumptionsþ Can be sampled (more or less)

ý Breaks down for higherdimensional dataý High memory consumptioný Artifacts due to discretization

Mean

data

Page 6: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Output Uncertaintyfor Regression

Page 7: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 7 Dr. Markus M. Geipel / CT RDA BAM LSY

Recap: Classification

The standard setup:• Classification predicts a categorical distribution of the classes• Targets are one-hot-encoded• The loss function is the cross-entropy

We will focus on Regression henceforth …

Neural Network Dense LayerAct:Softmax CategorialX

0 1 2 3 4 … 9

0 0 .98 0 0 .02

target0 1 2 3 4 … 9

0 0 1 0 0 0Understates thetrue uncertainty!

Page 8: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 8 Dr. Markus M. Geipel / CT RDA BAM LSY

Motivation for Output Uncertainty:Mean Average Precision @ 7

Challenge:Give at most 7 recommendations from a large product catalog. You areevaluated based on the precision on these at most 7 items.

Idea:Select at most 7 recommendations where the model is certain about therecommendation.

In General:Situations where a response/action can be rejected.

Further ReadingBrando, Axel, et al. "Uncertainty

Modelling in Deep Networks: ForecastingShort and Noisy Series." arXiv preprint

arXiv:1807.09011 (2018).

Related Kaggle Challangehttps://www.kaggle.com/c/santander-

product-recommendation

Page 9: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 9 Dr. Markus M. Geipel / CT RDA BAM LSY

Neural Network with Output Uncertainty

~ ( | , )

Let’s commit to a parametric distribution:

~ ( | , )

We will model as a Neural Network: ( , )

We either model as a scalar parameter under the assumption of homoskestic uncertaintyor as a Neural Network: ( , ) for heteroskedastic uncertainty

, =1

2 ( , )exp −

− ( , )2 ( , )

Your choice!Also popular: The

Laplace Distribution

Page 10: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 10 Dr. Markus M. Geipel / CT RDA BAM LSY

Neural Network with Output Uncertainty

, =1

2 ( , )exp −

− ( , )2 ( , )

= { , }

We will optimize the Log-Likelihood with respect to the weights

ℒ , = − − log , −− ( , )

2 ( , )+

| |

Use a variant of Stochastic Gradient Descent to train.

Page 11: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 11 Dr. Markus M. Geipel / CT RDA BAM LSY

Network with Output Uncertainty: Architecture Variants

Neural Network

X

NormalDistribution P(Y|X)

Loss: − log

Neural Network

X

Neural NetworkNormal

Distribution P(Y|X)X

Understates thetrue uncertainty!

Page 12: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 12 Dr. Markus M. Geipel / CT RDA BAM LSY

Multimodality

Fit Gaussian

Page 13: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 13 Dr. Markus M. Geipel / CT RDA BAM LSY

Example of Multimodality

Learning the value of an action(see also Reinforcement Learning)

Steering action

Value of the action

https://commons.wikimedia.org/wiki/File:Newport_Whitepit_Lane_pot_hole.JPG

Mean

Further ReadingDepeweg, Stefan, et al. "Learning andpolicy search in stochastic dynamical

systems with bayesian neural networks."arXiv preprint arXiv:1605.07127 (2016).

Page 14: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 14 Dr. Markus M. Geipel / CT RDA BAM LSY

Result of Point Estimation

https://commons.wikimedia.org/wiki/File:Bus_in_hole.jpg

Page 15: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 15 Dr. Markus M. Geipel / CT RDA BAM LSY

Inverse Problems

1. We are interested in X, but X can only be perceived indirectly via Y.2. The relation between X and Y is given by a known, possibly stochastic

function

= ( )

This means we would like to learn the inverse of g:

= ( )

The challenge is that g may not be invertible!This means that f may be stochastic even if g is not.

X

Y

Page 16: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 16 Dr. Markus M. Geipel / CT RDA BAM LSY

Inverse Problems: Examples

1. Image De-Noising2. Astronomy Imaging3. Deconvolution4. Localization5. …

Atmospheric Disturbance f(x)

X

y

Page 17: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 17 Dr. Markus M. Geipel / CT RDA BAM LSY

Gaussian Mixtures

Mean Variance

1.2 3

Mean Var. weight

Cluster 1 4.1 1.8 0.6

Cluster 2 -2.0 2.7 0.38

Cluster 3 -0.6 0.8 0.02

Further ReadingBlei, David M., and Michael I. Jordan.

"Variational inference for Dirichletprocess mixtures." Bayesian analysis

1.1 (2006): 121-143.

Further Practicewith GMMs

http://scikit-learn.org/stable/modules/mixture.html#bgmm

Further ReadingRasmussen, Carl Edward. "The infiniteGaussian mixture model." Advances in

neural information processingsystems. 2000.

If you want to know howmany clusters to assume:

Page 18: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 18 Dr. Markus M. Geipel / CT RDA BAM LSY

The Mixture Density Network: Architecture

• : standard deviations• : means• : cluster weights

• Loss: − log

ReferenceBishop, Christopher M. Mixture density

networks. Technical Report NCRG/4288,Aston University, Birmingham, UK, 1994.

Neural NetworkX MixtureDensity P(Y|X)

Page 19: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 19 Dr. Markus M. Geipel / CT RDA BAM LSY

The Mixture Density Network:Implementation in TensorFlow Probability

import tensorflow as tfimport tensorflow_probability as tfptfd = tfp.distributions

X_ph = tf.placeholder(tf.float32, [None, D])y_ph = tf.placeholder(tf.float32, [None])

K = 10net = tf.layers.dense(X_ph, 15, activation=tf.nn.relu)net = tf.layers.dense(net, 15, activation=tf.nn.relu)locs = tf.layers.dense(net, K, activation=None)scales = tf.layers.dense(net, K, activation=tf.exp)logits = tf.layers.dense(net, K, activation=None)

components = [ tfd.Normal(loc=loc, scale=scale) for loc, scalein zip(tf.unstack(tf.transpose(locs)), tf.unstack(tf.transpose(scales)))]

cat = tfd.Categorical(logits=logits)

mdn = tfd.Mixture(cat=cat, components=components)

loss = -tf.reduce_mean(mdn.log_prob(y_ph))

Neural NetworkX MixtureDensity P(Y|X)

Note theactivation! Wewant ⃗ (0, ∞)

Understates thetrue uncertainty!

Page 20: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 20 Dr. Markus M. Geipel / CT RDA BAM LSY

The Mixture Density Network: Result

MDNTraining

Training Data

Mixture DensityModel

Page 21: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Model Uncertainty

Page 22: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 22 Dr. Markus M. Geipel / CT RDA BAM LSY

Model Uncertainty

= f( , ) ; D = {x, t}

Model (Weight) Uncertainty~ ( |D)

Implies an uncertainty in the output!

Is it qualitatively different than simple output uncertainty?

Page 23: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 23 Dr. Markus M. Geipel / CT RDA BAM LSY

Example: Tossing a Coin

You have tossed a coin 1 000 000 times.

What is your prediction for the outcome of the next flip?

How sure are you? In other words: How sure are you about your model?

Head Tails

501 071 498 929

Page 24: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 24 Dr. Markus M. Geipel / CT RDA BAM LSY

Example: Predicting a Time Series

Data

What is your prediction for y?

How sure are you? In other words: How sure are you about your model?

t 0 1 2 3 4 5 6y 0 1 ? ? ? ? ?

Page 25: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 25 Dr. Markus M. Geipel / CT RDA BAM LSY

Epistemic and Aleatoric Uncertainty

Epistimic

Caused by insufficient number ofobservations. In other words: Themodel is underdetermined.

More observations will reducethis type of uncertainty

Aleatoric

Caused by stochasticity or un-observability of an aspect of asystem.

More observations will not reducethis type of uncertainty.Different types of observationsmight.

Further ReadingKendall, Alex, and Yarin Gal. "What

uncertainties do we need inbayesian deep learning for

computer vision?." Advances inneural information processing

systems. 2017.

Model

Epistemic

Aleatoric Output

Page 26: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 26 Dr. Markus M. Geipel / CT RDA BAM LSY

Normalization

PriorLikelihoodof Data

Model/Weight Uncertainty in Neural Networks

Bayes Theorem

Is this radically different than “normal” Neural Networks?

Page 27: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 27 Dr. Markus M. Geipel / CT RDA BAM LSY

Preparation:Neural Networks in the Bayesian Framework

Parametric Model with Noise Term

= f( , ) +~ (0, )

D = {x, t}

Inference on Weights

=θ (θ)( )

Page 28: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 28 Dr. Markus M. Geipel / CT RDA BAM LSY

MSEL2 Weight

Regularization

Preparation:Neural Networks in the Bayesian Framework

Likelihoodθ = f( , ) − ,

θ = f( , ) − ,( , )∈

Prior

= θ 0,1

Log Posterior

ℒ℘ , = (f , − ), ∈

+ θ

Normalization

PriorLikelihoodof Data

=θ (θ)( )

Page 29: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 29 Dr. Markus M. Geipel / CT RDA BAM LSY

MSEL2 Weight

Regularization

Preparation:Neural Networks in the Bayesian Framework

ℒ℘ , = (f , − ), ∈

+ θ

Optimization gives the Maximum A Posterior (MAP) Estimate

Further ReadingBarber, David, and Christopher M.

Bishop. "Ensemble learning inBayesian neural networks." NATOASI SERIES F COMPUTER AND

SYSTEMS SCIENCES 168 (1998):215-238.

Page 30: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 30 Dr. Markus M. Geipel / CT RDA BAM LSY

Why Bayes is Difficult in Practice

Two Main Challenges:

1. How to represent Distributions?2. How to calculate = ∫ θ θ θ?

Normalization

PriorLikelihoodof Data

=θ (θ)( )

Page 31: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 31 Dr. Markus M. Geipel / CT RDA BAM LSY

Some Recipes to deal with Bayes Theorem

Reduce it to a point estimate

→ use numerical optimization to find it

þ That’s what we just did!

Assume conjugate distribution of theexponential family for and

→ solve analytically

ý Not working for Neural Networks!

Empirically estimate

→ use Markov Chain Monte Carlo Methods tosample . (MCMC can sample from un-normalized distributions)

ÜLet’s take a closer look!

Make simplifying assumptions such asfactorization of

→ use Variational Methods to approximate

ÜLet’s take a closer look!

Note: There are more recipes for special types of models. E.g. Message passing for graphical models.

Page 32: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 32 Dr. Markus M. Geipel / CT RDA BAM LSY

Empirical Estimate of Weights Distributionvia Markov Chain Monte Carlo

Problem Statement:1. We want to sample from the distribution of Neural Network weights given the Data: .2. For a given we can evaluate the un-normalized = θ θ .

Idea:

θ

θ

Jump around in theweight space such thatthe points are distributed

proportional to

Page 33: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 33 Dr. Markus M. Geipel / CT RDA BAM LSY

How to Jump: The Metropolis-Hastings Algorithm

Split the Jump into two Phases

1. Propose a new point in weight spaceby sampling from a arbitrary but known proposaldistribution

• We call the proposal distribution Q( ∗| )• We call the proposed weight vector ∗

• We call the current weight vector

2. Define an Acceptance Probability for theproposed weight vector:

accept~ ∗,• If accept: = ∗

• else: =

θ

θ

θ

θ

~ ∗,

Q( ∗| )

Page 34: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 34 Dr. Markus M. Geipel / CT RDA BAM LSY

How to Jump: The Metropolis-Hastings Algorithm

If the Acceptance Probability is chosen as

∗, = min 1,∗ Q( | ∗)

Q( ∗| )

It can be proven that distribution of the sampled ,, … converges to

θ

θ

θ

θ

~ ∗,

Q( ∗| )

Further ReadingChapter 11, Sampling Methods of

Christopher M. Bishop. 2006. PatternRecognition and Machine Learning(Information Science and Statistics).

Springer-Verlag, Berlin.

Page 35: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 35 Dr. Markus M. Geipel / CT RDA BAM LSY

Caveats of Markov Chain Monte Carlo

1. Computational Complexity: One sampling step requirescomputation of θ θ which includes .

2. Convergence to can be slow:

3. Trade-off between high acceptance and efficienttraversal of weight-space

• Long jumps make acceptance low• Short jumps makes movement in space slow

There is a large body of research addressing thesepoints: Best use an existing framework or library!

θ

θ

x

x

x

x

Python Libraries• PyMC3: https://docs.pymc.io/• Edward: http://edwardlib.org/• TensorFlow Probability: https://www.tensorflow.org/probability/

Page 36: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 36 Dr. Markus M. Geipel / CT RDA BAM LSY

Big Caveat of Markov Chain Monte Carlo for Deep Learning

What’s the memory requirement?• S samples needed for each weight to build histogram• W weights in a Deep Neural Network

→ O(S*W)

For VGG19: 26578886 and 1000 samples for each weight: ~26 Billion!One weight is float32, so ~100 GB for the sample traces.

Page 37: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 37 Dr. Markus M. Geipel / CT RDA BAM LSY

Parametric Estimate via the Variational Method

So, is intractable? Let‘s simplify and assume:

≈ ( )

= ( )

∀ : . ( ) = 1

Let’s measure the discrepancy between and ( )with the Kullback–Leibler divergence:

D || = . log

If we can measure it, maybe we can minimize it with respect to …

Further ReadingHernández-Lobato, J. M., et al.

"Black-Box α-divergenceminimization." 33rd International

Conference on Machine Learning,ICML 2016. Vol. 4. 2016

Using D is a designchoice. There are good

reasons to choosedifferently.

Further ReadingRanganath, Rajesh, et al. "Operatorvariational inference." Advances in

Neural Information ProcessingSystems. 2016.

Page 38: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 38 Dr. Markus M. Geipel / CT RDA BAM LSY

Lower bound onlog P(D)!³ 0

Constfor

given D

Parametric Estimate via the Variational Method

D || = . log

Still contains the intractable term . Let‘s dissect it!

. log = . log ( , ) + log ( )

Rearrange:

log ( ) = D || + . log( , )

Page 39: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 39 Dr. Markus M. Geipel / CT RDA BAM LSY

The Evidence Lower Bound

Let’s give it a name: Evidence Lower Bound

ELBO ≔ . log( , )

Further ReadingChapter 10, Approximate Inference ofChristopher M. Bishop. 2006. PatternRecognition and Machine Learning(Information Science and Statistics).

Springer-Verlag, Berlin.

Coordinate AscentVariational Inference(CAVI) / Mean Field

Aproximation

Option B:Use sampling and stayin the gradient descent

framework of neuralnetworks.

Option A:Use factorization andassume exponential

family for Q

AutomaticDifferentiation

Variational Inference(ADVI) / Black Box

Variational Inference

þý(for deep neural networks)

Page 40: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 40 Dr. Markus M. Geipel / CT RDA BAM LSY

D || ( )RegularizationAverage Log Likelihood

The Evidence Lower Bound in Detail

. log( , )

≔ ELBO( )

Further deconstruction yields:

. log( , )

= . log ( | ) − . log ( )

, = ( | ) ( )

Page 41: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 41 Dr. Markus M. Geipel / CT RDA BAM LSY

Can be solvedanalytically

D || ( )RegularizationAverage Log Likelihood

Alternative Decomposition

EntropyEnergy

. log( , )

= . log ( | ) − . log ( )

. log( , )

= . log ( , ) − . log

Page 42: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 42 Dr. Markus M. Geipel / CT RDA BAM LSY

Stochastic Gradient of the ELBO

ELBO = log ( | ) − log

= log − log

If we could move the derivative into the Expectation, we could approximate it by sampling!

There are at least two options:1. Log Derivative Trický suffers from high varianceþ flexible with respect to the distribution

2. Reparametrization (a.k.a. Perturbation Analysis, Pathwise Derivatives)þ easy to implement,þ low gradient varianceý does not work for every distribution

Further ReadingWilliams, Ronald J. "Simplestatistical gradient-followingalgorithms for connectionist

reinforcement learning." Machinelearning 8.3-4 (1992): 229-256.

Viable and flexible option in combination with variance control techniques.Also used in Reinforcement Learning

Further ReadingRanganath, Rajesh, Sean Gerrish,

and David Blei. "Black boxvariational inference." Artificial

Intelligence and Statistics. 2014.

Page 43: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 43 Dr. Markus M. Geipel / CT RDA BAM LSY

Reparametrization Trick

| ( ) = |

Find a way to reparametrize | such that ~ ( , ) and we just sample

| = ( , )

We move the derivative inside the integral:

= ( , ) = ( ( , ))

Example:~ , → ~ + ; ~ (0,1)

þ is now independent of , and backpropagation works!

Further ReadingKingma, Diederik P., and Max

Welling. "Auto-encoding variationalbayes." arXiv preprint

arXiv:1312.6114 (2013).

warranted by thedominated

convergencetheorem

Page 44: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 44 Dr. Markus M. Geipel / CT RDA BAM LSY

Flipout

Challenge:“because a network typically has many more weights than units, it is veryexpensive to compute and store separate weight perturbations for everyexample in a mini-batch. Therefore, stochastic weight methods are typicallydone with a single sample per mini-batch.“

For VGG19: 26578886 weights and a batch size of 100 : ~2.6 Billion!

→ Same sample for the entire batch→ high variance in the gradient→ small learning rate needed→ slow convergence

Further ReadingWen, Yeming, et al. "Flipout:Efficient Pseudo-IndependentWeight Perturbations on Mini-

Batches." arXiv preprintarXiv:1803.04386 (2018).

Page 45: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 45 Dr. Markus M. Geipel / CT RDA BAM LSY

Flipout

Idea:Add independence to the weight samples in one batch with low computational cost.

Conditions:1. weight distribution is symmetric2. each weight is sampled independently

These conditions are met for our stochastic variable (perturbation) !

• Let Δ be a matrix of independently sampled ~ 0,1• Let be a matrix of independently sampled signs ±1

Δ = Δ ∘

Δ and Δ are equally distributed, but Δ is less correlated!

Page 46: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 46 Dr. Markus M. Geipel / CT RDA BAM LSY

Flipout

Δ = Δ ∘

is as big as a fully sampled weights matrix for the batch. Nothing gained yet!

Can we generate cheaply?

Idea:Use a rank one induced by a product of two random sign vectors and :

=

For a full analysis of the gradient variance reduction and computational cost see:

Further ReadingWen, Yeming, et al. "Flipout:Efficient Pseudo-IndependentWeight Perturbations on Mini-

Batches." arXiv preprintarXiv:1803.04386 (2018).

Page 47: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 47 Dr. Markus M. Geipel / CT RDA BAM LSY

Implementation in TensorFlow Probability

The Bayes Version of the LeNet CNN Architecture:Taken from the TensorFlow Probability Examples

neural_net = tf.keras.Sequential([tfp.layers.Convolution2DFlipout(6, kernel_size=5, padding="SAME", activation=tf.nn.relu),tf.keras.layers.MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"),tfp.layers.Convolution2DFlipout(16, kernel_size=5, padding="SAME", activation=tf.nn.relu),tf.keras.layers.MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"),tfp.layers.Convolution2DFlipout(120, kernel_size=5, padding="SAME", activation=tf.nn.relu),tf.keras.layers.Flatten(),tfp.layers.DenseFlipout(84, activation=tf.nn.relu),tfp.layers.DenseFlipout(10)])

labels_distribution = tfd.Categorical( logits = neural_net(images) )

neg_log_likelihood = -tf.reduce_mean(labels_distribution.log_prob(labels))kl = sum(neural_net.losses) / mnist_data.train.num_exampleselbo_loss = neg_log_likelihood + kl

Further Reading on LeNetLeCun, Yann, et al. "Gradient-

based learning applied to documentrecognition." Proceedings of theIEEE 86.11 (1998): 2278-2324.

Further Practicehttps://github.com/tensorflow/probability/tree/master/tensorflow_probab

ility/examples≈ − . log + . log

ConvolutionMaxpooling Dense

Page 48: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 48 Dr. Markus M. Geipel / CT RDA BAM LSY

Sampling from a Bayesian Neural Network

Pros• Estimate of Model Uncertainty

(If your use-case benefits from it!)• Implies Output Uncertainty• Acts as Regularization

Con• Computational / Memory Overhead

ConvolutionMaxpooling Dense

0 1 2 3 4 … 9

0 0 .98 0 0 .02

0 .01 .97 0 0 .01

0 0 .99 0 0 0

Calculate network outputSampleCalculate network outputSampleCalculate network outputSample

Page 49: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 49 Dr. Markus M. Geipel / CT RDA BAM LSY

Approximating Bayesian Networks with Ensembles

Pros• Simple• Parallelizable• Reduction of expected error

Con• Excessive training time

for modernDeep Neural Networks

Mean

NeuralNetwork

NeuralNetwork

NeuralNetwork

NeuralNetwork

…OutputSample

Further ReadingPerrone, Michael P., and Leon N.Cooper. When networks disagree:

Ensemble methods for hybridneural networks. No. TR-61.

BROWN UNIV PROVIDENCE RIINST FOR BRAIN AND NEURAL

SYSTEMS, 1992.

Page 50: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 50 Dr. Markus M. Geipel / CT RDA BAM LSY

Approximating Ensembles with Dropout

Further ReadingGal, Yarin, and Zoubin

Ghahramani. "Dropout as aBayesian approximation: Insightsand applications." Deep LearningWorkshop, ICML. Vol. 1. 2015.

ConvolutionMaxpooling Dense

Calculate network outputSampleCalculate network outputSampleCalculate network outputSample

0 1 2 3 4 … 9

0 0 .98 0 0 .02

0 .01 .97 0 0 .01

0 0 .99 0 0 0

Pros• Implies Output Uncertainty• Acts as Regularization• Computationally highly efficient

Con• Only indirect estimate of Model

Uncertainty

Dro

pra

ndom

neur

ons

Page 51: Lecture-6: Uncertainty in Neural Networks · Unrestricted © Siemens AG 2017 Page 3 17.05.2017 Dr. Markus M. Geipel / CT RDA BAM LSY Types of Uncertainty Classical Neural Network

Unrestricted © Siemens AG 201717.05.2017Page 51 Dr. Markus M. Geipel / CT RDA BAM LSY

In case you want to delve deeper into Bayesian Methods

Further Practicehttps://github.com/CamDavidsonPilon/Probabilistic-Programming-and-

Bayesian-Methods-for-Hackers

Further ReadingGelman, Andrew, et al. Bayesian

data analysis. Chapman andHall/CRC, 2013.

Further ReadingJaynes, Edwin T. Probability theory:

The logic of science. Cambridgeuniversity press, 2003

Further ReadingSivia, Devinderjit, and John Skilling.Data analysis: a Bayesian tutorial.

OUP Oxford, 2006.Mor

eTh

eory

More

Practice