bayesian machine learning - eindhoven university of technologyheco/courses/ia-5lil0/lecture13... ·...

Department of Electrical Engineering, Electronic Systems Group

Martin Roa Villescas, Patrick Wijnings, Prof.dr. Henk Corporaal

Intelligent Architectures - 5LIL0

Bayesian Machine Learning

Agenda• Model-based machine learning• Probability theory• Factor graphs• Bayesian neural networks

2

3

Model-based Machine Learning

Standard vs. Model-based Machine Learning

4

K-means clustering

Markov random field Gaussian mixture

logistic regression

Kalman filter random forests

HMM

principal components

neural networks

deep networks

kernel PCA

support vector machines

Boltzmann machines

linear regression

ICA

Radial basis functions

Gaussian process

decision trees

factor analysis

The “No Free Lunch” Theorem

5

“Averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points”

- Daniel Wolpert (1996)

A model is a simplification of reality

Simplification is based on assumptions

Assumptions fail in certain situations

Roughly speaking:“No one model works best for all possible situations.”

Therefore, the goal of ML is to find an algorithm that is well matched to the problem being solved.

Machine Learning

6

Data vs Prior Knowledge trade-off

7

“Big Data”

8

“Big data”

9

Model-based Machine Learning

10

Goal: To derive the appropriate ML algorithm by making the modelling assumptions explicit

Traditional: “How do I map my problem to the standard tools”

Model-based: “What is the model that represents my problem”

Logistic Regression

12

Deep Neural Networks

13

Deep Neural Networks

14

Data and Prior Knowledge

15

Translational invariance

Convolutional Neural Networks

16

Summary

17

We have seen that:• There is no universal machine learning algorithm.• The goal is to find an algorithm that performs well on

the particular dataset that we have• Such algorithm depends on the combination of the data

with prior knowledge• And the dream is that by being explicit about the prior

knowledge and combining it with an inference algorithm we will derive the machine learning algorithm

Data

Output Program

Software transformation

18

Program

Data

OutputTraditional CS:

Machine Learning:

Uncertainty is Everywhere

20

Which movie does the user want to watch?

Which word did the user say/write?

Which web page is the user trying to find?

Which link will the user clock on?

Which gesture is the user making?

Which is the medical condition of the patient?

Probability

21

Limit of infinite number of trials (frequentist)

Degree of belief (Bayesian)

60% 40%

Sum rule:

Product rule:

Bayes’ rule:

Probability Theory Notation

Joint probability:

Marginal probability:

Conditional probability:

28


Steps of model-based ML1. Specify the model2. Incorporate observed data3. Do inference (i.e. learn, adapt)• Iterate 2 and 3 in real-time applications• Extend the model as required

29

How does a machine learn?• Updates the parameters of the

probabilistic model using Bayes’ rule


Hello world: Coin bias estimation

30

Model specification• Likelihood:• Prior:

HELLO WORLD DEMO

Incorporate observed data• Virtual coin

Do inference• Exact analytical inference

Model-based Machine Learning (analogy)

31

power amplifiers

controllersreceivers

transmitters

protection circuitsinverters

instrumental amplifiers

level shifters

suppliers

light circuits

alarms

detectors

regulators

level shifters

chargers

sensors

digital display circuits

function generators

voltmeters

Model-based ML

32

Model-based ML

33

34

Probabilistic Graphical Models

Probabilistic Graphical Model (PGM)

Diagrammatic representation of a probabilistic model• Visualizes the structure of the model• Provides insight into properties of the

model (e.g. conditional independence)• Inference can be expressed in terms of

graphical manipulations

35

Three prevailing types• Bayesian networks• Markov random field• Factor graphs

Factor Graphs

Conveys detailed information about the model factorization• Suitable to cast inference tasks in a

simple and general form• Rules

• A node for every factor• An edge (half-edge) for every variable• A node 𝑓𝑓 is connected to edge 𝑥𝑥 iff factor 𝑓𝑓 is a function of variable 𝑥𝑥.

36

Example

Factor Graphs

Conveys detailed information about the model factorization• Suitable to cast inference tasks in a

simple and general form• Rules

• A node for every factor• An edge (half-edge) for every variable• A node 𝑓𝑓 is connected to edge 𝑥𝑥 iff factor 𝑓𝑓 is a function of variable 𝑥𝑥.

37

Example: Coin bias estimation

HELLO WORLD DEMO

Belief Propagation

Exact inference in PGMs• Computes two messages in each

direction for every edge using the sum-product rule

• The marginal probability of a variable is the multiplication of the two messages in its corresponding edge

38

Sum-product rule• The message out of a factor node is

the product of that factor and all its incoming messages, integrated over all variables of the incoming messages

Belief Propagation

39

Example

What is the marginal probability of 𝑝𝑝 𝑥𝑥4 ?

Kalman Filter

40

Kalman Filter

41

Hidden Markov Model

42

Tractability of Exact Inference

Exact inference is intractable in models of practical interest• High hidden dimensional spaces• Integrals with no closed-form analytical

solutions

43

Solution: Approximation methods• Markov chain Monte Carlo (MCMC)

• Stochastic approximations• Variational inference

• Deterministic approximations

True distribution

MCMC Variational inference

44

Model Compiler

Inference source code

JuliaCompiler

Compiledalgorithm

Algorithmexecution

Marginal distributions

ForneyLab

45

Bayesian Neural Networks

Drawbacks of Deep Learning• Neural networks compute point estimates• Overly confident decisions in classification,

prediction and actuation tasks• Prone to overfitting• Contain many hyperparameters that may

require specific tuning

46

Limitations of Deep Learning• Very data hungry• Very compute-intensive to train and deploy• Poor at representing uncertainty• Easily fooled by adversarial examples• Difficult to optimize, e.g. choice of

architecture, learning procedure, initialization, etc.

• Uninterpretable black-boxes, difficult to trust

47

Bayesian Neural Network

A neural network with a prior distribution on the weights.• Accounts for uncertainty in weights• Propagates this into uncertainty about predictions• More robust against overfitting

• Randomly sampling over network weights as a cheap form of model averaging

48

Takeaway

We’ve seen:

- A view-point of ML that provides a compass through the complex pile of existing ML algorithms

- A change of paradigm in the way software is programmed

- A practical tool to use when building real-world applications

- A vision of how ML can be democratized

49

References

- Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg. Available online: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf

50

http://users.isr.ist.utl.pt/%7Ewurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf

51

Questions?

bayesian machine learning - eindhoven university of technologyheco/courses/ia-5lil0/lecture13... ·...

Documents