bayesian machine learning - eindhoven university of technologyheco/courses/ia-5lil0/lecture13... ·...
TRANSCRIPT
Department of Electrical Engineering, Electronic Systems Group
Martin Roa Villescas, Patrick Wijnings, Prof.dr. Henk Corporaal
Intelligent Architectures - 5LIL0
Bayesian Machine Learning
Agenda• Model-based machine learning• Probability theory• Factor graphs• Bayesian neural networks
2
3
Model-based Machine Learning
Standard vs. Model-based Machine Learning
4
K-means clustering
Markov random field Gaussian mixture
logistic regression
Kalman filter random forests
HMM
principal components
neural networks
deep networks
kernel PCA
support vector machines
Boltzmann machines
linear regression
ICA
Radial basis functions
Gaussian process
decision trees
factor analysis
The “No Free Lunch” Theorem
5
“Averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points”
- Daniel Wolpert (1996)
A model is a simplification of reality
Simplification is based on assumptions
Assumptions fail in certain situations
Roughly speaking:“No one model works best for all possible situations.”
Therefore, the goal of ML is to find an algorithm that is well matched to the problem being solved.
Machine Learning
6
Data vs Prior Knowledge trade-off
7
“Big Data”
8
“Big data”
9
Model-based Machine Learning
10
Goal: To derive the appropriate ML algorithm by making the modelling assumptions explicit
Traditional: “How do I map my problem to the standard tools”
Model-based: “What is the model that represents my problem”
11
Logistic Regression
12
Deep Neural Networks
13
Deep Neural Networks
14
Data and Prior Knowledge
15
Translational invariance
Convolutional Neural Networks
16
Summary
17
We have seen that:• There is no universal machine learning algorithm.• The goal is to find an algorithm that performs well on
the particular dataset that we have• Such algorithm depends on the combination of the data
with prior knowledge• And the dream is that by being explicit about the prior
knowledge and combining it with an inference algorithm we will derive the machine learning algorithm
Data
Output Program
Software transformation
18
Program
Data
OutputTraditional CS:
Machine Learning:
19
Uncertainty is Everywhere
20
Which movie does the user want to watch?
Which word did the user say/write?
Which web page is the user trying to find?
Which link will the user clock on?
Which gesture is the user making?
Which is the medical condition of the patient?
Probability
21
Limit of infinite number of trials (frequentist)
Degree of belief (Bayesian)
60% 40%
22
23
24
25
26
27
Sum rule:
Product rule:
Bayes’ rule:
Probability Theory Notation
Joint probability:
Marginal probability:
Conditional probability:
28
Bayesian Machine Learning
Steps of model-based ML1. Specify the model2. Incorporate observed data3. Do inference (i.e. learn, adapt)• Iterate 2 and 3 in real-time applications• Extend the model as required
29
How does a machine learn?• Updates the parameters of the
probabilistic model using Bayes’ rule
Bayesian Machine Learning
Hello world: Coin bias estimation
30
Model specification• Likelihood:• Prior:
HELLO WORLD DEMO
Incorporate observed data• Virtual coin
Do inference• Exact analytical inference
Model-based Machine Learning (analogy)
31
power amplifiers
controllersreceivers
transmitters
protection circuitsinverters
instrumental amplifiers
level shifters
suppliers
light circuits
alarms
detectors
regulators
level shifters
chargers
sensors
digital display circuits
function generators
voltmeters
Model-based ML
32
Model-based ML
33
34
Probabilistic Graphical Models
Probabilistic Graphical Model (PGM)
Diagrammatic representation of a probabilistic model• Visualizes the structure of the model• Provides insight into properties of the
model (e.g. conditional independence)• Inference can be expressed in terms of
graphical manipulations
35
Three prevailing types• Bayesian networks• Markov random field• Factor graphs
Factor Graphs
Conveys detailed information about the model factorization• Suitable to cast inference tasks in a
simple and general form• Rules
• A node for every factor• An edge (half-edge) for every variable• A node 𝑓𝑓 is connected to edge 𝑥𝑥 iff factor 𝑓𝑓 is a function of variable 𝑥𝑥.
36
Example
Factor Graphs
Conveys detailed information about the model factorization• Suitable to cast inference tasks in a
simple and general form• Rules
• A node for every factor• An edge (half-edge) for every variable• A node 𝑓𝑓 is connected to edge 𝑥𝑥 iff factor 𝑓𝑓 is a function of variable 𝑥𝑥.
37
Example: Coin bias estimation
HELLO WORLD DEMO
Belief Propagation
Exact inference in PGMs• Computes two messages in each
direction for every edge using the sum-product rule
• The marginal probability of a variable is the multiplication of the two messages in its corresponding edge
38
Sum-product rule• The message out of a factor node is
the product of that factor and all its incoming messages, integrated over all variables of the incoming messages
Belief Propagation
39
Example
What is the marginal probability of 𝑝𝑝 𝑥𝑥4 ?
Kalman Filter
40
Kalman Filter
41
Hidden Markov Model
42
Tractability of Exact Inference
Exact inference is intractable in models of practical interest• High hidden dimensional spaces• Integrals with no closed-form analytical
solutions
43
Solution: Approximation methods• Markov chain Monte Carlo (MCMC)
• Stochastic approximations• Variational inference
• Deterministic approximations
True distribution
MCMC Variational inference
44
Model Compiler
Inference source code
JuliaCompiler
Compiledalgorithm
Algorithmexecution
Marginal distributions
ForneyLab
45
Bayesian Neural Networks
Drawbacks of Deep Learning• Neural networks compute point estimates• Overly confident decisions in classification,
prediction and actuation tasks• Prone to overfitting• Contain many hyperparameters that may
require specific tuning
46
Limitations of Deep Learning• Very data hungry• Very compute-intensive to train and deploy• Poor at representing uncertainty• Easily fooled by adversarial examples• Difficult to optimize, e.g. choice of
architecture, learning procedure, initialization, etc.
• Uninterpretable black-boxes, difficult to trust
47
Bayesian Neural Network
A neural network with a prior distribution on the weights.• Accounts for uncertainty in weights• Propagates this into uncertainty about predictions• More robust against overfitting
• Randomly sampling over network weights as a cheap form of model averaging
48
Takeaway
We’ve seen:
- A view-point of ML that provides a compass through the complex pile of existing ML algorithms
- A change of paradigm in the way software is programmed
- A practical tool to use when building real-world applications
- A vision of how ML can be democratized
49
References
- Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg. Available online: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf
50
51
Questions?