introduction to bayesian learning -...
TRANSCRIPT
Course InformationIntroduction
Introduction to Bayesian Learning
Davide Bacciu
Dipartimento di InformaticaUniversità di [email protected]
Apprendimento Automatico: Fondamenti - A.A. 2016/2017
Course InformationIntroduction
Outline
Lecture 1 - IntroductionProbabilistic reasoning and learning(Bayesian Networks)
Lecture 2 - Parameters Learning (I)Learning fully observed Bayesian models
Lecture 3 - Parameters Learning (II)Learning with hidden variables
If we have time, we will cover also some application examplesof Bayesian learning and Bayesian Networks
Course InformationIntroduction
Course Information
Reference Webpage:
http://pages.di.unipi.it/bacciu/teaching/apprendimento-automatico-fondamenti/
Here you can findCourse informationLecture slidesArticles and course materials
Office HoursWhere? Dipartimento di Informatica - 2nd Floor - Room
367When? Monday 17-19When? Basically anytime, if you send me an email
beforehand.
Course InformationIntroduction
Lecture Outline
1 Introduction
2 Probability TheoryProbabilities and Random VariablesBayes Theorem and Independence
3 Learning as Bayesian InferenceHypothesis SelectionBayesian Classifiers
4 Bayesian NetworksRepresentationLearning and Inference
Course InformationIntroduction
Bayesian Learning
Why Bayesian? Easy, because of frequent use of Bayestheorem...Bayesian Inference A powerful approach to probabilistic
reasoningBayesian Networks An expressive model for describing
probabilistic relationshipsWhy bothering?
Real-world is uncertainData (noisy measurements and partial knowledge)Beliefs (concepts and their relationships)
Probability as a measure of our beliefsConceptual framework for describing uncertainty in worldrepresentationLearning and reasoning become matters of probabilisticinferenceProbabilistic weighting of the hypothesis
Probability TheoryLearning as Bayesian Inference
Part I
Probability and Learning
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Random Variables
A Random Variable (RV) is a function describing theoutcome of a random process by assigning unique valuesto all possible outcomes of the experiment
Random Process =⇒ Coin Toss
Discrete RV =⇒ X =
{0 if heads1 if tails
The sample space S of a random process is the set of allpossible outcomes, e.g. S = {heads, tails}An event e is a subset e ∈ S, i.e. a set of outcomes, thatmay occur or not as a result of the experiment
Random variables are the building blocks for representing ourworld
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Probability Functions
A probability function P(X = x) ∈ [0,1] (P(x) in short)measures the probability of a RV X attaining the value x ,i.e. the probability of event x occurringIf the random process is described by a set of RVsX1, . . . ,XN , then the joint conditional probability writes
P(X1 = x1, . . . ,XN = xn) = P(x1 ∧ · · · ∧ xn)
Definition (Sum Rule)Probabilities of all the events must sum to 1∑
x
P(X = x) = 1
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Product Rule and Conditional Probabilities
Definition (Product Rule a.k.a. Chain Rule)
P(x1, . . . , xi , . . . , xn|y) =N∏
i=1
P(xi |x1, . . . , xi−1, y)
P(x |y) is the conditional probability of x given yReflects the fact that the realization of an event y mayaffect the occurrence of xMarginalization: sum and product rules together yield thecomplete probability equation
P(X1 = x1) =∑x2
P(X1 = x1,X2 = x2)
=∑x2
P(X1 = x1|X2 = x2)P(X2 = x2)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Bayes Rule
Given hypothesis hi ∈ H and observations d
P(hi |d) =P(d|hi)P(hi)
P(d)=
P(d|hi)P(hi)∑j P(d|hj)P(hj)
P(hi) is the prior probability of hi
P(d|hi) is the conditional probability of observing d giventhat hypothesis hi is true (likelihood).P(d) is the marginal probability of dP(hi |d) is the posterior probability that hypothesis is truegiven the data and the previous belief about thehypothesis.
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
So, Is This All About Bayes???
Frequentists Vs Bayesians
Figure: Credit goes to Mike West @ Duke University
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Independence and Conditional Independence
Two RV X and Y are independent if knowledge about Xdoes not change the uncertainty about Y and vice versa
I(X ,Y )⇔ P(X ,Y ) = P(X |Y )P(Y )
= P(Y |X )P(X ) = P(X )P(Y )
Two RV X and Y are conditionally independent given Z ifthe realization of X and Y is an independent event of theirconditional probability distribution given Z
I(X ,Y |Z )⇔ P(X ,Y |Z ) = P(X |Y ,Z )P(Y |Z )
= P(Y |X ,Z )P(X |Z ) = P(X |Z )P(Y |Z )
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Representing Probabilities with Discrete Data
Joint Probability Distribution Table
X1 . . . Xi . . . Xn P(X1, . . . ,Xn)
x′
1 . . . x′
i . . . x′n P(x
′
1, . . . , x′n)
x l1 . . . x l
i . . . x ln P(x l
1, . . . , xln)
Describes P(X1, . . . ,Xn) for all the RV instantiations x1, . . . , xn
In general, any probability of interest can be obtained startingfrom the Joint Probability Distribution P(X1, . . . ,Xn)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Wrapping Up....
We know how to represent the world and the observationsRandom Variables =⇒ X1, . . . ,XNJoint Probability Distribution =⇒ P(X1 = x1, . . . ,XN = xn)
We have rules for manipulating the probabilistic knowledgeSum-ProductMarginalizationBayesConditional Independence
It is about time that we do some learning...in other words, lets discover the values forP(X1 = x1, . . . ,XN = xn)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Bayesian Learning
Statistical learning approaches calculate the probability ofeach hypothesis hi given the data D, and selectshypotheses/makes predictions on that basisBayesian learning makes predictions using all hypothesesweighted by their probabilities
P(X |D = d) =∑
i
P(X |d,hi)P(hi |d)
=∑
i
P(X |hi) · P(hi |d)
New prediction Hypothesis prediction Posterior weighting
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Single Hypothesis Approximation
Computational and Analytical Tractability IssueBayesian Learning requires a (possibly infinite) summation overthe whole hypothesis space
Maximum a-Posteriori (MAP) predicts P(X |hMAP) using themost likely hypothesis hMAP given the training data
hMAP = arg maxh∈H
P(h|d) = arg maxh∈H
P(d|h)P(h)P(d)
= arg maxh∈H
P(d|h)P(h)
Assuming uniform priors P(hi) = P(hj), yields theMaximum Likelihood (ML) estimate P(X |hML)
hML = arg maxh∈H
P(d|h)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
All Too Abstract?
Let’s go to the Cinema!!!
How do I choose the next movie(prediction)?I might ask my friends for their favoritechoice given their personal taste(hypothesis)Select the movie
Bayesian advice? Make a voting from allthe friends’ suggestions weighted by theirattendance to cinema and tasteMAP advice? From the friend who goesoften to the cinema and whose taste ItrustML advice? From the friend who goesmore often to the cinema
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
The Candy Box Problem
A candy manufacturer produces 5 types of candy boxes(hypothesis) that are indistinguishable in the darkness ofthe cinema
h1 100% cherry flavorh2 75% cherry and 25% lime flavorh3 50% cherry and 50% lime flavorh4 25% cherry and 75% lime flavorh5 100% lime flavor
Given a sequence of candies d = d1, . . . ,dN extracted andreinserted in a box (observations), what is the most likelyflavor for the next candy (prediction)?
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Candy Box ProblemHypothesis Posterior
First, we need to compute the posterior for eachhypothesis (Bayes)
P(hi |d) = αP(d|hi)P(hi)
The manufacturer is kind enough to provide us with theproduction shares (prior) for the 5 boxes
P(h1),P(h2),P(h3),P(h4),P(h5) = (0.1,0.2,0.4,0.2,0.1)
Data likelihood can be computed under the assumptionthat observations are independently and identicallydistributed (i.i.d.)
P(d|hi) =N∏
j=1
P(dj |hi)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Candy Box ProblemHypothesis Posterior Computation
Suppose that the bag is a h5 and consider a sequence of 10observed lime candies
1 | )d
d)|2P(h
d)|4P(h
P(h 3 | )d
P(h 5 | )d
dNumber of samples in
20 4 6 8 10
0.2
0.4
0.6
0.8
1P(h
Post
erio
r pr
obab
ility
of
h
Hyp d0 d1 d2h1 0.1 0 0h2 0.2 0.1 0.03h3 0.4 0.4 0.30h4 0.2 0.3 0.35h5 0.1 0.2 0.31
P(hi |d) = αP(hi)P(d = l |hi)N
Posteriors of false hypothesis eventually vanish
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Candy Box ProblemComparing Predictions
Bayesian learning seeks
P(d11 = l |d1 = l , . . . ,d10 = l) =5∑
i=1
P(d11 = l |hi)P(hi |d)
d
Probability next candy is lime
Post
erio
r pr
obab
ility
of
h
1
0.8
0.6
0.4
0.2
108640
Number of samples in
2
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Observations
Both ML and MAP are point estimates since they onlymake predictions based on the most likely hypothesisMAP predictions are approximately Bayesian ifP(X |d) ∼ P(X |hMAP)
MAP and Bayesian predictions become closer as moredata gets availableML is a good approximation to MAP if dataset is large andthere are no a-priori preferences on the hypotheses
ML is fully data-drivenFor large data sets, the influence of prior becomesirrelevantML has problems with small datasets
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Regularization
P(h) introduces preference across hypothesesPenalize complexity
Complex hypotheses have a lower prior probabilityHypothesis prior embodies trade-off between complexityand degree of fit
MAP hypothesis hMAP
maxh
P(d|h)P(h) ≡ minh− log2(P(d|h))− log2P(h)
Number of bits required to specify hMAP =⇒ choosing the hypothesis that provides maximumcompressionMAP is a regularization of the ML estimation
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Bayes Optimal Classifier
Suppose we want to classify a new observation as one of theclasses cj ∈ C
The most probable Bayesian classification is
c∗OPT = arg maxcj∈C
P(cj |d) = arg maxcj∈C
∑i
P(cj |hi)P(hi |d)
Bayesian Optimal ClassifierNo other classification method (using same hypothesis spaceand prior knowledge) can outperform a Bayesian OptimalClassifier in average.
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Gibbs Algorithm
Bayes Optimal Classifier provides the best results, but can beextremely expensive
Use approximated methods =⇒ the Gibbs Algorithm1 Choose a hypothesis hG from H at random, by drawing it
from the current posterior probability distribution P(hi |d)2 Use hG to predict the classification of the next instance
Under certain conditions, the expected number ofmisclassifications is at most twice the expected value of aBayesian Optimal classifier
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Naive Bayes Classifier
One of the simplest, yet popular, tools based on a strongprobabilistic assumption
Consider the settingTarget classification function f : X −→ CEach instance x ∈ X is described by a set of attributes
x =< a1, . . . ,al , . . . ,aL >
Seek the MAP classification
cNB = arg maxcj∈C
P(cj |a1, . . . ,aL)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Naive Bayes Assumption
The MAP classification rewrites
cNB = arg maxcj∈C
P(cj |a1, . . . ,aL)
= arg maxcj∈C
P(a1, . . . ,aL|cj)P(cj)
Naive Bayes: Assume conditional independence between theattributes al given classification cj
P(a1, . . . ,aL|cj) =L∏
l=1
P(al |cj)
Naive Bayes Classification
cNB = arg maxcj∈C
P(cj)L∏
l=1
P(al |cj)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Observations on Naive Bayes
Use it whenDealing with moderate or large training setsAttributes that describe instances are seeminglyconditionally independent
In reality, conditional independence is often violated......but Naive Bayes does not seem to know it. It often workssurprisingly well!!
Text classificationMedical DiagnosisRecommendation systems
Next class we’ll see how to train a Naive Bayes classifier
Bayesian NetworksSummary
Part II
Bayesian Networks
Bayesian NetworksSummary
RepresentationLearning and Inference
Representing Conditional Independence
Bayes Optimal Classifier (BOC)No independence information between RVComputationally expensive
Naive Bayes (NB)Full independence given the classExtremely restrictive assumption
AlarmAnnouncement
NeighbourCall
Radio
BurglaryEarthquake
Bayesian Network (BNs) describeconditional independence betweensubsets of RV by a graphical model
I(X ,Y |Z )⇔ P(X ,Y |Z ) = P(X |Z )P(Y |Z )
Combine a-priori informationconcerning RV dependencies withobserved data
Bayesian NetworksSummary
RepresentationLearning and Inference
Bayesian Network - Directed Representation
Directed Acyclic Graph (DAG)Nodes represent random variables
Shaded⇒ observedEmpty⇒ un-observed
Edges describe the conditionalindependence relationshipsEvery variable is conditionallyindependent w.r.t. itsnon-descendant, given its parents
Conditional Probability Tables (CPT) local to each nodedescribe the probability distribution given its parents
P(Y1, . . . ,YN) =N∏
i=1
P(Yi |pa(Yi))
Bayesian NetworksSummary
RepresentationLearning and Inference
Naive Bayes as a Bayesian Network
... ... aL
c
ala1 a2 D
L
c
al
Naive Bayes classifier can be represented as a BayesianNetworkA more compact representation =⇒ Plate NotationAllows specifying more (Bayesian) details of the model
Bayesian NetworksSummary
RepresentationLearning and Inference
Plate Notation
L C
D
al βj
c
π
Bayesian Naive Bayes
Boxes denote replication for anumber of times denoted by theletter in the cornerShaded nodes are observedvariablesEmpty nodes denote un-observedlatent variablesBlack seeds identify constant terms(e.g. the prior distribution π over theclasses)
Bayesian NetworksSummary
RepresentationLearning and Inference
Inference in Bayesian Networks
Inference - How can one determine the distribution of thevalues of one/several network variables, given the observedvalues of others?
Bayesian Networks contain all the needed information(joint probability)NP-hard problem in generalSolves easily for a single unobserved variable......otherwise
Exact inference on small DAGsApproximated inference (Variational methods)Simulate the network (Monte Carlo)
Bayesian NetworksSummary
RepresentationLearning and Inference
Learning with Bayesian Networks
StructureFixed Structure Fixed Variables
X YP(Y |X)
X YP(X , Y )
Dat
a Com
plet
e
Naive BayesCalculate Frequencies (ML)
Discover dependenciesfrom the data
Structure SearchIndependence tests
Inco
mpl
ete
Latent variables
EM Algorithm (ML)MCMC, VBEM (Bayesian)
Difficult Problem
Structural EM
Parameter Learning Structure Learning
Bayesian NetworksSummary
Take Home Messages
Bayesian techniques formulate learning as probabilisticinferenceML learning selects the hypothesis that maximizes datalikelihoodMAP learning selects the most likely hypothesis given thedata (ML Regularization)Conditional independence relations can be expressed by aBayesian network
Parameters Learning (Next Lecture)Structure Learning