introduction to bayesian learning -...

Course InformationIntroduction

Introduction to Bayesian Learning

Davide Bacciu

Dipartimento di InformaticaUniversità di [email protected]

Apprendimento Automatico: Fondamenti - A.A. 2016/2017


Outline

Lecture 1 - IntroductionProbabilistic reasoning and learning(Bayesian Networks)

Lecture 2 - Parameters Learning (I)Learning fully observed Bayesian models

Lecture 3 - Parameters Learning (II)Learning with hidden variables

If we have time, we will cover also some application examplesof Bayesian learning and Bayesian Networks


Course Information

Reference Webpage:

http://pages.di.unipi.it/bacciu/teaching/apprendimento-automatico-fondamenti/

Here you can findCourse informationLecture slidesArticles and course materials

Office HoursWhere? Dipartimento di Informatica - 2nd Floor - Room

367When? Monday 17-19When? Basically anytime, if you send me an email

beforehand.




Lecture Outline

1 Introduction

2 Probability TheoryProbabilities and Random VariablesBayes Theorem and Independence

3 Learning as Bayesian InferenceHypothesis SelectionBayesian Classifiers

4 Bayesian NetworksRepresentationLearning and Inference


Bayesian Learning

Why Bayesian? Easy, because of frequent use of Bayestheorem...Bayesian Inference A powerful approach to probabilistic

reasoningBayesian Networks An expressive model for describing

probabilistic relationshipsWhy bothering?

Real-world is uncertainData (noisy measurements and partial knowledge)Beliefs (concepts and their relationships)

Probability as a measure of our beliefsConceptual framework for describing uncertainty in worldrepresentationLearning and reasoning become matters of probabilisticinferenceProbabilistic weighting of the hypothesis

Probability TheoryLearning as Bayesian Inference

Part I

Probability and Learning


Probabilities and Random VariablesBayes Theorem and Independence

Random Variables

A Random Variable (RV) is a function describing theoutcome of a random process by assigning unique valuesto all possible outcomes of the experiment

Random Process =⇒ Coin Toss

Discrete RV =⇒ X =

{0 if heads1 if tails

The sample space S of a random process is the set of allpossible outcomes, e.g. S = {heads, tails}An event e is a subset e ∈ S, i.e. a set of outcomes, thatmay occur or not as a result of the experiment

Random variables are the building blocks for representing ourworld



Probability Functions

A probability function P(X = x) ∈ [0,1] (P(x) in short)measures the probability of a RV X attaining the value x ,i.e. the probability of event x occurringIf the random process is described by a set of RVsX1, . . . ,XN , then the joint conditional probability writes

P(X1 = x1, . . . ,XN = xn) = P(x1 ∧ · · · ∧ xn)

Definition (Sum Rule)Probabilities of all the events must sum to 1∑

x

P(X = x) = 1



Product Rule and Conditional Probabilities

Definition (Product Rule a.k.a. Chain Rule)

P(x1, . . . , xi , . . . , xn|y) =N∏

i=1

P(xi |x1, . . . , xi−1, y)

P(x |y) is the conditional probability of x given yReflects the fact that the realization of an event y mayaffect the occurrence of xMarginalization: sum and product rules together yield thecomplete probability equation

P(X1 = x1) =∑x2

P(X1 = x1,X2 = x2)

=∑x2

P(X1 = x1|X2 = x2)P(X2 = x2)



Bayes Rule

Given hypothesis hi ∈ H and observations d

P(hi |d) =P(d|hi)P(hi)

P(d)=

P(d|hi)P(hi)∑j P(d|hj)P(hj)

P(hi) is the prior probability of hi

P(d|hi) is the conditional probability of observing d giventhat hypothesis hi is true (likelihood).P(d) is the marginal probability of dP(hi |d) is the posterior probability that hypothesis is truegiven the data and the previous belief about thehypothesis.



So, Is This All About Bayes???

Frequentists Vs Bayesians

Figure: Credit goes to Mike West @ Duke University



Independence and Conditional Independence

Two RV X and Y are independent if knowledge about Xdoes not change the uncertainty about Y and vice versa

I(X ,Y )⇔ P(X ,Y ) = P(X |Y )P(Y )

= P(Y |X )P(X ) = P(X )P(Y )

Two RV X and Y are conditionally independent given Z ifthe realization of X and Y is an independent event of theirconditional probability distribution given Z

I(X ,Y |Z )⇔ P(X ,Y |Z ) = P(X |Y ,Z )P(Y |Z )

= P(Y |X ,Z )P(X |Z ) = P(X |Z )P(Y |Z )



Representing Probabilities with Discrete Data

Joint Probability Distribution Table

X1 . . . Xi . . . Xn P(X1, . . . ,Xn)

x′

1 . . . x′

i . . . x′n P(x

′

1, . . . , x′n)

x l1 . . . x l

i . . . x ln P(x l

1, . . . , xln)

Describes P(X1, . . . ,Xn) for all the RV instantiations x1, . . . , xn

In general, any probability of interest can be obtained startingfrom the Joint Probability Distribution P(X1, . . . ,Xn)


Hypothesis SelectionBayesian Classifiers

Wrapping Up....

We know how to represent the world and the observationsRandom Variables =⇒ X1, . . . ,XNJoint Probability Distribution =⇒ P(X1 = x1, . . . ,XN = xn)

We have rules for manipulating the probabilistic knowledgeSum-ProductMarginalizationBayesConditional Independence

It is about time that we do some learning...in other words, lets discover the values forP(X1 = x1, . . . ,XN = xn)



Bayesian Learning

Statistical learning approaches calculate the probability ofeach hypothesis hi given the data D, and selectshypotheses/makes predictions on that basisBayesian learning makes predictions using all hypothesesweighted by their probabilities

P(X |D = d) =∑

i

P(X |d,hi)P(hi |d)

=∑

i

P(X |hi) · P(hi |d)

New prediction Hypothesis prediction Posterior weighting



Single Hypothesis Approximation

Computational and Analytical Tractability IssueBayesian Learning requires a (possibly infinite) summation overthe whole hypothesis space

Maximum a-Posteriori (MAP) predicts P(X |hMAP) using themost likely hypothesis hMAP given the training data

hMAP = arg maxh∈H

P(h|d) = arg maxh∈H

P(d|h)P(h)P(d)

= arg maxh∈H

P(d|h)P(h)

Assuming uniform priors P(hi) = P(hj), yields theMaximum Likelihood (ML) estimate P(X |hML)

hML = arg maxh∈H

P(d|h)



All Too Abstract?

Let’s go to the Cinema!!!

How do I choose the next movie(prediction)?I might ask my friends for their favoritechoice given their personal taste(hypothesis)Select the movie

Bayesian advice? Make a voting from allthe friends’ suggestions weighted by theirattendance to cinema and tasteMAP advice? From the friend who goesoften to the cinema and whose taste ItrustML advice? From the friend who goesmore often to the cinema



The Candy Box Problem

A candy manufacturer produces 5 types of candy boxes(hypothesis) that are indistinguishable in the darkness ofthe cinema

h1 100% cherry flavorh2 75% cherry and 25% lime flavorh3 50% cherry and 50% lime flavorh4 25% cherry and 75% lime flavorh5 100% lime flavor

Given a sequence of candies d = d1, . . . ,dN extracted andreinserted in a box (observations), what is the most likelyflavor for the next candy (prediction)?



Candy Box ProblemHypothesis Posterior

First, we need to compute the posterior for eachhypothesis (Bayes)

P(hi |d) = αP(d|hi)P(hi)

The manufacturer is kind enough to provide us with theproduction shares (prior) for the 5 boxes

P(h1),P(h2),P(h3),P(h4),P(h5) = (0.1,0.2,0.4,0.2,0.1)

Data likelihood can be computed under the assumptionthat observations are independently and identicallydistributed (i.i.d.)

P(d|hi) =N∏

j=1

P(dj |hi)



Candy Box ProblemHypothesis Posterior Computation

Suppose that the bag is a h5 and consider a sequence of 10observed lime candies

1 | )d

d)|2P(h

d)|4P(h

P(h 3 | )d

P(h 5 | )d

dNumber of samples in

20 4 6 8 10

0.2

0.4

0.6

0.8

1P(h

Post

erio

r pr

obab

ility

of

h

Hyp d0 d1 d2h1 0.1 0 0h2 0.2 0.1 0.03h3 0.4 0.4 0.30h4 0.2 0.3 0.35h5 0.1 0.2 0.31

P(hi |d) = αP(hi)P(d = l |hi)N

Posteriors of false hypothesis eventually vanish



Candy Box ProblemComparing Predictions

Bayesian learning seeks

P(d11 = l |d1 = l , . . . ,d10 = l) =5∑

i=1

P(d11 = l |hi)P(hi |d)

d

Probability next candy is lime

Post

erio

r pr

obab

ility

of

h

1

0.8

0.6

0.4

0.2

108640

Number of samples in

2



Observations

Both ML and MAP are point estimates since they onlymake predictions based on the most likely hypothesisMAP predictions are approximately Bayesian ifP(X |d) ∼ P(X |hMAP)

MAP and Bayesian predictions become closer as moredata gets availableML is a good approximation to MAP if dataset is large andthere are no a-priori preferences on the hypotheses

ML is fully data-drivenFor large data sets, the influence of prior becomesirrelevantML has problems with small datasets



Regularization

P(h) introduces preference across hypothesesPenalize complexity

Complex hypotheses have a lower prior probabilityHypothesis prior embodies trade-off between complexityand degree of fit

MAP hypothesis hMAP

maxh

P(d|h)P(h) ≡ minh− log2(P(d|h))− log2P(h)

Number of bits required to specify hMAP =⇒ choosing the hypothesis that provides maximumcompressionMAP is a regularization of the ML estimation



Bayes Optimal Classifier

Suppose we want to classify a new observation as one of theclasses cj ∈ C

The most probable Bayesian classification is

c∗OPT = arg maxcj∈C

P(cj |d) = arg maxcj∈C

∑i

P(cj |hi)P(hi |d)

Bayesian Optimal ClassifierNo other classification method (using same hypothesis spaceand prior knowledge) can outperform a Bayesian OptimalClassifier in average.



Gibbs Algorithm

Bayes Optimal Classifier provides the best results, but can beextremely expensive

Use approximated methods =⇒ the Gibbs Algorithm1 Choose a hypothesis hG from H at random, by drawing it

from the current posterior probability distribution P(hi |d)2 Use hG to predict the classification of the next instance

Under certain conditions, the expected number ofmisclassifications is at most twice the expected value of aBayesian Optimal classifier



Naive Bayes Classifier

One of the simplest, yet popular, tools based on a strongprobabilistic assumption

Consider the settingTarget classification function f : X −→ CEach instance x ∈ X is described by a set of attributes

x =< a1, . . . ,al , . . . ,aL >

Seek the MAP classification

cNB = arg maxcj∈C

P(cj |a1, . . . ,aL)



Naive Bayes Assumption

The MAP classification rewrites

cNB = arg maxcj∈C

P(cj |a1, . . . ,aL)

= arg maxcj∈C

P(a1, . . . ,aL|cj)P(cj)

Naive Bayes: Assume conditional independence between theattributes al given classification cj

P(a1, . . . ,aL|cj) =L∏

l=1

P(al |cj)

Naive Bayes Classification

cNB = arg maxcj∈C

P(cj)L∏

l=1

P(al |cj)



Observations on Naive Bayes

Use it whenDealing with moderate or large training setsAttributes that describe instances are seeminglyconditionally independent

In reality, conditional independence is often violated......but Naive Bayes does not seem to know it. It often workssurprisingly well!!

Text classificationMedical DiagnosisRecommendation systems

Next class we’ll see how to train a Naive Bayes classifier

Bayesian NetworksSummary

Part II

Bayesian Networks


RepresentationLearning and Inference

Representing Conditional Independence

Bayes Optimal Classifier (BOC)No independence information between RVComputationally expensive

Naive Bayes (NB)Full independence given the classExtremely restrictive assumption

AlarmAnnouncement

NeighbourCall

Radio

BurglaryEarthquake

Bayesian Network (BNs) describeconditional independence betweensubsets of RV by a graphical model

I(X ,Y |Z )⇔ P(X ,Y |Z ) = P(X |Z )P(Y |Z )

Combine a-priori informationconcerning RV dependencies withobserved data



Bayesian Network - Directed Representation

Directed Acyclic Graph (DAG)Nodes represent random variables

Shaded⇒ observedEmpty⇒ un-observed

Edges describe the conditionalindependence relationshipsEvery variable is conditionallyindependent w.r.t. itsnon-descendant, given its parents

Conditional Probability Tables (CPT) local to each nodedescribe the probability distribution given its parents

P(Y1, . . . ,YN) =N∏

i=1

P(Yi |pa(Yi))



Naive Bayes as a Bayesian Network

... ... aL

c

ala1 a2 D

L

c

al

Naive Bayes classifier can be represented as a BayesianNetworkA more compact representation =⇒ Plate NotationAllows specifying more (Bayesian) details of the model



Plate Notation

L C

D

al βj

c

π

Bayesian Naive Bayes

Boxes denote replication for anumber of times denoted by theletter in the cornerShaded nodes are observedvariablesEmpty nodes denote un-observedlatent variablesBlack seeds identify constant terms(e.g. the prior distribution π over theclasses)



Inference in Bayesian Networks

Inference - How can one determine the distribution of thevalues of one/several network variables, given the observedvalues of others?

Bayesian Networks contain all the needed information(joint probability)NP-hard problem in generalSolves easily for a single unobserved variable......otherwise

Exact inference on small DAGsApproximated inference (Variational methods)Simulate the network (Monte Carlo)



Learning with Bayesian Networks

StructureFixed Structure Fixed Variables

X YP(Y |X)

X YP(X , Y )

Dat

a Com

plet

e

Naive BayesCalculate Frequencies (ML)

Discover dependenciesfrom the data

Structure SearchIndependence tests

Inco

mpl

ete

Latent variables

EM Algorithm (ML)MCMC, VBEM (Bayesian)

Difficult Problem

Structural EM

Parameter Learning Structure Learning


Take Home Messages

Bayesian techniques formulate learning as probabilisticinferenceML learning selects the hypothesis that maximizes datalikelihoodMAP learning selects the most likely hypothesis given thedata (ML Regularization)Conditional independence relations can be expressed by aBayesian network

Parameters Learning (Next Lecture)Structure Learning

introduction to bayesian learning -...

Documents