ba map ml h map learners130.243.105.49/~tdt/ml/book-slides/ml_chapter06.pdf · lecture slides for...

Bayesian Learning

�Read Ch� ��Suggested exercises� ��

� Bayes Theorem

� MAP� ML hypotheses

� MAP learners

� Minimum description length principle

� Bayes optimal classier

� Naive Bayes learner

� Example� Learning over text data

� Bayesian belief networks

� Expectation Maximization algorithm

�� lecture slides for textbook Machine Learning� T� Mitchell� McGraw Hill� ��

Two Roles for Bayesian Methods

Provides practical learning algorithms�

� Naive Bayes learning

� Bayesian belief network learning

� Combine prior knowledge prior probabilities�with observed data

� Requires prior probabilities

Provides useful conceptual framework

� Provides �gold standard for evaluating otherlearning algorithms

� Additional insight into Occam�s razor

Bayes Theorem

P hjD� �P Djh�P h�

P D�

� P h� � prior probability of hypothesis h

� P D� � prior probability of training data D

� P hjD� � probability of h given D

� P Djh� � probability of D given h

Choosing Hypotheses

P D�

Generally want the most probable hypothesis giventhe training dataMaximum a posteriori hypothesis hMAP �

hMAP � arg maxh�H

P hjD�

� arg maxh�H

P Djh�P h�

P D�� arg max

h�HP Djh�P h�

If assume P hi� � P hj� then can further simplify�and choose the Maximum likelihood ML�hypothesis

hML � arg maxhi�H

P Djhi�

Bayes Theorem

Does patient have cancer or not�

A patient takes a lab test and the resultcomes back positive� The test returns acorrect positive result in only �� of thecases in which the disease is actually present�and a correct negative result in only �� ofthe cases in which the disease is not present�Furthermore� �� of the entire populationhave this cancer�

P cancer� � P �cancer� �

P �jcancer� � P �jcancer� �

P �j�cancer� � P �j�cancer� �

Basic Formulas for Probabilities

� Product Rule� probability P A � B� of aconjunction of two events A and B�

P A � B� � P AjB�P B� � P BjA�P A�

� Sum Rule� probability of a disjunction of twoevents A and B�

P A �B� � P A� � P B� � P A �B�

� Theorem of total probability� if events A�� An

are mutually exclusive with Pni��P Ai� � �� then

P B� �nX

i��P BjAi�P Ai�

Brute Force MAP Hypothesis Learner

�� For each hypothesis h in H� calculate theposterior probability

P D�

�� Output the hypothesis hMAP with the highestposterior probability

hMAP � argmaxh�H

P hjD�

Relation to Concept Learning

Consider our usual concept learning task

� instance space X� hypothesis space H� trainingexamples D

� consider the FindS learning algorithm outputsmost specic hypothesis from the version spaceV SH�D�

What would Bayes rule produce as the MAPhypothesis�

Does FindS output a MAP hypothesis��

Assume xed set of instances hx�� xmiAssume D is the set of classicationsD � hcx�� cxm�iChoose P Djh��

� lecture slides for textbook Machine Learning� T� Mitchell� McGraw Hill� ��

Assume xed set of instances hx�� xmiAssume D is the set of classicationsD � hcx�� cxm�iChoose P Djh�

� P Djh� � � if h consistent with D

� P Djh� � � otherwise

Choose P h� to be uniform distribution

� P h� � �jHj for all h in H

Then�

P hjD� �

��

�jV SH�Dj if h is consistent with D

� otherwise

Evolution of Posterior Probabilities

hypotheses hypotheses hypotheses

P(h|D1,D2)P(h|D1)P h )(

a( ) b( ) c( )

Characterizing Learning Algorithms by

Equivalent MAP Learners

Inductive system

Output hypotheses

Brute forceMAP learner

CandidateEliminationAlgorithm

Prior assumptions made explicit

P(h) uniformP(D|h) = 0 if inconsistent, = 1 if consistent

Equivalent Bayesian inference system

Training examples D

Hypothesis space H

Training examples D

Learning A Real Valued Function

Consider any real�valued target function f

Training examples hxi� dii� where di is noisytraining value

� di � fxi� � ei

� ei is random variable noise� drawnindependently for each xi according to someGaussian distribution with mean��

Then the maximum likelihood hypothesis hML isthe one that minimizes the sum of squared errors�

hML � arg minh�H

i��di � hxi��

Learning A Real Valued Function

hML � argmaxh�H

pDjh�

� argmaxh�H

i��pdijh�

� argmaxh�H

i��

�p��

e��

di�h�xi��

Maximize natural log of this instead��

hML � argmaxh�H

i��ln

�p��

� �

�BB�di � hxi�

�CCA

� argmaxh�H

i��

�BB�di � hxi�

�CCA

� argmaxh�H

i�� di� hxi��

� argminh�H

i��di� hxi��

� lecture slides for textbook Machine Learning� T� Mitchell� McGraw Hill� ��

Learning to Predict Probabilities

Consider predicting survival probability frompatient data

Training examples hxi� dii� where di is � or �

Want to train neural network to output aprobability given xi not a � or ��

In this case can show

hML � argmaxh�H

i��di lnhxi� � �� di� ln� � hxi��

Weight update rule for a sigmoid unit�

wjk � wjk � �wjk

where�wjk � �

i��di � hxi�� xijk

Minimum Description Length Principle

Occam�s razor� prefer the shortest hypothesis

MDL� prefer the hypothesis h that minimizes

hMDL � argminh�H

LC�h� � LC�Djh�

where LCx� is the description length of x underencoding C

Example� H � decision trees� D � training datalabels

� LC�h� is � bits to describe tree h

� LC�Djh� is � bits to describe D given h

� Note LC�Djh� � � if examples classiedperfectly by h� Need only describe exceptions

� Hence hMDL trades o� tree size for trainingerrors

Minimum Description Length Principle

hMAP � arg maxh�H

P Djh�P h�

� arg maxh�H

log� P Djh� � log�P h�

� arg minh�H

� log�P Djh� � log�P h� ��

Interesting fact from information theory�

The optimal shortest expected codinglength� code for an event with probability p is� log� p bits�

So interpret ��

� � log� P h� is length of h under optimal code

� � log� P Djh� is length of D given h underoptimal code

� prefer the hypothesis that minimizes

lengthh� � lengthmisclassifications�

Most Probable Classication of New Instances

So far we�ve sought the most probable hypothesisgiven the data D i�e�� hMAP�

Given new instance x� what is its most probableclassi�cation�

� hMAPx� is not the most probable classication�

Consider�

� Three possible hypotheses�

P h�jD� � �� P h�jD� � �� P h�jD� � ��

� Given new instance x�

h�x� � �� h�x� � �� h�x� � �� What�s most probable classication of x�

Bayes Optimal Classi�er

Bayes optimal classi�cation�

arg maxvj�V

hi�HP vjjhi�P hijD�

Example�

P h�jD� � �� P �jh�� P �jh��

thereforeX

hi�HP �jhi�P hijD� � ��

arg maxvj�V

hi�HP vjjhi�P hijD� � �

Gibbs Classi�er

Bayes optimal classier provides best result� butcan be expensive if many hypotheses�Gibbs algorithm�

�� Choose one hypothesis at random� according toP hjD�

�� Use this to classify new instance

Surprising fact� Assume target concepts are drawnat random from H according to priors on H� Then�

E�errorGibbs� � �E�errorBayesOptimal�

Suppose correct� uniform prior distribution over H�then

� Pick any hypothesis from VS� with uniformprobability

� Its expected error no worse than twice Bayesoptimal

Naive Bayes Classi�er

Along with decision trees� neural networks� nearestnbr� one of the most practical learning methods�

When to use

� Moderate or large training set available

� Attributes that describe instances areconditionally independent given classication

Successful applications�

� Diagnosis

� Classifying text documents

Naive Bayes Classi�er

Assume target function f � X � V � where eachinstance x described by attributes ha�� a� � � � ani�Most probable value of fx� is�

vMAP � argmaxvj�V

P vjja�� a� � � � an�

vMAP � argmaxvj�V

P a�� a� � � � anjvj�P vj�

P a�� a� � � � an�

� argmaxvj�V

P a�� a� � � � anjvj�P vj�

Naive Bayes assumption�

P a�� a� � � � anjvj� �Y

iP aijvj�

which gives

Naive Bayes classi�er� vNB � argmaxvj�V

P vj�Y

iP aijvj�

Naive Bayes Algorithm

Naive Bayes Learnexamples�

For each target value vj

�P vj� � estimate P vj�

For each attribute value ai of each attribute a

�P aijvj� � estimate P aijvj�

Classify New Instancex�

vNB � argmaxvj�V

�P vj�Y

ai�x�P aijvj�

Naive Bayes� Example

Consider PlayTennis again� and new instance

hOutlk � sun� Temp � cool�Humid � high�Wind � strong

Want to compute�

P vj�Y

iP aijvj�

P y� P sunjy� P cooljy� P highjy� P strongjy� � ��

P n� P sunjn� P cooljn� P highjn� P strongjn� � ��

� vNB � n

Naive Bayes� Subtleties

�� Conditional independence assumption is oftenviolated

P a�� a� � � � anjvj� �Y

iP aijvj�

� ��but it works surprisingly well anyway� Notedon�t need estimated posteriors �P vjjx� to becorrect need only that

argmaxvj�V

�P vj�Y

�P aijvj� � argmaxvj�V

P vj�P a� � � � � anjvj�

� see �Domingos ! Pazzani� �� for analysis

� Naive Bayes posteriors often unrealisticallyclose to � or �

Naive Bayes� Subtleties

�� what if none of the training instances withtarget value vj have attribute value ai� Then

�P aijvj� � �� and��

�P vj�Y

�P aijvj� � �

Typical solution is Bayesian estimate for �P aijvj��P aijvj� � nc � mp

n � m

� n is number of training examples for whichv � vj�

� nc number of examples for which v � vj anda � ai

� p is prior estimate for �P aijvj��m is weight given to prior i�e� number of

�virtual examples�

Learning to Classify Text

Why�

� Learn which news articles are of interest

� Learn to classify web pages by topic

Naive Bayes is among most e�ective algorithms

What attributes shall we use to represent textdocuments��

Learning to Classify Text

Target concept Interesting� � Document� f��g�� Represent each document by vector of words

� one attribute per word position in document

�� Learning� Use training examples to estimate

� P ��

� P docj��

Naive Bayes conditional independence assumption

P docjvj� �length�doc�Y

i��P ai � wkjvj�

where P ai � wkjvj� is probability that word inposition i is wk� given vj

one more assumption�P ai � wkjvj� � P am � wkjvj�� i�m

Learn naive Bayes textExamples� V �

�� collect all words and other tokens that occur inExamples

� V ocabulary � all distinct words and othertokens in Examples

�� calculate the required P vj� and P wkjvj�probability terms

� For each target value vj in V do

� docsj � subset of Examples for which thetarget value is vj

� P vj� � jdocsjjjExamplesj

� Textj � a single document created byconcatenating all members of docsj

� n� total number of words in Textj countingduplicate words multiple times�

� for each word wk in V ocabulary

nk � number of times word wk occurs inTextj

P wkjvj� � nk��n�jV ocabularyj

Classify naive Bayes textDoc�

� positions� all word positions in Doc thatcontain tokens found in V ocabulary

� Return vNB� where

P vj�Y

i�positionsP aijvj�

Twenty NewsGroups

Given �� training documents from each groupLearn to classify new documents according towhich newsgroup it came from

comp�graphics misc�forsalecomp�os�ms�windows�misc rec�autoscomp�sys�ibm�pc�hardware rec�motorcycles

comp�sys�mac�hardware rec�sport�baseballcomp�windows�x rec�sport�hockey

alt�atheism sci�spacesoc�religion�christian sci�crypt

talk�religion�misc sci�electronicstalk�politics�mideast sci�med

talk�politics�misctalk�politics�guns

Naive Bayes� �� classication accuracy

Article from rec�sport�hockey

Path� cantaloupe�srv�cs�cmu�edu�das�news�harvard�e

From� xxx�yyy�zzz�edu �John Doe�

Subject� Re� This year�s biggest and worst �opinio

Date� Apr � �� GMT

I can only comment on the Kings but the most

obvious candidate for pleasant surprise is Alex

Zhitnik� He came highly touted as a defensive

defenseman but he�s clearly much more than that�

Great skater and hard shot �though wish he were

more accurate�� In fact he pretty much allowed

the Kings to trade away that huge defensive

liability Paul Coffey� Kelly Hrudey is only the

biggest disappointment if you thought he was any

good to begin with� But at best he�s only a

mediocre goaltender� A better choice would be

Tomas Sandstrom though not through any fault of

his own but because some thugs in Toronto decided

Learning Curve for �� Newsgroups

100 1000 10000

20News

BayesTFIDF

PRTFIDF

Accuracy vs� Training set size �"� withheld for test�

Bayesian Belief Networks

Interesting because�

� Naive Bayes assumption of conditionalindependence too restrictive

� But it�s intractable without some suchassumptions��

� Bayesian Belief networks describe conditionalindependence among subsets of variables

� allows combining prior knowledge aboutin�dependencies among variables with observedtraining data

also called Bayes Nets�

Conditional Independence

De�nition� X is conditionally independent ofY given Z if the probability distributiongoverning X is independent of the value of Ygiven the value of Z that is� if

xi� yj� zk� P X � xijY � yj� Z � zk� � P X � xijZ � zk

more compactly� we write

P XjY� Z� � P XjZ�

Example� Thunder is conditionally independent ofRain� given Lightning

P ThunderjRain� Lightning� � P ThunderjLightning�

Naive Bayes uses cond� indep� to justify

P X� Y jZ� � P XjY� Z�P Y jZ�

� P XjZ�P Y jZ�

Bayesian Belief Network

CampfireLightning

Thunder ForestFire

Campfire

¬S,B ¬S,¬B

BusTourGroup

Network represents a set of conditionalindependence assertions�

� Each node is asserted to be conditionallyindependent of its nondescendants� given itsimmediate predecessors�

� Directed acyclic graph

Bayesian Belief Network

CampfireLightning

Thunder ForestFire

Campfire

¬S,B ¬S,¬B

BusTourGroup

Represents joint probability distribution over allvariables

� e�g�� P Storm�BusTourGroup� � � � � ForestF ire�

� in general�

P y�� yn� �nY

i��P yijParentsYi��

where ParentsYi� denotes immediatepredecessors of Yi in graph

� so� joint distribution is fully dened by graph�plus the P yijParentsYi��

Inference in Bayesian Networks

How can one infer the probabilities of� values ofone or more network variables� given observedvalues of others�

� Bayes net contains all information needed forthis inference

� If only one variable with unknown value� easy toinfer it

� In general case� problem is NP hard

In practice� can succeed in many cases

� Exact inference methods work well for somenetwork structures

� Monte Carlo methods �simulate the networkrandomly to calculate approximate solutions

Learning of Bayesian Networks

Several variants of this learning task

� Network structure might be known or unknown

� Training examples might provide values of allnetwork variables� or just some

If structure known and observe all variables

� Then it�s easy as training a Naive Bayes classier

Learning Bayes Nets

Suppose structure known� variables partiallyobservable

e�g�� observe ForestFire� Storm� BusTourGroup�Thunder� but not Lightning� Camp�re��

� Similar to training neural network with hiddenunits

� In fact� can learn network conditionalprobability tables using gradient ascent�

� Converge to network h that locally� maximizesP Djh�

Gradient Ascent for Bayes Nets

Let wijk denote one entry in the conditionalprobability table for variable Yi in the network

wijk � P Yi � yijjParentsYi� � the list uik of values�

e�g�� if Yi � Campfire� then uik might behStorm � T�BusTourGroup � F iPerform gradient ascent by repeatedly

�� update all wijk using training data D

wijk � wijk � �X

Phyij� uikjd�

�� then� renormalize the wijk to assure

� Pj wijk � �

� � � wijk � �

ba map ml h map learners130.243.105.49/~tdt/ml/book-slides/ml_chapter06.pdf · lecture slides for...

Documents

meyl - scalar waves (first tesla physics textb

txwpeg - multimedia.3m.com · helmet brand (a) helmet model...

ml- midd le- ml-company-name ml-division ml-mail-stop …

4.d textbook...

frontier 5000 series - hogentogler & co., inc.€¦ ·...

modification placards and educational dispensers 1250 ml ·...

user s guide - gilsonp1000l 0.01 ml ml 0.002 ml p5000l 0.01...

grow micro bloom honey · week 7 late flowering - 6.0 ml...

mapendakuningan.files.wordpress.com · ml pui ciawigebang...

spirits - stoelzle glass group€¦ · spirits standard...

sequen f oil induction as in...

ba map ml h map learners - academics |...

cold man 110 ml 212 splash men 110 ml allure femme 110 ml...

med. music by oscar levant lyric man ml ma ml ieved ml ml...

core textb t cmbxti10 textbfss cmssbx10 mathb t cmbxti10

田島ルーフィング株式会社 …quality sample...

studies improving health - buckeye international · 2017....

syringes & needles - b. braun...(non-sterile) 3 ml 0.1 ml...

ml15-17 (ml-1510 ml-1710 ml-1750)

retail price guide 2020 - adlux · ml-aum ml-hpriz ml-hglaz...