università di milano-bicocca laurea magistrale in informatica

47
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 5 - Statistical Learning

Upload: rigg

Post on 06-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Università di Milano-Bicocca Laurea Magistrale in Informatica. Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 5 - Statistical Learning. Outline. Bayes theorem MAP, ML hypotheses Minimum Description Length principle Optimal Bayes classifier Naive Bayes classifier - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Università di Milano-Bicocca Laurea Magistrale in Informatica

Università di Milano-BicoccaLaurea Magistrale in Informatica

Corso di

APPRENDIMENTO E APPROSSIMAZIONE

Prof. Giancarlo Mauri

Lezione 5 - Statistical Learning

Page 2: Università di Milano-Bicocca Laurea Magistrale in Informatica

Outline

Bayes theorem

MAP, ML hypotheses

Minimum Description Length principle

Optimal Bayes classifier

Naive Bayes classifier

Expectation Maximization (EM) algorithm

Page 3: Università di Milano-Bicocca Laurea Magistrale in Informatica

Two roles for Bayesian learning

Provides practical learning algorithms Naive Bayes learning

Bayesian belief network learning

Combines prior knowledge (prior probabilities)

with observed data

Requires prior probabilities

Provides useful conceptual framework “Gold standard” for evaluating other learning

algorithms

Additional insight into Occam’s razor

Page 4: Università di Milano-Bicocca Laurea Magistrale in Informatica

Bayesian learning

Advantages No hypothesis is eliminated, even if inconsistent with data

Each hypothesis h is given a probability P(h) to be the correct one

P(h) is incrementally modified after seen an example

Hypotheses that make probabilistic predictions (eg, for medical diagnosis) are allowed

Drawbacks Not easy to estimate prior probabilities Huge computational cost in the general case

Page 5: Università di Milano-Bicocca Laurea Magistrale in Informatica

Bayesian learning

View as updating of a probability distribution over the hypothesis space H

H (random) hypothesis variable, values h1, h2, …

Start with prior distribution P(H)

jth observation gives the outcome dj of random variable Dj - Training data D = {d1, d2, …, dN}

Use Bayes theorem to compute posterior probability of each hypothesis

Page 6: Università di Milano-Bicocca Laurea Magistrale in Informatica

Bayes theorem

Allows to compute probabilities of hypotheses, given training sample D and prior probabilities

P(D|h) P(h) P(h|D) = ------------------

P(D)

P(h) = prior probability of hypothesis h P(D) = probability of training set D P(h|D) = posterior probability of h given D P(D|h) = likelihood of D given h

NB: increases with P(D|h) and P(h), decreases when P(D) increases

Page 7: Università di Milano-Bicocca Laurea Magistrale in Informatica

Bayesian prediction

Prediction based on weighted average of likelihood wrt hypotheses probabilities

P(X|D) = i P(X|D,hi)P(hi|D) = i P(X|hi)P(hi|D)

No need to pick one best-guess hypothesis!

Page 8: Università di Milano-Bicocca Laurea Magistrale in Informatica

Example

Un paziente fa un esame di laboratorio per un marcatore tumorale. Sappiamo che: L’incidenza della malattia su tutta la popolazione è dell’8 per mille (probabilità a priori)

L’esame dà un risultato positivo nel 98% dei casi in cui è presente la malattia (quindi 2% di falsi negativi)

dà un risultato negativo nel 97% dei casi in cui non è presente la malattia (quindi 3% di falsi positivi)

Page 9: Università di Milano-Bicocca Laurea Magistrale in Informatica

Example

Abbiamo le seguenti probabilità a priori e condizionate:

P(c) = 0,008 P(c) = 0,992 P(+|c) = 0,98 P(-|c) = 0,02 P(+|c) = 0,03 P(-|c) = 0,97

Se il test per il nostro paziente risulta positivo (D=+), quali sono le probabilità a posteriori che abbia o non abbia il cancro ?

P(+|c)P(c) = 0.98x0.008 = 0.0078 = P(c|+)P(+|c)P(c) = 0.03x0.992 = 0.0298 = P(c|+)

Page 10: Università di Milano-Bicocca Laurea Magistrale in Informatica

Example

Dividendo per 0,0078+0,0298 per normalizzare a 1, otteniamo:

P(c|+) = 0,21 P(c|+) = 0,79

E’ un risultato controintuitivo, che si spiega col fatto che i falsi positivi, su una popolazione in stragrande maggioranza sana, diventano molto numerosi

Page 11: Università di Milano-Bicocca Laurea Magistrale in Informatica

Example

Suppose there are five kinds of bags of candies:

10% are h1: 100% cherry candies 20% are h2: 75% cherry candies + 25% lime candies 40% are h3: 50% cherry candies + 50% lime candies 20% are h4: 25% cherry candies + 75% lime candies 10% are h5: 100% lime candies

Then we observe candies drawn from some bag:

What kind of bag is it? What flavour will the next candy be?

Page 12: Università di Milano-Bicocca Laurea Magistrale in Informatica

Posterior probability of hypotheses

Number of samples in d

Posterior Probability of

hypothesis

Page 13: Università di Milano-Bicocca Laurea Magistrale in Informatica

Prediction probability

0 2 4 6 8 10

Page 14: Università di Milano-Bicocca Laurea Magistrale in Informatica

Posterior probability of hypotheses

The correct hypothesis in the limit will dominate the prediction, independently of the prior distribution, provided that the correct hypothesis is not given 0 probability

The bayesian prediction is optimal, i.e. it will be correct more often than each other prediction

Page 15: Università di Milano-Bicocca Laurea Magistrale in Informatica

MAP approximation

Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)

Maximum a posteriori (MAP) learning: choose the most probable hypothesis wrt training data

P(D|h) P(h) hMAP = arg max P(h|D) = arg max ------------------

hH P(D)

= arg max P(D|h) P(h)

= arg max (log P(D|h) + log P(h))

N.B. Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis (basic idea of minimum description length (MDL) learning)

For deterministic hypotheses, P(D|h) is 1 if consistent, 0 otherwise

MAP = simplest consistent hypothesis

Page 16: Università di Milano-Bicocca Laurea Magistrale in Informatica

Example

Number of samples in d

Posterior Probability of

hypothesis

After three samples, h5 will be used for prediction with probability 1

Page 17: Università di Milano-Bicocca Laurea Magistrale in Informatica

Brute force MAP learner

1.For each h in H, calculate the posterior probability

P(D|h) P(h) P(h|D) = ------------------

P(D)

2.Output the hypothesis hMAP with the highest posterior probability

hMAP = arg max P(h|D) hH

Page 18: Università di Milano-Bicocca Laurea Magistrale in Informatica

Relation to Concept Learning

Consider our usual concept learning task instance space X, hypothesis space H, training examples D

Consider the FIND-S learning algorithm (outputs most specific hypothesis from the version space VSH.D)

What would Bayes rule produce as the MAP hypothesis?

Does Find-S output a MAP hypothesis?

Page 19: Università di Milano-Bicocca Laurea Magistrale in Informatica

Relation to Concept Learning

Assume fixed set of instances ‹x1,…,xm›

Assume D is the set of classifications

D = ‹c(x1),…,c(xm)>

Let P(D|h) = 1 if h consistent with D

= 0 otherwise

Choose P(h) to be uniform distribution

P(h) = 1/|H| for all h in H

Then

P(h|D) = 1/|VSH,D| if h consistent with D

= 0 otherwise

Page 20: Università di Milano-Bicocca Laurea Magistrale in Informatica

Evolution of posterior probabilities

Page 21: Università di Milano-Bicocca Laurea Magistrale in Informatica

ML approximation

Maximum likelihood (ML) hypothesisIf prior probabilities are uniform (P(hi)=P(hj)

i,j), let choose h that maximizes likelihood of D

hML = arg max P(D|h)

For large data sets, prior becomes irrelevant

I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity)

ML is the “standard” (non-Bayesian) statistical learning method

Page 22: Università di Milano-Bicocca Laurea Magistrale in Informatica

ML parameter learning

Bag from a new manufacturer; fraction of cherry candies?

Any is possible: continuum of hypotheses h

is a parameter for this simple (binomial) family of models

Suppose we unwrap N candies, c cherries and α = N-c limes

These are i.i.d. (independent, identically distributed) observations, so

P(D|h) = Nj=1P(dj|h) = c (1-)α

Maximize this w.r.t. , which is easier for the log-likelihood:

L(D|h) = log P(D|h) = Nj=1 log P(dj|h) = c log + α log(1-) dL(D|h)P(h) c α c

c---------------- = --- - ------ = 0 dc+ α NSeems sensible, but causes problems with 0 counts!

Page 23: Università di Milano-Bicocca Laurea Magistrale in Informatica

xi

Learning a Real Valued Function

Consider any real-valued target function f

Training examples ‹xi,di›, where di is noisy training value

di = f(xi) + ei

ei is random variable (noise) drawn indipendently for each xi according to some Gaussian distribution with mean=0

Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors:

hML = arg min mi=1(di-

h(xi))2

hH

Page 24: Università di Milano-Bicocca Laurea Magistrale in Informatica

Learning a real valued function

Page 25: Università di Milano-Bicocca Laurea Magistrale in Informatica

Minimum Description Length Principle

Page 26: Università di Milano-Bicocca Laurea Magistrale in Informatica

Minimum Description Length Principle

Page 27: Università di Milano-Bicocca Laurea Magistrale in Informatica

Minimum Description Length Principle

Page 28: Università di Milano-Bicocca Laurea Magistrale in Informatica

Minimum Description Length PrincipleOccam’s razor: prefer the shortest hypothesis

MDL: prefer the hypothesis h that minimizes

FORMULA

Where LC(x) is the description length of x under econding C_______________________________________________________________

Example: H = decision trees, D = training data labels

LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given h-Note LC2(D|h) = 0 if examples classified perfectly by h. Need only describe exceptions

Hence hMDL trades off tree size for training errors

Page 29: Università di Milano-Bicocca Laurea Magistrale in Informatica

Minimum Description Length PrincipleFORMULE

Interesting fact from information theory:

The optimal (shortest coding length) code for an event with probability p is

-log2p bits.

So interpret (1):

-log2P(h) is length of h under optimal code

-log2P(D|h) is length of D given h under optimal code

Prefer the hypothesis that minimizes

Length(h) + length(misclassifications)

Page 30: Università di Milano-Bicocca Laurea Magistrale in Informatica

Most Probable Classification of New InstancesSo far we’ve sought the most probable hypothesis given the data D

(i.e., hMAP)

Given new instance x, what is its most probable classification?

hMAP(x) is not the most probable classification!

Consider:

Three possible hypotheses:FORMULA

Given new instance x,

h1(x) = +, h2(x) = -, h3(x) = -

What’s the most probable classification of x?

Page 31: Università di Milano-Bicocca Laurea Magistrale in Informatica

Bayes Optimal ClassifierBayes optimal classification:

FORMULA

Example:

P(h1|D) = .4, P(-|h1) = 0, P(+|h1) = 1

P(h2|D) = .3, P(-|h2) = 0, P(+|h2) = 0

P(h3|D) = .3, P(-|h3) = 0, P(+|h3) = 0

Therefore

FORMULE

And

FORMULA

Page 32: Università di Milano-Bicocca Laurea Magistrale in Informatica

Gibbs ClassifierBayes optimal classifier provides best result, but can be

expensive if many hypotheses.

Gibbs algorithm:

1. Choose one hypothesis at random, according to P(h|D)

2. Use this to classify new instance

Surprising fact: assume target concepts are drawn at random from H according to priors on H. Then:

FORUMLA

Suppose correct, uniform prior distribution over H, then

- Pick any hypothesis from VS, with uniform probability

- Its expected error no worse than twice Bayes optimal

Page 33: Università di Milano-Bicocca Laurea Magistrale in Informatica

Naive Bayes ClassifierAlong with decision trees, neural networks, nearest nbr,

one of the most practical learning methods.

When to use

- Moderate or large training set available

- Attributes that describe instances are conditionally independent given classification

Successful applications:

- Diagnosis

- Classifying text documents

Page 34: Università di Milano-Bicocca Laurea Magistrale in Informatica

Naive Bayes ClassifierAssume target function f : X V, where each instance x

described by attributes <a1, a2… an>.

Most probable value of f(x) is:FORMULE

Naive Bayes assumption:FORMULA

Which gives

Naive Bayes classifier: FORMULA

Page 35: Università di Milano-Bicocca Laurea Magistrale in Informatica

Naive Bayes AlgorithmNaive_Bayes_Learn (examples)

For each target value vj

P(vj) estimate P(vj)

For each attribute value ai of each attribute a

P(ai|vj) estimate P(ai|vj )

Classify_New_Instance(x)

FORMULA

Page 36: Università di Milano-Bicocca Laurea Magistrale in Informatica

Naive Bayes: ExampleConsider Playtennis again and new instance

<Outlk = sun, Temp = cool, Humid = high, Wind = strong>

Wanto to compute:FORMULA

P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005

P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021

vNB = n

Page 37: Università di Milano-Bicocca Laurea Magistrale in Informatica

Naive Bayes: SubtletiesConditional independence assumption is often violated

FORMULA

- …but it works surprisingly well anyway. Note don’t need estimated posteriors P(vj|x) to be correct; need only that

FORMULA

- See [Domingos&Piazzoni, 1996] for analysis

- Naive Bayes posteriors often unrealistically close to 1 or 0

Page 38: Università di Milano-Bicocca Laurea Magistrale in Informatica

Naive Bayes: Subtleties2. What if none of the training instances with target

value vj have attribute value aj ? Then

FORMULE

Typical solution is Bayesian estimate for P(aj| vj)

FORMULA

Where:

- n is number of training examples for which v = vj

- nc number of examples for which v = vj and a = ai

- p is prior estimate for P(aj| vj)

- m is weight given to prior (i.e. number of “virtual” examples)

Page 39: Università di Milano-Bicocca Laurea Magistrale in Informatica

Learning to Classify TextWhy?

- Learn which new articles are of interest

- Learn to classify web pages by topic

Naive Bayes is among most effective algorithm

What attributes shall we use to represent text documents?

Page 40: Università di Milano-Bicocca Laurea Magistrale in Informatica

Learning to Classify TextTarget concept Interesting? : Document {+, -}

1. Represent each document by vector of words

- one attribute per word position in document

2. Learning: use training examples to estimate

- P(+)

- P(-)

- P(doc|+)

- P(doc|-)

Naive Bayes conditional independence assumption

FORMULA

Where_ _ _is probably that word in position I is_ _ _given_ _ _

One more assumptionFORMULA

Page 41: Università di Milano-Bicocca Laurea Magistrale in Informatica

Learn_Naive_Bayes_Text (Examples, V)1. Collect all words and other tokens that occur in

Examples

- Vocabulary all distinct words and other tokens in Examples

2. Calculate the required P(vj) and P(wk|vj) probability terms

3. For each traget value vj in V do

4. Docsj subset of Examples for which the target value is vj

FORMULA

2. Textj a single document created by concatenating all members of docsj

3. N total number of words in Textj (counting duplicate words multiple times)

4. For each word wk in Vocabulary

*nk number of times word wk occurs in Textj

*FORMULA

Page 42: Università di Milano-Bicocca Laurea Magistrale in Informatica

Classify_Naive_Bayes_Text (Doc)- positions all word positions in Doc that contain

tokens found in Vocabulary

- Return vNB, whereFORMULA

Page 43: Università di Milano-Bicocca Laurea Magistrale in Informatica

Classify_Naive_Bayes_Text (Doc)Given 1000 training documents from each group learn to classify new document

according to which newsgroup it came from

comp.graphics misc.forsale

comps.os.ms-windows.misc rec.autos

comp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseball

comp.windows.x rec.sport.hockey

alt. atheism sci.space

soc.religion.christian sci.crypt

talk.religion.misc sci.electronics

talk.politics.mideast sci.med

talk.politics.misc

talk.politics.guns

Naive Bayes: 89% classification accuracy

Page 44: Università di Milano-Bicocca Laurea Magistrale in Informatica

Reti Bayesiane

Il classificatore ottimale di Bayes non fa assunzioni di indipendenza tra variabili ed è computazionalmente pesante

Il classificatore “ingenuo” di Bayes è efficiente grazie all’ipotesi molto restrittiva di indipendenza condizionale di tutti gli attributi dato il valore obiettivo v

Le reti Bayesiane descrivono l’indipendenza condizionale tra sottoinsiemi di variabili

Page 45: Università di Milano-Bicocca Laurea Magistrale in Informatica

Reti Bayesiane

DEF. X, Y, Z variabili casuali discrete.

X è condizionalmente indipendente da Y dato Z se:

xi, yj, zk P(X= xi|Y= yj,Z= zk)= P(X= xi|Z= zk)

Facilmente estendibile a insiemi di variabili:

P(X1,…, Xl|Y1,…,Ym,Z1,…, Zn)=P(X1,…, Xl | Z1,…, Zn)

Page 46: Università di Milano-Bicocca Laurea Magistrale in Informatica

Reti Bayesiane - Rappresentazione

Grafo aciclico orientato I nodi rappresentano le variabili casuali Gli archi rappresentano la struttura di dipendenza condizionale: ogni variabile è condizionalmente indipendente dai suoi non discendenti, dati i suoi genitori

Tabella di probabilità condizionali locali per ogni nodo, con la distribuzione di probabilità della variabile associata dati i predecessori immediati

Si può calcolare:

P(y1,…, yn) = i P(yi|gen(Yi))

Page 47: Università di Milano-Bicocca Laurea Magistrale in Informatica

Reti Bayesiane - Rappresentazione

Temporale Gita

Fuoco di campoFulmine

Tuono Incendio

T,G T,G T,G T,G F 0,4 0,1 0,8 0,2F 0,6 0,9 0,2 0,8

(Variabili a valori booleani)