vbm683 machine learning - hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a...

45
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra

Upload: others

Post on 22-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

VBM683

Machine Learning

Pinar Duygulu

Slides are adapted from

Dhruv Batra

Page 2: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Plan for Today

• Review of Probability

– Discrete vs Continuous Random Variables

– PMFs vs PDF

– Joint vs Marginal vs Conditional Distributions

– Bayes Rule and Prior

– Expectation, Entropy, KL-Divergence

(C) Dhruv Batra 2

Page 3: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Probability

• The world is a very uncertain place

• 30 years of Artificial Intelligence and Database

research danced around this fact

• And then a few AI researchers decided to use some

ideas from the eighteenth century

(C) Dhruv Batra 3Slide Credit: Andrew Moore

Page 4: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Probability

• A is non-deterministic event

– Can think of A as a boolean-valued variable

• Examples

– A = your next patient has cancer

– A = Donald Trump Wins the 2016 Presidential Election

(C) Dhruv Batra 4

Page 5: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Interpreting Probabilities

• What does P(A) mean?

• Frequentist View

– limit N∞ #(A is true)/N

– limiting frequency of a repeating non-deterministic event

• Bayesian View

– P(A) is your “belief” about A

• Market Design View

– P(A) tells you how much you would bet

(C) Dhruv Batra 5

Page 6: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

(C) Dhruv Batra 6Image Credit: Intrade / NPR

Page 7: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

(C) Dhruv Batra 7

7

The Axioms Of Probability

Slide Credit: Andrew Moore

Page 8: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Axioms of Probability

• 0<= P(A) <= 1

• P(empty-set) = 0

• P(everything) = 1

• P(A or B) = P(A) + P(B) – P(A and B)

(C) Dhruv Batra 8

Page 9: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Interpreting the Axioms

• 0<= P(A) <= 1

• P(empty-set) = 0

• P(everything) = 1

• P(A or B) = P(A) + P(B) – P(A and B)

(C) Dhruv Batra 9Image Credit: Andrew Moore6

Visualizing A

Event space of

all possible

worlds

Its area is 1Worlds in which A is False

Worlds in which

A is true

P(A) = Area of

reddish oval

Page 10: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Interpreting the Axioms

• 0<= P(A) <= 1

• P(empty-set) = 0

• P(everything) = 1

• P(A or B) = P(A) + P(B) – P(A and B)

(C) Dhruv Batra 10Image Credit: Andrew Moore

8

The Axioms Of Probability� 0 <= P(A) <= 1

� P(True) = 1

� P(False) = 0

� P(A or B) = P(A) + P(B) - P(A and B)

The area of A can�t get

any smaller than 0

And a zero area would

mean no world could

ever have A true

Page 11: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Interpreting the Axioms

• 0<= P(A) <= 1

• P(empty-set) = 0

• P(everything) = 1

• P(A or B) = P(A) + P(B) – P(A and B)

(C) Dhruv Batra 11Image Credit: Andrew Moore

9

Interpreting the axioms� 0 <= P(A) <= 1

� P(True) = 1

� P(False) = 0

� P(A or B) = P(A) + P(B) - P(A and B)

The area of A can�t get

any bigger than 1

And an area of 1 would

mean all worlds will have

A true

Page 12: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Interpreting the Axioms

• 0<= P(A) <= 1

• P(empty-set) = 0

• P(everything) = 1

• P(A or B) = P(A) + P(B) – P(A and B)

(C) Dhruv Batra 12Image Credit: Andrew Moore

11

A

B

Interpreting the axioms� 0 <= P(A) <= 1

� P(True) = 1

� P(False) = 0

� P(A or B) = P(A) + P(B) - P(A and B)

P(A or B)

BP(A and B)

Simple addition and subtraction

Page 13: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Concepts

• Sample Space

– Space of events

• Random Variables

– Mapping from events to numbers

– Discrete vs Continuous

• Probability

– Mass vs Density

(C) Dhruv Batra 13

Page 14: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

discrete random variable

sample space of possible outcomes,

which may be finite or countably infinite

outcome of sample of discrete random variable

probability distribution (probability mass function)

shorthand used when no ambiguity

(C) Dhruv Batra Slide Credit: Erik Suddherthdegenerate distributionuniform distribution 14

or Val(X)

Discrete Random Variables

Page 15: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Concepts

• Expectation

• Variance

(C) Dhruv Batra 15

Page 16: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Most Important Concepts

• Marginal distributions / Marginalization

• Conditional distribution / Chain Rule

• Bayes Rule

(C) Dhruv Batra 16

Page 17: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Joint Distribution

(C) Dhruv Batra 17

y

z

Page 18: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Marginalization

• Marginalization

– Events: P(A) = P(A and B) + P(A and not B)

– Random variables

(C) Dhruv Batra 18

P(X = x) = P(X = x,Y = y)y

å

Page 19: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Marginal Distributions

(C) Dhruv Batra Slide Credit: Erik Suddherth 19

y

z

Page 20: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Conditional Probabilities

• P(Y=y | X=x)

• What do you believe about Y=y, if I tell you X=x?

• P(Donald Trump Wins the 2016 Election)?

• What if I tell you:

– He has the Republican nomination

– His twitter history

– The complete DVD set of The Apprentice

(C) Dhruv Batra 20

Page 21: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Conditional Probabilities

• P(A | B) = In worlds that where B is true,

fraction where A is true

• Example

– H: “Have a headache”

– F: “Coming down with Flu”

(C) Dhruv Batra 2114

Conditional Probability

�P(A|B) = Fraction of worlds in which B is true

that also have A true

F

H

H = �Have a headache�

F = �Coming down with Flu�

P(H) = 1/10

P(F) = 1/40

P(H|F) = 1/2

�Headaches are rare and flu

is rarer, but if you�re coming

down with �flu there�s a 50-

50 chance you�ll have a

headache.�

14

Conditional Probability

�P(A|B) = Fraction of worlds in which B is true

that also have A true

F

H

H = �Have a headache�

F = �Coming down with Flu�

P(H) = 1/10

P(F) = 1/40

P(H|F) = 1/2

�Headaches are rare and flu

is rarer, but if you�re coming

down with �flu there�s a 50-

50 chance you�ll have a

headache.�

Page 22: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Conditional Distributions

(C) Dhruv Batra Slide Credit: Erik Sudderth 22

Page 23: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Conditional Probabilities

• Definition

• Corollary: Chain Rule

(C) Dhruv Batra 23

Page 24: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Independent Random Variables

(C) Dhruv Batra Slide Credit: Erik Sudderth 24

Page 25: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

• Sets of variables X, Y

• X is independent of Y

– Shorthand: P Ⱶ (X ⊥ Y)

• Proposition: P satisfies (X ⊥ Y) if and only if

– P(X=x,Y=y) = P(X=x) P(Y=y), ∀ 𝒙 ∈ 𝑽𝒂𝒍 𝑿 , ∀𝒚 ∈ 𝑽𝒂𝒍 𝒀

Marginal Independence

(C) Dhruv Batra 25

Page 26: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

• Sets of variables X, Y, Z

• X is independent of Y given Z if

– Shorthand: P Ⱶ (X ⊥ Y | Z)

– For P Ⱶ (X ⊥ Y |), write P Ⱶ (X ⊥ Y)

• Proposition: P satisfies (X ⊥ Y | Z) if and only if

– P(X,Y|Z) = P(X|Z) P(Y|Z), ∀ 𝒙 ∈ 𝑽𝒂𝒍 𝑿 , ∀𝒚 ∈𝑽𝒂𝒍 𝒀 , ∀𝒛 ∈ 𝑽𝒂𝒍(𝒁)

Conditional independence

(C) Dhruv Batra 26

Page 27: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Concept

• Bayes Rules

– Simple yet fundamental

(C) Dhruv Batra 2720

What we just did�P(A ^ B) P(A|B) P(B)

P(B|A) = ----------- = ---------------

P(A) P(A)

This is Bayes Rule

Bayes, Thomas (1763) An essay

towards solving a problem in the doctrine

of chances. Philosophical Transactions

of the Royal Society of London, 53:370-

418

Image Credit: Andrew Moore

Page 28: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Bayes Rule

• Simple yet profound

– Using Bayes Rules doesn’t make your analysis Bayesian!

• Concepts:

– Likelihood

• How much does a certain hypothesis explain the data?

– Prior

• What do you believe before seeing any data?

– Posterior

• What do we believe after seeing the data?

(C) Dhruv Batra 28

Page 29: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

New Topic:

Naïve Bayes

(your first probabilistic classifier)

(C) Dhruv Batra 29

Classificationx y Discrete

Page 30: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Classification

• Learn: h:X Y– X – features

– Y – target classes

• Suppose you know P(Y|X) exactly, how should you classify?– Bayes classifier:

Slide Credit: Carlos Guestrin

Page 31: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Generative vs. Discriminative

• Generative Approach

– Assume some functional form for P(X|Y), P(Y)

– Estimate p(X|Y) and p(Y)

– Use Bayes Rule to calculate P(Y| X=x)

– Indirect computation of P(Y|X) through Bayes rule

– But, can generate a sample, P(X) = y P(y) P(X|y)

• Discriminative Approach

– Estimate p(y|x) directly OR

– Learn “discriminant” function h(x)

– Direct but cannot obtain a sample of the data, because P(X)

is not available

(C) Dhruv Batra 31

Page 32: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

How hard is it to learn the optimal classifier?

Slide Credit: Carlos Guestrin

• Categorical Data

• How do we represent these? How many parameters?– Class-Prior, P(Y):

• Suppose Y is composed of k classes

– Likelihood, P(X|Y):

• Suppose X is composed of d binary features

• Complex model High variance with limited data!!!

Page 33: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Independence to the rescue

(C) Dhruv Batra 33Slide Credit: Sam Roweis

Page 34: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

The Naïve Bayes assumption

• Naïve Bayes assumption:

– Features are independent given class:

– More generally:

• How many parameters now?• Suppose X is composed of d binary features

Slide Credit: Carlos Guestrin(C) Dhruv Batra 34

Page 35: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

The Naïve Bayes Classifier

Slide Credit: Carlos Guestrin(C) Dhruv Batra 35

• Given:– Class-Prior P(Y)

– d conditionally independent features X given the class Y

– For each Xi, we have likelihood P(Xi|Y)

• Decision rule:

• If assumption holds, NB is optimal classifier!

Page 36: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Text classification

• Classify e-mails

– Y = {Spam,NotSpam}

• Classify news articles

– Y = {what is the topic of the article?}

• Classify webpages

– Y = {Student, professor, project, …}

• What about the features X?

– The text!

Slide Credit: Carlos Guestrin(C) Dhruv Batra 36

Page 37: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Features X are entire document –

Xi for ith word in article

Slide Credit: Carlos Guestrin(C) Dhruv Batra 37

Page 38: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

NB for Text classification

• P(X|Y) is huge!!!

– Article at least 1000 words, X={X1,…,X1000}

– Xi represents ith word in document, i.e., the domain of Xi is entire

vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.

• NB assumption helps a lot!!!

– P(Xi=xi|Y=y) is just the probability of observing word xi in a

document on topic y

Slide Credit: Carlos Guestrin(C) Dhruv Batra 38

Page 39: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Bag of Words model

• Typical additional assumption:

Position in document doesn’t matter:

P(Xi=a|Y=y) = P(Xk=a|Y=y)

– “Bag of words” model – order of words on the page ignored

– Sounds really silly, but often works very well!

When the lecture is over, remember to wake up the

person sitting next to you in the lecture room.

Slide Credit: Carlos Guestrin(C) Dhruv Batra 39

Page 40: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Bag of Words model

• Typical additional assumption:

Position in document doesn’t matter:

P(Xi=a|Y=y) = P(Xk=a|Y=y)

– “Bag of words” model – order of words on the page ignored

– Sounds really silly, but often works very well!

in is lecture lecture next over person remember room

sitting the the the to to up wake when you

Slide Credit: Carlos Guestrin(C) Dhruv Batra 40

Page 41: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Bag of Words model

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Slide Credit: Carlos Guestrin(C) Dhruv Batra 41

Page 42: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Object Bag of ‘words’

Slide Credit: Fei Fei Li(C) Dhruv Batra 42

Page 43: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Slide Credit: Fei Fei Li(C) Dhruv Batra 43

Page 44: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

category

decision

learning

feature detection

& representation

codewords dictionary

image representation

category models

(and/or) classifiers

recognition

Slide Credit: Fei Fei Li(C) Dhruv Batra 44

Page 45: VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

What if we have continuous Xi ?

Eg., character recognition: Xi is ith pixel

Gaussian Naïve Bayes (GNB):

Sometimes assume variance

• is independent of Y (i.e., i),

• or independent of Xi (i.e., k)

• or both (i.e., )

Slide Credit: Carlos Guestrin(C) Dhruv Batra 45