cs1571 introduction to ai, lecture 14

25
1 CS 2710 Foundations of AI Lecture 22 Milos Hauskrecht milos@c s.pitt.edu 5329 Sennott Square Machine learning Machine Learning The field of machine learning studies the design of computer programs (agents) capable of learning from past experience or adapting to changes in the environment The need for building agents capable of learning is everywhere predictions in medicine, text and web page classification, speech recognition, image/text retrieval, commercial software

Upload: others

Post on 15-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cs1571 Introduction to AI, Lecture 14

1

CS 2710 Foundations of AI

Lecture 22

Milos Hauskrecht

[email protected]

5329 Sennott Square

Machine learning

Machine Learning

• The field of machine learning studies the design of computer

programs (agents) capable of learning from past experience or

adapting to changes in the environment

• The need for building agents capable of learning is everywhere

– predictions in medicine,

– text and web page classification,

– speech recognition,

– image/text retrieval,

– commercial software

Page 2: Cs1571 Introduction to AI, Lecture 14

2

Learning

Learning process:

Learner (a computer program) processes data D representing

past experiences and tries to either to develop an appropriate

response to future data, or describe in some meaningful way

the data seen

Example:

Learner sees a set of patient cases (patient records) with

corresponding diagnoses. It can either try:

– to predict the presence of a disease for future patients

– describe the dependencies between diseases, symptoms

Types of learning

• Supervised learning

– Learning mapping between inputs x and desired outputs y

– Teacher gives me y’s for the learning purposes

• Unsupervised learning

– Learning relations between data components

– No specific outputs given by a teacher

• Reinforcement learning

– Learning mapping between inputs x and desired outputs y

– Critic does not give me y’s but instead a signal (reinforcement) of how good my answer was

• Other ‘types’ of learning:

– Concept learning, active learning, deep learning, …

Page 3: Cs1571 Introduction to AI, Lecture 14

3

Supervised learning

Data: a set of n examples

is input vector, and y is desired output (given by a teacher)

Objective: learn the mapping

s.t.

Two types of problems:

• Regression: X discrete or continuous

Y is continuous

• Classification: X discrete or continuous

Y is discrete

},..,,{ 21 ndddD

iii yd ,x

ix

YXf :

nixfy ii ,..,1allfor)(

Supervised learning examples

• Regression: Y is continuous

Debt/equity

Earnings Stock price

Future product orders

Data:

Debt/equity Earnings Future prod orders Stock price

20 115 20 123.45

18 120 31 140.56

….

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

Page 4: Cs1571 Introduction to AI, Lecture 14

4

Supervised learning examples

• Classification: Y is discrete

Data: image digit

3

7

5

….

Handwritten digit (array of 0,1s)

Label “3”

Unsupervised learning

• Data:

vector of values

No target value (output) y

• Objective:

– learn relations between samples, components of samples

Types of problems:

• Clustering

Group together “similar” examples, e.g. patient cases

• Density estimation

– Model probabilistically the population of samples

},..,,{ 21 ndddD

iid x

Page 5: Cs1571 Introduction to AI, Lecture 14

5

Unsupervised learning example

• Clustering. Group together similar examples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

iid x

Unsupervised learning example

• Clustering. Group together similar examples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

iid x

Page 6: Cs1571 Introduction to AI, Lecture 14

6

Unsupervised learning example

• Density estimation. We want to build a probability model

P(x) of a population from which we drew examples

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

iid x

Unsupervised learning. Density estimation

• A probability density of a point in the two dimensional space

– Model used here: Mixture of Gaussians

Page 7: Cs1571 Introduction to AI, Lecture 14

7

Reinforcement learning

We want to learn:

• We see examples of inputs x but not y

• We select y for observed x from available choices

• We get a feedback (reinforcement) from a critic about how

good our choice of y was

• The goal is to select outputs that lead to the best reinforcement

LearnerInput x Output/action y

Critic

reinforcement

YXf :

Learning: first look

• Assume we see examples of pairs (x , y) in D and we want to

learn the mapping to predict y for some future x

• We get the data D - what should we do?

YXf :

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

Page 8: Cs1571 Introduction to AI, Lecture 14

8

Learning: first look

• Problem: many possible functions exists for

representing the mapping between x and y

• Which one to choose? Many examples still unseen!

YXf :

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

Learning: first look

• Solution: make an assumption about the model, say,

baxxf )(

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

Page 9: Cs1571 Introduction to AI, Lecture 14

9

Learning: first look

• Choosing a parametric model or a set of models is not enough

Still too many functions

– One for every pair of parameters a, b

baxxf )(

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

Learning: first look

• We want the best set of model parameters

– reduce the misfit between the model M and observed data D

– Or, (in other words) explain the data the best

• How to measure the misfit?

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

Page 10: Cs1571 Introduction to AI, Lecture 14

10

Learning: first look

• We want the best set of model parameters

– reduce the misfit between the model M and observed data D

– Or, (in other words) explain the data the best

• How to measure the misfit?

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

The difference in observed

value of y and model prediction

Learning: first look

• We want the best set of model parameters

– reduce the misfit between the model M and observed data D

– Or, (in other words) explain the data the best

• How to measure the misfit?

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

+

Page 11: Cs1571 Introduction to AI, Lecture 14

11

Learning: first look

• We want the best set of model parameters

– reduce the misfit between the model M and observed data D

– Or, (in other words) explain the data the best

• How to measure the misfit?

Objective function:

• Error function: Measures the misfit between D and M

• Examples of error functions:

– Average Square Error

– Average Absolute Error

2

1

))((1

ii

n

i

xfyn

|)(|1

1

ii

n

i

xfyn

Learning: first look

• Linear regression problem

– Minimizes the squared error function for the linear model

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

2

1

))((1

ii

n

i

xfyn

Page 12: Cs1571 Introduction to AI, Lecture 14

12

Learning: first look

• Application: A new example x with unknown value y is

checked against the model, and y is calculated as

baxxfy )(

x

y

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

y = ?

x

• Data D: pairs (x , y) where y is a class label:

y examples: patient will be readmitted or no,

has disease (case) or no (control)

Supervised learning: Classification

x1

x2case

control

Page 13: Cs1571 Introduction to AI, Lecture 14

13

Supervised learning: Classification

• Find a model f: X R, say that defines

a decision boundary f (x) = 0 that separates well the two classes

– Note that some examples are not correctly classified

x1

x2case

control

cbxaxxf 21)(

f (x) = 0

Supervised learning: Classification

• A new example x with unknown class label is checked against

the model, the class label is assigned

x1

x2

case

control

?

Page 14: Cs1571 Introduction to AI, Lecture 14

14

CS 2750 Machine Learning

Learning: first look

1. Data:

2. Model selection:

– Select a model or a set of models (with parameters)

E.g.

3. Choose the objective function

– Squared error

4. Learning:

• Find the set of parameters optimizing the error function

– The model and parameters with the smallest error

5. Application

– Apply the learned model to new data

– E.g. predict ys for new inputs x using learned

baxy

2

1

))((1

ii

n

i

xfyn

},..,,{ 21 ndddD

)(xf

CS 2750 Machine Learning

Learning: first look

1. Data:

2. Model selection:

– Select a model or a set of models (with parameters)

E.g.

3. Choose the objective function

– Squared error

4. Learning:

• Find the set of parameters optimizing the error function

– The model and parameters with the smallest error

5. Application:

– Apply the learned model to new data

– E.g. predict ys for new inputs x using learned

baxy

2

1

))((1

ii

n

i

xfyn

},..,,{ 21 ndddD

)(xf

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

x

y

Page 15: Cs1571 Introduction to AI, Lecture 14

15

CS 2750 Machine Learning

A learning system: basic cycle

1. Data:

2. Model selection:

– Select a model or a set of models (with parameters)

E.g.

3. Choose the objective function

– Squared error

4. Learning:

• Find the set of parameters optimizing the error function

– The model and parameters with the smallest error

5. Application

– Apply the learned model to new data

– E.g. predict ys for new inputs x using learned

baxy

2

1

))((1

ii

n

i

xfyn

},..,,{ 21 ndddD

)(xf-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

y

CS 2750 Machine Learning

Learning: first look

1. Data:

2. Model selection:

– Select a model or a set of models (with parameters)

E.g.

3. Choose the objective function

– Squared error

4. Learning:

• Find the set of parameters optimizing the error function

– The model and parameters with the smallest error

5. Application

– Apply the learned model to new data

– E.g. predict ys for new inputs x using learned

baxy

2

1

))((1

ii

n

i

xfyn

},..,,{ 21 ndddD

)(xf

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

Page 16: Cs1571 Introduction to AI, Lecture 14

16

Learning: first look

1. Data:

2. Model selection:

– Select a model or a set of models (with parameters)

E.g.

3. Choose the objective function

– Squared error

4. Learning:

• Find the set of parameters optimizing the error function

– The model and parameters with the smallest error

5. Application:

– Apply the learned model to new data

– E.g. predict ys for new inputs x using learned

baxy

2

1

))((1

ii

n

i

xfyn

},..,,{ 21 ndddD

)(xf

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

Learning: first look

1. Data:

2. Model selection:

– Select a model or a set of models (with parameters)

E.g.

3. Choose the objective function

– Squared error

4. Learning:

• Find the set of parameters optimizing the error function

– The model and parameters with the smallest error

5. Application:

– Apply the learned model to new data

– E.g. predict ys for new inputs x using learned

baxy

2

1

))((1

ii

n

i

xfyn

},..,,{ 21 ndddD

)(xf

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-10

-8

-6

-4

-2

0

2

4

6

8

10

xx

x

x

Page 17: Cs1571 Introduction to AI, Lecture 14

17

Learning: first look

1. Data:

2. Model selection:

– Select a model or a set of models (with parameters)

E.g.

3. Choose the objective function

– Squared error

4. Learning:

• Find the set of parameters optimizing the error function

– The model and parameters with the smallest error

5. Application

– Apply the learned model to new data

• Looks straightforward, but there are problems ….

baxy

2

1

))((1

ii

n

i

xfyn

},..,,{ 21 ndddD

Learning

Problem

• We fit the model based on past experience (past examples seen)

• But ultimately we are interested in learning the mapping that

performs well on the whole population of examples

Training data: Data used to fit the parameters of the model

Training error:

True (generalization) error (over the whole unknown

population):

Training error tries to approximate the true error !!!!

Does a good training error imply a good generalization error ?

2

1

))((1

ii

n

i

xfyn

2

),( ))(( xfyE yx Expected squared error

Page 18: Cs1571 Introduction to AI, Lecture 14

18

Overfitting

• Assume we have a set of 10 points and we consider

polynomial functions as our possible models

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8

-6

-4

-2

0

2

4

6

8

10

Overfitting

• Fitting a linear function with the square error

• Error is nonzero

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8

-6

-4

-2

0

2

4

6

8

10

12

Page 19: Cs1571 Introduction to AI, Lecture 14

19

Overfitting

• Fitting a linear function with mean-squares error

• Error is nonzero

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8

-6

-4

-2

0

2

4

6

8

10

12

Overfitting

• Linear vs. cubic polynomial

• Higher order polynomial leads to a better fit, smaller error

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8

-6

-4

-2

0

2

4

6

8

10

12

Page 20: Cs1571 Introduction to AI, Lecture 14

20

Overfitting

• Is it always good to minimize the error of the observed data?

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8

-6

-4

-2

0

2

4

6

8

10

12

Overfitting

• For 10 data points, degree 9 polynomial gives a perfect fit

(Lagrange interpolation). Error is zero.

• Is it always good to minimize the training error?

-1.5 -1 -0.5 0 0.5 1 1.5

-8

-6

-4

-2

0

2

4

6

8

10

Page 21: Cs1571 Introduction to AI, Lecture 14

21

Overfitting

• For 10 data points, degree 9 polynomial gives a perfect fit

(Lagrange interpolation). Error is zero.

• Is it always good to minimize the training error? NO !!

• More important: How do we perform on the unseen data?

-1.5 -1 -0.5 0 0.5 1 1.5

-8

-6

-4

-2

0

2

4

6

8

10

Overfitting

• Situation when the training error is low and the generalization

error is high. Causes of the phenomenon:

– Model with more degrees of freedom (more parameters)

– Small data size (as compared to the complexity of model)

-1.5 -1 -0.5 0 0.5 1 1.5

-8

-6

-4

-2

0

2

4

6

8

10

Page 22: Cs1571 Introduction to AI, Lecture 14

22

How to evaluate the learner’s performance?

• Generalization error is the true error for the population of

examples we would like to optimize

• But it cannot be computed exactly

• Optimizing (mean) training error can lead to overfit, i.e.

training error may not reflect properly the generalization error

• So how to test the generalization error?

2

),( ))(( xfyE yx

2

,..1

))((1

ii

ni

xfyn

• Generalization error is the true error for the population of

examples we would like to optimize

• Sample mean only approximates it

• How to measure the generalization error?

• Two ways:

– Theoretical: Law of Large numbers

• statistical bounds on the difference between the true and

sample mean errors

– Practical: Use a separate data set with m data samples to

test

• (Mean) test error2

,..1

))((1

jj

mj

xfym

How to assess the learner’s performance?

]))([( 2

),( xfyE yx

Page 23: Cs1571 Introduction to AI, Lecture 14

23

• Simple holdout method

– Divide the data to the training and test data

– Typically 2/3 training and 1/3 testing

Testing of learning models

Learn (fit)

Dataset

Training set Testing set

Evaluate

Predictive

model

Testing of models

Learn on the

training setThe model

Evaluate on

the test set

case casecontrol control

Data set

Training set Test set

Page 24: Cs1571 Introduction to AI, Lecture 14

24

Density estimation

Data:

Objective: estimate the model of the underlying probability distribution over variables , , using examples in D

},..,,{ 21 nDDDD

iiD x a vector of attribute values

X

)(Xp },..,,{ 21 nDDDD

n samplestrue distribution estimate

)(ˆ Xp

)(Xp

Density estimation

Standard (iid) assumptions: Samples

• are independent of each other

• come from the same (identical) distribution (fixed )

)(Xp },..,,{ 21 nDDDD

n samplestrue distribution estimate

)(ˆ Xp

)(Xp

Independently drawn instances

from the same fixed distribution

Page 25: Cs1571 Introduction to AI, Lecture 14

25

Learning via parameter estimation

In this lecture we consider parametric density estimation

Basic settings:

• A set of random variables

• A model of the distribution over variables in X

with parameters

• Data

Objective: find parameters that fit the data the best

• What is the best set of parameters?

– There are various criteria one can apply here.

},,,{ 21 dXXX X

},..,,{ 21 nDDDD

Parameter estimation. Basic criteria.

• Maximum likelihood (ML)

• Maximum a posteriori probability (MAP)

),|( Dpmaximize

- represents prior (background) knowledge

),|( Dp maximize

)|(

)|(),|(),|(

Dp

pDpDp

Selects the mode of the posterior