introduction to machine learning figuresethem/i2ml/i2ml-figs.pdf · 2016. 8. 8. · x: milage y:...

Introduction to Machine Learning

Figures

Ethem Alpaydın

c©The MIT Press, 2004

1

Chapter 1:

Introduction

2

Sav

ing

s

Income

Low-Risk

High-Riskθ

2

θ1

Figure 1.1: Example of a training dataset where each

circle corresponds to one data instance with input

values in the corresponding axes and its sign

indicates the class. For simplicity, only two customer

attributes, income and savings, are taken as input

and the two classes are low-risk (‘+’) and high-risk

(‘−’). An example discriminant that separates thetwo types of examples is also shown. From:

E. Alpaydın. 2004. Introduction to Machine

Learning. c©The MIT Press.

3

x: milage

y: p

rice

Figure 1.2: A training dataset of used cars and the

function fitted. For simplicity, milage is taken as the

only input attribute and a linear model is used.

From: E. Alpaydın. 2004. Introduction to Machine


4

Chapter 2:

Supervised Learning

5

x

2:

En

gin

e p

ow

er

x1: Price

x1

t

x2

t

Figure 2.1: Training set for the class of a “family

car.” Each data point corresponds to one example

car and the coordinates of the point indicate the

price and engine power of that car. ‘+’ denotes a

positive example of the class (a family car), and ‘−’denotes a negative example (not a family car); it is

another type of car. From: E. Alpaydın. 2004.

Introduction to Machine Learning. c©The MIT Press.

6

x

2:

Engin

e pow

er

x1: Price

p1

p2

e1

e2

C

Figure 2.2: Example of a hypothesis class. The class

of family car is a rectangle in the price-engine power

space. From: E. Alpaydın. 2004. Introduction to

Machine Learning. c©The MIT Press.

7

��x 2: Engine powerx

1: Price

p1

p2

e1

e2

Ch

False negative

False positive

Figure 2.3: C is the actual class and h is our inducedhypothesis. The point where C is 1 but h is 0 is afalse negative, and the point where C is 0 but h is 1is a false positive. Other points, namely true

positives and true negatives, are correctly classified.



8

x

2:

En

gin

e p

ow

er

x1: Price

C

S

G

Figure 2.4: S is the most specific hypothesis and G is

the most general hypothesis. From: E. Alpaydın.

2004. Introduction to Machine Learning. c©The MITPress.

9

x

2

x1

Figure 2.5: An axis-aligned rectangle can shatter

four points. Only rectangles covering two points are

shown. From: E. Alpaydın. 2004. Introduction to


10

x

2

x1

C

h

Figure 2.6: The difference between h and C is thesum of four rectangular strips, one of which is

shaded. From: E. Alpaydın. 2004. Introduction to


11

x

2

x1

h1

h2

Figure 2.7: When there is noise, there is not a simple

boundary between the positive and negative

instances, and zero misclassification error may not be

possible with a simple hypothesis. A rectangle is a

simple hypothesis with four parameters defining the

corners. An arbitrary closed form can be drawn by

piecewise functions with a larger number of control

points. From: E. Alpaydın. 2004. Introduction to


12

Engin

e pow

er

Price

Family car

Sports car

Luxury sedan

?

?

Figure 2.8: There are three classes: family car, sports

car, and luxury sedan. There are three hypotheses

induced, each one covering the instances of one class

and leaving outside the instances of the other two

classes. ‘?’ are reject regions where no, or more than

one, class is chosen. From: E. Alpaydın. 2004.


13

x: milage

y: p

rice

Figure 2.9: Linear, second-order, and sixth-order

polynomials are fitted to the same set of points. The

highest order gives a perfect fit but given this much

data, it is very unlikely that the real curve is so

shaped. The second order seems better than the

linear fit in capturing the trend in the training data.



14

Chapter 3:

Bayes Decision Theory

15

x

2

x1

C1

C3

C2

reject

Figure 3.1: Example of decision regions and decision

boundaries. From: E. Alpaydın. 2004. Introduction

to Machine Learning. c©The MIT Press.

16

Rain

Wet grass

P(R)=0.4

P(W | R)=0.9

P(W | ~R)=0.2

Figure 3.2: Bayesian network modeling that rain is

the cause of wet grass. From: E. Alpaydın. 2004.


17

Sprinkler Rain

Wet grass

P(W | R,S)=0.95

P(W | R,~S)=0.90

P(W | ~R,S)=0.90

P(W | ~R,~S)=0.10

P(S)=0.2 P(R)=0.4

Figure 3.3: Rain and sprinkler are the two causes of

wet grass. From: E. Alpaydın. 2004. Introduction to


18

Sprinkler Rain

Wet grass

Cloudy

P(R | C)=0.8

P(R | ~C)=0.1

P(S | C)=0.1

P(S | ~C)=0.5

P(C)=0.5

rooF

P(F | R)=0.1

P(F | ~R)=0.7

P(W | R,S)=0.95

P(W | R,~S)=0.90

P(W | ~R,S)=0.90

P(W | ~R,~S)=0.10

Figure 3.5: Rain not only makes the grass wet but

also disturbs the cat who normally makes noise on

the roof. From: E. Alpaydın. 2004. Introduction to


20

C

x

P(C)

p(x | C)

Figure 3.6: Bayesian network for classification.



21

C

P(C)

p(x2

| C)

x1

p(xd

| C)p(x1

| C)

xd

x2

Figure 3.7: Naive Bayes’ classifier is a Bayesian

network for classification assuming independent

inputs. From: E. Alpaydın. 2004. Introduction to


22

x

choose

class

U

Figure 3.8: Influence diagram corresponding to

classification. Depending on input x, a class is

chosen that incurs a certain utility (risk). From:



23

Chapter 4:

Parametric Methods

24

di

E[d]

variance

bias

θ

Figure 4.1: θ is the parameter to be estimated. di

are several estimates (denoted by ‘×’) over differentsamples. Bias is the difference between the expected

value of d and θ. Variance is how much di are

scattered around the expected value. We would like

both to be small. From: E. Alpaydın. 2004.


25

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4Likelihoods

x

p(x|

C i)

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.2

0.4

0.6

0.8

1Posteriors with equal priors

x

p(C

i|x)

Figure 4.2: Likelihood functions and the posteriors

with equal priors for two classes when the input is

one-dimensional. Variances are equal and the

posteriors intersect at one point, which is the

threshold of decision. From: E. Alpaydın. 2004.


26

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4Likelihoods

x

p(x|

C i)

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.2

0.4

0.6

0.8

1Posteriors with equal priors

x

p(C

i|x)

Figure 4.3: Likelihood functions and the posteriors

with equal priors for two classes when the input is

one-dimensional. Variances are unequal and the

posteriors intersect at two points. From:



27

X

E[R|x]=wx+w0

p(r|x*)

x*

E[R|x*]

Figure 4.4: Regression assumes 0 mean Gaussian

noise added to the model; here, the model is linear.



28

0 1 2 3 4 5−5

0

5(a) Function and data

0 1 2 3 4 5−5

0

5(b) Order 1

0 1 2 3 4 5−5

0

5(c) Order 3

0 1 2 3 4 5−5

0

5(d) Order 5

Figure 4.5: (a) Function, f(x) = 2 sin(1.5x), and one

noisy (N (0, 1)) dataset sampled from the function.Five samples are taken, each containing twenty

instances. (b), (c), (d) are five polynomial fits, gi(·),of order 1, 3, and 5. For each case, dotted line is the

average of the five fits, namely, g(·). From:E. Alpaydın. 2004. Introduction to Machine


29

1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

3

3.5

4

bias

variance

error

Figure 4.6: In the same setting as that of figure 4.5,

using one hundred models instead of five, bias,

variance, and error for polynomials of order 1 to 5.

Order 1 has the smallest variance. Order 5 has the

smallest bias. As the order is increased, bias

decreases but variance increases. Order 3 has the

minimum error. From: E. Alpaydın. 2004.


30

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−5

0

5(a) Data and fitted polynomials

1 2 3 4 5 6 7 80.5

1

1.5

2

2.5

3(b) Error vs polynomial order

TrainingValidation

Figure 4.7: In the same setting as that of figure 4.5,

training and validation sets (each containing 50

instances) are generated. (a) Training data and

fitted polynomials of order from 1 to 8. (b) Training

and validation errors as a function of the polynomial

order. The “elbow” is at 3. From: E. Alpaydın.


31

Chapter 5:

Multivariate Methods

32

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x1

x2

Figure 5.1: Bivariate normal distribution. From:



33

Cov(x1,x

2)=0, Var(x

1)=Var(x

2)

x1

x 2

Cov(x1,x

2)=0, Var(x

1)>Var(x

2)

Cov(x1,x

2)>0 Cov(x

1,x

2)

0

0.05

0.1

x1

x2

p( x

|C1)

0

0.5

1

x1

x2

p(C

1| x

)

Figure 5.3: Classes have different covariance

matrices. From: E. Alpaydın. 2004. Introduction to


35

Figure 5.4: Covariances may be arbitary but shared

by both classes. From: E. Alpaydın. 2004.


36

Figure 5.5: All classes have equal, diagonal

covariance matrices but variances are not equal.



37

Figure 5.6: All classes have equal, diagonal

covariance matrices of equal variances on both

dimensions. From: E. Alpaydın. 2004. Introduction


38

Chapter 6:

Dimensionality Reduction

39

z1

z2

x1

x

2

z1

z

2

Figure 6.1: Principal components analysis centers

the sample and then rotates the axes to line up with

the directions of highest variance. If the variance on

z2 is too small, it can be ignored and we have

dimensionality reduction from two to one. From:



40

0 10 20 30 40 50 60 700

100

200

Eigenvectors

Eig

enva

lues

(a) Scree graph for Optdigits

0 10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1

Eigenvectors

Pro

p of

var

(b) Proportion of variance explained

Figure 6.2: (a) Scree graph. (b) Proportion of

variance explained is given for the Optdigits dataset

from the UCI Repository. This is a handwritten digit

dataset with ten classes and sixty-four dimensional

inputs. The first twenty eigenvectors explain 90

percent of the variance. From: E. Alpaydın. 2004.


41

−40 −30 −20 −10 0 10 20 30 40−40

−30

−20

−10

0

10

20

30

First Eigenvector

Sec

ond

Eig

enve

ctor

Optdigits after PCA

0

0

7

4

6

2

5

5

0

8

7

19

5

3

0

4

7

8

4

7

8

5

9 1

2

0

6

1

8

7

0

7

6

9

19

3

9

4

9

2

1

9

9

6

4

3

2

8

2

7 1

4

6

2

0

4

6

3

71

0

2

2

5

2

4

8

17

30

33

7

7

9

1

3

3

4

3

4

2

8

89

8

4

7

1

6

9

4

0

1

3

6

2

Figure 6.3: Optdigits data plotted in the space of

two principal components. Only the labels of

hundred data points are shown to minimize the

ink-to-noise ratio. From: E. Alpaydın. 2004.


42

variables

new

variables

x1 xdx2

z1 zk

z2

W V

x1 xdx2

z1 zk

z2

W

variables

factors

PCA FA

Figure 6.4: Principal components analysis generates

new variables that are linear combinations of the

original input variables. In factor analysis, however,

we posit that there are factors that when linearly

combined generate the input variables. From:



43

z1

z

2

x1

x

2

Figure 6.5: Factors are independent unit normals

that are stretched, rotated, and translated to make

up the inputs. From: E. Alpaydın. 2004.


44

−2500 −2000 −1500 −1000 −500 0 500 1000 1500 2000−2000

−1500

−1000

−500

0

500

1000

1500

2000

Athens

Berlin

Dublin

Helsinki

Istanbul

Lisbon

London

Madrid

Moscow

Paris

Rome

Zurich

Figure 6.6: Map of Europe drawn by MDS. Cities

include Athens, Berlin, Dublin, Helsinki, Istanbul,

Lisbon, London, Madrid, Moscow, Paris, Rome, and

Zurich. Pairwise road travel distances between these

cities are given as input, and MDS places them in two

dimensions such that these distances are preserved as

well as possible. From: E. Alpaydın. 2004.


45

w

m1

m1

m2

m2

s1

2

s2

2

x1

x

2

Figure 6.7: Two-dimensional, two-class data

projected on w. From: E. Alpaydın. 2004.


46

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−5

−4

−3

−2

−1

0

1

2

3

4Optdigits after LDA

0

0

7

4

62

55

0

8

7

1

9

5

3

0

4

7

8

4

7

8

5

9

1

2

0

6

1

8

7

0

7

6

91

9

3

9

4

9

2

1

99

6

4

32

8

2

7

14

6

2

0

4

63

7

1

0

2

2

52

4

8

1

7

3

0

3

3

77

9

1

3

3

4

3

4

2

88

9

8

4

7

1

6

9

4

0

1

3 6

2


the first two dimensions found by LDA. Comparing

this with figure 6.3, we see that LDA, as expected,

leads to a better separation of classes than PCA.

Even in this two-dimensional space (there are nine),

we can discern separate clouds for different classes.



47

Chapter 7:

Clustering

48

mi

x i

.

.

.

iCommunication

line

.

.

.

mi

x' =mi

Encoder Decoder

Fin

d c

losest

Figure 7.1: Given x, the encoder sends the index of

the closest code word and the decoder generates the

code word with the received index as x′. Error is

‖x′ − x‖2. From: E. Alpaydın. 2004. Introduction toMachine Learning. c©The MIT Press.

49

−40 −20 0 20 40−30

−20

−10

0

10

20

x1

x 2

k−means: Initial

−40 −20 0 20 40−30

−20

−10

0

10

20

x1

x 2

After 1 iteration

−40 −20 0 20 40−30

−20

−10

0

10

20

x1

x 2

After 2 iterations

−40 −20 0 20 40−30

−20

−10

0

10

20

x1

x 2After 3 iterations

Figure 7.2: Evolution of k-means. Crosses indicate

center positions. Data points are marked depending

on the closest center. From: E. Alpaydın. 2004.


50

Initialize mi, i = 1, . . . , k, for example, to k random xt

Repeat

For all xt ∈ X

bti ←{

1 if ‖xt −mi‖ = minj ‖xt −mj‖0 otherwise

For all mi, i = 1, . . . , k

mi ←∑

tbtix

t/∑

tbti

Until mi converge

Figure 7.3: k-means algorithm. From: E. Alpaydın.


51

−40 −30 −20 −10 0 10 20−30

−25

−20

−15

−10

−5

0

5

10

15

20

x1

x 2

EM solution

Figure 7.4: Data points and the fitted Gaussians by

EM, initialized by one k-means iteration of figure 7.2.

Unlike in k-means, as can be seen, EM allows

estimating the covariance matrices. The data points

labeled by greater hi, the contours of the estimated

Gaussian densities, and the separating curve of

hi = 0.5 (dashed line) are shown. From: E. Alpaydın.


52

a

b

fe

d

c

a de cb f

1

3

2

h

Figure 7.5: A two-dimensional dataset and the

dendrogram showing the result of single-link

clustering is shown. Note that leaves of the tree are

ordered so that no branches cross. The tree is then

intersected at a desired value of h to get the clusters.



53

Chapter 8:

Nonparametric Methods

54

0 1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4Histogram: h = 2

0 1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4h = 1

0 1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8h = 0.5

Figure 8.1: Histograms for various bin lengths. ‘×’denote data points. From: E. Alpaydın. 2004.


55

0 1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4Naive estimator: h = 2

0 1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4h = 1

0 1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8h = 0.5

Figure 8.2: Naive estimate for various bin lengths.



56

0 1 2 3 4 5 6 7 80

0.05

0.1

0.15

0.2Kernel estimator: h = 1

0 1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4h = 0.5

0 1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8h = 0.25

Figure 8.3: Kernel estimate for various bin lengths.



57

0 1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4k−NN estimator: k = 5

0 1 2 3 4 5 6 7 80

0.5

1k = 3

0 1 2 3 4 5 6 7 80

0.5

1k = 1

Figure 8.4: k-nearest neighbor estimate for various k

values. From: E. Alpaydın. 2004. Introduction to


58

x1

x

2

*

*

Figure 8.5: Dotted lines are the Voronoi tesselation

and the straight line is the class discriminant. In

condensed nearest neighbor, those instances that do

not participate in defining the discriminant (marked

by ‘*’) can be removed without increasing the

training error. From: E. Alpaydın. 2004.


59

Z ← ∅Repeat

For all x ∈ X (in random order)Find x′ ∈ Z s.t. ‖x− x′‖ = minxj∈Z ‖x− xj‖If class(x)6=class(x′) add x to Z

Until Z does not change

Figure 8.6: Condensed nearest neighbor algorithm.



60

0 1 2 3 4 5 6 7 8−2

0

2

4Regressogram smoother: h = 6

0 1 2 3 4 5 6 7 8−2

0

2

4h = 3

0 1 2 3 4 5 6 7 8−2

0

2

4h = 1

Figure 8.7: Regressograms for various bin lengths.

‘×’ denote data points. From: E. Alpaydın. 2004.Introduction to Machine Learning. c©The MIT Press.

61

0 1 2 3 4 5 6 7 8−2

0

2

4Running mean smoother: h = 6

0 1 2 3 4 5 6 7 8−2

0

2

4h = 3

0 1 2 3 4 5 6 7 8−2

0

2

4h = 1

Figure 8.8: Running mean smooth for various bin

lengths. From: E. Alpaydın. 2004. Introduction to


62

0 1 2 3 4 5 6 7 8−2

0

2

4Kernel smooth: h = 1

0 1 2 3 4 5 6 7 8−2

0

2

4h = 0.5

0 1 2 3 4 5 6 7 8−2

0

2

4h = 0.25

Figure 8.9: Kernel smooth for various bin lengths.



63

0 1 2 3 4 5 6 7 8−2

0

2

4Running line smooth: h = 6

0 1 2 3 4 5 6 7 8−2

0

2

4h = 3

0 1 2 3 4 5 6 7 8−2

0

2

4

6h = 1

Figure 8.10: Running line smooth for various bin

lengths. From: E. Alpaydın. 2004. Introduction to


64

0 1 2 3 4 5 6 7 8−2

0

2

4Regressogram linear smoother: h = 6

0 1 2 3 4 5 6 7 8−4

−2

0

2

4h = 3

0 1 2 3 4 5 6 7 8−2

0

2

4h = 1

Figure 8.11: Regressograms with linear fits in bins

for various bin lengths. From: E. Alpaydın. 2004.


65

Chapter 9:

Decision Trees

66

x

2

x1

w10

w20

x1>w

10

x2>w

20

YesNo

NoYes

C1

C1

C1

C2

C2

Figure 9.1: Example of a dataset and the

corresponding decision tree. Oval nodes are the

decision nodes and rectangles are leaf nodes. The

univariate decision node splits along one axis, and

successive splits are orthogonal to each other. After

the first split, {x|x1 < w10} is pure and is not splitfurther. From: E. Alpaydın. 2004. Introduction to


67

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

entr

opy=

−p*

log 2

(p)−

(1−

p)*l

og2(

1−p)

Figure 9.2: Entropy function for a two-class problem.



68

GenerateTree(X )If NodeEntropy(X)< θI /* eq. 9.3

Create leaf labelled by majority class in XReturn

i ← SplitAttribute(X)For each branch of xi

Find Xi falling in branchGenerateTree(Xi)

SplitAttribute(X )MinEnt← MAXFor all attributes i = 1, . . . , d

If xi is discrete with n values

Split X into X1, . . . ,Xn by xie ← SplitEntropy(X1, . . . ,Xn) /* eq. 9.8 */If e

0 1 2 3 4 5 6 7 8−2

0

2

4 θr = 0.5

0 1 2 3 4 5 6 7 8−2

0

2

4 θr = 0.2

0 1 2 3 4 5 6 7 8−2

0

2

4 θr = 0.05

Figure 9.4: Regression tree smooths for various

values of θr. The corresponding trees are given in

figure 9.5. From: E. Alpaydın. 2004. Introduction to


70

x < 3.16

x < 1.36

Yes No

NoYes

1.37 -1.35

1.86

x < 3.16

x < 1.36

Yes No

2.20

x < 5.96

NoYes

1.37 -1.35

NoYes

0.9 2.40

Yes No

x < 6.91

x < 3.16

x < 1.36

Yes No

2.20

x < 5.96

NoYes

-1.35

NoYes

2.40

Yes No

x < 6.91x < 0.76

NoYes

1.15 1.80

NoYes

1.20 0.60

x < 6.31

Figure 9.5: Regression trees implementing the

smooths of figure 9.4 for various values of θr. From:



71

x1

> 38.5

x2

> 2.5

Yes No

NoYes

0.8 0.6

x4

'A' 'C''B'

0.20.30.4

x1

: Age

x2 : Years in job

x3 : Gender

x4

: Job type

Figure 9.6: Example of a (hypothetical) decision

tree. Each path from the root to a leaf can be

written down as a conjunctive rule, composed of

conditions defined by the decision nodes on the path.



72

Ripper(Pos,Neg,k)

RuleSet ← LearnRuleSet(Pos,Neg)For k times

RuleSet ← OptimizeRuleSet(RuleSet,Pos,Neg)LearnRuleSet(Pos,Neg)

RuleSet ← ∅DL ← DescLen(RuleSet,Pos,Neg)Repeat

Rule ← LearnRule(Pos,Neg)Add Rule to RuleSet

DL’ ← DescLen(RuleSet,Pos,Neg)If DL’>DL+64

PruneRuleSet(RuleSet,Pos,Neg)

Return RuleSet

If DL’

PruneRuleSet(RuleSet,Pos,Neg)

For each Rule ∈ RuleSet in reverse orderDL ← DescLen(RuleSet,Pos,Neg)DL’ ← DescLen(RuleSet-Rule,Pos,Neg)IF DL’¡DL Delete Rule from RuleSet

Return RuleSet

OptimizeRuleSet(RuleSet,Pos,Neg)

For each Rule ∈ RuleSetDL0 ← DescLen(RuleSet,Pos,Neg)DL1 ← DescLen(RuleSet-Rule+ReplaceRule(RuleSet,Pos,Neg),Pos,Neg)

DL2 ← DescLen(RuleSet-Rule+ReviseRule(RuleSet,Rule,Pos,Neg),Pos,Neg)

If DL1=min(DL0,DL1,DL2)

Delete Rule from RuleSet and

add ReplaceRule(RuleSet,Pos,Neg)

Else If DL2=min(DL0,DL1,DL2)

Delete Rule from RuleSet and

add ReviseRule(RuleSet,Rule,Pos,Neg)

Return RuleSet

Figure 9.7: Ripper algorithm for learning rules

(cont’d). From: E. Alpaydın. 2004. Introduction to


74

x

2

x1

C1

C2

+

Yes No

C1

C2

w11

x1+w

12x

2+w

10= 0

Figure 9.8: Example of a linear multivariate decision

tree. The linear multivariate node can place an

arbitrary hyperplane thus is more general, whereas

the univariate node is restricted to axis-aligned splits.



75

Chapter 10:

Linear Discrimination

76

C1C

2

g(x)=w1x

1+w

2x

2+w

0=0

g(x)>0g(x)<0

+

x1

x

2

Figure 10.1: In the two-dimensional case, the linear

discriminant is a line that separates the examples

from two classes. From: E. Alpaydın. 2004.


77

w

g(x)=0

g(x)>0g(x)<0

x

|w0|/||w||

|g(x)|/||w||

x1

x

2

Figure 10.2: The geometric interpretation of the

linear discriminant. From: E. Alpaydın. 2004.


78

C1C

2

H2

H3

C3

H1

+

+

+

x1

x

2

Figure 10.3: In linear classification, each hyperplane

Hi separates the examples of Ci from the examples ofall other classes. Thus for it to work, the classes

should be linearly separable. Dotted lines are the

induced boundaries of the linear classifier. From:



79

C1C

2

H31

C3

H12

+

+

+

H23

x1

x

2

Figure 10.4: In pairwise linear separation, there is a

separate hyperplane for each pair of classes. For an

input to be assigned to C1, it should be on thepositive side of H12 and H13 (which is the negative

side of H31); we do not care for the value of H23. In

this case, C1 is not linearly separable from otherclasses but is pairwise linearly separable. From:



80

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 10.5: The logistic, or sigmoid, function.



81

For j = 0, . . . , d

wj ←rand(-0.01,0.01)Repeat

For j = 0, . . . , d

∆wj ← 0For t = 1, . . . , N

o ← 0For j = 0, . . . , d

o ← o + wjxtjy ← sigmoid(o)∆wj ← ∆wj + (rt − y)xtj

For j = 0, . . . , d

wj ← wj + η∆wjUntil convergence

Figure 10.6: Logistic discrimination algorithm

implementing gradient-descent for the single output

case with two classes. For w0, we assume that there

is an extra input x0, which is always +1: xt0 ≡ +1, ∀t.From: E. Alpaydın. 2004. Introduction to Machine


82

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x

P(C

o|x)

10

100

1000

Figure 10.7: For a univariate two-class problem

(shown with ‘◦’ and ‘×’ ), the evolution of the linewx + w0 and the sigmoid output after 10, 100, and

1,000 iterations over the sample. From: E. Alpaydın.


83

For i = 1, . . . , K, For j = 0, . . . , d, wij ← rand(−0.01, 0.01)Repeat

For i = 1, . . . , K, For j = 0, . . . , d, ∆wij ← 0For t = 1, . . . , N

For i = 1, . . . , K

oi ← 0For j = 0, . . . , d

oi ← oi + wijxtjFor i = 1, . . . , K

yi ← exp(oi)/∑

kexp(ok)

For i = 1, . . . , K

For j = 0, . . . , d

∆wij ← ∆wij + (rti − yi)xtjFor i = 1, . . . , K

For j = 0, . . . , d

wij ← wij + η∆wijUntil convergence

Figure 10.8: Logistic discrimination algorithm

implementing gradient-descent for the case with

K > 2 classes. For generality, we take xt0 ≡ 1, ∀t.From: E. Alpaydın. 2004. Introduction to Machine


84

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

2.5

3

3.5

4

x1

x 2

Figure 10.9: For a two-dimensional problem with

three classes, the solution found by logistic

discrimination. Thin lines are where gi(x) = 0, and

the thick line is the boundary induced by the linear

classifier choosing the maximum. From: E. Alpaydın.


85

−20

−10

0

10

20

x1

x2

g i(x

1,x 2

)

0

0.2

0.4

0.6

0.8

1

x1

x2

P(C

i|x1,

x 2)

Figure 10.10: For the same example in figure 10.9,

the linear discriminants (top), and the posterior

probabilities after the softmax (bottom). From:



86

w

C1C

2

g(x)=+1

g(x)= -1+

1/||w||

2/||w||

x1

x

2

Figure 10.11: On both sides of the optimal

separating hyperplance, the instances are at least

1/‖w‖ away and the total margin is 2/‖w‖. From:E. Alpaydın. 2004. Introduction to Machine


87

w

C1C

2

g(x)=+1

g(x)=-1+

1

2

3

x1

x

2

Figure 10.12: In classifying an instance, there are

three possible cases: In (1), ξ = 0; it is on the right

side and sufficiently away. In (2), ξ = 1 + g(x) > 1; it

is on the wrong side. In (3), ξ = 1− g(x), 0 < ξ < 1; itis on the right side but is in the margin and not

sufficiently away. From: E. Alpaydın. 2004.


88

−8 −6 −4 −2 0 2 4 6 80

10

20

30

40

50

60

70

Figure 10.13: Quadratic and ²-sensitive error

functions. We see that ²-sensitive error function is

not affected by small errors and also is affected less

by large errors and thus is more robust to outliers.



89

Chapter 11:

Multilayer Perceptrons

90

x0=+1 x

1x

2

y

xd

w0 w

dw

2w

1

Figure 11.1: Simple perceptron. xj , j = 1, . . . , d are

the input units. x0 is the bias unit that always has

the value 1. y is the output unit. wj is the weight of

the directed connection from input xj to the output.



91

wK

w2w

1

y1

y2

yK

x0=+1 x

1x

2x

d

Figure 11.2: K parallel perceptrons. xj , j = 0, . . . , d

are the inputs and yi, i = 1, . . . , K are the outputs. wij

is the weight of the connection from input xj to

output yi. Each output is a weighted sum of the

inputs. When used for K-class classification problem,

there is a postprocessing to choose the maximum, or

softmax if we need the posterior probabilities. From:



92

For i = 1, . . . , K

For j = 0, . . . , d

wij ← rand(−0.01, 0.01)Repeat

For all (xt, rt) ∈ X in random orderFor i = 1, . . . , K

oi ← 0For j = 0, . . . , d

oi ← oi + wijxtjFor i = 1, . . . , K

yi ← exp(oi)/∑

kexp(ok)

For i = 1, . . . , K

For j = 0, . . . , d

wij ← wij + η(rti − yi)xtjUntil convergence

Figure 11.3: Percepton training algorithm

implementing stochastic online gradient-descent for

the case with K > 2 classes. This is the online

version of the algorithm given in figure 10.8. From:



93

x0=+1 x1 x2

y

-1.5

+1+1

x1

x2

+

(0,0)

(1,1)

(1,0)

(0,1)

1.5

1.5

Figure 11.4: The perceptron that implements AND

and its geometric interpretation. From: E. Alpaydın.


94

x1

x2

Figure 11.5: XOR problem is not linearly separable.

We cannot draw a line where the empty circles are

on one side and the filled circles on the other side.



95

whj

zh

z0=+1

vih

yi

x0=+1 xj xd

Figure 11.6: The structure of a multilayer

perceptron. From: E. Alpaydın. 2004. Introduction


96

x0=+1 x

1x

2

y

z1

z0=+1

z2

-0.5-1 +1

-1+1

-0.5

+1 +1-0.5

z1

z2

+

+

+

x1

x2

z1

z2

y

Figure 11.7: The multilayer perceptron that solves

the XOR problem. The hidden units and the output

have the threshold activation function with threshold

at 0. From: E. Alpaydın. 2004. Introduction to


97

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

100

200 300

Figure 11.8: Sample training data shown as ‘+’,

where xt ∼ U(−0.5, 0.5), and yt = f(xt) +N (0, 0.1).f(x) = sin(6x) is shown by a dashed line. The

evolution of the fit of an MLP with two hidden units

after 100, 200, and 300 epochs is drawn. From:



98

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

1.2

1.4

Training Epochs

Mea

n S

quar

e E

rror

TrainingValidation

Figure 11.9: The mean square error on training and

validation sets as a function of training epochs.



99

−0.5 0 0.5−4

−3

−2

−1

0

1

2

3

4

−0.5 0 0.5−4

−3

−2

−1

0

1

2

3

4

−0.5 0 0.5−4

−3

−2

−1

0

1

2

3

4

Figure 11.10: (a) The hyperplanes of the hidden unit

weights on the first layer, (b) hidden unit outputs,

and (c) hidden unit outputs multiplied by the weights

on the second layer. Two sigmoid hidden units

slightly displaced, one multiplied by a negative

weight, when added, implement a bump. From:



100

Initialize all vih and whj to rand(−0.01, 0.01)Repeat

For all (xt, rt) ∈ X in random orderFor h = 1, . . . , H

zh ← sigmoid(wThxt)For i = 1, . . . , K

yi = vTi z

For i = 1, . . . , K

∆vi = η(rti − yti)z

For h = 1, . . . , H

∆wh = η(∑

i(rti − yti)vih)zh(1− zh)xt

For i = 1, . . . , K

vi ← vi + ∆viFor h = 1, . . . , H

wh ← wh + ∆whUntil convergence

Figure 11.11: Backpropagation algorithm for training

a multilayer perceptron for regression with K

outputs. This code can easily be adapted for

two-class classification (by setting a single sigmoid

output) and to K > 2 classification (by using softmax

outputs). From: E. Alpaydın. 2004. Introduction to


101

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

Number of Hidden Units

Mea

n S

quar

e E

rror

TrainingValidation

Figure 11.12: As complexity increases, training error

is fixed but the validation error starts to increase and

the network starts to overfit. From: E. Alpaydın.


102

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

2

2.5

3

3.5

Training Epochs

Mea

n S

quar

e E

rror

TrainingValidation

Figure 11.13: As training continues, the validation

error starts to increase and the network starts to

overfit. From: E. Alpaydın. 2004. Introduction to


103

Figure 11.14: A structured MLP. Each unit is

connected to a local group of units below it and

checks for a particular feature—for example, edge,

corner, and so forth—in vision. Only one hidden unit

is shown for each region; typically there are many to

check for different local features. From: E. Alpaydın.


104

Figure 11.15: In weight sharing, different units have

connections to different inputs but share the same

weight value (denoted by line type). Only one set of

units is shown; there should be multiple sets of units,

each checking for different features. From:



105

A A

A

A

Figure 11.16: The identity of the object does not

change when it is translated, rotated, or scaled. Note

that this may not always be true, or may be true up

to a point: ‘b’ and ‘q’ are rotated versions of each

other. These are hints that can be incorporated into

the learning process to make learning easier. From:



106

Dynamic Node Creation Cascade Correlation

Figure 11.17: Two examples of constructive

algorithms: Dynamic node creation adds a unit to an

existing layer. Cascade correlation adds each unit as

new hidden layer connected to all the previous layers.

Dashed lines denote the newly added

unit/connections. Bias units/weights are omitted for

clarity. From: E. Alpaydın. 2004. Introduction to


107

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hidden 1

Hid

den

2

Hidden Representation

00

7

4

62

5

5

0

8

7

19

5

3

0

4

7

8

47

8

5

9

1

2

0

6

1

8

7

0

7

6

9 1

9

3

9 4

9

2

1

99

6

4

3

2

8

2

71 4

6

20

4

6

3

7

1

02

2

5

2

4

8

1

7

3

0

33

77

9

1

3

3

4

3

4

2

8

8

9

8

4

7

1

6

9

4

0

1

3

6

2


the two hidden units of an MLP trained for

classification. From: E. Alpaydın. 2004.


108

x0=+

1x

1x

d

zH

Encoder

y1

yd

Decoder

y1

yd

x0=+

1x

1x

d

zH

Linear Nonlinear

Figure 11.19: In the autoassociator, there are as

many outputs as there are inputs and the desired

outputs are the inputs. When the number of hidden

units is less than the number of inputs, the MLP is

trained to find the best coding of the inputs on the

hidden units, performing dimensionality reduction.

On the left, the first layer acts as an encoder and the

second layer acts as the decoder. On the right, if the

encoder and decoder are multilayer perceptrons with

sigmoid hidden units, the network performs nonlinear

dimensionality reduction. From: E. Alpaydın. 2004.


109

x0=+1

xt

whj

zhz0=+1

vih

yi

xt-1xt-T

delaydelay...

Figure 11.20: A time delay neural network. Inputs in

a time window of length T are delayed in time until

we can feed all T inputs as the input vector to the

MLP. From: E. Alpaydın. 2004. Introduction to


110

(a) (c)(b)

Figure 11.21: Examples of MLP with partial

recurrency. Recurrent connections are shown with

dashed lines: (a) self-connections in the hidden layer,

(b) self-connections in the output layer, and (c)

connections from the output to the hidden layer.

Combinations of these are also possible. From:



111

(a) (b)

x

x3

x0

x1

x2

W

y

h

VR

W

W

W

WR

R

R

V

h0

h3

h1

h2

y

Figure 11.22: Backpropagation through time: (a)

recurrent network and (b) its equivalent unfolded

network that behaves identically in four steps. From:



112

Chapter 12:

Local Models

113

x1

x2

x

mi

Figure 12.1: Shaded circles are the centers and the

empty circle is the input instance. The online version

of k-means moves the closest center along the

direction of (x−mi) by a factor specified by η.From: E. Alpaydın. 2004. Introduction to Machine


114

Initialize mi, i = 1, . . . , k, for example, to k random xt

Repeat

For all xt ∈ X in random orderi ← arg minj ‖xt −mj‖mi ←mi + η(xt −mj)

Until mi converge

Figure 12.2: Online k-means algorithm. The batch

version is given in figure 7.3. From: E. Alpaydın.


115

x1

xd

m1

bk

m2

b2

b1

mk

Figure 12.3: The winner-take-all competitive neural

network, which is a network of k perceptrons with

recurrent connections at the output. Dashed lines

are recurrent connections, of which the ones that end

with an arrow are excitatory and the ones that end

with a circle are inhibitory. From: E. Alpaydın. 2004.


116

x1

x2

xa

mi

xb

ρ

Figure 12.4: The distance from xa to the closest

center is less than the vigilance value ρ and the

center is updated as in online k-means. However, xb

is not close enough to any of the centers and a new

group should be created at that position. From:



117

x1

x2

x

mi

mi-1

mi+1

mi+2

mi-2

Figure 12.5: In the SOM, not only the closest unit

but also its neighbors, in terms of indices, are moved

toward the input. Here, neighborhood is 1; mi and

its 1-nearest neighbors are updated. Note here that

mi+1 is far from mi, but as it is updated with mi,

and as mi will be updated when mi+1 is the winner,

they will become neighbors in the input space as well.



118

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 12.6: The one-dimensional form of the

bell-shaped function used in the radial basis function

network. This one has m = 0 and s = 1. It is like a

Gaussian but it is not a density; it does not integrate

to 1. It is nonzero between (m− 3s, m + 3s), but amore conservative interval is (m− 2s, m + 2s). From:E. Alpaydın. 2004. Introduction to Machine


119

x1

x2

xa : (1.0, 0.0, 0.0)

xb : (0.0, 0.0, 1.0)

xc : (1.0, 1.0, 0.0)

m1

xb

m3

m2

xc

x1

x2

xa

xb

xc

+

+

w1

w2

Local representation in the

space of (p1, p

2, p

3)

Distributed representation in the

space of (h1, h

2)

xa

xa : (1.0, 1.0)

xb : (0.0, 1.0)

xc : (1.0, 0.0)

Figure 12.7: The difference between local and

distributed representations. The values are hard,

0/1, values. One can use soft values in (0, 1) and get

a more informative encoding. In the local

representation, this is done by the Gaussian RBF

that uses the distance to the center, mi, and in the

distributed representation, this is done by the sigmoid

that uses the distance to the hyperplane, wi. From:



120

xj

xd

mhj

, sh

ph

p0=+1

wih

yi

x1

Figure 12.8: The RBF network where ph are the

hidden units using the bell-shaped activation

function. mh, sh are the first-layer parameters, and

wi are the second-layer weights. From: E. Alpaydın.


121

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 12.9: (-) Before and (- -) after normalization

for three Gaussians whose centers are denoted by ‘*’.

Note how the nonzero region of a unit depends also

on the positions of other units. If the spreads are

small, normalization implements a harder split; with

large spreads, units overlap more. From:



122

gh

wih

yi

vih

mh

, sh

xj

xd

x1

Figure 12.10: The mixture of experts can be seen as

an RBF network where the second-layer weights are

outputs of linear models. Only one linear model is

shown for clarity. From: E. Alpaydın. 2004.


123

ghw

ih

yi

vih

wh

Local

experts

Gating

network

mhj

xj

xd

x1

Figure 12.11: The mixture of experts can be seen as

a model for combining multiple models. wh are the

models and the gating network is another model

determining the weight of each model, as given by gh.

Viewed in this way, neither the experts nor the gating

are restricted to be linear. From: E. Alpaydın. 2004.


124

Chapter 13:

Hidden Markov Models

125

1 2

3

a11 a

12

a21

a13

π3

π2

π1

Figure 13.1: Example of a Markov model with three

states is a stochastic automaton. πi is the probability

that the system starts in state Si, and aij is the

probability that the system moves from state Si to

state Sj . From: E. Alpaydın. 2004. Introduction to


126

1 1 11

a11

a11

a11

a11

N

i

1

N

i

2

N

i

T

N

i

T-1

O1

O2

OT-1

OT

π1

πi

πN

Figure 13.2: An HMM unfolded in time as a lattice

(or trellis) showing all the possible trajectories. One

path, shown in thicker lines, is the actual (unknown)

state trajectory that generated the observation

sequence. From: E. Alpaydın. 2004. Introduction to


127

N

i

1

ja

iji

1

j

N

aij

(a) Forward (b) Backward

t t+1 t+1t

αi

βj

Ot+1

Ot+1

Figure 13.3: Forward-backward procedure: (a)

computation of αt(j) and (b) computation of βt(i).



128

i ja

ijα

i

Ot+

1

βj

t t+1

Figure 13.4: Computation of arc probabilities, ξt(i, j).



129

1 2 3

a11

a12

a13

π1

4

Figure 13.5: Example of a left-to-right HMM. From:



130

Chapter 14:

Assessing and Comparing

Classification Algorithms

131

Hit

ra

te: |

T

P

|/

(|

T

P

|+

|FN

|)

False alarm rate: |FP|/(|FP|+|TN|)

Figure 14.1: Typical roc curve. Each classifier has a

parameter, for example, a threshold, which allows us

to move over this curve, and we decide on a point,

based on the relative importance of hits versus false

alarms, namely, true positives and false positives.



132

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Unit Normal Z=N(0,1)

x

p(x)

2.5% 2.5%

Figure 14.2: 95 percent of the unit normal

distribution lies between −1.96 and 1.96. From:E. Alpaydın. 2004. Introduction to Machine


133

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35


x

p(x)

5%

Figure 14.3: 95 percent of the unit normal

distribution lies before 1.64. From: E. Alpaydın.


134

Chapter 15:

Combining Multiple Learners

135

x

w2

y

d1

dLd2

+

wL

w1

f ( )

Figure 15.1: In voting, the combiner function f(·) isa weighted sum. dj are the multiple learners, and wj

are the weights of their votes. y is the overall output.

In the case of multiple outputs, for example,

classification, the learners have multiple outputs dji

whose weighted sum gives yi. From: E. Alpaydın.


136

Training:

For all {xt, rt}Nt=1 ∈ X , initialize pt1 = 1/NFor all base-learners j = 1, . . . , L

Randomly draw Xj from X with probabilities ptjTrain dj using XjFor each (xt, rt), calculate ytj ← dj(xt)Calculate error rate: ²j ←

∑tptj · 1(ytj 6= rt)

If ²j > 1/2, then L ← j − 1; stopβj ← ²j/(1− ²j)For each (xt, rt), decrease probabilities if correct:

If ytj = rt ptj+1 ← βjptj Else ptj+1 ← ptj

Normalize probabilities:

Zj ←∑

tptj+1; p

tj+1 ← ptj+1/Zj

Testing:

Given x, calculate dj(x), j = 1, . . . , L

Calculate class outputs, i = 1, . . . , K:

yi =∑L

j=1

(log 1

βj

)dji(x)

Figure 15.2: AdaBoost algorithm. From:



137

x

y

d1 dLd2

+w

L

w1

f ( )

gating

Figure 15.3: Mixture of experts is a voting method

where the votes, as given by the gating system, are a

function of the input. The combiner system f also

includes this gating system. From: E. Alpaydın.


138

x

y

d1 dLd2

f ( )

Figure 15.4: In stacked generalization, the combiner

is another learner and is not restricted to being a

linear combination as in voting. From: E. Alpaydın.


139

x

y=d1

d1

d2

w1>θ

1

yes

no

w2>θ

2

yes

no

y=d2

... dL

y=dL

Figure 15.5: Cascading is a multistage method where

there is a sequence of classifiers, and the next one is

used only when the preceding ones are not confident.



140

Chapter 16:

Reinforcement Learning

141

ENVIRONMENT

AGENT

ActionReward

State

Figure 16.1: The agent interacts with an

environment. At any state of the environment, the

agent takes an action that changes the state and

returns a reward. From: E. Alpaydın. 2004.


142

Initialize V (s) to arbitrary values

Repeat

For all s ∈ SFor all a ∈ A

Q(s, a) ← E[r|s, a] + γ∑

s′∈S P (s′|s, a)V (s′)

V (s) ← maxa Q(s, a)Until V (s) converge

Figure 16.2: Value iteration algorithm for

model-based learning. From: E. Alpaydın. 2004.


143

Initialize a policy π arbitrarily

Repeat

π ← π′Compute the values using π by

solving the linear equations

V π(s) = E[r|s, π(s)] + γ∑

s′∈S P (s′|s, π(s))V π(s′)

Improve the policy at each state

π′(s) ← arg maxa(E[r|s, a] + γ∑

s′∈S P (s′|s, a)V π(s′))

Until π = π′

Figure 16.3: Policy iteration algorithm for

model-based learning. From: E. Alpaydın. 2004.


144

G100*

0

90

81

A

B

Figure 16.4: Example to show that Q values increase

but never decrease. This is a deterministic grid-world

where G is the goal state with reward 100, all other

immediate rewards are 0 and γ = 0.9. Let us consider

the Q value of the transition marked by asterisk, and

let us just consider only the two paths A and B. Let

us say that path A is seen before path B, then we

have γ max(0, 81) = 72.9. If afterward B is seen, a

shorter path is found and the Q value becomes

γ max(100, 81) = 90. If B is seen before A, the Q value

is γ max(100, 0) = 90. Then when B is seen, it does

not change because γ max(100, 81) = 90. From:



145

Initialize all Q(s, a) arbitrarily

For all episodes

Initalize s

Repeat

Choose a using policy derived from Q, e.g., ²-greedy

Take action a, observe r and s′

Update Q(s, a):

Q(s, a) ← Q(s, a) + η(r + γ maxa′ Q(s′, a′)−Q(s, a))s ← s′

Until s is terminal state

Figure 16.5: Q learning, which is an off-policy

temporal difference algorithm. From: E. Alpaydın.


146

Initialize all Q(s, a) arbitrarily

For all episodes

Initalize s


Repeat


Choose a′ using policy derived from Q, e.g., ²-greedyUpdate Q(s, a):

Q(s, a) ← Q(s, a) + η(r + γQ(s′, a′)−Q(s, a))s ← s′, a ← a′

Until s is terminal state

Figure 16.6: Sarsa algorithm, which is an on-policy

version of Q learning. From: E. Alpaydın. 2004.


147

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 16.7: Example of an eligibility trace for a

value. Visits are marked by an asterisk. From:



148

Initialize all Q(s, a) arbitrarily, e(s, a) ← 0, ∀s, aFor all episodes

Initalize s


Repeat


Choose a′ using policy derived from Q, e.g., ²-greedyδ ← r + γQ(s′, a′)−Q(s, a)e(s, a) ← 1For all s, a:

Q(s, a) ← Q(s, a) + ηδe(s, a)e(s, a) ← γλe(s, a)

s ← s′, a ← a′Until s is terminal state

Figure 16.8: Sarsa(λ) algorithm. From: E. Alpaydın.


149

ENVIRONMENT

AGENT

ActionReward

State

SE πb

Figure 16.9: In the case of a partially observable

environment, the agent has a state estimator (SE)

that keeps an internal belief state b and the policy π

generates actions based on the belief states. From:



150

G

S

Figure 16.10: The grid world. The agent can move

in the four compass directions starting from S. The

goal state is G. From: E. Alpaydın. 2004.


151

Appendix: Probability

152

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35


x

p(x)

Figure A.1: Probability density function of Z, theunit normal distribution.

153

introduction to machine learning figuresethem/i2ml/i2ml-figs.pdf · 2016. 8. 8. · x: milage y:...

Documents