introduction to machine learning figuresethem/i2ml/i2ml-figs.pdf · 2016. 8. 8. · x: milage y:...

153
Introduction to Machine Learning Figures Ethem Alpaydın c The MIT Press, 2004 1

Upload: others

Post on 16-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction to Machine Learning

    Figures

    Ethem Alpaydın

    c©The MIT Press, 2004

    1

  • Chapter 1:

    Introduction

    2

  • Sav

    ing

    s

    Income

    Low-Risk

    High-Riskθ

    2

    θ1

    Figure 1.1: Example of a training dataset where each

    circle corresponds to one data instance with input

    values in the corresponding axes and its sign

    indicates the class. For simplicity, only two customer

    attributes, income and savings, are taken as input

    and the two classes are low-risk (‘+’) and high-risk

    (‘−’). An example discriminant that separates thetwo types of examples is also shown. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    3

  • x: milage

    y: p

    rice

    Figure 1.2: A training dataset of used cars and the

    function fitted. For simplicity, milage is taken as the

    only input attribute and a linear model is used.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    4

  • Chapter 2:

    Supervised Learning

    5

  • x

    2:

    En

    gin

    e p

    ow

    er

    x1: Price

    x1

    t

    x2

    t

    Figure 2.1: Training set for the class of a “family

    car.” Each data point corresponds to one example

    car and the coordinates of the point indicate the

    price and engine power of that car. ‘+’ denotes a

    positive example of the class (a family car), and ‘−’denotes a negative example (not a family car); it is

    another type of car. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    6

  • x

    2:

    Engin

    e pow

    er

    x1: Price

    p1

    p2

    e1

    e2

    C

    Figure 2.2: Example of a hypothesis class. The class

    of family car is a rectangle in the price-engine power

    space. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    7

  • ������������x 2: Engine powerx

    1: Price

    p1

    p2

    e1

    e2

    Ch

    False negative

    False positive

    Figure 2.3: C is the actual class and h is our inducedhypothesis. The point where C is 1 but h is 0 is afalse negative, and the point where C is 0 but h is 1is a false positive. Other points, namely true

    positives and true negatives, are correctly classified.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    8

  • x

    2:

    En

    gin

    e p

    ow

    er

    x1: Price

    C

    S

    G

    Figure 2.4: S is the most specific hypothesis and G is

    the most general hypothesis. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    9

  • x

    2

    x1

    Figure 2.5: An axis-aligned rectangle can shatter

    four points. Only rectangles covering two points are

    shown. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    10

  • x

    2

    x1

    C

    h

    Figure 2.6: The difference between h and C is thesum of four rectangular strips, one of which is

    shaded. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    11

  • x

    2

    x1

    h1

    h2

    Figure 2.7: When there is noise, there is not a simple

    boundary between the positive and negative

    instances, and zero misclassification error may not be

    possible with a simple hypothesis. A rectangle is a

    simple hypothesis with four parameters defining the

    corners. An arbitrary closed form can be drawn by

    piecewise functions with a larger number of control

    points. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    12

  • Engin

    e pow

    er

    Price

    Family car

    Sports car

    Luxury sedan

    ?

    ?

    Figure 2.8: There are three classes: family car, sports

    car, and luxury sedan. There are three hypotheses

    induced, each one covering the instances of one class

    and leaving outside the instances of the other two

    classes. ‘?’ are reject regions where no, or more than

    one, class is chosen. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    13

  • x: milage

    y: p

    rice

    Figure 2.9: Linear, second-order, and sixth-order

    polynomials are fitted to the same set of points. The

    highest order gives a perfect fit but given this much

    data, it is very unlikely that the real curve is so

    shaped. The second order seems better than the

    linear fit in capturing the trend in the training data.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    14

  • Chapter 3:

    Bayes Decision Theory

    15

  • x

    2

    x1

    C1

    C3

    C2

    reject

    Figure 3.1: Example of decision regions and decision

    boundaries. From: E. Alpaydın. 2004. Introduction

    to Machine Learning. c©The MIT Press.

    16

  • Rain

    Wet grass

    P(R)=0.4

    P(W | R)=0.9

    P(W | ~R)=0.2

    Figure 3.2: Bayesian network modeling that rain is

    the cause of wet grass. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    17

  • Sprinkler Rain

    Wet grass

    P(W | R,S)=0.95

    P(W | R,~S)=0.90

    P(W | ~R,S)=0.90

    P(W | ~R,~S)=0.10

    P(S)=0.2 P(R)=0.4

    Figure 3.3: Rain and sprinkler are the two causes of

    wet grass. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    18

  • Sprinkler Rain

    Wet grass

    P(W | R,S)=0.95

    P(W | R,~S)=0.90

    P(W | ~R,S)=0.90

    P(W | ~R,~S)=0.10

    Cloudy

    P(R | C)=0.8

    P(R | ~C)=0.1

    P(S | C)=0.1

    P(S | ~C)=0.5

    P(C)=0.5

    Figure 3.4: If it is cloudy, it is likely that it will rain

    and we will not use the sprinkler. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    19

  • Sprinkler Rain

    Wet grass

    Cloudy

    P(R | C)=0.8

    P(R | ~C)=0.1

    P(S | C)=0.1

    P(S | ~C)=0.5

    P(C)=0.5

    rooF

    P(F | R)=0.1

    P(F | ~R)=0.7

    P(W | R,S)=0.95

    P(W | R,~S)=0.90

    P(W | ~R,S)=0.90

    P(W | ~R,~S)=0.10

    Figure 3.5: Rain not only makes the grass wet but

    also disturbs the cat who normally makes noise on

    the roof. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    20

  • C

    x

    P(C)

    p(x | C)

    Figure 3.6: Bayesian network for classification.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    21

  • C

    P(C)

    p(x2

    | C)

    x1

    p(xd

    | C)p(x1

    | C)

    xd

    x2

    Figure 3.7: Naive Bayes’ classifier is a Bayesian

    network for classification assuming independent

    inputs. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    22

  • x

    choose

    class

    U

    Figure 3.8: Influence diagram corresponding to

    classification. Depending on input x, a class is

    chosen that incurs a certain utility (risk). From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    23

  • Chapter 4:

    Parametric Methods

    24

  • di

    E[d]

    variance

    bias

    θ

    Figure 4.1: θ is the parameter to be estimated. di

    are several estimates (denoted by ‘×’) over differentsamples. Bias is the difference between the expected

    value of d and θ. Variance is how much di are

    scattered around the expected value. We would like

    both to be small. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    25

  • −10 −8 −6 −4 −2 0 2 4 6 8 100

    0.1

    0.2

    0.3

    0.4Likelihoods

    x

    p(x|

    C i)

    −10 −8 −6 −4 −2 0 2 4 6 8 100

    0.2

    0.4

    0.6

    0.8

    1Posteriors with equal priors

    x

    p(C

    i|x)

    Figure 4.2: Likelihood functions and the posteriors

    with equal priors for two classes when the input is

    one-dimensional. Variances are equal and the

    posteriors intersect at one point, which is the

    threshold of decision. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    26

  • −10 −8 −6 −4 −2 0 2 4 6 8 100

    0.1

    0.2

    0.3

    0.4Likelihoods

    x

    p(x|

    C i)

    −10 −8 −6 −4 −2 0 2 4 6 8 100

    0.2

    0.4

    0.6

    0.8

    1Posteriors with equal priors

    x

    p(C

    i|x)

    Figure 4.3: Likelihood functions and the posteriors

    with equal priors for two classes when the input is

    one-dimensional. Variances are unequal and the

    posteriors intersect at two points. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    27

  • X

    E[R|x]=wx+w0

    p(r|x*)

    x*

    E[R|x*]

    Figure 4.4: Regression assumes 0 mean Gaussian

    noise added to the model; here, the model is linear.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    28

  • 0 1 2 3 4 5−5

    0

    5(a) Function and data

    0 1 2 3 4 5−5

    0

    5(b) Order 1

    0 1 2 3 4 5−5

    0

    5(c) Order 3

    0 1 2 3 4 5−5

    0

    5(d) Order 5

    Figure 4.5: (a) Function, f(x) = 2 sin(1.5x), and one

    noisy (N (0, 1)) dataset sampled from the function.Five samples are taken, each containing twenty

    instances. (b), (c), (d) are five polynomial fits, gi(·),of order 1, 3, and 5. For each case, dotted line is the

    average of the five fits, namely, g(·). From:E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    29

  • 1 1.5 2 2.5 3 3.5 4 4.5 50

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    bias

    variance

    error

    Figure 4.6: In the same setting as that of figure 4.5,

    using one hundred models instead of five, bias,

    variance, and error for polynomials of order 1 to 5.

    Order 1 has the smallest variance. Order 5 has the

    smallest bias. As the order is increased, bias

    decreases but variance increases. Order 3 has the

    minimum error. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    30

  • 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−5

    0

    5(a) Data and fitted polynomials

    1 2 3 4 5 6 7 80.5

    1

    1.5

    2

    2.5

    3(b) Error vs polynomial order

    TrainingValidation

    Figure 4.7: In the same setting as that of figure 4.5,

    training and validation sets (each containing 50

    instances) are generated. (a) Training data and

    fitted polynomials of order from 1 to 8. (b) Training

    and validation errors as a function of the polynomial

    order. The “elbow” is at 3. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    31

  • Chapter 5:

    Multivariate Methods

    32

  • 0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    x1

    x2

    Figure 5.1: Bivariate normal distribution. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    33

  • Cov(x1,x

    2)=0, Var(x

    1)=Var(x

    2)

    x1

    x 2

    Cov(x1,x

    2)=0, Var(x

    1)>Var(x

    2)

    Cov(x1,x

    2)>0 Cov(x

    1,x

    2)

  • 0

    0.05

    0.1

    x1

    x2

    p( x

    |C1)

    0

    0.5

    1

    x1

    x2

    p(C

    1| x

    )

    Figure 5.3: Classes have different covariance

    matrices. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    35

  • Figure 5.4: Covariances may be arbitary but shared

    by both classes. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    36

  • Figure 5.5: All classes have equal, diagonal

    covariance matrices but variances are not equal.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    37

  • Figure 5.6: All classes have equal, diagonal

    covariance matrices of equal variances on both

    dimensions. From: E. Alpaydın. 2004. Introduction

    to Machine Learning. c©The MIT Press.

    38

  • Chapter 6:

    Dimensionality Reduction

    39

  • z1

    z2

    x1

    x

    2

    z1

    z

    2

    Figure 6.1: Principal components analysis centers

    the sample and then rotates the axes to line up with

    the directions of highest variance. If the variance on

    z2 is too small, it can be ignored and we have

    dimensionality reduction from two to one. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    40

  • 0 10 20 30 40 50 60 700

    100

    200

    Eigenvectors

    Eig

    enva

    lues

    (a) Scree graph for Optdigits

    0 10 20 30 40 50 60 700

    0.2

    0.4

    0.6

    0.8

    1

    Eigenvectors

    Pro

    p of

    var

    (b) Proportion of variance explained

    Figure 6.2: (a) Scree graph. (b) Proportion of

    variance explained is given for the Optdigits dataset

    from the UCI Repository. This is a handwritten digit

    dataset with ten classes and sixty-four dimensional

    inputs. The first twenty eigenvectors explain 90

    percent of the variance. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    41

  • −40 −30 −20 −10 0 10 20 30 40−40

    −30

    −20

    −10

    0

    10

    20

    30

    First Eigenvector

    Sec

    ond

    Eig

    enve

    ctor

    Optdigits after PCA

    0

    0

    7

    4

    6

    2

    5

    5

    0

    8

    7

    19

    5

    3

    0

    4

    7

    8

    4

    7

    8

    5

    9 1

    2

    0

    6

    1

    8

    7

    0

    7

    6

    9

    19

    3

    9

    4

    9

    2

    1

    9

    9

    6

    4

    3

    2

    8

    2

    7 1

    4

    6

    2

    0

    4

    6

    3

    71

    0

    2

    2

    5

    2

    4

    8

    17

    30

    33

    7

    7

    9

    1

    3

    3

    4

    3

    4

    2

    8

    89

    8

    4

    7

    1

    6

    9

    4

    0

    1

    3

    6

    2

    Figure 6.3: Optdigits data plotted in the space of

    two principal components. Only the labels of

    hundred data points are shown to minimize the

    ink-to-noise ratio. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    42

  • variables

    new

    variables

    x1 xdx2

    z1 zk

    z2

    W V

    x1 xdx2

    z1 zk

    z2

    W

    variables

    factors

    PCA FA

    Figure 6.4: Principal components analysis generates

    new variables that are linear combinations of the

    original input variables. In factor analysis, however,

    we posit that there are factors that when linearly

    combined generate the input variables. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    43

  • z1

    z

    2

    x1

    x

    2

    Figure 6.5: Factors are independent unit normals

    that are stretched, rotated, and translated to make

    up the inputs. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    44

  • −2500 −2000 −1500 −1000 −500 0 500 1000 1500 2000−2000

    −1500

    −1000

    −500

    0

    500

    1000

    1500

    2000

    Athens

    Berlin

    Dublin

    Helsinki

    Istanbul

    Lisbon

    London

    Madrid

    Moscow

    Paris

    Rome

    Zurich

    Figure 6.6: Map of Europe drawn by MDS. Cities

    include Athens, Berlin, Dublin, Helsinki, Istanbul,

    Lisbon, London, Madrid, Moscow, Paris, Rome, and

    Zurich. Pairwise road travel distances between these

    cities are given as input, and MDS places them in two

    dimensions such that these distances are preserved as

    well as possible. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    45

  • w

    m1

    m1

    m2

    m2

    s1

    2

    s2

    2

    x1

    x

    2

    Figure 6.7: Two-dimensional, two-class data

    projected on w. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    46

  • −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4Optdigits after LDA

    0

    0

    7

    4

    62

    55

    0

    8

    7

    1

    9

    5

    3

    0

    4

    7

    8

    4

    7

    8

    5

    9

    1

    2

    0

    6

    1

    8

    7

    0

    7

    6

    91

    9

    3

    9

    4

    9

    2

    1

    99

    6

    4

    32

    8

    2

    7

    14

    6

    2

    0

    4

    63

    7

    1

    0

    2

    2

    52

    4

    8

    1

    7

    3

    0

    3

    3

    77

    9

    1

    3

    3

    4

    3

    4

    2

    88

    9

    8

    4

    7

    1

    6

    9

    4

    0

    1

    3 6

    2

    Figure 6.8: Optdigits data plotted in the space of

    the first two dimensions found by LDA. Comparing

    this with figure 6.3, we see that LDA, as expected,

    leads to a better separation of classes than PCA.

    Even in this two-dimensional space (there are nine),

    we can discern separate clouds for different classes.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    47

  • Chapter 7:

    Clustering

    48

  • mi

    x i

    .

    .

    .

    iCommunication

    line

    .

    .

    .

    mi

    x' =mi

    Encoder Decoder

    Fin

    d c

    losest

    Figure 7.1: Given x, the encoder sends the index of

    the closest code word and the decoder generates the

    code word with the received index as x′. Error is

    ‖x′ − x‖2. From: E. Alpaydın. 2004. Introduction toMachine Learning. c©The MIT Press.

    49

  • −40 −20 0 20 40−30

    −20

    −10

    0

    10

    20

    x1

    x 2

    k−means: Initial

    −40 −20 0 20 40−30

    −20

    −10

    0

    10

    20

    x1

    x 2

    After 1 iteration

    −40 −20 0 20 40−30

    −20

    −10

    0

    10

    20

    x1

    x 2

    After 2 iterations

    −40 −20 0 20 40−30

    −20

    −10

    0

    10

    20

    x1

    x 2After 3 iterations

    Figure 7.2: Evolution of k-means. Crosses indicate

    center positions. Data points are marked depending

    on the closest center. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    50

  • Initialize mi, i = 1, . . . , k, for example, to k random xt

    Repeat

    For all xt ∈ X

    bti ←{

    1 if ‖xt −mi‖ = minj ‖xt −mj‖0 otherwise

    For all mi, i = 1, . . . , k

    mi ←∑

    tbtix

    t/∑

    tbti

    Until mi converge

    Figure 7.3: k-means algorithm. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    51

  • −40 −30 −20 −10 0 10 20−30

    −25

    −20

    −15

    −10

    −5

    0

    5

    10

    15

    20

    x1

    x 2

    EM solution

    Figure 7.4: Data points and the fitted Gaussians by

    EM, initialized by one k-means iteration of figure 7.2.

    Unlike in k-means, as can be seen, EM allows

    estimating the covariance matrices. The data points

    labeled by greater hi, the contours of the estimated

    Gaussian densities, and the separating curve of

    hi = 0.5 (dashed line) are shown. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    52

  • a

    b

    fe

    d

    c

    a de cb f

    1

    3

    2

    h

    Figure 7.5: A two-dimensional dataset and the

    dendrogram showing the result of single-link

    clustering is shown. Note that leaves of the tree are

    ordered so that no branches cross. The tree is then

    intersected at a desired value of h to get the clusters.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    53

  • Chapter 8:

    Nonparametric Methods

    54

  • 0 1 2 3 4 5 6 7 80

    0.1

    0.2

    0.3

    0.4Histogram: h = 2

    0 1 2 3 4 5 6 7 80

    0.1

    0.2

    0.3

    0.4h = 1

    0 1 2 3 4 5 6 7 80

    0.2

    0.4

    0.6

    0.8h = 0.5

    Figure 8.1: Histograms for various bin lengths. ‘×’denote data points. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    55

  • 0 1 2 3 4 5 6 7 80

    0.1

    0.2

    0.3

    0.4Naive estimator: h = 2

    0 1 2 3 4 5 6 7 80

    0.1

    0.2

    0.3

    0.4h = 1

    0 1 2 3 4 5 6 7 80

    0.2

    0.4

    0.6

    0.8h = 0.5

    Figure 8.2: Naive estimate for various bin lengths.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    56

  • 0 1 2 3 4 5 6 7 80

    0.05

    0.1

    0.15

    0.2Kernel estimator: h = 1

    0 1 2 3 4 5 6 7 80

    0.1

    0.2

    0.3

    0.4h = 0.5

    0 1 2 3 4 5 6 7 80

    0.2

    0.4

    0.6

    0.8h = 0.25

    Figure 8.3: Kernel estimate for various bin lengths.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    57

  • 0 1 2 3 4 5 6 7 80

    0.1

    0.2

    0.3

    0.4k−NN estimator: k = 5

    0 1 2 3 4 5 6 7 80

    0.5

    1k = 3

    0 1 2 3 4 5 6 7 80

    0.5

    1k = 1

    Figure 8.4: k-nearest neighbor estimate for various k

    values. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    58

  • x1

    x

    2

    *

    *

    Figure 8.5: Dotted lines are the Voronoi tesselation

    and the straight line is the class discriminant. In

    condensed nearest neighbor, those instances that do

    not participate in defining the discriminant (marked

    by ‘*’) can be removed without increasing the

    training error. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    59

  • Z ← ∅Repeat

    For all x ∈ X (in random order)Find x′ ∈ Z s.t. ‖x− x′‖ = minxj∈Z ‖x− xj‖If class(x)6=class(x′) add x to Z

    Until Z does not change

    Figure 8.6: Condensed nearest neighbor algorithm.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    60

  • 0 1 2 3 4 5 6 7 8−2

    0

    2

    4Regressogram smoother: h = 6

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 3

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 1

    Figure 8.7: Regressograms for various bin lengths.

    ‘×’ denote data points. From: E. Alpaydın. 2004.Introduction to Machine Learning. c©The MIT Press.

    61

  • 0 1 2 3 4 5 6 7 8−2

    0

    2

    4Running mean smoother: h = 6

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 3

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 1

    Figure 8.8: Running mean smooth for various bin

    lengths. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    62

  • 0 1 2 3 4 5 6 7 8−2

    0

    2

    4Kernel smooth: h = 1

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 0.5

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 0.25

    Figure 8.9: Kernel smooth for various bin lengths.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    63

  • 0 1 2 3 4 5 6 7 8−2

    0

    2

    4Running line smooth: h = 6

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 3

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4

    6h = 1

    Figure 8.10: Running line smooth for various bin

    lengths. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    64

  • 0 1 2 3 4 5 6 7 8−2

    0

    2

    4Regressogram linear smoother: h = 6

    0 1 2 3 4 5 6 7 8−4

    −2

    0

    2

    4h = 3

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4h = 1

    Figure 8.11: Regressograms with linear fits in bins

    for various bin lengths. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    65

  • Chapter 9:

    Decision Trees

    66

  • x

    2

    x1

    w10

    w20

    x1>w

    10

    x2>w

    20

    YesNo

    NoYes

    C1

    C1

    C1

    C2

    C2

    Figure 9.1: Example of a dataset and the

    corresponding decision tree. Oval nodes are the

    decision nodes and rectangles are leaf nodes. The

    univariate decision node splits along one axis, and

    successive splits are orthogonal to each other. After

    the first split, {x|x1 < w10} is pure and is not splitfurther. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    67

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    p

    entr

    opy=

    −p*

    log 2

    (p)−

    (1−

    p)*l

    og2(

    1−p)

    Figure 9.2: Entropy function for a two-class problem.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    68

  • GenerateTree(X )If NodeEntropy(X)< θI /* eq. 9.3

    Create leaf labelled by majority class in XReturn

    i ← SplitAttribute(X)For each branch of xi

    Find Xi falling in branchGenerateTree(Xi)

    SplitAttribute(X )MinEnt← MAXFor all attributes i = 1, . . . , d

    If xi is discrete with n values

    Split X into X1, . . . ,Xn by xie ← SplitEntropy(X1, . . . ,Xn) /* eq. 9.8 */If e

  • 0 1 2 3 4 5 6 7 8−2

    0

    2

    4 θr = 0.5

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4 θr = 0.2

    0 1 2 3 4 5 6 7 8−2

    0

    2

    4 θr = 0.05

    Figure 9.4: Regression tree smooths for various

    values of θr. The corresponding trees are given in

    figure 9.5. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    70

  • x < 3.16

    x < 1.36

    Yes No

    NoYes

    1.37 -1.35

    1.86

    x < 3.16

    x < 1.36

    Yes No

    2.20

    x < 5.96

    NoYes

    1.37 -1.35

    NoYes

    0.9 2.40

    Yes No

    x < 6.91

    x < 3.16

    x < 1.36

    Yes No

    2.20

    x < 5.96

    NoYes

    -1.35

    NoYes

    2.40

    Yes No

    x < 6.91x < 0.76

    NoYes

    1.15 1.80

    NoYes

    1.20 0.60

    x < 6.31

    Figure 9.5: Regression trees implementing the

    smooths of figure 9.4 for various values of θr. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    71

  • x1

    > 38.5

    x2

    > 2.5

    Yes No

    NoYes

    0.8 0.6

    x4

    'A' 'C''B'

    0.20.30.4

    x1

    : Age

    x2 : Years in job

    x3 : Gender

    x4

    : Job type

    Figure 9.6: Example of a (hypothetical) decision

    tree. Each path from the root to a leaf can be

    written down as a conjunctive rule, composed of

    conditions defined by the decision nodes on the path.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    72

  • Ripper(Pos,Neg,k)

    RuleSet ← LearnRuleSet(Pos,Neg)For k times

    RuleSet ← OptimizeRuleSet(RuleSet,Pos,Neg)LearnRuleSet(Pos,Neg)

    RuleSet ← ∅DL ← DescLen(RuleSet,Pos,Neg)Repeat

    Rule ← LearnRule(Pos,Neg)Add Rule to RuleSet

    DL’ ← DescLen(RuleSet,Pos,Neg)If DL’>DL+64

    PruneRuleSet(RuleSet,Pos,Neg)

    Return RuleSet

    If DL’

  • PruneRuleSet(RuleSet,Pos,Neg)

    For each Rule ∈ RuleSet in reverse orderDL ← DescLen(RuleSet,Pos,Neg)DL’ ← DescLen(RuleSet-Rule,Pos,Neg)IF DL’¡DL Delete Rule from RuleSet

    Return RuleSet

    OptimizeRuleSet(RuleSet,Pos,Neg)

    For each Rule ∈ RuleSetDL0 ← DescLen(RuleSet,Pos,Neg)DL1 ← DescLen(RuleSet-Rule+ReplaceRule(RuleSet,Pos,Neg),Pos,Neg)

    DL2 ← DescLen(RuleSet-Rule+ReviseRule(RuleSet,Rule,Pos,Neg),Pos,Neg)

    If DL1=min(DL0,DL1,DL2)

    Delete Rule from RuleSet and

    add ReplaceRule(RuleSet,Pos,Neg)

    Else If DL2=min(DL0,DL1,DL2)

    Delete Rule from RuleSet and

    add ReviseRule(RuleSet,Rule,Pos,Neg)

    Return RuleSet

    Figure 9.7: Ripper algorithm for learning rules

    (cont’d). From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    74

  • x

    2

    x1

    C1

    C2

    +

    Yes No

    C1

    C2

    w11

    x1+w

    12x

    2+w

    10= 0

    Figure 9.8: Example of a linear multivariate decision

    tree. The linear multivariate node can place an

    arbitrary hyperplane thus is more general, whereas

    the univariate node is restricted to axis-aligned splits.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    75

  • Chapter 10:

    Linear Discrimination

    76

  • C1C

    2

    g(x)=w1x

    1+w

    2x

    2+w

    0=0

    g(x)>0g(x)<0

    +

    x1

    x

    2

    Figure 10.1: In the two-dimensional case, the linear

    discriminant is a line that separates the examples

    from two classes. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    77

  • w

    g(x)=0

    g(x)>0g(x)<0

    x

    |w0|/||w||

    |g(x)|/||w||

    x1

    x

    2

    Figure 10.2: The geometric interpretation of the

    linear discriminant. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    78

  • C1C

    2

    H2

    H3

    C3

    H1

    +

    +

    +

    x1

    x

    2

    Figure 10.3: In linear classification, each hyperplane

    Hi separates the examples of Ci from the examples ofall other classes. Thus for it to work, the classes

    should be linearly separable. Dotted lines are the

    induced boundaries of the linear classifier. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    79

  • C1C

    2

    H31

    C3

    H12

    +

    +

    +

    H23

    x1

    x

    2

    Figure 10.4: In pairwise linear separation, there is a

    separate hyperplane for each pair of classes. For an

    input to be assigned to C1, it should be on thepositive side of H12 and H13 (which is the negative

    side of H31); we do not care for the value of H23. In

    this case, C1 is not linearly separable from otherclasses but is pairwise linearly separable. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    80

  • −10 −8 −6 −4 −2 0 2 4 6 8 100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Figure 10.5: The logistic, or sigmoid, function.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    81

  • For j = 0, . . . , d

    wj ←rand(-0.01,0.01)Repeat

    For j = 0, . . . , d

    ∆wj ← 0For t = 1, . . . , N

    o ← 0For j = 0, . . . , d

    o ← o + wjxtjy ← sigmoid(o)∆wj ← ∆wj + (rt − y)xtj

    For j = 0, . . . , d

    wj ← wj + η∆wjUntil convergence

    Figure 10.6: Logistic discrimination algorithm

    implementing gradient-descent for the single output

    case with two classes. For w0, we assume that there

    is an extra input x0, which is always +1: xt0 ≡ +1, ∀t.From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    82

  • 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−2

    −1.5

    −1

    −0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

    P(C

    o|x)

    10

    100

    1000

    Figure 10.7: For a univariate two-class problem

    (shown with ‘◦’ and ‘×’ ), the evolution of the linewx + w0 and the sigmoid output after 10, 100, and

    1,000 iterations over the sample. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    83

  • For i = 1, . . . , K, For j = 0, . . . , d, wij ← rand(−0.01, 0.01)Repeat

    For i = 1, . . . , K, For j = 0, . . . , d, ∆wij ← 0For t = 1, . . . , N

    For i = 1, . . . , K

    oi ← 0For j = 0, . . . , d

    oi ← oi + wijxtjFor i = 1, . . . , K

    yi ← exp(oi)/∑

    kexp(ok)

    For i = 1, . . . , K

    For j = 0, . . . , d

    ∆wij ← ∆wij + (rti − yi)xtjFor i = 1, . . . , K

    For j = 0, . . . , d

    wij ← wij + η∆wijUntil convergence

    Figure 10.8: Logistic discrimination algorithm

    implementing gradient-descent for the case with

    K > 2 classes. For generality, we take xt0 ≡ 1, ∀t.From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    84

  • 0 0.5 1 1.5 2 2.5 3 3.5 40

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    x1

    x 2

    Figure 10.9: For a two-dimensional problem with

    three classes, the solution found by logistic

    discrimination. Thin lines are where gi(x) = 0, and

    the thick line is the boundary induced by the linear

    classifier choosing the maximum. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    85

  • −20

    −10

    0

    10

    20

    x1

    x2

    g i(x

    1,x 2

    )

    0

    0.2

    0.4

    0.6

    0.8

    1

    x1

    x2

    P(C

    i|x1,

    x 2)

    Figure 10.10: For the same example in figure 10.9,

    the linear discriminants (top), and the posterior

    probabilities after the softmax (bottom). From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    86

  • w

    C1C

    2

    g(x)=+1

    g(x)= -1+

    1/||w||

    2/||w||

    x1

    x

    2

    Figure 10.11: On both sides of the optimal

    separating hyperplance, the instances are at least

    1/‖w‖ away and the total margin is 2/‖w‖. From:E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    87

  • w

    C1C

    2

    g(x)=+1

    g(x)=-1+

    1

    2

    3

    x1

    x

    2

    Figure 10.12: In classifying an instance, there are

    three possible cases: In (1), ξ = 0; it is on the right

    side and sufficiently away. In (2), ξ = 1 + g(x) > 1; it

    is on the wrong side. In (3), ξ = 1− g(x), 0 < ξ < 1; itis on the right side but is in the margin and not

    sufficiently away. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    88

  • −8 −6 −4 −2 0 2 4 6 80

    10

    20

    30

    40

    50

    60

    70

    Figure 10.13: Quadratic and ²-sensitive error

    functions. We see that ²-sensitive error function is

    not affected by small errors and also is affected less

    by large errors and thus is more robust to outliers.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    89

  • Chapter 11:

    Multilayer Perceptrons

    90

  • x0=+1 x

    1x

    2

    y

    xd

    w0 w

    dw

    2w

    1

    Figure 11.1: Simple perceptron. xj , j = 1, . . . , d are

    the input units. x0 is the bias unit that always has

    the value 1. y is the output unit. wj is the weight of

    the directed connection from input xj to the output.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    91

  • wK

    w2w

    1

    y1

    y2

    yK

    x0=+1 x

    1x

    2x

    d

    Figure 11.2: K parallel perceptrons. xj , j = 0, . . . , d

    are the inputs and yi, i = 1, . . . , K are the outputs. wij

    is the weight of the connection from input xj to

    output yi. Each output is a weighted sum of the

    inputs. When used for K-class classification problem,

    there is a postprocessing to choose the maximum, or

    softmax if we need the posterior probabilities. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    92

  • For i = 1, . . . , K

    For j = 0, . . . , d

    wij ← rand(−0.01, 0.01)Repeat

    For all (xt, rt) ∈ X in random orderFor i = 1, . . . , K

    oi ← 0For j = 0, . . . , d

    oi ← oi + wijxtjFor i = 1, . . . , K

    yi ← exp(oi)/∑

    kexp(ok)

    For i = 1, . . . , K

    For j = 0, . . . , d

    wij ← wij + η(rti − yi)xtjUntil convergence

    Figure 11.3: Percepton training algorithm

    implementing stochastic online gradient-descent for

    the case with K > 2 classes. This is the online

    version of the algorithm given in figure 10.8. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    93

  • x0=+1 x1 x2

    y

    -1.5

    +1+1

    x1

    x2

    +

    (0,0)

    (1,1)

    (1,0)

    (0,1)

    1.5

    1.5

    Figure 11.4: The perceptron that implements AND

    and its geometric interpretation. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    94

  • x1

    x2

    Figure 11.5: XOR problem is not linearly separable.

    We cannot draw a line where the empty circles are

    on one side and the filled circles on the other side.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    95

  • whj

    zh

    z0=+1

    vih

    yi

    x0=+1 xj xd

    Figure 11.6: The structure of a multilayer

    perceptron. From: E. Alpaydın. 2004. Introduction

    to Machine Learning. c©The MIT Press.

    96

  • x0=+1 x

    1x

    2

    y

    z1

    z0=+1

    z2

    -0.5-1 +1

    -1+1

    -0.5

    +1 +1-0.5

    z1

    z2

    +

    +

    +

    x1

    x2

    z1

    z2

    y

    Figure 11.7: The multilayer perceptron that solves

    the XOR problem. The hidden units and the output

    have the threshold activation function with threshold

    at 0. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    97

  • −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−2

    −1.5

    −1

    −0.5

    0

    0.5

    1

    1.5

    2

    100

    200 300

    Figure 11.8: Sample training data shown as ‘+’,

    where xt ∼ U(−0.5, 0.5), and yt = f(xt) +N (0, 0.1).f(x) = sin(6x) is shown by a dashed line. The

    evolution of the fit of an MLP with two hidden units

    after 100, 200, and 300 epochs is drawn. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    98

  • 0 50 100 150 200 250 3000

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    Training Epochs

    Mea

    n S

    quar

    e E

    rror

    TrainingValidation

    Figure 11.9: The mean square error on training and

    validation sets as a function of training epochs.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    99

  • −0.5 0 0.5−4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    −0.5 0 0.5−4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    −0.5 0 0.5−4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    Figure 11.10: (a) The hyperplanes of the hidden unit

    weights on the first layer, (b) hidden unit outputs,

    and (c) hidden unit outputs multiplied by the weights

    on the second layer. Two sigmoid hidden units

    slightly displaced, one multiplied by a negative

    weight, when added, implement a bump. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    100

  • Initialize all vih and whj to rand(−0.01, 0.01)Repeat

    For all (xt, rt) ∈ X in random orderFor h = 1, . . . , H

    zh ← sigmoid(wThxt)For i = 1, . . . , K

    yi = vTi z

    For i = 1, . . . , K

    ∆vi = η(rti − yti)z

    For h = 1, . . . , H

    ∆wh = η(∑

    i(rti − yti)vih)zh(1− zh)xt

    For i = 1, . . . , K

    vi ← vi + ∆viFor h = 1, . . . , H

    wh ← wh + ∆whUntil convergence

    Figure 11.11: Backpropagation algorithm for training

    a multilayer perceptron for regression with K

    outputs. This code can easily be adapted for

    two-class classification (by setting a single sigmoid

    output) and to K > 2 classification (by using softmax

    outputs). From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    101

  • 0 5 10 15 20 25 300

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    Number of Hidden Units

    Mea

    n S

    quar

    e E

    rror

    TrainingValidation

    Figure 11.12: As complexity increases, training error

    is fixed but the validation error starts to increase and

    the network starts to overfit. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    102

  • 0 100 200 300 400 500 600 700 800 900 10000

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    Training Epochs

    Mea

    n S

    quar

    e E

    rror

    TrainingValidation

    Figure 11.13: As training continues, the validation

    error starts to increase and the network starts to

    overfit. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    103

  • Figure 11.14: A structured MLP. Each unit is

    connected to a local group of units below it and

    checks for a particular feature—for example, edge,

    corner, and so forth—in vision. Only one hidden unit

    is shown for each region; typically there are many to

    check for different local features. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    104

  • Figure 11.15: In weight sharing, different units have

    connections to different inputs but share the same

    weight value (denoted by line type). Only one set of

    units is shown; there should be multiple sets of units,

    each checking for different features. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    105

  • A A

    A

    A

    Figure 11.16: The identity of the object does not

    change when it is translated, rotated, or scaled. Note

    that this may not always be true, or may be true up

    to a point: ‘b’ and ‘q’ are rotated versions of each

    other. These are hints that can be incorporated into

    the learning process to make learning easier. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    106

  • Dynamic Node Creation Cascade Correlation

    Figure 11.17: Two examples of constructive

    algorithms: Dynamic node creation adds a unit to an

    existing layer. Cascade correlation adds each unit as

    new hidden layer connected to all the previous layers.

    Dashed lines denote the newly added

    unit/connections. Bias units/weights are omitted for

    clarity. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    107

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Hidden 1

    Hid

    den

    2

    Hidden Representation

    00

    7

    4

    62

    5

    5

    0

    8

    7

    19

    5

    3

    0

    4

    7

    8

    47

    8

    5

    9

    1

    2

    0

    6

    1

    8

    7

    0

    7

    6

    9 1

    9

    3

    9 4

    9

    2

    1

    99

    6

    4

    3

    2

    8

    2

    71 4

    6

    20

    4

    6

    3

    7

    1

    02

    2

    5

    2

    4

    8

    1

    7

    3

    0

    33

    77

    9

    1

    3

    3

    4

    3

    4

    2

    8

    8

    9

    8

    4

    7

    1

    6

    9

    4

    0

    1

    3

    6

    2

    Figure 11.18: Optdigits data plotted in the space of

    the two hidden units of an MLP trained for

    classification. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    108

  • x0=+

    1x

    1x

    d

    zH

    Encoder

    y1

    yd

    Decoder

    y1

    yd

    x0=+

    1x

    1x

    d

    zH

    Linear Nonlinear

    Figure 11.19: In the autoassociator, there are as

    many outputs as there are inputs and the desired

    outputs are the inputs. When the number of hidden

    units is less than the number of inputs, the MLP is

    trained to find the best coding of the inputs on the

    hidden units, performing dimensionality reduction.

    On the left, the first layer acts as an encoder and the

    second layer acts as the decoder. On the right, if the

    encoder and decoder are multilayer perceptrons with

    sigmoid hidden units, the network performs nonlinear

    dimensionality reduction. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    109

  • x0=+1

    xt

    whj

    zhz0=+1

    vih

    yi

    xt-1xt-T

    delaydelay...

    Figure 11.20: A time delay neural network. Inputs in

    a time window of length T are delayed in time until

    we can feed all T inputs as the input vector to the

    MLP. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    110

  • (a) (c)(b)

    Figure 11.21: Examples of MLP with partial

    recurrency. Recurrent connections are shown with

    dashed lines: (a) self-connections in the hidden layer,

    (b) self-connections in the output layer, and (c)

    connections from the output to the hidden layer.

    Combinations of these are also possible. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    111

  • (a) (b)

    x

    x3

    x0

    x1

    x2

    W

    y

    h

    VR

    W

    W

    W

    WR

    R

    R

    V

    h0

    h3

    h1

    h2

    y

    Figure 11.22: Backpropagation through time: (a)

    recurrent network and (b) its equivalent unfolded

    network that behaves identically in four steps. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    112

  • Chapter 12:

    Local Models

    113

  • x1

    x2

    x

    mi

    Figure 12.1: Shaded circles are the centers and the

    empty circle is the input instance. The online version

    of k-means moves the closest center along the

    direction of (x−mi) by a factor specified by η.From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    114

  • Initialize mi, i = 1, . . . , k, for example, to k random xt

    Repeat

    For all xt ∈ X in random orderi ← arg minj ‖xt −mj‖mi ←mi + η(xt −mj)

    Until mi converge

    Figure 12.2: Online k-means algorithm. The batch

    version is given in figure 7.3. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    115

  • x1

    xd

    m1

    bk

    m2

    b2

    b1

    mk

    Figure 12.3: The winner-take-all competitive neural

    network, which is a network of k perceptrons with

    recurrent connections at the output. Dashed lines

    are recurrent connections, of which the ones that end

    with an arrow are excitatory and the ones that end

    with a circle are inhibitory. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    116

  • x1

    x2

    xa

    mi

    xb

    ρ

    Figure 12.4: The distance from xa to the closest

    center is less than the vigilance value ρ and the

    center is updated as in online k-means. However, xb

    is not close enough to any of the centers and a new

    group should be created at that position. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    117

  • x1

    x2

    x

    mi

    mi-1

    mi+1

    mi+2

    mi-2

    Figure 12.5: In the SOM, not only the closest unit

    but also its neighbors, in terms of indices, are moved

    toward the input. Here, neighborhood is 1; mi and

    its 1-nearest neighbors are updated. Note here that

    mi+1 is far from mi, but as it is updated with mi,

    and as mi will be updated when mi+1 is the winner,

    they will become neighbors in the input space as well.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    118

  • −5 −4 −3 −2 −1 0 1 2 3 4 50

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Figure 12.6: The one-dimensional form of the

    bell-shaped function used in the radial basis function

    network. This one has m = 0 and s = 1. It is like a

    Gaussian but it is not a density; it does not integrate

    to 1. It is nonzero between (m− 3s, m + 3s), but amore conservative interval is (m− 2s, m + 2s). From:E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    119

  • x1

    x2

    xa : (1.0, 0.0, 0.0)

    xb : (0.0, 0.0, 1.0)

    xc : (1.0, 1.0, 0.0)

    m1

    xb

    m3

    m2

    xc

    x1

    x2

    xa

    xb

    xc

    +

    +

    w1

    w2

    Local representation in the

    space of (p1, p

    2, p

    3)

    Distributed representation in the

    space of (h1, h

    2)

    xa

    xa : (1.0, 1.0)

    xb : (0.0, 1.0)

    xc : (1.0, 0.0)

    Figure 12.7: The difference between local and

    distributed representations. The values are hard,

    0/1, values. One can use soft values in (0, 1) and get

    a more informative encoding. In the local

    representation, this is done by the Gaussian RBF

    that uses the distance to the center, mi, and in the

    distributed representation, this is done by the sigmoid

    that uses the distance to the hyperplane, wi. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    120

  • xj

    xd

    mhj

    , sh

    ph

    p0=+1

    wih

    yi

    x1

    Figure 12.8: The RBF network where ph are the

    hidden units using the bell-shaped activation

    function. mh, sh are the first-layer parameters, and

    wi are the second-layer weights. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    121

  • 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Figure 12.9: (-) Before and (- -) after normalization

    for three Gaussians whose centers are denoted by ‘*’.

    Note how the nonzero region of a unit depends also

    on the positions of other units. If the spreads are

    small, normalization implements a harder split; with

    large spreads, units overlap more. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    122

  • gh

    wih

    yi

    vih

    mh

    , sh

    xj

    xd

    x1

    Figure 12.10: The mixture of experts can be seen as

    an RBF network where the second-layer weights are

    outputs of linear models. Only one linear model is

    shown for clarity. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    123

  • ghw

    ih

    yi

    vih

    wh

    Local

    experts

    Gating

    network

    mhj

    xj

    xd

    x1

    Figure 12.11: The mixture of experts can be seen as

    a model for combining multiple models. wh are the

    models and the gating network is another model

    determining the weight of each model, as given by gh.

    Viewed in this way, neither the experts nor the gating

    are restricted to be linear. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    124

  • Chapter 13:

    Hidden Markov Models

    125

  • 1 2

    3

    a11 a

    12

    a21

    a13

    π3

    π2

    π1

    Figure 13.1: Example of a Markov model with three

    states is a stochastic automaton. πi is the probability

    that the system starts in state Si, and aij is the

    probability that the system moves from state Si to

    state Sj . From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    126

  • 1 1 11

    a11

    a11

    a11

    a11

    N

    i

    1

    N

    i

    2

    N

    i

    T

    N

    i

    T-1

    O1

    O2

    OT-1

    OT

    π1

    πi

    πN

    Figure 13.2: An HMM unfolded in time as a lattice

    (or trellis) showing all the possible trajectories. One

    path, shown in thicker lines, is the actual (unknown)

    state trajectory that generated the observation

    sequence. From: E. Alpaydın. 2004. Introduction to

    Machine Learning. c©The MIT Press.

    127

  • N

    i

    1

    ja

    iji

    1

    j

    N

    aij

    (a) Forward (b) Backward

    t t+1 t+1t

    αi

    βj

    Ot+1

    Ot+1

    Figure 13.3: Forward-backward procedure: (a)

    computation of αt(j) and (b) computation of βt(i).

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    128

  • i ja

    ijα

    i

    Ot+

    1

    βj

    t t+1

    Figure 13.4: Computation of arc probabilities, ξt(i, j).

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    129

  • 1 2 3

    a11

    a12

    a13

    π1

    4

    Figure 13.5: Example of a left-to-right HMM. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    130

  • Chapter 14:

    Assessing and Comparing

    Classification Algorithms

    131

  • Hit

    ra

    te: |

    T

    P

    |/

    (|

    T

    P

    |+

    |FN

    |)

    False alarm rate: |FP|/(|FP|+|TN|)

    Figure 14.1: Typical roc curve. Each classifier has a

    parameter, for example, a threshold, which allows us

    to move over this curve, and we decide on a point,

    based on the relative importance of hits versus false

    alarms, namely, true positives and false positives.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    132

  • −5 −4 −3 −2 −1 0 1 2 3 4 50

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4Unit Normal Z=N(0,1)

    x

    p(x)

    2.5% 2.5%

    Figure 14.2: 95 percent of the unit normal

    distribution lies between −1.96 and 1.96. From:E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    133

  • −5 −4 −3 −2 −1 0 1 2 3 4 50

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4Unit Normal Z=N(0,1)

    x

    p(x)

    5%

    Figure 14.3: 95 percent of the unit normal

    distribution lies before 1.64. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    134

  • Chapter 15:

    Combining Multiple Learners

    135

  • x

    w2

    y

    d1

    dLd2

    +

    wL

    w1

    f ( )

    Figure 15.1: In voting, the combiner function f(·) isa weighted sum. dj are the multiple learners, and wj

    are the weights of their votes. y is the overall output.

    In the case of multiple outputs, for example,

    classification, the learners have multiple outputs dji

    whose weighted sum gives yi. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    136

  • Training:

    For all {xt, rt}Nt=1 ∈ X , initialize pt1 = 1/NFor all base-learners j = 1, . . . , L

    Randomly draw Xj from X with probabilities ptjTrain dj using XjFor each (xt, rt), calculate ytj ← dj(xt)Calculate error rate: ²j ←

    ∑tptj · 1(ytj 6= rt)

    If ²j > 1/2, then L ← j − 1; stopβj ← ²j/(1− ²j)For each (xt, rt), decrease probabilities if correct:

    If ytj = rt ptj+1 ← βjptj Else ptj+1 ← ptj

    Normalize probabilities:

    Zj ←∑

    tptj+1; p

    tj+1 ← ptj+1/Zj

    Testing:

    Given x, calculate dj(x), j = 1, . . . , L

    Calculate class outputs, i = 1, . . . , K:

    yi =∑L

    j=1

    (log 1

    βj

    )dji(x)

    Figure 15.2: AdaBoost algorithm. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    137

  • x

    y

    d1 dLd2

    +w

    L

    w1

    f ( )

    gating

    Figure 15.3: Mixture of experts is a voting method

    where the votes, as given by the gating system, are a

    function of the input. The combiner system f also

    includes this gating system. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    138

  • x

    y

    d1 dLd2

    f ( )

    Figure 15.4: In stacked generalization, the combiner

    is another learner and is not restricted to being a

    linear combination as in voting. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    139

  • x

    y=d1

    d1

    d2

    w1>θ

    1

    yes

    no

    w2>θ

    2

    yes

    no

    y=d2

    ... dL

    y=dL

    Figure 15.5: Cascading is a multistage method where

    there is a sequence of classifiers, and the next one is

    used only when the preceding ones are not confident.

    From: E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    140

  • Chapter 16:

    Reinforcement Learning

    141

  • ENVIRONMENT

    AGENT

    ActionReward

    State

    Figure 16.1: The agent interacts with an

    environment. At any state of the environment, the

    agent takes an action that changes the state and

    returns a reward. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    142

  • Initialize V (s) to arbitrary values

    Repeat

    For all s ∈ SFor all a ∈ A

    Q(s, a) ← E[r|s, a] + γ∑

    s′∈S P (s′|s, a)V (s′)

    V (s) ← maxa Q(s, a)Until V (s) converge

    Figure 16.2: Value iteration algorithm for

    model-based learning. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    143

  • Initialize a policy π arbitrarily

    Repeat

    π ← π′Compute the values using π by

    solving the linear equations

    V π(s) = E[r|s, π(s)] + γ∑

    s′∈S P (s′|s, π(s))V π(s′)

    Improve the policy at each state

    π′(s) ← arg maxa(E[r|s, a] + γ∑

    s′∈S P (s′|s, a)V π(s′))

    Until π = π′

    Figure 16.3: Policy iteration algorithm for

    model-based learning. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    144

  • G100*

    0

    90

    81

    A

    B

    Figure 16.4: Example to show that Q values increase

    but never decrease. This is a deterministic grid-world

    where G is the goal state with reward 100, all other

    immediate rewards are 0 and γ = 0.9. Let us consider

    the Q value of the transition marked by asterisk, and

    let us just consider only the two paths A and B. Let

    us say that path A is seen before path B, then we

    have γ max(0, 81) = 72.9. If afterward B is seen, a

    shorter path is found and the Q value becomes

    γ max(100, 81) = 90. If B is seen before A, the Q value

    is γ max(100, 0) = 90. Then when B is seen, it does

    not change because γ max(100, 81) = 90. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    145

  • Initialize all Q(s, a) arbitrarily

    For all episodes

    Initalize s

    Repeat

    Choose a using policy derived from Q, e.g., ²-greedy

    Take action a, observe r and s′

    Update Q(s, a):

    Q(s, a) ← Q(s, a) + η(r + γ maxa′ Q(s′, a′)−Q(s, a))s ← s′

    Until s is terminal state

    Figure 16.5: Q learning, which is an off-policy

    temporal difference algorithm. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    146

  • Initialize all Q(s, a) arbitrarily

    For all episodes

    Initalize s

    Choose a using policy derived from Q, e.g., ²-greedy

    Repeat

    Take action a, observe r and s′

    Choose a′ using policy derived from Q, e.g., ²-greedyUpdate Q(s, a):

    Q(s, a) ← Q(s, a) + η(r + γQ(s′, a′)−Q(s, a))s ← s′, a ← a′

    Until s is terminal state

    Figure 16.6: Sarsa algorithm, which is an on-policy

    version of Q learning. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    147

  • 0 10 20 30 40 50 60 70 80 90 1000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Figure 16.7: Example of an eligibility trace for a

    value. Visits are marked by an asterisk. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    148

  • Initialize all Q(s, a) arbitrarily, e(s, a) ← 0, ∀s, aFor all episodes

    Initalize s

    Choose a using policy derived from Q, e.g., ²-greedy

    Repeat

    Take action a, observe r and s′

    Choose a′ using policy derived from Q, e.g., ²-greedyδ ← r + γQ(s′, a′)−Q(s, a)e(s, a) ← 1For all s, a:

    Q(s, a) ← Q(s, a) + ηδe(s, a)e(s, a) ← γλe(s, a)

    s ← s′, a ← a′Until s is terminal state

    Figure 16.8: Sarsa(λ) algorithm. From: E. Alpaydın.

    2004. Introduction to Machine Learning. c©The MITPress.

    149

  • ENVIRONMENT

    AGENT

    ActionReward

    State

    SE πb

    Figure 16.9: In the case of a partially observable

    environment, the agent has a state estimator (SE)

    that keeps an internal belief state b and the policy π

    generates actions based on the belief states. From:

    E. Alpaydın. 2004. Introduction to Machine

    Learning. c©The MIT Press.

    150

  • G

    S

    Figure 16.10: The grid world. The agent can move

    in the four compass directions starting from S. The

    goal state is G. From: E. Alpaydın. 2004.

    Introduction to Machine Learning. c©The MIT Press.

    151

  • Appendix: Probability

    152

  • −5 −4 −3 −2 −1 0 1 2 3 4 50

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4Unit Normal Z=N(0,1)

    x

    p(x)

    Figure A.1: Probability density function of Z, theunit normal distribution.

    153