pattern classification - michigan state universitycse802/dhsch6.pdf · 2012. 3. 25. · pattern...

45
Pattern Pattern Classification Classification All materials in these slides were taken All materials in these slides were taken from from Pattern Classification (2nd ed) by R. O. Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 Wiley & Sons, 2000 with the permission of the authors and with the permission of the authors and the publisher the publisher

Upload: others

Post on 11-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Pattern Pattern

    ClassificationClassification

    All materials in these slides were taken All materials in these slides were taken

    from from

    Pattern Classification (2nd ed) by R. O. Pattern Classification (2nd ed) by R. O.

    Duda, P. E. Hart and D. G. Stork, John Duda, P. E. Hart and D. G. Stork, John

    Wiley & Sons, 2000Wiley & Sons, 2000

    with the permission of the authors and with the permission of the authors and

    the publisherthe publisher

  • Chapter 6: Multilayer Neural Networks Chapter 6: Multilayer Neural Networks

    (Sections 6.1(Sections 6.1--6.3)6.3)

    • Introduction

    • Feedforward Operation and Classification

    • Backpropagation Algorithm

  • Artificial Neural Net ModelsArtificial Neural Net Models

    • Massive parallelism is essential for high performance cognitive tasks (speech & image

    recognition). Humans need only a few msec. for

    most cognitive tasks

    • Design a massively parallel architecture composed of many simple processing elements

    interconnected to achieve certain collective

    computational capabilities

    • Also known as connectionist models and parallel distributed processing (PDP) models

    • Derive inspiration from “knowledge” of biological neural nets

  • Natural Neural Net ModelsNatural Neural Net Models

    • Human brain consists of very large number of neurons (between 10**10 to 10**12)

    • No. of interconnections per neuron is between 1K to 10K

    • Total number of interconnections is about 10**14• Damage to a few neurons or synapse (links) does not appear to impair overall performance

    significantly (robustness)

  • Artificial Neural Net ModelsArtificial Neural Net Models

    • Artificial Neural nets are specified by• Net topology• Node (processor) characteristics• Training/learning rules

    • We consider only feedforward multilayer networks

    • These networks essentially implement a non-parametric non-linear classifier

  • Linear Discriminant Functions• A discriminant function that is a linear combination of

    input features can be written as

    Weight

    vector

    Weight

    vector

    Bias or Threshold

    weight

    Bias or Threshold

    weight

    Sign of the

    function

    value gives

    the class

    label

    Sign of the

    function

    value gives

    the class

    label

  • Pattern Classification, Chapter 6

    6

    IntroductionIntroduction

    • Goal: Classify objects by learning nonlinearity

    • There are many problems for which linear discriminants are not sufficient for minimum error

    • The central difficulty is the choice of the appropriate nonlinear functions

    • A “brute” approach might be to select a complete basis set such as all polynomials; such a classifier

    would require too many parameters to be determined

    from a limited number of training samples

  • Pattern Classification, Chapter 6

    7

    • There is no automatic method for determining the nonlinearities when no information is provided to the classifier

    • In using the multilayer neural Networks or multilayer Perceptrons, the form of the nonlinearity is learned from the training data

    • Multilayer networks can, in principle, provide the optimal solution to an arbitrary classification problem

    • Nothing “magical” about multilayer neural networks; they implement linear discriminants but in a space where the

    inputs have been mapped nonlinearly

  • Pattern Classification, Chapter 6

    8

    Feedforward Operation and Feedforward Operation and

    ClassificationClassification

    • A three-layer neural network consists of an input layer, a hidden layer and an output layer

    interconnected by modifiable weights

    represented by links between layers

    • Multilayer neural network implements linear discriminants, but in a space where the inputs

    have been mapped nonlinearly

    • Figure 6.1 shows a simple three-layer network

  • Pattern Classification, Chapter 6

    9

  • Pattern Classification, Chapter 6

    10

  • Pattern Classification, Chapter 6

    11

    • A single “bias unit” is connected to each unit in addition to the input units

    • Net activation:

    where the subscript i indexes units in the input layer, j in the hidden layer; wji denotes the input-to-hidden layer weights at the hidden unit j. (In neurobiology, such weights or connections are called “synapses”)

    • Each hidden unit emits an output that is a nonlinear function of its activation, that is: yj = f(netj)

    ∑∑∑∑ ∑∑∑∑==== ====

    ≡≡≡≡====++++====d

    1i

    d

    0i

    t

    jjii0jjiij ,x.wwxwwxnet

  • Pattern Classification, Chapter 6

    12Figure 6.1 shows a simple threshold function

    • The function f(.) is also called the activation function or “nonlinearity” of a unit. There are more general activation functions with desirables properties

    • Each output unit similarly computes its net activation based on the hidden unit signals as:

    where the subscript k indexes units in the ouput layer and nH denotes the number of hidden units

  • Pattern Classification, Chapter 6

    13

    • The output units are referred as zk. An output unit computes the nonlinear function of its net input,

    emitting

    zk = f(netk)

    • In the case of c outputs (classes), we can view the network as computing c discriminant functions

    zk = gk(x); the input x is classified according to the

    largest discriminant function gk(x) ∀ k = 1, …,c

    • The three-layer network with the weights listed in fig. 6.1 solves the XOR problem

  • Pattern Classification, Chapter 6

    14

    • The hidden unit y1 computes the boundary:≥ 0 ⇒ y1 = +1

    x1 + x2 + 0.5 = 0

    < 0 ⇒ y1 = -1

    • The hidden unit y2 computes the boundary:≤ 0 ⇒ y2 = +1

    x1 + x2 -1.5 = 0

    < 0 ⇒ y2 = -1

    • Output unit emits z1 = +1 if and only if y1 = +1 and y2 = +1Using the terminology of computer logic, the units are

    behaving like gates, where the first hidden unit is an OR gate, the second hidden unit is an AND gate, and the output

    unit implements

    zk = y1 AND NOT y2 = (x1 OR x2) and NOT(x1 AND x2) =

    x1 XOR x2which provides the nonlinear decision of fig. 6.1

  • Pattern Classification, Chapter 6

    15

    • General Feedforward Operation – case of c output units

    • Hidden units enable us to express more complicated nonlinear functions and extend classification capability

    • Activation function does not have to be a sign function, it is often required to be continuous and differentiable

    • We can allow the activation in the output layer to be different from the activation function in the hidden layer or

    have different activation for each individual unit

    • Assume for now that all activation functions are identical

    c)1,...,(k

    (1) wwxwfwfz)x(gHn

    1j

    0k

    d

    1i

    0jijikjkk

    ====

    ++++

    ++++====≡≡≡≡ ∑∑∑∑ ∑∑∑∑

    ==== ====

  • Pattern Classification, Chapter 6

    16

    • Expressive Power of multi-layer Networks

    Question: Can every decision boundary be implemented by a three-layer network described by equation (1)?

    Answer: Yes (due to A. Kolmogorov)

    “Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH, proper nonlinearities, and weights.”

    Any continuous function g(x) defined on the unit cube can be represented in the following form

    for properly chosen functions δj and βij

    (((( )))) )2n];1,0[I(Ix )x()x(g n1n2

    1j

    iijj ≥≥≥≥====∈∈∈∈∀∀∀∀==== ∑∑∑∑++++

    ====

    ββββΣΣΣΣδδδδ

  • Pattern Classification, Chapter 6

    17

    • Eq (8) can be expressed in neural network terminology as follows:

    • Each of the (2n+1) hidden units δj takes as input a sum of dnonlinear functions, one for each input feature xi

    • Each hidden unit emits a nonlinear function δj of its total input

    • The output unit emits the sum of the contributions of the hidden units

    Unfortunately, Kolmogorov’s theorem tells us very little about

    how to find the nonlinear functions based on data; this is the

    central problem in network-based pattern recognition

    Another question: how many hidden nodes we should have?

  • Pattern Classification, Chapter 6

    18

  • Pattern Classification, Chapter 6

    19

  • Neural Network FunctionsNeural Network Functions

    Pattern Classification, Chapter 6

    20

  • Pattern Classification, Chapter 6

    21

    Backpropagation AlgorithmBackpropagation Algorithm

    • Any function from input to output can be implemented as a three-layer neural network

    • These results are of greater theoretical interest than practical, since the construction of such a

    network requires the nonlinear functions and the

    weight values which are unknown!

  • Pattern Classification, Chapter 6

    22

    • Our goal is to learn the interconnection weights based on the training patterns and the desired outputs

    • In a three-layer network, it is a straightforward matter to understand how the output, and thus the error, depends

    on the hidden-to-output layer weights

    • The power of backpropagation is that it enables us to compute an effective error for each hidden unit, and thus

    derive a learning rule for the input-to-hidden weights. This

    is known as:

    The credit assignment problem

    Network Learning

  • Pattern Classification, Chapter 6

    23

    • Network has two modes of operation:

    • FeedforwardThe feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the

    signals through the network in order to yield a

    decision from the outputs units

    • LearningThe supervised learning consists of presenting an

    input pattern and modifying the network parameters (weights) to bring the actual outputs closer to the

    desired target values

  • Pattern Classification, Chapter 6

    24

  • Pattern Classification, Chapter 6

    25Network Learning

    • Start with an untrained network, present a training pattern to the input layer, pass the signal through the network and determine the output.

    • Let tk be the k-th target (or desired) output and zk be the k-th computed output with k = 1, …, c. Let w

    represent all the weights of the network

    • The training error:

    • The backpropagation learning rule is based on gradient descent

    • The weights are initialized with random values and are changed in a direction that will reduce the error:

    ∑∑∑∑====

    −−−−====−−−−====c

    1k

    22

    kk zt2

    1)zt(

    2

    1)w(J

    w

    Jw

    ∂∂∂∂∂∂∂∂

    −−−−==== ηηηη∆∆∆∆

  • Pattern Classification, Chapter 6

    26where η is the learning rate which indicates the relative size of the change in weights

    w(m +1) = w(m) + ∆w(m)where m is the m-th training pattern presented

    • Error on the hidden–to-output weights

    where the sensitivity of unit k is defined as:

    and describes how the overall error changes with the activation of the unit’s net activation

    kj

    kk

    kj

    k

    kkj w

    net

    w

    net.

    net

    J

    w

    J

    ∂∂∂∂∂∂∂∂

    −−−−====∂∂∂∂∂∂∂∂

    ∂∂∂∂∂∂∂∂

    ====∂∂∂∂∂∂∂∂

    δδδδ

    k

    knet

    J

    ∂∂∂∂∂∂∂∂

    −−−−====δδδδ

    )net('f)zt(net

    z.

    z

    J

    net

    Jkkk

    k

    k

    kk

    k −−−−====∂∂∂∂∂∂∂∂

    ∂∂∂∂∂∂∂∂

    −−−−====∂∂∂∂

    ∂∂∂∂−−−−====δδδδ

  • Pattern Classification, Chapter 6

    27

    Since netk = wkt.y, therefore:

    Conclusion: the weight update (or learning rule) for the

    hidden-to-output weights is:

    ∆wkj = ηδkyj = η(tk – zk) f’ (netk)yj

    • Learning rule for the input-to-hiden units is more subtle and is the crux of the credit assignment problem

    • Error on the input-to-hidden units: Using the chain rule

    j

    kj

    k yw

    net====

    ∂∂∂∂∂∂∂∂

    ji

    j

    j

    j

    jji w

    net.

    net

    y.

    y

    J

    w

    J

    ∂∂∂∂

    ∂∂∂∂

    ∂∂∂∂

    ∂∂∂∂

    ∂∂∂∂∂∂∂∂

    ====∂∂∂∂∂∂∂∂

  • Pattern Classification, Chapter 6

    28However,

    Similarly as in the preceding case, we define the sensitivity of a hidden unit:

    Above equation is the core of the “credit assignment”

    problem: “The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by

    the hidden-to-output weights wkj, all multiplied by f’(netj)”; see fig 6.5

    Conclusion: Learning rule for the input-to-hidden weights:

    ∑∑∑∑ ∑∑∑∑

    ∑∑∑∑∑∑∑∑

    ==== ====

    ========

    −−−−−−−−====∂∂∂∂

    ∂∂∂∂∂∂∂∂∂∂∂∂

    −−−−−−−−====

    ∂∂∂∂∂∂∂∂

    −−−−−−−−====

    −−−−

    ∂∂∂∂∂∂∂∂

    ====∂∂∂∂∂∂∂∂

    c

    1k

    c

    1k

    kjkkk

    j

    k

    k

    kkk

    c

    1k j

    kkk

    2

    k

    c

    1k

    k

    jj

    w)net('f)zt(y

    net.

    net

    z)zt(

    y

    z)zt()zt(

    2

    1

    yy

    J

    ∑∑∑∑====

    ≡≡≡≡c

    1k

    kkjjj w)net('f δδδδδδδδ

    [[[[ ]]]] ijkkjjiji x)net('f wxwj

    ��� ���� ��

    δδδδ

    δδδδΣΣΣΣηηηηδδδδηηηη∆∆∆∆ ========

  • Sensitivity at Hidden NodeSensitivity at Hidden Node

  • Backpropagation AlgorithmBackpropagation Algorithm

    • More specifically, the “backpropagation of errors” algorithm

    • During training, an error must be propagated from the output layer back to the hidden layer to

    learn the input-to-hidden weights

    • It is gradient descent in a layered network• Exact behavior of the learning algorithm

    depends on the starting point

    • Start the process with random values of weights; in practice you learn many networks with

    different initializations

    Pattern Classification, Chapter 6

    30

  • Pattern Classification, Chapter 6

    31• Training protocols: • Stochastic: patterns are chosen randomly from training

    set; network weights are updated for each pattern

    • Batch: Present all patterns before updating weights• On-line: present each pattern once & only once (no

    memory for storing patterns)

    • Stochastic backpropagation algorithm:

    Begin initialize nH; w, criterion θ, η, m ← 0

    do m ← m + 1xm ← randomly chosen patternwji ← wji + ηδjxi; wkj ← wkj + ηδkyj

    until ||∇J(w)|| < θreturn w

    End

  • Pattern Classification, Chapter 6

    32

    • Stopping criterion

    • The algorithm terminates when the change in the criterion function J(w) is smaller than some preset

    value θ; other stopping criteria that lead to better performance than this one

    • A weight update may reduce the error on the single pattern being presented but can increase the error on the full training set

    • In stochastic backpropgation and batch propagation, we must make several passes through the training

    data

  • Pattern Classification, Chapter 6

    33

    • Learning Curves

    • Before training starts, the error on the training set is high; as the learning proceeds, error becomes smaller

    • Error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network

    • Average error on an independent test set is always higher than on the training set, and it can decrease as well as

    increase

    • A validation set is used in order to decide when to stop training ; we do not want to overfit the network and

    decrease the power of the classifier’s generalization

    “Stop training when the error on the validation set is

    minimum”

  • Pattern Classification, Chapter 6

    34

  • Representation at the Hidden LayerRepresentation at the Hidden Layer

    • What do the learned weights mean?• The weights connecting hidden layer to output

    layer form a linear discriminant

    • The weights connecting input layer to hidden layer represent a mapping from the input

    feature space to a latent feature space

    • For each hidden unit, the weights from input layer describe the input pattern that leads to

    the maximum activation of that node

  • Backpropagation as Feature MappingBackpropagation as Feature Mapping

    • 64-2-3 sigmoidal network for classifying three characters (E,F,L)• Non-linear interactions between the features may cause the

    features of the pattern to not manifest in a single hidden node (in contrary to the example shown above)

    • It may be difficult to draw similar interpretations in large networks and caution must be exercised while analyzing weights

    Input layer to hidden layer weights for a character recognition task

    Weights at two hidden nodes

    represented as 8x8 patterns

    Left gets activated for F, right gets

    activated for L, and both get

    activated for E

  • Practical Techniques for Improving Practical Techniques for Improving

    BackpropagationBackpropagation

    • A naïve application of backpropagation procedures may lead to slow convergence and poor performance

    • Some practical suggestions; no theoretical results• Activation Function f(.)

    • Must be non-linear (otherwise, 3-layer network is just a linear discriminant) and saturate (have max and min value) to keep weights and activation functions bounded

    • Activation function and its derivative must be continuous and smooth; optionally monotonic

    • Choice may depend on the problem. Eg. Gaussian activation if the data comes from a mixture of Gaussians

    • Eg: sigmoid (most popular), polynomial, tanh, sign function• Parameters of activation function (e.g. Sigmoid)

    • Centered at 0, odd function f(-net) = -f(net) (anti-symmetric); leads to faster learning

    • Depend on the range of the input values

  • Activation FunctionActivation Function

    The anti-symmetric sigmoid function: f(-x) = -f(x).

    a = 1.716, b = 2/3.

    First & second derivative

  • Practical ConsiderationsPractical Considerations• Scaling inputs (important not just for neural networks)

    • Large differences in scale of different features due to the choice of units is compensated by normalizing them to be in the same range, [0,1] or [-1,1]; without normalization, error will hardly depend on feature with very small values

    • Standardization: Shift the inputs to have zero mean and unit variance

    • Target Values• One-of-C representation for the target vector (C is no. of classes).

    Better to use +1 and –1 that lie well within the range of sigmoid function saturation values (+1.716, -1.716)

    • Higher values (e.g. 1.716, saturation point of a sigmoid) may require the weights to go to infinity to minimize the error

    • Training with Noise• For small training sets, it is better to add noise to the input patterns

    and generate new “virtual” training patterns

  • Practical ConsiderationsPractical Considerations• Number of Hidden units (nH)

    • Governs the expressive power of the network• The easier the task, the fewer the nodes needed• Rule of thumb: total no. of weights must be less than the number of

    training examples (preferably 10 times less); no. of hidden units determines the total no. of weights

    • A more principled method is to adjust the network complexity in response to training data; e.g. start with a “large” no. of hidden units and “decay”, prune, or eliminate weights

    • Initializing weights• We cannot initialize weights to zero, otherwise learning cannot take

    place

    • Choose initial weights w such that |w| < w’• w’ too small – slow learning; too large – early saturation and no

    learning

    • w’ is chosen to be 1/√d for input layer, and 1/ √nH for hidden layer

  • Total no. of WeightsTotal no. of Weights

    Error per pattern with the increase in number of hidden nodes.

    • 2-nH-1 network (with bias) trained on 90 2D-Gaussian patterns (n = 180) from each class (sampled from mixture of 3 Gaussians)

    • Minimum test error occurs at 17-21 weights in total (4-5 hidden nodes). This illustrates the rule of thumb that n/10 weights often gives lowest error

  • Practical ConsiderationsPractical Considerations

    • Learning Rate• Small learning rate: slow convergence• Large learning rate: high oscillation and slow convergence

  • Practical ConsiderationsPractical Considerations

    • Momentum• Prevents the algorithm from getting stuck at plateaus and local

    minima

    • Weight decay• Avoid overfitting by imposing the condition that weights must be

    small

    • After each update, weights are decayed by some factor• Related to regularization (also used in SVM)

    • Hints• Additional input nodes added to NN that are only

    used during training. Help learn better feature

    representation.

  • Practical ConsiderationsPractical Considerations

    • Training setup• Online, stochastic, batch-mode

    • Stop training• Halt when validation error reaches (first) minimum

    • Number of hidden layers• More layers -> more complex• Networks with more hidden layers are more prone to

    get caught in local minima

    • Smaller the better (KISS)• Criterion function

    • We talked about squared error, but there are others Pattern Classification, Chapter 6

    44