042407 neural networks

Upload: crana10

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 042407 Neural Networks

    1/19

  • 8/13/2019 042407 Neural Networks

    2/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    The Iris dataset with decision tree boundaries

    3

    1 2 3 4 5 6 70

    0.5

    1

    1.5

    2

    2.5

    petal length (cm)

    petalwidth(cm)

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    The optimal decision boundary for C2vs C3

    4

    1 2 3 4 5 6 70

    0.5

    1

    1.5

    2

    2.5

    petal length (cm)

    petalwidth(cm)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9p(petal length|C2)

    p(petal length|C3) optimal decision boundary isdetermined from the statisticaldistributionof the classes

    optimal only if model is correct assigns precise degree of uncertainty

    to classification

  • 8/13/2019 042407 Neural Networks

    3/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Optimal decision boundary

    5

    1 2 3 4 5 6 70

    0.2

    0.4

    0.6

    0.8

    1

    Optimal decision boundary

    p(petal length|C2) p(petal length|C3)

    p(C2|petal length) p(C3|petal length)

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Can we do better?

    6

    1 2 3 4 5 6 70

    0.5

    1

    1.5

    2

    2.5

    petal length (cm)

    petalwidth(cm)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9p(petal length|C2)

    p(petal length|C3) only way is to use more information DTs use both petal width and petal

    length

  • 8/13/2019 042407 Neural Networks

    4/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Arbitrary decision boundaries would be more powerful

    7

    1 2 3 4 5 6 70

    0.5

    1

    1.5

    2

    2.5

    petal length (cm)

    petalwidth(cm)

    Decision boundariescould be non-linear

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    Defining a decision boundary

    8

    consider just two classes want points on one side of line in

    class 1, otherwise class 2.

    2D linear discriminant function:

    This defines a 2D plane whichleads to the decision:

    The decision boundary:

    y =mTx+ b = 0x

    class 1 if y 0,

    class 2 if y

  • 8/13/2019 042407 Neural Networks

    5/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Linear separability

    9

    1 2 3 4 5 6 70

    0.5

    1

    1.5

    2

    2.5

    petal length (cm)

    petalwidth(cm)

    Two classes are linearly separable if they can beseparated by a linear combination of attributes

    - 1D: threshold- 2D: line- 3D: plane- M-D: hyperplane

    linearly separable

    not linearly separable

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Diagraming the classifier as a neural network

    The feedforward neural network isspecified by weights wiand bias b:

    It can written equivalently as

    where w0= bis the bias and adummy input x0that is always 1.

    10

    x1 x2 xM

    y

    w1 w2 wM

    b

    x1 x2 xM

    y

    w1 w2 wM

    x0=1

    w0

    y =wTx =

    M

    i=0

    wixi

    y = wTx+ b

    =

    M

    i=1

    wixi + b

    output unit

    input units

    bias

    weights

  • 8/13/2019 042407 Neural Networks

    6/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Determining, ie learning, the optimal linear discriminant

    11

    First we must define an objective function, ie the goal of learning Simple idea: adjust weights so that output y(xn)matches class cn

    Objective: minimize sum-squared error over all patternsxn:

    Note the notation xn defines a pattern vector:

    We can define the desired class as:

    E=1

    2

    N

    n=1

    (wTxn cn)2

    xn = x1, . . . , xM n

    cn = 0 xn class 11 xn class 2

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Weve seen this before: curve fitting

    12

    example from Bi shop (2006),Pattern Recognition and Machine Learning

    t= sin(2x) + noise

    0 1

    !1

    0

    1 t

    x

    y(xn,w)

    tn

    xn

  • 8/13/2019 042407 Neural Networks

    7/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Neural networks compared to polynomial curve fitting

    13

    0 1

    !1

    0

    1

    0 1

    !1

    0

    1

    0 1

    !1

    0

    1

    0 1

    !1

    0

    1

    y(x,w) = w0+ w1x+ w2x2 + + wMx

    M =M

    j=0

    wjxj

    E(w) =1

    2

    N

    n=1

    [y(xn,w) tn]2

    example from Bi shop (2006),Pattern Recognition and Machine Learning

    For the linearnetwork, M=1 andthere are multipleinput dimensions

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    General form of a linear network

    A linear neural network is simply alinear transformation of the input.

    Or, in matrix-vector form:

    Multiple outputs corresponds tomultivariate regression

    14

    y =Wx

    j =

    M

    i=0

    wi,jxi

    x1 xi xM

    yi

    w

    x0=1

    y1 yK

    x

    y

    W

    outputs

    weights

    inputs

    bias

  • 8/13/2019 042407 Neural Networks

    8/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Training the network: Optimization by gradient descent

    15

    We can adjust the weights incrementallyto minimize the objective function.

    This is calledgradient descent

    Or gradient ascentif were maximizing.

    The gradient descent rule for weight wiis:

    Or in vector form:

    For gradient ascent, the signof the gradient step changes.

    w1

    w2

    w3

    w4w2

    w1

    wt+1

    i = w

    t

    iE

    wi

    wt+1 =wt E

    w

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Computing the gradient

    Idea: minimize error by gradient descent Take the derivative of the objective function wrt the weights:

    And in vector form:

    16

    E = 1

    2

    N

    n=1

    (wTxn cn)2

    E

    wi=

    2

    2

    N

    n=1

    (w0x0,n+ + wixi,n+ + wMxM,n cn)xi,n

    =N

    n=1

    (wTxn cn)xi,n

    E

    w

    =N

    n=1

    (wTxn cn)xn

  • 8/13/2019 042407 Neural Networks

    9/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Simulation: learning the decision boundary

    Each iteration updates the gradient:

    Epsilon is a small value:"= 0.1/N

    Epsilon too large:- learning diverges

    Epsilon too small:

    - convergence slow

    17

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    0 5 10 151000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    11000

    iteration

    Error

    E

    wi=

    N

    n=1

    (wTxn cn)xi,n

    t+1

    i = w

    t

    i E

    wi

    Learning Curve

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Simulation: learning the decision boundary

    Each iteration updates the gradient:

    Epsilon is a small value:"= 0.1/N

    Epsilon too large:- learning diverges

    Epsilon too small:- convergence slow

    18

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    0 5 10 151000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    11000

    iteration

    Error

    E

    wi=

    N

    n=1

    (wTxn cn)xi,n

    t+1i

    = wt

    i

    E

    wi

    Learning Curve

  • 8/13/2019 042407 Neural Networks

    10/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Simulation: learning the decision boundary

    Each iteration updates the gradient:

    Epsilon is a small value:"= 0.1/N

    Epsilon too large:- learning diverges

    Epsilon too small:

    - convergence slow

    19

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    0 5 10 151000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    11000

    iteration

    Error

    E

    wi=

    N

    n=1

    (wTxn cn)xi,n

    t+1

    i = w

    t

    i E

    wi

    Learning Curve

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Simulation: learning the decision boundary

    Each iteration updates the gradient:

    Epsilon is a small value:"= 0.1/N

    Epsilon too large:- learning diverges

    Epsilon too small:- convergence slow

    20

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    0 5 10 151000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    11000

    iteration

    Error

    E

    wi=

    N

    n=1

    (wTxn cn)xi,n

    t+1i

    = wt

    i

    E

    wi

    Learning Curve

  • 8/13/2019 042407 Neural Networks

    11/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Simulation: learning the decision boundary

    Each iteration updates the gradient:

    Epsilon is a small value:"= 0.1/N

    Epsilon too large:- learning diverges

    Epsilon too small:

    - convergence slow

    21

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    0 5 10 151000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    11000

    iteration

    Error

    E

    wi=

    N

    n=1

    (wTxn cn)xi,n

    t+1

    i = w

    t

    i E

    wi

    Learning Curve

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Simulation: learning the decision boundary

    Each iteration updates the gradient:

    Epsilon is a small value:"= 0.1/N

    Epsilon too large:- learning diverges

    Epsilon too small:- convergence slow

    22

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    0 5 10 151000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    11000

    iteration

    Error

    E

    wi=

    N

    n=1

    (wTxn cn)xi,n

    t+1i

    = wt

    i

    E

    wi

    Learning Curve

  • 8/13/2019 042407 Neural Networks

    12/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Simulation: learning the decision boundary

    Learning converges onto the solutionthat minimizes the error.

    For linear networks, this isguaranteed to converge to theminimum

    It is also possible to derive a closed-form solution (covered later)

    23

    3 4 5 6 7

    1

    1.5

    2

    2.5

    x1

    x2

    0 5 10 151000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    11000

    iteration

    Error

    Learning Curve

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Learning is slow when epsilon is too small

    Here, larger step sizes wouldconverge more quickly to theminimum

    24

    w

    Error

  • 8/13/2019 042407 Neural Networks

    13/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Divergence when epsilon is too large

    If the step size is too large, learningcan oscillate between different sides

    of the minimum

    25

    w

    Error

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Multi-layer networks

    26

    Can we extend our network tomultiple layers? We have:

    Or in matrix form

    Thus a two-layer linear network isequivalent to a one-layer linearnetwork with weights U=VW.

    It is notmore powerful.

    x

    y

    W

    V

    z

    j =

    i

    wi,jxi

    zj =

    k

    vj,kyj

    =

    k

    vj,k

    i

    wi,jxi

    z = Vy= VWx

    How do we address this?

  • 8/13/2019 042407 Neural Networks

    14/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Non-linear neural networks

    Idea introduce a non-linearity:

    Now, multiple layers are notequivalent

    Common nonlinearities:- threshold

    -sigmoid

    27

    j =f(

    i

    wi,jxi)

    zj = f(

    k

    vj,kyj)

    = f(

    k

    vj,kf(

    i

    wi,jxi))

    !! !" !# !$ !% & % $ # " !&

    %

    '

    ()'*

    threshold

    !! !" !# !$ !% & % $ # " !&

    %

    '

    ()'*

    sigmoid

    =

    0 x < 0

    1 x 0

    y= 1

    1 + exp(x)

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Modeling logical operators

    A one-layer binary-threshold

    network can implement the logicaloperators AND and OR, but notXOR.

    Why not?

    28

    x1

    x2

    x1

    x2

    x1

    x2

    x1AND x2 x1OR x2 x1XOR x2

    j =f(

    i

    wi,jxi)

    !! !" !# !$ !% & % $ # " !&

    %

    '

    ()'*

    threshold y =

    0 x < 0

    1 x 0

  • 8/13/2019 042407 Neural Networks

    15/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Posterior odds interpretation of a sigmoid

    29

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    The general classification/regression problem

    30

    Data

    D = x1, . . . , xT

    xi = {x1, . . . , xN}i

    desired output

    = y1, . . . , yK

    model

    = 1, . . . , M

    Given data, we want to learn a model thatcan correctly classify novel observations or

    map the inputs to the outputs

    i = 1 if xi Ci class i,

    0 otherwise

    for classification:

    input is a set of T observations, each anN-dimensional vector (binary, discrete, orcontinuous)

    model (e.g. a decision tree) is defined byM parameters, e.g. a multi-layerneural network.

    regression for arbitrary y.

  • 8/13/2019 042407 Neural Networks

    16/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Error function is defined as before,where we use the target vector tntodefine the desired output for networkoutput yn.

    The forward pass computes theoutputs at each layer:

    A general multi-layer neural network

    31

    E = 1

    2

    n=1

    (yn(xn,W1:L) tn)2

    x1 xi xM

    yi

    w

    x0=1

    y1 yK

    x0=1

    yiy1 yK

    yiy1 yK

    y0=1

    w

    y0=1

    ylj = f(

    i

    wli,jyl1j )

    l= {1, . . . , L}

    x y0

    output = yL

    input

    layer 1

    layer 2

    output

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Deriving the gradient for a sigmoid neural network

    Mathematical procedure for trainis gradient descient: same asbefore, except the gradients aremore complex to derive.

    Convenient fact for the sigmoidnon-linearity:

    backward pass computes thegradients: back-propagation

    32

    d(x)

    dx =

    d

    dx

    1

    1 + exp (x)

    = (x)(1 (x))

    E = 1

    2

    N

    n=1

    (yn(xn,W1:L) tn)2

    x1 xi xM

    yi

    w

    x0=1

    y1 yK

    x0=1

    yiy1 yK

    yiy1 yK

    y0=1

    w

    y0=1

    input

    layer 1

    layer 2

    output

    Wt+1 = Wt + E

    W

    New problem: local minima

  • 8/13/2019 042407 Neural Networks

    17/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Applications: Driving (output is analog: steering direction)

    33

    network with 1 layer

    (4 hidden units)

    D. Pomerleau. Neural network

    perception for mobile robotguidance. Kluwer AcademicPublishing, 1993.

    Learns to drive on roads

    Demonstrated at highwayspeeds over 100s of miles

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Real image input is augmented to avoid overfitting

    34

    Training data:Images +

    correspondingsteering angle

    Important:Conditioning oftraining data togenerate newexamples! avoidsoverfitting

  • 8/13/2019 042407 Neural Networks

    18/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Takes as input image of handwritten digit Each pixel is an input unit Complex network with many layers Output is digit class

    Tested on large (50,000+) database of handwritten samples

    Real-time Used commercially

    Hand-written digits: LeNet

    35

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    LeNet

    36

    Y. LeCun, L. Bottou, Y. Bengio, andP. Haffner. Gradient-based learningapplied to document recognition.Proceedings of the IEEE, november1998.

    http://yann.lecun.com/exdb/lenet/

    Very low error rate (

  • 8/13/2019 042407 Neural Networks

    19/19

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Object recognition

    LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognitionwith Invariance to Pose and Lighting. Proceedings of CVPR 2004.

    http://www.cs.nyu.edu/~yann/research/norb/

    37

    Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

    Summary

    Decision boundaries- Bayes optimal- linear discriminant- linear separability

    Classification vs regression Optimization by gradient descent Degeneracy of a multi-layer linear network

    Non-linearities:: threshold, sigmoid, others?

    Issues:- very general architecture, can solve many problems- large number of parameters: need to avoid overfitting- usually requires a large amount of data, or special architecture- local minima, training can be slow, need to set stepsize

    38