neural networks - slides - cmu - aarti singh & barnabas poczos

Upload: teppi-nguyen

Post on 07-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    1/36

    !"#$%& !"()*$+,

    Aarti Singh & Barnabas Poczos

    Machine Learning 10-701/15-781

    Apr 24, 2014

    Slides Courtesy: Tom Mitchell

    1

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    2/36

    -*./,01 2".$",,/*3

    2

    !""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:

    Logisticfunction

    (or Sigmoid):

    ;*/-"10 )#.01*. 2

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    3/36

    -*./,01 2".$",,/*3 /, % -/3"%$

    4&%,,/5"$6

    3

    !""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:

    >%0-"-*. ?*#.=23@:

    (Linear Decision Boundary)

    1

    1

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    4/36

    7$%/3/3. -*./,01 2".$",,/*3

    4

    8*) (* &"%$3 (9" :%$%;"("$, )= ? )@A

    A32-.-./ >2'2

    B2C-$#$ 5D*.=-1*.2+9 ;-E%+-(**= F"1$2'%"

    >-"03-$-.21G%

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    5/36

    B:0;/C/3. 1*3D"E F#310*3

    5

    B2C D*.=-1*.2+ +*/O+-E%+-(**= P B-. Q%/21G% D*.=-1*.2+ +*/O+-E%+-(**=

    Q%/21G% D*.=-1*.2+ +*/O+-E%+-(**= -" 2 0*.G%C )#.01*.Gradient Descent (convex)

    Gradient:

    Learning rate, >0Update rule:

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    6/36

    -*./,01 F#310*3 %, % G$%:9

    Sigmoid Unit

    d

    d

    d

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    7/36

    !"#$%& !"()*$+, (* &"%$3 FH I J

    ) 02. ?% 2 3*3K&/3"%$ )#.01*. 8 5G%0'*3 *)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%" 6 5D"1(*$*)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%" Q%#32+ .%',*3E" O S%

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    8/36

    Neural Network trained to distinguish vowel sounds using 2 formants (features)

    Highly non-linear decision surfaceTwo layers of logistic units

    Inputlayer

    Hiddenlayer

    Output

    layer

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    9/36

    Neural Networktrained to drive a

    car!

    Weights of each pixel for one hidden unit

    Weights to output units from the hidden unit

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    10/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    11/36

    L$"@/10*3 #,/3. !"#$%& !"()*$+,

    Prediction Given neural network (hidden units and weights), use it to predictthe label of a test point

    Forward Propagation

    Start from input layerFor each subsequent layer, compute output of sigmoid unit

    Sigmoid unit:

    1-Hidden layer,

    1 output NN:

    oh

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    12/36

    Consider regression problem f:X!Y , for scalar Yy = f(x) + " assume noise N(0,$"), iid

    deterministic

    MN4O-P 7$%/3/3. F*$ !"#$%& !"()*$+,

    Learnedneural network

    Lets maximize the conditional data likelihood

    Train weights of all units to minimize sum of squared errors

    of predicted network outputs

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    13/36

    Consider regression problem f:X!Y , for scalar Yy = f(x) + " noise N(0,!")

    deterministic

    MQL 7$%/3/3. F*$ !"#$%& !"()*$+,

    Gaussian P(W) = N(0,!%)

    ln P(W) &c 'iwi2

    Train weights of all units to minimize sum of squared errors

    of predicted network outputs plus weight magnitudes

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    14/36

    d

    E Mean Square Error

    For Neural Networks,E[w] no longer convex in w

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    15/36

    Differentiable

    d

    d

    d

    7$%/3/3. !"#$%& !"()*$+,

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    16/36

    P$$*$ G$%@/"3( F*$ % R/.;*/@ S3/(

    ll

    l

    l

    l

    l

    l l

    ll

    l

    l ll

    l

    ly

    ly

    ly ly

    ly

    ly

    l

    l

    l

    ll l

    l ll

    ll l l lly

    Sigmoid Unit

    d

    dd

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    17/36

    Using all training data D

    l

    ll

    y

    l

    l

    l

    llyl

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    18/36

    (MLE)

    l l l llyk

    l l l l

    l

    l l lo

    Using Forward propagation

    yk= target output (label)

    ok/h= unit output

    (obtained by forward

    propagation)

    wij= wt from i to j

    Note: if i is input variable,oi= xi

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    19/36

    Objective/Error nolonger convex in

    weights

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    20/36

    Our learning algorithm involves a parameter

    n=number of gradient descent iterations

    How do we choose n to optimize future error?

    (note: similar issue for logistic regression, decision trees, !)

    e.g. the n that minimizes error rate of neural net over future data

    T"%&/3. )/(9 BD"$5U3.

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    21/36

    Our learning algorithm involves a parameter

    n=number of gradient descent iterations

    How do we choose n to optimize future error?

    Separate available data into training and validation set Use training to perform gradient descent n"number of iterations that optimizes validation set error

    T"%&/3. )/(9 BD"$5U3.

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    22/36

    Idea: train multiple times, leaving out a disjoint subset of data each timefor test. Average the test set accuracies.

    ________________________________________________

    Partition data into K disjoint subsets

    For k=1 to K

    testData = kth subset

    h"classifier trained* on all data except for testData

    accuracy(k) = accuracy of h on testData

    end

    FinalAccuracy = mean of the K recorded testset accuracies

    * might withhold some of this to choose number of gradient decent steps

    VKF*&@ 4$*,,KD%&/@%0*3

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    23/36

    This is just k-fold cross validation leaving out one example each iteration

    ________________________________________________

    Partition data into K disjoint subsets, each containing one example

    For k=1 to K

    testData = kth subset

    h"classifier trained* on all data except for testData

    accuracy(k) = accuracy of h on testData

    end

    FinalAccuracy = mean of the K recorded testset accuracies

    * might withhold some of this to choose number of gradient decent steps

    -"%D"K*3"K*#( 4$*,,KD%&/@%0*3

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    24/36

    Cross-validation Regularization small weights imply NN is linear (low VC

    dimension)

    Control number of hidden units low complexity

    T"%&/3. )/(9 BD"$5U3.

    "wixi

    Logistic

    output

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    25/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    26/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    27/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    28/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    29/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    30/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    31/36

    w0left

    strt

    right

    up

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    32/36

    Semantic Memory Model Based on ANNs

    [McClelland & Rogers, Nature2003]

    No hierarchy given.

    Train with assertions,

    e.g., Can(Canary,Fly)

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    33/36

    Humans act as though they have a hierarchical memoryorganization

    1. Victims of Semantic Dementia progressively lose knowledge of objectsBut they lose specific details first, general properties later, suggestinghierarchical memory organization

    ThingLiving

    AnimalPlant

    NonLiving

    BirdFish

    Canary

    2. Children appear to learn generalcategories and properties first, followingthe same hierarchy, top down*.

    * some debate remains on this.

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    34/36

    Memory deterioration follows semantic hierarchy[McClelland & Rogers, Nature2003]

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    35/36

  • 7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

    36/36

    Q$051/%& !"#$%& !"()*$+,H R#;;%$W

    T !01G%+@ #"%= '* $*=%+ =-"'3-?#'%= 0*$