neural networks - slides - cmu - aarti singh & barnabas poczos

7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos

1/36

!"#$%& !"()*$+,

Aarti Singh & Barnabas Poczos

Machine Learning 10-701/15-781

Apr 24, 2014

Slides Courtesy: Tom Mitchell

1


2/36

-*./,01 2".$",,/*3

2

!""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:

Logisticfunction

(or Sigmoid):

;*/-"10 )#.01*. 2


3/36

-*./,01 2".$",,/*3 /, % -/3"%$

4&%,,/5"$6

3

!""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:

>%0-"-*. ?*#.=23@:

(Linear Decision Boundary)

1

1


4/36

7$%/3/3. -*./,01 2".$",,/*3

4

8*) (* &"%$3 (9" :%$%;"("$, )= ? )@A

A32-.-./ >2'2

B2C-$#$ 5D*.=-1*.2+9 ;-E%+-(**= F"1$2'%"

>-"03-$-.21G%


5/36

B:0;/C/3. 1*3D"E F#310*3

5

B2C D*.=-1*.2+ +*/O+-E%+-(**= P B-. Q%/21G% D*.=-1*.2+ +*/O+-E%+-(**=

Q%/21G% D*.=-1*.2+ +*/O+-E%+-(**= -" 2 0*.G%C )#.01*.Gradient Descent (convex)

Gradient:

Learning rate, >0Update rule:


6/36

-*./,01 F#310*3 %, % G$%:9

Sigmoid Unit

d

d

d


7/36

!"#$%& !"()*$+, (* &"%$3 FH I J

) 02. ?% 2 3*3K&/3"%$ )#.01*. 8 5G%0'*3 *)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%" 6 5D"1(*$*)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%" Q%#32+ .%',*3E" O S%


8/36

Neural Network trained to distinguish vowel sounds using 2 formants (features)

Highly non-linear decision surfaceTwo layers of logistic units

Inputlayer

Hiddenlayer

Output

layer


9/36

Neural Networktrained to drive a

car!

Weights of each pixel for one hidden unit

Weights to output units from the hidden unit


10/36


11/36

L$"@/10*3 #,/3. !"#$%& !"()*$+,

Prediction Given neural network (hidden units and weights), use it to predictthe label of a test point

Forward Propagation

Start from input layerFor each subsequent layer, compute output of sigmoid unit

Sigmoid unit:

1-Hidden layer,

1 output NN:

oh


12/36

Consider regression problem f:X!Y , for scalar Yy = f(x) + " assume noise N(0,$"), iid

deterministic

MN4O-P 7$%/3/3. F*$ !"#$%& !"()*$+,

Learnedneural network

Lets maximize the conditional data likelihood

Train weights of all units to minimize sum of squared errors

of predicted network outputs


13/36

Consider regression problem f:X!Y , for scalar Yy = f(x) + " noise N(0,!")

deterministic

MQL 7$%/3/3. F*$ !"#$%& !"()*$+,

Gaussian P(W) = N(0,!%)

ln P(W) &c 'iwi2

Train weights of all units to minimize sum of squared errors

of predicted network outputs plus weight magnitudes


14/36

d

E Mean Square Error

For Neural Networks,E[w] no longer convex in w


15/36

Differentiable

d

d

d

7$%/3/3. !"#$%& !"()*$+,


16/36

P$$*$ G$%@/"3( F*$ % R/.;*/@ S3/(

ll

l

l

l

l

l l

ll

l

l ll

l

ly

ly

ly ly

ly

ly

l

l

l

ll l

l ll

ll l l lly

Sigmoid Unit

d

dd


17/36

Using all training data D

l

ll

y

l

l

l

llyl


18/36

(MLE)

l l l llyk

l l l l

l

l l lo

Using Forward propagation

yk= target output (label)

ok/h= unit output

(obtained by forward

propagation)

wij= wt from i to j

Note: if i is input variable,oi= xi


19/36

Objective/Error nolonger convex in

weights


20/36

Our learning algorithm involves a parameter

n=number of gradient descent iterations

How do we choose n to optimize future error?

(note: similar issue for logistic regression, decision trees, !)

e.g. the n that minimizes error rate of neural net over future data

T"%&/3. )/(9 BD"$5U3.


21/36

Our learning algorithm involves a parameter

n=number of gradient descent iterations

How do we choose n to optimize future error?

Separate available data into training and validation set Use training to perform gradient descent n"number of iterations that optimizes validation set error

T"%&/3. )/(9 BD"$5U3.


22/36

Idea: train multiple times, leaving out a disjoint subset of data each timefor test. Average the test set accuracies.

________________________________________________

Partition data into K disjoint subsets

For k=1 to K

testData = kth subset

h"classifier trained* on all data except for testData

accuracy(k) = accuracy of h on testData

end

FinalAccuracy = mean of the K recorded testset accuracies

* might withhold some of this to choose number of gradient decent steps

VKF*&@ 4$*,,KD%&/@%0*3


23/36

This is just k-fold cross validation leaving out one example each iteration

________________________________________________

Partition data into K disjoint subsets, each containing one example

For k=1 to K

testData = kth subset

h"classifier trained* on all data except for testData

accuracy(k) = accuracy of h on testData

end

FinalAccuracy = mean of the K recorded testset accuracies

* might withhold some of this to choose number of gradient decent steps

-"%D"K*3"K*#( 4$*,,KD%&/@%0*3


24/36

Cross-validation Regularization small weights imply NN is linear (low VC

dimension)

Control number of hidden units low complexity

T"%&/3. )/(9 BD"$5U3.

"wixi

Logistic

output


25/36


26/36


27/36


28/36


29/36


30/36


31/36

w0left

strt

right

up


32/36

Semantic Memory Model Based on ANNs

[McClelland & Rogers, Nature2003]

No hierarchy given.

Train with assertions,

e.g., Can(Canary,Fly)


33/36

Humans act as though they have a hierarchical memoryorganization

1. Victims of Semantic Dementia progressively lose knowledge of objectsBut they lose specific details first, general properties later, suggestinghierarchical memory organization

ThingLiving

AnimalPlant

NonLiving

BirdFish

Canary

2. Children appear to learn generalcategories and properties first, followingthe same hierarchy, top down*.

* some debate remains on this.


34/36

Memory deterioration follows semantic hierarchy[McClelland & Rogers, Nature2003]


35/36


36/36

Q$051/%& !"#$%& !"()*$+,H R#;;%$W

T !01G%+@ #"%= '* $*=%+ =-"'3-?#'%= 0*$

neural networks - slides - cmu - aarti singh & barnabas poczos

Documents