neural networks - slides - cmu - aarti singh & barnabas poczos
TRANSCRIPT
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
1/36
!"#$%& !"()*$+,
Aarti Singh & Barnabas Poczos
Machine Learning 10-701/15-781
Apr 24, 2014
Slides Courtesy: Tom Mitchell
1
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
2/36
-*./,01 2".$",,/*3
2
!""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:
Logisticfunction
(or Sigmoid):
;*/-"10 )#.01*. 2
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
3/36
-*./,01 2".$",,/*3 /, % -/3"%$
4&%,,/5"$6
3
!""#$%" '(% )*++*,-./ )#.01*.2+ )*3$ )*3 456789:
>%0-"-*. ?*#.=23@:
(Linear Decision Boundary)
1
1
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
4/36
7$%/3/3. -*./,01 2".$",,/*3
4
8*) (* &"%$3 (9" :%$%;"("$, )= ? )@A
A32-.-./ >2'2
B2C-$#$ 5D*.=-1*.2+9 ;-E%+-(**= F"1$2'%"
>-"03-$-.21G%
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
5/36
B:0;/C/3. 1*3D"E F#310*3
5
B2C D*.=-1*.2+ +*/O+-E%+-(**= P B-. Q%/21G% D*.=-1*.2+ +*/O+-E%+-(**=
Q%/21G% D*.=-1*.2+ +*/O+-E%+-(**= -" 2 0*.G%C )#.01*.Gradient Descent (convex)
Gradient:
Learning rate, >0Update rule:
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
6/36
-*./,01 F#310*3 %, % G$%:9
Sigmoid Unit
d
d
d
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
7/36
!"#$%& !"()*$+, (* &"%$3 FH I J
) 02. ?% 2 3*3K&/3"%$ )#.01*. 8 5G%0'*3 *)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%" 6 5D"1(*$*)9 0*.1.#*#" 2.=R*3 =-"03%'% G23-2?+%" Q%#32+ .%',*3E" O S%
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
8/36
Neural Network trained to distinguish vowel sounds using 2 formants (features)
Highly non-linear decision surfaceTwo layers of logistic units
Inputlayer
Hiddenlayer
Output
layer
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
9/36
Neural Networktrained to drive a
car!
Weights of each pixel for one hidden unit
Weights to output units from the hidden unit
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
10/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
11/36
L$"@/10*3 #,/3. !"#$%& !"()*$+,
Prediction Given neural network (hidden units and weights), use it to predictthe label of a test point
Forward Propagation
Start from input layerFor each subsequent layer, compute output of sigmoid unit
Sigmoid unit:
1-Hidden layer,
1 output NN:
oh
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
12/36
Consider regression problem f:X!Y , for scalar Yy = f(x) + " assume noise N(0,$"), iid
deterministic
MN4O-P 7$%/3/3. F*$ !"#$%& !"()*$+,
Learnedneural network
Lets maximize the conditional data likelihood
Train weights of all units to minimize sum of squared errors
of predicted network outputs
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
13/36
Consider regression problem f:X!Y , for scalar Yy = f(x) + " noise N(0,!")
deterministic
MQL 7$%/3/3. F*$ !"#$%& !"()*$+,
Gaussian P(W) = N(0,!%)
ln P(W) &c 'iwi2
Train weights of all units to minimize sum of squared errors
of predicted network outputs plus weight magnitudes
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
14/36
d
E Mean Square Error
For Neural Networks,E[w] no longer convex in w
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
15/36
Differentiable
d
d
d
7$%/3/3. !"#$%& !"()*$+,
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
16/36
P$$*$ G$%@/"3( F*$ % R/.;*/@ S3/(
ll
l
l
l
l
l l
ll
l
l ll
l
ly
ly
ly ly
ly
ly
l
l
l
ll l
l ll
ll l l lly
Sigmoid Unit
d
dd
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
17/36
Using all training data D
l
ll
y
l
l
l
llyl
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
18/36
(MLE)
l l l llyk
l l l l
l
l l lo
Using Forward propagation
yk= target output (label)
ok/h= unit output
(obtained by forward
propagation)
wij= wt from i to j
Note: if i is input variable,oi= xi
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
19/36
Objective/Error nolonger convex in
weights
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
20/36
Our learning algorithm involves a parameter
n=number of gradient descent iterations
How do we choose n to optimize future error?
(note: similar issue for logistic regression, decision trees, !)
e.g. the n that minimizes error rate of neural net over future data
T"%&/3. )/(9 BD"$5U3.
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
21/36
Our learning algorithm involves a parameter
n=number of gradient descent iterations
How do we choose n to optimize future error?
Separate available data into training and validation set Use training to perform gradient descent n"number of iterations that optimizes validation set error
T"%&/3. )/(9 BD"$5U3.
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
22/36
Idea: train multiple times, leaving out a disjoint subset of data each timefor test. Average the test set accuracies.
________________________________________________
Partition data into K disjoint subsets
For k=1 to K
testData = kth subset
h"classifier trained* on all data except for testData
accuracy(k) = accuracy of h on testData
end
FinalAccuracy = mean of the K recorded testset accuracies
* might withhold some of this to choose number of gradient decent steps
VKF*&@ 4$*,,KD%&/@%0*3
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
23/36
This is just k-fold cross validation leaving out one example each iteration
________________________________________________
Partition data into K disjoint subsets, each containing one example
For k=1 to K
testData = kth subset
h"classifier trained* on all data except for testData
accuracy(k) = accuracy of h on testData
end
FinalAccuracy = mean of the K recorded testset accuracies
* might withhold some of this to choose number of gradient decent steps
-"%D"K*3"K*#( 4$*,,KD%&/@%0*3
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
24/36
Cross-validation Regularization small weights imply NN is linear (low VC
dimension)
Control number of hidden units low complexity
T"%&/3. )/(9 BD"$5U3.
"wixi
Logistic
output
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
25/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
26/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
27/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
28/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
29/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
30/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
31/36
w0left
strt
right
up
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
32/36
Semantic Memory Model Based on ANNs
[McClelland & Rogers, Nature2003]
No hierarchy given.
Train with assertions,
e.g., Can(Canary,Fly)
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
33/36
Humans act as though they have a hierarchical memoryorganization
1. Victims of Semantic Dementia progressively lose knowledge of objectsBut they lose specific details first, general properties later, suggestinghierarchical memory organization
ThingLiving
AnimalPlant
NonLiving
BirdFish
Canary
2. Children appear to learn generalcategories and properties first, followingthe same hierarchy, top down*.
* some debate remains on this.
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
34/36
Memory deterioration follows semantic hierarchy[McClelland & Rogers, Nature2003]
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
35/36
-
7/21/2019 Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
36/36
Q$051/%& !"#$%& !"()*$+,H R#;;%$W
T !01G%+@ #"%= '* $*=%+ =-"'3-?#'%= 0*$