carlos guestrin- neural networks
Post on 06-Apr-2018
219 Views
Preview:
TRANSCRIPT
-
8/3/2019 Carlos Guestrin- Neural Networks
1/25
2005-2007 Carlos Guestrin 1
Neural Networks
Machine Learning 10701/15781
Carlos Guestrin
Carnegie Mellon University
October 10th, 2007
2005-2007 Carlos Guestrin 2
Perceptron as a graph- 6 -4 -2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-
8/3/2019 Carlos Guestrin- Neural Networks
2/25
2005-2007 Carlos Guestrin 3
The perceptron learning rule
Compare to MLE:
2005-2007 Carlos Guestrin 4
Hidden layer Perceptron:
1-hidden layer:
-
8/3/2019 Carlos Guestrin- Neural Networks
3/25
2005-2007 Carlos Guestrin 5
Example data for NN with hidden layer
2005-2007 Carlos Guestrin 6
Learned weights for hidden layer
-
8/3/2019 Carlos Guestrin- Neural Networks
4/25
2005-2007 Carlos Guestrin 7
NN for images
2005-2007 Carlos Guestrin 8
Weights in NN for images
-
8/3/2019 Carlos Guestrin- Neural Networks
5/25
2005-2007 Carlos Guestrin 9
Gradient descent for 1-hidden layer
Back-propagation: ComputingDropped w0 to make derivation simpler
2005-2007 Carlos Guestrin 10
Gradient descent for 1-hidden layer
Back-propagation: ComputingDropped w0 to make derivation simpler
-
8/3/2019 Carlos Guestrin- Neural Networks
6/25
2005-2007 Carlos Guestrin 11
Multilayer neural networks
2005-2007 Carlos Guestrin 12
Forward propagation prediction Recursive algorithm
Start from input layer
Output of node Vk with parents U1,U2,:
-
8/3/2019 Carlos Guestrin- Neural Networks
7/25
2005-2007 Carlos Guestrin 13
Back-propagation learning
Just gradient descent!!!
Recursive algorithm for computing gradient
For each example
Perform forward propagation
Start from output layer
Compute gradient of node Vk with parents U1,U2,
Update weight wik
2005-2007 Carlos Guestrin 14
Many possible response functions
Sigmoid
Linear
Exponential
Gaussian
-
8/3/2019 Carlos Guestrin- Neural Networks
8/25
2005-2007 Carlos Guestrin 15
Convergence of backprop
Perceptron leads to convex optimization
Gradient descent reaches global minima
Multilayer neural nets not convex
Gradient descent gets stuck in local minima
Hard to set learning rate
Selecting number of hidden units and layers = fuzzy process
NNs falling in disfavor in last few years
Well see later in semester, kernel trickis a good alternative
Nonetheless, neural nets are one of the most used ML
approaches
2005-2007 Carlos Guestrin 16
Overfitting? Neural nets represent
complex functions
Output becomes more complex
with gradient steps
-
8/3/2019 Carlos Guestrin- Neural Networks
9/25
2005-2007 Carlos Guestrin 17
Overfitting
Output fits training data too well
Poor test set accuracy
Overfitting the training data
Related to bias-variance tradeoff
One of central problems of ML
Avoiding overfitting?
More training data
Regularization
Early stopping
2005-2007 Carlos Guestrin 18
What you need to know about
neural networks Perceptron:
Representation
Perceptron learning rule
Derivation
Multilayer neural nets
Representation
Derivation of backprop
Learning rule
Overfitting
Definition
Training set versus test set
Learning curve
-
8/3/2019 Carlos Guestrin- Neural Networks
10/25
2005-2007 Carlos Guestrin 19
Announcements
Recitation this week: Neural networks
Project proposals due next Wednesday
Exciting data:
Swivel.com - user generated graphs
Recognizing Captchas
Election contributions
Activity recognition
2005-2007 Carlos Guestrin 20
Instance-based
Learning
Machine Learning 10701/15781Carlos Guestrin
Carnegie Mellon University
October 10th, 2007
-
8/3/2019 Carlos Guestrin- Neural Networks
11/25
2005-2007 Carlos Guestrin 21
Why not just use Linear Regression?
2005-2007 Carlos Guestrin 22
Using data to predict new data
-
8/3/2019 Carlos Guestrin- Neural Networks
12/25
2005-2007 Carlos Guestrin 23
Nearest neighbor
2005-2007 Carlos Guestrin 24
Univariate 1-Nearest NeighborGiven datapoints (x1,y1) (x2,y2)..(xN,yN),where we assume yi=f(xi) for some
unknown function f.
Given query pointxq, your job is to predict
Nearest Neighbor:
1. Find the closestxi in our set of datapoints
qxfy !
( )qi
i
xxnni != argmin
( )nniyy =2. Predict
Heres adataset withone input, oneoutput and fourdatapoints.
x
y
Here,
this
is
thecl
osest
datapoi
nt
Here, this isthe closestdatapoint
Here,this
is
theclos
est
data
poin
t
H
ere,
this
is
theclo
sest
datapoint
-
8/3/2019 Carlos Guestrin- Neural Networks
13/25
2005-2007 Carlos Guestrin 25
1-Nearest Neighbor is an example of.
Instance-based learning
Four things make a memory based learner: A distance metric How many nearby neighbors to look at?
A weighting function (optional) How to fit with the local points?
x1 y1x2 y2x3 y3
.
.
xn yn
A function approximator
that has been aroundsince about 1910.
To make a prediction,
search database for
similar datapoints, and fit
with the local points.
2005-2007 Carlos Guestrin 26
1-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric
Euclidian (and many more)
2. How many nearby neighbors to look at?
One
3. A weighting function (optional)
Unused
4. How to fit with the local points?
Just predict the same output as the nearest neighbor.
-
8/3/2019 Carlos Guestrin- Neural Networks
14/25
2005-2007 Carlos Guestrin 27
Multivariate 1-NN examples
Regression Classification
2005-2007 Carlos Guestrin 28
Multivariate distance metricsSuppose the input vectors x1, x2, xn are two dimensional:
x1 = (x11 ,x12 ) , x2 = (x21 ,x22) , xN = (xN1 ,xN2).
One can draw the nearest-neighbor regions in input space.
Dist(xi,xj) =(xi1 xj1)2+(3xi2 3xj2)
2
The relative scalings in the distance metric affect region shapes
Dist(xi,xj) = (xi1 xj1)2 + (xi2xj2)
2
-
8/3/2019 Carlos Guestrin- Neural Networks
15/25
2005-2007 Carlos Guestrin 29
Euclidean distance metric
Other Metrics
Mahalanobis, Rank-based, Correlation-based,
( )
=
(=
'
' 22
)x'-(x)x'-(x)x'(x,
')x'(x,
T
i
iii
D
xxD )
where
Or equivalently,
2005-2007 Carlos Guestrin 30
Notable distance metrics
(and their level sets)
L1 norm (absolute)
L1 (max) norm
Scaled Euclidian (L2)
Mahalanobis (here,
on the previous slide is notnecessarily diagonal, but is
symmetric
-
8/3/2019 Carlos Guestrin- Neural Networks
16/25
2005-2007 Carlos Guestrin 31
Consistency of 1-NN
Consider an estimatorfntrained on n examples
e.g., 1-NN, neural nets, regression,...
Estimator is consistentif true error goes to zero as
amount of data increases
e.g., for no noise data, consistent if:
Regression is not consistent!
Representation bias
1-NN is consistent (under some mild fineprint)
What about variance???
2005-2007 Carlos Guestrin 32
1-NN overfits?
-
8/3/2019 Carlos Guestrin- Neural Networks
17/25
2005-2007 Carlos Guestrin 33
k-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric
Euclidian (and many more)
2. How many nearby neighbors to look at?
k
1. A weighting function (optional)
Unused
2. How to fit with the local points?
Just predict the average output among the k nearest neighbors.
2005-2007 Carlos Guestrin 34
k-Nearest Neighbor (here k=9)
K-nearest neighbor for function fitting smoothes away noise, but there areclear deficiencies.
What can we do about all the discontinuities that k-NN gives us?
-
8/3/2019 Carlos Guestrin- Neural Networks
18/25
2005-2007 Carlos Guestrin 35
Weighted k-NNs
Neighbors are not all the same
2005-2007 Carlos Guestrin 36
Kernel regression
Four things make a memory based learner:
1. A distance metric
Euclidian (and many more)
2. How many nearby neighbors to look at?
All of them
3. A weighting function (optional)
wi = exp(-D(xi, query)2 / Kw
2)
Nearby points to the query are weighted strongly, far points
weakly. The KW parameter is the Kernel Width. Very
important.
4. How to fit with the local points?
Predict the weighted average of the outputs:
predict = wiyi /wi
-
8/3/2019 Carlos Guestrin- Neural Networks
19/25
2005-2007 Carlos Guestrin 37
Weighting functions
wi = exp(-D(xi, query)2 / Kw
2)
Typically optimize Kwusing gradient descent
(Our examples use Gaussian)
2005-2007 Carlos Guestrin 38
Kernel regression predictions
Increasing the kernel width Kw means further away points get anopportunity to influence you.
As Kw1, the prediction tends to the global average.
KW=80KW=20KW=10
-
8/3/2019 Carlos Guestrin- Neural Networks
20/252
2005-2007 Carlos Guestrin 39
Kernel regression on our test cases
KW=1/16 axis width.KW=1/32 of x-axis width.KW=1/32 of x-axis width.
Choosing a good Kw is important. Not just for Kernel Regression, but
for all the locally weighted learners were about to see.
2005-2007 Carlos Guestrin 40
Kernel regression can look bad
KW = Best.KW = Best.KW = Best.
Time to try something more powerful
-
8/3/2019 Carlos Guestrin- Neural Networks
21/252
2005-2007 Carlos Guestrin 41
Locally weighted regression
Kernel regression:
Take a very very conservative function approximatorcalled AVERAGING. Locally weight it.
Locally weighted regression:
Take a conservative function approximator calledLINEAR REGRESSION. Locally weight it.
2005-2007 Carlos Guestrin 42
Locally weighted regression Four things make a memory based learner:
A distance metric
Any
How many nearby neighbors to look at?
All of them
A weighting function (optional)
Kernels
wi = exp(-D(xi, query)2 / Kw2)
How to fit with the local points?
General weighted regression:
( )2
1
2
xy argmin!=
"=N
k
k
T
kkw
-
8/3/2019 Carlos Guestrin- Neural Networks
22/252
2005-2007 Carlos Guestrin 43
How LWR works
Query
Linear regression Same parameters for
all queries
Locally weighted regression Solve weighted linear regression
for each query
( ) YXXX T
1T !
=
!!!!!
"
#
$$$$$
%
&
=
nw
w
w
W
000
000
000
000
2
1
O
"=
WX( )
T
WX( )
#1
WX( )
T
WY
2005-2007 Carlos Guestrin 44
Another view of LWR
Image from Cohn, D.A., Ghahramani, Z., and Jordan, M.I. (1996) "Active Learning with Statistical Models", JAIR Volume 4, pages 129-145.
-
8/3/2019 Carlos Guestrin- Neural Networks
23/252
2005-2007 Carlos Guestrin 45
LWR on our test cases
KW = 1/8 of x-axis width.KW = 1/32 of x-axiswidth.
KW = 1/16 of x-axiswidth.
2005-2007 Carlos Guestrin 46
Locally weighted polynomial regression
LW Quadratic RegressionKernel width KW at optimal
level.
KW = 1/15 x-axis
LW Linear RegressionKernel width KW at optimal
level.
KW = 1/40 x-axis
Kernel RegressionKernel width KW at optimal
level.
KW = 1/100 x-axis
Local quadratic regression is easy: just add quadratic terms to theWXTWX matrix. As the regression degree increases, the kernel widthcan increase without introducing bias.
-
8/3/2019 Carlos Guestrin- Neural Networks
24/252
2005-2007 Carlos Guestrin 47
Curse of dimensionality for
instance-based learning
Must store and retreve all data!
Most real work done during testing
For every test sample, must search through all dataset very slow!
Well see fast methods for dealing with large datasets
Instance-based learning often poor with noisy or irrelevant
features
2005-2007 Carlos Guestrin 48
Curse of the irrelevant feature
-
8/3/2019 Carlos Guestrin- Neural Networks
25/25
2005-2007 Carlos Guestrin 49
What you need to know about
instance-based learning
k-NN Simplest learning algorithm
With sufficient data, very hard to beat strawman approach
Picking k?
Kernel regression Set k to n (number of data points) and optimize weights by
gradient descent
Smoother than k-NN
Locally weighted regression
Generalizes kernel regression, not just local average
Curse of dimensionalityMust remember (very large) dataset for prediction
Irrelevant features often killers for instance-based approaches
2005-2007 Carlos Guestrin 50
Acknowledgment This lecture contains some material from
Andrew Moores excellent collection of ML
tutorials:
http://www.cs.cmu.edu/~awm/tutorials
top related