042407 neural networks
TRANSCRIPT
-
8/13/2019 042407 Neural Networks
1/19
-
8/13/2019 042407 Neural Networks
2/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
The Iris dataset with decision tree boundaries
3
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
petal length (cm)
petalwidth(cm)
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
The optimal decision boundary for C2vs C3
4
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
petal length (cm)
petalwidth(cm)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9p(petal length|C2)
p(petal length|C3) optimal decision boundary isdetermined from the statisticaldistributionof the classes
optimal only if model is correct assigns precise degree of uncertainty
to classification
-
8/13/2019 042407 Neural Networks
3/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Optimal decision boundary
5
1 2 3 4 5 6 70
0.2
0.4
0.6
0.8
1
Optimal decision boundary
p(petal length|C2) p(petal length|C3)
p(C2|petal length) p(C3|petal length)
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Can we do better?
6
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
petal length (cm)
petalwidth(cm)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9p(petal length|C2)
p(petal length|C3) only way is to use more information DTs use both petal width and petal
length
-
8/13/2019 042407 Neural Networks
4/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Arbitrary decision boundaries would be more powerful
7
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
petal length (cm)
petalwidth(cm)
Decision boundariescould be non-linear
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
3 4 5 6 7
1
1.5
2
2.5
x1
x2
Defining a decision boundary
8
consider just two classes want points on one side of line in
class 1, otherwise class 2.
2D linear discriminant function:
This defines a 2D plane whichleads to the decision:
The decision boundary:
y =mTx+ b = 0x
class 1 if y 0,
class 2 if y
-
8/13/2019 042407 Neural Networks
5/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Linear separability
9
1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
petal length (cm)
petalwidth(cm)
Two classes are linearly separable if they can beseparated by a linear combination of attributes
- 1D: threshold- 2D: line- 3D: plane- M-D: hyperplane
linearly separable
not linearly separable
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Diagraming the classifier as a neural network
The feedforward neural network isspecified by weights wiand bias b:
It can written equivalently as
where w0= bis the bias and adummy input x0that is always 1.
10
x1 x2 xM
y
w1 w2 wM
b
x1 x2 xM
y
w1 w2 wM
x0=1
w0
y =wTx =
M
i=0
wixi
y = wTx+ b
=
M
i=1
wixi + b
output unit
input units
bias
weights
-
8/13/2019 042407 Neural Networks
6/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Determining, ie learning, the optimal linear discriminant
11
First we must define an objective function, ie the goal of learning Simple idea: adjust weights so that output y(xn)matches class cn
Objective: minimize sum-squared error over all patternsxn:
Note the notation xn defines a pattern vector:
We can define the desired class as:
E=1
2
N
n=1
(wTxn cn)2
xn = x1, . . . , xM n
cn = 0 xn class 11 xn class 2
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Weve seen this before: curve fitting
12
example from Bi shop (2006),Pattern Recognition and Machine Learning
t= sin(2x) + noise
0 1
!1
0
1 t
x
y(xn,w)
tn
xn
-
8/13/2019 042407 Neural Networks
7/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Neural networks compared to polynomial curve fitting
13
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
y(x,w) = w0+ w1x+ w2x2 + + wMx
M =M
j=0
wjxj
E(w) =1
2
N
n=1
[y(xn,w) tn]2
example from Bi shop (2006),Pattern Recognition and Machine Learning
For the linearnetwork, M=1 andthere are multipleinput dimensions
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
General form of a linear network
A linear neural network is simply alinear transformation of the input.
Or, in matrix-vector form:
Multiple outputs corresponds tomultivariate regression
14
y =Wx
j =
M
i=0
wi,jxi
x1 xi xM
yi
w
x0=1
y1 yK
x
y
W
outputs
weights
inputs
bias
-
8/13/2019 042407 Neural Networks
8/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Training the network: Optimization by gradient descent
15
We can adjust the weights incrementallyto minimize the objective function.
This is calledgradient descent
Or gradient ascentif were maximizing.
The gradient descent rule for weight wiis:
Or in vector form:
For gradient ascent, the signof the gradient step changes.
w1
w2
w3
w4w2
w1
wt+1
i = w
t
iE
wi
wt+1 =wt E
w
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Computing the gradient
Idea: minimize error by gradient descent Take the derivative of the objective function wrt the weights:
And in vector form:
16
E = 1
2
N
n=1
(wTxn cn)2
E
wi=
2
2
N
n=1
(w0x0,n+ + wixi,n+ + wMxM,n cn)xi,n
=N
n=1
(wTxn cn)xi,n
E
w
=N
n=1
(wTxn cn)xn
-
8/13/2019 042407 Neural Networks
9/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Simulation: learning the decision boundary
Each iteration updates the gradient:
Epsilon is a small value:"= 0.1/N
Epsilon too large:- learning diverges
Epsilon too small:
- convergence slow
17
3 4 5 6 7
1
1.5
2
2.5
x1
x2
0 5 10 151000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
Error
E
wi=
N
n=1
(wTxn cn)xi,n
t+1
i = w
t
i E
wi
Learning Curve
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Simulation: learning the decision boundary
Each iteration updates the gradient:
Epsilon is a small value:"= 0.1/N
Epsilon too large:- learning diverges
Epsilon too small:- convergence slow
18
3 4 5 6 7
1
1.5
2
2.5
x1
x2
0 5 10 151000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
Error
E
wi=
N
n=1
(wTxn cn)xi,n
t+1i
= wt
i
E
wi
Learning Curve
-
8/13/2019 042407 Neural Networks
10/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Simulation: learning the decision boundary
Each iteration updates the gradient:
Epsilon is a small value:"= 0.1/N
Epsilon too large:- learning diverges
Epsilon too small:
- convergence slow
19
3 4 5 6 7
1
1.5
2
2.5
x1
x2
0 5 10 151000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
Error
E
wi=
N
n=1
(wTxn cn)xi,n
t+1
i = w
t
i E
wi
Learning Curve
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Simulation: learning the decision boundary
Each iteration updates the gradient:
Epsilon is a small value:"= 0.1/N
Epsilon too large:- learning diverges
Epsilon too small:- convergence slow
20
3 4 5 6 7
1
1.5
2
2.5
x1
x2
0 5 10 151000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
Error
E
wi=
N
n=1
(wTxn cn)xi,n
t+1i
= wt
i
E
wi
Learning Curve
-
8/13/2019 042407 Neural Networks
11/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Simulation: learning the decision boundary
Each iteration updates the gradient:
Epsilon is a small value:"= 0.1/N
Epsilon too large:- learning diverges
Epsilon too small:
- convergence slow
21
3 4 5 6 7
1
1.5
2
2.5
x1
x2
0 5 10 151000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
Error
E
wi=
N
n=1
(wTxn cn)xi,n
t+1
i = w
t
i E
wi
Learning Curve
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Simulation: learning the decision boundary
Each iteration updates the gradient:
Epsilon is a small value:"= 0.1/N
Epsilon too large:- learning diverges
Epsilon too small:- convergence slow
22
3 4 5 6 7
1
1.5
2
2.5
x1
x2
0 5 10 151000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
Error
E
wi=
N
n=1
(wTxn cn)xi,n
t+1i
= wt
i
E
wi
Learning Curve
-
8/13/2019 042407 Neural Networks
12/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Simulation: learning the decision boundary
Learning converges onto the solutionthat minimizes the error.
For linear networks, this isguaranteed to converge to theminimum
It is also possible to derive a closed-form solution (covered later)
23
3 4 5 6 7
1
1.5
2
2.5
x1
x2
0 5 10 151000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
Error
Learning Curve
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Learning is slow when epsilon is too small
Here, larger step sizes wouldconverge more quickly to theminimum
24
w
Error
-
8/13/2019 042407 Neural Networks
13/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Divergence when epsilon is too large
If the step size is too large, learningcan oscillate between different sides
of the minimum
25
w
Error
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Multi-layer networks
26
Can we extend our network tomultiple layers? We have:
Or in matrix form
Thus a two-layer linear network isequivalent to a one-layer linearnetwork with weights U=VW.
It is notmore powerful.
x
y
W
V
z
j =
i
wi,jxi
zj =
k
vj,kyj
=
k
vj,k
i
wi,jxi
z = Vy= VWx
How do we address this?
-
8/13/2019 042407 Neural Networks
14/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Non-linear neural networks
Idea introduce a non-linearity:
Now, multiple layers are notequivalent
Common nonlinearities:- threshold
-sigmoid
27
j =f(
i
wi,jxi)
zj = f(
k
vj,kyj)
= f(
k
vj,kf(
i
wi,jxi))
!! !" !# !$ !% & % $ # " !&
%
'
()'*
threshold
!! !" !# !$ !% & % $ # " !&
%
'
()'*
sigmoid
=
0 x < 0
1 x 0
y= 1
1 + exp(x)
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Modeling logical operators
A one-layer binary-threshold
network can implement the logicaloperators AND and OR, but notXOR.
Why not?
28
x1
x2
x1
x2
x1
x2
x1AND x2 x1OR x2 x1XOR x2
j =f(
i
wi,jxi)
!! !" !# !$ !% & % $ # " !&
%
'
()'*
threshold y =
0 x < 0
1 x 0
-
8/13/2019 042407 Neural Networks
15/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Posterior odds interpretation of a sigmoid
29
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
The general classification/regression problem
30
Data
D = x1, . . . , xT
xi = {x1, . . . , xN}i
desired output
= y1, . . . , yK
model
= 1, . . . , M
Given data, we want to learn a model thatcan correctly classify novel observations or
map the inputs to the outputs
i = 1 if xi Ci class i,
0 otherwise
for classification:
input is a set of T observations, each anN-dimensional vector (binary, discrete, orcontinuous)
model (e.g. a decision tree) is defined byM parameters, e.g. a multi-layerneural network.
regression for arbitrary y.
-
8/13/2019 042407 Neural Networks
16/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Error function is defined as before,where we use the target vector tntodefine the desired output for networkoutput yn.
The forward pass computes theoutputs at each layer:
A general multi-layer neural network
31
E = 1
2
n=1
(yn(xn,W1:L) tn)2
x1 xi xM
yi
w
x0=1
y1 yK
x0=1
yiy1 yK
yiy1 yK
y0=1
w
y0=1
ylj = f(
i
wli,jyl1j )
l= {1, . . . , L}
x y0
output = yL
input
layer 1
layer 2
output
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Deriving the gradient for a sigmoid neural network
Mathematical procedure for trainis gradient descient: same asbefore, except the gradients aremore complex to derive.
Convenient fact for the sigmoidnon-linearity:
backward pass computes thegradients: back-propagation
32
d(x)
dx =
d
dx
1
1 + exp (x)
= (x)(1 (x))
E = 1
2
N
n=1
(yn(xn,W1:L) tn)2
x1 xi xM
yi
w
x0=1
y1 yK
x0=1
yiy1 yK
yiy1 yK
y0=1
w
y0=1
input
layer 1
layer 2
output
Wt+1 = Wt + E
W
New problem: local minima
-
8/13/2019 042407 Neural Networks
17/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Applications: Driving (output is analog: steering direction)
33
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robotguidance. Kluwer AcademicPublishing, 1993.
Learns to drive on roads
Demonstrated at highwayspeeds over 100s of miles
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Real image input is augmented to avoid overfitting
34
Training data:Images +
correspondingsteering angle
Important:Conditioning oftraining data togenerate newexamples! avoidsoverfitting
-
8/13/2019 042407 Neural Networks
18/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Takes as input image of handwritten digit Each pixel is an input unit Complex network with many layers Output is digit class
Tested on large (50,000+) database of handwritten samples
Real-time Used commercially
Hand-written digits: LeNet
35
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
LeNet
36
Y. LeCun, L. Bottou, Y. Bengio, andP. Haffner. Gradient-based learningapplied to document recognition.Proceedings of the IEEE, november1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (
-
8/13/2019 042407 Neural Networks
19/19
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Object recognition
LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognitionwith Invariance to Pose and Lighting. Proceedings of CVPR 2004.
http://www.cs.nyu.edu/~yann/research/norb/
37
Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks
Summary
Decision boundaries- Bayes optimal- linear discriminant- linear separability
Classification vs regression Optimization by gradient descent Degeneracy of a multi-layer linear network
Non-linearities:: threshold, sigmoid, others?
Issues:- very general architecture, can solve many problems- large number of parameters: need to avoid overfitting- usually requires a large amount of data, or special architecture- local minima, training can be slow, need to set stepsize
38