neural’networks’aarti/class/10601/slides/nnets_11_3_2011.pdf · 2011-11-15 ·...

37
Neural Networks Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell 1

Upload: others

Post on 15-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Neural  Networks  

Aarti Singh

Machine Learning 10-601 Nov 3, 2011

Slides Courtesy: Tom Mitchell

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA

1

Page 2: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Logis0c  Regression  

2

Assumes  the  following  func1onal  form  for  P(Y|X):  

Logistic function (or Sigmoid):

Logis1c  func1on  applied  to  a  linear  func1on  of  the  data  

z

logi

t (z)

Features can be discrete or continuous!

Page 3: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Logis0c  Regression  is  a  Linear  Classifier!  

3

Assumes  the  following  func1onal  form  for  P(Y|X):          Decision  boundary:  

(Linear Decision Boundary)

1

1

Page 4: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Training  Logis0c  Regression  

4

How  to  learn  the  parameters  w0,  w1,  …  wd?  

Training  Data  

Maximum  (Condi1onal)  Likelihood  Es1mates  

 

 

 Discrimina1ve  philosophy  –  Don’t  waste  effort  learning  P(X),  focus  on  P(Y|X)  –  that’s  all  that  maLers  for  classifica1on!  

 

Page 5: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Op0mizing  convex  func0on  

5

•  Max  Condi1onal  log-­‐likelihood    =  Min  Nega1ve  Condi1onal  log-­‐likelihood

•  Nega1ve  Condi1onal  log-­‐likelihood  is  a  convex  func1on  Gradient Descent (convex)  

Gradient:

Learning rate, η>0 Update rule:

Page 6: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Logis0c  func0on  as  a  Graph  

Sigmoid Unit

d

d

d

Page 7: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Neural  Networks  to  learn  f:  X  à  Y  •  f  can  be  a  non-­‐linear  func1on  •  X  (vector  of)  con1nuous  and/or  discrete  variables  •  Y  (vector  of)  con1nuous  and/or  discrete  variables  

•  Neural  networks  -­‐  Represent  f  by  network  of  logis1c/sigmoid  units:  

Input layer, X

Output layer, Y

Hidden layer, H

Sigmoid Unit

Page 8: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Neural Network trained to distinguish vowel sounds using 2 formants (features)

Highly non-linear decision surface Two layers of logistic units

Input layer

Hidden layer

Output layer

Page 9: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Neural Network trained to drive a car!

Weights of each pixel for one hidden unit

Weights to output units from the hidden unit

Page 10: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell
Page 11: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Forward  Propaga0on  for  predic0on  

Prediction – Given neural network (hidden units and weights), use it to predict the label of a test point

Forward Propagation –

Start from input layer For each subsequent layer, compute output of sigmoid unit

Sigmoid unit: 1-Hidden layer, 1 output NN: oh

Page 12: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Differentiable

d

d

d

Training  Neural  Networks  

Page 13: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

•  Consider regression problem f:XàY , for scalar Y y = f(x) + ε

assume noise N(0,σε), iid

deterministic

M(C)LE  Training  for  Neural  Networks  

Learned neural network

•  Let’s maximize the conditional data likelihood

Train weights of all units to minimize sum of squared errors of predicted network outputs

Page 14: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

•  Consider regression problem f:XàY , for scalar Y y = f(x) + ε

noise N(0,σε)

deterministic

MAP  Training  for  Neural  Networks  

Gaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2

Train weights of all units to minimize sum of squared errors of predicted network outputs plus weight magnitudes

Page 15: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

d

E – Mean Square Error

For Neural Networks, E[w] no longer convex in w

Page 16: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Error  Gradient  for  a  Sigmoid  Unit  

ll

ll

l

l

l l

ll

l

l ll

l

ly

ly

ly ly

ly

ly

ll

l

ll l

l ll

ll l l lly

Sigmoid Unit

d

d d

Page 17: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

(MLE)

l l l lly k

l l l l

l

l l lo

Using Forward propagation

yk = target output (label)

ok/h = unit output (obtained by forward propagation)

wij = wt from i to j

Note: if i is input variable, oi = xi

Page 18: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Using all training data D

llly

l

l

llly l

Page 19: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Objective/Error no longer convex in weights

Page 20: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Our learning algorithm involves a parameter n=number of gradient descent iterations

How do we choose n to optimize future error? (note: similar issue for logistic regression, decision trees, …) e.g. the n that minimizes error rate of neural net over future data

Dealing  with  OverfiVng  

Page 21: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Our learning algorithm involves a parameter n=number of gradient descent iterations

How do we choose n to optimize future error? •  Separate available data into training and validation set •  Use training to perform gradient descent •  n ß number of iterations that optimizes validation set error

Dealing  with  OverfiVng  

Page 22: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies.

________________________________________________ Partition data into K disjoint subsets For k=1 to K

testData = kth subset h ß classifier trained* on all data except for testData

accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies

* might withhold some of this to choose number of gradient decent steps

K-­‐fold  Cross-­‐valida0on  

Page 23: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

This is just k-fold cross validation leaving out one example each iteration ________________________________________________ Partition data into K disjoint subsets, each containing one example For k=1 to K

testData = kth subset h ß classifier trained* on all data except for testData

accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies

* might withhold some of this to choose number of gradient decent steps

Leave-­‐one-­‐out  Cross-­‐valida0on  

Page 24: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

•  Cross-validation

•  Regularization – small weights imply NN is linear (low VC dimension)

•  Control number of hidden units – low complexity

Dealing  with  OverfiVng  

Σwixi

Logi

stic

out

put

Page 25: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell
Page 26: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell
Page 27: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell
Page 28: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell
Page 29: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

w0 left strt right up

Page 30: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Semantic Memory Model Based on ANN’s

[McClelland & Rogers, Nature 2003]

No hierarchy given.

Train with assertions, e.g., Can(Canary,Fly)

Page 31: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Humans act as though they have a hierarchical memory organization

1.  Victims of Semantic Dementia progressively lose knowledge of objects But they lose specific details first, general properties later, suggesting hierarchical memory organization

Thing Living

Animal Plant

NonLiving

Bird Fish

Canary

2. Children appear to learn general categories and properties first, following the same hierarchy, top down*.

* some debate remains on this.

Question: What learning mechanism could produce this emergent hierarchy?

Page 32: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Memory deterioration follows semantic hierarchy [McClelland & Rogers, Nature 2003]

Page 33: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell
Page 34: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

•  Suppose we want to predict next state of world –  and it depends on history of unknown length –  e.g., robot with forward-facing sensors trying to predict next

sensor reading as it moves and turns

Training  Networks  on  Time  Series  

Page 35: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

•  Suppose we want to predict next state of world –  and it depends on history of unknown length –  e.g., robot with forward-facing sensors trying to predict next

sensor reading as it moves and turns

•  Idea: use hidden layer in network to capture state history

Training  Networks  on  Time  Series  

Page 36: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

How can we train recurrent net??

Training  Networks  on  Time  Series  

Page 37: Neural’Networks’aarti/Class/10601/slides/Nnets_11_3_2011.pdf · 2011-11-15 · Neural’Networks’ Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell

Ar0ficial  Neural  Networks:  Summary  

•  Ac1vely  used  to  model  distributed  computa1on  in  brain  •  Highly  non-­‐linear  regression/classifica1on  •  Vector-­‐valued  inputs  and  outputs  •  Poten1ally  millions  of  parameters  to  es1mate  -­‐  overfiWng  •  Hidden  layers  learn  intermediate  representa1ons  –  how  many  

to  use?  

•  Predic1on  –  Forward  propaga1on  •  Gradient  descent  (Back-­‐propaga1on),  local  minima  problems  

•  Mostly  obsolete  –  kernel  tricks  are  more  popular,  but  coming  back  in  new  form  as  deep  belief  networks  (probabilis1c  interpreta1on)