580.691 learning theory reza shadmehr
DESCRIPTION
580.691 Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function. Review of regression. Multivariate regression:. Batch algorithm. Steepest descent. LMS. Finding the minimum of a function in a single step. Taylor series expansion. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/1.jpg)
580.691 Learning Theory
Reza Shadmehr
LMS with Newton-Raphson, weighted least squares, choice of loss function
![Page 2: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/2.jpg)
Review of regression
0 1 1: ( ; )dd df R R f w w x w x x w Multivariate regression:
2( ) ( ) ( ) ( ) ( )
2( ) ( )
1
1
( )( 1) ( ) ( ) ( )
1
( )( 1) ( ) ( ) ( )
( ) ( )
( )( 1) ( ) ( )
( ) ( )
1 1
1
n n T n n T n
NTn T n
n
T T
Nnt t n T n
n
nn n n T n
n T n
nn n n
n T n
loss y loss y y
E loss y y X XN N
X X X
yN
y
y
w w x w x
w w x y w y w
w y
w w w x x
xw w w x
x x
xw w
x x
Batch algorithm
Steepest descent
LMS
![Page 3: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/3.jpg)
Finding the minimum of a function in a single step
*w w
( )J w
*
*
' '' '''2 3* * *
' ' '' *
'' * ' ''
'
*''
1! 2! 3!
0
w w ww w
w w w
w w w
w
w
J J JJ J w w w w w w
J J J w w
J w J J w
Jw w
J
Taylor series expansion
(If J is quadratic, otherwise more terms here)
![Page 4: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/4.jpg)
Newton-Raphson method
*
*
*
2
2
2
2
12
2
12
*2
1=
2
0
T TdJ d JJ J
d d
dJ dJ d J
d d d
d J dJ
dd
d J dJ
dd
w ww w
w w w
ww
ww
w w
w w
w w w
ww
w www
( )( )
12
( 1) ( )2 nn
n n d J dJ
dd
ww
w www
![Page 5: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/5.jpg)
The gradient of the loss function
( )( )
2( ) ( )
1
12
( 1) ( )2
( )( ) ( )
1
( )( ) ( )
1
2( )( )
21
1 ( )
2
2
2
tt
Nn T n
n
t t
Nnn T ni
i n
Nnn T n
n
NnT n
n
J yN
d J dJ
dd
dJy x
dw N
dJy
d N
d J d
N dd
ww
w w x
w www
w x
w x xw
w x xww
Newton-Raphson
![Page 6: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/6.jpg)
The gradient of the loss function
1 1 1 1
2 2 1 11 1 2 2
1 1
21 1 2 1
22 1 2 2
21 2
m m
m mTm m
m m m m
m
T m
m m m
T T
x x w x w x
x x w x w xw x w x w x
x x w x w x
x x x x x
d x x x x xd
x x x x x
d
d
w xx
w xxw
w xx xxw
2( )( )
21
( )( )
1
2
2
NnT n
n
Nn Tn
n
d J d
N dd
N
w x xww
x x
![Page 7: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/7.jpg)
LMS algorithm with Newton-Raphson
( )( )
12
( 1) ( )2
( )( ) ( )
1
2( )( )
21
1( ) ( )( 1) ( ) ( ) ( ) ( ) ( )
1 1
2
2
tt
t t
Nnn T n
n
Nn Tn
n
N Nn T nt t n n t T n
n n
d J dJ
dd
dJy
d N
d J
Nd
y
ww
w www
w x xw
x xw
w w x x w x xSteepest descent algorithm
1( 1) ( ) ( ) ( ) ( ) ( ) ( ) n n n T n n T n ny w w x x w x xLMS
( ) ( )n n Tx x is a singular matrix.
0< 1 Note:
![Page 8: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/8.jpg)
Weighted Least Squares
• Suppose some data points are more important than others. We want to
weight the errors in matching those data points more.
2( ) ( ) ( )
1
(1) (2) ( )
1 ( )
, , ,
1( )
1 10
0
0
2 0
Nn n T n
n
N
T
TT T
TT T T T T
T T T T T T
T T
J p yN
P diag p p p
J X P XN
dJX P X X P X
d N NdJ
X P X PX PX X PXddJ
X P X PX X P X P XddJ
X P X PXd
w w x
w y w y w
y w y ww
y w y ww
y w y ww
y ww
note: TP P
![Page 9: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/9.jpg)
In fMRI, we typically measure the signal intensity from N voxels at acquisition time t=1…T. Each of these T measurements constitutes an image. We assume that the time series of voxel n is an arbitrary linear function of the design matrix X plus a noise term:
n n ny = Xβ +ε
T x 1 column vector
T x p design matrix p x 1 vector
If one source of noise is due to random discrete events, for example, artifacts arising from the participant moving their jaw, then only some images will be influenced, violating the assumption of a stationary noise process. To relax this assumption, a simple approach is to allow the variance of noise in each image to be scaled by a separate parameter. Under the temporal independence assumption, the variance-covariance matrix of the noise process might be:
1
2 2 2
0 0
0 0var
0 0
n n n
T
s
s
s
ε V
a variance scaling parameter for the i-th time that the voxel was imaged
is
How to handle artifacts in FMRI data
Diedrichsen and Shadmehr, NeuroImage (2005)
![Page 10: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/10.jpg)
Discrete events (e.g., swallowing) will impact only those images that were acquired during the event. What should be done with these images, once they are identified? A typical approach would be to discard images based on some fixed threshold. If we knew
var nε
the optimal approach would be to weight the images by the inverse of their variance.
1* 1 1 n nT Tβ X V X X V y
But how do we get V? We can use the residuals from our model:
1
2
1
2
ˆ
1ˆ ˆ/
ˆ /
n n n n n
N
n n nn
n n n
diagN
T rank
T T
T
T
r y Xβ I - X X X X y Ry
s r r
r r X
This is a good start, but has some issues regarding bias of our estimator of variance. To improve things, see Diedrichsen and Shadmehr (2005).
![Page 11: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/11.jpg)
“Normal equations” for weighted least squares
1
1( 1) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
1 ( )
2 0
T
T T
T T
T T
n n n n T n n n T n n
J X P XN
dJX P X PX
d
X PX X P
X PX X P
p y
w y w y w
y ww
w y
w y
w w x x w x x
Weighted Least Squares
Weighted LMS
![Page 12: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/12.jpg)
Regression with basis functions
• In general, predictions can be based on a linear combination of a set of
basis functions:
1 2
0 1 1
( ), ( ), ( )
( )
( ; ) ( ) ( )
m
di
m m
g g g
g R R
f w w g w g
x x x
x
x w x x
basis set:
Examples:
Linear basis set:
Gaussian basis set:
( )i ig xx
2
1( ) exp
2
Ti i ig
x x p x p
Each basis is a local expert. This measures how close are the features of the input to that preferred by expert i.
2
1( ) exp
2
T Ti i ig M M
x x p x pRadial basis set (RBF)
![Page 13: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/13.jpg)
Input space
Collection of experts
Output 0 1 1( ; ) ( ) ( )n nf w w g w g x w x x
10w
![Page 14: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/14.jpg)
1
ˆ
1 1ˆ ˆ( )
1( )
T Tn
Tn
T T
X
Jn n
J X Xn
X X X
y w
w y y y y y y
w y w y w
w y
(1) (1) (1)1 2 0
(1)1(2) (2) (2)
1 22
( )
( ) ( ) ( )1 2
1
1
1
m
m
n
n n nmm
g g g w
wyg g g
wX
ywg g g
x x x
x x xy w
x x x
Regression with basis functions
![Page 15: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/15.jpg)
Choice of loss function
22
2
2
ˆ
0 if 0( )
1 if 0
( )
( )
( )
( ) 1 exp2
y y y
yLoss y
y
Loss y y
Loss y y y
Loss y y
yLoss y k
In learning, our aim is to find parameters w so to minimize the expected loss:
arg minww E Loss y
E Loss y p y w Loss y dy
Probability density of error, given our model parameters
-2 -1 0 1 20
1
2
3
4
y
y
2y
1.5y
This is a weighted sum. The loss is weighted by the likelihood of observing that loss.
![Page 16: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/16.jpg)
Inferring the choice of loss function from behavior
Kording & Wolpert, PNAS (2004)
A trial lasted 6 seconds. Over this period, a series of ‘peas’ appeared near the target, drawn from a distribution that depended on the finger position. The object was to “place the finger so that on average, the peas land as close as possible to the target”.
![Page 17: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/17.jpg)
-2 -1 1 2
0.2
0.4
0.6
0.8
1
0 if 0( )
1 if 0
yLoss y
y
-2 -1 0 1 20
1
2
3
4
y
Then the smallest possible expected value of loss occurs when p(y) has its peak at yerror =0
y
1p y w w
Loss
2p y w w
Imagine that the learner cannot arbitrarily change the density of the errors through learning. All the learner can do is shift the density left or right through setting the parameter w. If the learning uses this loss function:
arg max Pr 0 ww y w
Therefore, in the above plot choice of w2 is better than w1. In effect, the w that the learner chooses will depend on the exact shape of p(y).
The delta loss function
![Page 18: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/18.jpg)
-2 -1 0 1 20
0.2
0.4
0.6
0.8
1
1.2
-2 -1 0 1 20
0.2
0.4
0.6
0.8
1
y
1 21p y w p y w p y w
220.2 0.2 / ,N w
211 0.2,N w
0.3 0.4 0.5 0.6 0.7 0.8-0.05
0
0.05
0.1
0.15
w arg max 0ww p y w
0 if 0( )
1 if 0
yLoss y
y
Suppose the “outside” system (e.g., the teacher) sets . Given the loss function, we can predict what the best w will be for the learner.
0.3
0w
0.15
Behavior with the delta loss function
![Page 19: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/19.jpg)
2
2
2 2
22
arg min
var
w
Loss y y
w E y w
E y w p y w y dy
E y w y w E y w
Behavior with the squared error loss function
2 2
2 2 2 2
2 2 2 2 2
2 2
var
2 2
2
var
x E x E x E x x
E x xx x E x E x x E x
E x x x E x x
E x x x
We have a p(ytilda) with a variance that is independent of w. So to minimize E(loss), we should pick a w that produces the smallest E[ytilda]. That happens at a w that sets mean of p(ytilda) equal to zero.
21
22
1 0.2,
0.2 0.2 / ,
p y w N w
N w
2
1 0.2 0.2 0.2 /
arg min arg min 0w w
E y w w w
E y w w
E y w E y w
![Page 20: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/20.jpg)
y
w
2y
1.5ydelta
0.2 0.3 0.4 0.5 0.6 0.7 0.8-0.05
0
0.05
0.1
0.15
y
Typical subjects
(cm)
-2 -1 0 1 20
0.2
0.4
0.6
0.8
1
1.2
0.15
Kording & Wolpert, PNAS (2004)
• Results: large errors are penalized by less than a squared term. The loss function was estimated at:
• However, note that the largest errors tended to occur very infrequently in this experiment.
1.75Loss y y
![Page 21: 580.691 Learning Theory Reza Shadmehr](https://reader035.vdocuments.us/reader035/viewer/2022062408/56813578550346895d9cdbdf/html5/thumbnails/21.jpg)
Mean and variance of mixtures of normal distributions
2 21 1 1 2 2 2
2 21 1 1 2 2 2
1 1 2 2
22
2 2 2 2 21 1 1 2 2 2
2 2 2 21 1 1 2 2 2
22 2 2 21 1 1 2 2 2 1 1 2 2
, ,
, ,
var
, ,
var
p x N N
E x xN dx xN dx
x E x E x
E x x N dx x N dx
x