lms

23
The LMS Algorithm The LMS Objective Function. Global solution. Pseudoinverse of a matrix. Optimization (learning) by gradient descent. LMS or Widrow-Hoff Algorithms. Convergence of the Batch LMS Rule. Compiled by LAT E X at 3:50 PM on February 27, 2013. CS 295 (UVM) The LMS Algorithm Fall 2013 1 / 23

Upload: mfqc

Post on 03-Oct-2015

216 views

Category:

Documents


2 download

DESCRIPTION

Neural Networks

TRANSCRIPT

  • The LMS Algorithm

    The LMS Objective Function.

    Global solution.

    Pseudoinverse of a matrix.

    Optimization (learning) by gradient descent.

    LMS or Widrow-Hoff Algorithms.

    Convergence of the Batch LMS Rule.

    Compiled by LATEX at 3:50 PM on February 27, 2013.

    CS 295 (UVM) The LMS Algorithm Fall 2013 1 / 23

  • LMS Objective Function

    Again we consider the problem of programming a linear threshold function

    x1

    w2

    w3

    wn

    w1

    x2

    x3

    xn

    w0y

    y D sgn.wTxC w0/ D(

    1; if wTxC w0 > 0;1; if wTxC w0 0:

    so that it agrees with a given dichotomy of m feature vectors,

    Xm D f.x1; `1/; : : : ; .xm; `m/g:where xi 2 Rn and `i 2 f1;C1g, for i D 1; 2; : : : ;m.

    CS 295 (UVM) The LMS Algorithm Fall 2013 2 / 23

  • LMS Objective Function (cont.)Thus, given a dichotomy

    Xm D f.x1; `1/; : : : ; .xm; `m/g;we seek a solution weight vector w and bias w0, such that,

    sgnwTxi C w0

    D `i ;for i D 1; 2; : : : ;m.Equivalently, using homogeneous coordinates (or augmented feature vectors),

    sgnbwTbxi D `i ;

    for i D 1; 2; : : : ;m, wherebxi D 1; xTi T .Using normalized coordinates,

    sgnbwTbx0 i D 1; or, bwTbx0 i > 0;

    for i D 1; 2; : : : ;m, wherebx0 i D `ibxi .CS 295 (UVM) The LMS Algorithm Fall 2013 3 / 23

  • LMS Objective Function (cont.)

    Alternatively, let b 2 Rm satisfy bi > 0 for i D 1; 2; : : : ;m. We call b a margin vector.Frequently we will assume that bi D 1.Our learning criterion is certainly satisfied if

    bwTbx0 i D bifor i D 1; 2; : : : ;m. (Note that this condition is sufficient, but not necessary.)Equivalently, the above is satisfied if the expression

    E.bw/ D mXiD1

    bi bwTbx0 i2 :

    equals zero. The above expression we call the LMS objective function, where LMSstands for least mean square. (N.B. we could normalize the above by dividing theright side by m.)

    CS 295 (UVM) The LMS Algorithm Fall 2013 4 / 23

  • LMS Objective Function: Remarks

    Converting the original problem of classification (satisfying a system of inequalities)into one of optimization is somewhat ad hoc. And there is no guarantee that we canfind a bw? 2 RnC1 that satisfies E.bw?/ D 0. However, this conversion can lead topractical compromises if the original inequalities possess inconsistencies.

    Also, there is no unique objective function. The LMS expression,

    E.bw/ D mXiD1

    bi bwTbx0 i2:

    can be replaced by numerous candidates, e.g.

    mXiD1

    bi bwTbx0 i ; mX

    iD1

    1 sgn bwTbx0 i; mX

    iD1

    1 sgn bwTbx0 i2; etc.

    However, minimizing E.bw?/ is generally an easy task.CS 295 (UVM) The LMS Algorithm Fall 2013 5 / 23

  • Minimizing the LMS Objective FunctionInspection suggests that the LMS Objective Function

    E.bw/ D mXiD1

    bi bwTbx0 i2

    describes a parabolic function. It may have a unique global minimum, or an infinitenumber of global minima which occupy a connected linear set. (The latter can occurif m < nC 1.) Letting,

    X D

    0BBBB@bx0T1bx0T2:::bx0Tm

    1CCCCA D0BBB@`1 Ox 01;1 Ox 01;2 Ox 01;n`2 Ox 02;1 Ox 02;2 Ox 02;n:::

    ::::::

    : : ::::

    `m Ox 0m;1 Ox 0m;2 Ox 0m;n

    1CCCA 2 Rm.nC1/;then,

    E.bw/ D mXiD1

    bi bx0Ti bw2 D

    0BBBB@b1 bx0T1 bwb2 bx0T2 bw

    :::

    bm bx0Tmbw

    1CCCCA

    2

    D

    b

    0BBBB@bx0T1bx0T2:::bx0Tm

    1CCCCAbw

    2

    D kb Xbwj2:CS 295 (UVM) The LMS Algorithm Fall 2013 6 / 23

  • Minimizing the LMS Objective Function (cont.)

    E.bw/ D kb Xbwk2D .b Xbw/T .b Xbw/DbT bwT X T .b Xbw/

    D bwT X TX bw bwT X Tb bTX bwC bTbD bwT X TX bw 2bTX bwC kbk2

    As an aside, note that

    X TX D .bx01; : : : ;bx0m/0B@bx0T1:::bx0Tm1CA D mX

    iD1bx0 ibx0Ti 2 R.nC1/.nC1/;

    bTX D .b1; : : : ; bm/

    0B@bx0T1:::bx0Tm1CA D mX

    iD1bibx0Ti 2 RnC1:

    CS 295 (UVM) The LMS Algorithm Fall 2013 7 / 23

  • Minimizing the LMS Objective Function (cont.)Okay, so how do we minimize

    E.bw/ D bwT X TX bw 2bTX bwC kbk2Using calculus (e.g., Math 121), we can compute the gradient of E.bw/, andalgebraically determine a value of bw? which makes each component vanish. That is,solve

    rE.bw/ D

    0BBBBBBBBBB@

    @E

    @bw0@E

    @bw1:::@E

    @bwn

    1CCCCCCCCCCAD 0:

    It is straightforward to show that

    rE.bw/ D 2X T Xbw 2X Tb:CS 295 (UVM) The LMS Algorithm Fall 2013 8 / 23

  • Minimizing the LMS Objective Function (cont.)

    Thus,rE.bw/ D 2X T Xbw 2X Tb D 0

    if bw? D .X T X /1X Tb D Xb;where the matrix,

    XdefD .X T X /1X T 2 R.nC1/m

    is called the pseudoinverse of X . If X T X is singular, one defines

    XdefD lim!0.X

    T X C I/1X T :

    Observe that if X T X is nonsingular,

    XX D .X T X /1X T X D I:

    CS 295 (UVM) The LMS Algorithm Fall 2013 9 / 23

  • Example

    The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, PatternClassification, Second Edition, Wiley, NY, 2001, p. 241.Given the dichotomy,

    X4 D.1; 2/T ; 1

    ;.2; 0/T ; 1

    ;.3; 1/T ;1; .2; 3/T ;1

    we obtain,

    X D

    0BBBBB@bx0T1bx0T2bx0T3bx0T4

    1CCCCCA D0BB@

    1 1 21 2 01 3 11 2 3

    1CCA :Whence,

    X TX D0@4 8 68 18 11

    6 11 14

    1A ; and, X D .X TX /1X T D0BB@

    54

    1312

    34

    712

    12 16 12 160 13 0 13

    1CCA :

    CS 295 (UVM) The LMS Algorithm Fall 2013 10 / 23

  • Example (cont.)

    Letting, b D .1; 1; 1; 1/T , then

    bw D Xb D0BB@

    54

    1312

    34

    712

    12 16 12 160 13 0 13

    1CCA0BBB@

    1111

    1CCCA D0BB@

    113

    43 23

    1CCA :

    Whence,

    w0 D 113 and w D4

    3; 2

    3

    T

    x1

    x2

    1

    1

    2

    2

    3

    3

    4

    4

    5

    CS 295 (UVM) The LMS Algorithm Fall 2013 11 / 23

  • Method of Steepest Descent

    An alternative approach is the method of steepest descent.

    We begin by representing Taylors theorem for functions of more than one variable:let x 2 Rn, and f W Rn ! R, so

    f .x/ D f .x1; x2; : : : ; xn/ 2 R:Now let x 2 Rn, and consider

    f .xC x/ D f .x1 C x1; : : : ; xn C xn/:Define F W R! R, such that,

    F .s/ D f .xC s x/:Thus,

    F .0/ D f .x/; and, F .1/ D f .xC x/:

    CS 295 (UVM) The LMS Algorithm Fall 2013 12 / 23

  • Method of Steepest Descent (cont.)Taylors theorem for a single variable (Math 21/22),

    F .s/ D F .0/C 11

    F 0.0/sC 12

    F 00.0/s2 C 13

    F 000.0/s3 C :

    Our plan is to set s D 1 and replace F .1/ by f .xC x/, F .0/ by f .x/, etc.To evaluate F 0.0/ we will invoke the multivariate chain rule, e.g.,

    dds

    fu.s/; v.s/

    D @[email protected]; v/ u0.s/C @f

    @v.u; v/ v 0.s/:

    Thus,

    F 0.s/ D dFds.s/ D df

    ds.x1 C s x1; : : : ; xn C s xn/

    D @f@x1

    .xC s x/ dds.x1 C s x1/C C @f

    @xn.xC s x/ d

    ds.xn C s xn/

    D @f@x1

    .xC s x/ x1 C C @f@xn

    .xC s x/ xn:

    CS 295 (UVM) The LMS Algorithm Fall 2013 13 / 23

  • Method of Steepest Descent (cont.)

    Thus,

    F 0.0/ D @f@x1

    .x/ x1 C C @f@xn

    .x/ xn D rf .x/T x:

    CS 295 (UVM) The LMS Algorithm Fall 2013 14 / 23

  • Method of Steepest Descent (cont.)

    Thus, it is possible to show

    f .xC x/ D f .x/Crf .x/T xCO kxk2D f .x/C krf .x/kkxk cos CO kxk2 ;

    where defines the angle between rf .x/ and x. If kxk 1, thenf D f .xC x/ f .x/ krf .x/kkxk cos :

    Thus, the greatest reduction f occurs if cos D 1, that is if x D rf , where > 0. We thus seek a local minimum of the LMS objective function by taking asequence of steps bw.t C 1/ D bw.t/ rEbw.t/:

    CS 295 (UVM) The LMS Algorithm Fall 2013 15 / 23

  • Training an LTU using Steepest DescentWe now return to our original problem. Given a dichotomy

    Xm D f.x1; `1/; : : : ; .xm; `m/gof m feature vectors xi 2 Rn with `i 2 f1; 1g for i D 1; : : : ;m, we construct the setof normalized, augmented feature vectors

    bX0m D f.`i ; `ixi;1; : : : ; `ixi;n/T 2 RnC1ji D 1; : : : ;mg:Given a margin vector, b 2 Rm, with bi > 0 for i D 1; : : : ;m, we construct the LMSobjective function,

    E.bw/ D 12

    mXiD1.bwTbx0 i bi/2 D 12kXbw bk2;

    (the factor of 12 is added with some foresight), and then evaluate its gradient

    rE.bw/ D mXiD1.bwTbx0 i bi/bx0 i D X T .Xbw b/:

    CS 295 (UVM) The LMS Algorithm Fall 2013 16 / 23

  • Training an LTU using Steepest Descent (cont.)

    Substitution into the steepest descent update rule,

    bw.t C 1/ D bw.t/ rEbw.t/;yields the batch LMS update rule,

    bw.t C 1/ D bw.t/C mXiD1

    bi bw.t/Tbx0 ibx0 i :

    Alternatively, one can abstract the sequential LMS, or Widrow-Hoff rule, from theabove: bw.t C 1/ D bw.t/C b bw.t/Tbx0.t/bx0.t/:wherebx0.t/ 2 bX0m is the element of the dichotomy that is presented to the LTU atepoch t . (Here, we assume that b is fixed; otherwise, replace it by b.t/.)

    CS 295 (UVM) The LMS Algorithm Fall 2013 17 / 23

  • Sequential LMS Rule

    The sequential LMS rule

    bw.t C 1/ D bw.t/C hb bw.t/Tbx0.t/ibx0.t/;resembles the sequential perceptron rule,

    bw.t C 1/ D bw.t/C 2

    h1 sgn bw.t/Tbx0.t/ibx0.t/:

    Sequential rules are well suited to real-time implementations, as only the currentvalues for the weights, i.e. the configuration of the LTU itself, need to be stored. Theyalso work with dichotomies of infinite sets.

    CS 295 (UVM) The LMS Algorithm Fall 2013 18 / 23

  • Convergence of the batch LMS rule

    Recall, the LMS objective function in the form

    E.bw/ D 12

    Xbw b

    2;has as its gradient,

    rE.bw/ D X TXbw X Tb D X T.Xbw b/:Substitution into the rule of steepest descent,

    bw.t C 1/ D bw.t/ rEbw.t/;yields,

    bw.t C 1/ D bw.t/ X TXbw.t/ b

    CS 295 (UVM) The LMS Algorithm Fall 2013 19 / 23

  • Convergence of the batch LMS rule (cont.)

    The algorithm is said to converge to a fixed point bw?, if for every finite initial value

    bw.0/

  • Convergence of the batch LMS rule (cont.)

    Convergence bw.t/! bw? occurs if bw.t/ D bw.t/ bw? ! 0. Thus we require that

    bw.t C 1/

    <

    bw.t/

    . Inspecting the update rule,bw.t C 1/ D I X TXbw.t/;

    this reduces to the condition that all the eigenvalues of

    I X TXhave magnitudes less than 1.

    We will now evaluate the eigenvalues of the above matrix.

    CS 295 (UVM) The LMS Algorithm Fall 2013 21 / 23

  • Convergence of the batch LMS rule (cont.)

    Let S 2 R.nC1/.nC1/ denote the similarity transform that reduces the symmetricmatrix X TX to a diagonal matrix 2 R.nC1/.nC1/. Thus, STS D SST D I, and

    S X TX ST D D diag.0; 1; : : : ; n/;The numbers, 0; 1; : : : ; n, represent the eigenvalues of X TX . Note that 0 i fori D 0; 1; : : : ; n. Thus,

    S bw.t C 1/ D S.I X TX /ST S bw.t/;D .I /S bw.t/:

    Thus convergence occurs if

    kS bw.t C 1/k < kS bw.t/k;which occurs if the eigenvalues of I all have magnitudes less than one.

    CS 295 (UVM) The LMS Algorithm Fall 2013 22 / 23

  • Convergence of the batch LMS rule (cont.)

    The eigenvalues of I equal 1 i , for i D 0; 1; : : : ; n. (These are in fact theeigenvalues of I X TX .) Thus, we require that

    1 < 1 i < 1; or 0 < i < 2;for all i . Let max D max

    0in i denote the largest eigenvalue of XTX , then convergence

    requires that

    0