lecture 13 instructor : mert pilanciweb.stanford.edu/class/ee269/lecture13.pdfee269 signal...

26
EE269 Signal Processing for Machine Learning Lecture 13 Instructor : Mert Pilanci Stanford University February 27, 2019

Upload: others

Post on 26-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • EE269Signal Processing for Machine Learning

    Lecture 13

    Instructor : Mert Pilanci

    Stanford University

    February 27, 2019

  • Gaussian regression models

    I y 2 R continuous label and x 2 RdI training set x1, ...xn and labels y1, ..., yn

    p(y|x,w, b) = N(wTx+ b,�2)

    I P (w) prior probability of wI infinitely many classes parametrized by wI maxw P (y|x1, ...xn) = P (y|x1, ...xn, w, b)P (w)I independent observations: = Qni=1 P (yi|xi, w, b)P (w)I Maximum a Posteriori (MAP) estimate wMAP

    wMAP = argmaxnY

    i=1

    P (yi|xi, w, b)P (w)

    = argmaxnX

    i=1

    logP (yi|xi, w, b) + logP (w)

  • Gaussian regression models

    I y 2 R continuous label and x 2 RdI training set x1, ...xn and labels y1, ..., yn

    p(y|x,w, b) = N(wTx+ b,�2)

    wMAP = argmaxnX

    i=1

    logP (yi|xi, w, b) + logP (w)

    I Gaussian prior on w : P (w) = N(0, t2I)

    wMAP = argmax�1

    2�2

    nX

    i=1

    (yi � wTxi � b)

    2�

    1

    2t2||w||

    22

  • Gaussian regression models

    I y 2 R continuous label and x 2 RdI training set x1, ...xn and labels y1, ..., yn

    p(y|x,w, b) = N(wTx+ b,�2)

    wMAP = argmaxw

    nX

    i=1

    logP (yi|xi, w, b) + logP (w)

    I Gaussian prior on w : P (w) = N(0, t2I)

    wMAP = argminw

    nX

    i=1

    (yi � wTxi � b)

    2 +�2

    t2||w||

    22

    I `2 regularization (Ridge regression)

  • Gaussian regression models

    I Laplace prior P (w) / e�|w1|t ...e

    �|wd|t

    wMAP = argminw

    nX

    i=1

    (yi � wTxi � b)

    2 +�2

    t||w||1

    I `1 regularization (Lasso)

  • `2 regularization (Ridge regression)

    minw

    ||Xw � y||22 + �||w||

    22 (1)

  • `1 regularization (Lasso)

    minw

    ||Xw � y||22 + �||w||1 (2)

  • Exponential density e�|w|t vs Gaussian density e�

    w2

    t2

  • Least Squares Regression and Duality

    I in matrix-vector notation (redefine � n�)

    minw

    1

    2||Xw � y||

    22 +

    2||w||

    22

    I equivalent constrained problem

    minz,w : Xw=z

    1

    2||z � y||

    22 +

    2||w||

    22

    I Dual problem:

    max↵�↵

    T (1

    2�XX

    T +1

    2I)↵+ ↵T y

    KKT conditions imply w⇤ = 1�X

    T↵⇤

    We can solve the dual in closed form ↵⇤ = ( 1�XX

    T + I)�1y

  • Dual Least Squares Regression Problem

    Dual problem:

    max↵�↵

    T (1

    2�XX

    T +1

    2I)↵+ ↵T y

    KKT conditions imply w⇤ = 1�X

    T↵⇤

    We can solve the dual in closed form ↵⇤ = ( 1�XX

    T + I)�1y

    I Given test sample x, the prediction isw

    ⇤Tx =

    �1�

    Pni=1 xi↵

    i

    �x = 1�

    Pni=1hxi, xi↵

    i

    I Kernel map x! �(x) and kernel matrixKij = (x, y) = h�(x),�(x)i

    I Dual solution ↵⇤ = ( 1�K + I)�1yI Prediction f̂(x) = 1�

    Pni=1 (xi, x)↵

    i

  • Kernel Regression Application

    I polynomial kernel (x, y) = (1 + xT y)4prediction f̂(x) = 1�

    Pni=1 (xi, x)↵

    i

  • Kernel Regression Application

    I Gaussian kernel (x, y) = e�||x�y||22

    2�2

    prediction f̂(x) = 1�Pn

    i=1 (xi, x)↵⇤

    i

  • Reproducing Kernel Hilbert Space

    I Mercer’s Theorem: Any positive definite kernel function canbe represented in terms of eigenfunctions

    (x, y) =1X

    i=1

    �i�i(x)�i(y)

    I The functions �(x) form an orthonormal basis for a functionspace

    Hk = {f : f(x) =1X

    i=1

    fi�i(x),1X

    i=1

    f2i

    �i

  • Reproducing Kernel Hilbert Space

    I Mercer’s Theorem: Any positive definite kernel function canbe represented in terms of eigenfunctions

    (x, y) =1X

    i=1

    �i�i(x)�i(y)

    I The functions �(x) form an orthonormal basis for a functionspace

    Hk = {f : f(x) =1X

    i=1

    fi�i(x),1X

    i=1

    f2i

    �i

  • Representer Theorem in Reproducing Kernel Hilbert Space

    (⇤) minf2Hk

    nX

    i=1

    (f(xi)� yi)2 + �||f ||2Hk

    I Representer theorem : The optimal solution must have theform f

    ⇤(x) =Pn

    i=1 ↵i(x, xi)

    I Plugging in and applying reproducing propertyh(xi, ·),(xj , ·)iH = (xi, xj), we get

    (⇤) = min↵

    ||K↵� y||22 + �↵

    TK↵

    I solution ↵⇤ = (K + �I)�1yI prediction f̂(x) = Pni=1 ↵⇤i (x, xi)I same prediction rule obtained with dual ridge regression

  • Representer Theorem in Reproducing Kernel Hilbert Space

    (⇤) minf2Hk

    nX

    i=1

    (f(xi)� yi)2 + �||f ||2Hk

    I Representer theorem : The optimal solution must have theform f

    ⇤(x) =Pn

    i=1 ↵i(x, xi)

    I Plugging in and applying reproducing propertyh(xi, ·),(xj , ·)iH = (xi, xj), we get

    (⇤) = min↵

    ||K↵� y||22 + �↵

    TK↵

    I solution ↵⇤ = (K + �I)�1yI prediction f̂(x) = Pni=1 ↵⇤i (x, xi)I same prediction rule obtained with dual ridge regression

  • Representer Theorem in Reproducing Kernel Hilbert Space

    (⇤) minf2Hk

    nX

    i=1

    (f(xi)� yi)2 + �||f ||2Hk

    I Representer theorem : The optimal solution must have theform f

    ⇤(x) =Pn

    i=1 ↵i(x, xi)

    I Plugging in and applying reproducing propertyh(xi, ·),(xj , ·)iH = (xi, xj), we get

    (⇤) = min↵

    ||K↵� y||22 + �↵

    TK↵

    I solution ↵⇤ = (K + �I)�1yI prediction f̂(x) = Pni=1 ↵⇤i (x, xi)I same prediction rule obtained with dual ridge regression

  • Example: Gaussian Kernel

    I Gaussian kernel (x, y) = e�||x�y||22

    2�2

    (x, y) =P

    1

    i=1 �i�i(x)�i(y)

    �(x) / e�(c�a)x2Hi(x

    p2c) and �i = bi

    a, b, c are functions of � and Hk is the ith order Hermite

    polynomial

  • Example: Gaussian Kernel

    I Gaussian kernel (x, y) = e�||x�y||22

    2�2

    f(x) =Pn

    i=1 ↵i(xi, x) =Pn

    i=1

    P1

    j=1 �j�j(xi)�j(x) =P1

    j=1 fjp�j�j(x)

    where fj =Pm

    i=1 ↵ip

    �j�j(xi)

    minf2Hk

    nX

    i=1

    (f(xi)� yi)2 + �||f ||2Hk

    I For a function h(x) = P1i=1 hi�i(x)I ||h||2

    Hk= hh, hiHk =

    P1

    i=1h2i�i

    enforces smoothness by

    penalizing rough eigenfunctions (small �i).

  • Example: Gaussian Kernel

    I Gaussian kernel (x, y) = e�||x�y||22

    2�2

    f(x) =Pn

    i=1 ↵i(xi, x) =Pn

    i=1

    P1

    j=1 �j�j(xi)�j(x) =P1

    j=1 fjp�j�j(x)

    where fj =Pm

    i=1 ↵ip

    �j�j(xi)

    minf2Hk

    nX

    i=1

    (f(xi)� yi)2 + �||f ||2Hk

    I For a function h(x) = P1i=1 hi�i(x)I ||h||2

    Hk= hh, hiHk =

    P1

    i=1h2i�i

    enforces smoothness by

    penalizing rough eigenfunctions (small �i).

  • Example: Sobolev Kernel (one dimensional signals)

    Hk = {f : [0, 1] ! R | f(0) = 0, abs. continuous,Z

    1

    �1

    |f0(t)|2dt < 1}

    I absolutely continuous , f 0(t) exists almost everywhere,I Hk is a Reproducing Kernel Hilbert Space with kernel

    (x, y) = min(x, y)

  • Sobolev Kernel vs Polynomial Kernel

  • Example: Sinc Kernel (one dimensional signals)

    I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)

    I f(t) : bandlimited functionsI related to Shannon-Whittaker interpolation formula

    uniform samples x[n] = x(nT )

    I bandlimited interpolation

    x(t) =X

    n

    x[n]sinc⇣t� nT

    T

  • Example: Sinc Kernel (one dimensional signals)

    I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)I f(t) : bandlimited functions

    I related to Shannon-Whittaker interpolation formulauniform samples x[n] = x(nT )

    I bandlimited interpolation

    x(t) =X

    n

    x[n]sinc⇣t� nT

    T

  • Example: Sinc Kernel (one dimensional signals)

    I Paley-Wiener spaceI (x, y) , sinc(↵(x� y)) = sin(↵(x�y))↵(x�y)I f(t) : bandlimited functionsI related to Shannon-Whittaker interpolation formula

    uniform samples x[n] = x(nT )

    I bandlimited interpolation

    x(t) =X

    n

    x[n]sinc⇣t� nT

    T