uva cs 4501: machine learning lecture 5: non …...rbf= radial-basis function:a function which...

UVACS4501:MachineLearning

Lecture5:Non-LinearRegressionModels

Dr.Yanjun Qi

UniversityofVirginiaDepartmentofComputerScience

10/6/18

Dr.YanjunQi/UVACS

Wherearewe?èFivemajorsectionsofthiscourse

q Regression(supervised)q Classification(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

10/6/18 2

Dr.YanjunQi/UVACS

Regression(supervised)q Fourwaystotrain/performoptimizationforlinearregressionmodelsq NormalEquationq GradientDescent(GD)q StochasticGDq Newton’smethod

qSupervisedregressionmodelsqLinearregression(LR)qLRwithnon-linearbasisfunctionsqLocallyweightedLRqLRwithRegularizations

10/6/18 3

Dr.YanjunQi/UVACS

TodayèRegression(supervised)

q Fourwaystotrain/performoptimizationforlinearregressionmodelsq NormalEquationq GradientDescent(GD)q StochasticGDq Newton’smethod

qSupervisedregressionmodelsqLinearregression(LR)qLRwithnon-linearbasisfunctionsqLocallyweightedLRqLRwithRegularizations

10/6/18 4

Dr.YanjunQi/UVACS

q RegressionModelsBeyondLinear– LRwithnon-linearbasisfunctions– Instance-basedRegression:K-NearestNeighbors(later)

– Locallyweightedlinearregression(extra)–RegressiontreesandMultilinear Interpolation(later)

10/6/18 5

Dr.YanjunQi/UVACS

LRwithnon-linearbasisfunctions

• LRdoesnotmeanwecanonlydealwithlinearrelationships

y =θ0 + θ jϕ j(x)j=1m∑ =θTϕ(x)

10/6/18

y =θ Tx

Dr.YanjunQi/UVACS

LRwithnon-linearbasisfunctions

• Wearefreetodesignbasisfunctions(e.g.,non-linearfeatures:

Herearefixedbasisfunctions(alsodefine)

• E.g.:polynomialregression:

ϕ(x):= 1,x ,x2⎡⎣ ⎤⎦T

10/6/18

!!ϕ j(x) !!ϕ0(x)=1

e.g.(1)polynomialregression

10/6/18

Dr.YanjunQi/UVACS

θ * = ϕTϕ( )−1ϕT !y( ) yXXX TT !1−=*θ

y =θTϕ(x)y =θ Tx

ϕ(x):= 1,x ,x2⎡⎣ ⎤⎦T

e.g.(1)polynomialregression

10/6/18

Dr.YanjunQi/UVACS

KEY:ifthebasesaregiven,theproblemof

learningtheparameters isstilllinear.

y =θTϕ(x)y =θ Tx

Dr.YanjunQi/UVACS

ManyPossibleBasisfunctions• Therearemanybasisfunctions,e.g.:

– Polynomial

– Radialbasisfunctions

– Sigmoidal

– Splines,– Fourier,– Wavelets,etc

ϕ j (x) = xj−1

( )⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

µφ exp)(

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

µσφ )(

10/6/18

ManyPossibleBasisfunctions

10/6/18

Dr.YanjunQi/UVACS

e.g.(2)LRwithradial-basisfunctions

• E.g.:LRwithRBFregression:

!! y =θ0 + θ jϕ j(x)j=1m∑ =ϕ(x)Tθ

10/6/18

!!ϕ(x):= 1,Kλ1(x ,r1),Kλ2(x ,r2),Kλ3

(x ,r3),Kλ4(x ,r4 )⎡⎣

⎤⎦T

θ * = ϕTϕ( )−1ϕT !y

Kλ(x ,r)= exp − (x − r)

2λ2⎛

⎝⎜⎞

⎠⎟

RBF = radial-basisfunction: afunctionwhichdependsonlyontheradial distancefromacentre point

GaussianRBFè

asdistance fromthecenterr increases, theoutputoftheRBFdecreases

1Dcase 2Dcase

10/6/18

Dr.YanjunQi/UVACS

10/6/18

Dr.YanjunQi/UVACS

Kλ(x ,r)= exp − (x − r)

2λ2⎛

⎝⎜⎞

⎠⎟

10.6065307

0.1353353

0.0001234098

Kλ(x ,r)=r

r +2λr +3λ

e.g.anotherLinearregressionwith1DRBFbasisfunctions

(assuming3predefinedcentres andwidth)

10/6/18

Dr.YanjunQi/UVACS

ϕ(x):= 1,Kλ1(x ,r1),Kλ2(x ,r2),Kλ3

(x ,r3)⎡⎣

⎤⎦T

θ * = ϕTϕ( )−1ϕT !y

Dr.YanjunQi/UVACS

e.g.aLRwith1DRBFs(3predefinedcentres andwidth)

• 1DRBF

• Afterfit:

10/6/18

e.g.EvenmorepossibleBasisFunc?

10/6/18

Dr.YanjunQi/UVACS

(2) Multivariate Linear Regression with basis Expansion

Regression

Y = Weighted linear sum of (X basis expansion)

Linear algebra

Regression coefficients

Representation

Score Function

Search/Optimization

Models, Parameters

!! y =θ0 + θ jϕ j(x)j=1m∑ =ϕ(x)Tθ

10/6/18 18

Dr.YanjunQi/UVACS

Twomainissues:

• ToLearntheparameter– AlmostthesameasLR,justè Xto– Linearcombinationofbasisfunctions(thatcanbenon-linear)

• Howtochoosethemodelorder,– E.g.whatpolynomialdegreeforpolynomialregression

– E.g.,wheretoputthecentersfortheRBFkernels?Howwide?

10/6/18

Dr.YanjunQi/UVACS

ϕ(x)θ *

Dr.YanjunQi/UVACS

e.g.2DGoodandBadRBFBasis

• Agood2DRBF

• Twobad2DRBFs

10/6/18

Issue:OverfittingandUnderfitting

y=f(x)+noiseCanwelearnaregression ffromthedata?

Let’sconsiderthreemethods…

LinearRegression

QuadraticRegression

Join-the-dots

Alsoknownaspiecewiselinearnonparametric regression ifthatmakesyoufeelbetter

Whichisbest?

Whynotchoosethemethodwiththebestfittothedata?

Whatdowereallywant?

“Howwellareyougoing topredictfuturedatadrawnfromthesamedistribution?”

Whatdowereallywant?

“Howwellareyougoing topredictfuturedatadrawnfromthesamedistribution?”

Underfit Good? Overfit

Dr.YanjunQi/UVACS

Issue:Overfitting andunderfitting

xy 10 θθ += 2210 xxy θθθ ++= ∑ =

j xy θ

10/6/18

K-foldCrossValidation/Train-Test/

Generalisation:learnfunction/hypothesisfrompastdatainorderto“explain”,“predict”,“model”or“control”new dataexamples

Underfit Looksgood Overfit

Thetestsetmethod

1.Randomlychoosesomepercentagelike30% ofthelabeleddatatobeinatestset2.Theremainder isatrainingset

Credit:Prof.AndrewMoore

Thetestsetmethod

1.Randomlychoosesomepercentagelike30% ofthelabeleddatatobeinatestset2.Theremainder isatrainingset3.Performyour regressiononthetrainingset

(Linearregressionexample)

Thetestsetmethod

1.Randomlychoose30%ofthedatatobeinatestset2.Theremainder isatrainingset3.Performyour regressiononthetrainingset4.Estimateyour futureperformancewiththetestset

(Linearregressionexample)MeanSquaredError=2.4

Evaluation: e.g. train / test split as follows

10/6/18 32

Xtrain =

−− x1T −−

−− x2T −−

! ! !−− xn

T −−

!ytrain =

y1y2"yn

Xtest =

−− xn+1T −−

−− xn+2T −−

! ! !−− xn+m

T −−

!ytest =

yn+1yn+2"yn+m

Dr.YanjunQi/UVACS

10/6/18 33

TestingMSEErrortoreport:

Jtest =1m

(x iTθ * − yi )2i=n+1

∑ = 1m

Dr.YanjunQi/UVACS

InHomework,whenweaskforplotsof trainingerror,weaskfortheMSEper-sampletrainerrors;BecauseitiscomparabletotestMSEerror.

10/6/18

Dr.YanjunQi/

Jtrain−MSE =

(x iTθ * − yi )2i=1

• TrainMSEErrortoobserve:

InHomework,whenweaskforplotsof trainingerror,weaskfortheMSEper-sampletrainerrors;BecauseitiscomparabletotestMSEerror.

Inmanysituations,visualizingTrain-MSEcanbehelpful tounderstand thebehaviorofyourmethod, e.g.,theinfluenceofthehyperparameteryouchose……

Thetestsetmethod

(Quadraticregressionexample)MeanSquaredError=0.9

Thetestsetmethod

(Join thedotsexample)MeanSquaredError=2.2

Thetestsetmethod

Goodnews:•Veryverysimple•Canthensimplychoosethemethodwiththebesttest-setscoreBadnews:•What’sthedownside?

Thetestsetmethod

Goodnews:•Veryverysimple•Canthensimplychoosethemethodwiththebesttest-setscoreBadnews:•Wastesdata:wegetanestimateofthebestmethodtoapplyto30%lessdata•Ifwedon’thavemuchdata,ourtest-setmightjustbeluckyorunlucky

We say the “test-set estimator of performance has high variance”

Regression: Complexity versus Goodness of Fit

Too simple?

Too complex ? About right ?

Training data

What ultimately matters: GENERALIZATION

LowVariance/HighBias

LowBias/HighVariance

10/6/18 39

Dr.YanjunQi/UVACS

LOOCV(Leave-one-outCrossValidation)

For k=1 to n

1. Let (xk,yk) be the kth record

For k=1 to n

2. Temporarily remove (xk,yk)from the dataset

For k=1 to n

3. Train on the remaining R-1 datapoints

LOOCV(Leave-one-outCrossValidation)For k=1 to n

4. Note your error (xk,yk)

LOOCV(Leave-one-outCrossValidation)For k=1 to R

When you’ve done all points, report the mean error.

LOOCV(Leave-one-outCrossValidation)For k=1 to n

1. Let (xk,yk) be the kth

record

2. Temporarily remove (xk,yk) from the dataset

MSELOOCV=2.12

LOOCVforQuadraticRegressionFor k=1 to n

MSELOOCV=0.962

LOOCVforJoinTheDotsFor k=1 to n

1. Let (xk,yk) be the kth

record

2. Temporarily remove (xk,yk) from the dataset

MSELOOCV=3.33

WhichkindofCrossValidation?Downside Upside

Test-set Variance: unreliable estimate of future performance

Leave-one-out

Expensive. Has some weird behavior

Doesn’t waste data

..canwegetthebestofbothworlds?

e.g.Byk=10foldCrossValidationmodel P1 P2 P3 P4 P5 P6 P7 P8 P9 P10

1 train train train train train train train train train test

2 train train train train train train train train test train

3 train train train train train train train test train train

4 train train train train train train test train train train

5 train train train train train test train train train train

6 train train train train test train train train train train

7 train train train test train train train train train train

8 train train test train train train train train train train

9 train test train train train train train train train train

10 test train train train train train train train train train

• Dividedatainto10equalpieces

• 9piecesastrainingset,therest1astestset

• Collectthescoresfromthediagonal

• Wenormallyusethemeanofthescores 4910/6/18

Dr.YanjunQi/UVACSMakesurethatthetrain/test/validationfoldsareindeed independent samples.

k-foldCrossValidation

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Purple Green and Blue)

For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points.

For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points.

For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points.

For the purple partition: Train on all the points not in the purple partition. Find the test-set sum of errors on the purple points.

Then report the mean error

LinearRegressionMSE3FOLD=2.05

QuadraticRegressionMSE3FOLD=1.11

For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points.

Joint-the-dotsMSE3FOLD=2.93

WhichkindofCrossValidation?Downside Upside

Test-set Variance: unreliable estimate of future performance

Leave-one-out

Expensive. Has some weird behavior

Doesn’t waste data

10-fold Wastes 10% of the data. 10 times more expensive than test set

Only wastes 10%. Only 10 times more expensive instead of R times.

3-fold Wastier than 10-fold. Expensivier than test set

better than test-set

n-fold Identical to Leave-one-out

CV-basedModelSelection

• We’retryingtodecidewhichalgorithmtouse.• Wetraineachmachineandmakeatable…

i fi TRAINERR k-FOLD-CV-ERR Choice1 f12 f23 f3 Ö

4 f45 f56 f6

10/6/18

Dr.YanjunQi/UVACS

59Credit:StanfordMachineLearningcourse

References

• BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides

q Prof.Nando deFreitas’stutorialslideq Prof.AndrewMoore’sslides@CMU

10/6/18 60

Dr.YanjunQi/UVACS

Extra: Performancevs.TrainingSizeforanexample

l TheresultsfromBatchGDandOnline(SGD)updatearealmostidentical.Sotheplotscoincide.

l ThetestMSEfromthenormalequationismorethanthatofBGDandO(SGD)during smalltraining.Thisisprobablyduetooverfitting.

l InBandO,sinceonly2000(forexample)iterationsareallowedatmost.Thisroughlyactsasamechanismthatavoidsoverfitting.

10/6/18

Dr.YanjunQi/UVACS

Dr.EricXing’s tutorialslide

EXTRA:MOREREGRESSIONMODELS

10/6/18

Dr.YanjunQi/UVACS

ExtraToday

10/6/18 63

Dr.YanjunQi/UVACS

K-NearestNeighbor

• Features– Allinstancescorrespondtopointsinanp-dimensionalEuclideanspace

– Regressionisdelayedtillanewinstancearrives– Regressionisdonebycomparingfeaturevectorsofthedifferentpoints

– Targetfunctionmaybediscreteorreal-valued• Whentargetiscontinuous,thepredictionisthemeanvalueoftheknearesttrainingexamples

K=5-NearestNeighbor(1Dinput)

10/6/18

Dr.YanjunQi/UVACS

K=1-NearestNeighbor(1Dinput)

10/6/18

Dr.YanjunQi/UVACS

K-Nearest Neighbor

Regression/ classification

Local Smoothness

Training Samples

Representation

Score Function

Search/Optimization

Models, Parameters

10/6/18 67

Dr.YanjunQi/UVACS6316/f16

Variants:Distance-Weightedk-NearestNeighborAlgorithm

• Assignweightstotheneighborsbasedontheir“distance” fromthequerypoint– Weight“may” beinversesquareofthedistances

• Alltrainingpointsmayinfluenceaparticularinstance– E.g.,Shepard’smethod/ModifiedShepard,… byGeospatialAnalysis

Instance-basedRegressionvs.LinearRegression

• LinearRegressionLearning– Explicitdescriptionoftargetfunctiononthewholetrainingset

• Instance-basedLearning– Learning=storingalltraininginstances– Referredtoas“Lazy” learning

ExtraToday

10/6/18 70

Dr.YanjunQi/UVACS

Locally weighted regression• aka locally weighted regression, local

linear regression, LOESS, …– A combination of kNN and Linear regression

10/6/18

Dr.YanjunQi/UVACS

Kλ (xi, x0 )

Locally weighted regression

10/6/18

Dr.YanjunQi/UVACS

UseRBFfunctiontopickout/emphasizetheneighborregionofx_0è

Kλ (xi, x0 )

Locally weighted regression

10/6/18

Dr.YanjunQi/UVACS

Alinear_func(x)->yèOnlytorepresent

theneighborregionofx_0

Kλ (xi, x0 )

f (x0)=θ0!(x0)+θ1

!(x0)x0

Dr.YanjunQi/UVACS

Locallyweightedlinearregression

Insteadofminimizing

nowwefittominimize

21 )()( θθ x

J(θ ) = 12

wi (xiTθ − yi )

wi = Kλ(x i ,x0)= exp −

(x i − x0)22λ2

⎝⎜

⎠⎟

10/6/18

wherex_0 isthequerypointforwhichwe'dliketoknowitscorrespondingy

Dr.YanjunQi/UVACS

Locallyweightedlinearregression

Wefit \thetatominimize

wi comesfrom:

• x_0 isthequerypointforwhichwe'dliketoknowitscorrespondingy

J(θ ) = 12

wi (xiTθ − yi )

wi = Kλ(x i ,x0)= exp −

(x i − x0)22λ2

⎝⎜

⎠⎟

10/6/18

Essentiallyweputhigherweightsontrainingexamplesthatareclosetothequerypointx_0(thanthosethatarefurtherawayfromthequerypointx_0)

Locally weighted linear regression

• ThewidthofRBFmatters!

10/6/18

Dr.YanjunQi/UVACS

LEARNING of Locally weighted linear regression

10/6/18

Dr.YanjunQi/UVACS

• Separate weighted least squares training and inference at each target point x0

f (x0)=θ0!(x0)+θ1

!(x0)x0

10/6/18

Dr. Yanjun Qi / UVA CS

Locally weighted linear regression

• èSeparate weighted least square error minimization at each target point x0:

f (x0)= x0Tθ *(x0)

θ *(x0)= argmin12 wi(x iTθ(x0)− yi )2

= argmin12 Kλ(xi ,x0)(x iTθ(x0)− yi )2i=1

0000 )(ˆ)(ˆ)(ˆ xxxxf βα +=

10/6/18

Dr. Yanjun Qi / UVA CS

Extra: Solution of Locally weighted linear/NonLinearBasis regression

( ) NixxKdiagxW iNN ,,1,),()( 00 !==× λ

θ*(x0)= (BTW(x0)B)−1BTW(x0)y

versus LR θ * = XTX( )−1 XT !y

10/6/18 80

More è Local Weighted Polynomial Regression

• Local polynomial fits of any degree d

∑∑ ∑

⎥⎦

⎤⎢⎣

⎡−−

jijiidjxx

xxxyxxKj

1 0000

1000,,1),(),(

)(ˆ)(ˆ)(ˆ

)()(),(min00

βαλβα !

Dr.YanjunQi/UVACS

Blue:trueGreen:estimated

Dr.YanjunQi/UVACS

Extra:Parametricvs.non-parametric

• Locallyweightedlinearregressionisanon-parametricalgorithm.

• The(unweighted)linearregressionalgorithmthatwesawearlierisknownasaparametric learningalgorithm– becauseithasafixed,finitenumberofparameters(the),which

arefittothedata;– Oncewe'vefitthe\theta andstoredthemaway,wenolongerneed

tokeepthetrainingdataaroundtomakefuturepredictions.– Incontrast,tomakepredictionsusinglocallyweightedlinear

regression,weneedtokeeptheentiretrainingsetaround.

• Theterm"non-parametric"(roughly)referstothefactthattheamountofknowledgeweneedtokeep,inordertorepresentthehypothesisgrowswithlinearlythesizeofthetrainingset.

10/6/18

(3) Locally Weighted / Kernel Linear Regression

Regression

Y = Weighted linear sum of X’s

Weighted SSE

Linear algebra

Local Regression coefficients

(conditioned on each test point)

Representation

Score Function

Search/Optimization

Models, Parameters

0000 )(ˆ)(ˆ)(ˆ xxxxf βα +=min

α (x0 ),β(x0 )Kλ(x0 ,xi )[ yi −α(x0)−β(x0)xi ]2

∑10/6/18 82

Dr.YanjunQi/UVACS

θ*(x0)= (BTW(x0)B)−1BTW(x0)y

uva cs 4501: machine learning lecture 5: non …...rbf= radial-basis function:a function which...

Documents

myer centrepoint christmas 2015

frasers centrepoint trust...frasers centrepoint trust...

operating manual - maquinariaparacarpintero.com€¦ ·...

magic maintenance (2000) inc. centrepoint mall,...

singapore the centrepoint is pleased to unveil two food...

frasers centrepoint trust investor presentation (september...

b.30 mechanical piping insulation - vn - centrepoint

frasers centrepoint trust financial statements ... · page...

financial statements - centrepoint

supporting homeless young people to participate centrepoint...

centrepoint full year 2017 results alliance centrepoint...

centrepoint theatre technical specification tech...

stock price pattern prediction based on complex network...

transport impact assessment - city of bunbury and...

frasers centrepoint introdoc (clean with gatefold)

b.23 air handling plant - vn - centrepoint

b.18 chillers - vn - centrepoint

new frasers centrepoint trust - investor relations: ir...

welcome speeches: · web viewgaia marcus – project...

suite 15.1, level 15 centrepoint north tower mid valley...