università di milano-bicocca laurea magistrale in informatica

47
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo Mauri

Upload: tulia

Post on 15-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Università di Milano-Bicocca Laurea Magistrale in Informatica. Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo Mauri. Instance-based Learning. Key idea: Just store all training examples - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Università di Milano-Bicocca Laurea Magistrale in Informatica

Università di Milano-BicoccaLaurea Magistrale in Informatica

Corso di

APPRENDIMENTO E APPROSSIMAZIONE

Lezione 8 - Instance based learning

Prof. Giancarlo Mauri

Page 2: Università di Milano-Bicocca Laurea Magistrale in Informatica

Instance-based Learning

Key idea: Just store all training examples <xi, f(xi)> Classify a query instance by retrieving a set of “similar” instances

Advantages Training is very fast - just storing examples Learn complex target functions Don’t lose information

Disadvantages Slow at query time - need for efficient indexing of training examples

Easily fooled by irrelevant attributes

Page 3: Università di Milano-Bicocca Laurea Magistrale in Informatica

Instance-based Learning

Main approaches: k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning

Page 4: Università di Milano-Bicocca Laurea Magistrale in Informatica

K-Nearest Neighbor Learning

Instances: points in Rn

Euclidean distance to measure similarity

Target function f: Rn V

The algorithm: Given query instance xq

first locate nearest training example xn, then estimate f(xq) = f(xn) (for k=1)

ortake vote among its k nearest nbrs (k≠1 and f discrete-valued)

take mean of f values of k nearest nbrs (k≠1 and f real-valued)

µ 1( )

( )

k

iiq

f xf x

k=← ∑

Page 5: Università di Milano-Bicocca Laurea Magistrale in Informatica

K-Nearest Neighbor Learning

When to Consider Nearest Neighbor Instances map to points in Rn Less than 20 attributes per instance Lots of training data

Page 6: Università di Milano-Bicocca Laurea Magistrale in Informatica

An example

Boolean target function in 2D

x classified + for k=1 and - for k=5 (left diagram)

Right diagram (Voronoi diagram) shows the decision surface induced by 1-nbr for a given set of training examples (black dots)

-- -

-+

++

+

-

x

Page 7: Università di Milano-Bicocca Laurea Magistrale in Informatica

Voronoi Diagram

query point qf

nearest neighbor qi

Page 8: Università di Milano-Bicocca Laurea Magistrale in Informatica

3-Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o

Page 9: Università di Milano-Bicocca Laurea Magistrale in Informatica

7-Nearest Neighbors

query point qf

7 nearest neighbors

3x,4o

Page 10: Università di Milano-Bicocca Laurea Magistrale in Informatica

Nearest Neighbor (continuous)

1-nearest neighbor

Page 11: Università di Milano-Bicocca Laurea Magistrale in Informatica

Nearest Neighbor (continuous)

3-nearest neighbor

Page 12: Università di Milano-Bicocca Laurea Magistrale in Informatica

Nearest Neighbor (continuous)

5-nearest neighbor

Page 13: Università di Milano-Bicocca Laurea Magistrale in Informatica

Behavior in the Limit

Let p(x) be the probability that instance x will be labeled 1 (positive) versus 0 (negative) Nearest neighbor

As number of training examples ∞ , approaches Gibbs algorithm: with probability p(x) predict 1, else 0

K-Nearest neighborAs number of training examples ∞ and k gets large, approaches Bayes optimal: if p(x)>.5 then predict 1, else 0

Note: Gibbs has at most twice the expected error of Bayes optimal

Page 14: Università di Milano-Bicocca Laurea Magistrale in Informatica

Might want weight nearer neighbors more heavily…

where with d(xq,xi) distance between xq and xi

(if xq= xi, i.e. d(xq,xi) = 0, then f(xq) = f(xi))

Note: now it makes sense to use all training examples instead of just k (Shepard’s method, global, slow)

µ 1

1

( )( )

k

i iiq k

ii

w f xf x

w=

=

← ∑∑

2

1

( , )iq i

wd x x

Distance-Weighted k-NN

Page 15: Università di Milano-Bicocca Laurea Magistrale in Informatica

Curse of Dimensionality

Imagine instances described by 20 attributes, but only 2 are relevant to target function

nearest nbr is easily misled when high-dimensional X

One approach:

Stretch jth axis by weight zj, where z1,…, zn chosen to minimize prediction error

Use cross-validation to automatically choose weights z1,…, zn

Note setting zj to zero eliminates this dimension altogether

Page 16: Università di Milano-Bicocca Laurea Magistrale in Informatica

Locally weighted Regression

Note that kNN forms local approximation to f for each query point xq : why not form an explicit approximation f’(x) for region surrounding xq ?

Fit linear function to k nearest neighbors

Fit quadratic…

Produces “piecewise approximation” to f

Several choices of error to minimize: Squared error over k nearest neighbors

E1(xq) = (∑xNNs of xq(f(x)-f’(x))2)/2

Distance-weighted squared error over all nbrsE1(xq) = (∑xD(f(x)-f’(x))2K(d(xq,x))/2

Page 17: Università di Milano-Bicocca Laurea Magistrale in Informatica

Radial Basis Function Networks

Global approximation to target function, in terms of linear combination of local approximations

Used, e.g., for image classification

A different kind of neural network

Closely related to distance-weighted regression, but “eager” instead of “lazy”

Page 18: Università di Milano-Bicocca Laurea Magistrale in Informatica

Radial Basis Function Networks

Where ai(x) are the attributes describing instance x, and

One common choice for is

01

( ) ( ( , ))k

u u uu

f x w w K d x x=

= +∑

( ( , ))u uK d x x

22

1( , )

2( ( , ))u

u

d x x

u uK d x x e σ−

=

Page 19: Università di Milano-Bicocca Laurea Magistrale in Informatica

Training RBF Networks

Q1: what xu to use for each kernel function Ku(d(xu, x)) Scatter uniformly throughout instance space

Or use training instances (reflects instance distribution)

Q2: how to train weights (assume here Gaussian Ku) First choose variance (and perhaps mean) for each Ku

- e.g., use EM Then hold Ku fixed, and train linear output layer

- efficient methods to fit linear function

Page 20: Università di Milano-Bicocca Laurea Magistrale in Informatica

Radial Basis Function Network

Global approximation to target function in terms of linear combination of local approximations

Used, e.g. for image classification

Similar to back-propagation neural network but activation function is Gaussian rather than sigmoid

Closely related to distance-weighted regression but ”eager” instead of ”lazy”

Page 21: Università di Milano-Bicocca Laurea Magistrale in Informatica

Radial Basis Function Network

input layer

Kernel functions

output f(x)

xi

Kn(d(xn,x))=exp(-1/2 d(xn,x)2/σ2)

wn linear parameters

f(x)=w0+n=1k wn Kn(d(xn,x))

Page 22: Università di Milano-Bicocca Laurea Magistrale in Informatica

Training Radial Basis Function Networks

How to choose the center xn for each Kernel function Kn?

scatter uniformly across instance space use distribution of training instances (clustering)

How to train the weights? Choose mean xn and variance σn for each Kn

nonlinear optimization or EM

Hold Kn fixed and use local linear regression to compute the optimal weights wn

Page 23: Università di Milano-Bicocca Laurea Magistrale in Informatica

Radial Basis Network Example

K1(d(x1,x))=

exp(-1/2 d(x1,x)2/σ2)

w1 x+ w0

f^(x) = K1 (w1 x+ w0) + K2 (w3 x + w2)

Page 24: Università di Milano-Bicocca Laurea Magistrale in Informatica

Case-based reasoning

Can apply instance-based learning even when X ≠

need different “distance” metric

Case-based reasoning is instance-based learning applied to instances with symbolic logic descriptions

((user-complaint error53-on-shutdown)

(cpu-model PowerPC)

(operating-system Windows)

(network-connection PCIA)

(memory 48meg)

(installed-applications Excel Netscape VirusScan)

(disk 1gig)

(likely-cause???))

nℜ

Page 25: Università di Milano-Bicocca Laurea Magistrale in Informatica

Case-based reasoning in CADET

CADET: 75 stored examples of mechanical devices

each training example: < qualitative function, mechanical structure>

new query: desired function

target value: mechanical structure for this function

Distance metric: match qualitative function descriptions

Page 26: Università di Milano-Bicocca Laurea Magistrale in Informatica

Case-based Reasoning in CADET

Case-based Reasoning in CADET

Page 27: Università di Milano-Bicocca Laurea Magistrale in Informatica

Case-based Reasoning in CADET

Instances represented by rich structural descriptions

Multiple cases retrieved (and combined) to form solution to new problem

Tight coupling between case retrieval and problem solving

Bottom line:

Simple matching of cases useful for tasks such as answering help-desk queries

Area of ongoing research

Page 28: Università di Milano-Bicocca Laurea Magistrale in Informatica

Lazy and Eager learning

Lazy: wait for query before generalizing k-NEAREST NEIGHBOR, Case-based reasoning

Eager: generalize before seein query Radial basis function networks, ID3,

Backpropagation, NaiveBayes,…

Does it matter?

Eager learner must create global approximation

Lazy learner can create many local approximations

If they use same H, lazy can represent more complex fns (e.g., consider H = linear functions)

Page 29: Università di Milano-Bicocca Laurea Magistrale in Informatica

Machine Learning 2D5362

Instance Based Learning

Page 30: Università di Milano-Bicocca Laurea Magistrale in Informatica

Distance Weighted k-NN

Give more weight to neighbors closer to the query point

f^(xq) = i=1k wi f(xi) / i=1

k wi

where wi=K(d(xq,xi))

and d(xq,xi) is the distance between xq and xi

Instead of only k-nearest neighbors use all training examples (Shepard’s method)

Page 31: Università di Milano-Bicocca Laurea Magistrale in Informatica

Distance Weighted Average

Weighting the data:

f^(xq) = i f(xi) K(d(xi,xq))/ i

K(d(xi,xq))

Relevance of a data point (xi,f(xi)) is measured by calculating the distance d(xi,xq) between the query xq and the input vector xi

Weighting the error criterion:

E(xq) = i (f^(xq)-f(xi))2 K(d(xi,xq))

the best estimate f^(xq) will minimize the cost E(q), therefore E(q)/f^(xq)=0

Page 32: Università di Milano-Bicocca Laurea Magistrale in Informatica

Kernel Functions

Page 33: Università di Milano-Bicocca Laurea Magistrale in Informatica

Distance Weighted NN

K(d(xq,xi)) = 1/ d(xq,xi)2

Page 34: Università di Milano-Bicocca Laurea Magistrale in Informatica

Distance Weighted NN

K(d(xq,xi)) = 1/(d0+d(xq,xi))2

Page 35: Università di Milano-Bicocca Laurea Magistrale in Informatica

Distance Weighted NN

K(d(xq,xi)) = exp(-(d(xq,xi)/σ0)2)

Page 36: Università di Milano-Bicocca Laurea Magistrale in Informatica

Linear Global Models

The model is linear in the parameters wk, which can be estimated using a least squares algorithm

f^(xi) = k=1D k xki or F(x) = X

Where xi=(x1,…,xD)i, i=1..N, with D the input dimension and N the number of data points.

Estimate the wk by minimizing the error criterion

E= i=1N (f^(xi) – yi)2

(XTX) = XT F(X)

= (XT X)-1 XT F(X)

k= m=1D n=1

N (l=1D xT

kl xlm)-1 xTmn f(xn)

Page 37: Università di Milano-Bicocca Laurea Magistrale in Informatica

Linear Regression Example

Page 38: Università di Milano-Bicocca Laurea Magistrale in Informatica

Linear Local Models

Estimate the parameters k such that they locally (near the query point xq) match the training data either by

weighting the data:

wi=K(d(xi,xq))1/2 and transforming

zi=wi xi

vi=wi yi

or by weighting the error criterion:

E= i=1N (xi

T – yi)2 K(d(xi,xq))

still linear in with LSQ solution

= ((WX)T WX)-1 (WX)T WF(X)

Page 39: Università di Milano-Bicocca Laurea Magistrale in Informatica

Linear Local Model Example

query point Xq=0.35

Kernel K(x,xq)

Local linear model:f^(x)=b1x+b0

f^(xq)=0.266

Page 40: Università di Milano-Bicocca Laurea Magistrale in Informatica

Linear Local Model Example

Page 41: Università di Milano-Bicocca Laurea Magistrale in Informatica

Design Issues in Local Regression

Local model order (constant, linear, quadratic)

Distance function d

feature scaling: d(x,q)=(j=1d mj(xj-qj)2)1/2

irrelevant dimensions mj=0

kernel function K

smoothing parameter bandwidth h in K(d(x,q)/h)

h=|m| global bandwidth h= distance to k-th nearest neighbor point h=h(q) depending on query point h=hi depending on stored data points

See paper by Atkeson [1996] ”Locally Weighted Learning”

Page 42: Università di Milano-Bicocca Laurea Magistrale in Informatica

Local Linear Models

Page 43: Università di Milano-Bicocca Laurea Magistrale in Informatica

Local Linear Model Tree (LOLIMOT)

• incremental tree construction algorithm• partitions input space by axis-orthogonal splits• adds one local linear model per iteration

1.start with an initial model (e.g. single LLM)2.identify LLM with worst model error Ei 3.check all divisions : split worst LLM hyper-rectangle in halves along each possible dimension4.find best (smallest error) out of possible divisions5.add new validity function and LLM 6.repeat from step 2. until termination criteria is met

Page 44: Università di Milano-Bicocca Laurea Magistrale in Informatica

LOLIMOT

Initial global linear model

Split along x1 or x2

Pick split that minimizesmodel error (residual)

Page 45: Università di Milano-Bicocca Laurea Magistrale in Informatica

LOLIMOT Example

Page 46: Università di Milano-Bicocca Laurea Magistrale in Informatica

LOLIMOT Example

Page 47: Università di Milano-Bicocca Laurea Magistrale in Informatica

Lazy and Eager Learning

Lazy: wait for query before generalizing k-nearest neighbors, weighted linear regression

Eager: generalize before seeing query Radial basis function networks, decision trees, back-propagation, LOLIMOT

Eager learner must create global approximation

Lazy learner can create local approximations

If they use the same hypothesis space, lazy can represent more complex functions (H=linear functions)