università di milano-bicocca laurea magistrale in informatica

Post on 15-Jan-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Università di Milano-Bicocca Laurea Magistrale in Informatica. Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo Mauri. Instance-based Learning. Key idea: Just store all training examples - PowerPoint PPT Presentation

TRANSCRIPT

Università di Milano-BicoccaLaurea Magistrale in Informatica

Corso di

APPRENDIMENTO E APPROSSIMAZIONE

Lezione 8 - Instance based learning

Prof. Giancarlo Mauri

Instance-based Learning

Key idea: Just store all training examples <xi, f(xi)> Classify a query instance by retrieving a set of “similar” instances

Advantages Training is very fast - just storing examples Learn complex target functions Don’t lose information

Disadvantages Slow at query time - need for efficient indexing of training examples

Easily fooled by irrelevant attributes

Instance-based Learning

Main approaches: k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning

K-Nearest Neighbor Learning

Instances: points in Rn

Euclidean distance to measure similarity

Target function f: Rn V

The algorithm: Given query instance xq

first locate nearest training example xn, then estimate f(xq) = f(xn) (for k=1)

ortake vote among its k nearest nbrs (k≠1 and f discrete-valued)

take mean of f values of k nearest nbrs (k≠1 and f real-valued)

µ 1( )

( )

k

iiq

f xf x

k=← ∑

K-Nearest Neighbor Learning

When to Consider Nearest Neighbor Instances map to points in Rn Less than 20 attributes per instance Lots of training data

An example

Boolean target function in 2D

x classified + for k=1 and - for k=5 (left diagram)

Right diagram (Voronoi diagram) shows the decision surface induced by 1-nbr for a given set of training examples (black dots)

-- -

-+

++

+

-

x

Voronoi Diagram

query point qf

nearest neighbor qi

3-Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o

7-Nearest Neighbors

query point qf

7 nearest neighbors

3x,4o

Nearest Neighbor (continuous)

1-nearest neighbor

Nearest Neighbor (continuous)

3-nearest neighbor

Nearest Neighbor (continuous)

5-nearest neighbor

Behavior in the Limit

Let p(x) be the probability that instance x will be labeled 1 (positive) versus 0 (negative) Nearest neighbor

As number of training examples ∞ , approaches Gibbs algorithm: with probability p(x) predict 1, else 0

K-Nearest neighborAs number of training examples ∞ and k gets large, approaches Bayes optimal: if p(x)>.5 then predict 1, else 0

Note: Gibbs has at most twice the expected error of Bayes optimal

Might want weight nearer neighbors more heavily…

where with d(xq,xi) distance between xq and xi

(if xq= xi, i.e. d(xq,xi) = 0, then f(xq) = f(xi))

Note: now it makes sense to use all training examples instead of just k (Shepard’s method, global, slow)

µ 1

1

( )( )

k

i iiq k

ii

w f xf x

w=

=

← ∑∑

2

1

( , )iq i

wd x x

Distance-Weighted k-NN

Curse of Dimensionality

Imagine instances described by 20 attributes, but only 2 are relevant to target function

nearest nbr is easily misled when high-dimensional X

One approach:

Stretch jth axis by weight zj, where z1,…, zn chosen to minimize prediction error

Use cross-validation to automatically choose weights z1,…, zn

Note setting zj to zero eliminates this dimension altogether

Locally weighted Regression

Note that kNN forms local approximation to f for each query point xq : why not form an explicit approximation f’(x) for region surrounding xq ?

Fit linear function to k nearest neighbors

Fit quadratic…

Produces “piecewise approximation” to f

Several choices of error to minimize: Squared error over k nearest neighbors

E1(xq) = (∑xNNs of xq(f(x)-f’(x))2)/2

Distance-weighted squared error over all nbrsE1(xq) = (∑xD(f(x)-f’(x))2K(d(xq,x))/2

Radial Basis Function Networks

Global approximation to target function, in terms of linear combination of local approximations

Used, e.g., for image classification

A different kind of neural network

Closely related to distance-weighted regression, but “eager” instead of “lazy”

Radial Basis Function Networks

Where ai(x) are the attributes describing instance x, and

One common choice for is

01

( ) ( ( , ))k

u u uu

f x w w K d x x=

= +∑

( ( , ))u uK d x x

22

1( , )

2( ( , ))u

u

d x x

u uK d x x e σ−

=

Training RBF Networks

Q1: what xu to use for each kernel function Ku(d(xu, x)) Scatter uniformly throughout instance space

Or use training instances (reflects instance distribution)

Q2: how to train weights (assume here Gaussian Ku) First choose variance (and perhaps mean) for each Ku

- e.g., use EM Then hold Ku fixed, and train linear output layer

- efficient methods to fit linear function

Radial Basis Function Network

Global approximation to target function in terms of linear combination of local approximations

Used, e.g. for image classification

Similar to back-propagation neural network but activation function is Gaussian rather than sigmoid

Closely related to distance-weighted regression but ”eager” instead of ”lazy”

Radial Basis Function Network

input layer

Kernel functions

output f(x)

xi

Kn(d(xn,x))=exp(-1/2 d(xn,x)2/σ2)

wn linear parameters

f(x)=w0+n=1k wn Kn(d(xn,x))

Training Radial Basis Function Networks

How to choose the center xn for each Kernel function Kn?

scatter uniformly across instance space use distribution of training instances (clustering)

How to train the weights? Choose mean xn and variance σn for each Kn

nonlinear optimization or EM

Hold Kn fixed and use local linear regression to compute the optimal weights wn

Radial Basis Network Example

K1(d(x1,x))=

exp(-1/2 d(x1,x)2/σ2)

w1 x+ w0

f^(x) = K1 (w1 x+ w0) + K2 (w3 x + w2)

Case-based reasoning

Can apply instance-based learning even when X ≠

need different “distance” metric

Case-based reasoning is instance-based learning applied to instances with symbolic logic descriptions

((user-complaint error53-on-shutdown)

(cpu-model PowerPC)

(operating-system Windows)

(network-connection PCIA)

(memory 48meg)

(installed-applications Excel Netscape VirusScan)

(disk 1gig)

(likely-cause???))

nℜ

Case-based reasoning in CADET

CADET: 75 stored examples of mechanical devices

each training example: < qualitative function, mechanical structure>

new query: desired function

target value: mechanical structure for this function

Distance metric: match qualitative function descriptions

Case-based Reasoning in CADET

Case-based Reasoning in CADET

Case-based Reasoning in CADET

Instances represented by rich structural descriptions

Multiple cases retrieved (and combined) to form solution to new problem

Tight coupling between case retrieval and problem solving

Bottom line:

Simple matching of cases useful for tasks such as answering help-desk queries

Area of ongoing research

Lazy and Eager learning

Lazy: wait for query before generalizing k-NEAREST NEIGHBOR, Case-based reasoning

Eager: generalize before seein query Radial basis function networks, ID3,

Backpropagation, NaiveBayes,…

Does it matter?

Eager learner must create global approximation

Lazy learner can create many local approximations

If they use same H, lazy can represent more complex fns (e.g., consider H = linear functions)

Machine Learning 2D5362

Instance Based Learning

Distance Weighted k-NN

Give more weight to neighbors closer to the query point

f^(xq) = i=1k wi f(xi) / i=1

k wi

where wi=K(d(xq,xi))

and d(xq,xi) is the distance between xq and xi

Instead of only k-nearest neighbors use all training examples (Shepard’s method)

Distance Weighted Average

Weighting the data:

f^(xq) = i f(xi) K(d(xi,xq))/ i

K(d(xi,xq))

Relevance of a data point (xi,f(xi)) is measured by calculating the distance d(xi,xq) between the query xq and the input vector xi

Weighting the error criterion:

E(xq) = i (f^(xq)-f(xi))2 K(d(xi,xq))

the best estimate f^(xq) will minimize the cost E(q), therefore E(q)/f^(xq)=0

Kernel Functions

Distance Weighted NN

K(d(xq,xi)) = 1/ d(xq,xi)2

Distance Weighted NN

K(d(xq,xi)) = 1/(d0+d(xq,xi))2

Distance Weighted NN

K(d(xq,xi)) = exp(-(d(xq,xi)/σ0)2)

Linear Global Models

The model is linear in the parameters wk, which can be estimated using a least squares algorithm

f^(xi) = k=1D k xki or F(x) = X

Where xi=(x1,…,xD)i, i=1..N, with D the input dimension and N the number of data points.

Estimate the wk by minimizing the error criterion

E= i=1N (f^(xi) – yi)2

(XTX) = XT F(X)

= (XT X)-1 XT F(X)

k= m=1D n=1

N (l=1D xT

kl xlm)-1 xTmn f(xn)

Linear Regression Example

Linear Local Models

Estimate the parameters k such that they locally (near the query point xq) match the training data either by

weighting the data:

wi=K(d(xi,xq))1/2 and transforming

zi=wi xi

vi=wi yi

or by weighting the error criterion:

E= i=1N (xi

T – yi)2 K(d(xi,xq))

still linear in with LSQ solution

= ((WX)T WX)-1 (WX)T WF(X)

Linear Local Model Example

query point Xq=0.35

Kernel K(x,xq)

Local linear model:f^(x)=b1x+b0

f^(xq)=0.266

Linear Local Model Example

Design Issues in Local Regression

Local model order (constant, linear, quadratic)

Distance function d

feature scaling: d(x,q)=(j=1d mj(xj-qj)2)1/2

irrelevant dimensions mj=0

kernel function K

smoothing parameter bandwidth h in K(d(x,q)/h)

h=|m| global bandwidth h= distance to k-th nearest neighbor point h=h(q) depending on query point h=hi depending on stored data points

See paper by Atkeson [1996] ”Locally Weighted Learning”

Local Linear Models

Local Linear Model Tree (LOLIMOT)

• incremental tree construction algorithm• partitions input space by axis-orthogonal splits• adds one local linear model per iteration

1.start with an initial model (e.g. single LLM)2.identify LLM with worst model error Ei 3.check all divisions : split worst LLM hyper-rectangle in halves along each possible dimension4.find best (smallest error) out of possible divisions5.add new validity function and LLM 6.repeat from step 2. until termination criteria is met

LOLIMOT

Initial global linear model

Split along x1 or x2

Pick split that minimizesmodel error (residual)

LOLIMOT Example

LOLIMOT Example

Lazy and Eager Learning

Lazy: wait for query before generalizing k-nearest neighbors, weighted linear regression

Eager: generalize before seeing query Radial basis function networks, decision trees, back-propagation, LOLIMOT

Eager learner must create global approximation

Lazy learner can create local approximations

If they use the same hypothesis space, lazy can represent more complex functions (H=linear functions)

top related