cs 7616 pattern recognition - college of...

77
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick Aaron Bobick School of Interactive Computing CS 7616 Pattern Recognition Non-parametric methods: Kernels and other Naïve Things

Upload: nguyenliem

Post on 02-Apr-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Aaron Bobick School of Interactive Computing

CS 7616 Pattern Recognition Non-parametric methods: Kernels and other Naïve Things

Page 2: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Administrivia

• Next PS will be same data, two more methods. Probably K-NN and maybe Naïve Bayes

Page 3: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Today brought to you by… • This lecture (and I’m guessing some more to come!) gracously

provided (with real email!) by Aarti Singh @ CMU

Page 4: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Parametric methods • Assume some functional form (Gaussian, logistic, Linear,

Quadratic) for • 𝑃𝑃(𝑋𝑋|𝑌𝑌) and 𝑃𝑃(𝑌𝑌) as in Bayes (remember Y is now label) • 𝑃𝑃(𝑌𝑌|𝑋𝑋) as in Logistic, Linear and Nonlinear regression

• Estimate parameters (𝝁𝝁,Σ,𝒘𝒘, 𝑏𝑏) using MLE/MAP or through gradient ascent

• Pro – need few data points to learn parameters • Con – Strong distributional assumptions, not satisfied in

practice • Doesn’t really work very well (sometimes? Often?)

Page 5: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Example

3 5

8

7

9

4

2

1

Hand-written digit images projected as points on a two-dimensional (nonlinear) feature spaces

Page 6: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Non-Parametric methods • Typically don’t make any distributional assumptions • As we have more data, we should be able to learn more

complex models • Let number of parameters scale with number of training data

• Today, we will see some nonparametric methods for

• Density estimation • Classification • Maybe Regression and Naïve Bayes

Page 7: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Histogram density estimate • Partition the feature space into distinct bins with widths Δ𝑖𝑖

and count the number of observations, 𝑛𝑛𝑖𝑖, in each bin.

• Often, the same width is used for all bins, Δ𝑖𝑖 = Δ

• Δ acts as a smoothing parameter.

Image src: Bishop book

�̂�𝑝 𝑥𝑥 =𝑛𝑛𝑖𝑖𝑛𝑛Δ𝑖𝑖

1𝑥𝑥∈𝐵𝐵𝑖𝑖𝑛𝑛𝑖𝑖

Page 8: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Bias – Variance tradeoff

Bias – how close is the mean of estimate to the truth Variance – how much does the estimate vary around mean

• Choice of bin-width ∆ or #bins

Small ∆, large #bins “Small bias, Large variance” Large ∆, small #bins “Large bias, Small variance”

# bins = 1/∆

(p(x) approx constant per bin)

(more data per bin, stable estimate)

Bias Variance

Page 9: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Choice of #bins

Image src: Bishop book

# bins = 1/∆

fixed n ∆ decreases ni decreases

MSE

= B

ias +

Var

ianc

e

Page 10: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Histogram as MLE • Underlying model – density is constant on each bin Parameters pj : density in bin j

Note since ∑ (Δ ⋅ 𝑝𝑝𝑗𝑗)𝑗𝑗 = 1

• Maximize likelihood of data under probability model with

parameters pj

Page 11: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

• Histogram – blocky estimate

• Kernel density estimate aka “Parzen/moving window method”

Kernel density estimate

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Page 12: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

• more generally

Kernel density estimate

-1 1

Page 13: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Kernel density estimation

Gaussian bumps (red) around six data points and their sum (blue)

• Place small "bumps" at each data point, determined by the kernel function. • The estimator consists of a (normalized) "sum of bumps”.

• Note that where the points are denser the density estimate will have higher values.

Img src: Wikipedia

Page 14: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Kernels

Any kernel function that satisfies

Page 15: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Kernels

Finite support – only need local points to compute estimate

Infinite support - need all points to compute estimate -But quite popular since smoother

Page 16: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Choice of kernel bandwidth

“Bart-Simpson” Density

Too small

Too large Just right

Page 17: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Histograms vs. Kernel density estimation

∆ = ℎ acts as a smoother.

Page 18: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

How to pick Δ (or h bin width) • Fundamental question of how to pick the bin width: • The fewer samples you have, the bigger you need the bin to

be to avoid accidental variations in density estimate. • Actually, the fewer samples you have near x, the bigger the

bin has to be around x. • Obvious idea: make the bin size a function of the number of

samples in the neighborhood of x.

Page 19: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-NN (Nearest Neighbor) density estimation

• Histogram

• Kernel density est

Before: Fix ∆, estimate number of points within ∆ of x (𝑛𝑛𝑖𝑖 or 𝑛𝑛𝑥𝑥) from data

Now: Fix 𝑛𝑛𝑥𝑥 = 𝑘𝑘, estimate ∆ from data (volume of ball around x that contains k training pts)

• k-NN density est

Page 20: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-NN density estimation

𝑘𝑘 acts as a smoother.

Not very popular for density estimation - expensive to compute, bad estimates But a related version for classification quite popular …

Page 21: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

From

Density estimation to

Classification

Page 22: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-NN classifier

Sports

Science

Arts

Page 23: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-NN classifier

Sports

Science

Arts

Test document

Page 24: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-NN classifier (k=5)

Sports

Science

Arts

Test document

What should we predict? … Average? Majority? Why?

∆k,x

Page 25: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-NN classifier • Optimal Classifier:

• k-NN Classifier:

# total training pts of class y

# training pts of class y that lie within ∆k ball

Page 26: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

1-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

Page 27: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

2-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

K even not used in practice

Page 28: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

3-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

Page 29: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

5-Nearest Neighbor (kNN) classifier

Sports

Science

Arts

Page 30: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

What is the best K?

Bias-variance tradeoff Larger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimation

Page 31: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

1-NN classifier – decision boundary

K

Voronoi Diagram

Page 32: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-NN classifier – decision boundary

• K acts as a smoother (Bias-variance tradeoff)

Page 33: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Case Study: kNN for Web Classification • Dataset

• 20 News Groups (20 classes) • Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/) • 61,118 words, 18,774 documents • Class labels descriptions

Page 34: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Experimental Setup • Training/Test Sets:

• 50%-50% randomly split. • 10 runs • report average results

• Evaluation Criteria:

Page 35: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Results: Binary Classes alt.atheism

vs. comp.graphics

rec.autos vs.

rec.sport.baseball

comp.windows.x

vs. rec.motorcycles

k

Accu

racy

Page 36: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

From Classification to Regression

Page 37: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Temperature sensing • What is the temperature • in the room?

Average “Local” Average

at location x?

x

Page 38: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Kernel Regression

• Aka Local Regression • Nadaraya-Watson Kernel Estimator

Where

• Weight each training point based on distance to test

point • Boxcar kernel yields local average

h

Page 39: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Kernels

Page 40: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Choice of kernel bandwidth h h=1 h=10

h=50 h=200

Choice of kernel is not that important

Too small

Too large Just right

Too small

Page 41: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Kernel Regression as Weighted Least Squares

Weighted Least Squares

Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares i.e. set f(Xi) = β (a constant)

Page 42: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Kernel Regression as Weighted Least Squares

constant

Notice that

set f(Xi) = β (a constant)

Page 43: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Local Linear/Polynomial Regression

Weighted Least Squares

Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set

(local polynomial of degree p around X)

Page 44: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

When doesn’t this work? • Not enough points in a bin of reasonable size…

• When does that happen?

• When dimensions pf the space get too large.

• For example:

Page 45: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Curse of dimensionality…

Page 46: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Bayesian Revisited and “Naïve Bayes” • Given data 𝑿𝑿, posteriori probability of a hypothesis ℎ, 𝑃𝑃(ℎ|𝑋𝑋)

follows the Bayes theorem

• MAP (maximum posteriori) hypothesis

• Practical difficulty: require estimation of multi-dimensional densities 𝑃𝑃 𝑿𝑿 ℎ with limited data. (ℎ is, for example, class 𝐶𝐶𝑗𝑗)

( | ) ( )( | )( )

P h P hP hP

= XXX

.argmax ( | ) argmax ( | ) ( )h P h P h P hMAP h H h H≡ =

∈ ∈X X

Page 47: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Naïve Bayes Classifier and kernel densities • A simplified assumption: attributes are conditionally independent:

• Assumes that the density of each element of a vector are independent given the class.

• Many fewer parameters. E.g. For 5 dimensional gaussians, 10 parameters instead of 20 (𝝁𝝁 and 𝚺𝚺)

• And you can use Kernel density estimation to learn each! • Conventional wisdom: Naïve Bayes gives poor densities but good

classification!

1( | ) ( ) (x | )

n

j j i ji

P C P C P C=

∝ ∏X

Page 48: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Summary • Instance based/non-parametric approaches

Four things make a memory based learner: 1. A distance metric, dist(x,Xi)

Euclidean (and many more) 2. How many nearby neighbors/radius to look at?

k, ∆/h 3. A weighting function (optional)

W based on kernel K 4. How to fit with the local points?

Average, Majority vote, Weighted average, Poly fit

Page 49: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Summary • Parametric vs Nonparametric approaches

Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data

Parametric models rely on very strong (simplistic) distributional assumptions

Nonparametric models (not histograms) requires

storing and computing with the entire data set. Parametric models, once fitted, are much more

efficient in terms of storage and computation.

Page 50: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Remaining slides not used yet • Taken from 600.325/425 Declarative Methods - J. Eisner

Page 51: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Intuition behind memory-based learning • Similar inputs map to similar outputs

• If not true learning is impossible • If true learning reduces to defining “similar”

• Not all similarities created equal

• guess J. D. Salinger’s weight • who are the similar people? • similar occupation, age, diet, genes, climate, …

• guess J. D. Salinger’s IQ • similar occupation, writing style, fame, SAT score, …

• Superficial vs. deep similarities?

• B. F. Skinner and the behaviorism movement

parts of slide thanks to Rich Caruana

what do brains actually do?

Page 52: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

1-Nearest Neighbor • Define a distance d(x1,x2) between any 2 examples

• examples are feature vectors • so could just use Euclidean distance …

• Training: Index the training examples for fast lookup. • Test: Given a new x, find the closest x1 from training.

Classify x the same as x1 (positive or negative)

• Can learn complex decision boundaries • As training size ∞, error rate is at most 2x the Bayes-

optimal rate (i.e., the error rate you’d get from knowing the true model that generated the data – whatever it is!)

Page 53: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

From Hastie, Tibshirani, Friedman 2001 p418

1-Nearest Neighbor – decision boundary

Page 54: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

k-Nearest Neighbor

• Average of k points more reliable when: • noise in training vectors x • noise in training labels y • classes partially overlap

attribute_1

attr

ibut

e_2

+ +

+

+

+ + + + +

o o

o o o

o o o o o o

+ +

+ o o o

slide thanks to Rich Caruana (modified)

Instead of picking just the single nearest neighbor, pick the k nearest neighbors and have them vote

Page 55: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

From Hastie, Tibshirani, Friedman 2001 p418

slide thanks to Rich Caruana (modified)

1 Nearest Neighbor – decision boundary

Page 56: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

From Hastie, Tibshirani, Friedman 2001 p418

slide thanks to Rich Caruana (modified)

15 Nearest Neighbors – it’s smoother!

Page 57: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

How to choose “k” • Odd k (often 1, 3, or 5):

• Avoids problem of breaking ties (in a binary classifier) • Large k:

• less sensitive to noise (particularly class noise) • better probability estimates for discrete classes • larger training sets allow larger values of k

• Small k: • captures fine structure of problem space better • may be necessary with small training sets

• Balance between large and small k • What does this remind you of?

• As training set approaches infinity, and k grows large, kNN becomes Bayes optimal

slide thanks to Rich Caruana (modified)

Page 58: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

From Hastie, Tibshirani, Friedman 2001 p419

slide thanks to Rich Caruana (modified)

why?

Page 59: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Cross-Validation • Models usually perform better on training data than on future

test cases • 1-NN is 100% accurate on training data! • “Leave-one-out” cross validation:

• “remove” each case one-at-a-time • use as test case with remaining cases as train set • average performance over all test cases

• LOOCV is impractical with most learning methods, but extremely efficient with MBL!

slide thanks to Rich Caruana

Page 60: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Distance-Weighted kNN • hard to pick large vs. small k

• may not even want k to be constant • use large k, but more emphasis on nearer neighbors?

),(exp

1often or ),(

1 maybeor ),(

1:NN-k for the weightsrelative define We

labels their , and NN-k theare , e wher

)(

11

1

1

xxDistxxDistxxDistw

yyxx

w

ywxprediction

iiii

kk

k

ii

k

iii

⋅=

⋅=

=

=

ββ

Page 61: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Combining k-NN with other methods, #1 • Instead of having the k-NN simply vote, put them into a little

machine learner! • To classify x, train a “local” classifier on its k nearest neighbors

(maybe weighted). • polynomial, neural network, …

parts of slide thanks to Rich Caruana

Page 62: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Now back to that distance function

• Euclidean distance treats all of the input dimensions as equally important

attribute_1

attr

ibut

e_2

+ +

+

+

+ + + + +

o o

o o o

o o o o o o

parts of slide thanks to Rich Caruana

Page 63: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick These o’s are now “closer” to + than to each other

Now back to that distance function

• Euclidean distance treats all of the input dimensions as equally important

• Problem #1: • What if the input represents physical weight not in pounds but in

milligrams? • Then small differences in physical weight dimension have a huge effect on

distances, overwhelming other features. • Should really correct for these arbitrary “scaling” issues.

• One simple idea: rescale weights so that standard deviation = 1.

+ o o

+

weight (lb)

attr

ibut

e_2

+ +

+ +

+ +

+ o

o

o o o o

o o o

weight (mg)

attr

ibut

e_2

+

+ +

+ +

+ +

+ +

o o

o o o

o o

o o o

o bad

Page 64: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Now back to that distance function

• Euclidean distance treats all of the input dimensions as equally important

• Problem #2: • What if some dimensions more correlated with true label?

• (more relevant, or less noisy)

• Stretch those dimensions out so that they are more important in determining distance. • One common technique is called “information gain.”

parts of slide thanks to Rich Caruana

most relevant attribute

attr

ibut

e_2 +

+

+

+

+

+ + +

+ + o o

o

o o

o

o

o

o o

o

attr

ibut

e_2

most relevant attribute

+ +

+

+

+

+ + +

+ + o o

o

o o

o

o

o

o o

o good

Page 65: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Weighted Euclidean Distance

• large weight si attribute i is more important • small weight si attribute i is less important • zero weight si attribute i doesn’t matter

( )∑=

−⋅=N

iiii xxsxxd

1

2')',(

slide thanks to Rich Caruana (modified)

Page 66: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Now back to that distance function

• Euclidean distance treats all of the input dimensions as equally important • Problem #3:

• Do we really want to decide separately and theoretically how to scale each dimension?

• Could simply pick dimension scaling factors to maximize performance on development data. (maybe do leave-one-out) • Similarly, pick number of neighbors k and how to weight them.

• Especially useful if performance measurement is complicated (e.g., 3 classes and differing misclassification costs).

attribute_1

attr

ibut

e_2

+ +

+

+

+ + + + +

o

o

o o o

o o o o

o o

Page 67: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

should replot on log scale before measuring dist

Now back to that distance function

• Euclidean distance treats all of the input dimensions as equally important

• Problem #4: • Is it the original input dimensions that we want to scale? • What if the true clusters run diagonally? Or curve? • We can transform the data first by extracting a different, useful

set of features from it: • Linear discriminant analysis • Hidden layer of a neural network

attribute_1

attr

ibut

e_2

+ exp(attribute_1)

attr

ibut

e_2 +

+

+

+

+

+ + +

+ + o o

o o

o

o

o

o

o o

i.e., redescribe the data by how a different type of learned classifier internally sees it

Page 68: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Now back to that distance function

• Euclidean distance treats all of the input dimensions as equally important

• Problem #5: • Do we really want to transform the data globally? • What if different regions of the data space behave differently? • Could find 300 “nearest” neighbors (using global transform), then locally

transform that subset of the data to redefine “near” • Maybe could use decision trees to split up the data space first

attribute_1 at

trib

ute_

2 +

+ +

+

+ +

+ + +

+ o o o

o

o o

o

o

o

o o

o

o + + + + + +

+ +

+

o o o

o o o o o o o o

?

?

Page 69: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Why are we doing all this preprocessing? • Shouldn’t the user figure out a smart way to transform the data

before giving it to k-NN?

• Sure, that’s always good, but what will the user try? • Probably a lot of the same things we’re discussing. • She’ll stare at the training data and try to figure out how to transform it

so that close neighbors tend to have the same label. • To be nice to her, we’re trying to automate the most common parts of

that process – like scaling the dimensions appropriately. • We may still miss patterns that her visual system or expertise can find.

So she may still want to transform the data. • On the other hand, we may find patterns that would be hard for her to

see.

Page 70: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

split on feature that reduces our uncertainty most

1607/1704 = 0.943 694/5977 = 0.116

Tangent: Decision Trees (a different simple method)

example thanks to Manning & Schütze

Is this Reuters article an Earnings Announcement?

2301/7681 = 0.3 of all docs contains “cents” < 2 times contains “cents” ≥ 2 times

contains “versus” < 2 times

contains “versus” ≥ 2 times

contains “net” < 1 time

contains “net” ≥ 1 time

1398/1403 = 0.996

209/301 = 0.694

“yes”

422/541 = 0.780

272/5436 = 0.050

“no”

Page 71: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Booleans, Nominals, Ordinals, and Reals • Consider attribute value differences:

(xi – x’i): what does subtraction do?

• Reals: easy! full continuum of differences • Integers: not bad: discrete set of differences • Ordinals: not bad: discrete set of differences • Booleans: awkward: hamming distances 0 or 1 • Nominals? not good! recode as Booleans?

slide thanks to Rich Caruana (modified)

Page 72: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

“Curse of Dimensionality”” • Pictures on previous slides showed 2-dimensional data • What happens with lots of dimensions? • 10 training samples cover the space less & less well …

images thanks to Charles Annis

Page 73: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

“Curse of Dimensionality””

• A deeper perspective on this: • Random points chosen in a high-dimensional space tend to all be pretty much

equidistant from one another! • (Proof: in 1000 dimensions, the squared distance between two random points is

a sample variance of 1000 coordinate distances. Since 1000 is large, this sample variance is usually close to the true variance.)

• So each test example is about equally close to most training examples. • We need a lot of training examples to expect one that is unusually close to the

test example.

images thanks to Charles Annis

Pictures on previous slides showed 2-dimensional data What happens with lots of dimensions? 10 training samples cover the space less & less well …

Page 74: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

“Curse of Dimensionality”” • also, with lots of dimensions/attributes/features, the irrelevant

ones may overwhelm the relevant ones:

• So the ideas from previous slides grow in importance: • feature weights (scaling)

• feature selection (try to identify & discard irrelevant features) • but with lots of features, some irrelevant ones will probably

accidentally look relevant on the training data • smooth by allowing more neighbors to vote (e.g., larger k)

( ) ( )∑ ∑= =

−+−=relevant

i

irrelevant

jjjii xxxxxxd

1 1

22 '')',(

slide thanks to Rich Caruana (modified)

Page 75: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Advantages of Memory-Based Methods • Lazy learning: don’t do any work until you know what you

want to predict (and from what variables!) • never need to learn a global model • many simple local models taken together can represent a more

complex global model • Learns arbitrarily complicated decision boundaries • Very efficient cross-validation • Easy to explain to users how it works

• … and why it made a particular decision! • Can use any distance metric: string-edit distance, …

• handles missing values, time-varying distributions, ...

slide thanks to Rich Caruana (modified)

Page 76: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Weaknesses of Memory-Based Methods • Curse of Dimensionality

• often works best with 25 or fewer dimensions • Classification runtime scales with training set size

• clever indexing may help (K-D trees? locality-sensitive hash?) • large training sets will not fit in memory

• Sometimes you wish NN stood for “neural net” instead of “nearest neighbor” • Simply averaging nearby training points isn’t very subtle • Naive distance functions are overly respectful of the input encoding

• For regression (predict a number rather than a class), the extrapolated surface has discontinuities

slide thanks to Rich Caruana (modified)

Page 77: CS 7616 Pattern Recognition - College of Computingafb/classes/CS7616-Spring2014/slides/CS7616-06.pdfCS 7616 Pattern Recognition . Non-parametric methods: ... Bishop book # bins = 1/∆

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Current Research in MBL • Condensed representations to reduce memory

requirements and speed-up neighbor finding to scale to 106–1012 cases

• Learn better distance metrics • Feature selection • Overfitting, VC-dimension, ... • MBL in higher dimensions • MBL in non-numeric domains:

• Case-Based or Example-Based Reasoning • Reasoning by Analogy

slide thanks to Rich Caruana