cs 7616 pattern recognition - college of...

Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick

Aaron Bobick School of Interactive Computing

CS 7616 Pattern Recognition Non-parametric methods: Kernels and other Naïve Things


Administrivia

• Next PS will be same data, two more methods. Probably K-NN and maybe Naïve Bayes


Today brought to you by… • This lecture (and I’m guessing some more to come!) gracously

provided (with real email!) by Aarti Singh @ CMU


Parametric methods • Assume some functional form (Gaussian, logistic, Linear,

Quadratic) for • 𝑃𝑃(𝑋𝑋|𝑌𝑌) and 𝑃𝑃(𝑌𝑌) as in Bayes (remember Y is now label) • 𝑃𝑃(𝑌𝑌|𝑋𝑋) as in Logistic, Linear and Nonlinear regression

• Estimate parameters (𝝁𝝁,Σ,𝒘𝒘, 𝑏𝑏) using MLE/MAP or through gradient ascent

• Pro – need few data points to learn parameters • Con – Strong distributional assumptions, not satisfied in

practice • Doesn’t really work very well (sometimes? Often?)


Example

3 5

8

7

9

4

2

1

Hand-written digit images projected as points on a two-dimensional (nonlinear) feature spaces


Non-Parametric methods • Typically don’t make any distributional assumptions • As we have more data, we should be able to learn more

complex models • Let number of parameters scale with number of training data

• Today, we will see some nonparametric methods for

• Density estimation • Classification • Maybe Regression and Naïve Bayes


Histogram density estimate • Partition the feature space into distinct bins with widths Δ𝑖𝑖

and count the number of observations, 𝑛𝑛𝑖𝑖, in each bin.

• Often, the same width is used for all bins, Δ𝑖𝑖 = Δ

• Δ acts as a smoothing parameter.

Image src: Bishop book

�̂�𝑝 𝑥𝑥 =𝑛𝑛𝑖𝑖𝑛𝑛Δ𝑖𝑖

1𝑥𝑥∈𝐵𝐵𝑖𝑖𝑛𝑛𝑖𝑖


Bias – Variance tradeoff

Bias – how close is the mean of estimate to the truth Variance – how much does the estimate vary around mean

• Choice of bin-width ∆ or #bins

Small ∆, large #bins “Small bias, Large variance” Large ∆, small #bins “Large bias, Small variance”

# bins = 1/∆

(p(x) approx constant per bin)

(more data per bin, stable estimate)

Bias Variance


Choice of #bins

Image src: Bishop book

# bins = 1/∆

fixed n ∆ decreases ni decreases

MSE

= B

ias +

Var

ianc

e


Histogram as MLE • Underlying model – density is constant on each bin Parameters pj : density in bin j

Note since ∑ (Δ ⋅ 𝑝𝑝𝑗𝑗)𝑗𝑗 = 1

• Maximize likelihood of data under probability model with

parameters pj


• Histogram – blocky estimate

• Kernel density estimate aka “Parzen/moving window method”

Kernel density estimate

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35


• more generally

Kernel density estimate

-1 1


Kernel density estimation

Gaussian bumps (red) around six data points and their sum (blue)

• Place small "bumps" at each data point, determined by the kernel function. • The estimator consists of a (normalized) "sum of bumps”.

• Note that where the points are denser the density estimate will have higher values.

Img src: Wikipedia


Kernels

Any kernel function that satisfies


Kernels

Finite support – only need local points to compute estimate

Infinite support - need all points to compute estimate -But quite popular since smoother


Choice of kernel bandwidth

“Bart-Simpson” Density

Too small

Too large Just right


Histograms vs. Kernel density estimation

∆ = ℎ acts as a smoother.


How to pick Δ (or h bin width) • Fundamental question of how to pick the bin width: • The fewer samples you have, the bigger you need the bin to

be to avoid accidental variations in density estimate. • Actually, the fewer samples you have near x, the bigger the

bin has to be around x. • Obvious idea: make the bin size a function of the number of

samples in the neighborhood of x.


k-NN (Nearest Neighbor) density estimation

• Histogram

• Kernel density est

Before: Fix ∆, estimate number of points within ∆ of x (𝑛𝑛𝑖𝑖 or 𝑛𝑛𝑥𝑥) from data

Now: Fix 𝑛𝑛𝑥𝑥 = 𝑘𝑘, estimate ∆ from data (volume of ball around x that contains k training pts)

• k-NN density est


k-NN density estimation

𝑘𝑘 acts as a smoother.

Not very popular for density estimation - expensive to compute, bad estimates But a related version for classification quite popular …


From

Density estimation to

Classification


k-NN classifier

Sports

Science

Arts


k-NN classifier

Sports

Science

Arts

Test document


k-NN classifier (k=5)

Sports

Science

Arts

Test document

What should we predict? … Average? Majority? Why?

∆k,x


k-NN classifier • Optimal Classifier:

• k-NN Classifier:

# total training pts of class y

# training pts of class y that lie within ∆k ball


1-Nearest Neighbor (kNN) classifier

Sports

Science

Arts



Sports

Science

Arts

K even not used in practice



Sports

Science

Arts


What is the best K?

Bias-variance tradeoff Larger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimation


1-NN classifier – decision boundary

K

Voronoi Diagram


k-NN classifier – decision boundary

• K acts as a smoother (Bias-variance tradeoff)


Case Study: kNN for Web Classification • Dataset

• 20 News Groups (20 classes) • Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/) • 61,118 words, 18,774 documents • Class labels descriptions


Experimental Setup • Training/Test Sets:

• 50%-50% randomly split. • 10 runs • report average results

• Evaluation Criteria:


Results: Binary Classes alt.atheism

vs. comp.graphics

rec.autos vs.

rec.sport.baseball

comp.windows.x

vs. rec.motorcycles

k

Accu

racy


From Classification to Regression


Temperature sensing • What is the temperature • in the room?

Average “Local” Average

at location x?

x


Kernel Regression

• Aka Local Regression • Nadaraya-Watson Kernel Estimator

Where

• Weight each training point based on distance to test

point • Boxcar kernel yields local average

h


Kernels


Choice of kernel bandwidth h h=1 h=10

h=50 h=200

Choice of kernel is not that important

Too small

Too large Just right

Too small


Kernel Regression as Weighted Least Squares

Weighted Least Squares

Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares i.e. set f(Xi) = β (a constant)


Kernel Regression as Weighted Least Squares

constant

Notice that

set f(Xi) = β (a constant)


Local Linear/Polynomial Regression

Weighted Least Squares

Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set

(local polynomial of degree p around X)


When doesn’t this work? • Not enough points in a bin of reasonable size…

• When does that happen?

• When dimensions pf the space get too large.

• For example:


Curse of dimensionality…


Bayesian Revisited and “Naïve Bayes” • Given data 𝑿𝑿, posteriori probability of a hypothesis ℎ, 𝑃𝑃(ℎ|𝑋𝑋)

follows the Bayes theorem

• MAP (maximum posteriori) hypothesis

• Practical difficulty: require estimation of multi-dimensional densities 𝑃𝑃 𝑿𝑿 ℎ with limited data. (ℎ is, for example, class 𝐶𝐶𝑗𝑗)

( | ) ( )( | )( )

P h P hP hP

= XXX

.argmax ( | ) argmax ( | ) ( )h P h P h P hMAP h H h H≡ =

∈ ∈X X


Naïve Bayes Classifier and kernel densities • A simplified assumption: attributes are conditionally independent:

• Assumes that the density of each element of a vector are independent given the class.

• Many fewer parameters. E.g. For 5 dimensional gaussians, 10 parameters instead of 20 (𝝁𝝁 and 𝚺𝚺)

• And you can use Kernel density estimation to learn each! • Conventional wisdom: Naïve Bayes gives poor densities but good

classification!

1( | ) ( ) (x | )

n

j j i ji

P C P C P C=

∝ ∏X


Summary • Instance based/non-parametric approaches

Four things make a memory based learner: 1. A distance metric, dist(x,Xi)

Euclidean (and many more) 2. How many nearby neighbors/radius to look at?

k, ∆/h 3. A weighting function (optional)

W based on kernel K 4. How to fit with the local points?

Average, Majority vote, Weighted average, Poly fit


Summary • Parametric vs Nonparametric approaches

Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data

Parametric models rely on very strong (simplistic) distributional assumptions

Nonparametric models (not histograms) requires

storing and computing with the entire data set. Parametric models, once fitted, are much more

efficient in terms of storage and computation.


Remaining slides not used yet • Taken from 600.325/425 Declarative Methods - J. Eisner


Intuition behind memory-based learning • Similar inputs map to similar outputs

• If not true learning is impossible • If true learning reduces to defining “similar”

• Not all similarities created equal

• guess J. D. Salinger’s weight • who are the similar people? • similar occupation, age, diet, genes, climate, …

• guess J. D. Salinger’s IQ • similar occupation, writing style, fame, SAT score, …

• Superficial vs. deep similarities?

• B. F. Skinner and the behaviorism movement

parts of slide thanks to Rich Caruana

what do brains actually do?


1-Nearest Neighbor • Define a distance d(x1,x2) between any 2 examples

• examples are feature vectors • so could just use Euclidean distance …

• Training: Index the training examples for fast lookup. • Test: Given a new x, find the closest x1 from training.

Classify x the same as x1 (positive or negative)

• Can learn complex decision boundaries • As training size ∞, error rate is at most 2x the Bayes-

optimal rate (i.e., the error rate you’d get from knowing the true model that generated the data – whatever it is!)


From Hastie, Tibshirani, Friedman 2001 p418

1-Nearest Neighbor – decision boundary


k-Nearest Neighbor

• Average of k points more reliable when: • noise in training vectors x • noise in training labels y • classes partially overlap

attribute_1

attr

ibut

e_2

+ +

+

+

+ + + + +

o o

o o o

o o o o o o

+ +

+ o o o

slide thanks to Rich Caruana (modified)

Instead of picking just the single nearest neighbor, pick the k nearest neighbors and have them vote




1 Nearest Neighbor – decision boundary




15 Nearest Neighbors – it’s smoother!


How to choose “k” • Odd k (often 1, 3, or 5):

• Avoids problem of breaking ties (in a binary classifier) • Large k:

• less sensitive to noise (particularly class noise) • better probability estimates for discrete classes • larger training sets allow larger values of k

• Small k: • captures fine structure of problem space better • may be necessary with small training sets

• Balance between large and small k • What does this remind you of?

• As training set approaches infinity, and k grows large, kNN becomes Bayes optimal





why?


Cross-Validation • Models usually perform better on training data than on future

test cases • 1-NN is 100% accurate on training data! • “Leave-one-out” cross validation:

• “remove” each case one-at-a-time • use as test case with remaining cases as train set • average performance over all test cases

• LOOCV is impractical with most learning methods, but extremely efficient with MBL!

slide thanks to Rich Caruana


Distance-Weighted kNN • hard to pick large vs. small k

• may not even want k to be constant • use large k, but more emphasis on nearer neighbors?

),(exp

1often or ),(

1 maybeor ),(

1:NN-k for the weightsrelative define We

labels their , and NN-k theare , e wher

)(

11

1

1

xxDistxxDistxxDistw

yyxx

w

ywxprediction

iiii

kk

k

ii

k

iii

⋅=

⋅=

∑

∑

=

=

ββ


Combining k-NN with other methods, #1 • Instead of having the k-NN simply vote, put them into a little

machine learner! • To classify x, train a “local” classifier on its k nearest neighbors

(maybe weighted). • polynomial, neural network, …



Now back to that distance function

• Euclidean distance treats all of the input dimensions as equally important

attribute_1

attr

ibut

e_2

+ +

+

+

+ + + + +

o o

o o o

o o o o o o


Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick These o’s are now “closer” to + than to each other



• Problem #1: • What if the input represents physical weight not in pounds but in

milligrams? • Then small differences in physical weight dimension have a huge effect on

distances, overwhelming other features. • Should really correct for these arbitrary “scaling” issues.

• One simple idea: rescale weights so that standard deviation = 1.

+ o o

+

weight (lb)

attr

ibut

e_2

+ +

+ +

+ +

+ o

o

o o o o

o o o

weight (mg)

attr

ibut

e_2

+

+ +

+ +

+ +

+ +

o o

o o o

o o

o o o

o bad




• Problem #2: • What if some dimensions more correlated with true label?

• (more relevant, or less noisy)

• Stretch those dimensions out so that they are more important in determining distance. • One common technique is called “information gain.”


most relevant attribute

attr

ibut

e_2 +

+

+

+

+

+ + +

+ + o o

o

o o

o

o

o

o o

o

attr

ibut

e_2

most relevant attribute

+ +

+

+

+

+ + +

+ + o o

o

o o

o

o

o

o o

o good


Weighted Euclidean Distance

• large weight si attribute i is more important • small weight si attribute i is less important • zero weight si attribute i doesn’t matter

( )∑=

−⋅=N

iiii xxsxxd

1

2')',(




• Euclidean distance treats all of the input dimensions as equally important • Problem #3:

• Do we really want to decide separately and theoretically how to scale each dimension?

• Could simply pick dimension scaling factors to maximize performance on development data. (maybe do leave-one-out) • Similarly, pick number of neighbors k and how to weight them.

• Especially useful if performance measurement is complicated (e.g., 3 classes and differing misclassification costs).

attribute_1

attr

ibut

e_2

+ +

+

+

+ + + + +

o

o

o o o

o o o o

o o


should replot on log scale before measuring dist



• Problem #4: • Is it the original input dimensions that we want to scale? • What if the true clusters run diagonally? Or curve? • We can transform the data first by extracting a different, useful

set of features from it: • Linear discriminant analysis • Hidden layer of a neural network

attribute_1

attr

ibut

e_2

+ exp(attribute_1)

attr

ibut

e_2 +

+

+

+

+

+ + +

+ + o o

o o

o

o

o

o

o o

i.e., redescribe the data by how a different type of learned classifier internally sees it




• Problem #5: • Do we really want to transform the data globally? • What if different regions of the data space behave differently? • Could find 300 “nearest” neighbors (using global transform), then locally

transform that subset of the data to redefine “near” • Maybe could use decision trees to split up the data space first

attribute_1 at

trib

ute_

2 +

+ +

+

+ +

+ + +

+ o o o

o

o o

o

o

o

o o

o

o + + + + + +

+ +

+

o o o

o o o o o o o o

?

?


Why are we doing all this preprocessing? • Shouldn’t the user figure out a smart way to transform the data

before giving it to k-NN?

• Sure, that’s always good, but what will the user try? • Probably a lot of the same things we’re discussing. • She’ll stare at the training data and try to figure out how to transform it

so that close neighbors tend to have the same label. • To be nice to her, we’re trying to automate the most common parts of

that process – like scaling the dimensions appropriately. • We may still miss patterns that her visual system or expertise can find.

So she may still want to transform the data. • On the other hand, we may find patterns that would be hard for her to

see.


split on feature that reduces our uncertainty most

1607/1704 = 0.943 694/5977 = 0.116

Tangent: Decision Trees (a different simple method)

example thanks to Manning & Schütze

Is this Reuters article an Earnings Announcement?

2301/7681 = 0.3 of all docs contains “cents” < 2 times contains “cents” ≥ 2 times

contains “versus” < 2 times

contains “versus” ≥ 2 times

contains “net” < 1 time

contains “net” ≥ 1 time

1398/1403 = 0.996

209/301 = 0.694

“yes”

422/541 = 0.780

272/5436 = 0.050

“no”


Booleans, Nominals, Ordinals, and Reals • Consider attribute value differences:

(xi – x’i): what does subtraction do?

• Reals: easy! full continuum of differences • Integers: not bad: discrete set of differences • Ordinals: not bad: discrete set of differences • Booleans: awkward: hamming distances 0 or 1 • Nominals? not good! recode as Booleans?



“Curse of Dimensionality”” • Pictures on previous slides showed 2-dimensional data • What happens with lots of dimensions? • 10 training samples cover the space less & less well …

images thanks to Charles Annis


“Curse of Dimensionality””

• A deeper perspective on this: • Random points chosen in a high-dimensional space tend to all be pretty much

equidistant from one another! • (Proof: in 1000 dimensions, the squared distance between two random points is

a sample variance of 1000 coordinate distances. Since 1000 is large, this sample variance is usually close to the true variance.)

• So each test example is about equally close to most training examples. • We need a lot of training examples to expect one that is unusually close to the

test example.

images thanks to Charles Annis

Pictures on previous slides showed 2-dimensional data What happens with lots of dimensions? 10 training samples cover the space less & less well …


“Curse of Dimensionality”” • also, with lots of dimensions/attributes/features, the irrelevant

ones may overwhelm the relevant ones:

• So the ideas from previous slides grow in importance: • feature weights (scaling)

• feature selection (try to identify & discard irrelevant features) • but with lots of features, some irrelevant ones will probably

accidentally look relevant on the training data • smooth by allowing more neighbors to vote (e.g., larger k)

( ) ( )∑ ∑= =

−+−=relevant

i

irrelevant

jjjii xxxxxxd

1 1

22 '')',(



Advantages of Memory-Based Methods • Lazy learning: don’t do any work until you know what you

want to predict (and from what variables!) • never need to learn a global model • many simple local models taken together can represent a more

complex global model • Learns arbitrarily complicated decision boundaries • Very efficient cross-validation • Easy to explain to users how it works

• … and why it made a particular decision! • Can use any distance metric: string-edit distance, …

• handles missing values, time-varying distributions, ...



Weaknesses of Memory-Based Methods • Curse of Dimensionality

• often works best with 25 or fewer dimensions • Classification runtime scales with training set size

• clever indexing may help (K-D trees? locality-sensitive hash?) • large training sets will not fit in memory

• Sometimes you wish NN stood for “neural net” instead of “nearest neighbor” • Simply averaging nearby training points isn’t very subtle • Naive distance functions are overly respectful of the input encoding

• For regression (predict a number rather than a class), the extrapolated surface has discontinuities



Current Research in MBL • Condensed representations to reduce memory

requirements and speed-up neighbor finding to scale to 106–1012 cases

• Learn better distance metrics • Feature selection • Overfitting, VC-dimension, ... • MBL in higher dimensions • MBL in non-numeric domains:

• Case-Based or Example-Based Reasoning • Reasoning by Analogy

slide thanks to Rich Caruana

cs 7616 pattern recognition - college of...

Documents