finding predictors: nearest neighbor modern motivations ... · choosing the right number of...

Post on 02-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Finding Predictors: Nearest Neighbor

Modern Motivations: Be Lazy!

ClassificationRegressionChoosing the right number of neighbors

Some Optimizations

Other types of lazy algorithms

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 1 / 49

Motivation: A Zoo

Given: Information about animals in the zooHow can we classify new animals?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 2 / 49

Motivation: A Zoo

Given: Information about animals in the zooHow can we classify new animals?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 2 / 49

Motivation: A Zoo

Given: Information about animals in the zooHow can we classify new animals?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 2 / 49

Motivation: A Zoo

Given: Information about animals in the zooHow can we classify new animals?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 2 / 49

Motivation: A Zoo

Given: Information about animals in the zooHow can we classify new animals?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 2 / 49

Remember ?

Unsupervised vs. Supervised Learning

Unsupervised: No class information given.Goal: detect unknown patterns (e.g. clusters, association rules)

Supervised: Class information exists/is provided by supervisor.Goal: learn class structure for future unclassified/unknown data.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 3 / 49

Motivation: Expert/Legal Systems/Streber

How do Expert Systems work?

Find most similar case(s).

How does the American Justice System work?

Find most similar case(s).

How does the nerd (“Streber”) learn?

He learns by heart.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 49

Motivation: Expert/Legal Systems/Streber

How do Expert Systems work?

Find most similar case(s).

How does the American Justice System work?

Find most similar case(s).

How does the nerd (“Streber”) learn?

He learns by heart.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 49

Motivation: Expert/Legal Systems/Streber

How do Expert Systems work?

Find most similar case(s).

How does the American Justice System work?

Find most similar case(s).

How does the nerd (“Streber”) learn?

He learns by heart.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 49

Motivation: Expert/Legal Systems/Streber

How do Expert Systems work?

Find most similar case(s).

How does the American Justice System work?

Find most similar case(s).

How does the nerd (“Streber”) learn?

He learns by heart.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 49

Motivation: Expert/Legal Systems/Streber

How do Expert Systems work?

Find most similar case(s).

How does the American Justice System work?

Find most similar case(s).

How does the nerd (“Streber”) learn?

He learns by heart.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 49

Motivation: Expert/Legal Systems/Streber

How do Expert Systems work?

Find most similar case(s).

How does the American Justice System work?

Find most similar case(s).

How does the nerd (“Streber”) learn?

He learns by heart.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 49

Eager vs. Lazy Learners

Eager vs. Lazy Learners

Lazy: Save all data from training, use it for classifying(The learner was lazy, classifier has to do the work)

Eager: Builds a (compact) model/structure during training, usemodel for classification.(The learner was eager/worked harder, classifier has a simple life.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 5 / 49

Nearest neighbour predictors

Nearest neighbour predictors are special case of instance-basedlearning.

Instead of constructing a model that generalizes beyond the trainingdata, the training examples are merely stored.

Predictions for new cases are derived directly from these storedexamples and their (known) classes or target values.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 6 / 49

Nearest neighbour predictors

Nearest neighbour predictors are special case of instance-basedlearning.

Instead of constructing a model that generalizes beyond the trainingdata, the training examples are merely stored.

Predictions for new cases are derived directly from these storedexamples and their (known) classes or target values.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 6 / 49

Nearest neighbour predictors

Nearest neighbour predictors are special case of instance-basedlearning.

Instead of constructing a model that generalizes beyond the trainingdata, the training examples are merely stored.

Predictions for new cases are derived directly from these storedexamples and their (known) classes or target values.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 6 / 49

Simple nearest neighbour predictor

For a new instance, use the target value of the closest neighbour in thetraining set.

Classification

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 7 / 49

Simple nearest neighbour predictor

For a new instance, use the target value of the closest neighbour in thetraining set.

input

output

Classification Regression

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 7 / 49

Nearest Neighbour Predictor: Issues

Noisy Data is a problem:

How can we fix this?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 8 / 49

k-nearest neighbour predictor

Instead of relying for the prediction on only one instance, the (single)nearest neighbour, usually the k (k > 1) are taken into account,leading to the k-nearest neighbour predictor.

Classification: Choose the majority class among the k nearestneighbours for prediction.

Regression: Take the mean value of the k nearest neighbours forprediction.

All k nearest neighbours have the same influence on the prediction.

Closer nearest neighbours should have higher influence on theprediction.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 9 / 49

k-nearest neighbour predictor

Instead of relying for the prediction on only one instance, the (single)nearest neighbour, usually the k (k > 1) are taken into account,leading to the k-nearest neighbour predictor.

Classification: Choose the majority class among the k nearestneighbours for prediction.

Regression: Take the mean value of the k nearest neighbours forprediction.

All k nearest neighbours have the same influence on the prediction.

Closer nearest neighbours should have higher influence on theprediction.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 9 / 49

k-nearest neighbour predictor

Instead of relying for the prediction on only one instance, the (single)nearest neighbour, usually the k (k > 1) are taken into account,leading to the k-nearest neighbour predictor.

Classification: Choose the majority class among the k nearestneighbours for prediction.

Regression: Take the mean value of the k nearest neighbours forprediction.

All k nearest neighbours have the same influence on the prediction.

Closer nearest neighbours should have higher influence on theprediction.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 9 / 49

k-nearest neighbour predictor

Instead of relying for the prediction on only one instance, the (single)nearest neighbour, usually the k (k > 1) are taken into account,leading to the k-nearest neighbour predictor.

Classification: Choose the majority class among the k nearestneighbours for prediction.

Regression: Take the mean value of the k nearest neighbours forprediction.

Disadvantage:

All k nearest neighbours have the same influence on the prediction.

Closer nearest neighbours should have higher influence on theprediction.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 9 / 49

k-nearest neighbour predictor

Instead of relying for the prediction on only one instance, the (single)nearest neighbour, usually the k (k > 1) are taken into account,leading to the k-nearest neighbour predictor.

Classification: Choose the majority class among the k nearestneighbours for prediction.

Regression: Take the mean value of the k nearest neighbours forprediction.

Disadvantage:

All k nearest neighbours have the same influence on the prediction.

Closer nearest neighbours should have higher influence on theprediction.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 9 / 49

k-nearest neighbour predictor

Instead of relying for the prediction on only one instance, the (single)nearest neighbour, usually the k (k > 1) are taken into account,leading to the k-nearest neighbour predictor.

Classification: Choose the majority class among the k nearestneighbours for prediction.

Regression: Take the mean value of the k nearest neighbours forprediction.

Disadvantage:

All k nearest neighbours have the same influence on the prediction.

Closer nearest neighbours should have higher influence on theprediction.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 9 / 49

Ingredients for the k-nearest neighbour predictor

Distance Metric:The distance metric, together with a possible task-specific scaling orweighting of the attributes, determines which of the training examplesare nearest to a query data point and thus selects the trainingexample(s) used to produce a prediction.

Number of Neighbours:The number of neighbours of the query point that are considered canrange from only one (the basic nearest neighbour approach) througha few (like k-nearest neighbour approaches) to, in principle, all datapoints as an extreme case.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 10 / 49

Ingredients for the k-nearest neighbour predictor

Distance Metric:The distance metric, together with a possible task-specific scaling orweighting of the attributes, determines which of the training examplesare nearest to a query data point and thus selects the trainingexample(s) used to produce a prediction.

Number of Neighbours:The number of neighbours of the query point that are considered canrange from only one (the basic nearest neighbour approach) througha few (like k-nearest neighbour approaches) to, in principle, all datapoints as an extreme case.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 10 / 49

Ingredients for the k-nearest neighbour predictor

Distance Metric:The distance metric, together with a possible task-specific scaling orweighting of the attributes, determines which of the training examplesare nearest to a query data point and thus selects the trainingexample(s) used to produce a prediction.

Number of Neighbours:The number of neighbours of the query point that are considered canrange from only one (the basic nearest neighbour approach) througha few (like k-nearest neighbour approaches) to, in principle, all datapoints as an extreme case (would that be a good idea?).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 10 / 49

Ingredients for the k-nearest neighbour predictor

weighting function for the neighboursWeighting function defined on the distance of a neighbour from thequery point, which yields higher values for smaller distances.

prediction functionFor multiple neighbours, one needs a procedure to compute theprediction from the (generally differing) classes or target values ofthese neighbours, since they may differ and thus may not yield aunique prediction directly.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 11 / 49

Ingredients for the k-nearest neighbour predictor

weighting function for the neighboursWeighting function defined on the distance of a neighbour from thequery point, which yields higher values for smaller distances.

prediction functionFor multiple neighbours, one needs a procedure to compute theprediction from the (generally differing) classes or target values ofthese neighbours, since they may differ and thus may not yield aunique prediction directly.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 11 / 49

k Nearest neighbour predictor

input

output

input

output

Average (3 nearest neighbours) Distance weighted (2 nearest neighbours)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 12 / 49

Nearest neighbour predictor

Choosing the “ingredients”

distance metricProblem dependent. Often Euclidean distance (after normalisation).

number of neighboursVery often chosen on the basis of cross-validation. Choose k thatleads to the best performance for cross-validation.

weighting function for the neighbours

E.g. tricubic weighting function: w(si, q, k) =

(1−

(d(si,q)

dmax(q,k)

)3)3q Query pointsi (input vector of) the i-th nearest neighbour of q in the training datasetk number of considered neighboursd employed distance functiondmax(q, k) maximum distance between any two nearest neighbours andthe distances of the nearest neighbours to the query point

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 13 / 49

Nearest neighbour predictor

Choosing the “ingredients”

distance metricProblem dependent. Often Euclidean distance (after normalisation).

number of neighboursVery often chosen on the basis of cross-validation. Choose k thatleads to the best performance for cross-validation.

weighting function for the neighbours

E.g. tricubic weighting function: w(si, q, k) =

(1−

(d(si,q)

dmax(q,k)

)3)3q Query pointsi (input vector of) the i-th nearest neighbour of q in the training datasetk number of considered neighboursd employed distance functiondmax(q, k) maximum distance between any two nearest neighbours andthe distances of the nearest neighbours to the query point

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 13 / 49

Nearest neighbour predictor

Choosing the “ingredients”

distance metricProblem dependent. Often Euclidean distance (after normalisation).

number of neighboursVery often chosen on the basis of cross-validation. Choose k thatleads to the best performance for cross-validation.

weighting function for the neighbours

E.g. tricubic weighting function: w(si, q, k) =

(1−

(d(si,q)

dmax(q,k)

)3)3q Query pointsi (input vector of) the i-th nearest neighbour of q in the training datasetk number of considered neighboursd employed distance functiondmax(q, k) maximum distance between any two nearest neighbours andthe distances of the nearest neighbours to the query point

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 13 / 49

Nearest neighbour predictor

Choosing the “ingredients”

distance metricProblem dependent. Often Euclidean distance (after normalisation).

number of neighboursVery often chosen on the basis of cross-validation. Choose k thatleads to the best performance for cross-validation.

weighting function for the neighbours

E.g. tricubic weighting function: w(si, q, k) =

(1−

(d(si,q)

dmax(q,k)

)3)3q Query pointsi (input vector of) the i-th nearest neighbour of q in the training datasetk number of considered neighboursd employed distance functiondmax(q, k) maximum distance between any two nearest neighbours andthe distances of the nearest neighbours to the query point

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 13 / 49

Nearest neighbour predictor

Choosing the “ingredients”

prediction function

Regression:

Compute the weighted average of the target values of the nearestneighbours.

Classification:

Sum up the weights for each class among the nearest neighbours.Choose the class with the highest value(or incorporate a cost matrix and interpret the summed weights for theclasses as likelihoods).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 14 / 49

Kernel functions

A k-nearest neighbour predictor with a weighting function can beinterpreted as an n-nearest neighbour predictor with a modifiedweighting function where n is the number of (training) data.

The modified weighting function simply assigns the weight 0 to allinstances that do not belong to the k nearest neighbours.

More general approach: Use a general kernel function that assigns adistance-dependent weight to all instances in the training data set.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 15 / 49

Kernel functions

A k-nearest neighbour predictor with a weighting function can beinterpreted as an n-nearest neighbour predictor with a modifiedweighting function where n is the number of (training) data.

The modified weighting function simply assigns the weight 0 to allinstances that do not belong to the k nearest neighbours.

More general approach: Use a general kernel function that assigns adistance-dependent weight to all instances in the training data set.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 15 / 49

Kernel functions

A k-nearest neighbour predictor with a weighting function can beinterpreted as an n-nearest neighbour predictor with a modifiedweighting function where n is the number of (training) data.

The modified weighting function simply assigns the weight 0 to allinstances that do not belong to the k nearest neighbours.

More general approach: Use a general kernel function that assigns adistance-dependent weight to all instances in the training data set.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 15 / 49

Kernel functions

Such a kernel function K assigning a weight to each data point thatdepends on its distance d to the query point should satisfy the followingproperties:

K(d) ≥ 0

K(0) = 1 (or at least, K has its mode at 0)

K(d) decreases monotonously with increasing d.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 16 / 49

Kernel functions

Such a kernel function K assigning a weight to each data point thatdepends on its distance d to the query point should satisfy the followingproperties:

K(d) ≥ 0

K(0) = 1 (or at least, K has its mode at 0)

K(d) decreases monotonously with increasing d.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 16 / 49

Kernel functions

Such a kernel function K assigning a weight to each data point thatdepends on its distance d to the query point should satisfy the followingproperties:

K(d) ≥ 0

K(0) = 1 (or at least, K has its mode at 0)

K(d) decreases monotonously with increasing d.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 16 / 49

Kernel functions

Such a kernel function K assigning a weight to each data point thatdepends on its distance d to the query point should satisfy the followingproperties:

K(d) ≥ 0

K(0) = 1 (or at least, K has its mode at 0)

K(d) decreases monotonously with increasing d.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 16 / 49

Kernel functions

Typical examples for kernel functions (σ > 0 is a predefined constant):

Krect(d) =

{1 if d ≤ σ,0 otherwise

Ktriangle(d) = Krect(d) · (1− d/σ)

Ktricubic(d) = Krect(d) · (1− d3/σ3)3

Kgauss(d) = exp(− d2

2σ2

)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 17 / 49

Kernel functions

Typical examples for kernel functions (σ > 0 is a predefined constant):

Krect(d) =

{1 if d ≤ σ,0 otherwise

Ktriangle(d) = Krect(d) · (1− d/σ)

Ktricubic(d) = Krect(d) · (1− d3/σ3)3

Kgauss(d) = exp(− d2

2σ2

)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 17 / 49

Kernel functions

Typical examples for kernel functions (σ > 0 is a predefined constant):

Krect(d) =

{1 if d ≤ σ,0 otherwise

Ktriangle(d) = Krect(d) · (1− d/σ)

Ktricubic(d) = Krect(d) · (1− d3/σ3)3

Kgauss(d) = exp(− d2

2σ2

)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 17 / 49

Kernel functions

Typical examples for kernel functions (σ > 0 is a predefined constant):

Krect(d) =

{1 if d ≤ σ,0 otherwise

Ktriangle(d) = Krect(d) · (1− d/σ)

Ktricubic(d) = Krect(d) · (1− d3/σ3)3

Kgauss(d) = exp(− d2

2σ2

)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 17 / 49

Locally weighted (polynomial) regression

For regression problems: So far, weighted averaging of the targetvalues.

Instead of a simple weighted average, one can also compute a (local)regression function at the query point taking the weights into account.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 18 / 49

Locally weighted polynomial regression

input

output

input

output

Kernel weighted regression (left) vs. distance-weighted 4-nearest neighbourregression (tricubic weighting function, right) in one dimension.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 19 / 49

Adjusting the distance function

The choice of the distance function is crucial for the success of anearest neighbour approach.

One can try do adapt the distance function.

One way to adapt the distance function is feature weights to put astronger emphasis on those feature that are more important.

A configuration of feature weights can be evaluated based oncross-validation.

The optimisation of the feature weights can then be carried out basedon some heuristic strategy like hill climbing, simulated annealing,evolutionary algorithms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 20 / 49

Adjusting the distance function

The choice of the distance function is crucial for the success of anearest neighbour approach.

One can try do adapt the distance function.

One way to adapt the distance function is feature weights to put astronger emphasis on those feature that are more important.

A configuration of feature weights can be evaluated based oncross-validation.

The optimisation of the feature weights can then be carried out basedon some heuristic strategy like hill climbing, simulated annealing,evolutionary algorithms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 20 / 49

Adjusting the distance function

The choice of the distance function is crucial for the success of anearest neighbour approach.

One can try do adapt the distance function.

One way to adapt the distance function is feature weights to put astronger emphasis on those feature that are more important.

A configuration of feature weights can be evaluated based oncross-validation.

The optimisation of the feature weights can then be carried out basedon some heuristic strategy like hill climbing, simulated annealing,evolutionary algorithms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 20 / 49

Adjusting the distance function

The choice of the distance function is crucial for the success of anearest neighbour approach.

One can try do adapt the distance function.

One way to adapt the distance function is feature weights to put astronger emphasis on those feature that are more important.

A configuration of feature weights can be evaluated based oncross-validation.

The optimisation of the feature weights can then be carried out basedon some heuristic strategy like hill climbing, simulated annealing,evolutionary algorithms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 20 / 49

Adjusting the distance function

The choice of the distance function is crucial for the success of anearest neighbour approach.

One can try do adapt the distance function.

One way to adapt the distance function is feature weights to put astronger emphasis on those feature that are more important.

A configuration of feature weights can be evaluated based oncross-validation.

The optimisation of the feature weights can then be carried out basedon some heuristic strategy like hill climbing, simulated annealing,evolutionary algorithms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 20 / 49

Data set reduction, prototype building

Advantage of the nearest neighbour approach:No time for training is needed – at least when no feature weightadaptation is carried out or the number of nearest neighbours is fixedin advance.

Disadvantage:Calculation of the predicted class or value can take long when thedata set is large.

Finding a smaller subset of the training set for the nearest neighbourpredictor.

Building prototypes by merging (close) instances, for instance byaveraging.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 21 / 49

Data set reduction, prototype building

Advantage of the nearest neighbour approach:No time for training is needed – at least when no feature weightadaptation is carried out or the number of nearest neighbours is fixedin advance.

Disadvantage:Calculation of the predicted class or value can take long when thedata set is large.

Finding a smaller subset of the training set for the nearest neighbourpredictor.

Building prototypes by merging (close) instances, for instance byaveraging.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 21 / 49

Data set reduction, prototype building

Advantage of the nearest neighbour approach:No time for training is needed – at least when no feature weightadaptation is carried out or the number of nearest neighbours is fixedin advance.

Disadvantage:Calculation of the predicted class or value can take long when thedata set is large.

Possible solutions:

Finding a smaller subset of the training set for the nearest neighbourpredictor.

Building prototypes by merging (close) instances, for instance byaveraging.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 21 / 49

Data set reduction, prototype building

Advantage of the nearest neighbour approach:No time for training is needed – at least when no feature weightadaptation is carried out or the number of nearest neighbours is fixedin advance.

Disadvantage:Calculation of the predicted class or value can take long when thedata set is large.

Possible solutions:

Finding a smaller subset of the training set for the nearest neighbourpredictor.

Building prototypes by merging (close) instances, for instance byaveraging.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 21 / 49

Data set reduction, prototype building

Advantage of the nearest neighbour approach:No time for training is needed – at least when no feature weightadaptation is carried out or the number of nearest neighbours is fixedin advance.

Disadvantage:Calculation of the predicted class or value can take long when thedata set is large.

Possible solutions:

Finding a smaller subset of the training set for the nearest neighbourpredictor.

Building prototypes by merging (close) instances, for instance byaveraging.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 21 / 49

Data set reduction, prototype building

Advantage of the nearest neighbour approach:No time for training is needed – at least when no feature weightadaptation is carried out or the number of nearest neighbours is fixedin advance.

Disadvantage:Calculation of the predicted class or value can take long when thedata set is large.

Possible solutions:

Finding a smaller subset of the training set for the nearest neighbourpredictor.

Building prototypes by merging (close) instances, for instance byaveraging.

Can be carried out based on cross-validation and using heuristicoptimisation strategies.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 21 / 49

Choice of parameter k

Linear classification problem (with some noise):

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 22 / 49

Choice of parameter k

1 nearest neighbour:

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 23 / 49

Choice of parameter k

2 nearest neighbour:

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 24 / 49

Choice of parameter k

5 nearest neighbour:

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 25 / 49

Choice of parameter k

50 nearest neighbour:

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 26 / 49

Choice of parameter k

470 nearest neighbour:

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 27 / 49

Choice of parameter k

480 nearest neighbour:

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 28 / 49

Choice of parameter k

500 nearest neighbour:

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 29 / 49

Choice of Parameter k

k=1 yields y=piecewise constant labeling

”too small” k: very sensitive to outliers

”too large” k: many objects from other clusters (classes) in thedecision set

k = N predicts y=globally constant (majority) label

The selection of k depends from various input ”parameters” :

the size n of the data set

the quality of the data

...Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 30 / 49

Choice of Parameter k: cont.

Simple classifier, k = 1, 2, . . .

Concept, Images, and Analysis from Peter Flach.Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 31 / 49

Choice of Parameter k: cont.

Simple data

Concept, Images, and Analysis from Peter Flach.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 32 / 49

Choice of Parameter k: cont.

Simple classifier, k = 1. Voronoi Tesselation of input space.

Concept, Images, and Analysis from Peter Flach.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 33 / 49

Choice of Parameter k: cont.

Simple classifier, k = 1. ...and classification.

Concept, Images, and Analysis from Peter Flach.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 34 / 49

Choice of Parameter k: cont.

Simple classifier, k = 1

Concept, Images, and Analysis from Peter Flach.Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 35 / 49

Choice of Parameter k: cont.

Simple classifier, k = 2

Concept, Images, and Analysis from Peter Flach.Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 36 / 49

Choice of Parameter k: cont.

Simple classifier, k = 2

Concept, Images, and Analysis from Peter Flach.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 37 / 49

Choice of Parameter k: cont.

Simple classifier, k = 3

Concept, Images, and Analysis from Peter Flach.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 38 / 49

Choice of Parameter k

k = 1: highly localized classifier, perfectly fits separable training data

k > 1:

the instance space partition refinesmore segments are labelled with the same local models

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 39 / 49

Choice of Parameter k - Cross Validation

k is mostly determined manual or heuristic

One heuristic : Cross Validation

1 Select a cross validation method (e.g. q-fold cross validation withD = D1

.∪ ...

.∪ Dq)

2 Select a range for k (e.g. 1 < k <= kmax)

3 Select an evaluation measure (e.g.E(k) =

∑qi=1

∑x∈Di

p(x is correct classified|D \Di))

4 Use k which results in minimal kbest = argmin (E(k))

Can we do this in KNIME?...

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 40 / 49

Choice of Parameter k - Cross Validation

k is mostly determined manual or heuristic

One heuristic : Cross Validation

1 Select a cross validation method (e.g. q-fold cross validation withD = D1

.∪ ...

.∪ Dq)

2 Select a range for k (e.g. 1 < k <= kmax)

3 Select an evaluation measure (e.g.E(k) =

∑qi=1

∑x∈Di

p(x is correct classified|D \Di))

4 Use k which results in minimal kbest = argmin (E(k))

Can we do this in KNIME?...

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 40 / 49

kNN Classifier: Summary

Instance Based Classifier: Remembers all training cases

Sensitive to neighborhood:

Distance FunctionNeighborhood WeightingPrediction (Aggregation) Function

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 41 / 49

Food for Thought: 1-NN Classifier

Bias of the Learning Algorithm?

No variations in search: simple store all examples

Model Bias?

Classification via Nearest Neighbor

Hypothesis Space?

One hypothesis only: Voronoi partitioning of space

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 42 / 49

Food for Thought: 1-NN Classifier

Bias of the Learning Algorithm?

No variations in search: simple store all examples

Model Bias?

Classification via Nearest Neighbor

Hypothesis Space?

One hypothesis only: Voronoi partitioning of space

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 42 / 49

Food for Thought: 1-NN Classifier

Bias of the Learning Algorithm?

No variations in search: simple store all examples

Model Bias?

Classification via Nearest Neighbor

Hypothesis Space?

One hypothesis only: Voronoi partitioning of space

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 42 / 49

Food for Thought: 1-NN Classifier

Bias of the Learning Algorithm?

No variations in search: simple store all examples

Model Bias?

Classification via Nearest Neighbor

Hypothesis Space?

One hypothesis only: Voronoi partitioning of space

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 42 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)

Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)

Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Again: Lazy vs. Eager Learners

kNN learns a local model at query time

Previous algorithms (k-means, ID3, ...) learn a global model beforequery time.

Lazy Algorithms:

do nothing during training (just store examples)Generate new hypothesis for each query (“class A!” in case of kNN!)

Eager Algorithms:

do as much as possible during training (ideally: extract the onerelevant rule!)Generate one global hypothesis (or a set, see Candidate-Elimination)once.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 49

Other Types of Lazy Learners

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 44 / 49

Lazy Decision Trees

Can we use a Decision Tree Algorthm in a lazy mode?

Sure: only create branch that contains test case.

Better: do beam search instead of greedy “branch” building!

Works for essentially all model building algorithms (but makes sensefor “partitioning”-style algorithms only...

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 45 / 49

Lazy Decision Trees

Can we use a Decision Tree Algorthm in a lazy mode?

Sure: only create branch that contains test case.

Better: do beam search instead of greedy “branch” building!

Works for essentially all model building algorithms (but makes sensefor “partitioning”-style algorithms only...

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 45 / 49

Lazy Decision Trees

Can we use a Decision Tree Algorthm in a lazy mode?

Sure: only create branch that contains test case.

Better: do beam search instead of greedy “branch” building!

Works for essentially all model building algorithms (but makes sensefor “partitioning”-style algorithms only...

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 45 / 49

Lazy Decision Trees

Can we use a Decision Tree Algorthm in a lazy mode?

Sure: only create branch that contains test case.

Better: do beam search instead of greedy “branch” building!

Works for essentially all model building algorithms (but makes sensefor “partitioning”-style algorithms only...

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 45 / 49

Lazy(?) Neural Networks

Specht introduced Probabilistic Neural Networks in 1990.

Special type of Neural Networks, optimized for classification tasks.

One Neuron per training instance

Actually long known as “Parzen Window Classifier” in statistics.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 46 / 49

Lazy(?) Neural Networks

Specht introduced Probabilistic Neural Networks in 1990.

Special type of Neural Networks, optimized for classification tasks.

One Neuron per training instance

Actually long known as “Parzen Window Classifier” in statistics.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 46 / 49

Lazy(?) Neural Networks

Specht introduced Probabilistic Neural Networks in 1990.

Special type of Neural Networks, optimized for classification tasks.

One Neuron per training instance

Actually long known as “Parzen Window Classifier” in statistics.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 46 / 49

Lazy(?) Neural Networks

Specht introduced Probabilistic Neural Networks in 1990.

Special type of Neural Networks, optimized for classification tasks.

One Neuron per training instance

Actually long known as “Parzen Window Classifier” in statistics.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 46 / 49

Lazy(?) Neural Networks

Specht introduced Probabilistic Neural Networks in 1990.

Special type of Neural Networks, optimized for classification tasks.

One Neuron per training instance

Actually long known as “Parzen Window Classifier” in statistics.

Parzen Window

The Parzen windows method is a non-parametric procedure thatsynthesizes an estimate of a probability density function (pdf) bysuperposition of a number of windows, replicas of a function (often theGaussian).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 46 / 49

Probabilistic Neural Networks

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 47 / 49

Probabilistic Neural Networks

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 48 / 49

Probabilistic Neural Networks

Probabilistic Neural Networks are powerful predictors(but σ-adjustment problematic)

Efficient (and eager!) training algorithms exist that introduce moregeneral neurons, covering more than just one training instance.

Usual problems of distance based classifiers apply.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 49

Probabilistic Neural Networks

Probabilistic Neural Networks are powerful predictors(but σ-adjustment problematic)

Efficient (and eager!) training algorithms exist that introduce moregeneral neurons, covering more than just one training instance.

Usual problems of distance based classifiers apply.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 49

Probabilistic Neural Networks

Probabilistic Neural Networks are powerful predictors(but σ-adjustment problematic)

Efficient (and eager!) training algorithms exist that introduce moregeneral neurons, covering more than just one training instance.

Usual problems of distance based classifiers apply.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 49

top related