algorithm-independent machine learning anna egorova-förster university of lugano pattern...

17
Algorithm- Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Upload: victor-floyd

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Algorithm-Independent Machine Learning

Anna Egorova-FörsterUniversity of LuganoPattern Classification Reading Group, January 2007

All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Page 2: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

2

• So far: different classifiers and methods presented

• BUT:• Is some classifier better than all others?

• How to compare classifiers?

• Is comparison possible at all?

• Is at least some classifier always better than random?

• AND• Do techniques exist which boost all classifiers?

Algorithm-Independent Machine Learning

Page 3: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

3

No Free Lunch Theorem

For any two learning algorithms P1(h|D) and P2(h|D), the following are true, independent of the sampling distribution P(x) and the number n of training points:

• Uniformly averaged over all target functions F,

1(E|F,n) - 2(E|F,n) = 0.

• For any fixed training set D, uniformly averaged over F,

1(E|F,D) - 2(E|F,D) = 0

• Uniformly averaged over all priors P(F),

1(E|n) - 2(E|n) = 0

• For any fixed training set D, uniformly averaged over P(F),

1(E|D) - 2(E|D) = 0

Page 4: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

4

No Free Lunch Theorem

1. Uniformly averaged over all target functions F,

1(E|F,n) - 2(E|F,n) = 0.

Average over all possible target functions, the error will be the same for all classifiers.

Possible target functions: 25

1. For any fixed training set D, uniformly averaged over F,

1(E|F,D) - 2(E|F,D) = 0

Even if we know the training set D, the off-training errors will be the same.

x F h1 h2

000 1 1 1

001 -1 -1 -1

010 1 1 1

011 -1 1 -1

100 1 1 -1

101 -1 1 -1

110 1 1 -1

111 1 1 -1

Training set D

Off-Training set

Page 5: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

5Consequences of the No Free Lunch Theorem

If no information about the target function F(x) is provided:

• No classifier is better than some other in the general case

• No classifier is better than random in the general case

Page 6: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

6Ugly Duckling TheoremFeatures Comparison

• Binary feature fi

• Patterns xi in the form:

f1 and f2, f1 or f2 etc.

• Rank of a predicate r: the number of simplest patterns it contains.

Rank 1: x1: f1 AND NOT f2

x2: f1 AND f2

x3: f2 AND NOT f1

Rank 2: x1 OR x2 : f1

Rank 3: x1 OR x2 OR x3 : f1 OR f2

Venn diagram

Page 7: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

7

Features with prior information

Page 8: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

8

Features Comparison

• To compare two patterns: take the number of features they share?

Blind_left = {0,1}Blind_right = {0,1}Is (0,1) more similar to (1,0) or to (1,1)???

• Different representations also possible:Blind_right = {0,1}Both_eyes_same = {0,1}

With no prior information about the features available

impossible to prefer some representation over another

Page 9: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

9

Ugly Duckling Theorem

• Given that we use a finite set of predicates that enables us to distinguish any two patterns under consideration, the number of predicates shared by two such patterns is constant and independent of the choice of those patterns. Furthermore, if pattern similarity is based on the total number of predicates shared by two patterns, then any two patterns are “equally similar”.

• An ugly duckling is as similar to the beautiful swan 1 as does beautiful swan 2 to beautiful swan 1.

Page 10: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

10

Ugly Duckling Theorem

• Use for comparison of patterns the number of predicates they share.

• For two different patterns xi and xj:

• No same predicates of rank 1

• One of rank 2: xi OR xj

• In the general case:

• Result is independent of choice of xi and xj!

d − 2

r − 2

⎝ ⎜

⎠ ⎟= (1+1)d−2 = 2d−2

r−2

d

Page 11: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

11

Bias and Variance

• Bias: given the training set D, we can accurately estimate F from D.

• Variance: given different training sets D, there will be no (little) differences between the estimations of F.

• Low bias means usually high variance

• High bias means usually low variance

• Best: low bias, low variance• Only possible with as much as possible information about F(x).

D g(x,D) − F(x)( )2

[ ]

= εD g(x,D) − F(x)[ ]( )2

bias21 2 4 4 4 3 4 4 4

+ εD g(x,D) −εD g(x,D)[ ]( )2

[ ] ⎛ ⎝ ⎜ ⎞

⎠ ⎟

var iance1 2 4 4 4 4 4 3 4 4 4 4 4

Page 12: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

12

Bias and variance

Page 13: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

13Resampling for estimating statisticsJackknife

• Remove some point from the training set:

• Calculate the statistics with the new training set

• Repeat for all points

• Calculate the jackknife statistics€

μ(i) =1

n −1x j

j≠ i

n

∑€

D(i)

μ(⋅) =1

nμ i

i=1

n

Page 14: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

14

Bagging

• Draw n’ < n training points and train a different classifier

• Combine classifiers’ votes into end result

• Classifiers are of same type: all neural networks, decision trees etc.

• Instability: small changes in the training sets leads to significantly different classifiers and/or results

Page 15: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

15

Boosting

• Improve the performance of different types of classifiers

• Weak learners: the classifier has accuracy only slightly better than random

• Example: three component-classifiers for a two-class problem

• Draw three different training sets D1, D2 and D3 and train three different classifiers C1, C2 and C3 (weak learners).

Page 16: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

16

Boosting

• D1: randomly draw n1 < n training points from D.

• Train C1 with D1

• D2: “most informative” dataset with respect to D1.• Half of the points are classified properly by C1, half of them not.

• Flip a coin: if head, find the first pattern in D/D1 misclassified by C1. If tails, find a pattern properly classified by C1.

• Continue until possible

• Train C2 with D2

• D3: most informative with respect to C1 and C2.• Randomly select a pattern from D/(D1,D2)

• If C1 and C2 disagree, add it to D3

• Train C3 with D3

Page 17: Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these

Pattern Classification, Chapter 9

17

Boosting