1111 predictive learning from data electrical and computer engineering lecture set 8 methods for...

76
1 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

Upload: emerald-hicks

Post on 12-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

1111

Predictive Learning from Data

Electrical and Computer Engineering

LECTURE SET 8

Methods for Classification

Page 2: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

2

OUTLINE

• Problem statement and approaches

- Risk minimization (SLT) approach

- Statistical Decision Theory

• Methods’s taxonomy

• Representative methods for classification

• Practical aspects and examples

• Combining methods and Boosting

• Summary

Page 3: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

3

Recall (Binary) Classification problem:• Data in the form (x,y), where

- x is multivariate input (i.e. vector)

- y is univariate output (‘response’)• Classification: y is categorical (class label)

Estimation of indicator function

x

xx

fy

fyfyL

if1

if0,

Page 4: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

4

Pattern Recognition System (~classification)

• Feature extraction: hard part (app.-dependent)• Classification: y ~ class label

y = (0,1,...J-1); J known in advance

Given training data

learn/ estimate a decision rule that assigns a lass label to input a feature vector x

• Classifier is intended for use with future(test) data

input pattern

feature extraction

Xclassifier decision

(class label)

(xi,yi) (i = 1,...n)

Page 5: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

5

Classification vs Discrimination• In some apps, the goal is not prediction, but

capturing the essential differences between the classes in the training data ~ discrimination

• Example: Diagnosis of the causes of plane crashDiscrimination is related to explanation of past data

• In this course, we are mainly interested in predictive classification

• It is important to distinguish between:- conceptual approaches (for classification) and - constructive learning algorithms

Page 6: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

6

Two Approaches to Classification• Risk Minimization (VC-theoretical) approach

- specify a set of models (decision boundaries) of increasing complexity (i.e., structure)

- minimize training error for each element of a structure (usually loss function ~ training error)

- choose model of opt. complexity, i.e. via resampling or analytic bounds

• Loss function: should be specified a priori• Technical problem: non-convex loss function

Page 7: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

7

Statistical Decision Theory Approach• Parametric density estimation approach:1. Class densities and are known or

estimated from the training data

2. Prior probabilities and are known

3. The posterior probability that a given input x belongs to each class is given by Bayes formula:

4. Then Bayes optimal decision rule is

p0 x, * p1 x , *

P y 0 P y 1

P y 0 x p0 x , * P y 0

p x

P y 1x p1 x ,* P y 1

p x

f x 0 if p0 x ,* P y 0 p1 x, * P y 1 1 otherwise

Page 8: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

8

Bayes-Optimal Decision Rule• Bayes decision rule can be expressed in terms of the

likelihood ratio:

More generally, for non-equal misclassification costs:

• Only relative probability magnitudes are critical

r x 0 ifp x y 0 p x y 1

P y 1 P y 0

1 otherwise

otherwise1

11

00if0

01

10

C

C

yPyp

yPypr x

xx

Page 9: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

9

Discriminant Functions• Bayes decision rule in the form:

• Discriminant function ~ probability ratio (or its monotonic transformation):

a

g1 x g2 x

g3 x

r x 0 r x 1

g

x

r x 0 if g x a

1 otherwise

Page 10: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

10

Class boundaries for known distributions• For specific (Gaussian) class distributions opt. decision

boundary can be calculated as

• With a threshold

• For equal covariance matrices

the discriminant function can be expressed in terms of the Mahalanobis distances from x to each class center

g x 1

2x 0 T

0 1 x 0

1

2x 1 T

1 1 x 1

1

2ln

0

1

a lnP y 0 P y 1

0 1

g x 1

2d2 x , 0

1

2d2 x ,1

Page 11: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

11

• Two interpretations of the Bayes rule for Gaussian classes with common covariance matrix

0

0.5

1

x

P y 0 x

P y 1x

p

0

0.1

0.2

0.3

0.4

x 01

p x y 0 p x y 1

p

Page 12: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

12

Posterior probability estimate via regression• For binary classification the class label is a discrete

random variable with values Y={0,1}. Then for known distributions, the following equality between posterior probability and conditional expectation holds:

regression (with squared-loss) can be used to estimate posterior probability

• Example: linear discriminant function for Gaussian classes as in HW 2:

xxxxx 11*10*0 YPYPYPXYEg

R w 1

nw x i w0 yi 2

i 1

n

0

0.5

1

-6 -4 -2 0 2 4 6

P y 0 x

g x

Page 13: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

13

Regression-Based Methods• Generally, class distributions are unknown

need flexible (adaptive) regression estimators for posterior probabilities: MARS, RBF, MLP …

For two-class problems with (0,1) class labels, minimization of: yields

yields

• For J classes use one – of - J encoding for class labels, and solve multiple-response regression problem.i.e. for 3 classes output encoding is 100 010 001

• The outputs of trained multiple response regression model are then used as discriminant functions of a classifier.

R1 1

nˆ g 1 x i , yi 2

i 1

n

ˆ g 1 x ,* P y 1x

R0 1

nˆ g 0 xi , 1 yi 2

i 1

n

ˆ g 0 x ,* P y 0 x

Page 14: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

14

Regression-Based Methods (cont’d)• Training/Estimation

• Prediction/Operation

Estimation of multipleresponse regression

.

.

.

.

.

.

x1

xd y J

y 1

Multiple responsediscriminant functions

.

.

.

.

.

.

x1

xd

ˆ y 1

ˆ y J

MAX ˆ y

Page 15: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

15

VC-theoretic Approach• The learning machine observes samples (x ,y), and returns

an estimated response (indicator function)• Goal of Learning: find a function (model)

minimizing Prediction Risk:

Empirical Risk is

Generatorof samples

LearningMachine

System

x

y

y

),(ˆ xfy

min,y))) dP(,Loss(y, f( xx *),( xf

R f x i , yii1

n

Page 16: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

16

VC-theoretic Approach (cont’d)• Minimization of empirical risk

on each element of SRM structure is difficult due to discontinuous loss/indicator function• Solution:

(1) Introduce flexible parameterization (structure) i.e. dictionary structure

(2) Minimize continuous risk functional (squared-loss)

MLP classifier (with sigmoid activation functions) - similar to multiple-response regression for classification

Empirical Risk is

R f x i , yii1

n

g x ,w,V wi s x vi i 1

m

w0

n

iii yR

1

2,,gs Vwx

Page 17: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

17

Discussion• Risk minimization and statistical approaches

often yield similar learning methods• But conceptual basis and motivation is different• This difference may lead to variations in:

- empirical loss function- implementation of complexity control- interpretation of the trained model outputs- evaluation of classifier performance

• Most competitive methods effectively follow risk-minimization approach, even when presented under statistical terminology.

Page 18: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

18

Important (philosophical) difference:System identification vs system imitation

Several good models are possible under risk minimization, i.e. recall HW2 linear decision boundary (via regression):

-0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02-0.02

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02Training data

x1

x2

Page 19: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

19

• For the same training data (year 2004)

quadratic decision boundary also provides good solution (trading strategy). Why this is possible?

-0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02-0.025

-0.02

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02Quadratic decision boundary classifier on training data

x1

x2

Page 20: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

20

Fisher’s LDA• Classification method based on the risk-minimization

approach (but motivated by statistical arguments)• Seeks optimal (linear) projection aimed to achieve max

separation between (two) classes

Maximization of empirical index

21

2

21

ˆˆ

ˆˆ)(

μμ

wR

- Works well for high-dim. data

- Related to linear regression and ridge regression

Page 21: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

21

OUTLINE

• Problem statement and approaches

• Methods’ taxonomy

• Representative methods for classification

• Practical aspects and application study

• Combining methods and Boosting

• Summary

Page 22: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

22

Methods’ Taxonomy

• Estimating classifier from data requires specification of (1) a set of indicator functions indexed by complexity

(2) loss function suitable for optimization

(3) optimization method • Optimization method correlates with loss fct (2)

Taxonomy based on optimization method

Page 23: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

23

Methods’ Taxonomy

• Based on optimization method used:

- continuous nonlinear optimization (regression-based methods)

- greedy optimization (decision trees)

- local methods (estimate decision boundary locally)

• Each class of methods has its own implementation issues

Page 24: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

24

Regression-Based Methods

• Empirical loss functions

• Misclassification costs and prior probabilities

• Representative methods: MLP, RBF and CTM classifiers

Estimation of multipleresponse regression

.

.

.

.

.

.

x1

xd y J

y 1

Page 25: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

25

Empirical Loss FunctionsAn output of regression-based classifier • Squared loss

motivated by density estimation P(y=1/x)• Cross-entropy loss

motivated by density estimation via max likelihood and Kullbak-Leibler criterion

Remp 1

ng xi , yi 2

i1

n

,gy x'

Remp 1

ng x i , yi 2

yi 0 g x i , yi 2

y i 1

Remp 1

nyi ln g xi , 1 yi ln 1 g x i ,

i 1

n

ˆ f logˆ f

f

dx

Page 26: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

26

Empirical Loss Functions (cont’d)• Asymptotic results: outputs

of a trained network yield accurate estimates of posterior probabilities provided that

- sample size is very large

- an estimator has optimal complexity

In practice, none of these assumptions hold• Cross-entropy loss

- claimed to be superior to squared loss (for classification)

- can be easily adapted to MLP training (backpropagation)

• VC-theoretic view: both squared and cross-entropy loss are just mechanisms for minimizing classification error.

,gy x'

Page 27: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

27

Misclassification costs + prior probabilities• For binary classification: class 0/1 (or -/+)

~ cost of false negative (true 1/ decision 0)

~ cost of false positive (true 0/ decision 1)

• Known differences in prior probabilities in the training and test data ~ and

NOTE: these prescriptions follow risk-minimization

10C

01C

R 1

n

˜ P y 0 P y 0 g x i yi 2

yi 0

˜ P y 1 P y 1 g xi y i 2

yi 1

P y 0 ˜ P y 0

1

201

0

210 gg

1

ii yii

yii yCyC

nR xx

Page 28: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

28

Example Regression-Based Methods• Regression-based classifiers can use:

- global basis functions (i.e., MLP, MARS)

- local basis functions (i.e. RBF, CTM)

global vs local decision boundary

Page 29: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

29

MLP Networks for Classification• Standard MLP network with J output units:

use 1-of-J encoding for the outputs• Practical issues for MLP classifiers

- prescaling of input values to [-0.5, 0.5] range- initialization of weights (to small values)- set training output (y) values: 0.1 and 0.9 rather than 0/1 (to avoid long training time)

• Stopping rule (1) for training: keep decreasing squared error as long as it reduces classification error

• Stopping rule (2) for complexity control: use classification error for resampling

• Multiple local minima: use classification error to select good local minimum during training

Page 30: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

30

RBF Classifiers• Standard multiple-output RBF network (J outputs)• Practical issues for RBF classifiers

- prescaling of input values to [-0.5, 0.5] range- typically non-adaptive training (as for RBF regression) i.e. estimating RBF centers and widths via unsupervised learning, followed by estimation of weights W via OLS

• Complexity control: - usually via the number of basis functions m selected via resampling.- classification error (not squared-error) is used for selecting optimal complexity parameter.

• RBF Classifiers work best when the number of basis functions is small, i.e. training data can be accurately represented by a small number of ‘RBF clusters’.

Page 31: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

31

CTM Classifiers• Standard CTM for regression: each unit has single

output y implementing local linear regression

• CTM classifier: each unit has J outputs (via 1-of-J encoding) implementing local linear decision boundary

• CTM uses the same map for all outputs:

- same map topology

- same neighborhood schedule

- same adaptive scaling of input variables• Complexity control: determined by both

- the final neighborhood size

- the number of CTM units (local basis functions)

Page 32: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

32

CTM Classifiers: complexity controlHeuristic strategy for complexity control + training

1. Find opt. number of units m* , via resampling, using fixed neighborhood schedule (with final width 0.05).

2. Determine the final neighborhood width by training CTM network with m* units on original training data. Optimal final width corresponds to min classification error (empirical risk)

Note: both (1) and (2) use classification error for tuning opt. parameters (through minimization of squared-error)

maxkkinitialfinalinitialk

kK k 2

2

2

zzzz exp,

Page 33: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

33

Classification Trees (CART) • Minimization of suitable empirical loss via

partitioning of the input space into regions

• Example of CART partitioning for a function of 2 inputs

f x w jI x R j j 1

m

x1

x2

R1

R2

R3

R4R5

s1

s2

s3

s4

split 1 x1 ,s1

x2 ,s2

x2 ,s3 x1 ,s4

1

2

3 4

R1

R2 R3 R4R5

Page 34: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

34

Classification Trees (CART)• Binary classification example (2D input space)• Algorithm similar to regression trees (tree growth via

binary splitting + model selection), BUT using different empirical loss function

1

x 1< - 0 .4 0 9

2

S plit

x 2< - 0 .0 6 7

+3

x 1< - 0 .1 4 8

+

Page 35: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

35

Loss functions for Classification Trees

• Misclassification loss: poor practical choice• Other loss (cost) functions for splitting nodes:

For J-class problem, a cost function is a measure of node impurity

where p(i/t) denotes the probability of class i samples at node t.

• Possible cost functions

Misclassification

Gini function

Entropy function

Q t Q p 1t , p 2 t , . . . , p J t

Q t 1 maxj

p j t Q t p i t p j t

ij

j 1 p j t 2

j

Q t p j t ln p j t j

Page 36: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

36

Classification Trees: node splitting

• Minimizing cost function = maximizing the decrease in node impurity. Assume node t is split into two regions (Left and Right) on variable k at a split point s. Then the decrease is impurity caused by this split

where and

• Misclassification cost ~ discontinuous (due to max)

- may give sub-optimal solutions (poor local min)

- does not work well with greedy optimization

Q v,k, t Q t Q tL pL t Q tR pR t pL t p tL p t pR t p tR p t

Page 37: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

37

Using different cost fcts for node splitting

(a) Decrease in impurity:misclassification = 0.25

gini = 0.13

entropy = 0.13

(b) Decrease in impurity:misclassification = 0.25

gini = 0.17

entropy = 0.22 Split (b) is better as it leads

to a smaller final tree

Page 38: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

38

Details of calculating decrease in impurity

Consider split (a) • Misclassification Cost

• Gini Cost

25.025.0*5.025.0*5.05.0 tQ

25.04/31 tQL 25.04/31 tQR

5.05.01 tQ 5.084 tpL 5.0tpr

5.05.05.01 22 tQ 5.0tpL 5.0tpr

8/3)4/1()4/3(1 22 tQL 8/3tQR

8/1)8/3(*5.0)8/3(*5.05.0 tQ

Page 39: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

39

IRIS Data Set: A data set with 150 random samples of flowers from the iris species setosa, versicolor, and virginica (3 classes). From each species there are 50 observations for sepal length, sepal width, petal length, and petal width in cm. This dataset is from classical statistics

MATLAB code (splitmin =10)load fisheriris;t = treefit(meas, species);treedisp(t,'names',{'SL' 'SW' 'PL' 'PW'});

Page 40: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

40

Another example with Iris data:Consider IRIS data set where every other sample is used (total 75 samples, 25 per class). Then the CART tree formed using the same Matlab software (splitmin = 10, Gini loss fct)) is

Page 41: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

41

CART model selection

• Model selection strategy(1) Grow a large tree (subject to min leaf node size)

(2) Tree pruning by selectively merging tree nodes

• The final model ~ minimizes penalized risk

where empirical risk ~ misclassificatiion ratenumber of leaf nodes ~ regularization parameter ~ (via

resampling)

• Note: larger smaller trees • In practice: often user-defined (splitmin in Matlab)

TRR emppen ,

T

Page 42: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

42

Decision Trees: summary• Advantages

- speed

- interpretability

- different types of input variables

• Limitations: sensitivity to

- correlated inputs

- affine transformations (of input variables)

- general instability of trees

• Variations: ID3 (in machine learning), linear CART

Page 43: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

43

Local Methods for Classification• Decision boundary constructed via local

estimation (in x-space)• Nearest Neighbor (k-NN) classifiers

- define a metric (distance) in x-space and choose k (complexity parameter)

- for given x, find k-nearest training samples

- classify x as class A, if most of the k neighbors are from class A

• Statistical Interpretation: local estimation of probabilty

• VC-theoretic interpretation: estimation of decision boundary via minimization of local empirical risk

PA(x)

Page 44: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

44

Local Risk Minimization Framework• Similar to local risk minimization for regression• Local risk for binary classification

here for k closest samples, and 0 otherwise;

parameter takes the discrete values [0,1]

• Local risk is minimized when takes the value of the majority of class labels.

NOTE that local risk is minimized directly (no training is needed)

Remp _local w 1

kyi w 2

Kk x0 ,x i i 1

n

Kk x0 ,x i 1

ww

Page 45: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

45

Nearest Neighbor Classifiers• Advantages

- easy to implement

- no training needed• Limitations

- choice of distance metric

- irrelevant inputs contribute to noise

- poor on-line performance when training size is large (especially with high-dimensional data)

• Computationally efficient variations

- tree implementations of k-NN

- condensed k-NN

Page 46: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

46

OUTLINE

• Problem statement and approaches• Methods’ taxonomy• Representative methods for classification• Practical aspects and examples

- Problem formalization

- Data Quality

- Promising Application Areas: financial engineering, biomedical/ life sciences, fraud detection

• Combining methods and Boosting• Summary

Page 47: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

47

Formalization of Application Requirements• General procedure for mapping application requirements

onto learning problem setting.

Creative process, requires application domain knowledge

APPLICATION NEEDS

LossFunction

Input, output,other variables

Training/test data

AdmissibleModels

FORMAL PROBLEM STATEMENT

LEARNING THEORY

Page 48: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

48

Data Quality• Data is obtained under observational setting,

NOT as a result of scientific experiment

Always question integrity of the data• Example 1: HW2 Data

- stock market data: dividend distribution, holidays• Example 2: Pima Indians Diabetes Data (UCI

Database)- 35 out of 768 total samples (female Pima Indians) show blood pressure value of zero

• Example 3: Transportation study: Safety Performance of Compliance Reviews

Page 49: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

49

Promising Application Areas

• Financial ApplicationsFinancial Applications (Financial Engineering)

- misunderstanding of predictive learning, i.e. backtesting- main problem: what is/ how to measure risk?

misunderstanding of uncertainty/ risk- non-stationaritynon-stationarity can use only short-term modeling

• Successful investing: two extremes(1) Based on fundamentals/ deep understanding Buy-and-Hold (Warren Buffett)(2) Short-term, purely quantitative (predictive learning)

Always involves risk (~ of losing money)

Page 50: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

50

Promising Application Areas• Biomedical + Life SciencesBiomedical + Life Sciences

- great social+practical importance- main problem: cost of human life should be agreed upon by society- ineffectivenesseffectiveness of medical care: due to existence of many subsystems that put different value on human life

• Two possible applications of predictive learning(1) Imitate diagnosis performed by human doctors training data ~ diagnostic decisions made by humans(2) Substitute human diagnosis/ decision making training data ~ objective medical outcomes

ASIDE: Medical doctors expected/required to make no errors

Page 51: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

51

Virtual Biopsy Project (NIH – 2007)

Is It Possible to Use Computer Methods to Get the Information a Biopsy Provides

without Performing a Biopsy?(Jim DeLeo, NIH Clinical Center)

Is It Possible to Use Computer Methods to Get the Information a Biopsy Provides

without Performing a Biopsy?(Jim DeLeo, NIH Clinical Center)

Goal: to reduce the number of unnecessary biopsies + reduce cost

Page 52: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

52

Prostate CancerPredictive computer model (binary classifier) reduces

unnecessary biopsies by more than one-third

June 25, 2003. Using a predictive computer model could reduce unnecessary prostate biopsies by almost 38%, according to a study conducted by Oregon Health & Science University researchers. The study was presented at the American Society of Clinical Oncology's annual meeting in Chicago.

"While current prostate cancer screening practices are good at helping us find patients with cancer, they unfortunately also identifymany patients who don't have cancer. In fact, three out of four men whoundergo a prostate biopsy do not have cancer at all," said Mark Garzotto,MD, lead study investigator and member of the OHSU Cancer Institute."Until now most patients with abnormal screening results were counseledto have prostate biopsies because physicians were unable todiscriminate between those with cancer and those without cancer."

Page 53: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

53

Prostate Cancer Virtual Biopsy ANN

Page 54: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

54

OUTLINE

• Problem statement and approaches

• Methods’ taxonomy

• Representative methods for classification

• Practical aspects and examples

• Combining methods and Boosting

• Summary

Page 55: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

55

Strategies for Combining Methods

• Predictive model depends on 3 factors (a) parameterization of admissible models(b) random training sample(c) empirical loss (for risk minimization)

• Three combining strategies (for improved generalization)

1. Different (a), the same (b) and (c) Committee of Networks, Stacking, Bayesian averaging2. Different (b), the same (a) and (c) Bagging3. Different (c), the same (a) and (b) Boosting

Page 56: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

56

Combining strategy 3 (Boosting)

• Boosting: apply the same method to training data, where the data samples are adaptively weighted (in the empirical loss function)

• Boosting: designed and used for classification

• Implementation of Boosting:- apply the same method (base classifier) to many (modified) realizations of training data- combine the resulting model as a weighted average

Page 57: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

57

Boosting strategy

Apply learning method to many realizations of the data

Page 58: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

58

AdaBoost algorithm (Freund and Schapire, 1996)

Given training data (binary classification):

Initialize sample weights: Repeat for

1. Apply the base method to the training samples with weights , producing the component model 2. Calculate the error for the classifier and its weight:

3. Update the data weightsCombine classifiers via weighted majority voting:

ii y,x 1,1iyi 1, . . . ,n

ni 1Nj ,...,1

i xjb xjb

n

ii

n

iijii

j

byIerr

1

1

x

jjj errerrw 1log

ijijii byIw xexp~

N

jjjbwf

1

sign xx

Page 59: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

59

Example of AdaBoost algorithmoriginal training data: 10 samples

Data 1 2 3 4 5 6 7 8 9 10 initial i 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Page 60: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

60

First iteration of AdaBoostFirst (weak) classifier Sample weight changes x1b

Data samples 1 2 3 4 5 6 7 8 9 10 initial i 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

ii byI x1 0 1 1 0 1 0 0 0 0 0

1err 1err = 3.0532

1w 42.01log 111 errerrw

i after 1b 0.07 0.17 0.17 0.07 0.17 0.07 0.07 0.07 0.07 0.07

Page 61: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

61

Second iteration of AdaBoostSecond (weak) classifier Sample weight changes

Page 62: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

62

Third iteration of AdaBoost

Third (weak) classifier

Page 63: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

63

Combine base classifiers

1err 1err = 3.0532

1w 42.01ln5.0 111 errerrw

2err 2err = 21.01096

2w 65.01ln5.0 222 errerrw

3err 3err = 14.0841

3w 92.01ln5.0 333 errerrw

Page 64: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

64

Example of AdaBoost for classification• 75 training samples: mixture of three Gaussians

centered at (-2,0), (2,0) ~ class 1, and at (0,0) ~ class -1

• 600 test samples (from the same distribution)

-4 -2 0 2 4

-2-1

01

2

x1

x2

Page 65: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

65

Example (cont’d)• Base classifier:

(decision stump)• The first 10 component classifiers are shown

-4 -2 0 2 4

-2-1

01

2

x1

x2

12

34

56

78

910

vx

vxvkg

k

k

if1

if1,,x

Page 66: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

66

Example (cont’d)• Generalization performance (m = 100 iterations)

Training error decreases with m (can be proven)

Test error does not show overfitting for large m

0 20 40 60 80 100

0.0

0.1

0.2

0.3

0.4

iteration

mis

cla

ssifi

catio

n r

ate

testtraining

Page 67: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

67

Relation of boosting to other methods

• Why boosting can generalize well, in spite of the large number of component models (m)?

• What controls the complexity of boosting?• AdaBoost final model has an additive form

- can be related to additive methods (statistics)• Generalization performance can be related to

large-margin properties of the final classifier

m

jjjbwf

1

sign xx

Page 68: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

68

Boosting as an additive method• Dictionary methods:

• MLP and RBF: basis fcts are specified a priori model complexity ~ number of basis functions

• Projection Pursuit: basis fcts are estimated sequentially via greedy strategy (backfitting) model complexity difficult to estimate

• Boosting can be shown to implement the backfitting procedure (similar to Projection Pursuit) but using an appropriate loss function xx yggyL exp,

01

,,, wgwfm

jjjj

vxVwx

Page 69: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

69

Stepwise form of AdaBoost algorithmGiven training data (binary classification):

a base classifierand empirical loss

Initialization Repeat for

1. Determine parameters and via minimization of

2. Update the discriminant function

Classification rule

ii y,x 1,1iyi 1, . . . ,n

mj ,...,1 00 xg

vx,b

jw jv

n

iiiji

wjj wbgyLw

11

,,,minarg, vxxv

v

jjjj bwgg vxxx ,1

xx mgf sign

xx yggyL exp,

Page 70: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

70

Various loss functions for classification• Exponential loss (AdaBoost)• SVM loss (SVM classifier)

xx yffyL exp,

Page 71: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

71

Generalization Performance of AdaBoost

• Similarity between SVM and exponential loss helps to explain good performance of AdaBoost

• Boosting tends to increase the degree of separation between two classes (margin)

• Generalization properties poorly understood• Complexity control via

- the number of components

- complexity of a base classifier

• Poor performance for noisy data sets

Page 72: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

72

Example 1: Hyperbolas Data Set

x1 = ((t-0.4)*3)2+0.225 x2 = 1-((t-0.6)*3)2-0.225.

for class 1.(Uniform) for class 2.(Uniform)

Gaussian noise with st. dev. = 0.03 added to both x1 and x2

[0.2,0.6]t [0.4,0.8]t

• 100 Training samples (50 per class)/ 100 Validation.• 2,000 Test samples (1000 per class).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Page 73: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

73

AdaBoost using decision stumps:

• Model selection: choose opt N using validation data set.• Repeat experiments 10 times

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Page 74: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

74

AdaBoost Performance Results

Test error: AdaBoost ~ 2.29% vs RBF SVM ~ 0.42%

Experimentnumber

training error

validation error

testerror

Optimal N

1 0 0.02 0.0275 322 0 0.01 0.0115 163 0 0.07 0.044 324 0 0.02 0.0115 325 0 0.06 0.0235 326 0 0.03 0.036 167 0 0.03 0.018 168 0 0.02 0.016 329 0 0.05 0.0225 32

10 0 0 0.0185 16Ave 0 0.031 0.0229

St. dev. 0 0.0223 0.0105

Page 75: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

75

OUTLINE

• Problem statement and approaches

• Methods’ taxonomy

• Representative methods for classification

• Practical aspects and examples

• Combining methods and Boosting

• Summary

Page 76: 1111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 8 Methods for Classification

76

SUMMARY• VC-theoretic approach to classification

- minimization of empirical error

- structure on a set of indicator functions

• Importance of continuous loss function suitable for minimization

• Simple methods (local classifiers) often are very competitive

• Classification is inherently less sensitive to optimal complexity control (vs regression)